[K42-discussion] DirLinuxFSVolatile deadlock
Patrick Bozeman
PEBozeman at lbl.gov
Tue Sep 26 12:23:18 EST 2006
Dilma DaSilva wrote:
> Patrick, in my understanding, the trace stack you showed us is not
> all operating on the same object.
> _getStatus will be calling getStatus on an object, which ends up
> calling eliminateStaleDir, which will call locked_doDetachInvalidDir in
> the PARENT directory. So the nameholder object that we're trying
> to acquire for write is not the one being held. Am I missing
> something?
>
Here is some more data, but first, I'll provide some context so that you
can follow the file and dir names. I am triggering the deadlock by
doing the following.
Starting '/bin/bash' on /dev/pts/1
hw1 root # ls /proc
1 304 310 319 320 321 325 cpuinfo meminfo mounts stat sys
version
hw1 root # stat /proc/326/stat
<snip>
hw1 root # stat /proc/326/stat
<we just deadlocked>
The ls is to check proc to see what the next pid is going to be. Then I
stat a file in that dir, knowing it will be the next pid created. I
stat the file again, knowing that it is gone. This leads to deadlock in
the cache invalidation code.
The following is some data I collected about the lock usage. Procfs has
already been identified as the file system, so you don't see the /proc
part of the path name at this point. (The line numbers in the trace
below are slightly off from CVS due to the addition of the logging
lines, but I suspect you can follow it.)
NameTreeLinuxFS::_getStatus 713 lookup name: /326/stat
DirLinuxFS::lookup 1748 acquireR nhParentLock: 0x10052ce8 name: /326/stat
DirLinuxFS::lookup 1775 externalLookupDirectory name: 326
DirLinuxFS::lookup 1783 grantedR nhSubDirLock: 0x1000066413c0 name: 326
DirLinuxFS::lookup 1799 releaseR nhParentLock: 0x10052ce8
DirLinuxFS::lookup 1801 assign nhParentLock = nhSubdirLock: 0x1000066413c0
DirLinuxFS::lookup 1811 returning lockRef: 0x1000066413c0 name:
/326/stat remainder: stat
NameTreeLinuxFS::_getStatus 725 grantedR nhlock: 0x1000066413c0 name:
/326/stat remainder: stat
NameTreeLinuxFS::_getStatus 732 getStatus name: /326/stat remainder: stat
WARNING: file
"/home/peb/src/k42-devel-patches/k42/kitchsrc/lib/libc/fslib/DirLinuxFSVolatile.C",
line 814
In DirLinuxFSVolatile::revalidate()
DirLinuxFSVolatile::locked_detachInvalidDir 603 acquireW nhlock:
0x1000066413c0
We are deadlocking on the nameholder lock at address 0x1000066413c0.
That is the nameholder lock for dir '326', i.e. the dir being eliminated
in eliminateStaleDir.
NameTreeLinuxFS::_getStatus kicks things off with a call to /proc to
lookup /326/stat.
DirLinuxFS::lookup walks the directory tree, locking and unlocking
directories as it goes. It starts by locking the root nameholder at
0x10052ce8, and then performs an externalLookupDirectory for dir '326'.
It is granted a nameholder lock on '326' at addr 0x1000066413c0. It
then releases the name holder lock for the root. DirLinuxFS::lookup
then returns a dir reference to '326' as well as its nameholder lock.
NameTreeLinuxFS::_getStatus then calls getStatus on the dirref for '326'
for file name 'stat'. DirLinuxFSVolatile determines that '326' is stale
and calls eliminateStaleDir. eliminateStaleDir determines the parent of
'326' and calls DREF(parent_of_326)->locked_detachInvalidDir(326_ref).
(That step doesn't show up in the above trace since I was focusing on
the locks, but it can be found in the backtrace at the top of this thread.)
DREF(parent_of_326)->locked_detachInvalidDir(326_ref) calls
children.remove(326_ref, &nhi). It then tries to lock nhi.rwlock at
memory address 0x1000066413c0, which was locked earlier on in the call
stack, and blamo.. the server is wedged.
Deadlock seems unavoidable the way this is currently structured, but
please help me understand if I am mistaken.
Also, this mornings version of this thread is relevant here. The
children.remove call should be deleting the nameholder, at which point
the call to actualy grab the write lock would be being performed on free
memory. I pointed out in the other part of this thread why I thought
the name holder was being leaked, however, if it isn't being leaked, the
the write lock is being performed on a deleted object. It seems that a
lookup/remove model as is performed in the file case is necessary here.
More information about the K42-discussion
mailing list