[K42-discussion] DirLinuxFSVolatile deadlock

Patrick Bozeman PEBozeman at lbl.gov
Tue Sep 26 12:23:18 EST 2006


Dilma DaSilva wrote:
> Patrick, in my understanding, the trace stack you showed us is not 
> all operating on the same object.
> _getStatus will be calling getStatus on an object, which ends up
> calling eliminateStaleDir, which will call locked_doDetachInvalidDir in
> the PARENT directory. So the nameholder object that we're trying
> to acquire for write is not the one being held. Am I missing
> something?
>   
Here is some more data, but first, I'll provide some context so that you 
can follow the file and dir names.  I am triggering the deadlock by 
doing the following.

Starting '/bin/bash' on /dev/pts/1
hw1 root # ls /proc
1  304  310  319  320  321  325  cpuinfo  meminfo  mounts  stat  sys  
version
hw1 root # stat /proc/326/stat
<snip>
hw1 root # stat /proc/326/stat
<we just deadlocked>

The ls is to check proc to see what the next pid is going to be.  Then I 
stat a file in that dir, knowing it will be the next pid created.  I 
stat the file again, knowing that it is gone.  This leads to deadlock in 
the cache invalidation code.

The following is some data I collected about the lock usage. Procfs has 
already been identified as the file system, so you don't see the /proc 
part of the path name at this point.  (The line numbers in the trace 
below are slightly off from CVS due to the addition of the logging 
lines, but I suspect you can follow it.)

NameTreeLinuxFS::_getStatus 713 lookup name: /326/stat
DirLinuxFS::lookup 1748 acquireR nhParentLock: 0x10052ce8 name: /326/stat
DirLinuxFS::lookup 1775 externalLookupDirectory name: 326
DirLinuxFS::lookup 1783 grantedR nhSubDirLock: 0x1000066413c0 name: 326
DirLinuxFS::lookup 1799 releaseR nhParentLock: 0x10052ce8
DirLinuxFS::lookup 1801 assign nhParentLock = nhSubdirLock: 0x1000066413c0
DirLinuxFS::lookup 1811 returning lockRef: 0x1000066413c0 name: 
/326/stat remainder: stat
NameTreeLinuxFS::_getStatus 725 grantedR nhlock: 0x1000066413c0 name: 
/326/stat remainder: stat
NameTreeLinuxFS::_getStatus 732 getStatus name: /326/stat remainder: stat
WARNING: file 
"/home/peb/src/k42-devel-patches/k42/kitchsrc/lib/libc/fslib/DirLinuxFSVolatile.C", 
line 814
In DirLinuxFSVolatile::revalidate()
DirLinuxFSVolatile::locked_detachInvalidDir 603 acquireW nhlock: 
0x1000066413c0

We are deadlocking on the nameholder lock at address 0x1000066413c0.  
That is the nameholder lock for dir '326', i.e. the dir being eliminated 
in eliminateStaleDir.

NameTreeLinuxFS::_getStatus kicks things off with a call to /proc to 
lookup /326/stat.

DirLinuxFS::lookup walks the directory tree, locking and unlocking 
directories as it goes.  It starts by locking the root nameholder at 
0x10052ce8, and then performs an externalLookupDirectory for dir '326'. 
It is granted a nameholder lock on '326' at addr 0x1000066413c0.  It 
then releases the name holder lock for the root.  DirLinuxFS::lookup 
then returns a dir reference to '326' as well as its nameholder lock.

NameTreeLinuxFS::_getStatus then calls getStatus on the dirref for '326' 
for file name 'stat'.  DirLinuxFSVolatile determines that '326' is stale 
and calls eliminateStaleDir.  eliminateStaleDir determines the parent of 
'326' and calls DREF(parent_of_326)->locked_detachInvalidDir(326_ref).  
(That step doesn't show up in the above trace since I was focusing on 
the locks, but it can be found in the backtrace at the top of this thread.)

DREF(parent_of_326)->locked_detachInvalidDir(326_ref) calls 
children.remove(326_ref, &nhi). It then tries to lock nhi.rwlock at 
memory address 0x1000066413c0, which was locked earlier on in the call 
stack, and blamo.. the server is wedged.

Deadlock seems unavoidable the way this is currently structured, but 
please help me understand if I am mistaken.


Also, this mornings version of this thread is relevant here.  The 
children.remove call should be deleting the nameholder, at which point 
the call to actualy grab the write lock would be being performed on free 
memory.  I pointed out in the other part of this thread why I thought 
the name holder was being leaked, however, if it isn't being leaked, the 
the write lock is being performed on a deleted object.  It seems that a 
lookup/remove model as is performed in the file case is necessary here.






More information about the K42-discussion mailing list