[K42-discussion] DirLinuxFSVolatile deadlock

Patrick Bozeman PEBozeman at lbl.gov
Thu Sep 21 12:19:25 EST 2006


My next set of patches require a resolution to the following problem.
In my private tree, I have simply removed the locks under discussion
below and have not had any issues, but it is not clear to me that it
is the right thing to do.

Fyi, the general context for this issue is that I am triggering
revalidation of volatile directories and ran into deadlocks.

(Note, I don't need this fixed too badly because the approach I take
to revalidate directories makes it unlikely that this code is actually
exercised, but there are certain timing sequences that can cause
it to happen, and by tweaking my revalidation code, I can make
it happen on demand.)

I am experiencing a deadlock in the following existing K42 code.  I
have added an assert and provided the backtrace along with some
commentary.

/* virtual */ SysStatus
DirLinuxFSVolatile::locked_detachInvalidDir(DirLinuxFSRef dirRef)
{
    <snip>

    NameHolderInfo nhi;
    SysStatus rc = children.remove((ObjRef)dirRef, &nhi);
    if (_SUCCESS(rc)) {
        // FIXME: temporary assert to illustrate bug
        tassertMsg(0, "dbg break\n");
        nhi.rwlock->acquireW();
        // we're holding the directory lock, so we're the last ones
        // getting to this nameHolder, so no need to release the lock
        // before freeing the nameHolder, but let's do it anyway
        nhi.rwlock->releaseW();
        tassertWrn(_SUCCESS(rc), "children remove failed \n");
    } else {

    <snip>


(gdb) backtrace
#0  breakpoint () at 
/home/peb/src/k42-devel-patches/k42/kitchsrc/lib/libc/sys/abort.C:60
#1  0x00001000702da24c in raiseError() ()
    at 
/home/peb/src/k42-devel-patches/k42/kitchsrc/lib/libc/sys/TAssert.C:50
#2  0x00001000702da36c in errorWithMsg(char const*, char const*, 
unsigned long, char const*, ...) (failedexpr=0x1002fdd8 "0",
    fname=0x1002f7e8 
"/home/peb/src/k42-devel-patches/k42/kitchsrc/lib/libc/fslib/DirLinuxFSVolatile.C", 
lineno=0x253, fmt=0x1002fa18 "dbg break\n")
    at 
/home/peb/src/k42-devel-patches/k42/kitchsrc/lib/libc/sys/TAssert.C:108
#3  0x00000000100204d4 in 
DirLinuxFSVolatile::locked_detachInvalidDir(DirLinuxFS**) (
    this=Variable "this" is not available.
)
    at 
/home/peb/src/k42-devel-patches/k42/kitchsrc/lib/libc/fslib/DirLinuxFSVolatile.C:605
#4  0x0000000010013594 in 
DirLinuxFSVolatile::detachInvalidDir(DirLinuxFS**) (this=Variable "this" 
is not available.
)
    at DirLinuxFSVolatile.H:174
#5  0x0000000010020df4 in 
DirLinuxFSVolatile::eliminateStaleDir(DirLinuxFSVolatile**) (
    this=0x1000066c2c00, lockedParent=0x0) at DirLinuxFS.H:145
#6  0x0000000010020f08 in DirLinuxFSVolatile::revalidate() 
(this=0x1000066c2c00)
    at 
/home/peb/src/k42-devel-patches/k42/kitchsrc/lib/libc/fslib/DirLinuxFSVolatile.C:807
#7  0x00000000100202ec in DirLinuxFSVolatile::getStatus(char*, unsigned 
long, FileLinux::Stat*, unsigned long) (this=0x1000066c2c00, 
name=0x1000000addb5 "stat", namelen=0x4,
    retStatus=0x1000000addf0, followLink=0x0)
    at 
/home/peb/src/k42-devel-patches/k42/kitchsrc/lib/libc/fslib/DirLinuxFSVolatile.C:554
#8  0x000000001002afe4 in NameTreeLinuxFS::_getStatus(char*, unsigned 
long, stat&, unsigned long, unsigned long) (this=0x100006641480, 
name=0x1000000addb0 "\003328\004stat",
    namelen=0x9, retStatus=@0x1000000addf0, followLink=0x0, pid=Variable 
"pid" is not available.
) at FileLinux.H:191`
#9  0x00001000702a4874 in 
XNameTreeLinux::__getStatusEPcmR4statm(unsigned long) (
    this=0x100006800740, callerID=0x2100010000) at sysTypes.H:151
#10 0x00001000702b62f4 in DispatcherDefault_InvokeXObjMethod ()
    at 
/home/peb/src/k42-devel-patches/k42/kitchsrc/lib/libc/scheduler/arch/powerpc/DispatcherDefaultAsm.S:1203
#11 0x00001000702b618c in DispatcherDefault_PPCServerOnThread ()
    at 
/home/peb/src/k42-devel-patches/k42/kitchsrc/lib/libc/scheduler/arch/powerpc/DispatcherDefaultAsm.S:1162


A few things are wrong here.  The main reason I noticed this was that it 
causes
a deadlock.  Notice that frame 8 is NameTreeLInuxFS::_getStatus.  If you
look at that the code for that function, it has previously called
    rc = DREF(rootDirLinuxRef)->lookup(pathName, namelen,
                       remainder, remainderLen,
                       dirLinuxRef, nhLock);

which performs a read lock on the name holder lock on behalf of 
NameTreeLinuxFS
and returns it as nhLock.  NameTreeLinuxFS releases this lock a little ways
after the call to DirLinuxFSVolatile::getStatus.

Of course, this leads to deadlock because the very next line after the
temporary assert is to grab a write lock on the nameholder.

Now, in this particular case, it appears the locks can just be removed
since there aren't any statements between acquiring and releasing them
anyway.  However, this code is a cut-and-paste from the function right
below it, i.e. DirLinuxFSVolatile::locked_doDetachInvalidFile() which
splits the lookup and remove operation into two phases and acquires
a write lock before performing the remove as is shown in the snippet
below.

SysStatus
DirLinuxFSVolatile::locked_doDetachInvalidFile(ServerFileRef fileRef)
{
    <snip>

    _ASSERT_HELD(dirlock);

    NameHolderInfo nhi;
    // FIXME: could we do a remove without a lookup (but what if someone
    // else had the lock to it?
    SysStatus rc = children.lookup((ObjRef)fileRef, &nhi);
    if (_SUCCESS(rc)) {
        // remove it
        nhi.rwlock->acquireW();
        // we're holding the directory lock, so we're the last ones
        // getting to this nameHolder, so no need to release the lock
        // before freeing the nameHolder, but let's do it anyway
        nhi.rwlock->releaseW();
        rc = children.remove((ObjRef)fileRef);
        tassertWrn(_SUCCESS(rc), "children remove failed \n");
    } else {

    <snip>

It isn't clear to me if the locks are needed or not.  There are some 
comments
in doDetachInvalidFile that imply that they are.  In particular, is it 
possible
for there to be another reader, and if so, would performing a remove cause
a problem?  I also don't understand why the lock is released prior to
performing the remove in the file case.  If multiple readers are possible,
can't one slip in between the time the write lock is releases and the remove
is performed?

If the locks are not required, they can of course just be removed.  If they
are required, it seems that we need a way to upgrade to a write lock without
deadlocking.

Can someone who worked on this originally provide some commentary here?

On a related note, it seems that children.remove(ObjRef, NameHolderInfo) is
leaking memory.  In DentryListHash::remove(name, len, nhi) the hash
entry is deleted at the end, but in DentryListHash::remove(ref, nhi) it 
is not.
I'll get around to providing a patch for this, but I haven't tested out
adding the delete yet.




More information about the K42-discussion mailing list