[K42-discussion] About regress failure "Caller of notInUse lies"

Dilma DaSilva dilma at watson.ibm.com
Fri Jan 27 05:33:58 EST 2006


In the last year our nightly hardware regress failed 21 times (most
often on the xserver) when running SDET (MP configuration) with an
assertion on FCMCommonMultiRepRoot::notInUse() that states
"Caller of notInUse lies". A few weeks ago I added code to print
more information about the failure. This e-mail describes what
I learned. I'm not going to pursue this further at the moment, but
if someone else wants to explore this, I'll be happy to be involved.
(I believe Jonathan, Marc, and Orran are the primary experts;
I'm a naive tourist in this cool space)

The call stack we have for this failure is:

FCMCommonMultiRepRoot::notInUse()   <-------- fails here
FCMCommonMultiRepRoot::locked_removeReference()
FCMCommonMultiRepRoot::removeReference()
FCMCommonMultiRep::detachForkChild()
FCMComputation::locked_detachFromParent()
FCMComputation::doDestroy()
FCMCommon:destroy()
FRComputation::destroy()
FRCommon::fcmNotInUse()
FCMCommon::notInUSe()
FCMComputation::detachRegion()
RegionDefault::destroy()
RegionList::deleteRegionsAll()
ProcessReplicated::destroy()
ProcessReplicated::kill()
ProcessServer::_kill()

It seems that a process is going away, so we're trying to  update the
fcm fork tree. An FCM computation object talks to its parent asking to
be detached. Its parent is a FCM multirep. This parent sees that
its own reference count is 0 and it has no region list attached, so
it also tries to go away (invoking notInUse). 

The code in locked_removeReference() invoking notInUse() is:

    referenceCount--;
    if (!referenceCount && regionList.isEmpty()) {
	lock.release();
	return notInUse();
    }

We have a race condition. At the failure point, I had referenceCount
for this FCMCommonMultiRep object printed; it was 1. 
It seems that as we tried to collapse the fork tree and free unneeded
objects, someone attached itself to the FCMCommonMultiRep.
I don't know much about the complex synchronization involving FCMs.
I recall Jonathan and Marc discussing some possible
problems  with the fork logic and an alternative design, but I
don't remember details.

[
A side node: method locked_removeReference() unlocks the object
before returning (I personally don't like this style).
I noticed a code path where the lock is released
 after invoking locked_removeReference(). 
I believe the code path is not being exercised,
but anyway I documented my lack of understanding and I added an
assertion _ASSERT_HELD before the unlock invocation.
]

dilma



More information about the K42-discussion mailing list