Straw, Adam D via llvm-dev
2020-May-26 20:16 UTC
[llvm-dev] [ORC JIT][MLIR] GDBRegistrationListener "second attempt to perform debug registration" assert
Referring to the log messages from my previous mail… I confused myself (and
probably others) by reading the “Adding MemMgr 0x55555959f440“ message as
“Registering MemMgr 0x55555959f440”. Thus the address mismatch made no sense.
How could we be registering a `MemMgr` address/key that does not match once we
arrive in `notifyObjectLoaded` method?
Answer: Because the registrations is NOT coming from
RTDyldObjectLinkingLayer.cpp. I found this first by adding an additional debug
statement (highlighted) thinking the `reinterpret_cast` may be at fault for the
key mismatch. I never saw this print and I then realized my “Adding” vs.
“Registering” mistake.
// Run EventListener notifyLoaded callbacks.
{
std::lock_guard<std::mutex> Lock(RTDyldLayerMutex);
dbgs() << "Registering MemMgr " << MemMgr <<
"\n";
auto LOIItr = LoadedObjInfos.find(MemMgr);
assert(LOIItr != LoadedObjInfos.end() && "LoadedObjInfo
missing");
for (auto *L : EventListeners) {
dbgs() << " key = " <<
reinterpret_cast<uintptr_t>(MemMgr) << "\n";
L->notifyObjectLoaded(
static_cast<uint64_t>(reinterpret_cast<uintptr_t>(MemMgr)), *Obj,
*LOIItr->second);
}
LoadedObjInfos.erase(MemMgr);
}
So, where is the registration coming from? Remember that I am on a slightly
older version of MLIR. My mlir/lib/ExecutionEngine/ExecutionEngine.cpp has the
following code with highlighted debug statement added by me:
objectLayer->setNotifyLoaded(
[engine = engine.get()](
llvm::orc::VModuleKey, const llvm::object::ObjectFile &object,
const llvm::RuntimeDyld::LoadedObjectInfo &objectInfo) {
if (engine->gdbListener) {
uint64_t key = static_cast<uint64_t>(
reinterpret_cast<uintptr_t>(object.getData().data()));
dbgs() << "ExecutionEngine key = " << key
<< "\n";
engine->gdbListener->notifyObjectLoaded(key, object,
objectInfo);
}
});
The key printed here matches the key printed in `notifyObjectLoaded`. This code
is NOT present at the tip of MLIR master. Hence the solution here may be to
move to latest MLIR master which I probably should have tried before sending
this mail. Will do that now.
Adam
From: Straw, Adam D
Sent: Tuesday, May 26, 2020 9:41 AM
To: Lang Hames <lhames at gmail.com>; David Blaikie <dblaikie at
gmail.com>
Cc: llvm-dev at lists.llvm.org; Cyphers, Scott <Scott.Cyphers at
intel.com>; Bobba, Jayaram <jayaram.bobba at intel.com>; Mostafa, Nagy
H <Nagy.H.Mostafa at intel.com>; Iyer, Swapna <swapna.iyer at
intel.com>
Subject: RE: [llvm-dev] [ORC JIT][MLIR] GDBRegistrationListener "second
attempt to perform debug registration" assert
Sorry for the delay. Hope everyone had a good long weekend. See attached log
captured from GDB session. Here is the punchline:
Adding MemMgr 0x55555959f440
unit-test:
/localdisk/adstraw/ngraph/build/mlir_project/llvm-project/llvm/lib/ExecutionEngine/GDBRegistrationListener.cpp:164:
virtual void
{anonymous}::GDBJITRegistrationListener::notifyObjectLoaded(llvm::JITEventListener::ObjectKey,
const llvm::object::ObjectFile&, const
llvm::RuntimeDyld::LoadedObjectInfo&): Assertion `ObjectBufferMap.find(K) ==
ObjectBufferMap.end() && "Second attempt to perform debug
registration."' failed.
(gdb) print/x K
$2 = 0x555559605c50
The key (K) causing the assert a) does not match the addresses of the most
recently added MemMgr and b) does not match the address of ANY MemMgr in the
entire log file. The plot thickens.
Wondering if there is some other object (besides a MemMgr) that registers its
address-as-key into `ObjectBufferMap`.
Adam
From: Lang Hames <lhames at gmail.com<mailto:lhames at gmail.com>>
Sent: Thursday, May 21, 2020 4:53 PM
To: David Blaikie <dblaikie at gmail.com<mailto:dblaikie at
gmail.com>>
Cc: Straw, Adam D <adam.d.straw at intel.com<mailto:adam.d.straw at
intel.com>>; llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>; Cyphers, Scott <scott.cyphers at
intel.com<mailto:scott.cyphers at intel.com>>; Bobba, Jayaram
<jayaram.bobba at intel.com<mailto:jayaram.bobba at intel.com>>;
Mostafa, Nagy H <nagy.h.mostafa at intel.com<mailto:nagy.h.mostafa at
intel.com>>
Subject: Re: [llvm-dev] [ORC JIT][MLIR] GDBRegistrationListener "second
attempt to perform debug registration" assert
Hi Adam,
Calls to the listeners should be protected by the RTDyldLayerMutex. Could you
apply the attached patch and share the debugging output from one of the failing
runs?
Regards,
Lang.
On Wed, May 20, 2020 at 8:00 PM David Blaikie <dblaikie at
gmail.com<mailto:dblaikie at gmail.com>> wrote:
+Lang
On Wed, May 20, 2020 at 4:44 PM Straw, Adam D via llvm-dev <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:
Hi all,
Attention: Lang Hames
I am developing the nGraph
MLIR<https://github.com/NervanaSystems/ngraph/tree/master/src/contrib/mlir>
implementation and hitting the following assert while running nGraph unit tests:
assert(ObjectBufferMap.find(K) == ObjectBufferMap.end() &&
"Second attempt to perform debug registration.");
Here is a
permalink<https://github.com/llvm/llvm-project/blob/3d5360a4398bfa6878f94ca9ac55bc568692c765/llvm/lib/ExecutionEngine/GDBRegistrationListener.cpp#L163>
for that line of code on GitHub. The failure only occurs when running multiple
unit tests back-to-back. Inevitably, if I rerun the failing unit test, it
passes. The failure also tends to move around with a different unit tests
failing on successive runs.
I am able to hit the failure in GDB. Here is a partial backtrace:
#3 0x00007fffec4cf412 in __GI___assert_fail (assertion=assertion at
entry=0x7ffff1d145e0 "ObjectBufferMap.find(K) == ObjectBufferMap.end()
&& \"Second attempt to perform debug registration.\"",
file=file at entry=0x7ffff1d144d8
"/localdisk/adstraw/ngraph/build/mlir_project/llvm-project/llvm/lib/ExecutionEngine/GDBRegistrationListener.cpp",
line=line at entry=164,
function=function at entry=0x7ffff1d15520 <(anonymous
namespace)::GDBJITRegistrationListener::notifyObjectLoaded(unsigned long,
llvm::object::ObjectFile const&, llvm::RuntimeDyld::LoadedObjectInfo
const&)::__PRETTY_FUNCTION__> "virtual void
{anonymous}::GDBJITRegistrationListener::notifyObjectLoaded(llvm::JITEventListener::ObjectKey,
const llvm::object::ObjectFile&, const
llvm::RuntimeDyld::LoadedObjectInfo&)")
at assert.c:101
#4 0x00007ffff01ed4d3 in (anonymous
namespace)::GDBJITRegistrationListener::notifyObjectLoaded (this=0x5555591a9270,
K=93825060893584, Obj=..., L=...)
at
/localdisk/adstraw/ngraph/build/mlir_project/llvm-project/llvm/lib/ExecutionEngine/GDBRegistrationListener.cpp:163
#5 0x00007ffff01b0eca in std::function<void (unsigned long,
llvm::object::ObjectFile const&, llvm::RuntimeDyld::LoadedObjectInfo
const&)>::operator()(unsigned long, llvm::object::ObjectFile const&,
llvm::RuntimeDyld::LoadedObjectInfo const&) const (__args#2=...,
__args#1=..., __args#0=<optimized out>, this=0x5555594e09b8) at
/usr/include/c++/7/bits/std_function.h:706
#6 llvm::orc::RTDyldObjectLinkingLayer::onObjLoad (this=<optimized out>,
K=3, R=..., Obj=..., MemMgr=0x5555591c4160,
LoadedObjInfo=std::unique_ptr<llvm::RuntimeDyld::LoadedObjectInfo> =
{...},
Resolved=std::map with 8 elements = {...}, InternalSymbols=std::set with 0
elements)
at
/localdisk/adstraw/ngraph/build/mlir_project/llvm-project/llvm/lib/ExecutionEngine/Orc/RTDyldObjectLinkingLayer.cpp:267
A little debug context:
* Frame #6 with K=3 we are about to execute the `NotifyLoaded` callback
* Frame #5 we execute the `NotifyLoaded` callback which apparently
transforms K from 3 to 93825060893584 (0x5555596CF390) which looks like a
pointer. I don’t know the code well enough to grok what’s going on here. I’m
not even sure where this callback resides, but it could be
here<https://github.com/llvm/llvm-project/blob/c66f89005f6d23b6885d8f93f33ff27dc60ce7dd/llvm/lib/ExecutionEngine/Orc/RTDyldObjectLinkingLayer.cpp#L312>
in RTDyldObjectLinkingLater.cpp where we are using the address of the `MemMgr`
as K.
* Frame #4 now in `notifyObjectLoaded` with K=93825060893584 which will go
on to assert in frame #3 due to the fact that that key (K) already exists in the
map
All of this leads me to believe there is some sort of race. Theory: with two
unit tests A and B running back to back…
1. Unit test A allocates and deallocates a `MemMgr` (or whatever other object
we are using for a key) at address X
2. Unit test A calls ` notifyFreeingObject` to free the object at K=X but
gets stuck (does not lock) the ` JITDebugLock` mutex
here<https://github.com/llvm/llvm-project/blob/3d5360a4398bfa6878f94ca9ac55bc568692c765/llvm/lib/ExecutionEngine/GDBRegistrationListener.cpp#L181>
for whatever reason
3. Meanwhile unit test B allocates a `MemMgr` (or whatever other object we
are using for a key) at newly freed address X
4. Unit test B calls `notifyObjectLoaded` to register the object at K=X and
locks the ` JITDebugLock`
here<https://github.com/llvm/llvm-project/blob/3d5360a4398bfa6878f94ca9ac55bc568692c765/llvm/lib/ExecutionEngine/GDBRegistrationListener.cpp#L162>
with unit test A still waiting in step #2 above
5. This goes on to assert as the key was not erased (step #2) before it was
added (step #4)
I could use a little help to debug the error further. Curiously, we do hit this
issue in continuous integration (CI) --- only when running on development
systems “by hand”. I am investigating the differences (perhaps single vs. multi
threaded?) as to why this might be the case.
Note that I am using a slightly old version of MLIR master, I’m at this commit:
commit 2f8b4545f4960778e37114c024073d208751ca89
Author: Adam Straw <adam.d.straw at intel.com<mailto:adam.d.straw at
intel.com>>
Date: Tue Apr 14 22:49:18 2020 +0300
[mlir] Fix assert on signed integer type in EDSC
Integer type in Std dialect is signless so we should be checking
for signless integer type instead of signed integer type in EDSC.
Differential Revision: https://reviews.llvm.org/D78144
Thanks in advance for the help.
Adam
Note: I am out of office this Fri and the following Mon but otherwise should be
prompt with email replies.
_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200526/af98b28a/attachment.html>
Straw, Adam D via llvm-dev
2020-May-27 18:54 UTC
[llvm-dev] [ORC JIT][MLIR] GDBRegistrationListener "second attempt to perform debug registration" assert
Lang,
It looks like the change I need is here, correct?
https://github.com/llvm/llvm-project/commit/3c5dd5863c34ecd51e9d2a49929877d8151dea39
I would like to move the nGraph MLIR pointer all the way to the tip of master
but it is expensive (lots of compile and link issues) so hoping to take a baby
step.
Adam
From: Straw, Adam D
Sent: Tuesday, May 26, 2020 1:17 PM
To: Lang Hames <lhames at gmail.com>; David Blaikie <dblaikie at
gmail.com>
Cc: llvm-dev at lists.llvm.org; Cyphers, Scott <Scott.Cyphers at
intel.com>; Bobba, Jayaram <jayaram.bobba at intel.com>; Mostafa, Nagy
H <Nagy.H.Mostafa at intel.com>; Iyer, Swapna <swapna.iyer at
intel.com>
Subject: RE: [llvm-dev] [ORC JIT][MLIR] GDBRegistrationListener "second
attempt to perform debug registration" assert
Referring to the log messages from my previous mail… I confused myself (and
probably others) by reading the “Adding MemMgr 0x55555959f440“ message as
“Registering MemMgr 0x55555959f440”. Thus the address mismatch made no sense.
How could we be registering a `MemMgr` address/key that does not match once we
arrive in `notifyObjectLoaded` method?
Answer: Because the registrations is NOT coming from
RTDyldObjectLinkingLayer.cpp. I found this first by adding an additional debug
statement (highlighted) thinking the `reinterpret_cast` may be at fault for the
key mismatch. I never saw this print and I then realized my “Adding” vs.
“Registering” mistake.
// Run EventListener notifyLoaded callbacks.
{
std::lock_guard<std::mutex> Lock(RTDyldLayerMutex);
dbgs() << "Registering MemMgr " << MemMgr <<
"\n";
auto LOIItr = LoadedObjInfos.find(MemMgr);
assert(LOIItr != LoadedObjInfos.end() && "LoadedObjInfo
missing");
for (auto *L : EventListeners) {
dbgs() << " key = " <<
reinterpret_cast<uintptr_t>(MemMgr) << "\n";
L->notifyObjectLoaded(
static_cast<uint64_t>(reinterpret_cast<uintptr_t>(MemMgr)), *Obj,
*LOIItr->second);
}
LoadedObjInfos.erase(MemMgr);
}
So, where is the registration coming from? Remember that I am on a slightly
older version of MLIR. My mlir/lib/ExecutionEngine/ExecutionEngine.cpp has the
following code with highlighted debug statement added by me:
objectLayer->setNotifyLoaded(
[engine = engine.get()](
llvm::orc::VModuleKey, const llvm::object::ObjectFile &object,
const llvm::RuntimeDyld::LoadedObjectInfo &objectInfo) {
if (engine->gdbListener) {
uint64_t key = static_cast<uint64_t>(
reinterpret_cast<uintptr_t>(object.getData().data()));
dbgs() << "ExecutionEngine key = " << key
<< "\n";
engine->gdbListener->notifyObjectLoaded(key, object,
objectInfo);
}
});
The key printed here matches the key printed in `notifyObjectLoaded`. This code
is NOT present at the tip of MLIR master. Hence the solution here may be to
move to latest MLIR master which I probably should have tried before sending
this mail. Will do that now.
Adam
From: Straw, Adam D
Sent: Tuesday, May 26, 2020 9:41 AM
To: Lang Hames <lhames at gmail.com<mailto:lhames at gmail.com>>;
David Blaikie <dblaikie at gmail.com<mailto:dblaikie at gmail.com>>
Cc: llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>;
Cyphers, Scott <Scott.Cyphers at intel.com<mailto:Scott.Cyphers at
intel.com>>; Bobba, Jayaram <jayaram.bobba at
intel.com<mailto:jayaram.bobba at intel.com>>; Mostafa, Nagy H
<Nagy.H.Mostafa at intel.com<mailto:Nagy.H.Mostafa at intel.com>>;
Iyer, Swapna <swapna.iyer at intel.com<mailto:swapna.iyer at
intel.com>>
Subject: RE: [llvm-dev] [ORC JIT][MLIR] GDBRegistrationListener "second
attempt to perform debug registration" assert
Sorry for the delay. Hope everyone had a good long weekend. See attached log
captured from GDB session. Here is the punchline:
Adding MemMgr 0x55555959f440
unit-test:
/localdisk/adstraw/ngraph/build/mlir_project/llvm-project/llvm/lib/ExecutionEngine/GDBRegistrationListener.cpp:164:
virtual void
{anonymous}::GDBJITRegistrationListener::notifyObjectLoaded(llvm::JITEventListener::ObjectKey,
const llvm::object::ObjectFile&, const
llvm::RuntimeDyld::LoadedObjectInfo&): Assertion `ObjectBufferMap.find(K) ==
ObjectBufferMap.end() && "Second attempt to perform debug
registration."' failed.
(gdb) print/x K
$2 = 0x555559605c50
The key (K) causing the assert a) does not match the addresses of the most
recently added MemMgr and b) does not match the address of ANY MemMgr in the
entire log file. The plot thickens.
Wondering if there is some other object (besides a MemMgr) that registers its
address-as-key into `ObjectBufferMap`.
Adam
From: Lang Hames <lhames at gmail.com<mailto:lhames at gmail.com>>
Sent: Thursday, May 21, 2020 4:53 PM
To: David Blaikie <dblaikie at gmail.com<mailto:dblaikie at
gmail.com>>
Cc: Straw, Adam D <adam.d.straw at intel.com<mailto:adam.d.straw at
intel.com>>; llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>; Cyphers, Scott <scott.cyphers at
intel.com<mailto:scott.cyphers at intel.com>>; Bobba, Jayaram
<jayaram.bobba at intel.com<mailto:jayaram.bobba at intel.com>>;
Mostafa, Nagy H <nagy.h.mostafa at intel.com<mailto:nagy.h.mostafa at
intel.com>>
Subject: Re: [llvm-dev] [ORC JIT][MLIR] GDBRegistrationListener "second
attempt to perform debug registration" assert
Hi Adam,
Calls to the listeners should be protected by the RTDyldLayerMutex. Could you
apply the attached patch and share the debugging output from one of the failing
runs?
Regards,
Lang.
On Wed, May 20, 2020 at 8:00 PM David Blaikie <dblaikie at
gmail.com<mailto:dblaikie at gmail.com>> wrote:
+Lang
On Wed, May 20, 2020 at 4:44 PM Straw, Adam D via llvm-dev <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:
Hi all,
Attention: Lang Hames
I am developing the nGraph
MLIR<https://github.com/NervanaSystems/ngraph/tree/master/src/contrib/mlir>
implementation and hitting the following assert while running nGraph unit tests:
assert(ObjectBufferMap.find(K) == ObjectBufferMap.end() &&
"Second attempt to perform debug registration.");
Here is a
permalink<https://github.com/llvm/llvm-project/blob/3d5360a4398bfa6878f94ca9ac55bc568692c765/llvm/lib/ExecutionEngine/GDBRegistrationListener.cpp#L163>
for that line of code on GitHub. The failure only occurs when running multiple
unit tests back-to-back. Inevitably, if I rerun the failing unit test, it
passes. The failure also tends to move around with a different unit tests
failing on successive runs.
I am able to hit the failure in GDB. Here is a partial backtrace:
#3 0x00007fffec4cf412 in __GI___assert_fail (assertion=assertion at
entry=0x7ffff1d145e0 "ObjectBufferMap.find(K) == ObjectBufferMap.end()
&& \"Second attempt to perform debug registration.\"",
file=file at entry=0x7ffff1d144d8
"/localdisk/adstraw/ngraph/build/mlir_project/llvm-project/llvm/lib/ExecutionEngine/GDBRegistrationListener.cpp",
line=line at entry=164,
function=function at entry=0x7ffff1d15520 <(anonymous
namespace)::GDBJITRegistrationListener::notifyObjectLoaded(unsigned long,
llvm::object::ObjectFile const&, llvm::RuntimeDyld::LoadedObjectInfo
const&)::__PRETTY_FUNCTION__> "virtual void
{anonymous}::GDBJITRegistrationListener::notifyObjectLoaded(llvm::JITEventListener::ObjectKey,
const llvm::object::ObjectFile&, const
llvm::RuntimeDyld::LoadedObjectInfo&)")
at assert.c:101
#4 0x00007ffff01ed4d3 in (anonymous
namespace)::GDBJITRegistrationListener::notifyObjectLoaded (this=0x5555591a9270,
K=93825060893584, Obj=..., L=...)
at
/localdisk/adstraw/ngraph/build/mlir_project/llvm-project/llvm/lib/ExecutionEngine/GDBRegistrationListener.cpp:163
#5 0x00007ffff01b0eca in std::function<void (unsigned long,
llvm::object::ObjectFile const&, llvm::RuntimeDyld::LoadedObjectInfo
const&)>::operator()(unsigned long, llvm::object::ObjectFile const&,
llvm::RuntimeDyld::LoadedObjectInfo const&) const (__args#2=...,
__args#1=..., __args#0=<optimized out>, this=0x5555594e09b8) at
/usr/include/c++/7/bits/std_function.h:706
#6 llvm::orc::RTDyldObjectLinkingLayer::onObjLoad (this=<optimized out>,
K=3, R=..., Obj=..., MemMgr=0x5555591c4160,
LoadedObjInfo=std::unique_ptr<llvm::RuntimeDyld::LoadedObjectInfo> =
{...},
Resolved=std::map with 8 elements = {...}, InternalSymbols=std::set with 0
elements)
at
/localdisk/adstraw/ngraph/build/mlir_project/llvm-project/llvm/lib/ExecutionEngine/Orc/RTDyldObjectLinkingLayer.cpp:267
A little debug context:
* Frame #6 with K=3 we are about to execute the `NotifyLoaded` callback
* Frame #5 we execute the `NotifyLoaded` callback which apparently
transforms K from 3 to 93825060893584 (0x5555596CF390) which looks like a
pointer. I don’t know the code well enough to grok what’s going on here. I’m
not even sure where this callback resides, but it could be
here<https://github.com/llvm/llvm-project/blob/c66f89005f6d23b6885d8f93f33ff27dc60ce7dd/llvm/lib/ExecutionEngine/Orc/RTDyldObjectLinkingLayer.cpp#L312>
in RTDyldObjectLinkingLater.cpp where we are using the address of the `MemMgr`
as K.
* Frame #4 now in `notifyObjectLoaded` with K=93825060893584 which will go
on to assert in frame #3 due to the fact that that key (K) already exists in the
map
All of this leads me to believe there is some sort of race. Theory: with two
unit tests A and B running back to back…
1. Unit test A allocates and deallocates a `MemMgr` (or whatever other object
we are using for a key) at address X
2. Unit test A calls ` notifyFreeingObject` to free the object at K=X but
gets stuck (does not lock) the ` JITDebugLock` mutex
here<https://github.com/llvm/llvm-project/blob/3d5360a4398bfa6878f94ca9ac55bc568692c765/llvm/lib/ExecutionEngine/GDBRegistrationListener.cpp#L181>
for whatever reason
3. Meanwhile unit test B allocates a `MemMgr` (or whatever other object we
are using for a key) at newly freed address X
4. Unit test B calls `notifyObjectLoaded` to register the object at K=X and
locks the ` JITDebugLock`
here<https://github.com/llvm/llvm-project/blob/3d5360a4398bfa6878f94ca9ac55bc568692c765/llvm/lib/ExecutionEngine/GDBRegistrationListener.cpp#L162>
with unit test A still waiting in step #2 above
5. This goes on to assert as the key was not erased (step #2) before it was
added (step #4)
I could use a little help to debug the error further. Curiously, we do hit this
issue in continuous integration (CI) --- only when running on development
systems “by hand”. I am investigating the differences (perhaps single vs. multi
threaded?) as to why this might be the case.
Note that I am using a slightly old version of MLIR master, I’m at this commit:
commit 2f8b4545f4960778e37114c024073d208751ca89
Author: Adam Straw <adam.d.straw at intel.com<mailto:adam.d.straw at
intel.com>>
Date: Tue Apr 14 22:49:18 2020 +0300
[mlir] Fix assert on signed integer type in EDSC
Integer type in Std dialect is signless so we should be checking
for signless integer type instead of signed integer type in EDSC.
Differential Revision: https://reviews.llvm.org/D78144
Thanks in advance for the help.
Adam
Note: I am out of office this Fri and the following Mon but otherwise should be
prompt with email replies.
_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200527/5a4841d1/attachment-0001.html>
Straw, Adam D via llvm-dev
2020-May-29 20:22 UTC
[llvm-dev] [ORC JIT][MLIR] GDBRegistrationListener "second attempt to perform debug registration" assert
Closing the gap on this. Moving nGraph MLIR pointer to the 3c5ddd58 git hash
did indeed fix the problem. Thanks so much for all the help.
From: Straw, Adam D
Sent: Wednesday, May 27, 2020 11:54 AM
To: Lang Hames <lhames at gmail.com>; David Blaikie <dblaikie at
gmail.com>
Cc: llvm-dev at lists.llvm.org; Cyphers, Scott <Scott.Cyphers at
intel.com>; Bobba, Jayaram <jayaram.bobba at intel.com>; Mostafa, Nagy
H <Nagy.H.Mostafa at intel.com>; Iyer, Swapna <swapna.iyer at
intel.com>; Caballero, Diego <diego.caballero at intel.com>
Subject: RE: [llvm-dev] [ORC JIT][MLIR] GDBRegistrationListener "second
attempt to perform debug registration" assert
Lang,
It looks like the change I need is here, correct?
https://github.com/llvm/llvm-project/commit/3c5dd5863c34ecd51e9d2a49929877d8151dea39
I would like to move the nGraph MLIR pointer all the way to the tip of master
but it is expensive (lots of compile and link issues) so hoping to take a baby
step.
Adam
From: Straw, Adam D
Sent: Tuesday, May 26, 2020 1:17 PM
To: Lang Hames <lhames at gmail.com<mailto:lhames at gmail.com>>;
David Blaikie <dblaikie at gmail.com<mailto:dblaikie at gmail.com>>
Cc: llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>;
Cyphers, Scott <Scott.Cyphers at intel.com<mailto:Scott.Cyphers at
intel.com>>; Bobba, Jayaram <jayaram.bobba at
intel.com<mailto:jayaram.bobba at intel.com>>; Mostafa, Nagy H
<Nagy.H.Mostafa at intel.com<mailto:Nagy.H.Mostafa at intel.com>>;
Iyer, Swapna <swapna.iyer at intel.com<mailto:swapna.iyer at
intel.com>>
Subject: RE: [llvm-dev] [ORC JIT][MLIR] GDBRegistrationListener "second
attempt to perform debug registration" assert
Referring to the log messages from my previous mail… I confused myself (and
probably others) by reading the “Adding MemMgr 0x55555959f440“ message as
“Registering MemMgr 0x55555959f440”. Thus the address mismatch made no sense.
How could we be registering a `MemMgr` address/key that does not match once we
arrive in `notifyObjectLoaded` method?
Answer: Because the registrations is NOT coming from
RTDyldObjectLinkingLayer.cpp. I found this first by adding an additional debug
statement (highlighted) thinking the `reinterpret_cast` may be at fault for the
key mismatch. I never saw this print and I then realized my “Adding” vs.
“Registering” mistake.
// Run EventListener notifyLoaded callbacks.
{
std::lock_guard<std::mutex> Lock(RTDyldLayerMutex);
dbgs() << "Registering MemMgr " << MemMgr <<
"\n";
auto LOIItr = LoadedObjInfos.find(MemMgr);
assert(LOIItr != LoadedObjInfos.end() && "LoadedObjInfo
missing");
for (auto *L : EventListeners) {
dbgs() << " key = " <<
reinterpret_cast<uintptr_t>(MemMgr) << "\n";
L->notifyObjectLoaded(
static_cast<uint64_t>(reinterpret_cast<uintptr_t>(MemMgr)), *Obj,
*LOIItr->second);
}
LoadedObjInfos.erase(MemMgr);
}
So, where is the registration coming from? Remember that I am on a slightly
older version of MLIR. My mlir/lib/ExecutionEngine/ExecutionEngine.cpp has the
following code with highlighted debug statement added by me:
objectLayer->setNotifyLoaded(
[engine = engine.get()](
llvm::orc::VModuleKey, const llvm::object::ObjectFile &object,
const llvm::RuntimeDyld::LoadedObjectInfo &objectInfo) {
if (engine->gdbListener) {
uint64_t key = static_cast<uint64_t>(
reinterpret_cast<uintptr_t>(object.getData().data()));
dbgs() << "ExecutionEngine key = " << key
<< "\n";
engine->gdbListener->notifyObjectLoaded(key, object,
objectInfo);
}
});
The key printed here matches the key printed in `notifyObjectLoaded`. This code
is NOT present at the tip of MLIR master. Hence the solution here may be to
move to latest MLIR master which I probably should have tried before sending
this mail. Will do that now.
Adam
From: Straw, Adam D
Sent: Tuesday, May 26, 2020 9:41 AM
To: Lang Hames <lhames at gmail.com<mailto:lhames at gmail.com>>;
David Blaikie <dblaikie at gmail.com<mailto:dblaikie at gmail.com>>
Cc: llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>;
Cyphers, Scott <Scott.Cyphers at intel.com<mailto:Scott.Cyphers at
intel.com>>; Bobba, Jayaram <jayaram.bobba at
intel.com<mailto:jayaram.bobba at intel.com>>; Mostafa, Nagy H
<Nagy.H.Mostafa at intel.com<mailto:Nagy.H.Mostafa at intel.com>>;
Iyer, Swapna <swapna.iyer at intel.com<mailto:swapna.iyer at
intel.com>>
Subject: RE: [llvm-dev] [ORC JIT][MLIR] GDBRegistrationListener "second
attempt to perform debug registration" assert
Sorry for the delay. Hope everyone had a good long weekend. See attached log
captured from GDB session. Here is the punchline:
Adding MemMgr 0x55555959f440
unit-test:
/localdisk/adstraw/ngraph/build/mlir_project/llvm-project/llvm/lib/ExecutionEngine/GDBRegistrationListener.cpp:164:
virtual void
{anonymous}::GDBJITRegistrationListener::notifyObjectLoaded(llvm::JITEventListener::ObjectKey,
const llvm::object::ObjectFile&, const
llvm::RuntimeDyld::LoadedObjectInfo&): Assertion `ObjectBufferMap.find(K) ==
ObjectBufferMap.end() && "Second attempt to perform debug
registration."' failed.
(gdb) print/x K
$2 = 0x555559605c50
The key (K) causing the assert a) does not match the addresses of the most
recently added MemMgr and b) does not match the address of ANY MemMgr in the
entire log file. The plot thickens.
Wondering if there is some other object (besides a MemMgr) that registers its
address-as-key into `ObjectBufferMap`.
Adam
From: Lang Hames <lhames at gmail.com<mailto:lhames at gmail.com>>
Sent: Thursday, May 21, 2020 4:53 PM
To: David Blaikie <dblaikie at gmail.com<mailto:dblaikie at
gmail.com>>
Cc: Straw, Adam D <adam.d.straw at intel.com<mailto:adam.d.straw at
intel.com>>; llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>; Cyphers, Scott <scott.cyphers at
intel.com<mailto:scott.cyphers at intel.com>>; Bobba, Jayaram
<jayaram.bobba at intel.com<mailto:jayaram.bobba at intel.com>>;
Mostafa, Nagy H <nagy.h.mostafa at intel.com<mailto:nagy.h.mostafa at
intel.com>>
Subject: Re: [llvm-dev] [ORC JIT][MLIR] GDBRegistrationListener "second
attempt to perform debug registration" assert
Hi Adam,
Calls to the listeners should be protected by the RTDyldLayerMutex. Could you
apply the attached patch and share the debugging output from one of the failing
runs?
Regards,
Lang.
On Wed, May 20, 2020 at 8:00 PM David Blaikie <dblaikie at
gmail.com<mailto:dblaikie at gmail.com>> wrote:
+Lang
On Wed, May 20, 2020 at 4:44 PM Straw, Adam D via llvm-dev <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:
Hi all,
Attention: Lang Hames
I am developing the nGraph
MLIR<https://github.com/NervanaSystems/ngraph/tree/master/src/contrib/mlir>
implementation and hitting the following assert while running nGraph unit tests:
assert(ObjectBufferMap.find(K) == ObjectBufferMap.end() &&
"Second attempt to perform debug registration.");
Here is a
permalink<https://github.com/llvm/llvm-project/blob/3d5360a4398bfa6878f94ca9ac55bc568692c765/llvm/lib/ExecutionEngine/GDBRegistrationListener.cpp#L163>
for that line of code on GitHub. The failure only occurs when running multiple
unit tests back-to-back. Inevitably, if I rerun the failing unit test, it
passes. The failure also tends to move around with a different unit tests
failing on successive runs.
I am able to hit the failure in GDB. Here is a partial backtrace:
#3 0x00007fffec4cf412 in __GI___assert_fail (assertion=assertion at
entry=0x7ffff1d145e0 "ObjectBufferMap.find(K) == ObjectBufferMap.end()
&& \"Second attempt to perform debug registration.\"",
file=file at entry=0x7ffff1d144d8
"/localdisk/adstraw/ngraph/build/mlir_project/llvm-project/llvm/lib/ExecutionEngine/GDBRegistrationListener.cpp",
line=line at entry=164,
function=function at entry=0x7ffff1d15520 <(anonymous
namespace)::GDBJITRegistrationListener::notifyObjectLoaded(unsigned long,
llvm::object::ObjectFile const&, llvm::RuntimeDyld::LoadedObjectInfo
const&)::__PRETTY_FUNCTION__> "virtual void
{anonymous}::GDBJITRegistrationListener::notifyObjectLoaded(llvm::JITEventListener::ObjectKey,
const llvm::object::ObjectFile&, const
llvm::RuntimeDyld::LoadedObjectInfo&)")
at assert.c:101
#4 0x00007ffff01ed4d3 in (anonymous
namespace)::GDBJITRegistrationListener::notifyObjectLoaded (this=0x5555591a9270,
K=93825060893584, Obj=..., L=...)
at
/localdisk/adstraw/ngraph/build/mlir_project/llvm-project/llvm/lib/ExecutionEngine/GDBRegistrationListener.cpp:163
#5 0x00007ffff01b0eca in std::function<void (unsigned long,
llvm::object::ObjectFile const&, llvm::RuntimeDyld::LoadedObjectInfo
const&)>::operator()(unsigned long, llvm::object::ObjectFile const&,
llvm::RuntimeDyld::LoadedObjectInfo const&) const (__args#2=...,
__args#1=..., __args#0=<optimized out>, this=0x5555594e09b8) at
/usr/include/c++/7/bits/std_function.h:706
#6 llvm::orc::RTDyldObjectLinkingLayer::onObjLoad (this=<optimized out>,
K=3, R=..., Obj=..., MemMgr=0x5555591c4160,
LoadedObjInfo=std::unique_ptr<llvm::RuntimeDyld::LoadedObjectInfo> =
{...},
Resolved=std::map with 8 elements = {...}, InternalSymbols=std::set with 0
elements)
at
/localdisk/adstraw/ngraph/build/mlir_project/llvm-project/llvm/lib/ExecutionEngine/Orc/RTDyldObjectLinkingLayer.cpp:267
A little debug context:
* Frame #6 with K=3 we are about to execute the `NotifyLoaded` callback
* Frame #5 we execute the `NotifyLoaded` callback which apparently
transforms K from 3 to 93825060893584 (0x5555596CF390) which looks like a
pointer. I don’t know the code well enough to grok what’s going on here. I’m
not even sure where this callback resides, but it could be
here<https://github.com/llvm/llvm-project/blob/c66f89005f6d23b6885d8f93f33ff27dc60ce7dd/llvm/lib/ExecutionEngine/Orc/RTDyldObjectLinkingLayer.cpp#L312>
in RTDyldObjectLinkingLater.cpp where we are using the address of the `MemMgr`
as K.
* Frame #4 now in `notifyObjectLoaded` with K=93825060893584 which will go
on to assert in frame #3 due to the fact that that key (K) already exists in the
map
All of this leads me to believe there is some sort of race. Theory: with two
unit tests A and B running back to back…
1. Unit test A allocates and deallocates a `MemMgr` (or whatever other object
we are using for a key) at address X
2. Unit test A calls ` notifyFreeingObject` to free the object at K=X but
gets stuck (does not lock) the ` JITDebugLock` mutex
here<https://github.com/llvm/llvm-project/blob/3d5360a4398bfa6878f94ca9ac55bc568692c765/llvm/lib/ExecutionEngine/GDBRegistrationListener.cpp#L181>
for whatever reason
3. Meanwhile unit test B allocates a `MemMgr` (or whatever other object we
are using for a key) at newly freed address X
4. Unit test B calls `notifyObjectLoaded` to register the object at K=X and
locks the ` JITDebugLock`
here<https://github.com/llvm/llvm-project/blob/3d5360a4398bfa6878f94ca9ac55bc568692c765/llvm/lib/ExecutionEngine/GDBRegistrationListener.cpp#L162>
with unit test A still waiting in step #2 above
5. This goes on to assert as the key was not erased (step #2) before it was
added (step #4)
I could use a little help to debug the error further. Curiously, we do hit this
issue in continuous integration (CI) --- only when running on development
systems “by hand”. I am investigating the differences (perhaps single vs. multi
threaded?) as to why this might be the case.
Note that I am using a slightly old version of MLIR master, I’m at this commit:
commit 2f8b4545f4960778e37114c024073d208751ca89
Author: Adam Straw <adam.d.straw at intel.com<mailto:adam.d.straw at
intel.com>>
Date: Tue Apr 14 22:49:18 2020 +0300
[mlir] Fix assert on signed integer type in EDSC
Integer type in Std dialect is signless so we should be checking
for signless integer type instead of signed integer type in EDSC.
Differential Revision: https://reviews.llvm.org/D78144
Thanks in advance for the help.
Adam
Note: I am out of office this Fri and the following Mon but otherwise should be
prompt with email replies.
_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200529/ae8b5f35/attachment-0001.html>