(sorry for the duplicate Fred, I failed at reply-all the first time) On Wed, Oct 22, 2014 at 6:33 PM, Frédéric Riss <friss at apple.com> wrote:> > > On Oct 22, 2014, at 4:57 PM, David Blaikie <dblaikie at gmail.com> wrote: > > > > Just working on some of the gmlt+fission debug info stuff and I came > across a comment that might be relevant to reducing the number of distinct > debugloc metadata nodes: > > > > "or some sub-optimal metadata that > > // isn't structurally identical (see: file path/name info from clang, > which > > // includes the directory of the cpp file being built, even when the > file name > > // is absolute (such as an <> lookup header)))" > > > > Seems that the file path/name isn't well canonicalized so as to allow > metadata level merging when linking. Might be helpful to figure out that > issue at some point. > > Incidentally I worked on an issue last week where the line table would get > entries representing the same file, but where the file/dir split wasn’t > done at the same place. I have a patch that remerges them at emission, but > I was planing on investigating more the source of the duplication before I > submit anything. > > The cases I’ve seen have one duplicated entry though, nothing that could > have a visible impact on memory consumption. >So the particular case where I think this arises in a way that might be measurable is if you have a build system that changes directories to build subprojects (like our make build system, if I understand correctly - but not our cmake build system, again, if I understand correctly): imagine a simple directory layout: include/ foo.h lib/ a/ a.cpp // includes foo.h and calls one inline function from it (or uses a type, etc) from some external function a() b/ b.cpp // does the same thing as a.cpp, but with its own external function, b() if you run "clang++ -emit-llvm -S -Iinclude -c lib/a/a.cpp lib/b/b.cpp -g" you get two .ll files both with the obvious: !9 = metadata !{metadata !"include/foo.h", metadata !"/tmp/dbginfo/pathtest"} But if you do this instead: "cd lib/a; clang++ -emit-llvm -S -I../../include -c a.cpp -g; cd ../../lib/b; clang++ -emit-llvm -S -I../../include -c b.cpp -g" you get two different nodes: !9 = metadata !{metadata !"../../include/foo.h", metadata !"/tmp/dbginfo/pathtest/lib/b"} !9 = metadata !{metadata !"../../include/foo.h", metadata !"/tmp/dbginfo/pathtest/lib/a"} and now you're left with a situation in which almost all the metadata is different and any place you were relying on the standard metadata uniquing you won't get it :(> > Fred > > > > - David > > _______________________________________________ > > LLVM Developers mailing list > > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20141023/da2dff8b/attachment.html>
> On 2014-Oct-23, at 09:19, David Blaikie <dblaikie at gmail.com> wrote: > > (sorry for the duplicate Fred, I failed at reply-all the first time) > > On Wed, Oct 22, 2014 at 6:33 PM, Frédéric Riss <friss at apple.com> wrote: > > > On Oct 22, 2014, at 4:57 PM, David Blaikie <dblaikie at gmail.com> wrote: > > > > Just working on some of the gmlt+fission debug info stuff and I came across a comment that might be relevant to reducing the number of distinct debugloc metadata nodes: > > > > "or some sub-optimal metadata that > > // isn't structurally identical (see: file path/name info from clang, which > > // includes the directory of the cpp file being built, even when the file name > > // is absolute (such as an <> lookup header)))" > > > > Seems that the file path/name isn't well canonicalized so as to allow metadata level merging when linking. Might be helpful to figure out that issue at some point. > > Incidentally I worked on an issue last week where the line table would get entries representing the same file, but where the file/dir split wasn’t done at the same place. I have a patch that remerges them at emission, but I was planing on investigating more the source of the duplication before I submit anything. > > The cases I’ve seen have one duplicated entry though, nothing that could have a visible impact on memory consumption. > > So the particular case where I think this arises in a way that might be measurable is if you have a build system that changes directories to build subprojects (like our make build system, if I understand correctly - but not our cmake build system, again, if I understand correctly): > > imagine a simple directory layout: > > include/ > foo.h > lib/ > a/ > a.cpp // includes foo.h and calls one inline function from it (or uses a type, etc) from some external function a() > b/ > b.cpp // does the same thing as a.cpp, but with its own external function, b() > > if you run "clang++ -emit-llvm -S -Iinclude -c lib/a/a.cpp lib/b/b.cpp -g" you get two .ll files both with the obvious: > !9 = metadata !{metadata !"include/foo.h", metadata !"/tmp/dbginfo/pathtest"} > But if you do this instead: "cd lib/a; clang++ -emit-llvm -S -I../../include -c a.cpp -g; cd ../../lib/b; clang++ -emit-llvm -S -I../../include -c b.cpp -g" you get two different nodes: > > !9 = metadata !{metadata !"../../include/foo.h", metadata !"/tmp/dbginfo/pathtest/lib/b"} > > !9 = metadata !{metadata !"../../include/foo.h", metadata !"/tmp/dbginfo/pathtest/lib/a"} > and now you're left with a situation in which almost all the metadata is different and any place you were relying on the standard metadata uniquing you won't get it :( >This might be fixed by making `MDFile` (or `DIFile`) first-class. We just need to canonicalize on creation. class MDFile { public: // Split the path at the right place. MDFile *get(LLVMContext &C, StringRef Path); // Convenience for callers, but the path gets canonicalized. MDFile *get(LLVMContext &C, StringRef File, StringRef Dir); StringRef getFilename() const; StringRef getDirectory() const; }; Note that whether we continue to use `MDString` under the hood is an implementation detail. However, path canonicalization (in particular, eating "..") requires a `stat()` to do correctly on *NIX, so the implementation would have to cache lookups. Doesn't seem hard though.
On Thu, Oct 23, 2014 at 9:43 AM, Duncan P. N. Exon Smith < dexonsmith at apple.com> wrote:> > > On 2014-Oct-23, at 09:19, David Blaikie <dblaikie at gmail.com> wrote: > > > > (sorry for the duplicate Fred, I failed at reply-all the first time) > > > > On Wed, Oct 22, 2014 at 6:33 PM, Frédéric Riss <friss at apple.com> wrote: > > > > > On Oct 22, 2014, at 4:57 PM, David Blaikie <dblaikie at gmail.com> wrote: > > > > > > Just working on some of the gmlt+fission debug info stuff and I came > across a comment that might be relevant to reducing the number of distinct > debugloc metadata nodes: > > > > > > "or some sub-optimal metadata that > > > // isn't structurally identical (see: file path/name info from > clang, which > > > // includes the directory of the cpp file being built, even when the > file name > > > // is absolute (such as an <> lookup header)))" > > > > > > Seems that the file path/name isn't well canonicalized so as to allow > metadata level merging when linking. Might be helpful to figure out that > issue at some point. > > > > Incidentally I worked on an issue last week where the line table would > get entries representing the same file, but where the file/dir split wasn’t > done at the same place. I have a patch that remerges them at emission, but > I was planing on investigating more the source of the duplication before I > submit anything. > > > > The cases I’ve seen have one duplicated entry though, nothing that could > have a visible impact on memory consumption. > > > > So the particular case where I think this arises in a way that might be > measurable is if you have a build system that changes directories to build > subprojects (like our make build system, if I understand correctly - but > not our cmake build system, again, if I understand correctly): > > > > imagine a simple directory layout: > > > > include/ > > foo.h > > lib/ > > a/ > > a.cpp // includes foo.h and calls one inline function from it (or > uses a type, etc) from some external function a() > > b/ > > b.cpp // does the same thing as a.cpp, but with its own external > function, b() > > > > if you run "clang++ -emit-llvm -S -Iinclude -c lib/a/a.cpp lib/b/b.cpp > -g" you get two .ll files both with the obvious: > > !9 = metadata !{metadata !"include/foo.h", metadata > !"/tmp/dbginfo/pathtest"} > > But if you do this instead: "cd lib/a; clang++ -emit-llvm -S > -I../../include -c a.cpp -g; cd ../../lib/b; clang++ -emit-llvm -S > -I../../include -c b.cpp -g" you get two different nodes: > > > > !9 = metadata !{metadata !"../../include/foo.h", metadata > !"/tmp/dbginfo/pathtest/lib/b"} > > > > !9 = metadata !{metadata !"../../include/foo.h", metadata > !"/tmp/dbginfo/pathtest/lib/a"} > > and now you're left with a situation in which almost all the metadata is > different and any place you were relying on the standard metadata uniquing > you won't get it :( > > > > This might be fixed by making `MDFile` (or `DIFile`) first-class. We > just need to canonicalize on creation. > > class MDFile { > public: > // Split the path at the right place. > MDFile *get(LLVMContext &C, StringRef Path); > > // Convenience for callers, but the path gets canonicalized. > MDFile *get(LLVMContext &C, StringRef File, StringRef Dir); > > StringRef getFilename() const; > StringRef getDirectory() const; > }; > > Note that whether we continue to use `MDString` under the hood is an > implementation detail. > > However, path canonicalization (in particular, eating "..") requires > a `stat()` to do correctly on *NIX, so the implementation would have > to cache lookups. Doesn't seem hard though. >Seems mostly orthogonal to MDFile being first class - I imagine the logic would be implemented just as well in DIBuilder::createFile and moved around wherever it happens to go when MDFile/DIFile becomes first class. (& it'll have to be done on construction/during frontend work (preaching to the choir, but just saying it to make sure I'm understanding, etc) because the files might not be around to run 'stat' on by the time we're processing IR - doesn't mean the code for it has to be written in the frontend, it can still be in a utility like DIBuilder, so long as we never have to conjure up new DIFiles during optimizations (or if we do so, they don't need to be canonicalized, which sounds plausible)) - David -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20141023/4aa445ed/attachment.html>