Sean Silva
2015-Apr-29 00:36 UTC
[LLVMdev] RFC: Machine Level IR text-based serialization format
On Tue, Apr 28, 2015 at 3:51 PM, David Majnemer <david.majnemer at gmail.com> wrote:> I love the idea of having some sort of textual representation. My only > concern is that our YAML parser is not very actively maintained (is there > someone expert with its implementation *and* active in the project?) and > (IMHO) over-engineered when compared to the simplicity of our custom IR > parser. > > Without TLC, I'm afraid it would make for a poor piece of LLVM > infrastructure to rely on. The reliability of the serialization mechanism > is very important if we are to have any chance of applying fuzz testing to > the backend pieces; after all, testability is a huge motivation for this > work. > > As a concrete example, a file solely containing '%' crashes the yaml > parser: > $ ~/llvm/Debug+Asserts/bin/yaml2obj -format=coff t.yaml > yaml2obj: ~/llvm/src/lib/Support/YAMLTraits.cpp:78: bool > llvm::yaml::Input::setCurrentDocument(): Assertion `Strm->failed() && "Root > is NULL iff parsing failed"' failed. > 0 yaml2obj 0x000000000048682e > 1 yaml2obj 0x0000000000486b43 > 2 yaml2obj 0x000000000048570e > 3 libpthread.so.0 0x00007f5e79643340 > 4 libc.so.6 0x00007f5e78c9acc9 gsignal + 57 > 5 libc.so.6 0x00007f5e78c9e0d8 abort + 328 > 6 libc.so.6 0x00007f5e78c93b86 > 7 libc.so.6 0x00007f5e78c93c32 > 8 yaml2obj 0x000000000045f378 > 9 yaml2obj 0x000000000040d4b3 > 10 yaml2obj 0x000000000040b0fa > 11 yaml2obj 0x0000000000404a79 > 12 yaml2obj 0x0000000000404dd8 > 13 libc.so.6 0x00007f5e78c85ec5 __libc_start_main + 245 > 14 yaml2obj 0x0000000000404879 > Stack dump: > 0. Program arguments: ~/llvm/Debug+Asserts/bin/yaml2obj -format=coff > t.yaml > >Hopefully a fuzzer that is fuzzing a yaml input would not waste its time with syntactically invalid or unusual YAML. Also, you're thinking of YAMLIO which is a layer on top of the YAML parser (YAMLParser.{h,cpp}). It might make sense to not use YAMLIO (it is good for some types of data, not for all) but still use the YAML parser. -- Sean Silva> On Tue, Apr 28, 2015 at 2:00 PM, Alex L <arphaman at gmail.com> wrote: > >> >> >> 2015-04-28 10:14 GMT-07:00 Quentin Colombet <qcolombet at apple.com>: >> >>> Hi Alex, >>> >>> Thanks for working on this. >>> >>> Personally I would rather not have to write YAML inputs but instead >>> resort on the what the machine dumps look like. That being said, I can live >>> with YAML :). >>> >>> More importantly, how do you plan to report syntax errors to the users? >>> Things like invalid instruction, invalid registers, etc.? >>> What about unallocated code, i.e., virtual registers, invalid SSA form, >>> etc.? >>> >>> Cheers, >>> Q. >>> >> >> Thanks, >> >> Unfortunately, the machine dumps are quite incomplete (and tricky to >> parse too!), and thus some sort of new syntax has to be developed. >> I think that a YAML based container is a good candidate for this purpose, >> as it has a structured format that represents things like machine functions, >> frame information, register information, target specific machine function >> details, etc in a clear and readable way. >> >> I haven't thought about error reporting that much, as I've been mostly >> working on developing the syntax and making sure that all the data >> structures >> can be represented by it. But I believe that the errors that crop up in >> an invalid machine instruction syntax, like invalid basic block references, >> invalid instructions, >> etc. can be reported quite well and I can rely on already existing error >> reporting facilities in LLVM to help me. The more structural errors, like >> missing attributes >> will be handled by the YAML parser automatically, and I might extend it >> to provide better/more specific error messages. And I think that it's >> possible >> to use the machine verifier to catch the other errors that you've >> mentioned. >> >> Alex >> >> >> >>> On Apr 28, 2015, at 9:56 AM, Alex L <arphaman at gmail.com> wrote: >>> >>> Hi all, >>> >>> >>> I would like to propose a text-based, human readable format that will be used to >>> >>> serialize the machine level IR. The major goal of this format is to allow LLVM >>> >>> to save the machine level IR after any code generation pass and then to load >>> >>> it again and continue running passes on the machine level IR. The primary use case >>> >>> of this format is to enable easier testing process for the code generation passes, >>> >>> by allowing the developers to write tests that load the IR, then invoke just a >>> >>> specific code gen pass and then inspect the output of that pass by checking the >>> >>> printed out IR. >>> >>> >>> >>> The proposed format has a number of key features: >>> >>> - It stores the machine level IR and the optional LLVM IR in one text file. >>> >>> - The connections between the machine level IR and the LLVM IR are preserved. >>> >>> - The format uses a YAML based container for most of the data structures. The LLVM >>> >>> IR is embedded in the YAML container. >>> >>> - The format also uses a new, text-based syntax to serialize the machine instructions. >>> >>> The instructions are embedded in YAML. >>> >>> >>> This is an incomplete example of a YAML file containing the LLVM IR, the machine level IR >>> >>> and the instructions: >>> >>> >>> --- >>> >>> ir: | >>> >>> define i32 @fact(i32 %n) { >>> >>> %1 = alloca i32, align 4 >>> >>> store i32 %n, i32* %1, align 4 >>> >>> %2 = load i32, i32* %1, align 4 >>> >>> %3 = icmp eq i32 %2, 0 >>> >>> br i1 %3, label %10, label %4 >>> >>> >>> ; <label>:4 ; preds = %0 >>> >>> %5 = load i32, i32* %1, align 4 >>> >>> %6 = sub nsw i32 %5, 1 >>> >>> %7 = call i32 @fact(i32 %6) >>> >>> %8 = load i32, i32* %1, align 4 >>> >>> %9 = mul nsw i32 %7, %8 >>> >>> br label %10 >>> >>> >>> ; <label>:10 ; preds = %0, %4 >>> >>> %11 = phi i32 [ %9, %4 ], [ 1, %0 ] >>> >>> ret i32 %11 >>> >>> } >>> >>> >>> ... >>> >>> --- >>> >>> number: 0 >>> >>> name: fact >>> >>> alignment: 4 >>> >>> regInfo: >>> >>> .... >>> >>> frameInfo: >>> >>> .... >>> >>> body: >>> >>> - bb: 0 >>> >>> llbb: '%0' >>> >>> successors: [ 'bb#2', 'bb#1' ] >>> >>> liveIns: [ '%edi' ] >>> >>> instructions: >>> >>> - 'push64r undef %rax, %rsp, %rsp' >>> >>> - 'mov32mr %rsp, 1, %noreg, 4, %noreg, %edi' >>> >>> - .... >>> >>> .... >>> >>> - bb: 1 >>> >>> llbb: '%4' >>> >>> successors: [ 'bb#2' ] >>> >>> instructions: >>> >>> - '%edi = mov32rm %rsp, 1, %noreg, 4, %noreg' >>> >>> - .... >>> >>> .... >>> >>> - .... >>> >>> .... >>> >>> ... >>> >>> >>> The example above shows a YAML file with two YAML documents (delimited by `---` >>> >>> and `...`) containing the LLVM IR and the machine function information for the function `fact`. >>> >>> >>> >>> When a specific format is chosen, I'll start with patches that serialize the >>> >>> embedded LLVM IR. Then I'll add support for things like machine functions and >>> >>> machine basic blocks, and I think that an intrusive implementation will work best >>> >>> for data structures like these. After that I will continue adding support for >>> >>> serialization of the remaining data structures. >>> >>> >>> >>> Thanks for reading through the proposal. What are you thoughts about this format? >>> >>> >>> _______________________________________________ >>> LLVM Developers mailing list >>> LLVMdev at cs.uiuc.edu llvm.cs.uiuc.edu >>> lists.cs.uiuc.edu/mailman/listinfo/llvmdev >>> >>> >>> >> >> _______________________________________________ >> LLVM Developers mailing list >> LLVMdev at cs.uiuc.edu llvm.cs.uiuc.edu >> lists.cs.uiuc.edu/mailman/listinfo/llvmdev >> >> > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu llvm.cs.uiuc.edu > lists.cs.uiuc.edu/mailman/listinfo/llvmdev > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <lists.llvm.org/pipermail/llvm-dev/attachments/20150428/e4fea443/attachment.html>
David Majnemer
2015-Apr-29 00:59 UTC
[LLVMdev] RFC: Machine Level IR text-based serialization format
On Tuesday, April 28, 2015, Sean Silva <chisophugis at gmail.com> wrote:> > > On Tue, Apr 28, 2015 at 3:51 PM, David Majnemer <david.majnemer at gmail.com > <javascript:_e(%7B%7D,'cvml','david.majnemer at gmail.com');>> wrote: > >> I love the idea of having some sort of textual representation. My only >> concern is that our YAML parser is not very actively maintained (is there >> someone expert with its implementation *and* active in the project?) and >> (IMHO) over-engineered when compared to the simplicity of our custom IR >> parser. >> >> Without TLC, I'm afraid it would make for a poor piece of LLVM >> infrastructure to rely on. The reliability of the serialization mechanism >> is very important if we are to have any chance of applying fuzz testing to >> the backend pieces; after all, testability is a huge motivation for this >> work. >> >> As a concrete example, a file solely containing '%' crashes the yaml >> parser: >> $ ~/llvm/Debug+Asserts/bin/yaml2obj -format=coff t.yaml >> yaml2obj: ~/llvm/src/lib/Support/YAMLTraits.cpp:78: bool >> llvm::yaml::Input::setCurrentDocument(): Assertion `Strm->failed() && "Root >> is NULL iff parsing failed"' failed. >> 0 yaml2obj 0x000000000048682e >> 1 yaml2obj 0x0000000000486b43 >> 2 yaml2obj 0x000000000048570e >> 3 libpthread.so.0 0x00007f5e79643340 >> 4 libc.so.6 0x00007f5e78c9acc9 gsignal + 57 >> 5 libc.so.6 0x00007f5e78c9e0d8 abort + 328 >> 6 libc.so.6 0x00007f5e78c93b86 >> 7 libc.so.6 0x00007f5e78c93c32 >> 8 yaml2obj 0x000000000045f378 >> 9 yaml2obj 0x000000000040d4b3 >> 10 yaml2obj 0x000000000040b0fa >> 11 yaml2obj 0x0000000000404a79 >> 12 yaml2obj 0x0000000000404dd8 >> 13 libc.so.6 0x00007f5e78c85ec5 __libc_start_main + 245 >> 14 yaml2obj 0x0000000000404879 >> Stack dump: >> 0. Program arguments: ~/llvm/Debug+Asserts/bin/yaml2obj -format=coff >> t.yaml >> >> > > Hopefully a fuzzer that is fuzzing a yaml input would not waste its time > with syntactically invalid or unusual YAML. >Maybe. I don't see why we would want to lock ourselves out of using afl-fuzz though.> > Also, you're thinking of YAMLIO which is a layer on top of the YAML parser > (YAMLParser.{h,cpp}). It might make sense to not use YAMLIO (it is good for > some types of data, not for all) but still use the YAML parser. > > -- Sean Silva > > >> On Tue, Apr 28, 2015 at 2:00 PM, Alex L <arphaman at gmail.com >> <javascript:_e(%7B%7D,'cvml','arphaman at gmail.com');>> wrote: >> >>> >>> >>> 2015-04-28 10:14 GMT-07:00 Quentin Colombet <qcolombet at apple.com >>> <javascript:_e(%7B%7D,'cvml','qcolombet at apple.com');>>: >>> >>>> Hi Alex, >>>> >>>> Thanks for working on this. >>>> >>>> Personally I would rather not have to write YAML inputs but instead >>>> resort on the what the machine dumps look like. That being said, I can live >>>> with YAML :). >>>> >>>> More importantly, how do you plan to report syntax errors to the users? >>>> Things like invalid instruction, invalid registers, etc.? >>>> What about unallocated code, i.e., virtual registers, invalid SSA form, >>>> etc.? >>>> >>>> Cheers, >>>> Q. >>>> >>> >>> Thanks, >>> >>> Unfortunately, the machine dumps are quite incomplete (and tricky to >>> parse too!), and thus some sort of new syntax has to be developed. >>> I think that a YAML based container is a good candidate for this >>> purpose, as it has a structured format that represents things like machine >>> functions, >>> frame information, register information, target specific machine >>> function details, etc in a clear and readable way. >>> >>> I haven't thought about error reporting that much, as I've been mostly >>> working on developing the syntax and making sure that all the data >>> structures >>> can be represented by it. But I believe that the errors that crop up in >>> an invalid machine instruction syntax, like invalid basic block references, >>> invalid instructions, >>> etc. can be reported quite well and I can rely on already existing error >>> reporting facilities in LLVM to help me. The more structural errors, like >>> missing attributes >>> will be handled by the YAML parser automatically, and I might extend it >>> to provide better/more specific error messages. And I think that it's >>> possible >>> to use the machine verifier to catch the other errors that you've >>> mentioned. >>> >>> Alex >>> >>> >>> >>>> On Apr 28, 2015, at 9:56 AM, Alex L <arphaman at gmail.com >>>> <javascript:_e(%7B%7D,'cvml','arphaman at gmail.com');>> wrote: >>>> >>>> Hi all, >>>> >>>> >>>> I would like to propose a text-based, human readable format that will be used to >>>> >>>> serialize the machine level IR. The major goal of this format is to allow LLVM >>>> >>>> to save the machine level IR after any code generation pass and then to load >>>> >>>> it again and continue running passes on the machine level IR. The primary use case >>>> >>>> of this format is to enable easier testing process for the code generation passes, >>>> >>>> by allowing the developers to write tests that load the IR, then invoke just a >>>> >>>> specific code gen pass and then inspect the output of that pass by checking the >>>> >>>> printed out IR. >>>> >>>> >>>> >>>> The proposed format has a number of key features: >>>> >>>> - It stores the machine level IR and the optional LLVM IR in one text file. >>>> >>>> - The connections between the machine level IR and the LLVM IR are preserved. >>>> >>>> - The format uses a YAML based container for most of the data structures. The LLVM >>>> >>>> IR is embedded in the YAML container. >>>> >>>> - The format also uses a new, text-based syntax to serialize the machine instructions. >>>> >>>> The instructions are embedded in YAML. >>>> >>>> >>>> This is an incomplete example of a YAML file containing the LLVM IR, the machine level IR >>>> >>>> and the instructions: >>>> >>>> >>>> --- >>>> >>>> ir: | >>>> >>>> define i32 @fact(i32 %n) { >>>> >>>> %1 = alloca i32, align 4 >>>> >>>> store i32 %n, i32* %1, align 4 >>>> >>>> %2 = load i32, i32* %1, align 4 >>>> >>>> %3 = icmp eq i32 %2, 0 >>>> >>>> br i1 %3, label %10, label %4 >>>> >>>> >>>> ; <label>:4 ; preds = %0 >>>> >>>> %5 = load i32, i32* %1, align 4 >>>> >>>> %6 = sub nsw i32 %5, 1 >>>> >>>> %7 = call i32 @fact(i32 %6) >>>> >>>> %8 = load i32, i32* %1, align 4 >>>> >>>> %9 = mul nsw i32 %7, %8 >>>> >>>> br label %10 >>>> >>>> >>>> ; <label>:10 ; preds = %0, %4 >>>> >>>> %11 = phi i32 [ %9, %4 ], [ 1, %0 ] >>>> >>>> ret i32 %11 >>>> >>>> } >>>> >>>> >>>> ... >>>> >>>> --- >>>> >>>> number: 0 >>>> >>>> name: fact >>>> >>>> alignment: 4 >>>> >>>> regInfo: >>>> >>>> .... >>>> >>>> frameInfo: >>>> >>>> .... >>>> >>>> body: >>>> >>>> - bb: 0 >>>> >>>> llbb: '%0' >>>> >>>> successors: [ 'bb#2', 'bb#1' ] >>>> >>>> liveIns: [ '%edi' ] >>>> >>>> instructions: >>>> >>>> - 'push64r undef %rax, %rsp, %rsp' >>>> >>>> - 'mov32mr %rsp, 1, %noreg, 4, %noreg, %edi' >>>> >>>> - .... >>>> >>>> .... >>>> >>>> - bb: 1 >>>> >>>> llbb: '%4' >>>> >>>> successors: [ 'bb#2' ] >>>> >>>> instructions: >>>> >>>> - '%edi = mov32rm %rsp, 1, %noreg, 4, %noreg' >>>> >>>> - .... >>>> >>>> .... >>>> >>>> - .... >>>> >>>> .... >>>> >>>> ... >>>> >>>> >>>> The example above shows a YAML file with two YAML documents (delimited by `---` >>>> >>>> and `...`) containing the LLVM IR and the machine function information for the function `fact`. >>>> >>>> >>>> >>>> When a specific format is chosen, I'll start with patches that serialize the >>>> >>>> embedded LLVM IR. Then I'll add support for things like machine functions and >>>> >>>> machine basic blocks, and I think that an intrusive implementation will work best >>>> >>>> for data structures like these. After that I will continue adding support for >>>> >>>> serialization of the remaining data structures. >>>> >>>> >>>> >>>> Thanks for reading through the proposal. What are you thoughts about this format? >>>> >>>> >>>> _______________________________________________ >>>> LLVM Developers mailing list >>>> LLVMdev at cs.uiuc.edu >>>> <javascript:_e(%7B%7D,'cvml','LLVMdev at cs.uiuc.edu');> >>>> llvm.cs.uiuc.edu >>>> lists.cs.uiuc.edu/mailman/listinfo/llvmdev >>>> >>>> >>>> >>> >>> _______________________________________________ >>> LLVM Developers mailing list >>> LLVMdev at cs.uiuc.edu >>> <javascript:_e(%7B%7D,'cvml','LLVMdev at cs.uiuc.edu');> >>> llvm.cs.uiuc.edu >>> lists.cs.uiuc.edu/mailman/listinfo/llvmdev >>> >>> >> >> _______________________________________________ >> LLVM Developers mailing list >> LLVMdev at cs.uiuc.edu <javascript:_e(%7B%7D,'cvml','LLVMdev at cs.uiuc.edu');> >> llvm.cs.uiuc.edu >> lists.cs.uiuc.edu/mailman/listinfo/llvmdev >> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <lists.llvm.org/pipermail/llvm-dev/attachments/20150428/7836d8e5/attachment.html>
Hayden Livingston
2015-Apr-29 05:35 UTC
[LLVMdev] RFC: Machine Level IR text-based serialization format
As an aside, you haven't mentioned but will the IR parser be rewritten at all? Is the YAML a container on top of the IR? If you are rewriting the IR parser, would it be possible to maintain some sort of grammar? On Tue, Apr 28, 2015 at 5:59 PM, David Majnemer <david.majnemer at gmail.com> wrote:> > > On Tuesday, April 28, 2015, Sean Silva <chisophugis at gmail.com> wrote: >> >> >> >> On Tue, Apr 28, 2015 at 3:51 PM, David Majnemer <david.majnemer at gmail.com> >> wrote: >>> >>> I love the idea of having some sort of textual representation. My only >>> concern is that our YAML parser is not very actively maintained (is there >>> someone expert with its implementation *and* active in the project?) and >>> (IMHO) over-engineered when compared to the simplicity of our custom IR >>> parser. >>> >>> Without TLC, I'm afraid it would make for a poor piece of LLVM >>> infrastructure to rely on. The reliability of the serialization mechanism >>> is very important if we are to have any chance of applying fuzz testing to >>> the backend pieces; after all, testability is a huge motivation for this >>> work. >>> >>> As a concrete example, a file solely containing '%' crashes the yaml >>> parser: >>> $ ~/llvm/Debug+Asserts/bin/yaml2obj -format=coff t.yaml >>> yaml2obj: ~/llvm/src/lib/Support/YAMLTraits.cpp:78: bool >>> llvm::yaml::Input::setCurrentDocument(): Assertion `Strm->failed() && "Root >>> is NULL iff parsing failed"' failed. >>> 0 yaml2obj 0x000000000048682e >>> 1 yaml2obj 0x0000000000486b43 >>> 2 yaml2obj 0x000000000048570e >>> 3 libpthread.so.0 0x00007f5e79643340 >>> 4 libc.so.6 0x00007f5e78c9acc9 gsignal + 57 >>> 5 libc.so.6 0x00007f5e78c9e0d8 abort + 328 >>> 6 libc.so.6 0x00007f5e78c93b86 >>> 7 libc.so.6 0x00007f5e78c93c32 >>> 8 yaml2obj 0x000000000045f378 >>> 9 yaml2obj 0x000000000040d4b3 >>> 10 yaml2obj 0x000000000040b0fa >>> 11 yaml2obj 0x0000000000404a79 >>> 12 yaml2obj 0x0000000000404dd8 >>> 13 libc.so.6 0x00007f5e78c85ec5 __libc_start_main + 245 >>> 14 yaml2obj 0x0000000000404879 >>> Stack dump: >>> 0. Program arguments: ~/llvm/Debug+Asserts/bin/yaml2obj -format=coff >>> t.yaml >>> >> >> >> Hopefully a fuzzer that is fuzzing a yaml input would not waste its time >> with syntactically invalid or unusual YAML. > > > Maybe. I don't see why we would want to lock ourselves out of using > afl-fuzz though. > >> >> >> Also, you're thinking of YAMLIO which is a layer on top of the YAML parser >> (YAMLParser.{h,cpp}). It might make sense to not use YAMLIO (it is good for >> some types of data, not for all) but still use the YAML parser. >> >> -- Sean Silva >> >>> >>> On Tue, Apr 28, 2015 at 2:00 PM, Alex L <arphaman at gmail.com> wrote: >>>> >>>> >>>> >>>> 2015-04-28 10:14 GMT-07:00 Quentin Colombet <qcolombet at apple.com>: >>>>> >>>>> Hi Alex, >>>>> >>>>> Thanks for working on this. >>>>> >>>>> Personally I would rather not have to write YAML inputs but instead >>>>> resort on the what the machine dumps look like. That being said, I can live >>>>> with YAML :). >>>>> >>>>> More importantly, how do you plan to report syntax errors to the users? >>>>> Things like invalid instruction, invalid registers, etc.? >>>>> What about unallocated code, i.e., virtual registers, invalid SSA form, >>>>> etc.? >>>>> >>>>> Cheers, >>>>> Q. >>>> >>>> >>>> Thanks, >>>> >>>> Unfortunately, the machine dumps are quite incomplete (and tricky to >>>> parse too!), and thus some sort of new syntax has to be developed. >>>> I think that a YAML based container is a good candidate for this >>>> purpose, as it has a structured format that represents things like machine >>>> functions, >>>> frame information, register information, target specific machine >>>> function details, etc in a clear and readable way. >>>> >>>> I haven't thought about error reporting that much, as I've been mostly >>>> working on developing the syntax and making sure that all the data >>>> structures >>>> can be represented by it. But I believe that the errors that crop up in >>>> an invalid machine instruction syntax, like invalid basic block references, >>>> invalid instructions, >>>> etc. can be reported quite well and I can rely on already existing error >>>> reporting facilities in LLVM to help me. The more structural errors, like >>>> missing attributes >>>> will be handled by the YAML parser automatically, and I might extend it >>>> to provide better/more specific error messages. And I think that it's >>>> possible >>>> to use the machine verifier to catch the other errors that you've >>>> mentioned. >>>> >>>> Alex >>>> >>>> >>>>> >>>>> On Apr 28, 2015, at 9:56 AM, Alex L <arphaman at gmail.com> wrote: >>>>> >>>>> Hi all, >>>>> >>>>> >>>>> I would like to propose a text-based, human readable format that will >>>>> be used to >>>>> >>>>> serialize the machine level IR. The major goal of this format is to >>>>> allow LLVM >>>>> >>>>> to save the machine level IR after any code generation pass and then to >>>>> load >>>>> >>>>> it again and continue running passes on the machine level IR. The >>>>> primary use case >>>>> >>>>> of this format is to enable easier testing process for the code >>>>> generation passes, >>>>> >>>>> by allowing the developers to write tests that load the IR, then invoke >>>>> just a >>>>> >>>>> specific code gen pass and then inspect the output of that pass by >>>>> checking the >>>>> >>>>> printed out IR. >>>>> >>>>> >>>>> >>>>> The proposed format has a number of key features: >>>>> >>>>> - It stores the machine level IR and the optional LLVM IR in one text >>>>> file. >>>>> >>>>> - The connections between the machine level IR and the LLVM IR are >>>>> preserved. >>>>> >>>>> - The format uses a YAML based container for most of the data >>>>> structures. The LLVM >>>>> >>>>> IR is embedded in the YAML container. >>>>> >>>>> - The format also uses a new, text-based syntax to serialize the >>>>> machine instructions. >>>>> >>>>> The instructions are embedded in YAML. >>>>> >>>>> >>>>> This is an incomplete example of a YAML file containing the LLVM IR, >>>>> the machine level IR >>>>> >>>>> and the instructions: >>>>> >>>>> >>>>> --- >>>>> >>>>> ir: | >>>>> >>>>> define i32 @fact(i32 %n) { >>>>> >>>>> %1 = alloca i32, align 4 >>>>> >>>>> store i32 %n, i32* %1, align 4 >>>>> >>>>> %2 = load i32, i32* %1, align 4 >>>>> >>>>> %3 = icmp eq i32 %2, 0 >>>>> >>>>> br i1 %3, label %10, label %4 >>>>> >>>>> >>>>> ; <label>:4 ; preds = %0 >>>>> >>>>> %5 = load i32, i32* %1, align 4 >>>>> >>>>> %6 = sub nsw i32 %5, 1 >>>>> >>>>> %7 = call i32 @fact(i32 %6) >>>>> >>>>> %8 = load i32, i32* %1, align 4 >>>>> >>>>> %9 = mul nsw i32 %7, %8 >>>>> >>>>> br label %10 >>>>> >>>>> >>>>> ; <label>:10 ; preds = %0, %4 >>>>> >>>>> %11 = phi i32 [ %9, %4 ], [ 1, %0 ] >>>>> >>>>> ret i32 %11 >>>>> >>>>> } >>>>> >>>>> >>>>> ... >>>>> >>>>> --- >>>>> >>>>> number: 0 >>>>> >>>>> name: fact >>>>> >>>>> alignment: 4 >>>>> >>>>> regInfo: >>>>> >>>>> .... >>>>> >>>>> frameInfo: >>>>> >>>>> .... >>>>> >>>>> body: >>>>> >>>>> - bb: 0 >>>>> >>>>> llbb: '%0' >>>>> >>>>> successors: [ 'bb#2', 'bb#1' ] >>>>> >>>>> liveIns: [ '%edi' ] >>>>> >>>>> instructions: >>>>> >>>>> - 'push64r undef %rax, %rsp, %rsp' >>>>> >>>>> - 'mov32mr %rsp, 1, %noreg, 4, %noreg, %edi' >>>>> >>>>> - .... >>>>> >>>>> .... >>>>> >>>>> - bb: 1 >>>>> >>>>> llbb: '%4' >>>>> >>>>> successors: [ 'bb#2' ] >>>>> >>>>> instructions: >>>>> >>>>> - '%edi = mov32rm %rsp, 1, %noreg, 4, %noreg' >>>>> >>>>> - .... >>>>> >>>>> .... >>>>> >>>>> - .... >>>>> >>>>> .... >>>>> >>>>> ... >>>>> >>>>> >>>>> The example above shows a YAML file with two YAML documents (delimited >>>>> by `---` >>>>> >>>>> and `...`) containing the LLVM IR and the machine function information >>>>> for the function `fact`. >>>>> >>>>> >>>>> >>>>> When a specific format is chosen, I'll start with patches that >>>>> serialize the >>>>> >>>>> embedded LLVM IR. Then I'll add support for things like machine >>>>> functions and >>>>> >>>>> machine basic blocks, and I think that an intrusive implementation will >>>>> work best >>>>> >>>>> for data structures like these. After that I will continue adding >>>>> support for >>>>> >>>>> serialization of the remaining data structures. >>>>> >>>>> >>>>> >>>>> Thanks for reading through the proposal. What are you thoughts about >>>>> this format? >>>>> >>>>> >>>>> _______________________________________________ >>>>> LLVM Developers mailing list >>>>> LLVMdev at cs.uiuc.edu llvm.cs.uiuc.edu >>>>> lists.cs.uiuc.edu/mailman/listinfo/llvmdev >>>>> >>>>> >>>> >>>> >>>> _______________________________________________ >>>> LLVM Developers mailing list >>>> LLVMdev at cs.uiuc.edu llvm.cs.uiuc.edu >>>> lists.cs.uiuc.edu/mailman/listinfo/llvmdev >>>> >>> >>> >>> _______________________________________________ >>> LLVM Developers mailing list >>> LLVMdev at cs.uiuc.edu llvm.cs.uiuc.edu >>> lists.cs.uiuc.edu/mailman/listinfo/llvmdev >>> >> > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu llvm.cs.uiuc.edu > lists.cs.uiuc.edu/mailman/listinfo/llvmdev >
Duncan P. N. Exon Smith
2015-Apr-29 18:41 UTC
[LLVMdev] RFC: Machine Level IR text-based serialization format
> On 2015-Apr-28, at 17:59, David Majnemer <david.majnemer at gmail.com> wrote: > > > > On Tuesday, April 28, 2015, Sean Silva <chisophugis at gmail.com> wrote: > > > On Tue, Apr 28, 2015 at 3:51 PM, David Majnemer <david.majnemer at gmail.com> wrote: > I love the idea of having some sort of textual representation. My only concern is that our YAML parser is not very actively maintained (is there someone expert with its implementation *and* active in the project?) and (IMHO) over-engineered when compared to the simplicity of our custom IR parser. > > Without TLC, I'm afraid it would make for a poor piece of LLVM infrastructure to rely on. The reliability of the serialization mechanism is very important if we are to have any chance of applying fuzz testing to the backend pieces; after all, testability is a huge motivation for this work. > > As a concrete example, a file solely containing '%' crashes the yaml parser: > $ ~/llvm/Debug+Asserts/bin/yaml2obj -format=coff t.yaml > yaml2obj: ~/llvm/src/lib/Support/YAMLTraits.cpp:78: bool llvm::yaml::Input::setCurrentDocument(): Assertion `Strm->failed() && "Root is NULL iff parsing failed"' failed. > 0 yaml2obj 0x000000000048682e > 1 yaml2obj 0x0000000000486b43 > 2 yaml2obj 0x000000000048570e > 3 libpthread.so.0 0x00007f5e79643340 > 4 libc.so.6 0x00007f5e78c9acc9 gsignal + 57 > 5 libc.so.6 0x00007f5e78c9e0d8 abort + 328 > 6 libc.so.6 0x00007f5e78c93b86 > 7 libc.so.6 0x00007f5e78c93c32 > 8 yaml2obj 0x000000000045f378 > 9 yaml2obj 0x000000000040d4b3 > 10 yaml2obj 0x000000000040b0fa > 11 yaml2obj 0x0000000000404a79 > 12 yaml2obj 0x0000000000404dd8 > 13 libc.so.6 0x00007f5e78c85ec5 __libc_start_main + 245 > 14 yaml2obj 0x0000000000404879 > Stack dump: > 0. Program arguments: ~/llvm/Debug+Asserts/bin/yaml2obj -format=coff t.yaml > > > > Hopefully a fuzzer that is fuzzing a yaml input would not waste its time with syntactically invalid or unusual YAML. > > Maybe. I don't see why we would want to lock ourselves out of using afl-fuzz though.I don't think we're locked out of anything. We should fix bugs in the YAML parser as we find them.> > > Also, you're thinking of YAMLIO which is a layer on top of the YAML parser (YAMLParser.{h,cpp}). It might make sense to not use YAMLIO (it is good for some types of data, not for all) but still use the YAML parser. > > -- Sean Silva > > On Tue, Apr 28, 2015 at 2:00 PM, Alex L <arphaman at gmail.com> wrote: > > > 2015-04-28 10:14 GMT-07:00 Quentin Colombet <qcolombet at apple.com>: > Hi Alex, > > Thanks for working on this. > > Personally I would rather not have to write YAML inputs but instead resort on the what the machine dumps look like. That being said, I can live with YAML :). > > More importantly, how do you plan to report syntax errors to the users? > Things like invalid instruction, invalid registers, etc.? > What about unallocated code, i.e., virtual registers, invalid SSA form, etc.? > > Cheers, > Q. > > Thanks, > > Unfortunately, the machine dumps are quite incomplete (and tricky to parse too!), and thus some sort of new syntax has to be developed. > I think that a YAML based container is a good candidate for this purpose, as it has a structured format that represents things like machine functions, > frame information, register information, target specific machine function details, etc in a clear and readable way. > > I haven't thought about error reporting that much, as I've been mostly working on developing the syntax and making sure that all the data structures > can be represented by it. But I believe that the errors that crop up in an invalid machine instruction syntax, like invalid basic block references, invalid instructions, > etc. can be reported quite well and I can rely on already existing error reporting facilities in LLVM to help me. The more structural errors, like missing attributes > will be handled by the YAML parser automatically, and I might extend it to provide better/more specific error messages. And I think that it's possible > to use the machine verifier to catch the other errors that you've mentioned. > > Alex > > >> On Apr 28, 2015, at 9:56 AM, Alex L <arphaman at gmail.com> wrote: >> >> Hi all, >> >> I would like to propose a text-based, human readable format that will be used to >> serialize the machine level IR. The major goal of this format is to allow LLVM >> to save the machine level IR after any code generation pass and then to load >> it again and continue running passes on the machine level IR. The primary use case >> of this format is to enable easier testing process for the code generation passes, >> by allowing the developers to write tests that load the IR, then invoke just a >> specific code gen pass and then inspect the output of that pass by checking the >> printed out IR. >> >> >> The proposed format has a number of key features: >> - It stores the machine level IR and the optional LLVM IR in one text file. >> - The connections between the machine level IR and the LLVM IR are preserved. >> - The format uses a YAML based container for most of the data structures. The LLVM >> IR is embedded in the YAML container. >> - The format also uses a new, text-based syntax to serialize the machine instructions. >> The instructions are embedded in YAML. >> >> This is an incomplete example of a YAML file containing the LLVM IR, the machine level IR >> and the instructions: >> >> --- >> ir: | >> define i32 @fact(i32 %n) { >> %1 = alloca i32, align 4 >> store i32 %n, i32* %1, align 4 >> %2 = load i32, i32* %1, align 4 >> %3 = icmp eq i32 %2, 0 >> br i1 %3, label %10, label %4 >> >> ; <label>:4 ; preds = %0 >> %5 = load i32, i32* %1, align 4 >> %6 = sub nsw i32 %5, 1 >> %7 = call i32 @fact(i32 %6) >> %8 = load i32, i32* %1, align 4 >> %9 = mul nsw i32 %7, %8 >> br label %10 >> >> ; <label>:10 ; preds = %0, %4 >> %11 = phi i32 [ %9, %4 ], [ 1, %0 ] >> ret i32 %11 >> } >> >> ... >> --- >> number: 0 >> name: fact >> alignment: 4 >> regInfo: >> .... >> frameInfo: >> .... >> body: >> - bb: 0 >> llbb: '%0' >> successors: [ 'bb#2', 'bb#1' ] >> liveIns: [ '%edi' ] >> instructions: >> - 'push64r undef %rax, %rsp, %rsp' >> - 'mov32mr %rsp, 1, %noreg, 4, %noreg, %edi' >> - .... >> .... >> - bb: 1 >> llbb: '%4' >> successors: [ 'bb#2' ] >> instructions: >> - '%edi = mov32rm %rsp, 1, %noreg, 4, %noreg' >> - .... >> .... >> - .... >> .... >> ... >> >> The example above shows a YAML file with two YAML documents (delimited by `---` >> and `...`) containing the LLVM IR and the machine function information for the function `fact`. >> >> >> When a specific format is chosen, I'll start with patches that serialize the >> embedded LLVM IR. Then I'll add support for things like machine functions and >> machine basic blocks, and I think that an intrusive implementation will work best >> for data structures like these. After that I will continue adding support for >> serialization of the remaining data structures. >> >> >> Thanks for reading through the proposal. What are you thoughts about this format? >> >> _______________________________________________ >> LLVM Developers mailing list >> LLVMdev at cs.uiuc.edu llvm.cs.uiuc.edu >> lists.cs.uiuc.edu/mailman/listinfo/llvmdev > > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu llvm.cs.uiuc.edu > lists.cs.uiuc.edu/mailman/listinfo/llvmdev > > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu llvm.cs.uiuc.edu > lists.cs.uiuc.edu/mailman/listinfo/llvmdev > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu llvm.cs.uiuc.edu > lists.cs.uiuc.edu/mailman/listinfo/llvmdev