Our tokenizer recognize [A-Za-z0-9_.$/\\~=+[]*?\-:!<>]+ as a token. gold uses more complex rules to tokenize. I don't think we need that much complex rules, but there seems to be room to improve our tokenizer. In particular, I believe we can parse the Linux's linker script by changing the tokenizer rules as follows. [A-Za-z_.$/\\~=+[]*?\-:!<>][A-Za-z0-9_.$/\\~=+[]*?\-:!<>]* or [0-9]+ On Mon, Jan 23, 2017 at 9:25 AM, George Rimar via llvm-dev < llvm-dev at lists.llvm.org> wrote:> >I'm not sure if it is easy, but I think that it's clear that the > linkerscript lexer needs to be improved. I think that is the source of the > >problems with `*(.apicdrivers);` as well. This is not the first bug > related to lexing that we have run into (e.g. lexing `.=` as a single > >token is the cause of https://llvm.org/bugs/show_bug.cgi?id=31128 ). > > > >-- Sean Silva > > PR31128 seems to be not an issue. Both gold and bfd do not accept '.='. > So it seems the only known issue we have is about math expressions like "x > = 5*4", > I am going to look again how to fix that. > > George. > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170123/8f7ac775/attachment.html>
>Our tokenizer recognize> > [A-Za-z0-9_.$/\\~=+[]*?\-:!<>]+ > >as a token. gold uses more complex rules to tokenize. I don't think we need that much complex rules, but there seems to be >room to improve our tokenizer. In particular, I believe we can parse the Linux's linker script by changing the tokenizer rules as >follows. > > [A-Za-z_.$/\\~=+[]*?\-:!<>][A-Za-z0-9_.$/\\~=+[]*?\-:!<>]* > >or > > [0-9]+That probably should help a bit, but does not solve a problem in general. I think it will not work for expressions like . = z5*4; as it will read "z5*4" as a single token I think. I was thinking about entering some special parser state for extracting sub tokens from tokens transparently when we are inside code that evaluates the expression. We can start from your suggestion first I think and see how it works and if we really face scripts writtel like above in real life. At least it is not harmfull and should help to kernel. I'll try to prepare a patch if you do not mind. George. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170124/18b272e4/attachment.html>
>Our tokenizer recognize> > [A-Za-z0-9_.$/\\~=+[]*?\-:!<>]+ > >as a token. gold uses more complex rules to tokenize. I don't think we need that much complex rules, but there seems to be >room to improve our tokenizer. In particular, I believe we can parse the Linux's linker script by changing the tokenizer rules as >follows. > > [A-Za-z_.$/\\~=+[]*?\-:!<>][A-Za-z0-9_.$/\\~=+[]*?\-:!<>]* > >or > > [0-9]+?After more investigation, that seems will not work so simple. Next are possible examples where it will be broken: . = 0x1000; (gives tokens "0, x1000") . = A*10; (gives "A*10") . = 10k; (gives "10, k") . = 10*5; (gives "10, *5" "[0-9]+" could be "[0-9][kmhKMHx0-9]*" but for "10*5" that anyways gives "10" and "*5" tokens. And I do not think we can involve some handling of operators, as its hard to assume some context on tokenizing step. We do not know if that a file name we are parsing or a math expression. May be worth trying to handle this on higher level, during evaluation of expressions ? George. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170124/2af5c409/attachment.html>