>Our tokenizer recognize> > [A-Za-z0-9_.$/\\~=+[]*?\-:!<>]+ > >as a token. gold uses more complex rules to tokenize. I don't think we need that much complex rules, but there seems to be >room to improve our tokenizer. In particular, I believe we can parse the Linux's linker script by changing the tokenizer rules as >follows. > > [A-Za-z_.$/\\~=+[]*?\-:!<>][A-Za-z0-9_.$/\\~=+[]*?\-:!<>]* > >or > > [0-9]+?After more investigation, that seems will not work so simple. Next are possible examples where it will be broken: . = 0x1000; (gives tokens "0, x1000") . = A*10; (gives "A*10") . = 10k; (gives "10, k") . = 10*5; (gives "10, *5" "[0-9]+" could be "[0-9][kmhKMHx0-9]*" but for "10*5" that anyways gives "10" and "*5" tokens. And I do not think we can involve some handling of operators, as its hard to assume some context on tokenizing step. We do not know if that a file name we are parsing or a math expression. May be worth trying to handle this on higher level, during evaluation of expressions ? George. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170124/2af5c409/attachment.html>
Well, maybe, we should just change the Linux kernel instead of tweaking our tokenizer too hard. On Tue, Jan 24, 2017 at 7:57 AM, George Rimar <grimar at accesssoftek.com> wrote:> >Our tokenizer recognize > > > > [A-Za-z0-9_.$/\\~=+[]*?\-:!<>]+ > > > >as a token. gold uses more complex rules to tokenize. I don't think we > need that much complex rules, but there seems to be >room to improve our > tokenizer. In particular, I believe we can parse the Linux's linker script > by changing the tokenizer rules as >follows. > > > > [A-Za-z_.$/\\~=+[]*?\-:!<>][A-Za-z0-9_.$/\\~=+[]*?\-:!<>]* > > > >or > > > > [0-9]+ > > After more investigation, that seems will not work so simple. > Next are possible examples where it will be broken: > . = 0x1000; (gives tokens "0, x1000") > . = A*10; (gives "A*10") > . = 10k; (gives "10, k") > . = 10*5; (gives "10, *5" > > "[0-9]+" could be "[0-9][kmhKMHx0-9]*" > but for "10*5" that anyways gives "10" and "*5" tokens. > And I do not think we can involve some handling of operators, > as its hard to assume some context on tokenizing step. > We do not know if that a file name we are parsing or a math expression. > > May be worth trying to handle this on higher level, during evaluation of > expressions ? > > George. >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170124/81ba40d3/attachment.html>
> Well, maybe, we should just change the Linux kernel instead of tweaking our tokenizer too hard.I agree, for now I am inclined to do that and watch for other scripts. George. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170125/5832dd43/attachment.html>
On Tue, Jan 24, 2017 at 11:29 AM, Rui Ueyama <ruiu at google.com> wrote:> Well, maybe, we should just change the Linux kernel instead of tweaking > our tokenizer too hard. >This is silly. Writing a simple and maintainable lexer is not hard (look e.g. at https://reviews.llvm.org/D10817). There are some complicated context-sensitive cases in linker scripts that break our approach of tokenizing up front (so we might want to hold off on), but we aren't going to die from implementing enough to lex basic arithmetic expressions independent of whitespace. We will be laughed at. ("You seriously couldn't even be bothered to implement a real lexer?") -- Sean Silva> > > On Tue, Jan 24, 2017 at 7:57 AM, George Rimar <grimar at accesssoftek.com> > wrote: > >> >Our tokenizer recognize >> > >> > [A-Za-z0-9_.$/\\~=+[]*?\-:!<>]+ >> > >> >as a token. gold uses more complex rules to tokenize. I don't think we >> need that much complex rules, but there seems to be >room to improve our >> tokenizer. In particular, I believe we can parse the Linux's linker script >> by changing the tokenizer rules as >follows. >> > >> > [A-Za-z_.$/\\~=+[]*?\-:!<>][A-Za-z0-9_.$/\\~=+[]*?\-:!<>]* >> > >> >or >> > >> > [0-9]+ >> >> After more investigation, that seems will not work so simple. >> Next are possible examples where it will be broken: >> . = 0x1000; (gives tokens "0, x1000") >> . = A*10; (gives "A*10") >> . = 10k; (gives "10, k") >> . = 10*5; (gives "10, *5" >> >> "[0-9]+" could be "[0-9][kmhKMHx0-9]*" >> but for "10*5" that anyways gives "10" and "*5" tokens. >> And I do not think we can involve some handling of operators, >> as its hard to assume some context on tokenizing step. >> We do not know if that a file name we are parsing or a math expression. >> >> May be worth trying to handle this on higher level, during evaluation of >> expressions ? >> >> George. >> > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170126/db42992a/attachment.html>
On Tue, Jan 24, 2017 at 7:57 AM, George Rimar <grimar at accesssoftek.com> wrote:> >Our tokenizer recognize > > > > [A-Za-z0-9_.$/\\~=+[]*?\-:!<>]+ > > > >as a token. gold uses more complex rules to tokenize. I don't think we > need that much complex rules, but there seems to be >room to improve our > tokenizer. In particular, I believe we can parse the Linux's linker script > by changing the tokenizer rules as >follows. > > > > [A-Za-z_.$/\\~=+[]*?\-:!<>][A-Za-z0-9_.$/\\~=+[]*?\-:!<>]* > > > >or > > > > [0-9]+ > > After more investigation, that seems will not work so simple. > Next are possible examples where it will be broken: > . = 0x1000; (gives tokens "0, x1000") > . = A*10; (gives "A*10") > . = 10k; (gives "10, k") > . = 10*5; (gives "10, *5" > > "[0-9]+" could be "[0-9][kmhKMHx0-9]*" > but for "10*5" that anyways gives "10" and "*5" tokens. > And I do not think we can involve some handling of operators, > as its hard to assume some context on tokenizing step. > We do not know if that a file name we are parsing or a math expression. > > May be worth trying to handle this on higher level, during evaluation of > expressions ? >The lexical format of linker scripts requires a context-sensitive lexer. Look at how gold does it. IIRC there are 3 cases that are something like: one is for file-name like things, one is for numbers and stuff, and the last category is for numbers and stuff but numbers can also include things like `10k` (I think; would need to look at the code to remember for sure). It's done in a very elegant way in gold (passing a callback "can continue" that says which characters can continue the token). Which token regex to use is dependent on the grammar production (hence context sensitive). If you look at the other message I sent in this thread just now, ScriptParserBase is essentially a lexer interface and can be pretty easily converted to a more standard on-the-fly character-scanning implementation of a lexer. Once that is done adding a new method to scan a different kind of token for certain parts of the parser. -- Sean Silva> > George. >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170127/c4637a45/attachment.html>
>The lexical format of linker scripts requires a context-sensitive lexer.> >Look at how gold does it. IIRC there are 3 cases that are something like: one is for file-name like things, one is for numbers and stuff, and the last category is for >numbers and stuff but numbers can also include things like `10k` (I think; would need to look at the code to remember for sure). It's done in a very elegant way in gold >(passing a callback "can continue" that says which characters can continue the token). Which token regex to use is dependent on the grammar production (hence >context sensitive). If you look at the other message I sent in this thread just now, ScriptParserBase is essentially a lexer interface and can be pretty easily converted to >a more standard on-the-fly character-scanning implementation of a lexer. Once that is done adding a new method to scan a different kind of token for certain parts of >the parser. > >-- Sean SilvaI think that approach should work and should not be hard to implement. Though when I think about that feature from "end user POV" I wonder how much users of it can be ? AFAIK we have the only script found in the wild that suffers from absence of whitespaces in math expressions. Looks 99.9% of scripts are free of that issue. And writing "5*6" instead "5 * 6" is looks not nice from code style. Adding more code to LLD requires additional support for it at the end. I am not going to say we should or should not doing that, that is just my concern. Moreover I probably would try to do that (just in case, to extend flexibility), though I can't say I see read need for that atm, basing on said above. George. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170127/97e66a5e/attachment.html>