On Tue, Jan 24, 2017 at 11:29 AM, Rui Ueyama <ruiu at google.com> wrote:> Well, maybe, we should just change the Linux kernel instead of tweaking > our tokenizer too hard. >This is silly. Writing a simple and maintainable lexer is not hard (look e.g. at https://reviews.llvm.org/D10817). There are some complicated context-sensitive cases in linker scripts that break our approach of tokenizing up front (so we might want to hold off on), but we aren't going to die from implementing enough to lex basic arithmetic expressions independent of whitespace. We will be laughed at. ("You seriously couldn't even be bothered to implement a real lexer?") -- Sean Silva> > > On Tue, Jan 24, 2017 at 7:57 AM, George Rimar <grimar at accesssoftek.com> > wrote: > >> >Our tokenizer recognize >> > >> > [A-Za-z0-9_.$/\\~=+[]*?\-:!<>]+ >> > >> >as a token. gold uses more complex rules to tokenize. I don't think we >> need that much complex rules, but there seems to be >room to improve our >> tokenizer. In particular, I believe we can parse the Linux's linker script >> by changing the tokenizer rules as >follows. >> > >> > [A-Za-z_.$/\\~=+[]*?\-:!<>][A-Za-z0-9_.$/\\~=+[]*?\-:!<>]* >> > >> >or >> > >> > [0-9]+ >> >> After more investigation, that seems will not work so simple. >> Next are possible examples where it will be broken: >> . = 0x1000; (gives tokens "0, x1000") >> . = A*10; (gives "A*10") >> . = 10k; (gives "10, k") >> . = 10*5; (gives "10, *5" >> >> "[0-9]+" could be "[0-9][kmhKMHx0-9]*" >> but for "10*5" that anyways gives "10" and "*5" tokens. >> And I do not think we can involve some handling of operators, >> as its hard to assume some context on tokenizing step. >> We do not know if that a file name we are parsing or a math expression. >> >> May be worth trying to handle this on higher level, during evaluation of >> expressions ? >> >> George. >> > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170126/db42992a/attachment.html>
On Thu, Jan 26, 2017 at 7:56 PM, Sean Silva <chisophugis at gmail.com> wrote:> > > On Tue, Jan 24, 2017 at 11:29 AM, Rui Ueyama <ruiu at google.com> wrote: > >> Well, maybe, we should just change the Linux kernel instead of tweaking >> our tokenizer too hard. >> > > This is silly. Writing a simple and maintainable lexer is not hard (look > e.g. at https://reviews.llvm.org/D10817). There are some complicated > context-sensitive cases in linker scripts that break our approach of > tokenizing up front (so we might want to hold off on), but we aren't going > to die from implementing enough to lex basic arithmetic expressions > independent of whitespace. >Hmm..., the crux of not being able to lex arithmetic expressions seems to be due to lack of context sensitivity. E.g. consider `foo*bar`. Could be a multiplication, or could be a glob pattern. Looking at the code more closely, adding context sensitivity wouldn't be that hard. In fact, our ScriptParserBase class is actually a lexer (look at the interface; it is a lexer's interface). It shouldn't be hard to change from an up-front tokenization to a more normal lexer approach of scanning the text for each call that wants the next token. Roughly speaking, just take the body of the for loop inside ScriptParserBase::tokenize and add a helper which does that on the fly and is called by consume/next/etc. Instead of an index into a token vector, just keep a `const char *` pointer that we advance. Once that is done, we can easily add a `nextArithmeticToken` or something like that which just lexes with different rules. Implementing a linker is much harder than implementing a lexer. If we give our users the impression that implementing a compatible lexer is hard for us, what impression will we give them about the linker's implementation quality? If we can afford 100 lines of self-contained code to implement a concurrent hash table; we can afford 100 self-contained lines to implement a context-sensitive lexer. This is end-user visible functionality; we should be careful skimping on it in the name of simplicity. -- Sean Silva> > We will be laughed at. ("You seriously couldn't even be bothered to > implement a real lexer?") > > -- Sean Silva > > >> >> >> On Tue, Jan 24, 2017 at 7:57 AM, George Rimar <grimar at accesssoftek.com> >> wrote: >> >>> >Our tokenizer recognize >>> > >>> > [A-Za-z0-9_.$/\\~=+[]*?\-:!<>]+ >>> > >>> >as a token. gold uses more complex rules to tokenize. I don't think we >>> need that much complex rules, but there seems to be >room to improve our >>> tokenizer. In particular, I believe we can parse the Linux's linker script >>> by changing the tokenizer rules as >follows. >>> > >>> > [A-Za-z_.$/\\~=+[]*?\-:!<>][A-Za-z0-9_.$/\\~=+[]*?\-:!<>]* >>> > >>> >or >>> > >>> > [0-9]+ >>> >>> After more investigation, that seems will not work so simple. >>> Next are possible examples where it will be broken: >>> . = 0x1000; (gives tokens "0, x1000") >>> . = A*10; (gives "A*10") >>> . = 10k; (gives "10, k") >>> . = 10*5; (gives "10, *5" >>> >>> "[0-9]+" could be "[0-9][kmhKMHx0-9]*" >>> but for "10*5" that anyways gives "10" and "*5" tokens. >>> And I do not think we can involve some handling of operators, >>> as its hard to assume some context on tokenizing step. >>> We do not know if that a file name we are parsing or a math expression. >>> >>> May be worth trying to handle this on higher level, during evaluation of >>> expressions ? >>> >>> George. >>> >> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170127/8c9cf106/attachment.html>
Rafael Avila de Espindola via llvm-dev
2017-Jan-27 19:17 UTC
[llvm-dev] Linking Linux kernel with LLD
> Hmm..., the crux of not being able to lex arithmetic expressions seems to > be due to lack of context sensitivity. E.g. consider `foo*bar`. Could be a > multiplication, or could be a glob pattern. > > Looking at the code more closely, adding context sensitivity wouldn't be > that hard. In fact, our ScriptParserBase class is actually a lexer (look at > the interface; it is a lexer's interface). It shouldn't be hard to change > from an up-front tokenization to a more normal lexer approach of scanning > the text for each call that wants the next token. Roughly speaking, just > take the body of the for loop inside ScriptParserBase::tokenize and add a > helper which does that on the fly and is called by consume/next/etc. > Instead of an index into a token vector, just keep a `const char *` pointer > that we advance. > > Once that is done, we can easily add a `nextArithmeticToken` or something > like that which just lexes with different rules.I like that idea. I first thought of always having '*' as a token, but then space has to be a token, which is an incredible pain. I then thought of having a "setLexMode" method, but the lex mode can always be implicit from where we are in the parser. The parser should always know if it should call next or nextArithmetic. And I agree we should probably implement this. Even if it is not common, it looks pretty silly to not be able to handle 2*5. Cheers, Rafael