thr3ads.net - llvm dev - [llvm-dev] Linking Linux kernel with LLD [Jan 2017]

If this information is useful, please help other people find it:
Share via:

George Rimar via llvm-dev

2017-Jan-24 15:57 UTC

[llvm-dev] Linking Linux kernel with LLD

>Our tokenizer recognize
>
>  [A-Za-z0-9_.$/\\~=+[]*?\-:!<>]+
>
>as a token. gold uses more complex rules to tokenize. I don't think we
need that much complex rules, but there seems to be >room to improve our
tokenizer. In particular, I believe we can parse the Linux's linker script
by changing the tokenizer rules as >follows.
>
>  [A-Za-z_.$/\\~=+[]*?\-:!<>][A-Za-z0-9_.$/\\~=+[]*?\-:!<>]*
>
>or
>
>  [0-9]+?
After more investigation, that seems will not work so simple.
Next are possible examples where it will be broken:
. = 0x1000; (gives tokens "0, x1000")
. = A*10;   (gives "A*10")
. = 10k;    (gives "10, k")
. = 10*5;   (gives "10, *5"

"[0-9]+" could be "[0-9][kmhKMHx0-9]*"
but for "10*5" that anyways gives "10" and "*5"
tokens.
And I do not think we can involve some handling of operators,
as its hard to assume some context on tokenizing step.
We do not know if that a file name we are parsing or a math expression.

May be worth trying to handle this on higher level, during evaluation of
expressions ?

George.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170124/2af5c409/attachment.html>

Rui Ueyama via llvm-dev

2017-Jan-24 19:29 UTC

head link

[llvm-dev] Linking Linux kernel with LLD

Well, maybe, we should just change the Linux kernel instead of tweaking our
tokenizer too hard.

On Tue, Jan 24, 2017 at 7:57 AM, George Rimar <grimar at accesssoftek.com>
wrote:
> >Our tokenizer recognize
> >
> >  [A-Za-z0-9_.$/\\~=+[]*?\-:!<>]+
> >
> >as a token. gold uses more complex rules to tokenize. I don't think
we
> need that much complex rules, but there seems to be >room to improve our
> tokenizer. In particular, I believe we can parse the Linux's linker
script
> by changing the tokenizer rules as >follows.
> >
> > 
[A-Za-z_.$/\\~=+[]*?\-:!<>][A-Za-z0-9_.$/\\~=+[]*?\-:!<>]*
> >
> >or
> >
> >  [0-9]+
>
> After more investigation, that seems will not work so simple.
> Next are possible examples where it will be broken:
> . = 0x1000; (gives tokens "0, x1000")
> . = A*10;   (gives "A*10")
> . = 10k;    (gives "10, k")
> . = 10*5;   (gives "10, *5"
>
> "[0-9]+" could be "[0-9][kmhKMHx0-9]*"
> but for "10*5" that anyways gives "10" and
"*5" tokens.
> And I do not think we can involve some handling of operators,
> as its hard to assume some context on tokenizing step.
> We do not know if that a file name we are parsing or a math expression.
>
> May be worth trying to handle this on higher level, during evaluation of
> expressions ?
>
> George.
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170124/81ba40d3/attachment.html>

George Rimar via llvm-dev

2017-Jan-25 07:54 UTC

head link

[llvm-dev] Linking Linux kernel with LLD

> Well, maybe, we should just change the Linux kernel instead of tweaking our
tokenizer too hard.

I agree, for now I am inclined to do that and watch for other scripts.


George.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170125/5832dd43/attachment.html>

Sean Silva via llvm-dev

2017-Jan-27 03:56 UTC

head link

[llvm-dev] Linking Linux kernel with LLD

On Tue, Jan 24, 2017 at 11:29 AM, Rui Ueyama <ruiu at google.com> wrote:
> Well, maybe, we should just change the Linux kernel instead of tweaking
> our tokenizer too hard.
>
This is silly. Writing a simple and maintainable lexer is not hard (look
e.g. at https://reviews.llvm.org/D10817). There are some complicated
context-sensitive cases in linker scripts that break our approach of
tokenizing up front (so we might want to hold off on), but we aren't going
to die from implementing enough to lex basic arithmetic expressions
independent of whitespace.

We will be laughed at. ("You seriously couldn't even be bothered to
implement a real lexer?")

-- Sean Silva

>
>
> On Tue, Jan 24, 2017 at 7:57 AM, George Rimar <grimar at
accesssoftek.com>
> wrote:
>
>> >Our tokenizer recognize
>> >
>> >  [A-Za-z0-9_.$/\\~=+[]*?\-:!<>]+
>> >
>> >as a token. gold uses more complex rules to tokenize. I don't
think we
>> need that much complex rules, but there seems to be >room to improve
our
>> tokenizer. In particular, I believe we can parse the Linux's linker
script
>> by changing the tokenizer rules as >follows.
>> >
>> > 
[A-Za-z_.$/\\~=+[]*?\-:!<>][A-Za-z0-9_.$/\\~=+[]*?\-:!<>]*
>> >
>> >or
>> >
>> >  [0-9]+
>>
>> After more investigation, that seems will not work so simple.
>> Next are possible examples where it will be broken:
>> . = 0x1000; (gives tokens "0, x1000")
>> . = A*10;   (gives "A*10")
>> . = 10k;    (gives "10, k")
>> . = 10*5;   (gives "10, *5"
>>
>> "[0-9]+" could be "[0-9][kmhKMHx0-9]*"
>> but for "10*5" that anyways gives "10" and
"*5" tokens.
>> And I do not think we can involve some handling of operators,
>> as its hard to assume some context on tokenizing step.
>> We do not know if that a file name we are parsing or a math expression.
>>
>> May be worth trying to handle this on higher level, during evaluation
of
>> expressions ?
>>
>> George.
>>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170126/db42992a/attachment.html>

Sean Silva via llvm-dev

2017-Jan-27 09:26 UTC

head link

[llvm-dev] Linking Linux kernel with LLD

On Tue, Jan 24, 2017 at 7:57 AM, George Rimar <grimar at accesssoftek.com>
wrote:
> >Our tokenizer recognize
> >
> >  [A-Za-z0-9_.$/\\~=+[]*?\-:!<>]+
> >
> >as a token. gold uses more complex rules to tokenize. I don't think
we
> need that much complex rules, but there seems to be >room to improve our
> tokenizer. In particular, I believe we can parse the Linux's linker
script
> by changing the tokenizer rules as >follows.
> >
> > 
[A-Za-z_.$/\\~=+[]*?\-:!<>][A-Za-z0-9_.$/\\~=+[]*?\-:!<>]*
> >
> >or
> >
> >  [0-9]+
>
> After more investigation, that seems will not work so simple.
> Next are possible examples where it will be broken:
> . = 0x1000; (gives tokens "0, x1000")
> . = A*10;   (gives "A*10")
> . = 10k;    (gives "10, k")
> . = 10*5;   (gives "10, *5"
>
> "[0-9]+" could be "[0-9][kmhKMHx0-9]*"
> but for "10*5" that anyways gives "10" and
"*5" tokens.
> And I do not think we can involve some handling of operators,
> as its hard to assume some context on tokenizing step.
> We do not know if that a file name we are parsing or a math expression.
>
> May be worth trying to handle this on higher level, during evaluation of
> expressions ?
>
The lexical format of linker scripts requires a context-sensitive lexer.

Look at how gold does it. IIRC there are 3 cases that are something like:
one is for file-name like things, one is for numbers and stuff, and the
last category is for numbers and stuff but numbers can also include things
like `10k` (I think; would need to look at the code to remember for sure).
It's done in a very elegant way in gold (passing a callback "can
continue"
that says which characters can continue the token). Which token regex to
use is dependent on the grammar production (hence context sensitive). If
you look at the other message I sent in this thread just now,
ScriptParserBase is essentially a lexer interface and can be pretty easily
converted to a more standard on-the-fly character-scanning implementation
of a lexer. Once that is done adding a new method to scan a different kind
of token for certain parts of the parser.

-- Sean Silva

>
> George.
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170127/c4637a45/attachment.html>

George Rimar via llvm-dev

2017-Jan-27 09:37 UTC

head link

[llvm-dev] Linking Linux kernel with LLD

>The lexical format of linker scripts requires a context-sensitive lexer.
>
>Look at how gold does it. IIRC there are 3 cases that are something like:
one is for file-name like things, one is for numbers and stuff, and the last
category is for >numbers and stuff but numbers can also include things like
`10k` (I think; would need to look at the code to remember for sure). It's
done in a very elegant way in gold >(passing a callback "can
continue" that says which characters can continue the token). Which token
regex to use is dependent on the grammar production (hence >context
sensitive). If you look at the other message I sent in this thread just now,
ScriptParserBase is essentially a lexer interface and can be pretty easily
converted to >a more standard on-the-fly character-scanning implementation of
a lexer. Once that is done adding a new method to scan a different kind of token
for certain parts of >the parser.
>
>-- Sean Silva
I think that approach should work and should not be hard to implement.
Though when I think about that feature from "end user POV" I wonder
how much users of it can be ? AFAIK we have the only script found in the wild
that suffers from absence of whitespaces in math expressions. Looks 99.9% of
scripts are free of that issue. And writing "5*6" instead "5 *
6" is looks not nice from code style.
Adding more code to LLD requires additional support for it at the end.

I am not going to say we should or should not doing that, that is just my
concern. Moreover I probably would try to do that (just in case, to extend
flexibility), though I can't say I see read need for that atm, basing on
said above.

George.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170127/97e66a5e/attachment.html>

llvm dev - Jan 2017 - Linking Linux kernel with LLD

[llvm-dev] Linking Linux kernel with LLD

[llvm-dev] Linking Linux kernel with LLD

[llvm-dev] Linking Linux kernel with LLD

[llvm-dev] Linking Linux kernel with LLD

[llvm-dev] Linking Linux kernel with LLD

[llvm-dev] Linking Linux kernel with LLD