thr3ads.net - llvm dev - [LLVMdev] Increase the flexibility of the AsmLexer in parsing identifiers. [Nov 2014]

If this information is useful, please help other people find it:
Share via:

Marcello Maggioni

2014-Nov-12 17:23 UTC

[LLVMdev] Increase the flexibility of the AsmLexer in parsing identifiers.

Hello,

I would like to gather some ideas and opinions on how to make the default
AsmLexer more flexible when dealing with Identifiers.

When the lexer emits something as an "Identifier" (read. String of
characters) it means that it needs to be parsed all at once in a single go,
even if it contains elements that might be wanted to be parsed as separate
entities.
In that case it is needed to implement some custom parsing logic that lexes
and parses in place the identifier string to emit the Operands in the
operand vector, which might not be ideal.

At the moment the default AsmLexer lexes tokens like this:

There are a bunch of symbols that are parsed directly into tokens(like #, %
... etc), then there are integer/float literals and a fairly big category
that catches the default case that doesn't match any of the previous, that
are handled by the LexIdentifier() function.

Actually in the current default AsmLexer this function doesn't always emit
an Identifier token, but might return Float literals or Dot tokens in some
special cases, so it works more like a "handle what I couldn't directly
recognize" kind of function.

In multiple occasions I found like I wanted to be able to change what
actually this function considers an Identifier or separate tokens.

A use case would be this.

Let's say that my target's assembly syntax has this fancy characteristic
where different operands are separated by '$' (dollar) like in:

add r0$5$r3

The default AsmLexer would lex the entire r0$5$r3 as a single
"Identifier"
and it is not possible to Lex every operand separately , but some custom
lexing logic must be applied over the returned "Identifier" Token to
split
and recognize each of the operands.

This is a stupid example, but there are other cases where something similar
happens and can be a hassle to deal with, because what an Identifier is
entirely dependent from some arbitrary logic in the Lexer.

To override this logic the entire default Lexer and Parser needs to be
overridden (probably copying most of the existing logic for the rest of the
parsing anyway).

I would like to find a more easy way to specify what to return as an
identifier or separate logic allowing for more flexibility.

I developed a tentative patch that adds this flexibility to the current
MCAsmLexer infrastructure.
I would like to gather opinions on this approach or ideas on other possible
approaches to achieve something similar and find out if somebody else finds
this kind of concept useful or not.

Thanks,
Marcello
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141112/d53b4ebb/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: configurable_asmlexer.patch
Type: application/octet-stream
Size: 5351 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141112/d53b4ebb/attachment.obj>

Reid Kleckner

2014-Nov-12 18:45 UTC

head link

[LLVMdev] Increase the flexibility of the AsmLexer in parsing identifiers.

I think allowing MCAsmParserExtensions to control this behavior by
overriding methods would be cleaner than adding more setters. I'm imagining
that each target is allowed to supply its own table or virtual method to
implement 'IsIdentifierChar' in AsmLexer.cpp. This would handle
AllowAtInIdentifier and your use case.

On Wed, Nov 12, 2014 at 9:23 AM, Marcello Maggioni <hayarms at gmail.com>
wrote:
> Hello,
>
> I would like to gather some ideas and opinions on how to make the default
> AsmLexer more flexible when dealing with Identifiers.
>
> When the lexer emits something as an "Identifier" (read. String
of
> characters) it means that it needs to be parsed all at once in a single go,
> even if it contains elements that might be wanted to be parsed as separate
> entities.
> In that case it is needed to implement some custom parsing logic that
> lexes and parses in place the identifier string to emit the Operands in the
> operand vector, which might not be ideal.
>
> At the moment the default AsmLexer lexes tokens like this:
>
> There are a bunch of symbols that are parsed directly into tokens(like #,
> % ... etc), then there are integer/float literals and a fairly big category
> that catches the default case that doesn't match any of the previous,
that
> are handled by the LexIdentifier() function.
>
> Actually in the current default AsmLexer this function doesn't always
emit
> an Identifier token, but might return Float literals or Dot tokens in some
> special cases, so it works more like a "handle what I couldn't
directly
> recognize" kind of function.
>
> In multiple occasions I found like I wanted to be able to change what
> actually this function considers an Identifier or separate tokens.
>
> A use case would be this.
>
> Let's say that my target's assembly syntax has this fancy
characteristic
> where different operands are separated by '$' (dollar) like in:
>
> add r0$5$r3
>
> The default AsmLexer would lex the entire r0$5$r3 as a single
"Identifier"
> and it is not possible to Lex every operand separately , but some custom
> lexing logic must be applied over the returned "Identifier" Token
to split
> and recognize each of the operands.
>
> This is a stupid example, but there are other cases where something
> similar happens and can be a hassle to deal with, because what an
> Identifier is entirely dependent from some arbitrary logic in the Lexer.
>
> To override this logic the entire default Lexer and Parser needs to be
> overridden (probably copying most of the existing logic for the rest of the
> parsing anyway).
>
> I would like to find a more easy way to specify what to return as an
> identifier or separate logic allowing for more flexibility.
>
> I developed a tentative patch that adds this flexibility to the current
> MCAsmLexer infrastructure.
> I would like to gather opinions on this approach or ideas on other
> possible approaches to achieve something similar and find out if somebody
> else finds this kind of concept useful or not.
>
> Thanks,
> Marcello
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141112/36dc5ebc/attachment.html>

Marcello Maggioni

2014-Nov-12 19:21 UTC

head link

[LLVMdev] Increase the flexibility of the AsmLexer in parsing identifiers.

Hello Reid,

I'm not exactly sure I understand completely your proposal.
Are you proposing to add overridable virtual methods to
MCAsmParserExtensions to be used by AsmLexer to specify which characters
are part of an identifier or not?

If that is the case I'm not sure how MCAsmLexer or AsmLexer can make use of
those , because while MCAsmParserExtensions sees MCAsmLexer the latter
doesn't know anything about the former.

Or ... maybe I completely misinterpreted what you were saying? :-D

Marcello

2014-11-12 10:45 GMT-08:00 Reid Kleckner <rnk at google.com>:
> I think allowing MCAsmParserExtensions to control this behavior by
> overriding methods would be cleaner than adding more setters. I'm
imagining
> that each target is allowed to supply its own table or virtual method to
> implement 'IsIdentifierChar' in AsmLexer.cpp. This would handle
> AllowAtInIdentifier and your use case.
>
> On Wed, Nov 12, 2014 at 9:23 AM, Marcello Maggioni <hayarms at
gmail.com>
> wrote:
>
>> Hello,
>>
>> I would like to gather some ideas and opinions on how to make the
default
>> AsmLexer more flexible when dealing with Identifiers.
>>
>> When the lexer emits something as an "Identifier" (read.
String of
>> characters) it means that it needs to be parsed all at once in a single
go,
>> even if it contains elements that might be wanted to be parsed as
separate
>> entities.
>> In that case it is needed to implement some custom parsing logic that
>> lexes and parses in place the identifier string to emit the Operands in
the
>> operand vector, which might not be ideal.
>>
>> At the moment the default AsmLexer lexes tokens like this:
>>
>> There are a bunch of symbols that are parsed directly into tokens(like
#,
>> % ... etc), then there are integer/float literals and a fairly big
category
>> that catches the default case that doesn't match any of the
previous, that
>> are handled by the LexIdentifier() function.
>>
>> Actually in the current default AsmLexer this function doesn't
always
>> emit an Identifier token, but might return Float literals or Dot tokens
in
>> some special cases, so it works more like a "handle what I
couldn't
>> directly recognize" kind of function.
>>
>> In multiple occasions I found like I wanted to be able to change what
>> actually this function considers an Identifier or separate tokens.
>>
>> A use case would be this.
>>
>> Let's say that my target's assembly syntax has this fancy
characteristic
>> where different operands are separated by '$' (dollar) like in:
>>
>> add r0$5$r3
>>
>> The default AsmLexer would lex the entire r0$5$r3 as a single
>> "Identifier" and it is not possible to Lex every operand
separately , but
>> some custom lexing logic must be applied over the returned
"Identifier"
>> Token to split and recognize each of the operands.
>>
>> This is a stupid example, but there are other cases where something
>> similar happens and can be a hassle to deal with, because what an
>> Identifier is entirely dependent from some arbitrary logic in the
Lexer.
>>
>> To override this logic the entire default Lexer and Parser needs to be
>> overridden (probably copying most of the existing logic for the rest of
the
>> parsing anyway).
>>
>> I would like to find a more easy way to specify what to return as an
>> identifier or separate logic allowing for more flexibility.
>>
>> I developed a tentative patch that adds this flexibility to the current
>> MCAsmLexer infrastructure.
>> I would like to gather opinions on this approach or ideas on other
>> possible approaches to achieve something similar and find out if
somebody
>> else finds this kind of concept useful or not.
>>
>> Thanks,
>> Marcello
>>
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141112/4514b1c9/attachment.html>

Apparently Analagous Threads

Search for more apparently analagous threads

llvm dev - Nov 2014 - [LLVMdev] Increase the flexibility of the AsmLexer in parsing identifiers.

[LLVMdev] Increase the flexibility of the AsmLexer in parsing identifiers.

[LLVMdev] Increase the flexibility of the AsmLexer in parsing identifiers.

[LLVMdev] Increase the flexibility of the AsmLexer in parsing identifiers.

Apparently Analagous Threads