thr3ads.net - R devel - [Rd] Mapping parse tree elements to tokens [Jul 2015]

If this information is useful, please help other people find it:
Share via:

Michael Lawrence

2015-Jul-29 18:30 UTC

[Rd] Mapping parse tree elements to tokens

Probably need a generic tree based on "ParseNode" objects that
associate the line information with the symbol (for leaf nodes). As
Duncan notes, it should be possible to gather that from the table.

But it would be nice if there was an "expr" column in the parse data
column in addition to "text". It would contain the parsed object.
Otherwise, to use the table, one is often reparsing the text, which
just seems redundant and inconvenient.

Michael

On Wed, Jul 29, 2015 at 9:43 AM, Duncan Murdoch
<murdoch.duncan at gmail.com> wrote:> On 29/07/2015 12:13 PM, Jim Hester wrote:
>>
>> I would like to map the parsed tokens obtained from
utils::getParseData()
>> to the parse tree and elements obtained by base::parse().
>>
>> It looks like back when this code was in the parser package the parse()
>> function annotated the elements in the tree with their id, which would
>> allow you to perform this mapping.  However when the code was included
in
>> R
>> this functionality was removed.
>
>
> Yes, not all elements of the parse tree can legally have attributes
> attached.
>>
>>
>> ?getParseData states
>>    The ?id? values are not attached to the elements of the parse
>>            tree, they are only retained in the table returned by
>>            ?getParseData?.
>>
>> Is there another way you can map between the getParseData() tokens and
>> elements of the parse tree that makes this additional annotation
>> unnecessary?  Or is this simply not possible?
>
>
> I think you can't get to it, though you can get close by looking at the
id &
> parent values in the table.  For example,
>
>  code <- "x + (y + 1)"
>  p <- parse(text=code)
>
> getParseData(p)
>    line1 col1 line2 col2 id parent     token terminal text
> 15     1    1     1   11 15      0      expr    FALSE
> 1      1    1     1    1  1      3    SYMBOL     TRUE    x
> 3      1    1     1    1  3     15      expr    FALSE
> 2      1    3     1    3  2     15       '+'     TRUE    +
> 13     1    5     1   11 13     15      expr    FALSE
> 4      1    5     1    5  4     13       '('     TRUE    (
> 11     1    6     1   10 11     13      expr    FALSE
> 5      1    6     1    6  5      7    SYMBOL     TRUE    y
> 7      1    6     1    6  7     11      expr    FALSE
> 6      1    8     1    8  6     11       '+'     TRUE    +
> 8      1   10     1   10  8      9 NUM_CONST     TRUE    1
> 9      1   10     1   10  9     11      expr    FALSE
> 10     1   11     1   11 10     13       ')'     TRUE    )
>
>
> Now p is an expression, with the parse tree in p[[1]].  From the table, we
> can see that the root node has id 15, and 3 nodes have that as a parent.
> Those would be p[[c(1,1)]], p[[c(1,2)]], p[[c(1,3)]].  The tricky part is
> the re-ordering:  those correspond to `+`, x, and (y+1) respectively, not
> the order they appear in the original source or in the table.  Generally
the
> function call appears first in the parse tree, but I'm not sure you
could
> always recognize which is the function call by looking at the table.
>
> Duncan Murdoch
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

Duncan Murdoch

2015-Jul-29 18:47 UTC

head link

[Rd] Mapping parse tree elements to tokens

On 29/07/2015 2:30 PM, Michael Lawrence wrote:> Probably need a generic tree based on "ParseNode" objects that
> associate the line information with the symbol (for leaf nodes). As
> Duncan notes, it should be possible to gather that from the table.
>
> But it would be nice if there was an "expr" column in the parse
data
> column in addition to "text". It would contain the parsed object.
> Otherwise, to use the table, one is often reparsing the text, which
> just seems redundant and inconvenient.
Can you (both Jim and Michael) describe the uses you might have for 
this?  There are lots of possible changes that could make this 
information available:

  - attach to each item in the parse tree, as the parser package did.  
(Bad idea for general use which is why I dropped it, but
it could be done as a special option to parse, if you aren't planning to 
evaluate the expression.)
  - give the index into the parse tree of each item, i.e. c(1,1), 
c(1,2), c(1,3) in the example below, or just 1,2,3 along with a function 
to reconstruct the full path.
  - give a copy of the branch of the parse tree, as Michael suggests.

etc.  Which is best for your purposes?

Duncan Murdoch
>
> Michael
>
> On Wed, Jul 29, 2015 at 9:43 AM, Duncan Murdoch
> <murdoch.duncan at gmail.com> wrote:
> > On 29/07/2015 12:13 PM, Jim Hester wrote:
> >>
> >> I would like to map the parsed tokens obtained from
utils::getParseData()
> >> to the parse tree and elements obtained by base::parse().
> >>
> >> It looks like back when this code was in the parser package the
parse()
> >> function annotated the elements in the tree with their id, which
would
> >> allow you to perform this mapping.  However when the code was
included in
> >> R
> >> this functionality was removed.
> >
> >
> > Yes, not all elements of the parse tree can legally have attributes
> > attached.
> >>
> >>
> >> ?getParseData states
> >>    The ?id? values are not attached to the elements of the parse
> >>            tree, they are only retained in the table returned by
> >>            ?getParseData?.
> >>
> >> Is there another way you can map between the getParseData() tokens
and
> >> elements of the parse tree that makes this additional annotation
> >> unnecessary?  Or is this simply not possible?
> >
> >
> > I think you can't get to it, though you can get close by looking
at the id &
> > parent values in the table.  For example,
> >
> >  code <- "x + (y + 1)"
> >  p <- parse(text=code)
> >
> > getParseData(p)
> >    line1 col1 line2 col2 id parent     token terminal text
> > 15     1    1     1   11 15      0      expr    FALSE
> > 1      1    1     1    1  1      3    SYMBOL     TRUE    x
> > 3      1    1     1    1  3     15      expr    FALSE
> > 2      1    3     1    3  2     15       '+'     TRUE    +
> > 13     1    5     1   11 13     15      expr    FALSE
> > 4      1    5     1    5  4     13       '('     TRUE    (
> > 11     1    6     1   10 11     13      expr    FALSE
> > 5      1    6     1    6  5      7    SYMBOL     TRUE    y
> > 7      1    6     1    6  7     11      expr    FALSE
> > 6      1    8     1    8  6     11       '+'     TRUE    +
> > 8      1   10     1   10  8      9 NUM_CONST     TRUE    1
> > 9      1   10     1   10  9     11      expr    FALSE
> > 10     1   11     1   11 10     13       ')'     TRUE    )
> >
> >
> > Now p is an expression, with the parse tree in p[[1]].  From the
table, we
> > can see that the root node has id 15, and 3 nodes have that as a
parent.
> > Those would be p[[c(1,1)]], p[[c(1,2)]], p[[c(1,3)]].  The tricky part
is
> > the re-ordering:  those correspond to `+`, x, and (y+1) respectively,
not
> > the order they appear in the original source or in the table. 
Generally the
> > function call appears first in the parse tree, but I'm not sure
you could
> > always recognize which is the function call by looking at the table.
> >
> > Duncan Murdoch
> >
> > ______________________________________________
> > R-devel at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel

Michael Lawrence

2015-Jul-29 19:06 UTC

head link

[Rd] Mapping parse tree elements to tokens

I have two use cases in mind:

1) Code indexing/searching, where the table gets me almost all of the
way there, except I ask for all of the text (including the calls) and
then parse that, because it's nice to get back an actual code object
when you are searching code (in addition to where the code lives). The
extra parsing step is just a minor inconvenience.

2) Code analysis, which I'm pretty sure is also Jim's use case, where
the analysis is implemented most easily as a parse tree traversal,
while you also want to point back to the original source location.
Here's where one would want a reference from parse node to location.

So neither of those involves code evaluation at first glance, though I
guess one could use some sort of evaluation during analysis.

On Wed, Jul 29, 2015 at 11:47 AM, Duncan Murdoch
<murdoch.duncan at gmail.com> wrote:> On 29/07/2015 2:30 PM, Michael Lawrence wrote:
>>
>> Probably need a generic tree based on "ParseNode" objects
that
>> associate the line information with the symbol (for leaf nodes). As
>> Duncan notes, it should be possible to gather that from the table.
>>
>> But it would be nice if there was an "expr" column in the
parse data
>> column in addition to "text". It would contain the parsed
object.
>> Otherwise, to use the table, one is often reparsing the text, which
>> just seems redundant and inconvenient.
>
>
> Can you (both Jim and Michael) describe the uses you might have for this?
> There are lots of possible changes that could make this information
> available:
>
>  - attach to each item in the parse tree, as the parser package did.  (Bad
> idea for general use which is why I dropped it, but
> it could be done as a special option to parse, if you aren't planning
to
> evaluate the expression.)
>  - give the index into the parse tree of each item, i.e. c(1,1), c(1,2),
> c(1,3) in the example below, or just 1,2,3 along with a function to
> reconstruct the full path.
>  - give a copy of the branch of the parse tree, as Michael suggests.
>
> etc.  Which is best for your purposes?
>
> Duncan Murdoch
>
>>
>> Michael
>>
>> On Wed, Jul 29, 2015 at 9:43 AM, Duncan Murdoch
>> <murdoch.duncan at gmail.com> wrote:
>> > On 29/07/2015 12:13 PM, Jim Hester wrote:
>> >>
>> >> I would like to map the parsed tokens obtained from
>> >> utils::getParseData()
>> >> to the parse tree and elements obtained by base::parse().
>> >>
>> >> It looks like back when this code was in the parser package
the parse()
>> >> function annotated the elements in the tree with their id,
which would
>> >> allow you to perform this mapping.  However when the code was
included
>> >> in
>> >> R
>> >> this functionality was removed.
>> >
>> >
>> > Yes, not all elements of the parse tree can legally have
attributes
>> > attached.
>> >>
>> >>
>> >> ?getParseData states
>> >>    The ?id? values are not attached to the elements of the
parse
>> >>            tree, they are only retained in the table returned
by
>> >>            ?getParseData?.
>> >>
>> >> Is there another way you can map between the getParseData()
tokens and
>> >> elements of the parse tree that makes this additional
annotation
>> >> unnecessary?  Or is this simply not possible?
>> >
>> >
>> > I think you can't get to it, though you can get close by
looking at the
>> > id &
>> > parent values in the table.  For example,
>> >
>> >  code <- "x + (y + 1)"
>> >  p <- parse(text=code)
>> >
>> > getParseData(p)
>> >    line1 col1 line2 col2 id parent     token terminal text
>> > 15     1    1     1   11 15      0      expr    FALSE
>> > 1      1    1     1    1  1      3    SYMBOL     TRUE    x
>> > 3      1    1     1    1  3     15      expr    FALSE
>> > 2      1    3     1    3  2     15       '+'     TRUE    +
>> > 13     1    5     1   11 13     15      expr    FALSE
>> > 4      1    5     1    5  4     13       '('     TRUE    (
>> > 11     1    6     1   10 11     13      expr    FALSE
>> > 5      1    6     1    6  5      7    SYMBOL     TRUE    y
>> > 7      1    6     1    6  7     11      expr    FALSE
>> > 6      1    8     1    8  6     11       '+'     TRUE    +
>> > 8      1   10     1   10  8      9 NUM_CONST     TRUE    1
>> > 9      1   10     1   10  9     11      expr    FALSE
>> > 10     1   11     1   11 10     13       ')'     TRUE    )
>> >
>> >
>> > Now p is an expression, with the parse tree in p[[1]].  From the
table,
>> > we
>> > can see that the root node has id 15, and 3 nodes have that as a
parent.
>> > Those would be p[[c(1,1)]], p[[c(1,2)]], p[[c(1,3)]].  The tricky
part
>> > is
>> > the re-ordering:  those correspond to `+`, x, and (y+1)
respectively,
>> > not
>> > the order they appear in the original source or in the table. 
Generally
>> > the
>> > function call appears first in the parse tree, but I'm not
sure you
>> > could
>> > always recognize which is the function call by looking at the
table.
>> >
>> > Duncan Murdoch
>> >
>> > ______________________________________________
>> > R-devel at r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-devel
>
>

Jim Hester

2015-Jul-29 21:33 UTC

head link

[Rd] Mapping parse tree elements to tokens

As Michael guessed my main use cases was code analysis.  A concrete example
where this would help is with my test code coverage tool covr.  There is
currently a bug when tracking coverage for if / else statements when the
clauses do not contain brackets (https://github.com/jimhester/covr/issues/39).
Because only one source reference is generated in this case (because it is
parsed as a single expression), it is not possible to track each of the
clauses separately.  While I can get the source reference for the entire
statement, in order to extract the if/else clauses I need to either use the
tokenized information from getParseData(), or re-parse the entire if / else
expression by hand (which seems prone to error to me).

Another example of where this would help is linking comments to
expressions.  While I know this topic has been discussed previously (
https://stat.ethz.ch/pipermail/r-devel/2009-March/052731.html) and I am
fine with the default parser dropping comments, having the ability to map
the more detailed tokens back to the parse tree would allow the comments to
be annotated to their closest expression.

Of the three options you propose I think simply supplying the index as an
additional column from the getParseData() output would be the most
straightforward to implement and use.

While it is true that you can get most of the way there with the current
source references as Michael mentions in some cases having more fine
grained location information is useful and there is no great way to get
there currently without re-parsing the full expressions from the source
reference.

The current getParseData output is already very implementation specific so
I don't think it would be a great additional support burden to add the
indexing information.  Likely the whole function would have to be removed
if a different parsing method was used.

Regardless I am glad others have shown some interest in this issue, thank
you for taking the time to read and respond!

Jim

On Wed, Jul 29, 2015 at 2:47 PM, Duncan Murdoch <murdoch.duncan at
gmail.com>
wrote:
> On 29/07/2015 2:30 PM, Michael Lawrence wrote:
>
>> Probably need a generic tree based on "ParseNode" objects
that
>> associate the line information with the symbol (for leaf nodes). As
>> Duncan notes, it should be possible to gather that from the table.
>>
>> But it would be nice if there was an "expr" column in the
parse data
>> column in addition to "text". It would contain the parsed
object.
>> Otherwise, to use the table, one is often reparsing the text, which
>> just seems redundant and inconvenient.
>>
>
> Can you (both Jim and Michael) describe the uses you might have for this?
> There are lots of possible changes that could make this information
> available:
>
>  - attach to each item in the parse tree, as the parser package did.  (Bad
> idea for general use which is why I dropped it, but
> it could be done as a special option to parse, if you aren't planning
to
> evaluate the expression.)
>  - give the index into the parse tree of each item, i.e. c(1,1), c(1,2),
> c(1,3) in the example below, or just 1,2,3 along with a function to
> reconstruct the full path.
>  - give a copy of the branch of the parse tree, as Michael suggests.
>
> etc.  Which is best for your purposes?
>
> Duncan Murdoch
>
>
>> Michael
>>
>> On Wed, Jul 29, 2015 at 9:43 AM, Duncan Murdoch
>> <murdoch.duncan at gmail.com> wrote:
>> > On 29/07/2015 12:13 PM, Jim Hester wrote:
>> >>
>> >> I would like to map the parsed tokens obtained from
>> utils::getParseData()
>> >> to the parse tree and elements obtained by base::parse().
>> >>
>> >> It looks like back when this code was in the parser package
the parse()
>> >> function annotated the elements in the tree with their id,
which would
>> >> allow you to perform this mapping.  However when the code was
included
>> in
>> >> R
>> >> this functionality was removed.
>> >
>> >
>> > Yes, not all elements of the parse tree can legally have
attributes
>> > attached.
>> >>
>> >>
>> >> ?getParseData states
>> >>    The ?id? values are not attached to the elements of the
parse
>> >>            tree, they are only retained in the table returned
by
>> >>            ?getParseData?.
>> >>
>> >> Is there another way you can map between the getParseData()
tokens and
>> >> elements of the parse tree that makes this additional
annotation
>> >> unnecessary?  Or is this simply not possible?
>> >
>> >
>> > I think you can't get to it, though you can get close by
looking at the
>> id &
>> > parent values in the table.  For example,
>> >
>> >  code <- "x + (y + 1)"
>> >  p <- parse(text=code)
>> >
>> > getParseData(p)
>> >    line1 col1 line2 col2 id parent     token terminal text
>> > 15     1    1     1   11 15      0      expr    FALSE
>> > 1      1    1     1    1  1      3    SYMBOL     TRUE    x
>> > 3      1    1     1    1  3     15      expr    FALSE
>> > 2      1    3     1    3  2     15       '+'     TRUE    +
>> > 13     1    5     1   11 13     15      expr    FALSE
>> > 4      1    5     1    5  4     13       '('     TRUE    (
>> > 11     1    6     1   10 11     13      expr    FALSE
>> > 5      1    6     1    6  5      7    SYMBOL     TRUE    y
>> > 7      1    6     1    6  7     11      expr    FALSE
>> > 6      1    8     1    8  6     11       '+'     TRUE    +
>> > 8      1   10     1   10  8      9 NUM_CONST     TRUE    1
>> > 9      1   10     1   10  9     11      expr    FALSE
>> > 10     1   11     1   11 10     13       ')'     TRUE    )
>> >
>> >
>> > Now p is an expression, with the parse tree in p[[1]].  From the
table,
>> we
>> > can see that the root node has id 15, and 3 nodes have that as a
parent.
>> > Those would be p[[c(1,1)]], p[[c(1,2)]], p[[c(1,3)]].  The tricky
part
>> is
>> > the re-ordering:  those correspond to `+`, x, and (y+1)
respectively,
>> not
>> > the order they appear in the original source or in the table.
>> Generally the
>> > function call appears first in the parse tree, but I'm not
sure you
>> could
>> > always recognize which is the function call by looking at the
table.
>> >
>> > Duncan Murdoch
>> >
>> > ______________________________________________
>> > R-devel at r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>
>
	[[alternative HTML version deleted]]

Apparently Analagous Threads

Search for more maybe matching threads

R devel - Jul 2015 - Mapping parse tree elements to tokens

[Rd] Mapping parse tree elements to tokens

[Rd] Mapping parse tree elements to tokens

[Rd] Mapping parse tree elements to tokens

[Rd] Mapping parse tree elements to tokens

Apparently Analagous Threads