thr3ads.net - Ironruby core - [Ironruby-core] Code Review: parseroptfinal [Dec 2008]

If this information is useful, please help other people find it:
Share via:

Tomas Matousek

2008-Dec-04 00:17 UTC

[Ironruby-core] Code Review: parseroptfinal

tfpt review "/shelveset:parseroptfinal;REDMOND\tomat"

Improves Ruby tokenizer and parser performance by 30% total (previously 3 MBps,
now 4.2 MBps - measured on real code of libraries we have checked in and
including Ruby AST building, but not transformation to DLR ASTs).

Adds /bm option to tokenizer test driver that measures parsing speed
(IronRuby.Tests.exe /tokenizer /bm).
Adds more tokenization unit tests and fixes several minor bugs in tokenizer.

AST, Parser.cs, parser.y:
- Replaces List<Expression> by a new class Statements where the list
represents statements. This encapsulation us to change underlying representation
w/o affecting code using the statements. Also refactors the rules for statement
list so that we can use an empty statements singleton.

- Replaces List<Expression> by Expression[] where it represents
arguments. Arguments are the most frequent list of expression so it make sense
to optimize it. Previously any list of arguments was constructed starting from
an empty List<Expression> and adding arguments as parser reduced them.
This is quite inefficient since most method calls have very few arguments (0 to
3). The parser now has a stack of Expression to which it pushes arguments as
they are reduced. Each rule that defines argument list counts the number of
arguments in the list in token value that is propagated throughout the list
reduction. The rule that consumes the argument list then simply pops the right
amount of arguments out of it and copies it to an exactly sized array. This
might be further optimized if necessary - we can specialize Argument node for 0
to 3 arguments, or just remember a reference to the arguments stack and index to
the array.
We can use this approach for other lists as well (maybe for statemens, string
concatenations, etc.) in future.

- Provides dummy implementations for features that are not implemented
yet: defined?(super), BEGIN, END. This avoids throwing not impl. exceptions so
that we can measure performance of AST transformations over entire library code
base.

GPPG:
- Removes Parser <: ShiftReduceParser inheritance. It was meant to
enable sharing of generic GPPG parser driver and derive specific parsers but
since we don''t share the class it makes no sense. Still GPPG.cs is
mostly Ruby independent piece of code that could be easily copied by other
languages.
- Merges 3 stacks (state, value and location) into a single data structure
reducing cost of push and pop.
- Improves reduction of empty rules, removes unnecessary array lookups on
hot path (these changes require some adjustments in GPPG generator).

TokenValue:
- Reduces size of the structure by using an explicit layout definition.
This reduces amount of data moved around during reductions.
- It might be possible in future to get rid of the obj2 field
(it''s used only by argument rules).

Tokenizer:
- Removes a direct dependency on the parser - we only need a resolver for
local variables. Introduces ILexicalVariableResolver (implemented by Parser)
that provides such information. This allows tests to run tokenizer with mocked
parser.
- Replaces nextc(), peekc() and pushback() methods by Read, Peek, Back,
SeekRelative, which don''t implicitly normalize end of lines. This
allows us to drop yytext StringBuilder used to track normalized eolns. It was
the culprit of very bad performance. Basically entire tokenized source code got
eventually loaded into this StringBuilder one character in time :-/. The
tokenizer now needs to deal with eoln normalization explicitly, which required a
lot of small tweaks.
- Reimplements line buffer loading - previous implementation was
allocating 2 char[] per line, which is of course terrible waste. The buffer is
now shared for all lines of code except for heredoc that needs to be handled in
a special way.
- The tokenizer is now finally in a good shape, although there are more
small improvements that could be done.

Tomas
-------------- next part --------------
A non-text attachment was scrubbed...
Name: parseroptfinal.diff
Type: application/octet-stream
Size: 849879 bytes
Desc: parseroptfinal.diff
URL:
<http://rubyforge.org/pipermail/ironruby-core/attachments/20081203/9dafef36/attachment-0001.obj>

Curt Hagenlocher

2008-Dec-04 17:52 UTC

head link

[Ironruby-core] Code Review: parseroptfinal

Looks good!

-----Original Message-----
From: Tomas Matousek
Sent: Wednesday, December 03, 2008 4:17 PM
To: IronRuby External Code Reviewers
Cc: ironruby-core at rubyforge.org
Subject: Code Review: parseroptfinal

  tfpt review "/shelveset:parseroptfinal;REDMOND\tomat"

Improves Ruby tokenizer and parser performance by 30% total (previously 3 MBps,
now 4.2 MBps - measured on real code of libraries we have checked in and
including Ruby AST building, but not transformation to DLR ASTs).

Adds /bm option to tokenizer test driver that measures parsing speed
(IronRuby.Tests.exe /tokenizer /bm).
Adds more tokenization unit tests and fixes several minor bugs in tokenizer.

AST, Parser.cs, parser.y:
-       Replaces List<Expression> by a new class Statements where the list
represents statements. This encapsulation us to change underlying representation
w/o affecting code using the statements. Also refactors the rules for statement
list so that we can use an empty statements singleton.

-       Replaces List<Expression> by Expression[] where it represents
arguments. Arguments are the most frequent list of expression so it make sense
to optimize it. Previously any list of arguments was constructed starting from
an empty List<Expression> and adding arguments as parser reduced them.
This is quite inefficient since most method calls have very few arguments (0 to
3). The parser now has a stack of Expression to which it pushes arguments as
they are reduced. Each rule that defines argument list counts the number of
arguments in the list in token value that is propagated throughout the list
reduction. The rule that consumes the argument list then simply pops the right
amount of arguments out of it and copies it to an exactly sized array. This
might be further optimized if necessary - we can specialize Argument node for 0
to 3 arguments, or just remember a reference to the arguments stack and index to
the array.
We can use this approach for other lists as well (maybe for statemens, string
concatenations, etc.) in future.

-       Provides dummy implementations for features that are not implemented
yet: defined?(super), BEGIN, END. This avoids throwing not impl. exceptions so
that we can measure performance of AST transformations over entire library code
base.

GPPG:
-       Removes Parser <: ShiftReduceParser inheritance. It was meant to
enable sharing of generic GPPG parser driver and derive specific parsers but
since we don''t share the class it makes no sense. Still GPPG.cs is
mostly Ruby independent piece of code that could be easily copied by other
languages.
-       Merges 3 stacks (state, value and location) into a single data structure
reducing cost of push and pop.
-       Improves reduction of empty rules, removes unnecessary array lookups on
hot path (these changes require some adjustments in GPPG generator).

TokenValue:
-       Reduces size of the structure by using an explicit layout definition.
This reduces amount of data moved around during reductions.
-       It might be possible in future to get rid of the obj2 field
(it''s used only by argument rules).

Tokenizer:
-       Removes a direct dependency on the parser - we only need a resolver for
local variables. Introduces ILexicalVariableResolver (implemented by Parser)
that provides such information. This allows tests to run tokenizer with mocked
parser.
-       Replaces nextc(), peekc() and pushback() methods by Read, Peek, Back,
SeekRelative, which don''t implicitly normalize end of lines. This
allows us to drop yytext StringBuilder used to track normalized eolns. It was
the culprit of very bad performance. Basically entire tokenized source code got
eventually loaded into this StringBuilder one character in time :-/. The
tokenizer now needs to deal with eoln normalization explicitly, which required a
lot of small tweaks.
-       Reimplements line buffer loading - previous implementation was
allocating 2 char[] per line, which is of course terrible waste. The buffer is
now shared for all lines of code except for heredoc that needs to be handled in
a special way.
-       The tokenizer is now finally in a good shape, although there are more
small improvements that could be done.

Tomas

Ironruby core - Dec 2008 - Code Review: parseroptfinal

[Ironruby-core] Code Review: parseroptfinal

[Ironruby-core] Code Review: parseroptfinal