tfpt review "/shelveset:parseroptfinal;REDMOND\tomat" Improves Ruby tokenizer and parser performance by 30% total (previously 3 MBps, now 4.2 MBps - measured on real code of libraries we have checked in and including Ruby AST building, but not transformation to DLR ASTs). Adds /bm option to tokenizer test driver that measures parsing speed (IronRuby.Tests.exe /tokenizer /bm). Adds more tokenization unit tests and fixes several minor bugs in tokenizer. AST, Parser.cs, parser.y: - Replaces List<Expression> by a new class Statements where the list represents statements. This encapsulation us to change underlying representation w/o affecting code using the statements. Also refactors the rules for statement list so that we can use an empty statements singleton. - Replaces List<Expression> by Expression[] where it represents arguments. Arguments are the most frequent list of expression so it make sense to optimize it. Previously any list of arguments was constructed starting from an empty List<Expression> and adding arguments as parser reduced them. This is quite inefficient since most method calls have very few arguments (0 to 3). The parser now has a stack of Expression to which it pushes arguments as they are reduced. Each rule that defines argument list counts the number of arguments in the list in token value that is propagated throughout the list reduction. The rule that consumes the argument list then simply pops the right amount of arguments out of it and copies it to an exactly sized array. This might be further optimized if necessary - we can specialize Argument node for 0 to 3 arguments, or just remember a reference to the arguments stack and index to the array. We can use this approach for other lists as well (maybe for statemens, string concatenations, etc.) in future. - Provides dummy implementations for features that are not implemented yet: defined?(super), BEGIN, END. This avoids throwing not impl. exceptions so that we can measure performance of AST transformations over entire library code base. GPPG: - Removes Parser <: ShiftReduceParser inheritance. It was meant to enable sharing of generic GPPG parser driver and derive specific parsers but since we don''t share the class it makes no sense. Still GPPG.cs is mostly Ruby independent piece of code that could be easily copied by other languages. - Merges 3 stacks (state, value and location) into a single data structure reducing cost of push and pop. - Improves reduction of empty rules, removes unnecessary array lookups on hot path (these changes require some adjustments in GPPG generator). TokenValue: - Reduces size of the structure by using an explicit layout definition. This reduces amount of data moved around during reductions. - It might be possible in future to get rid of the obj2 field (it''s used only by argument rules). Tokenizer: - Removes a direct dependency on the parser - we only need a resolver for local variables. Introduces ILexicalVariableResolver (implemented by Parser) that provides such information. This allows tests to run tokenizer with mocked parser. - Replaces nextc(), peekc() and pushback() methods by Read, Peek, Back, SeekRelative, which don''t implicitly normalize end of lines. This allows us to drop yytext StringBuilder used to track normalized eolns. It was the culprit of very bad performance. Basically entire tokenized source code got eventually loaded into this StringBuilder one character in time :-/. The tokenizer now needs to deal with eoln normalization explicitly, which required a lot of small tweaks. - Reimplements line buffer loading - previous implementation was allocating 2 char[] per line, which is of course terrible waste. The buffer is now shared for all lines of code except for heredoc that needs to be handled in a special way. - The tokenizer is now finally in a good shape, although there are more small improvements that could be done. Tomas -------------- next part -------------- A non-text attachment was scrubbed... Name: parseroptfinal.diff Type: application/octet-stream Size: 849879 bytes Desc: parseroptfinal.diff URL: <http://rubyforge.org/pipermail/ironruby-core/attachments/20081203/9dafef36/attachment-0001.obj>
Looks good! -----Original Message----- From: Tomas Matousek Sent: Wednesday, December 03, 2008 4:17 PM To: IronRuby External Code Reviewers Cc: ironruby-core at rubyforge.org Subject: Code Review: parseroptfinal tfpt review "/shelveset:parseroptfinal;REDMOND\tomat" Improves Ruby tokenizer and parser performance by 30% total (previously 3 MBps, now 4.2 MBps - measured on real code of libraries we have checked in and including Ruby AST building, but not transformation to DLR ASTs). Adds /bm option to tokenizer test driver that measures parsing speed (IronRuby.Tests.exe /tokenizer /bm). Adds more tokenization unit tests and fixes several minor bugs in tokenizer. AST, Parser.cs, parser.y: - Replaces List<Expression> by a new class Statements where the list represents statements. This encapsulation us to change underlying representation w/o affecting code using the statements. Also refactors the rules for statement list so that we can use an empty statements singleton. - Replaces List<Expression> by Expression[] where it represents arguments. Arguments are the most frequent list of expression so it make sense to optimize it. Previously any list of arguments was constructed starting from an empty List<Expression> and adding arguments as parser reduced them. This is quite inefficient since most method calls have very few arguments (0 to 3). The parser now has a stack of Expression to which it pushes arguments as they are reduced. Each rule that defines argument list counts the number of arguments in the list in token value that is propagated throughout the list reduction. The rule that consumes the argument list then simply pops the right amount of arguments out of it and copies it to an exactly sized array. This might be further optimized if necessary - we can specialize Argument node for 0 to 3 arguments, or just remember a reference to the arguments stack and index to the array. We can use this approach for other lists as well (maybe for statemens, string concatenations, etc.) in future. - Provides dummy implementations for features that are not implemented yet: defined?(super), BEGIN, END. This avoids throwing not impl. exceptions so that we can measure performance of AST transformations over entire library code base. GPPG: - Removes Parser <: ShiftReduceParser inheritance. It was meant to enable sharing of generic GPPG parser driver and derive specific parsers but since we don''t share the class it makes no sense. Still GPPG.cs is mostly Ruby independent piece of code that could be easily copied by other languages. - Merges 3 stacks (state, value and location) into a single data structure reducing cost of push and pop. - Improves reduction of empty rules, removes unnecessary array lookups on hot path (these changes require some adjustments in GPPG generator). TokenValue: - Reduces size of the structure by using an explicit layout definition. This reduces amount of data moved around during reductions. - It might be possible in future to get rid of the obj2 field (it''s used only by argument rules). Tokenizer: - Removes a direct dependency on the parser - we only need a resolver for local variables. Introduces ILexicalVariableResolver (implemented by Parser) that provides such information. This allows tests to run tokenizer with mocked parser. - Replaces nextc(), peekc() and pushback() methods by Read, Peek, Back, SeekRelative, which don''t implicitly normalize end of lines. This allows us to drop yytext StringBuilder used to track normalized eolns. It was the culprit of very bad performance. Basically entire tokenized source code got eventually loaded into this StringBuilder one character in time :-/. The tokenizer now needs to deal with eoln normalization explicitly, which required a lot of small tweaks. - Reimplements line buffer loading - previous implementation was allocating 2 char[] per line, which is of course terrible waste. The buffer is now shared for all lines of code except for heredoc that needs to be handled in a special way. - The tokenizer is now finally in a good shape, although there are more small improvements that could be done. Tomas