Hi Jason ! I''ve looked to the ragel code for textile and you are right: it has become quite hard to understand. I have gone through the list of difficult defects and through the current textile reference and I have the feeling that the current parser is quite complicated for the task at hand. Textile does not look like such a complicated grammar (at least not what is listed in the reference page), but maybe I''m wrong and there are many places where determinism is not easily attained. I really feel that the parts that are difficult for the parser are also difficult for the reader when editing text. And most of these hard-to-parse and hard-to-read features in textile (except for tables) are not related to describing content but to styling: something like setting an "id" in an article seems really bad to me: what if you display two articles on one page and they both define "hot" id ? Same goes with "em" padding: that''s not content, that''s styling. I feel very concerned about all these issues related to textile because I am building a CMS in which my clients put *everything*: letters, comments, documents, quality certification stuff, control lists, etc. So I really need a textile parser that can survive in the long run (10yrs). To achieve this goal, we need to: a. have a parser that is easy to enhance with new needs without breaking old text b. have a grammar that is easy to parse For point "a", I think we can live with S-expression generation and customization during s-expression tree processing. For example an image with caption would be parsed as: !file.jpg (foo bar baz)! ==> [:image, "file.jpg (foo bar baz)"] So the processor will run ruby regex to "finish the work". This means the parser in "C" is kept simple and if someone wants to add more features to the "image" tag, she just has to change the ruby regex. For point "b": we need to *not* support shortcut syntax for styling features such as the "id" thing or "em" padding (at least not at the "C" parser level). If someone really wants an em padding, she should use html (it''s not nice to use and this is an indication that this is bad practice) : <div style=''padding-left:4em;''> # one # two </div> Since I *really* need such a tool, I could help refactoring redcloth into a two step parser (half in "C", half in ruby). What do you think ? Gaspard On Mon, Jun 8, 2009 at 6:40 AM, Jason Garber<jg at jasongarber.com> wrote:> Gaspard, here''s a copy of my complaining to _why, which contains a few > examples of what I''m up against. ?This was awhile back, so I''ve moved on a > bit, but what you were asking about still applies. > There''s probably a way to do everything I need to in Ragel, but I can''t > figure it out. I spent a few hours figuring out a different capture > mechanism and thought I was being quite clever, but in the end it didn''t > work out. ?I was relieved because it felt like I was reinventing something a > tool should provide anyway. > Jason > > Begin forwarded message: > > From: Jason Garber <jg at jasongarber.com> > Date: June 1, 2009 8:50:23 AM EDT > To: why the lucky stiff <why at whytheluckystiff.net> > Subject: Need your advice on RedCloth > Hi, why. ?I need your advice on RedCloth and Ragel. ?The current mark and > capture mechanism has just gotten too ugly for me to handle. ?Since I took > over the project, we''ve had to add several more variables/macros/actions to > mark fallback captures and I had to add a separate machine to parse > attributes. ?It''s been livable, but as I fix more bugs it keeps heading in > that direction and I don''t like it. > > I''ve reduced the problem down to the deterministic nature of the machine. > ?For example, merging in PyTextile-style table attributes creates a conflict > when recognizing these two possibilities: > (#myid)# This is a list item with an id > (# This is a list item with padding-left:1em > > It has to get to the third character to know whether the first character was > a left indent or the start of the id, but by then it''s too late?the indent > has already been stored. ?You wind up getting the same output from (#myid)# > one as you would ((#myid)# one. ?I solved it easily enough with a > conditional action that looks to see if p+2 is a space, but I feel like I > shouldn''t have to. > > I wish that from the start state there were two ''('' transitions, one marking > the indent, one the id. ?The branches would have different things both > leading to a final state. ?At the final state, an action would discard the > captured bits that were not on the path to the final state, leaving only the > things that "stuck." ?Basically, the same thing backtracking regex engines > do when matching /(\()?(\(#([a-z]+\))?# (.+)/ > > Plus, there''s the matter of having to think about nondeterminism at every > step of the way, like when writing cite = "??" mtext "??". ?I never thought > about it that book titles might end with or contain a question mark. ?It > took me nearly an hour to get that working, but a regex would have just done > it without me wasting brain cycles. > > I''ve also had to write duplicate patterns that don''t have embedded actions > so that I can look ahead (for extended blocks and such). ?So I wind up with > duplicate patterns A and A_noactions, C and C_noactions... > > I''m new to all this stuff, but it seems Ragel produces a DFA, not an NFA, so > what I describe above isn''t possible. ?Is there a way to accomplish it with > Ragel? ?It''s tempting to just switch to Oniguruma. ?I''ll bet it wouldn''t be > too much slower if we interfaced with it directly in C and did all the > string manipulations in C. ?Might have distribution problems. > > What do you think? > > Jason > > _______________________________________________ > Redcloth-upwards mailing list > Redcloth-upwards at rubyforge.org > http://rubyforge.org/mailman/listinfo/redcloth-upwards >