Example I have a UTF 8 encoded string that holds an XML docu Future of Coding #linking-together

Example: I have a UTF-8 encoded string that holds ...

Stefan

02/04/2020, 6:46 AM

Example: I have a UTF-8 encoded string that holds an XML document. I run a super-simple parser (scanner) over it that cuts it into .tag(String) and .text(String) values — all characters that used to be between < and > end up in .tags, everything else in .texts. Nothing more. Now in a separate step I take all the .tags and run another super-simple parser on the contents of each tag. If it starts with /, the .tag is transformed into an .endTag, if not it’s transformed into a .startTag. In another separate step I take all .startTags and run another super-simple parser on their contents to split tagName from attributes. Etc. You get the pattern. What weird way of parsing have I just reinvented? This is not a quiz. I’m trying to figure out what this is called so I can search for articles that do something like this.

Kartik Agaram

02/04/2020, 7:04 AM

You don't mention how you're managing all the .tags and .texts in step 1, all the .startTags and .endTags in step 2, and so on. For some answers to that question, this sounds like recursive descent. https://en.wikipedia.org/wiki/Recursive_descent_parser This is probably the #1 way to hand-build parsers. Lots of languages either use it in their parsers or did at some point in the past. I'm very partial to it.

wtaysom

02/04/2020, 7:07 AM

Yes, please someone answer. I have certainly looked at this kind of serial tokenizer/tree-transformer kind of thing before without reaching a clear notion that unifies with more conventional parsers and multipass stream transformation. There's a connection to this sort of stream processing http://reactivex.io/ and Bidirectional Transformations of this type https://2018.programming-conference.org/track/bx-2018-papers and Nanopass compilers https://nanopass.org/.

Stefan

02/04/2020, 7:52 AM

@wtaysom Nanopass compilers, yay — that’s the kind of stuff I was looking for. You don’t happen to have looked into common patterns for parsing passes, robustness/gracefully dealing with malformed input, concurrent parsing, or visualizing parsing passes as part of your explorations in that matter? And you also read my mind. Bidirectional Transforms is what I’m after.

gman

02/04/2020, 8:23 AM

Can you even do that? Isn't this valid XML

Copy code

<tag attr="<notatag>foo</notatag>">bar</tag>

👍 1

Stefan

02/04/2020, 8:35 AM

I’m not trying to parse XML. I was just making up an example to explain what I’m doing. Seemed easier to do than describing it abstractly. Apart from that I think you’re supposed to escape < and > with < and > in attributes and text.

👍 2

gman

02/04/2020, 8:58 AM

A well, checking the spec this is valid

Copy code

<tag attr=">">foo</tag>

but if dealing with nested delimiters wasn't important then no biggy.

Edward de Jong / Beads Project

02/05/2020, 5:42 AM

This type of multi-pass massaging of text is extremely common. I don't think the researchers think it worthy of their attention, as it is a sequential series of a set of text transformations which in the end result in a desired format. The SED program from 50 years ago was one of the first "stream" processing utilities.

Open in Slack

Previous Next