Example: I have a UTF-8 encoded string that holds ...
# linking-together
s
Example: I have a UTF-8 encoded string that holds an XML document. I run a super-simple parser (scanner) over it that cuts it into .tag(String) and .text(String) values — all characters that used to be between < and > end up in .tags, everything else in .texts. Nothing more. Now in a separate step I take all the .tags and run another super-simple parser on the contents of each tag. If it starts with /, the .tag is transformed into an .endTag, if not it’s transformed into a .startTag. In another separate step I take all .startTags and run another super-simple parser on their contents to split tagName from attributes. Etc. You get the pattern. What weird way of parsing have I just reinvented? This is not a quiz. I’m trying to figure out what this is called so I can search for articles that do something like this.
k
You don't mention how you're managing all the .tags and .texts in step 1, all the .startTags and .endTags in step 2, and so on. For some answers to that question, this sounds like recursive descent. https://en.wikipedia.org/wiki/Recursive_descent_parser This is probably the #1 way to hand-build parsers. Lots of languages either use it in their parsers or did at some point in the past. I'm very partial to it.
w
Yes, please someone answer. I have certainly looked at this kind of serial tokenizer/tree-transformer kind of thing before without reaching a clear notion that unifies with more conventional parsers and multipass stream transformation. There's a connection to this sort of stream processing http://reactivex.io/ and Bidirectional Transformations of this type https://2018.programming-conference.org/track/bx-2018-papers and Nanopass compilers https://nanopass.org/.
s
@wtaysom Nanopass compilers, yay — that’s the kind of stuff I was looking for. You don’t happen to have looked into common patterns for parsing passes, robustness/gracefully dealing with malformed input, concurrent parsing, or visualizing parsing passes as part of your explorations in that matter? And you also read my mind. Bidirectional Transforms is what I’m after.
g
Can you even do that? Isn't this valid XML
Copy code
<tag attr="<notatag>foo</notatag>">bar</tag>
👍 1
s
I’m not trying to parse XML. I was just making up an example to explain what I’m doing. Seemed easier to do than describing it abstractly. Apart from that I think you’re supposed to escape < and > with < and > in attributes and text.
👍 2
g
A well, checking the spec this is valid
Copy code
<tag attr=">">foo</tag>
but if dealing with nested delimiters wasn't important then no biggy.
e
This type of multi-pass massaging of text is extremely common. I don't think the researchers think it worthy of their attention, as it is a sequential series of a set of text transformations which in the end result in a desired format. The SED program from 50 years ago was one of the first "stream" processing utilities.