Is there a parsing technique where you either make...
# linking-together
s
Is there a parsing technique where you either make explicit which parts of the input are ignored and what the escape patterns are? Or parsing techniques where you selectively pick only parts of the input to begin with? (I realize that sounds like regular expressions as parsers, which I know is usually not a good idea — any other techniques beyond regex?) Example: many programming languages skip comments in the lexer. Or think of string literals, where it pretty much doesn't matter what's inside the string, as long as you play by the escaping rules. If you take the strings example further, these strings often get parsed by a different parser at a different time, for instance when the string is a number or a date, or perhaps a regex. Is there anything that makes these different "stages" of parsing explicit? (I'm obviously looking for a concept I don't know a name for…)
Let me jump right in and say that I'm aware of parser combinators, which allow you to have different parsers composed of different components that can stand in for different languages, e.g. the host language + the date formatter language used in particular string literals. However, there isn't really a way to just skip over the "I don't care about what's in here" strings without breaking the whole parser, or is there?
s
Isn't this the difference between lexing and parsing as you alluded to yourself? Otherwise, you can set it up as a pipeline, I guess. Do you have a more concrete example?
k
Yeah, I don't think so. If you skip over some text you want to make sure there isn't a
#
in there that disables everything you're parsing. And so on. This is a key difference between tools and languages, IMO. Tools can be opt-in (though they don't have to be). But languages have to be all or nothing. In Mu I tried as far as possible to do everything with partial on-demand syntax sugar. But there's still 2 points at which I need to draw a line and start parsing everything from scratch.
m
I am no expert in parsing but I don't see why you couldn't postpone parsing of string insides for later, as long as you can tell where the string starts and end? For that you would need to care about escaping, but that is it. But I guess I am getting smth wrong here?
s
@S.M Mukarram Nainar The difference between lexing and parsing is not what I mean, although I understand that my examples fit that. How about parsing source code where comments also have a certain structure — the compiler/interpreter doesn't care about the comments, a documentation generator doesn't care about the source code.
s
So is the problem that you'd want to reuse the original parser for both the compiler and the documentation generator?
s
@Martin Yes, exactly, it's done all the time. I guess I'm looking for formalisms around that. Are there formalisms about escaping?
i
I've bumped into this while I wrote a JSON parser in Standard ML, just for fun. I wanted the String and Number lexers to be pluggable. This won't help you much, but I decided to call them paralexers, because they sit beside and help the overall lexing phase. I'd love to know if there's an established term for this type of parameterized lexers/parsers.
j
Not exactly what you are looking for, but I thought this was related enough to link to. The idea is a microgrammer where you just define the parts you care about and ignore everything else. https://blog.acolyer.org/2016/05/31/how-to-build-static-checking-systems-using-orders-of-magnitude-less-code/
💡 1
s
@Jimmy Miller Uh, yes, that sounds interesting. I'll take a look. Thanks!
m
@Stefan I don't know of any formalisms, but here is couple of simple things we all see in practice: • Defined delimiters with defined escaping mechanism. Example: strings. • Defined delimiters with no escaping mechanism. Example: multiline comments in C++ (/* */) (not 100% sure about this but I think they can't be escaped). • If line starts with special sequence, whole line is considered as a thing. This is really just a specialized case of defined delimiters with no escaping mechanism, where ending delimiter is newline. Example: line comments. I have been playing with this lately because in compiler/parser we are building, we currently have the need to parse blocks of JS without actually parsing the JS -> so I would say the same challenge as you are having. We decided to go with simple solution for now: Defined delimiters with no escaping mechanism. In our case we are going with
{=js
as starting delimiter and
js=}
as end delimiter. While situation could in theory happen where developer wants to have such sequence (
js=}
) in their code inside this JS block, as a part of string for example, I don't think it is a real problem in practice, and it can be easily circumvented (by breaking down the string or adding some whitespace or so on). So we basically parse these blocks as strings, and then process them later in the pipeline (some of them we even just copy paste and never process -> but those are implementation details now). As more advanced step, I see doing the following: once our parser hits the
{=js
, we switch to JS parser and let it parse whatever follows, JS statement by JS statement. After each JS statement, we can check if there is
=js}
after if, and if so, we stop with parsing JS and continue with our parser. I believe JSX does something similar, although maybe somewhat simpler.
k
You're right, this is something I've done many times but never seen a formalism for. I just searched for 'hierarchical parsing' and the best I could find was https://alandix.com/blog/2011/03/12/hierarchical-grammars-for-more-human-like-compiler-parsing
i
I think the formalism is called Island Grammar: https://en.wikipedia.org/wiki/Island_grammar
💡 1
s
@Martin Thanks, yes, that describes the cases for common text-based languages. While we're at it, I'd also add the arguably simpler ways configuration files use tabs and newlines or sometimes \0 for structure. And I'd also mention fixed-size and length-based encoding used in some lower-level network protocols. Totally get your challenges with JS. I'm actually looking at this from the other end — when you design a language (which in my case is more about data description than about programming at the moment), which of those formalisms should you use when? What are the benefits and drawbacks of each approach?
@Kartik Agaram Uh, the references in that one look promising… thanks!
@Ionuț G. Stan Huh, never heard of Island Grammars before. And oh, I guess template languages would've been a much better example to go with… 🤦
This is great. Thanks everyone for trying to follow my weird train of thought and coming up with all these pointers. (Please keep going if you find anything else! Just wanted to show my appreciation as this is one of those things where a forum like this can really help; basically LMGTFY when you don't know the terms to search for.)
👍🏻 2
k
It's actually interesting to think of lexing vs parsing as a single point in a more general space of possibilities.
👍 1
m
@Stefan Aha got it, interesting that you are looking at it from that perspective: I would love to hear more about this if you manage to formalize this or write something!
👍 1
s
@Kartik Agaram Not sure what you mean, but it sounds intriguing. For me lexing and parsing are just about regular vs. context-free. But you seem to look at how these are often separate stages…? In that space I like the idea of micro-pass parsing, here's a good intro with links to relevant papers: https://blog.sigplan.org/2019/07/09/my-first-fifteen-compilers/
At the risk of derailing the thread, another concept I'd like to see as a "single point in a more general space of possibilities" is compile-time vs. run-time. It feels similar on a very abstract level. If you think about it, there are more "times" than just the two: macro-expansion, compilation (perhaps with several stages of lowering into IL/IRs), static analysis, linking all fall into compile-time, and then at runtime we have configuration (as long as the config file doesn't change), execution (as long as the process is running), session (e.g. as long as a user is logged in), run-loop (as long as a single run loop or frame is processed), stack frame, instruction, etc. What if type systems wouldn't (just) slice the space into possible values at runtime, but also across all these "stages"?
☝️ 2
i
@Stefan I believe what you're describing is called multi-stage programming (https://en.wikipedia.org/wiki/Multi-stage_programming). I don't have experience with it, but it looks like it might not exactly cover what we normally call runtime. However, it splits the compilation phase into multiple stages, so one stage is the program for a subsequent stage, and so on. Having said that, I agree that sometimes it's not easy to explain some programs by splitting them just into compile-time and runtime. For example, an embedded DSL implementation might include a static analysis phase, but it's static only relative to the DSL, not to the host language. Similarly it might even contain a compile-time phase which is run as part of the meta-program runtime. In addition, besides the classical compile time and runtime, I also like to think about developer time. There's a lot of computation happening at developer time and I guess what a lot of people in this community are trying to do is to improve exactly this phase.
r
I'm a big fan of mutli-stage programming theory. You can also look at this through the dimension of location. Instead of "compile time" vs. "runtime", you have, "does this computation run on the developer machine or the deployment machine?" compilation is just computation. It's all a mater of when and where different parts of a computation are run. There is no reason it can't be separated on both time and space dimensions... (I'm a static types person, but thinking this way made me hate dynamic typing less. I realized they are equivalent and it's a just an engineering trade-off about where the checking runs 😛 ) At the risk of losing some important subtleties of the project, I think @Kartik Agaram's Mu project can be seen as an effective multi-stage programming system. It's one of the things I really like about that project ❤️. Mu attempts to make sure all the stages are distinct, user understandable (and theoretically verifiable) and only build upon earlier stages (no inter-stage recursion, a problem that macro systems run into).
💡 2
k
@Stefan Yeah my thought was not very well-formed. Basic idea: I tend to visualize parsers as scans over streams that emit some sort of artifact as a side-effect. (Hence my original answer bluntly in the negative.) Hierarchical grammars seem to be scans that emit a stream as a side-effect, a stream that can itself be parsed by a different grammar. I've been doing this naturally for as long as I can remember. Lexing feels like this sort of thing, a stream of bytes turned into a stream of tokens. Compiler passes also feel vaguely in the same family/framework, though they of course can emit whole new languages (whether IR or VM bytecode or native code) as a side-effect. Nanopass is nice, but doesn't feel quite as rigorous as the taxonomy of languages and the correspondence results between classes of languages and automata. In general, we humans seem to be good at creating and reasoning about formal systems in isolation. We have precious few results (to my knowledge) that apply to arbitrary combinations of formal systems. Every seam between formal systems (wormhole between universes) is a non-linearity that makes things harder to reason about. This is basically why my kick lately is minimizing the number of inter-language seams rather than maximizing intra-language power.
Looks like you've been noodling at this for a while now! https://futureofcoding.slack.com/archives/C5U3SEW6A/p1580798784083500. I'd love to hear where the journey of the past few months (passing through a wormhole between pre- and post-pandemic universes) has taken you.
s
You’re dangerously good with search, @Kartik Agaram. Well, this parser thing is just a side project that keeps coming up every few months, usually when I get to a point where I need to either import or export some arbitrary data in some arbitrary format. I always try to take another stab at solving it properly before I go back to just using a library or cobbling something together myself. So, unfortunately I have to admit that I haven’t spend the last few months on just that and therefore haven’t made much progress. :-) What I’m mostly trying to do is to transfer what I learned about cognitive science (categorization, metaphorical structuring, image schemas — the stuff you hear me talk about a lot here) to system and interface design. I’m practically trying to pull off what GoF did with Design Patterns in architecture from Christopher Alexander, but with Image Schemas in cognitive semantics from George Lakoff. Well, that’s the big dream. If I only get 5% there, it’ll be great. To turn this rather theoretical endeavor into something more practical and technical, I’m currently trying to build a thinking tool that covers important parts of an individual’s personal knowledge management process. Think note taking, second brain, roam cult, digital garden. But trying to go a little beyond the “bi-directional links is how your brain works”. It all comes together at designing the data model, both from a technical perspective (which isn’t that difficult) and from a user interface/experience perspective. I want end users without technical experience to ultimately build and modify the data model themselves — without having to understand what a data model is. I think I described it better here: https://stefan-lesser.com/2019/12/06/structure-and-behavior/ Long story short: some of that image schema and metaphor stuff gave me some ideas and I’m trying to validate them. First just for myself, building something that improves my own process. If you see me pick up blogging again later this year it could mean that it worked…
👍 2
w
Oh friends, here we have one of my favorite topics. Let's with the specific question: parsing documentation out of comments. If you have a whole language parser on hand (or just the lexer), then you may be able to use that to grab the comments, which you then parse separately. Feels like overkill, right? Why not just scan for the comments? One way to think of it is that the full language parser has a simpler sub-parser that only cares about comments. It's fun to think of how to automatically transform a full parser to one that just cares about comments. Or another way is to be clever. (See also XML where you can get a 90% good parser with 10% of the grammar.) Your comment marker can only appear in strings of which there are two kinds, oh and regular expressions too, oh and... well at least you can be certain that literals don't cross line boundaries. Still... something feels off. Consider how algorithmic parsing (a parser for things is a function from strings to lists of pairs of strings and things) is at odds with how you as a programmer parse things (starting right in the middle looking for fencepost spacing and brackets rarely needing to see the whole file). Instead of having a complete system, we look around a little bit and guess between with many partial systems (with simpler ones working most of the time) and end goals in mind. If we see anything too weird, we fudge it to make things fit. Try reading with a five-year old. It can be a great challenge to have them focus on the letters in front of them over a fitting word in their head. How would a more person-like parser work? Separately, parsers are interesting for their close connection to all kinds of other traversals. It's a small step from a streams of tokens (first, rest) to trees (node, left branch, right branch) to the world (here, north, east, south, west), and parsers are unique for how richly they interpret the stream. A lot to explore.
k
It occurs to me that http://doc.cat-v.org/bell_labs/structural_regexps might be relevant to this thread. It makes a strong case that you can make rigorous parsers using regular expressions. You just have to let go of the line-oriented nature of our regex tools.
No, scratch that. Not sure what I was smoking. It's still regular expressions, so of course it can't handle nested brackets.
w
That is correct. They're useful, but you don't get free nesting.
My father, by the way, has been using Sam since the 80's and still is.
i
I've read the paper about structural regexes and maybe I'm thick, but it seems to me that it's not about an addition to the CS concept of regular expressions, it's just about the particularities of some implementations. Am I reading it wrong? What's special about them?
w
Nothing fundamental, just convenient for some tasks. This is the other paper to read http://doc.cat-v.org/plan_9/4th_edition/papers/sam/. The biggest idea is that Sam's command language is expressive and terse so that you keep little command snippets around, which you then middle click to execute. So instead of having a special find/replace window in your editor. You keep open a text file that you use for the purpose. It has a certain ease and elegance where modern text editors don't: even the strongly customizable ones.
1
i
Thanks, William. I'll have a read through that other paper.
k