There s a recurring theme in my career going back to my very Future of Coding #thinking-together

There’s a recurring theme in my career, going back...

Alex Cruise

06/20/2022, 5:45 PM

There’s a recurring theme in my career, going back to my very first job, that goes under the mental headline of “wide logic” problems… That’s where you have a somewhat narrow stream (conceptually, not necessarily streaming per se) of data, and you need to evaluate a large number of functions against each item--at least a handful, often dozens, sometimes hundreds! The functions are basically always provided by the user, although normally in a declarative, non-Turing-complete DSL. I’m curious whether others have noticed this pattern, and whether you know of any more mainstream labels for it? The closest I’ve found is papers that reference publish/subscribe and indexing techniques for complex boolean expression trees, e.g.: • An efficient publish/subscribe index for e-commerce databases (VLDB ’14) • Analysis and optimization for boolean expression indexing [BE-Tree] (ACM Trans. DB. Sys. ’13) • A-Tree: A Dynamic Data Structure for Efficiently Indexing Arbitrary Boolean Expressions (SIGMOD/PODS ’21 … actually this is a new one for me, I need to read it now… 🙂)

Alex Cruise

06/20/2022, 5:48 PM

Some stories…

Alex Cruise

06/20/2022, 6:01 PM

First job: a (snail mail) direct marketing agency. We would run dozens of queries/reports to analyze the performance of mailings/campaigns, breaking down response rates, average donation amounts, etc. The functions were often “symbolic” (i.e. calculated) fields in the 4GL, they were in a Turing-complete language but often written by relatively unskilled programmers. I built a system that would traverse the analysis/mailing list once, eval all the expressions once per record, and aggregate them in memory over multiple group-bys

Alex Cruise

06/20/2022, 6:17 PM

Second, third and fourth jobs didn’t really have this feature…

Alex Cruise

06/20/2022, 6:20 PM

Fifth job was layer7.com (an API gateway appliance), every message needed to be resolved to one of potentially hundreds of policies, but the resolution logic was pretty simple, and once resolved, the policy didn’t tend to be too branchy

ibdknox

06/20/2022, 6:23 PM

Sounds like streaming analytics to me.

Alex Cruise

06/20/2022, 6:23 PM

Sixth job was a SaaS app for monitoring compliance rules at brokerages… There were hundreds of handcrafted rules that needed to be evaluated against every Order, Trade and Account record after a nightly ETL job. Before I got involved, every rule was run one at a time, sequentially, and some customers’ nightly jobs were starting to run up against the clock.

Alex Cruise

06/20/2022, 6:24 PM

Definitely streaming analytics, but from what I’ve seen so far the support for this pattern isn’t great anywhere

ibdknox

06/20/2022, 6:24 PM

Differential dataflow should handle it well I would think.

Alex Cruise

06/20/2022, 6:26 PM

I’ve definitely been keeping a close eye on materialize.io 🙂 … IIUC they don’t do a great job with datasets that are significantly larger than main memory

Alex Cruise

06/20/2022, 6:30 PM

At the brokerage monitoring thing, the hack I came up with was to stipulate that every rule needed to start with (in an AND position) a predicate that was constrained to be mechanically translatable to SQL, and all other predicates were MVEL, less constrained. All the rules that shared that primary predicate would be run as a single query as the first filter, and the MVEL expressions for the sub-rules would be run in-memory instead of pushed down

ibdknox

06/20/2022, 6:35 PM

ah yeah, if you need something out of core, I don't think DD (which is lower level than the materialize offering) will give you that, though it may be something you could add. Relational.ai is probably almost exactly what you want (very powerful variant of datalog that can handle very large datasets and does incremental evaluation), but they're a bit early yet.

Alex Cruise

06/20/2022, 6:37 PM

Lately I’ve been more focused on teasing out the 3-4 very different cost models for the sub-expressions that queries are composed of: • pure functions that don’t even look at data or the environment; that can be evaluated whenever • functions that need to look at the record, or cheap environment stuff like the current time of day • stateful functions, usually aggregates but sometimes more complex • things that need to do IO: lookups, external service calls, etc. Caching becomes mission-critical

Tom Larkworthy

06/20/2022, 9:51 PM

yeah I have seen this a lot too, e.g. achievements on fitness trackers (tell me when I have done 10,000 steps over 10K customer base). The specification is declarative but the data arrives incrementally. Due to the huge maintenance issues of trying to upgrade state on an incremental streaming solution, I tend to recommend just recomputing from scratch every X mins and throw horizontal compute at it (e.g. BigQuery). The system remains stateless then, plus you always need the from scratch computation anyway to recover from disaster, so it's the first thing to do anyway.

🤔 1

Tom Larkworthy

06/20/2022, 9:58 PM

I see this as a recurring theme of incremental computation, where I see analogues with differential equations.

Copy code

declarative is to imperatives as y = f(x) is to dy/dx = f(x, y)

Jack Rusher

06/21/2022, 11:39 AM

These are all Complex Event Processing (CEP) use cases. We built a system at a startup called Aleri in the mid-00s that handled this sort of thing with continuous queries implemented over a dataflow engine. We sold the company to Sybase, which sold to SAP, which -- so far as I know -- still sells the product today.

Open in Slack

Previous Next