I built a prototype yesterday for that demonstrate...
# thinking-together
c
I built a prototype yesterday for that demonstrates an architecture that I think is really powerful: https://github.com/ccorcos/datalog-prototype/
A full-stack prototype of a collaborative web application backed by a Datalog-inspired database.
Let me know what you guys think 🙂
👍 3
a
Datomic is a great database for this kind of thing
p
If you’re inspired by Datalog, I’m curious about your reasons for choosing EAV over n-ary relations.
c
@Peter Abrahamsen I didnt find too much reading on the subject so please fill me in if I'm wrong -- the reason EAV is nice is because you can create 5 indexes to query 3-tuples in every possible way. Whereas large tuples are going to need larger sets of indexes which get expensive.
p
Sure, or you need some system for determining what indexes you need. EAV databases would seem a natural fit for graph data, while n-ary relations more naturally capture the statements logic programmers want to make. Since they’re interchangeable, it might be primarily a matter of taste. I’m asking because I’m genuinely curious, and because I went the other way—and it was mostly a matter of taste 🙂
c
Yeah, deciding on the indexes your need is dependent on the ontology of the data you're storing. This is totally fine if you are designing a system and know exactly what the requirements are. But my goal is to build something more general so that I can represent any information I want and be able to query it effectively.
I'd love to hear more about your project though 🙂 Do you have any code or prototypes you can share?
p
My code isn’t accessible, but I should have a prototype ready in a few weeks. The database side of it will sound familiar: a “pretty much Datalog” inferential database, backed by sqlite, plus an event log. Events will be shared peer-to-peer and will grow monotonically.
The network and database layers are in Rust, with platform-specific front ends (Swift on iOS for the moment).
The product is a conversational environment plus a kind of knowledge base. An interpersonal memex, if you will. Cross-referencing, history, etc built in. The first use case is planning a road trip, but I intend to make it very general purpose, eventually with the free-form nature of a spreadsheet.
The magic is in not fucking it up.
When I say prototype I mean, you’ll be able to send messages back and forth, and maybe it will look okay.
c
Wow, we're working on very similar projects! 🙂 My goal is to built a distributed offline p2p database for personal users -- something that could replace iCloud, for example. Also something that could enable developers to build entire applications writing only front-end code. Lots more cool ideas to after that... I'd love to chat sometime if you're interested. Any chance you're in SF? ccorcos@gmail.com 19165489415
o
I have just read through your code: it makes me discover and understand EAV and datalog that I didn't know before your introduction message. Very nice code. Thanks! 🙂
By the way, how do you envision the use of this for CRDT? When I played with CRDT/CausalTree, the root data structure was an append only event log with carefully chosen timestamps. Do you think there is some interest to implement this with EAV like your prototype?
c
Hey @ogadaki just seeing this. Definitely going to use an append only log. I have this feeling that I can design it in a way that doesn't need a vector clock for causal consistency, but I havent thought that through yet. Otherwise, CRDT is implemented as a higher-level concept on top of this database and required the developer to carefully model their data. For example, for good text editing CRDT, you need to model characters as a tree that you traverse depth-first to construct the string. This requires no changes to the underlying database architecture and the CRDT behavior is entirely at the application layer 🙂
o
I have this feeling that I can design it in a way that doesn't need a vector clock for causal consistency
A way to have causal consistency in an append only log, is to use a combination of a Lamport timestamp and an unique session ID, i.e. an ID unique to each site of editing. You order events based on the timestamp and when two events from two sites have the same timestamp you use the ID to choose the order in a deterministic way. It is actually what I have successfully used for my implementation of CausalTree/CRDT. You can find more information about this here: http://archagon.net/blog/2018/03/24/data-laced-with-history/. It is a very well written article and actually it is after reading it that I work on my Rust implementation of this. Here is en excerpt, explaining the ordering of the event log:
"This won’t be convergent by default since the operations don’t have an inherent total order, but it’s easy to fix this by giving each one a globally-unique ID in the form of an owner UUID
plus a Lamport timestamp. (In the diagrams below, this is encoded as “SX@TY”, where “SX” represents the UUID for site X, and Y is just the timestamp.) With this scheme, no two operations can have the same ID: operations from the same owner will have different timestamps, while operations from different owners will have different UUIDs. The Lamport timestamps will be used to sort the operations into causal order, leaving the UUIDs for tiebreaking when concurrent operations happen to have the same timestamp."
And in my demo, there is a dynamic illustration of this event log ID for eventual consistency. If you are interested, go there: https://ogadaki.gitlab.io/sunnytech-2019-wuer/, then click "show sequences" and then "play scenario". Or you can use the three editors to add some characters and sometime click on "broadcast content". I made this demo to help understand how Causal Trees work. If you happen to use it, any feedback would be very welcome! 🙂
p
That post really made a lot of things click for me (as it did many others) when I found it. Will you be publishing your Rust implementation?
c
The problem with causal consistency is that the clock gets really large with an arbitrary number of collaborators... I’m hoping that I can come up with some way of indexing the transaction log so I can query for the last transaction involving some fact and enforce causality that way.
âž• 1
This article looks great. Thanks for sharing :)
👍 1
p
That’s a problem with Lamport clocks, but they give you more information than you generally need in practice. In collaborative text editing, as in git, the operations themselves form logical timelines, and the site ids of two operations are used just as tiebreakers, as Nicolas mentions.
If we’re talking about interactive, human-scale systems, though, eventual consistency isn’t that impressive a goal. There’s a lot of opportunity to explore different application-specific, visual presentations of timelines, for example, with or without divergence. Most of the time a user won’t care, but occasionally they might ask, was the person who made this change aware of this other change? Not all forms of inconsistency can be identified at the database or application layers, or indeed encoded at all.
I’m excited to get to the point where I can explore some of these ideas, but I have a lot to do before then.
o
The Rust code of my implementation can be seen here : https://gitlab.com/ogadaki/sunnytech-2019-wuer. But, it was only an implementation to show how to mix Rust and Vue.js using WebAssembly, so the Rust part doesn't comprise only the algorithm but also some code to communicate with JavaScript one. And it is just a prototype.
👍 1
@Chet Corcos wrote:
The problem with causal consistency is that the clock gets really large with an arbitrary number of collaborators... I’m hoping that I can come up with some way of indexing the transaction log so I can query for the last transaction involving some fact and enforce causality that way.
Yes, that concerns make me also not completely satisfied with this kind of implementation of event log. Let me know about your progress on this! 🙂
👍 1
@Peter Abrahamsen wrote:
Not all forms of inconsistency can be identified at the database or application layers, or indeed encoded at all.
I totally agree! The consistency addressed with the TimeStamp+SiteId is only to be sure the event log order is strictly the same on two sites, but it is a work at an upper abstraction to deal with the "user-faced" consistency, and maybe occasionally asking a user to manage manually the conflicts (Ă  la merge conflict in git for example).
Coming back to the question of "very large timestamp", the difficulty is that we need a way to keep the order of a long event log, in a robust a way regarding partition of appending in that log. This potentially large index seems necessary. Or maybe, one can store events as a basic list, and only when there is a fork there is a need to maintain a kind of index offset in the list on all the branch of the fork? Then, when forked sites need to merge, they can give the index of the start of the fork + their offset from this, and I guess we have all the information to have nearly a similar behavior of Lamport TimeStamp? At least regarding our needs?
p
(Realizing I confused Lamport timestamps and vector clocks earlier.) Maybe I’m forgetting something, but I don’t think there is a problem for ORDTs? (I’m not using ORDTs, at least not yet.)
Chronological (wall time) LWW is sufficient for my purposes for the next while. Everything mutable is mutable only by a specific site, or several sites controlled by the same person. Once that becomes insufficient, I will start attaching some version metadata, but probably still bounded by the set of sites that can modify, or have modified, a particular entity.