I have been working with large-ish knowledge graph...
# thinking-together
j
I have been working with large-ish knowledge graphs on and off over the course of the year, and I keep running into the same problem... my intuition for what I have done when I modify the graph and how it will effect everything else is just dismally bad. I make what seems like will be an incremental change, and it doesn't do what I expected, or if it does, it breaks all my algorithms. This happens with other data structures, too, obviously. But with graph data it happens far more often, as in nearly always, and the time between "it isn't working" and "oh, I know how I broke it" is much longer. So I'm wondering, assuming that my experience is not unique, is there something about graphs that makes them harder to grok? The absence of a clear serialization, the absence of order, complexity of structure? Or is it a UI problem, and if I could just see what I was doing it wouldn't be so bad... Often, when I come to a particularly hard problem, I sample and visualize the graph, and learn that the previous three problems were not correctly solved. So visualization helps, but is that enough? Or is there something fundamentally inhuman about large, dense graphs?
o
can you give an example of issue you encountered?
j
In general, maybe? Like I have at least twice added edges of a different type to reflect a different relationship, and failed to appreciate the need to exclude those edges from how other algorithms in the system work, because the new edges represent relationships of a different type. And that mess-up has been invisible to me until I was investigating why steps far after that algorithm was applied were broken. But that's just one example, and it's not about the specifics. It's a pattern I'm beginning to see, and I wonder if it represents something, or if it's just the effects of a leaning curve, or what.
a
In my experience, even linked lists are hard like that. It doesn’t seem like the problem goes away until you eliminate edges entirely and have arrays.
In trees whose shape mirrors the shape of the recursive computation exactly, things can be simple. Thinking locally works.
m
"working with knowledge graphs" makes it sound like you are clicking away in some notetaking™ UI, and not programmatically modifying particular data structure. if "mistakes while programmatically modifying" is the case – it is similar to working with dynamic (python/clojure) vs static (java/haskel) lang codebase: you know the general rules "how to call method on thing/get attribute of thing", but dynamic-lang env (generally) hits you with "that thing has no such method/attribute" at runtime, static (generally) – at compile/IDE time. maybe it can be generalized as: there are rules either 1) you are unaware of, or 2) are not enforced "in time" (whatever "enforced" and "in time" mean for you, eg. red squiggly line in IDE, or absence of method in autocomplete dropdown, etc.)
n
This post covered the challenges with graphs : https://www.hillelwayne.com/post/graph-types/
j
@Misha A, the problem is not using the library. The problem is that after I do something using the library, the representation in my head of what the output will be and how it will behave elsewhere in the code is dramatically wrong.
@Nilesh Trivedi thanks for the Hillel link. I'm a fan of his and hadn't seen it before. But again, that's addressing bad libraries.. The problem I'm describing is with me.
But I wonder if this isn't the same category of problem that strict typing is supposed to solve, where it detects for you when the shape you are expecting is not the shape that will arrive. But some of the properties that I would need to specify would be across wide portions of the graph, at arbitrary levels of complexity. I could define two OWL ontologies, and use reasoners to validate the before and after versions of the graph data. But that seems like a nightmarish level of overhead. Pydantic for graphs, maybe. 😅
m
when in doubt – print it out!
d
There's a great book on this, Constraint Theory by Friedman and Phan. It starts with a description of how much of a graph someone can fit in their head and reason about, then describes how to make a graph out of your models, then how you can develop some intuition of what to look for, and finally ends with how to automate these to much larger graphs semi-efficiently
However, their technique was designed for aerospace engineering, so generally it focuses on feasibility (detecting over/exact constraint) and what questions we can ask the data ("is x vs y with z held constant meaningful/possible?") rather than an executable single output
t
At work we have a user defined graph structure. Its a source of enough bugs we had a whole project dedicated to "no broken graphs". We stopped trying to be clever and literally do a graph walk every modification to ensure our constraints hold, e.g. its still a DAG, every split node has a corresponding merge node. Every node pointer is pointing at a node etc. so bugs in one part of the system could not spill into the database. When the product started there were lofty expectations we would be able to implement graph patching incrementally etc. but the practical reality we had to sacrifice fancy algorithms for constant brute force validation of the structure. I think graph maintenance is tricky coz we know the incremental approach is possible but there is no nice out-of-the-box safe way of specifying it so you are essentially working with an unconstrained dynamic model, in our case mistakes were quite costly and the graphs were not that big so it was better ROI to enforce global constrains the dumb but reliable way.
j
Yeah, I'm getting the feeling that shcema-free is like a double-edged nunchuck, where you spend more time hitting yourself in the armpit than hitting the thing you are trying to kill. Thanks for the feedback, everyone, it's helped a lot.
r
Can you think of this as missing a test suite for your graph + algorithms? Do you feel this is different than adding a change to code and then proving you didn't break anything? If it's not that different then perhaps if you built up a good suite you could run it periodically as you change your graph
j
I think of it more like missing Pydantic for graphs. I just need to validate the data shape against my expectations.
r
There are tools for writing "unit tests for data", like great expectations, etc. if you need more inspiration to read up on
Ah ok
If the structure you're trying to test is more arbitrary, only then is running a test suite necessary. Simpler checks would be schema validation like you say
I have found data test suites helpful for cases where it's really hard to predict how you might break something and it can't be expressed in just the schema
j
Yeah, I think I could see getting to the point where that would help a lot. For now, my problems are lower level.
t
the outcome of our "no broken graphs" project was adding transactional modification, so instead of mutating graphs one node/edge at a time, you applied a patch all-or-nothing, and we check the post conditions before committing the transaction. For us, our graphs needed certain matching nodes, a bit like making sure every "(" has a ")" but where the bit in the middle is a graph walks. Impossible to verify if you add one node/edge at a time, but totally possible when you apply transactional batches and maintain the invariants between transactions.
d
It sounds like most of you have graphs embedded into a non-graph structure I wish it was easier to find a data structure/representation that makes graph-breaking operations non-representable (aka generate a data structure for a schema that includes read/write constraints)
j
That's actually not my issue. But it's not as if "node" and "edge" and "graph" are first order types in any language I'm using, so technically I suppose it's always a graph embedded in a non-graph structure!? My API doesn't let me do anything that would make it "not a graph". The problem is just that I am constantly creating "not the shape of graph I was expecting".
a
I thought this was a fun, related article. It won’t be of much help pragmatically, but may provide some catharsis.
So which algorithms should come with the library? “The amount of things people want to do with graphs is absurd,” Kelly told me. That matches my experience, and the experiences of all my interviewees. It sometimes seems like graphs are too powerful, that all their possibilities are beyond my understanding.
https://www.hillelwayne.com/post/graph-types/
d
I am having trouble understanding the problem and I would appreciate more concrete examples. Is it syntactical shapes don't align with expectations, or is it the semantics of the query and the representation don't sufficiently match to get the right answer after an evolution of the KG? Each of these need very different approaches to improve the situation. I have worked with very large KGs, and found them very workable, but you need to constantly garden them, preferably automatically detecting issues, and adding a lot of tests / constraints that are automatically checked, to your data.
d
The problem is just that I am constantly creating "not the shape of graph I was expecting".
What I mean is a generated data structure that can only express the shapes you expect; aka any operation on the data structure keeps it valid. Preferably without an expensive/exhaustive "check if the entire graph is valid if I make this change"