Hi, I've been thinking about how end-users store a...
# thinking-together
f
Hi, I've been thinking about how end-users store and backup their data lately and came to the conclusion that there's currently no really good solution for it. I've written a blog post about the topic and would be interested to hear your thoughts. I have a couple of ideas how my dream system would look like and have started building a prototype. I hope to find time to write about that soon. Here's the post: https://fkohlgrueber.github.io/blog/data-storage-part-1/
I'm happy to discuss in this thread, on twitter or per email. Have a great day everyone 😉
@Ivan Reese I wasn't sure whether to post here or in #CCL5VVBAN. Feel free to move the post if you find another channel more appropriate.
👍 1
c
@Felix Kohlgrüber, I'm currently reading through your write-up, but my first thought was to include a note on tagging systems like Perkeep (multi-client non-hierarchical searchable blob storage) and recently, Supertag (non-hierarchical FUSE).
Perhaps good for part 2 🙂
f
Hi @Cole, thanks for your feedback. The organization of files is definitely something to be covered in a future post. I'm currently thinking that a flat hash-based file store would be a good foundation on which other interesting things like tags or hierarchies could be implemented. Before that though, I'd like to figure the "distributed" part out. That's the next step for me
👍 1
g
this triggers my belief that there should be a good, cheap system for everyone to have a low-power, high-friendliness personal server with great ux. if you feel like that’s related enough i’d be super happy to chat anytime! definitely interested either way
💭 1
👆 1
k
Data deletion: The opposite of not accidentally deleting data is confidence that deliberate deletions are truly deleted, not cached on a server somewhere. One missing axis here is incrementality of backups. Time machine saves only differences with the previous backup. In addition to efficiency, incremental backups eliminate accidental deletion. And they can be layered on your 3 categories using tools like duplicity or borgbackup. Services like rsync.net and backblaze work particularly well with incremental backups. One drawback of incremental backups or encryption is that it's easy to get into a state where you need to download gigabytes of data just to access one file. It seems worth explicitly stating that in "universal access to data". It's implied by the words there, but won't be obvious to readers.
f
@Garth Goldwater I fully agree. Laptops, tablets and smartphones (which make up most of non-tech people's devices) aren't well suited for really persistent and safe data storage. Limitations are power-consumption, size / upgradability, availability, risk of loss, ... I've been thinking about what I've been calling Personal / End-User Home Servers, which is probably very close to what you're describing. This would be a device that's as easy to set up as an Amazon Echo and serves as the central place where personal data / communication is handled. This topic is definitely related. Ideally, it wouldn't be required though. I imagine a distributed system that'd also work without a "server". For example, I'd like to be able to use data sync & replication when only having my smartphone and laptop connected. This would lower the barrier of entry (no need to buy and set up another device to get started) of the system, but you'd be able to improve capacity, availability, etc by connecting such a home server device.
💯 1
@Kartik Agaram That's right, confidence that the data you deleted is really gone is important too. I would consider incrementability (is that even a word?!) a performance optimization. Conceptually, it doesn't matter whether backups or updates are incremental or not, right? Accidental deletion can be prevented by keeping a history (which might be backup snapshots, file system journals, git, ...), but that's independent of incrementability. Incremental updates will become very important in the implementation though, that's right. That's a good point! One option is to have the current state stored along with increments what contain the differences "going backwards". More recent versions (which are more likely to be accessed) are cheaper to get this way. I'm not sure, but I think git uses something like this internally.
k
@Felix Kohlgrüber First of all, your description of the status quo seems correct and complete. As someone trying to help non-techies (also called "family") manage their electronic data, I can only confirm that there is no good solution today. One aspect I wonder about is the conflation of data storage, data syndication/synchronization, and backup. There are decent solutions to each aspect, but non for the combination, which is however exactly what matters for end users. It's perhaps the syndication'/synchronization aspect that is hardest to solve because it is inherently cross-platform and today's tech world is more oriented towards platform domination than cross-platform mediation.
f
@Konrad Hinsen Thanks! That's right. I also noticed that I was mixing synchronization, storage and backup in the post during proofreading. It's pretty hard to distinguish these aspects though. Synchronizing data between devices (when done correctly) also serves as a backup and can improve data availability, regular backups give you history which is also part of many file systems, etc.. Thinking about these aspects from the start will probably lead to a simple solution that performs well on all use-cases. Synchronization is indeed a tough problem. I hope that a simple data model and great UX will get me pretty far.
g
@Felix Kohlgrüber yeah you hit the nail on the head for the kind of thing i was envisioning. the biggest problem is honestly port forwarding for end users. i think that kind of thing could be prototyped on a raspberry pie and use something like hypercore as the storage engine
f
@Garth Goldwater there are a couple of options, upnp, hole punching, etc. I'm not an expert but other p2p software seems to have solved this already. I was thinking of using libp2p (which is for example powering IPFS), but hypercore looks interesting too. I'll check it out, thanks :-)
k
Network configuration is indeed an issue, in particular since it involves routers, over which users sometimes have limited control. On the software side, the best one can do is propose a software distribution with minimal installation/maintenance effort (something like YunoHost, https://yunohost.org/). Users then have the options (1) RPi or similar at home, implying router configuration or (2) renting a VPS for hosting. Service providers could then jump in and make (2) a zero-effort approach.