The wonderful sound of an atomic commit

February 11, 2020

There's a horrifying realization as you work with software of how fragile it is. Failures can happen almost anywhere, and a lot of code isn't equipped to deal with it.

We can at least assume, in JavaScript, that when your code is executing it will keep executing until it either ends or waits for something. Your OS could still crash so JavaScript technically isn't "run-to-completion". That's relatively safe though; your program would stop in the middle of some in-memory computations which isn't a big deal.

No, the real hairy stuff happens whenever you need to access the outside world. Which of course, all practical programs do all the time. If you need to read or write from a file or the network, you pretty much have to assume it's going to fail. Which means almost all code paths in JavaScript that use await to wait on something is going to fail. It's arduous and difficult.

What's worse is anything interesting involves multiple interactions with the outside world. Let's say you have a bunch of static files for your blog and you want to add "© James Long 2020" to the bottom of all of them (you don't use a static site generator because you can't afford the disk space of pulling from npm). You write a shell script to loop over the files and append the text, but then your script errors in the middle of it! Your left with a half-modified set of files… and there's no easy way back to the good state of things.

We can find solace in a few things that bring sanity to this world though. One of those things is atomic behavior — something which you can rely on either succeeding or failing entirely. There's no "half states". Anything providing atomic behavior gives me warm fuzzies and I like to go to sleep holding it.

Going back to our file example: one solution would be to copy the file somewhere temporary, modify them there, and move them back. You'd have to move them back under a new name, so let's assume you have "posts" and the new modified directory will be "posts-2":

mv /tmp/modified-posts ./posts-2

If "posts-2" exists after your script ends, you know all the files in it are modified. Unfortunately even the question "is mv atomic on my fs?" is complicated to answer, but generally speaking it is.

This example might not seem so important, so let's improve it. Say you're working on a server that is serving your blog files. You're so interesting and everybody wants to read your stuff, so you get 1000 requests per second. You don't want any down time, and you never want any "halfway states". How do you modify the contents of the posts directory that the server is reading from?

If posts is a regular directory, you can't. You're going to need at least two commands, one to move that directory away and another to move the new one in. The answer is something you have to do from the beginning: make posts a symlink. With symlinks, you can atomically change what they link to:

// The world will change in a single instant, atomically
ln -sf ./posts ./files/modified-posts

I don't know about you, but whenever I do something atomic, I can almost hear a BOOM as it's committed.

-s means create a symbolic link, and -f means to overwrite an existing symlink (on macOS, it's different for each OS). The nice thing is that the filesystem guarantees that this is an atomic operation (as far as I know). Symlinks are a great way to allow data to change: you can go off and do a bunch of work on files and then update the symlink to the new location if everything succeeds.

Sadly, so many tools don't work atomically. Imagine if you could ctrl+c to stop a npm install and your filesystem was exactly the way it was before you started the command? I love anything that allows me to make mistakes and then backtrack to where I was before.

There's a relationship to immutability in programming languages here, where immutability forces you to make modifications somewhere else and then swap it atomically so that the world sees it. In fact Nix, an immutable filesystem, is exploring exactly that.

I've gone off on a tangent. My sorrow for more reliable tooling finds solace in a place where atomicity is par for the course: databases.

My real-world use case

What inspired me to write this post is recently I worked on improving the reliability of syncing in Actual. Actual is a local app and keeps all data locally, and syncs changes across devices. It's critical for the syncing layer to be as robust as possible, otherwise it could corrupt your data.

Internally, Actual uses sqlite to store data. The best thing about SQL databases is they offer "transactions" as way to deal with multiple changes across time (the A in ACID stands for Atomic). Using a transaction, you can make a bunch of changes and then COMMIT them atomically to the database, or use ROLLBACK to revert those changes. The important part of COMMIT is it guarantees an atomic commit to the filesystem.

The syncing process goes something like this:

let messages = getNewMessages(receivedMessages);
let clock = getClock()
messages.forEach(msg => {
`UPDATE ${msg.table} SET ${msg.column} = ? WHERE id = ?`,
[msg.value, msg.id]
`INSERT INTO messages_crdt (timestamp, table, id, column, value) VALUES (?, ?, ?, ?, ?)`,
[msg.timestamp, msg.table, msg.id, msg.column, msg.value]
clock = insertMessage(clock)
'INSERT OR REPLACE INTO messages_clock (id, clock) VALUES (1, ?)',

Don't worry too much about the algorithm (it's greatly simplified) but we need to make several changes and they have to be applied atomically — all at the same time. We change the actual data, insert the message into the CRDT, and after processing all the messages store the new clock. If any of these changes got comitted without finishing the process, the syncing data would be corrupted.

We can do this by using a transaction, which is easy:

// … make all my changes …

I can't help it — whenever I see a COMMIT I can't help but imagine it makes a joyful BOOM sounds when it successfully commits. It's such a nice thing to hear.

That is how syncing has worked in Actual up until the next update. Unfortunately, there's a sneaky problem with the above code. We have to remember how fragile everything is: what happens if one of the SQL statements fails?

We only want all of these updates to commit if all of them succeeds, otherwise we want to rollback the entire set of changes. This way, the sync update is very much an atomic block of changes. It either happens or it doesn't.

Unfortunately, if a sqlite query fails it stops executing, but the transaction is left open. Some future code will eventually do a COMMIT for their own stuff and end up committing some leftover changes accidentally.

My problem in Actual was even worse for reasons I'll only summarize: runQuery used to be async and I wasn't awaiting on it on purpose. I don't support async transactions because it requires so much machinery (you need to block all async work until you're done), and sync is so much simpler. I was firing off updates linearly which works if all queries are successful, but if one fails my code would happily continue to execute and run all queries regardless.

I use better-sqlite3 which is a freaking fantasic libary because it's synchronous (go read the reasons why). To fix my problems I need to do a few things. First, I need to make runQuery synchronous, because better-sqlite3 is sync anyway (it was async for reasons I won't go into).

Now that runQuery is sync, it will properly stop executing when a query fails. But I still need to do a ROLLBACK to make sure no updates are leftover. The resulting code looks like this:

try {
// … make all my changes …
catch(e) {
// Yeesh, nevermind
throw e

Even better, better-sqlite3 comes with a builtin transaction function which does exactly this (and also supports nested transactions) so the code ends up looking like this:

let run = db.transaction(() => {
// … make all changes …

If an exception is thrown within db.transaction, it will automatically rollback the transaction. I've run a bunch of property tests against the newly improved syncing layer and it's had no problems, even if things fail sometimes.

It feels really good to harden the syncing layer. You could throw your computer out the window while it's syncing and it would recover fine from it. Your hardware might not though.

All of these changes will be available in the next update 0.0.114 of Actual, coming this week