3. Relational / Document Trade-Offs

There are no solutions. There are only trade-offs.
— Thomas Sowell

While the context of this quote is economics, it is a concept that has many applications, including this topic. There are generally accepted principles of data storage, proved in enterprise applications representing billions in commerce annually. The site you're reading is written by one person, whose business has occasionally crossed the threshold to profitable (but it's been a while). Do we think this site has the same data storage needs as a Fortune 50 enterprise?

Some would say yes. To get to this page, you have likely clicked links that needed to point to pages that actually exist. The software running the site needs to know who I am, and record me as the author of this (and every other) page. It has content, and that content is related. Every time I save a page edit, the software records a revision; each revision needs to be tied to the right page, and if a page goes away, all its revisions should as well.

Most people, though, would probably say no. Of course I do not need large, distributed data centers with dozens of employees supporting my data storage needs. Even if I structure my database poorly, leftover revisions from a deleted page are likely not going to cause a blip in performance, much less fill up a disk. If I do something to mess up a database, in the worst case, I can drop back to the previous night's backup.

"OK, when did become about the author?", you may be thinking. It isn't (though if you would like to help make this profitable, reach out!); it's an illustration that, while the principles are good - and I'm about to defend them - they are not the only way. By understanding the principles, and the trade-offs, you may be able to reduce complexity in your application.

The ACID Test

Relational databases, as a general rule, are ACID-compliant. This set of principles (summarized) mean that:

Transactions are treated as a single unit, whether they are a single statement or multiple statements; it all works, or it all fails (an “atomic” transaction, atomicity)
A transaction cannot leave the data store in an inconsistent state; all constraints must be satisfied (consistency)
Concurrent transactions cannot see other in-progress transactions (isolation)
Transactions reported as successful will still be there, even if the server goes down, is interrupted, etc. (durability)

These principles were a part of the data structure we designed in the first page. The links between the author and book, and patron and book, fall under consistency; if we tried to check a book out to patron 1234, and that patron did not exist, the transaction would fail. If two librarians are checking out two different books to two different patrons at the same instant, there should be no problem (isolation). However, if they are trying to check out the last copy of the same book - well, at that point, we must decide how to handle it; absent handling strategies, the second attempt will fail (isolation, consistency).

Distributed Data

Even with advances in CPU and storage, there are limits to what one database server can do. “Edge computing,” pushing content as close to its consumer as possible, is easy to do with static files, but can be more challenging for data - especially if ACID data is required. There are several strategies, and their complexities are well beyond our scope here; we'll summarize a few here, because it will help with our consideration.

Sharding - Data within the database is physically placed in the database based on the value of a field. “Region,” “year,” and “first letter of last name” are all valid sharding strategies.
Replicas - The database, in its entirety, is replicated to other locations. Read-only processes can look at these replicas, rather than the main database, if no updates are required. This reduces the load on the main database, and a replica can be promoted to main if the main becomes unavailable.
Clustering - A clustered setup designates one instance as the controller, and other instances as workers. (Often, the controller can also be a worker; that's just not its main job.) A worker can read and write, and communicates writes to the controller, which then distributes updates to the other workers. The term “eventual consistency” is often used with this structure.

Many document databases expect to be clustered from initial install; understanding that makes a lot of their other decisions make sense.

Do We Need…

Most of the trade-offs to consider revolve around needs concerning aspects of ACID. We'll look at the first three; while there may be esoteric applications that do not need durability, I'm not aware of any relational or document databases that do not guarantee that.

Atomicity?

While some document databases do support transactions, most guarantee statement-level atomicity, not transaction-level atomicity. To think through an example, let's think through removing a patron from our library. We would not want deletion of a patron to succeed if they have any books checked out, but if they have brought them back and want to close out their account, we want to handle that in one transaction. (In practice, we would probably inactivate them; but, for now, they're gone.)

In a relational database, we can do this easily. When deleting a patron, the application can look for the books they have checked out, display a list, and ask “Has the patron returned these books?” If the librarian clicks “yes”, the application can start a transaction; delete the book_checked_out rows for the patron; delete the patron; then commit the transaction. If that is successful, we know that the books have been returned and the patron has been deleted.

In our document database, we may not be able to do that. (Some databases do support transactions, but these may have different options.) Without transactions, we may need to execute more queries, and each one could succeed or fail on its own. Remember our document example has the checked-out books stored as an array within the book document. If the database supports removing items from an array, we can do that with one query; if not, we will need to retrieve the checked-out books, alter each array to exclude the patron, then update each book. Finally, we could execute a query to delete the patron.

A built-in mitigation for some of this comes in the form of the document itself. The more information stored within the document, the lower the risk that multiple queries will be needed. In our example, we do, but it's a bit contrived as well. For checking in a book, we just need to remove the checkout from the array. In a document database that does support in-place array manipulation, the transaction is a single query, just as it would be a single DELETE in the relational structure.

Consistency?

No one says “I don't need consistent data - just give me something!” However, consistency guarantees come with a cost. Relational databases must validate all constraints, which means that the developer must specify all constraints. This constraint enforcement can complicate backup and restore, which must be done in a certain order (though the relational workaround is to disable the constraints, load the data, then enable the constraints; if they fail, the backup was bad).

For document databases, consistency is not defined as constraints in the database. This does not mean that the logical constraints don't exist (remember, most data has structure and is related to other data), but it shifts responsibility for maintaining those constraints to the application. For example, this site uses document storage within a SQLite database, a hybrid concept we'll discuss more fully as we move into the libraries we've written to make this easier. The pages are documents, but the revisions are stored in a relational table. When a page is deleted, SQLite makes no attempt to keep its revisions from being orphaned.

The knowledge that the database makes no guarantees can bleed into how effective documents should be designed (also a future topic). Robust applications should treat most relationships as optional, unless its absence is one the application cannot work around. For example, the software that runs this site also supports blog posts and categories under which those posts can be assigned. The absence of a category should not prevent a post from displaying. The logic to delete categories also removes them from the array of IDs for a post, but there is no enforcement for that at the database level.

One note about “eventual consistency” - in practice, the “eventual” part is rarely an issue. Absent some huge network latency, most eventual consistency queries are consistent within the first second, and the vast majority are consistent within a few more. It's not consistent as a computer scientist would define it, but it's usually consistent enough for many uses.

Isolation?

The article linked above has a good description of an isolation failure, under the heading with the same name. A relational database with ACID guarantees will usually throw a deadlock condition on one or both of the updates. Document databases can have different ways of handling this scenario, but they usually end up with some form of “last update wins,” which can result in both “phantom reads” (where a document matches a query but its contents do not match by the time it is retrieved from disk) and lost updates.

That sounds terrible - why doesn't our consideration end here? The main reason is that isolation failures only occur with writes (updates), and they only apply to single documents. If your data is read more than written, or written-all-at-once then read, this is a low-risk issue. If you have one person updating the data, the risk rounds down to non-existent. Even in a multi-user environment, the likelihood of the same document being modified at the exact same time by different users is very, very low.

The concern should not be ignored; it would not be a principle of data integrity if it were not important. As with consistency, some document databases have the ability to require isolation on certain commands, and they should be used; the slight bit of extra time it will take the query to complete is likely much less than you would spend unwinding what would probably look like a data “glitch.” If the document database does not have a way to ensure isolation, consider application-level mitigations in cases where conflicting updates may occur.

A Final Consideration

As mentioned above, most document databases are designed with multiple instances in mind. What they do well is a quick update locally, then communicate that change up to the controller. You won't find things like sequences or automatically-increasing numeric IDs, because there is no real way to implement that in a distributed system. If you are using a single instance of a document database, many (but not all!) of the ACID concerns and exceptions go away. If an update requires a “quorum” of servers to report a successful update, but the entire cluster is 1 combination controller / worker, using things like transactions or isolation (if supported) will have no appreciable performance effect on your application.

Understanding the trade-offs in what we lose and gain from sticking to ACID or deviating from it can help guide our decision about which we want to target - and, once that decision is made, we will be writing an application which utilizes that data store. While the considerations here focused on the database itself, we'll turn to trade-offs in application development in our next section.

Next: 4. Application Trade-Offs

Previous: 2. What Are Documents?

« Using Relational Databases as Document Stores • Home

Loading…