Loading…

4. Application Trade-Offs

Working with Domain Objects

ORMs

When we first started, we mentioned Object-Relational Mapper (ORM) tools. They keep developers from having to write a lot of boilerplate code for every object and data item the application uses. Many of them provide the ability to track changes to the objects it returns, allowing the application to retrieve the object, update some properties, then tell the database to save changes; the tool determines the SQL statements needed to persist that change. It can handle changes, additions, and deletions for many objects in one request.

ORMs can also be safer. SQL injection attacks, as part of injection attacks in general, were ranked by the OWASP Top 10 at #1 in 2017 and #3 in 2021. An ORM tool will use parameters to safely pass field content in queries. They make the right thing to do the easy thing; this is its default behavior, and it has to be bypassed (in part or in whole) to write a vulnerable query.

The downside is that certain data structures do not translate well to relational databases. Consider a blog post with tags. In a relational database, the proper way to store these would be in a post_tag table, where the post's ID is repeated for each tag. Retrieving a post and its tags require multiple queries (or a nested subquery which could translate tags to a comma-delimited string). The ORM must translate this when the object is retrieved, and must decompose the domain object into its rows when updating data. (Some relational databases, most notably PostgreSQL, do have an ARRAY column type; in this case, that would simplify this scenario.)

Serialization

In general, “serialization” refers to the process by which a domain object is converted to a program-independent format; objects can be recreated by “deserializing” the output from a former serialization. (This used to be mostly text formats - plain text, XML, JSON - but can also be done used with common binary formats.) The rise of the Single Page Application (SPA), where data is exchanged using JSON, means that the majority of serialization is happening to and from JSON. As document databases store their documents in JSON form (or a binary version of it), we can build on this to get domain objects into and out of a document database.

If an application has JSON serialization strategies already defined for use in a SPA, these same strategies can (generally) be used with a document database. In some scenarios, this may completely eliminate a deserialization/serialization sequence (from the database to the application, then from the application to the browser); select the JSON from the database - and, for multiple results, combine the rows with , wrapped with [ and ]. Now, the server is passing text vs. getting text, making an object, then turning that object back into text.

One big downside, when contrasted with ORMs, is that document database drivers are likely not going to have the “update the object and save changes” paradigm of those tools. Document databases do not support row locking and other features ORMs use to ensure that their change-aware objects can be persisted.

Another difference is that an “update,” in document database terms, usually refers to replacing an entire document. The only property which cannot be changed is the ID; any other property can be added, updated, or removed by an update. To partially update a document, the term is “patch.” A patch specifies the JSON which should be present after the update, and search criteria to identify documents to be patched. (Patching by ID is fine.)

This is where some consistency risk increases. There is no currency check on what the value was when it was obtained, and even if some other process has patched it, the patch command will not be rejected. While patches that use commands like “increment” or “append” (provided by some document databases) will succeed as expected, others may not. Imagine a library patron whose last name was entered as Johnsen. The patron goes to the check-out desk to have it updated to Janssen - but someone in the back was doing a quality check on new patrons, and decided to “fix the typo” by correcting it to Johnson. If both these updates happened after users had retrieved Johnsen's record, the last one to press “Save” would overwrite the previous one.

Overfetching

Overfetching is retrieving more data from the database than is necessary for the process being conducted. It gets its own subheading because it is a problem common to both ORMs and document databases (although, in theory, the larger a document becomes, the more the problem is exacerbated). Depending on how much extra data is being returned, it may not be an issue in practice. In most cases, though, considering overfetching should likely be postponed until the process works, but no later. When your little hobby site catches fire, what used to run great can be brought to its knees retrieving data and populating objects for no reason at all.

Consider the admin area for a blog, where a list of posts is displayed. The list probably needs the title, author, status, and published date, but little else. We do not need categories, tags, comments, or even the content of the post.

Both ORMs and document databases usually provide “projections” (even if they use a different term for it). A projection is most simply understood as a different view of the data. For an ORM, perhaps we create a post_list view, and a domain object for those items; we can now query that view and only select the data we need. For document databases, some have some form of “without” statement that will exclude items from the document. (The application needs to handle these being missing.) Documents may be able to be patched in the result, leaving the database itself unchanged. (The query, paraphrased, is “give me the post document patched with the JSON object {"tags": [], "text": ""}.” The document returned would have an empty list of tags and no text.)

Working with Data

Documents do not have to be turned into domain objects. Perhaps you are using a language like PHP, where data structures are commonly returned as associative arrays (similar to a dictionary or map in other languages). Or, maybe you just need one or two items from a table or document without retrieving the entire thing.

Addressing Fields

Selecting a few fields in SQL is trivial; some ORMs make it easy to get to the underlying connection, and nearly all relational database drivers have the concept of a result set, where data items for the current row can be accessed by name, their index in the SELECT statement, or both.

Document databases usually provide the ability to retrieve arbitrary fields from documents, but their implementations can vary. MongoDB allows you to specify 1 for fields to include (and 0 to exclude ID, returned otherwise). RethinkDB provides the .pluck command to select certain fields and the .without command to exclude certain fields.

For documents stored in relational databases, there is syntax for selecting fields from documents similar to how columns from a table are selected. They are addressed the same way as columns which come from the table itself. As with document databases, though, the syntax varies, and may be implemented as custom operators or function calls.

Indexing

We have yet to discuss indexes to any great extent. They can bring huge performance boosts in either data paradigm. While we'll consider them fully when we dig into document design, a short consideration here will serve us well. Both relational and document databases use a unique index on the primary key column(s) or field; we'll look at others we may need.

In a relational database, foreign key fields should be indexed. These databases maintain integrity by checking values as described in the constraint, and they do this every time an INSERT or UPDATE is executed which sets a new value. If the foreign key is not indexed, the database has to search every row of the table manually (a “full table scan”). These help our application as well; for our patron / checked-out book association, the index will help us identify these rows quickly, whether we are starting from the patron or the book.

Other indexes can be created for fields commonly found in a WHERE clause. If we built a “find a patron by their e-mail address” process into our library system, we would likely want to index the email field we would add to their record. (Doubly so if they can also use that e-mail address to sign in to the library system to access resources there.)

Relational indexes are not free; they take up some space, and the database's ACID guarantees apply to indexes as well. This can slow down updates, particularly the more indexes need to be updated. A great starting point for indexes is primary keys (which the database does for you), foreign keys, and commonly searched items.

Document database indexing is one area where vendors can distinguish their product from others. The shape of the data (arrays, sub-documents, etc.) also require more indexing options. Most allow creation of an index on an array which can be used to mimic a SQL IN query. Some allow indexing computed values which are actually stored in the index, and can be retrieved from there. Some also allow for all indexed values to be also stored in the index; in these cases, queries that only require those fields do not have to retrieve the actual document, as the index can satisfy the query.

Document indexes may not be ACID-compliant, particularly with consistency. In some cases, an index can be explicitly not updated with the command that updates the data; it's executed in the background once the database system has said it's done. In other cases, the application can specifically request to wait until an index is consistent.

Interim Summary

We have looked at relational databases, document databases, trade-offs between their data stores, and now trade-offs from an application perspective. We have looked at both the strengths and weaknesses of each data model. What if we could get the benefits of relational data and documents at the same time?

Next: 5. Hybrid Data Stores

Previous: 3. Relational / Document Trade-Offs


« Using Relational Databases as Document Stores • Home