1. A Brief History of Relational Data
Relational Data
Relational databases were not the first data structure, but when people talk about document databases, their behavior is usually contrasted with relational databases. We do not need a PhD-level knowledge of these databases, but some high-level concepts will serve us well.
The “relation” in “relational database” names the concept that instances of data can be linked (related) to other pieces of data. Think of a system that keeps track of books for a library; some items they would need to track the book's title, its author, how many copies they have, and who has a copy checked out. If they were to make a new “book” entry for each physical copy, they would end up repeating nearly everything. With a relational database, though, we could structure tables where very little would be duplicated.
- An
author
table could hold the author's name, plus biographical information, etc. - A
patron
table could hold the library cardholder's information - The
book
table could have the name of the book and how many copies the library owns - A
book_author
table has the ID of a book and the ID of an author - A
book_checked_out
table could have the ID of a book, the ID of a patron, and the return date
In this example, we have 5 tables to hold the information, and two of those are there solely for the purpose of associating entities with each other. When we think about this structure, there are some interesting ways of parsing the data that weren't covered by the description above.
- Books can have multiple authors; this structure also provides an easy way to find books that an author has written.
- We can count the occurrences of a book in the
book_checked_out
table and subtract that from the copies we own to determine how many copies are available for check out. - We can easily track what book a single patron has checked out.
- If an author's name changes, when we update the
author
table, the system picks up the new name.
Notice the word “could” in the descriptions of the tables; there are different ways to define relations among entities, and database purists could also come up with scenarios that this structure does not cover. The intent here is to present an isolated yet non-trivial working example that we can use as we think through how this data is structured.
The ORM Bridge
In high-level programming languages, developers create structures with data representing some entity. Most database drivers deal in tables, rows, and columns. If we were to use one of those libraries, we would end up writing several queries and lots of property setting to constitute a domain object from the relational data in our tables.
An Object-Relational Mapper (ORM) library helps to translate between these structures and the relational database. There are lots of these tools, and they have some pretty solid up-sides in most cases. Take, for example, C# objects and Microsoft's Entity Framework Core (EF Core).
// This is the definition of a book public class Book { public long Id {get; set;} = 0L; public string Title {get; set;} = ""; public int CopiesOnHand {get; set;} = 0; } // This is how we retrieve a book // - ctx is an open database context (connection) // - theId is the variable with the ID of the book we want to retrieve var book = await ctx.Books.FirstOrDefaultAsync(book => book.Id = theId);
This illustrates the simplicity of code using an ORM library. Rather than create a query, execute it, check to see if anything was returned, then assign each column from the result to a property in the class - it's the single line above. (If no Book
for that ID exists, the book
variable will be null
.)
That's great for the book information, but that's not all we need to access; we need authors for the book and we need patrons to check them out. We're not trying to become EF Core experts, but adding this information looks something like this…
public class Author { public long Id {get; set;} = 0L; public string Name {get; set;} = ""; // naive // dates of birth/death, bio, etc. // A "navigation property" to find books by author public ICollection<Book> Books {get; init;} = new List<Book>(); } public class Patron { public long Id {get; set;} = 0L; public string Name {get; set;} = ""; public string Phone {get; set;} = ""; // skipping the navigation property here } public class CheckOut { public long BookId {get; set;} = 0L; public long PatronId {get; set;} = 0L; public DateTime ReturnDate {get; set;} = DateTime.Now; // "Navigation properties" public Book Book {get; set;} = default!; public Patron Patron {get; set;} = default!; } // A new Book class public class Book { // properties as before, then... // ...more navigation properties public ICollection<Author> Authors {get; init;} = new List<Author>(); public ICollection<CheckOut> CheckOuts {get; init;} = new List<CheckOut>(); }
Notice that the Author
has a collection of Book
s, and a Book
has a collection of Author
s. This is how the book_author
table is represented. For checked-out books, we actually have a type that links a Book
to a Patron
and also stores the return date. EF Core's “navigation properties” are how it exposes these other entities within .NET code. If we do that same simple Book
retrieval command from the first example, we can now traverse these properties to display the author's name, who has books checked out, and when they're due back.
Referential Integrity
The term “referential integrity” is used to describe how these relations are kept in sync. In our example above, we wouldn't want a book_author
record pointing to either a book or an author that does not exist. We would not want to allow a patron to be deleted while they still had books checked out. (And, we may not want to allow them to be deleted at all! If we need to keep records of who checked out what book, when, and when they were returned, we would lose those records if we deleted the patron.)
This is an area where relational databases excel. The IDs we have in some tables that point to tables where rows have that ID are called foreign keys. Relational databases allow us to define foreign keys that must exist; ones that must exist or can be missing; and what should happen when the parent record is deleted. To accomplish this, indexes will be applied to these key fields; this not only lets the integrity checks happen quickly, these can also be used to create queries to join the information efficiently.
So… What's the Problem?
There are no problems, only trade-offs. Let's look at how we get to documents from here.
Next: 2. What Are Documents?