BitBadger.Documents
Advanced Usage: Referential Integrity
One of the hallmarks of document database is loose association between documents. In our running hotel and room example, there is no technical reason we could not delete every hotel in the database, leaving all the rooms with hotel IDs that no longer exist. This is a feature-not-a-bug, but it shows the tradeoffs inherent to selecting a data storage mechanism. In our case, this is less than ideal - but, since we are using PostgreSQL, a relational database, we can implement referential integrity if, when, and where we need it.
NOTE: This page has very little to do with the document library itself; these are all modifications that can be made via PostgreSQL. SQLite may have similar capabilities, but this author has yet to explore that.
Enforcing Referential Integrity on the Child Document
While we've been able to use data->>'Id'
in place of column names for most things up to this point, here is where we hit a roadblock; we cannot define a foreign key constraint against an arbitrary expression. Through database triggers, though, we can accomplish the same thing.
Triggers are implemented in PostgreSQL through a function/trigger definition pair. A function defined as a trigger has NEW
and OLD
defined as the data that is being manipulated (different ones, depending on the operation; no OLD
for INSERT
s, no NEW
for DELETE
s, etc.). For our purposes here, we'll use NEW
, as we're trying to verify the data as it's being inserted or updated.
CREATE OR REPLACE FUNCTION room_hotel_id_fk() RETURNS TRIGGER AS $$ DECLARE hotel_id TEXT; BEGIN SELECT data->>'Id' INTO hotel_id FROM hotel WHERE data->>'Id' = NEW.data->>'HotelId'; IF hotel_id IS NULL THEN RAISE EXCEPTION 'Hotel ID % does not exist', NEW.data->>'HotelId'; END IF; RETURN NEW; END; $$ LANGUAGE plpgsql; CREATE OR REPLACE TRIGGER hotel_enforce_fk BEFORE INSERT OR UPDATE ON room FOR EACH ROW EXECUTE FUNCTION room_hotel_id_fk();
This is as straightforward as we can make it; if the query fails to retrieve data (returning NULL
here, not raising NO_DATA_FOUND
like Oracle would), we raise an exception. Here's what that looks like in practice.
hotel=# insert into room values ('{"Id": "one", "HotelId": "fifteen"}');
ERROR: Hotel ID fifteen does not exist
CONTEXT: PL/pgSQL function room_hotel_id_fk() line 7 at RAISE
hotel=# insert into hotel values ('{"Id": "fifteen", "Name": "Demo Hotel"}');
INSERT 0 1
hotel=# insert into room values ('{"Id": "one", "HotelId": "fifteen"}');
INSERT 0 1
(This assumes we'll always have a HotelId
field; see below on how to create this trigger if the foreign key is optional.)
Enforcing Referential Integrity on the Parent Document
We've only addressed half of the parent/child relationship so far; now, we need to make sure parents don't disappear.
Referencing the Child Key
The trigger on room
referenced the unique index in its lookup. When we try to go from hotel
to room
, though, we'll need to address the HotelId
field of the room
' document. For the best efficiency, we can index that field.
CREATE INDEX IF NOT EXISTS idx_room_hotel_id ON room ((data->>'HotelId'));
ON DELETE DO NOTHING
When defining a foreign key constraint, the final part of that clause is an ON DELETE
action; if it's excluded, it defaults to DO NOTHING
. The effect of this is that rows cannot be deleted if they are referenced in a child table. This can be implemented by looking for any rows that reference the hotel being deleted, and raising an exception if any are found.
CREATE OR REPLACE FUNCTION hotel_room_delete_prevent() RETURNS TRIGGER AS $$ DECLARE has_rows BOOL; BEGIN SELECT EXISTS(SELECT 1 FROM room WHERE OLD.data->>'Id' = data->>'HotelId') INTO has_rows; IF has_rows THEN RAISE EXCEPTION 'Hotel ID % has dependent rooms; cannot delete', OLD.data->>'Id'; END IF; RETURN OLD; END; $$ LANGUAGE plpgsql; CREATE OR REPLACE TRIGGER hotel_room_delete BEFORE DELETE ON hotel FOR EACH ROW EXECUTE FUNCTION hotel_room_delete_prevent();
This trigger in action…
hotel=# delete from hotel where data->>'Id' = 'fifteen';
ERROR: Hotel ID fifteen has dependent rooms; cannot delete
CONTEXT: PL/pgSQL function hotel_room_delete_prevent() line 7 at RAISE
hotel=# select * from room;
data
-------------------------------------
{"Id": "one", "HotelId": "fifteen"}
(1 row)
There's that child record! We've successfully prevented an orphaned room.
ON DELETE CASCADE
Rather than prevent deletion, another foreign key constraint option is to delete the dependent records as well; the delete “cascades” (like a waterfall) to the child tables. Implementing this is even less code!
CREATE OR REPLACE FUNCTION hotel_room_delete_cascade() RETURNS TRIGGER AS $$ BEGIN DELETE FROM room WHERE data->>'HotelId' = OLD.data->>'Id'; RETURN OLD; END; $$ LANGUAGE plpgsql; CREATE OR REPLACE TRIGGER hotel_room_delete BEFORE DELETE ON hotel FOR EACH ROW EXECUTE FUNCTION hotel_room_delete_cascade();
Here's is what happens when we try the same DELETE
statement that was prevented above…
hotel=# select * from room;
data
-------------------------------------
{"Id": "one", "HotelId": "fifteen"}
(1 row)
hotel=# delete from hotel where data->>'Id' = 'fifteen';
DELETE 1
hotel=# select * from room;
data
------
(0 rows)
We deleted a hotel, not rooms, but the rooms are now gone as well.
ON DELETE SET NULL
The final option for a foreign key constraint is to set the column in the dependent table to NULL
. There are two options to set a field to NULL
in a JSONB
document; we can either explicitly give the field a value of null
, or we can remove the field from the document. As there is no schema, the latter is cleaner; PostgreSQL will return NULL
for any non-existent field.
CREATE OR REPLACE FUNCTION hotel_room_delete_set_null() RETURNS TRIGGER AS $$ BEGIN UPDATE room SET data = data - 'HotelId' WHERE data->>'HotelId' = OLD.data ->> 'Id'; RETURN OLD; END; $$ LANGUAGE plpgsql; CREATE OR REPLACE TRIGGER hotel_room_delete BEFORE DELETE ON hotel FOR EACH ROW EXECUTE FUNCTION hotel_room_delete_set_null();
That -
operator is new for us. When used on a JSON
or JSONB
field, it removes the named field from the document.
Let's watch it work…
hotel=# delete from hotel where data->>'Id' = 'fifteen';
ERROR: Hotel ID <NULL> does not exist
CONTEXT: PL/pgSQL function room_hotel_id_fk() line 7 at RAISE
SQL statement "UPDATE room SET data = data - 'HotelId' WHERE data->>'HotelId' = OLD.data->>'Id'"
PL/pgSQL function hotel_room_delete_set_null() line 3 at SQL statement
Oops! This trigger execution fired the BEFORE UPDATE
trigger on room
, and it took exception to us setting that value to NULL
. The child table trigger assumes we'll always have a value. We'll need to tweak that trigger to allow this.
CREATE OR REPLACE FUNCTION room_hotel_id_nullable_fk() RETURNS TRIGGER AS $$ DECLARE hotel_id TEXT; BEGIN IF NEW.data->>'HotelId' IS NOT NULL THEN SELECT data->>'Id' INTO hotel_id FROM hotel WHERE data->>'Id' = NEW.data->>'HotelId'; IF hotel_id IS NULL THEN RAISE EXCEPTION 'Hotel ID % does not exist', NEW.data->>'HotelId'; END IF; END IF; RETURN NEW; END; $$ LANGUAGE plpgsql; CREATE OR REPLACE TRIGGER hotel_enforce_fk BEFORE INSERT OR UPDATE ON room FOR EACH ROW EXECUTE FUNCTION room_hotel_id_nullable_fk();
Now, when we try to run the deletion, it works.
hotel=# select * from room;
data
-------------------------------------
{"Id": "one", "HotelId": "fifteen"}
(1 row)
hotel=# delete from hotel where data->>'Id' = 'fifteen';
DELETE 1
hotel=# select * from room;
data
---------------
{"Id": "one"}
(1 row)
Should We Do This?
You may be thinking “Hey, this is pretty cool; why not do this everywhere?” Well, the answer is - as it is with everything software-development-related - “it depends.”
No…?
The flexible, schemaless data storage paradigm that we call “document databases” allow changes to happen quickly. While “schemaless” can mean “ad hoc,” in practice most documents have a well-defined structure. Not having to define columns for each item, then re-define or migrate them when things change, brings a lot of benefits.
What we've implemented above, in this example, complicates some processes. Sure, triggers can be disabled then re-enabled, but unlike true constraints, they do not validate existing data. If we were to disable triggers, run some updates, and re-enable them, we could end up with records that can't be saved in their current state.
Yes…?
The lack of referential integrity in document databases can be an impediment to adoption in areas where that paradigm may be more suitable than a relational one. To be sure, there are fewer relationships in a document database whose documents have complex structures, arrays, etc. This doesn't mean that there won't be relationships, though; in our hotel example, we could easily see a “reservation” document that has the IDs of a customer and a room. Just as it didn't make much sense to embed the rooms in a hotel document, it doesn't make sense to embed customers in a room document.
What PostgreSQL brings to all of this is that it does not have to be an all-or-nothing decision re: referential integrity. We can implement a document store with no constraints, then apply the ones we absolutely must have. We realize we're complicating maintenance a bit (though pgdump
will create a backup with the proper order for restoration), but we like that PostgreSQL will protect us from broken code or mistyped UPDATE
statements.
Going Further
As the trigger functions are executing SQL, it would be possible to create a set of reusable trigger functions that take table/column as parameters. Dynamic SQL in PL/pgSQL was additional complexity that would have distracted from the concepts. Feel free to take the examples above and make them reusable.
Finally, one piece we will not cover is CHECK
constraints. These can be applied to tables using the data->>'Key'
syntax, and can be used to apply more of a schema feel to the unstructured JSONB
document. PostgreSQL's handling of JSON data really is first-class and unopinionated; you can use as much or as little as you like!