Loading…

BitBadger.Documents
Advanced Usage: Referential Integrity

One of the hallmarks of document database is loose association between documents. In our running hotel and room example, there is no technical reason we could not delete every hotel in the database, leaving all the rooms with hotel IDs that no longer exist. This is a feature-not-a-bug, but it shows the tradeoffs inherent to selecting a data storage mechanism. In our case, this is less than ideal - but, since we are using PostgreSQL, a relational database, we can implement referential integrity if, when, and where we need it.

NOTE: This page has very little to do with the document library itself; these are all modifications that can be made via PostgreSQL. SQLite may have similar capabilities, but this author has yet to explore that.

Enforcing Referential Integrity on the Child Document

While we've been able to use data ->> 'Id' in place of column names for most things up to this point, here is where we hit a roadblock; we cannot define a foreign key constraint against an arbitrary expression. Through database triggers, though, we can accomplish the same thing.

Triggers are implemented in PostgreSQL through a function/trigger definition pair. A function defined as a trigger has NEW and OLD defined as the data that is being manipulated (different ones, depending on the operation; no OLD for INSERTs, no NEW for DELETEs, etc.). For our purposes here, we'll use NEW, as we're trying to verify the data as it's being inserted or updated.

CREATE OR REPLACE FUNCTION room_hotel_id_fk() RETURNS TRIGGER AS $$
    DECLARE
        hotel_id TEXT;
    BEGIN
        SELECT data ->> 'Id' INTO hotel_id FROM hotel WHERE data ->> 'Id' = NEW.data ->> 'HotelId';
        IF hotel_id IS NULL THEN
            RAISE EXCEPTION 'Hotel ID % does not exist', NEW.data ->> 'HotelId';
        END IF;
        RETURN NEW;
    END;
$$ LANGUAGE plpgsql;

CREATE OR REPLACE TRIGGER hotel_enforce_fk BEFORE INSERT OR UPDATE ON room
    FOR EACH ROW EXECUTE FUNCTION room_hotel_id_fk();

This is as straightforward as we can make it; if the query fails to retrieve data (returning NULL here, not raising NO_DATA_FOUND like Oracle would), we raise an exception. Here's what that looks like in practice.

hotel=# insert into room values ('{"Id": "one", "HotelId": "fifteen"}');
ERROR:  Hotel ID fifteen does not exist
CONTEXT:  PL/pgSQL function room_hotel_id_fk() line 7 at RAISE
hotel=# insert into hotel values ('{"Id": "fifteen", "Name": "Demo Hotel"}');
INSERT 0 1
hotel=# insert into room values ('{"Id": "one", "HotelId": "fifteen"}');
INSERT 0 1

(This assumes we'll always have a HotelId field; see below on how to create this trigger if the foreign key is optional.)

Enforcing Referential Integrity on the Parent Document

We've only addressed half of the parent/child relationship so far; now, we need to make sure parents don't disappear.

Referencing the Child Key

The trigger on room referenced the unique index in its lookup. When we try to go from hotel to room, though, we'll need to address the HotelId field of the room' document. For the best efficiency, we can index that field.

CREATE INDEX IF NOT EXISTS idx_room_hotel_id ON room ((data ->> 'HotelId'));

ON DELETE DO NOTHING

When defining a foreign key constraint, the final part of that clause is an ON DELETE action; if it's excluded, it defaults to DO NOTHING. The effect of this is that rows cannot be deleted if they are referenced in a child table. This can be implemented by looking for any rows that reference the hotel being deleted, and raising an exception if any are found.

CREATE OR REPLACE FUNCTION hotel_room_delete_prevent() RETURNS TRIGGER AS $$
    DECLARE
        has_rows BOOL;
    BEGIN
        SELECT EXISTS(SELECT 1 FROM room WHERE OLD.data ->> 'Id' = data ->> 'HotelId') INTO has_rows;
        IF has_rows THEN
            RAISE EXCEPTION 'Hotel ID % has dependent rooms; cannot delete', OLD.data ->> 'Id';
        END IF;
        RETURN OLD;
    END;
$$ LANGUAGE plpgsql;

CREATE OR REPLACE TRIGGER hotel_room_delete BEFORE DELETE ON hotel
    FOR EACH ROW EXECUTE FUNCTION hotel_room_delete_prevent();

This trigger in action…

hotel=# delete from hotel where data ->> 'Id' = 'fifteen';
ERROR:  Hotel ID fifteen has dependent rooms; cannot delete
CONTEXT:  PL/pgSQL function hotel_room_delete_prevent() line 7 at RAISE
hotel=# select * from room;
                data                 
-------------------------------------
 {"Id": "one", "HotelId": "fifteen"}
(1 row)

There's that child record! We've successfully prevented an orphaned room.

ON DELETE CASCADE

Rather than prevent deletion, another foreign key constraint option is to delete the dependent records as well; the delete “cascades” (like a waterfall) to the child tables. Implementing this is even less code!

CREATE OR REPLACE FUNCTION hotel_room_delete_cascade() RETURNS TRIGGER AS $$
    BEGIN
        DELETE FROM room WHERE data ->> 'HotelId' = OLD.data ->> 'Id';
        RETURN OLD;
    END;
$$ LANGUAGE plpgsql;

CREATE OR REPLACE TRIGGER hotel_room_delete BEFORE DELETE ON hotel
    FOR EACH ROW EXECUTE FUNCTION hotel_room_delete_cascade();

Here's is what happens when we try the same DELETE statement that was prevented above…

hotel=# select * from room;
                data                 
-------------------------------------
 {"Id": "one", "HotelId": "fifteen"}
(1 row)

hotel=# delete from hotel where data ->> 'Id' = 'fifteen';
DELETE 1
hotel=# select * from room;
 data 
------
(0 rows)

We deleted a hotel, not rooms, but the rooms are now gone as well.

ON DELETE SET NULL

The final option for a foreign key constraint is to set the column in the dependent table to NULL. There are two options to set a field to NULL in a JSONB document; we can either explicitly give the field a value of null, or we can remove the field from the document. As there is no schema, the latter is cleaner; PostgreSQL will return NULL for any non-existent field.

CREATE OR REPLACE FUNCTION hotel_room_delete_set_null() RETURNS TRIGGER AS $$
    BEGIN
        UPDATE room SET data = data - 'HotelId' WHERE data ->> 'HotelId' = OLD.data ->> 'Id';
        RETURN OLD;
    END;
$$ LANGUAGE plpgsql;

CREATE OR REPLACE TRIGGER hotel_room_delete BEFORE DELETE ON hotel
    FOR EACH ROW EXECUTE FUNCTION hotel_room_delete_set_null();

That - operator is new for us. When used on a JSON or JSONB field, it removes the named field from the document. Let's watch it work…

hotel=# delete from hotel where data ->> 'Id' = 'fifteen';
ERROR:  Hotel ID <NULL> does not exist
CONTEXT:  PL/pgSQL function room_hotel_id_fk() line 7 at RAISE
SQL statement "UPDATE room SET data = data - 'HotelId' WHERE data ->> 'HotelId' = OLD.data ->> 'Id'"
PL/pgSQL function hotel_room_delete_set_null() line 3 at SQL statement

Oops! This trigger execution fired the BEFORE UPDATE trigger on room, and it took exception to us setting that value to NULL. The child table trigger assumes we'll always have a value. We'll need to tweak that trigger to allow this.

CREATE OR REPLACE FUNCTION room_hotel_id_nullable_fk() RETURNS TRIGGER AS $$
    DECLARE
        hotel_id TEXT;
    BEGIN
        IF NEW.data ->> 'HotelId' IS NOT NULL THEN
            SELECT data ->> 'Id' INTO hotel_id FROM hotel WHERE data ->> 'Id' = NEW.data ->> 'HotelId';
            IF hotel_id IS NULL THEN
                RAISE EXCEPTION 'Hotel ID % does not exist', NEW.data ->> 'HotelId';
            END IF;
        END IF;
        RETURN NEW;
    END;
$$ LANGUAGE plpgsql;

CREATE OR REPLACE TRIGGER hotel_enforce_fk BEFORE INSERT OR UPDATE ON room
    FOR EACH ROW EXECUTE FUNCTION room_hotel_id_nullable_fk();

Now, when we try to run the deletion, it works.

hotel=# select * from room;
                data                 
-------------------------------------
 {"Id": "one", "HotelId": "fifteen"}
(1 row)

hotel=# delete from hotel where data ->> 'Id' = 'fifteen';
DELETE 1
hotel=# select * from room;
     data      
---------------
 {"Id": "one"}
(1 row)

Should We Do This?

You may be thinking “Hey, this is pretty cool; why not do this everywhere?” Well, the answer is - as it is with everything software-development-related - “it depends.”

No…?

The flexible, schemaless data storage paradigm that we call “document databases” allow changes to happen quickly. While “schemaless” can mean “ad hoc,” in practice most documents have a well-defined structure. Not having to define columns for each item, then re-define or migrate them when things change, brings a lot of benefits.

What we've implemented above, in this example, complicates some processes. Sure, triggers can be disabled then re-enabled, but unlike true constraints, they do not validate existing data. If we were to disable triggers, run some updates, and re-enable them, we could end up with records that can't be saved in their current state.

Yes…?

The lack of referential integrity in document databases can be an impediment to adoption in areas where that paradigm may be more suitable than a relational one. To be sure, there are fewer relationships in a document database whose documents have complex structures, arrays, etc. This doesn't mean that there won't be relationships, though; in our hotel example, we could easily see a “reservation” document that has the IDs of a customer and a room. Just as it didn't make much sense to embed the rooms in a hotel document, it doesn't make sense to embed customers in a room document.

What PostgreSQL brings to all of this is that it does not have to be an all-or-nothing decision re: referential integrity. We can implement a document store with no constraints, then apply the ones we absolutely must have. We realize we're complicating maintenance a bit (though pgdump will create a backup with the proper order for restoration), but we like that PostgreSQL will protect us from broken code or mistyped UPDATE statements.

Going Further

As the trigger functions are executing SQL, it would be possible to create a set of reusable trigger functions that take table/column as parameters. Dynamic SQL in PL/pgSQL was additional complexity that would have distracted from the concepts. Feel free to take the examples above and make them reusable.

Finally, one piece we will not cover is CHECK constraints. These can be applied to tables using the data ->> 'Key' syntax, and can be used to apply more of a schema feel to the unstructured JSONB document. PostgreSQL's handling of JSON data really is first-class and unopinionated; you can use as much or as little as you like!

« Back to Advanced Usage