My 35 Database Opinions
“That’s just, like, your opinion, man.” - The Dude
Like all opinions, some of these might be wrong, and some might be right. Some might only apply in certain circumstances. But these are the opinions I’ve formed over the last 5+ years on what databases to use, and how to use them.
- Pick relational databases by default. “NoSQL” means “another one-off query language to learn”.
- Only use alternative databases (MongoDB, Redis, Cassandra, etc.) if you REALLY REALLY need to.
- If your data is even SLIGHTLY relational, don’t use MongoDB. Aggregation pipeline lookups don’t scale well.
- “Schemaless” means that your schema is diffused throughout your codebase.
- Schemas can be trivially handled with a migration tool. (I use Alembic or Prisma)
- Keep your schemas aggressively normalized. De-normalized data is a nightmare to keep up-to-date.
- There’s almost always other ways to gain performance before even considering de-normalizing.
- Plan out your schemas ahead of time. (Use your migration tool!)
- Plan out which columns will be indexes ahead of time!
- Ensure your indexes get created. Periodically audit your indexes and add/remove them if needed.
- Don’t bother with compound indexes unless you have VERY specific queries that need them.
- EAV is an anti-pattern… but can sometimes be useful if used SPARINGLY.
- Use UUIDs instead of auto-incrementing integers for primary keys. You gain globally unique IDs, and hide your DB’s size, for not many downsides.
- Use foreign key constraints! They check yourself before you wreck yourself!
- Dealing with time is tricky and has lots of hidden ‘gotchas’.
- ALWAYS store time as an ISO-8601 timestamp (or your DB’s equivalent timestamp type) in the UTC timezone.
- Storing peoples’ names is also very tricky. I just stick to a single field of max 128 Unicode characters.
- Try to have rows represent a single physical object, or an action/event that occurred at a singular point in time.
- Try to keep rows mostly immutable. Things about the row that frequently change should probably be modeled as appends to a separate table.
- Make relationships many-to-many by default unless you have a good reason otherwise.
- Your many-to-many relationship tables should store metadata about the relationship. e.g. Users-to-Companies table that also store’s each employee’s job title at each company.
- Hard delete things. Soft-deleting can lead to problems. Keep an audit log of all changes (creates/updates/deletes) that lets you “undo” if needed.
- If data needs to be redacted, just overwrite individual columns with “REDACTED”, rather than deleting the row.
- Use SQLite for local, single-user applications.
- Use PostgreSQL for remote and multi-user applications.
- Don’t use your Postgres superuser for normal DB connections. Create other users and assign them permissions.
- Consider partitioning Postgres tables once they reach approx 100 million rows.
- Choosing a partition key is hard. Usually it’s the least worst option.
- Choose partition key(s) that aligns with your most common WHERE clauses.
- Choose partition key(s) that will lead to partitions roughly balanced in both size and load.
- Your codebase should gracefully handle ANY possible state that’s representable in your DB’s schema. If an engineer manually changes any cell’s value in the DB, it should not blow up your codebase.
- Try to make analytical OLAP queries work using your OLTP database.
- If they’re just too much for your OLTP database, even after you’ve done every optimization you can, only then introduce an OLAP Data Warehouse.
- Don’t write your own ETL code for shipping data from your database to your Data Warehouse. USE SOMETHING PREBUILT!
- Be very cautious when querying your Data Warehouse. Most (e.g. Athena, BigQuery) are powerful. But you can accidentally rack up HUGE bills.