Much ink has been spilled on which database you should use, or how to think about which database to use, for your project. My aim in this post is not to sell you on any database paradigm, rather to serve as a reference guide and checklist for how to responsibly host a SQL server, be it MySQL, Postgresql or other, in production.
Before calling your SQL database plan ready-for-implementation, ask yourself if you’ve thought about all these requirements:
☐ Read only replicas
☐ Multi-zone hosting
☐ Automated daily backups
☐ One click rollback / backup restore
☐ Event based audit logging / full table histories / log replication
☐ Automatic disk expansion
☐ High quality Migration tooling
☐ Connection/IP Security
☐ Local dev versions / easy ability to download prod data
☐ Staging / recent-prod replica environments
☐ CPU & Memory monitoring / auto scaling
☐ Slow query monitoring
☐ High quality ORM / DB Connection library
In practice it’s very expensive or impossible to do all of these things yourself, your best bet is to chose a solution that comes with many of these features out of the box such as Google Cloud SQL or Amazon RDS; just make sure to enable the features you care about.
Read only replicas
More often than not a production SQL server will have use cases that can easily be divided between read heavy vs. write heavy. The most common is perhaps the desire to do analytics processing on transaction data. Generally this should be handled with a proper data pipeline/enterprise data warehouse, but having a real time readonly mirror is a good practice regardless even for your ELT tools.
If AWS us-east-1 goes down, will your application survive? Have a plan to ensure data is replicated in real time between zones, or even better between datacenters entirely.
Automated daily backups
Ideally you have at least daily, if not more regular, full backups that are sent off site. Depending on your requirements perhaps that’s an exported zip file to a storage bucket with the same cloud provider, or perhaps it’s a bucket in an entirely different cloud. Make sure that everything about this process is secure and locked down tight, these are entire copies of your database after all.
This is a good use case for that realtime read only replica.
One click rollback / backup restore
Most cloud hosted SQL options will offer the option of one-click point in time restore. At a minimum ensure you have an entirely automated way, that is tested regularly, to restore from one of your hourly or daily backups.
Event based audit logging / full table histories / log replication
Different databases have different terminology for this, in PSQL they’re replication slots, in MSSQL it’s log replication. The idea is you want CDC — change data capture — for every mutation to every table recorded in a data warehouse for your analytics team to do as they need. Such data can be used to produce business audit logs, or run point-in-time analytics queries to ask questions for users such as “what was my inventory like last week?”
Automatic disk expansion
Nobody likes getting an alarm at 3AM that their database has hit its disk storage limit. In case it’s not obvious, very bad things happen when a database runs out of disk space. Make sure your SQL solution never runs out of disk by using a platform/tool that will expand automatically. Ideally it shrinks automatically too.
High quality Migration tooling
Schema and data migrations are hard, don’t try and solve these problems yourself. Use a tool or framework that will help you generate migrations and manage the execution of migrations across various environments. Remember that your migration has to work locally for developers who have used this repository before and new developers, as well as in all staging, feature branch and production environments. Don’t underestimate the difficulty of this challenge.
Often you can get away with IP allowlisting access to a database, but in 2022 that’s going out of style (and will be flagged by PCI or SOC2 auditors). Nowadays your database should be in a private VPC with no internet access and networked/peered with your application servers. Keep in mind that this will make access for developers challenging, that’s a good thing! It’s a good idea to have a strategy, either with a proxy or a bastion host, for emergencies though.
Local dev versions / easy ability to download prod data
You’ll want tooling to download a copy of sanitized production data for testing. Something that runs well on a local machine with 1000 rows may be unacceptably slow in production with 2 million records. Those 2 million records may cause trouble not just due to volume, but also data heterogeneity — real world users will hit edge cases your developers may not.
CPU, Memory, Connection monitoring / auto scaling
Ensure you have monitoring and, ideally autoscaling, on cpu, memory and connection counts for a SQL database. It should be somebody’s job to check from time to time that these values are within acceptable ranges for your use case.
SQL databases are generally some of the more expensive parts of the stack. I recommend you set up a budget using tools in your cloud provider so you know how much you’re spending and can monitor growth.
Slow query monitoring
It’s easy to shoot yourself in the foot with SQL, whether using an ORM or writing raw SQL, and generate very expensive and slow queries. You’ll want logging and ideally alerting for anything abnormally slow that makes it to production.
High quality ORM / DB Connection library
Don’t forget about developer experience! Do you want to be writing raw SQL or using an ORM/DAL? There are tradeoffs in both directions, think through your options carefully. Does the ORM come with a migration tool? Does it have built-in connection pooling?