The GitLab.com Database Deletion (2017)

On the night of January 31, 2017, GitLab.com suffered one of the most candidly documented data-loss incidents in software history. While dealing with a separate problem caused by spam and load on the database, an engineer working late tried to clean up what they believed was an empty directory on a secondary database server. They were in fact on the primary, and they ran a recursive delete against the live production database directory.

GitLab’s own postmortem records the moment with unusual honesty. The engineer “terminated the process a second or two after noticing their mistake, but at this point around 300 GB of data had already been removed.” Roughly 5,000 projects, 5,000 comments, and 700 user accounts created in a window that evening were lost; the Git repositories themselves were stored separately and survived, but the relational database that ties everything together was gone.

The deletion was the first failure; the second was worse. As the team scrambled to restore, they found that none of their safeguards worked. The postmortem walks through five backup and recovery methods that all failed: scheduled pg_dump backups were silently producing nothing because of a PostgreSQL version mismatch and the failure emails were being rejected, Azure disk snapshots were not enabled for the database servers, LVM snapshots were only taken infrequently, replication was broken, and the standby data had been wiped during the recovery attempts. As the team summarized at the time, out of five backup techniques none were reliably working. They ultimately recovered from a roughly six-hour-old LVM snapshot that happened to exist on a staging server.

What turned the incident into a landmark was the way GitLab handled it. Rather than going quiet, the company worked the problem in the open, publishing a live Google Doc of their notes as they went and streaming the recovery on YouTube to thousands of concurrent viewers. The restore was slow, taking well over 18 hours because of storage performance limits, and the public could watch every step, including the dead ends.

The episode is a cornerstone example of blameless-postmortem culture. GitLab’s writeup focuses relentlessly on the systems and process failures rather than punishing the engineer, on the principle that anyone could have made the same mistake under the same conditions and that the real defects were the missing guardrails and the untested backups. The single most repeated lesson is brutal in its simplicity: a backup you have never tested to restore is not a backup, and GitLab’s five-for-five failure proved it in front of the world.

The incident reshaped how many teams talk about reliability, popularizing the discipline of regularly restoring from backups, alerting loudly when backups fail, and treating recovery as something you rehearse rather than assume. By choosing radical transparency over damage control, GitLab also turned a humiliating outage into a widely cited model for honest engineering communication.

The GitLab.com Database Deletion (2017)

Sources

Related