« Parallel Information Retrieval and Other Search Engine Goodness | Main | Architecture »

Strategy: Saving Your Butt With Deferred Deletes

Deferred Deletes is a technique where deleted items are marked as deleted but not garbage collected until some days or preferably weeks later.  James Hamilton talks describes this strategy in his classic On Designing and Deploying Internet-Scale Services:

Never delete anything. Just mark it deleted. When new data comes in, record the requests on the way. Keep a rolling two week (or more) history of all changes to help recover from software or administrative errors. If someone makes a mistake and forgets the where clause on a delete statement (it has happened before and it will again), all logical copies of the data are deleted. Neither RAID nor mirroring can protect against this form of error. The ability to recover the data can make the difference between a highly embarrassing issue or a minor, barely noticeable glitch. For those systems already doing off-line backups, this additional record of data coming into the service only needs to be since the last backup. But, being cautious, we recommend going farther back anyway.

Mistakes happen and James says in Stonebraker on CAP Theorem and Databases that:

Deferred delete is not full protection but it has saves my butt more than once and I’m a believer. If you have an application error, administrative error, or database implementation bug that losses data, then it is simply gone unless you have an offline copy. This, by the way, is why I’m a big fan of deferred delete.

Something to consider in your own design.

Reader Comments (9)

Don't forget that an update also removes data.

I described a not so intrusive way of doing data versionning that is a simplified version of a DDJ article

April 13, 2010 | Unregistered CommenterSteve Schnepp

The next logical step is to keep a whole archived history of not only deletes, but alters too... Load and performance allowing, of course.

April 13, 2010 | Unregistered CommenterJean-Marc Liotier

It is also worth noting the downsides of deferred deletes (also known as soft deletes), outlined in The trouble with soft delete:

It’s like a tax; a mandatory WHERE clause to ensure you don’t return any deleted rows. This extra WHERE clause is similar to checking return codes in programming languages that don’t throw exceptions (like C). It’s very simple to do, but if you forget to do it in even one place, bugs can creep in very fast.

There are other performance and complexity issues involved.

As always, it is a trade-off.

April 13, 2010 | Registered Commentersaidimu apale

Another approach to 'deferred deletes' is to 'move' the data to a different data store (a different database, for example) altogether. This will keep the primary data store sane, queries simpler (you don't need to 'exclude' deleted records either in the query or in some post processing blocks), and still less embarrassing when the data is deleted accidentally :-)

April 13, 2010 | Unregistered Commenterrags

"...but if you forget to do it in even one place, bugs can creep in very fast."

One easy solution is to use a view that filters out the logical deletes. You'll want a construct to transparently write to the correct tables and read from the views, but that's not difficult, is easily generalized, and is an abstraction can be applied to many other view-based solutions.

April 13, 2010 | Unregistered CommenterBrendan

As a Java developer with a background in warehousing, I would like to make note of the technique in which the history of records is maintained by a date. Just like "mark for delete", when a record is changed (including deleted) we do not actually delete it, but set a date column which represents that it has been closed out. In this fashion record counts build up quickly depending on application logic, but I think I would incorporate this kind of idea with Todd's suggestion for a rolling 2 week quota of history. My experience concurs that having such a history can be very useful for investigating issues as well as recovering from them. Good post, thanks!

April 16, 2010 | Unregistered CommenterTim

I'm looking at it from a sysadmin's point of view. Implementation of it on a filesystem is simple I guess. You might create a separate directory tree like /garbage and put "deleted" files there with full path. Run cron periodically to delete files older than a given number of days. Then you only need an alias delsec "mv <data> /garbage/<data>

April 18, 2010 | Unregistered Commenterlinuxdatacenter

I learned the hard way, and blogged about it. :-)

April 28, 2010 | Unregistered CommenterCorey

All is depends on the requirement of application, is requirement says do not delete do not delete, either put a delete flag or date timestamp both of them are effective based on scenario. By default I use the delete flag as I believe no data should be deleted in production world :)

July 6, 2010 | Unregistered CommenterPrashant Saraf

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>