MongoDB Compression

For a while people have wanted MongoDB to compress their data, or at least compress their field names. This would be beneficial in not only reducing the amount of disk space required, but also in theory improving performance as we trade disk IO with CPU IO. I thought this be a fun project to investigate, so I started by working out if this would actually be useful.

Lets start with compressing the data. I’ve taken a project I’ve been working on, where most of the records have the similar set of fields. An example record may look like (which across the database had an average size of 547 bytes):

{
	"_id" : ObjectId("5134b1c644ae658fc8c050c0"),
	"version" : NumberLong(0),
	"attributes" : {
		"firstName" : "Andrew",
		"lastName" : "Brampton",
		"birthday" : "1982-01-01T00:18:00.000-05:00",		
	},
	"tags" : ["25-30 years", "dc", "male" ],
	"contactPoints" : [{
		"_id" : "+11235551234",
		"type" : "sms",
		"number" : {
			"E164" : "+11235551234"
		},
		"version" : NumberLong(0),
		"subscriptions" : [{
			"version" : NumberLong(0),
			"event" : ObjectId("5134b1c644ae658fc8c050aa"),
			"status" : "ACTIVE"
		}]
	},{
		"_id" : "[email protected]",
		"type" : "email",
		"email" : "[email protected]",
		"version" : NumberLong(0),
		"subscriptions" : [ ]
	}]
}

There are some excellent presentations and articles on how MongoDB structures the data on the disk. For the sake of this investigation I took a single data file (of size 2 GiB) that was full of just these kinds of records.

Full Compression

gzip datafile.5

Original size: 2,146,435,072 bytes (2.0 GiB)
Compressed gzip size: 453,359,908 bytes (432 MiB) / 21% of the original size

Simple gzip across the whole file gave a 4.7x saving in file size. This is obviously best case, as it covers the full file. Next lets look at the savings from compressing the field names.

Field name Compression

strings -n3 datafile.5 | sort | uniq -c | sort -n | tail -n20 

That prints out a list of the most popular strings in the data file, sorted by frequency. From those stats we can infer

Unique fields: 14
Fields: 75,764,300
Field bytes: 506,244,611 bytes (482 MiB) / 23% of the total data file size

If we assume that we can encode each field to just a single byte, then we reduce the bytes taken by the field names from 482 MiB to 72.2 MiB, a 6.6x saving of the field names. That’s a good saving, but as the field names only take up 23% of the file, the overall saving would be 20% of the total data file size.

Document Compression

Compression of individual document might provide a better solution than field compression. It’s not possible to compress the whole database, as that’ll make it extremely hard to alter individual documents. So instead each document would be compressed independently of the others. This is different to field compression would would have to be compressed across documents.

To try this out I wrote a simple Python script instead of hacking the MongoDB code. This script simulates the compression by finding each BSON encoded document on disk, compresses it (with zlib), and sums up the uncompressed and compressed document sizes.

Uncompressed document length: 1,888,733,156 bytes (1.75 GiB)
Compressed document length: 1,261,267,661 bytes (1.17 GiB) / 66% of the total document length

The keen reader will notice the documents only made up 1.75 GiB, of the total 2 GiB data file. The rest of the file contains non-document data, such as indexes, padding, and perhaps deleted documents. As the non-document data wasn’t compressed in this test, the total savings was about 30% of the total file. This could improve if the document was more compressible (for example if they were larger), or if a better compression scheme was used.

Conclusion

In conclusion, compression could certainly help reduce the size on disk, and this could lead to performance improvements. Field level compression gave us 20% reduction and document compression gave us 30%. The next step is to actually implement this, and run various benchmarks.

Future

My current thinking on field level compression is to create a simple lookup table that maps field names to a token. This lookup table would be stored in the extents, which contains numerous documents. As new extents are created new lookup tables will be created. This allows the lookup tables to adapt as new types of documents are put into the system. The lookup table will only be appended to, to ensure existing documents continue to work. The table can then be optimised when a compact operation is called.

Looking at the existing MongoDB code, adding field level compression might be quite tricky, as the BSON objects are memory mapped, and multiple places make assumptions about being able to access fields. So an easier approach might be to do full BSON object compression, and create uncompressed copies in memory.

Look forward to my attempt of compression.

Be Sociable, Share!

4 Comments

  1. Roger

    Note that zlib typically can’t compress text less than 100 bytes. You can get quite a bit better compression by having a prebuilt (or ongoing built) compression dictionary that contains popular patters and store that apart from the compressed output rather than with every compressed output as is the usual case.

    There are libraries like smaz that do this.

  2. rmxz

    Wouldn’t just running on a compressing filesystem (say btrfs) give you more advantages (can compress larger blocks than the approaches you considered); as well as be far simpler (no code change at all)?

  3. Rmxz, compressing the file system might be the easiest, and perhaps simplest approach. However, compression built into MongoDB would have most likely be able to preform better compression as it understands the layout and format of the data. Also, MongoDB would be able to selectively compress different documents/collections/indexes, etc.

    However, it might be interesting to add a quick section on filesystem/disk level compression.

  4. Curt

    Compressing entire documents could wreak performance disaster if you ever have to do queries that reference unindexed field values. In such cases, the database engine would have to decompress every single record in order try to satisfy the query, and would do it every single time that query was run.

    Field compression (or, more properly, field aliasing) has the potential for significant space savings (and some performance boost as well). This could be best accomplished by allowing the developer / dba to run a command against a dataset that would scan the db, and create short (one or two character) aliases for all fieldnames encountered that occur over a threshhold number of records within the collection.

    For example, something like:

    db.Collection.AliasFieldnames(byte percentOfRecords, bool doCommit)

    If this were run, Mongo would sample the documents in that collection, and any field that occurred in more than the threshhold percentage of records would be stored with a shortened alias. The commit flag could control whether the collection was actually updated to use the new aliases (which could take quite a long time if the collection is big); otherwise the function would just return a sorted list of the most-encountered fieldnames.

    The engine could map and un-map to those aliases on DB operations, so it could be completely transparent to the user. The CPU overhead on the server side would be trivial (there could, perhaps, be even lower CPU use, since iteration through documents to find fieldnames would involve less string comparisons).

    Space savings would be most dramatic in those cases where fieldnames are long, and field data is short (e.g. booleans, bytes, chars, etc.)

Leave Your Comment

Your email will not be published or shared. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>