Friday 3 June 2011

Wanted: Memory conservative key-value store

I would really like to find a key value store that is memory conservative. What we currently have is like a souped up version of the dumbdbm standard library module, but it has a cache budget and as it loads in new values flushes older ones to make room. However, as the amount of data managed increases so does the amount of key metadata indicating whereabouts the values lie on disk. So now the next step is to either add some form of caching for key metadata, or find a suitable free open source solution.

Does anyone know of a suitable one that is not constrained by the GPL? It doesn't have to be Python, but Python bindings are a bonus.

Considering Sqlite

When thinking of low memory database solutions, Sqlite is one that comes to mind, and even better it comes as part of the Python distribution these days. And even betterer, there's a custom port for my uncommon platform of choice. And even.. bettererer it has an IO abstraction layer that allows it to work with custom IO solutions with minimal additional work. Additionally, reading the spiel makes it sound appealing memory-wise:

SQLite is a compact library. With all features enabled, the library size can be less than 300KiB, depending on compiler optimization settings. (Some compiler optimizations such as aggressive function inlining and loop unrolling can cause the object code to be much larger.) If optional features are omitted, the size of the SQLite library can be reduced below 180KiB. SQLite can also be made to run in minimal stack space (4KiB) and very little heap (100KiB), making SQLite a popular database engine choice on memory constrained gadgets such as cellphones, PDAs, and MP3 players. There is a tradeoff between memory usage and speed. SQLite generally runs faster the more memory you give it. Nevertheless, performance is usually quite good even in low-memory environments.
But you know what? I am as yet unable to get it down to 180KiB no matter how many features I compile out of it using the handy SQLITE_OMIT... options. And not all options can be omitted if I want to use pysqlite, as it does not suit the maintainer to support them.

Here's a clipped table of the code sizes for various cross-compilations:

Overall
libsqlite3.a
libpysqlite3.a
Description
1
0
0
Without sqlite
2
487536
425060
49860
Sqlite with optimise for size
3
365212
308730
46740
Sqlite with optimise for size + code omissions
4
536608
472290
50560
Sqlite with full optimise
5
402492
344440
47430
Sqlite with full optimise + code omissions

In the Windows Python 2.7 installation _sqlite3.pyd is 48KB and sqlite3.dll is 417 KB. So the sizes above, are still comparatively above that expecting both to be done with no omissions and full optimisation. But more or less close enough.

Considering home grown

Any third party solution would need to be adapted to deal with the custom IO needs, unless it was written in pure Python. At this point, the simplest solution is just to extend what I already have.

Edit: Just a note, the key desired feature is memory management. It should be possible to put hard constraints on the amount of memory it uses, both for the cached records read from disk, and for the lookup information that maps keys to location of records on disk. Most key value stores I have looked at either claim to keep all keys in memory as a feature, or just keep them all in memory because it is the simple thing to do.

13 comments:

  1. Not sure I fully understand dumbdbm, i've gone off this... http://en.wikipedia.org/wiki/Dbm

    Does it need to be distributed?
    If so, would it be better to move to something more cache-able at the infrastructure level like a REST JSON based service with HTTP cache in-front of it, using etags for object hashes (http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.19) and If-Modified Http header (http://www.w3.org/Protocols/HTTP/HTRQ_Headers.html#if-modified-since) for cache enablement?

    ReplyDelete
  2. No distributed. No REST. No JSON. :-)

    ReplyDelete
  3. If your budget for a storage engine is in the low 100Ks how can you afford to run a python interpreter?

    ReplyDelete
  4. Python is already being used and its usage is separate from whatever the storage engine would add. Not a problem.

    ReplyDelete
  5. The Durus library is ~ 194K, mostly pure Python (includes one small C module ~ 38k compiled). The easiest way to describe Durus is to consider it a simpler ZODB, meaning it is a Python object persistence system.

    If Windows is your|one of your targets, it might be best to look elsewhere. There aren't any insurmountable problems in getting durus to run on Windows but it isn't a supported platform. *nix's are.

    http://www.mems-exchange.org/software/durus/

    If still interested, to represent a key:value store you could use persistent dicts or the persistent BTree container (which act like a persistent dict) for performance with large stores. Durus can be used with a simple file based store or a network accessible client-server store; changing between the two is a line or two of code.

    ReplyDelete
  6. Right, but the key question is whether can Durus constrain how much memory it uses?

    Can it have a limit set to the cached entries? Can it have a limit set to the lookup data (what key maps to where on disk) memory usage?

    ReplyDelete
  7. I've developed key-value store based on splay tree. It's pure Python, and relatively simple to understand and extend. There are two variants:

    1) in memory: http://code.google.com/p/deebee/source/browse/deebee/splaydict.py

    2) on disk: http://code.google.com/p/deebee/source/browse/deebee/splaydictfile.py

    ReplyDelete
  8. If it is only the file size of the sqlite dll that is "large", what does that have to do with your expected memory usage? If pysqlite is too hefty, have you considered ctypes for direct access to the stripped down sqlite dll? I've done similar things for Windows COM/DCOM programming where pywin32 was using too much memory and garbage collection wasn't happening as well as I wanted, so I had to go directly to the related dll's with ctypes. SQLite is almost ubiquitous in portable memory-starved devices today, I can't see how you can run a Python interpreter but not get SQLite to work.

    ReplyDelete
  9. robwalker01:
    1) It is a disappointing higher than advertised cost. 150KB more than it should be, which matters.
    2) No, that pysqlite doesn't by design cater for omissions in sqlite is a tangential point. 48 KB is a price that is acceptable.
    3) Who says it can't be made to work? It works fine, it just costs too much to use memory wise than advertised.

    ReplyDelete
  10. Marko Tasic:
    If it is up to me to add memory usage constraints for the lookup metadata then the work required is just as much as doing the same extension to my existing solution.

    I'll check it out though, it has to be better than half the undocumented libraries that I've had to scrutinise. Who knows, maybe its what I am looking for.

    ReplyDelete
  11. Metakit? It only stores up to 2GB at most.

    ReplyDelete
  12. Metakit, ZODB...not much else that I can find.

    ReplyDelete
  13. Richard, Durus has a cache limit. In addition, start up time and memory utilization is a factor of how many changes have occurred within the DB since the last 'pack', rather than how many objects are within the DB.

    I wrote a data collection app for a hand held Linux powered tablet (going back two or three years now) that utilized Durus; memory constraints and Durus performance certainly were not an issue then for me; that particular app never had to manage much more than 10K or so objects with less than 1% of them changing in a given day or session.

    One possibly nice feature of Durus (and the considerably more complex ZODB) is the value (object) of your key:value store doesn't have to be (but certainly can be if you need it to be) a simple type like a string.

    Since Durus imposes very little additional thinking on top of the Python you already know and use today, you may find it very quick to prove or disprove its utility for your purpose.

    ReplyDelete