Building a NLTK FreqDist on Redis

May 20, 2009 at 1:27 pm (python) (nltk, probability, redis, statistics)

Say you want to build a frequency distribution of many thousands of samples with the following characteristics:

fast to build
persistent data
network accessible (with no locking requirements)
can store large sliceable index lists

The only solution I know that meets those requirements is Redis. NLTK’s FreqDist is not persistent , shelve is far too slow, BerkeleyDB is not network accessible (and is generally a PITA to manage), and AFAIK there’s no other key-value store that makes sliceable lists really easy to create & access. So far I’ve been quite pleased with Redis, especially given how new it is. It’s quite fast, is network accessible, atomic operations make locking unnecessary, supports sortable and sliceable list structures, and is very easy to configure.

Classification

Building a FreqDist allows you to create a ProbDist, which in turn can be used for classification. Having it be persistent lets you examine the data later. And the ability to create sliceable lists allows you to make sorted indexes for paging thru your samples.

Here’s some more concrete use cases for persistent frequency distributions:

RedisFreqDist

I put the code I’ve been using to build frequency distributions over large sets of words up at BitBucket. probablity.py contains RedisFreqDist, which works just like the NTLK FreqDist, except it stores samples and frequencies as keys and values in Redis. That means samples must be strings. Internally, RedisFreqDist also stores a set of all the samples under the key __samples__ for efficient lookup and sorting. Here’s some example code for using it. For more info, checkout the wiki, or read the code.

def make_freq_dist(samples, host='localhost', port=6379, db=0):
	freqs = RedisFreqDist(host=host, port=port, db=db)

	for sample in samples:
		freqs.inc(sample)

Unfortunately, I had to muck about with some of FreqDist’s internal implementation to remain compatible, so I can’t promise the code will work beyond NLTK version 0.9.9. probablity.py also includes ConditionalRedisFreqDist for creating ConditionalProbDists.

Lists

For creating lists of samples, that very much depends on your use case, but here’s some example code for doing so. r is a redis object, key is the index key for storing the list, and samples is assumed to be a sorted list. The get_samples function demonstrates how to get a slice of samples from the list.

def index_samples(r, key, samples):
	r.delete(key)

	for word in words:
		r.push(key, word, tail=True)

def get_samples(r, key, start, end):
	return r.lrange(key, start, end)

Yes, Redis is still fairly alpha, so I wouldn’t use it for critical systems. But I’ve had very few issues so far, especially compared to dealing with BerkeleyDB. I highly recommend it for your non-critical computational needs 🙂

2 Comments

casey said,

May 23, 2009 at 7:47 pm

have you looked at tokyo tyrant? It meets your requirements, and is a lot more mature than redis.
Jacob said,

May 23, 2009 at 7:59 pm

I have, but I didn’t see any way to do sliceable lists. I see that I could concat string values together, but then what? Get the whole string, split it, and return a small slice? That’s what I was doing in BerkeleyDB, and it’s incredibly inefficient with large lists.

Stream Hacker