Say you want to build a frequency distribution of many thousands of samples with the following characteristics:
- fast to build
- persistent data
- network accessible (with no locking requirements)
- can store large sliceable index lists
The only solution I know that meets those requirements is Redis. NLTK’s FreqDist is not persistent , shelve is far too slow, BerkeleyDB is not network accessible (and is generally a PITA to manage), and AFAIK there’s no other key-value store that makes sliceable lists really easy to create & access. So far I’ve been quite pleased with Redis, especially given how new it is. It’s quite fast, is network accessible, atomic operations make locking unnecessary, supports sortable and sliceable list structures, and is very easy to configure.
Building a FreqDist allows you to create a ProbDist, which in turn can be used for classification. Having it be persistent lets you examine the data later. And the ability to create sliceable lists allows you to make sorted indexes for paging thru your samples.
Here’s some more concrete use cases for persistent frequency distributions:
I put the code I’ve been using to build frequency distributions over large sets of words up at BitBucket. probablity.py contains
RedisFreqDist, which works just like the NTLK FreqDist, except it stores samples and frequencies as keys and values in Redis. That means samples must be strings. Internally,
RedisFreqDist also stores a set of all the samples under the key __samples__ for efficient lookup and sorting. Here’s some example code for using it. For more info, checkout the wiki, or read the code.
def make_freq_dist(samples, host='localhost', port=6379, db=0): freqs = RedisFreqDist(host=host, port=port, db=db) for sample in samples: freqs.inc(sample)
Unfortunately, I had to muck about with some of FreqDist’s internal implementation to remain compatible, so I can’t promise the code will work beyond NLTK version 0.9.9. probablity.py also includes
ConditionalRedisFreqDist for creating ConditionalProbDists.
For creating lists of samples, that very much depends on your use case, but here’s some example code for doing so.
r is a redis object,
key is the index key for storing the list, and
samples is assumed to be a sorted list. The
get_samples function demonstrates how to get a slice of samples from the list.
def index_samples(r, key, samples): r.delete(key) for word in words: r.push(key, word, tail=True) def get_samples(r, key, start, end): return r.lrange(key, start, end)
Yes, Redis is still fairly alpha, so I wouldn’t use it for critical systems. But I’ve had very few issues so far, especially compared to dealing with BerkeleyDB. I highly recommend it for your non-critical computational needs :)