Building a NLTK FreqDist on Redis

May 20, 2009 at 1:27 pm (python) (, , , )

Say you want to build a frequency distribution of many thousands of samples with the following characteristics:

  • fast to build
  • persistent data
  • network accessible (with no locking requirements)
  • can store large sliceable index lists

The only solution I know that meets those requirements is Redis. NLTK’s FreqDist is not persistent , shelve is far too slow, BerkeleyDB is not network accessible (and is generally a PITA to manage), and AFAIK there’s no other key-value store that makes sliceable lists really easy to create & access. So far I’ve been quite pleased with Redis, especially given how new it is. It’s quite fast, is network accessible, atomic operations make locking unnecessary, supports sortable and sliceable list structures, and is very easy to configure.


Building a FreqDist allows you to create a ProbDist, which in turn can be used for classification. Having it be persistent lets you examine the data later. And the ability to create sliceable lists allows you to make sorted indexes for paging thru your samples.

Here’s some more concrete use cases for persistent frequency distributions:


I put the code I’ve been using to build frequency distributions over large sets of words up at contains RedisFreqDist, which works just like the NTLK FreqDist, except it stores samples and frequencies as keys and values in Redis. That means samples must be strings. Internally, RedisFreqDist also stores a set of all the samples under the key __samples__ for efficient lookup and sorting. Here’s some example code for using it. For more info, checkout the wiki, or read the code.

def make_freq_dist(samples, host='localhost', port=6379, db=0):
	freqs = RedisFreqDist(host=host, port=port, db=db)

	for sample in samples:

Unfortunately, I had to muck about with some of FreqDist’s internal implementation to remain compatible, so I can’t promise the code will work beyond NLTK version 0.9.9. also includes ConditionalRedisFreqDist for creating ConditionalProbDists.


For creating lists of samples, that very much depends on your use case, but here’s some example code for doing so. r is a redis object, key is the index key for storing the list, and samples is assumed to be a sorted list. The get_samples function demonstrates how to get a slice of samples from the list.

def index_samples(r, key, samples):

	for word in words:
		r.push(key, word, tail=True)

def get_samples(r, key, start, end):
	return r.lrange(key, start, end)

Yes, Redis is still fairly alpha, so I wouldn’t use it for critical systems. But I’ve had very few issues so far, especially compared to dealing with BerkeleyDB. I highly recommend it for your non-critical computational needs 🙂

Permalink 2 Comments

Chunk Extraction with NLTK

February 23, 2009 at 4:02 pm (programming, python) (, , , , , )

Chunk extraction is a useful preliminary step to information extraction, that creates parse trees from unstructured text. Once you have a parse tree of a sentence, you can do more specific information extraction, such as named entity recognition and relation extraction.

Chunking is basically a 3 step process:

  1. Tag a sentence
  2. Chunk the tagged sentence
  3. Analyze the parse tree to extract information

I’ve already written about how to train a part of speech tagger and a chunker, so I’ll assume you’ve already done the training, and now you want to use your tagger and chunker to do something useful.

Tag Chunker

The previously trained chunker is actually a chunk tagger. It’s a Tagger that assigns IOB chunk tags to part-of-speech tags. In order to use it for proper chunking, we need some extra code to convert the IOB chunk tags into a parse tree. I’ve created a wrapper class that complies with the nltk ChunkParserI interface and uses the trained chunk tagger to get IOB tags and convert them to a proper parse tree.

import nltk.chunk
import itertools

class TagChunker(nltk.chunk.ChunkParserI):
    def __init__(self, chunk_tagger):
        self._chunk_tagger = chunk_tagger

    def parse(self, tokens):
        # split words and part of speech tags
        (words, tags) = zip(*tokens)
        # get IOB chunk tags
        chunks = self._chunk_tagger.tag(tags)
        # join words with chunk tags
        wtc = itertools.izip(words, chunks)
        # w = word, t = part-of-speech tag, c = chunk tag
        lines = [' '.join([w, t, c] for (w, (t, c)) in wtc if c]
        # create tree from conll formatted chunk lines
        return nltk.chunk.conllstr2tree('\n'.join(lines))

Chunk Extraction

Now that we have a proper chunker, we can use it to extract chunks. Here’s a simple example that tags a sentence, chunks the tagged sentence, then prints out each noun phrase.

# sentence should be a list of words
tagged = tagger.tag(sentence)
tree = chunker.parse(tagged)
# for each noun phrase sub tree in the parse tree
for subtree in tree.subtrees(filter=lambda t: t.node == 'NP'):
    # print the noun phrase as a list of part-of-speech tagged words
    print subtree.leaves()

Each sub tree has a phrase tag, and the leaves of a sub tree are the tagged words that make up that chunk. Since we’re training the chunker on IOB tags, NP stands for Noun Phrase. As noted before, the results of this natural language processing are heavily dependent on the training data. If your input text isn’t similar to the your training data, then you probably won’t be getting many chunks.

Permalink 3 Comments

How to Train a NLTK Chunker

December 29, 2008 at 8:19 am (python) (, , , , )

In NLTK, chunking is the process of extracting short, well-formed phrases, or chunks, from a sentence. This is also known as partial parsing, since a chunker is not required to capture all the words in a sentence, and does not produce a deep parse tree. But this is a good thing because it’s very hard to create a complete parse grammar for natural language, and full parsing is usually all or nothing. So chunking allows you to get at the bits you want and ignore the rest.


The general approach to chunking and parsing is to define rules or expressions that are then matched against the input sentence. But this is a very manual, tedious, and error-prone process, likely to get very complicated real fast. The alternative approach is to train a chunker the same way you train a part-of-speech tagger. Except in this case, instead of training on (word, tag) sequences, we train on (tag, iob) sequences, where iob is a chunk tag defined in the the conll2000 corpus. Here’s a function that will take a list of chunked sentences (from a chunked corpus like conll2000 or treebank), and return a list of (tag, iob) sequences.

import nltk.chunk

def conll_tag_chunks(chunk_sents):
    tag_sents = [nltk.chunk.tree2conlltags(tree) for tree in chunk_sents]
    return [[(t, c) for (w, t, c) in chunk_tags] for chunk_tags in tag_sents]


So how accurate is the trained chunker? Here’s the rest of the code, followed by a chart of the accuracy results. Note that I’m only using Ngram Taggers. You could additionally use the BrillTagger, but the training takes a ridiculously long time for very minimal gains in accuracy.

import nltk.corpus, nltk.tag

def ubt_conll_chunk_accuracy(train_sents, test_sents):
    train_chunks = conll_tag_chunks(train_sents)
    test_chunks = conll_tag_chunks(test_sents)

    u_chunker = nltk.tag.UnigramTagger(train_chunks)
    print 'u:', nltk.tag.accuracy(u_chunker, test_chunks)

    ub_chunker = nltk.tag.BigramTagger(train_chunks, backoff=u_chunker)
    print 'ub:', nltk.tag.accuracy(ub_chunker, test_chunks)

    ubt_chunker = nltk.tag.TrigramTagger(train_chunks, backoff=ub_chunker)
    print 'ubt:', nltk.tag.accuracy(ubt_chunker, test_chunks)

    ut_chunker = nltk.tag.TrigramTagger(train_chunks, backoff=u_chunker)
    print 'ut:', nltk.tag.accuracy(ut_chunker, test_chunks)

    utb_chunker = nltk.tag.BigramTagger(train_chunks, backoff=ut_chunker)
    print 'utb:', nltk.tag.accuracy(utb_chunker, test_chunks)

# conll chunking accuracy test
conll_train = nltk.corpus.conll2000.chunked_sents('train.txt')
conll_test = nltk.corpus.conll2000.chunked_sents('test.txt')
ubt_conll_chunk_accuracy(conll_train, conll_test)

# treebank chunking accuracy test
treebank_sents = nltk.corpus.treebank_chunk.chunked_sents()
ubt_conll_chunk_accuracy(treebank_sents[:2000], treebank_sents[2000:])
Accuracy for Trained Chunker

Accuracy for Trained Chunker

The ub_chunker and utb_chunker are slight favorites with equal accuracy, so in practice I suggest using the ub_chunker since it takes slightly less time to train.


Training a chunker this way is much easier than creating manual chunk expressions or rules, it can approach 100% accuracy, and the process is re-usable across data sets. As with part-of-speech tagging, the training set really matters, and should be as similar as possible to the actual text that you want to tag and chunk.

Permalink 5 Comments

Part of Speech Tagging with NLTK – Part 3

December 3, 2008 at 10:14 am (python) (, , , )

In part 2, I showed how to produce a part-of-speech tagger using Ngram tagging in combination with Affix and Regex tagging, with accuracy approaching 90%. In part 3, I’ll use the BrillTagger to get the accuracy up to and over 90%.

Brill Tagging

The BrillTagger is different than the previous taggers. For one, it’s not a SequentialBackoffTagger, though it does use an initial tagger, which in our case will be the raubt_tagger from part 2. The BrillTagger uses the initial tagger to produce initial tags, then corrects those tags based on transformational rules. These rules are learned by training with the FastBrillTaggerTrainer and rules templates. Here’s an example, with templates copied from the demo() function in Refer to part 1 for the backoff_tagger function and the train_sents, and part 2 for the word_patterns.

import nltk.tag
from nltk.tag import brill

raubt_tagger = backoff_tagger(train_sents, [nltk.tag.AffixTagger,
    nltk.tag.UnigramTagger, nltk.tag.BigramTagger, nltk.tag.TrigramTagger],

templates = [
    brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (1,1)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (2,2)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (1,2)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (1,3)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (1,1)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (2,2)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (1,2)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (1,3)),
    brill.ProximateTokensTemplate(brill.ProximateTagsRule, (-1, -1), (1,1)),
    brill.ProximateTokensTemplate(brill.ProximateWordsRule, (-1, -1), (1,1))

trainer = brill.FastBrillTaggerTrainer(raubt_tagger, templates)
braubt_tagger = trainer.train(train_sents, max_rules=100, min_score=3)

Brill Tagging Accuracy

So now we have a braubt_tagger. You can tweak the max_rules and min_score params, but be careful, as increasing the values will exponentially increase the training time without significantly increasing accuracy. In fact, I found that increasing the min_score tended to decrease the accuracy by a percent or 2. So here’s how the braubt_tagger fares against the other taggers.


There’s certainly more you can do for part-of-speech tagging with nltk, but the braubt_tagger should be good enough for many purposes. The most important component of part-of-speech tagging is using the correct training data. If you want your tagger to be accurate, you need to train it on a corpus similar to the text you’ll be tagging. The brown, conll2000, and treebank corpora are what they are, and you shouldn’t assume that a tagger trained on them will be accurate on a different corpus. For example, a tagger trained on one part of the brown corpus may be 90% accurate on other parts of the brown corpus, but only 50% accurate on the conll2000 corpus. But a tagger trained on the conll2000 corpus will be accurate for the treebank corpus, and vice versa, because conll2000 and treebank are quite similar. So make sure you choose your training data carefully.

Permalink 6 Comments

Part of Speech Tagging with NLTK – Part 2

November 10, 2008 at 2:42 pm (python) (, , )

Following up on Part of Speech Tagging with NLTK – Part 1, I test the accuracy of adding an AffixTagger and a RegexpTagger to my SequentialBackoffTagger chain.

Affix Tagging

The AffixTagger learns prefix and suffix patterns to determine the part of speech tag for word. I tried inserting the AffixTagger into every possible position of the ubt_tagger to see which method increased accuracy the most. As you’ll see in the results, the aubt_tagger had the highest accuracy.

ubta_tagger = backoff_tagger(train_sents, [nltk.tag.UnigramTagger, nltk.tag.BigramTagger, nltk.tag.TrigramTagger, nltk.tag.AffixTagger])
ubat_tagger = backoff_tagger(train_sents, [nltk.tag.UnigramTagger, nltk.tag.BigramTagger, nltk.tag.AffixTagger, nltk.tag.TrigramTagger])
uabt_tagger = backoff_tagger(train_sents, [nltk.tag.UnigramTagger, nltk.tag.AffixTagger, nltk.tag.BigramTagger, nltk.tag.TrigramTagger])
aubt_tagger = backoff_tagger(train_sents, [nltk.tag.AffixTagger, nltk.tag.UnigramTagger, nltk.tag.BigramTagger, nltk.tag.TrigramTagger])

Regexp Tagging

The RegexpTagger allows you to define your own word patterns for determining the part of speech tag. Some of the patterns defined below were taken from chapter 3 of the NLTK book, others I added myself. Since I had already determined that the aubt_tagger was the most accurate, I only tested the RegexpTagger at the beginning and end of the chain.

word_patterns = [
	(r'^-?[0-9]+(.[0-9]+)?$', 'CD'),
	(r'.*ould$', 'MD'),
	(r'.*ing$', 'VBG'),
	(r'.*ed$', 'VBD'),
	(r'.*ness$', 'NN'),
	(r'.*ment$', 'NN'),
	(r'.*ful$', 'JJ'),
	(r'.*ious$', 'JJ'),
	(r'.*ble$', 'JJ'),
	(r'.*ic$', 'JJ'),
	(r'.*ive$', 'JJ'),
	(r'.*ic$', 'JJ'),
	(r'.*est$', 'JJ'),
	(r'^a$', 'PREP'),

aubtr_tagger = nltk.tag.RegexpTagger(word_patterns, backoff=aubt_tagger)
raubt_tagger = backoff_tagger(train_sents, [nltk.tag.AffixTagger, nltk.tag.UnigramTagger, nltk.tag.BigramTagger, nltk.tag.TrigramTagger],

Affix and Regexp Tagging Accuracy


As you can see, the aubt_tagger provided the most gain over the ubt_tagger, and the raubt_tagger had a slight gain on top of that. In Part 3 I’ll discuss the results of using the BrillTagger to push the accuracy even higher.

Permalink 1 Comment

Part of Speech Tagging with NLTK – Part 1

November 3, 2008 at 6:19 pm (python) (, , )

An important part of weotta’s tag extraction is part of speech tagging, a process of identifying nouns, verbs, adjectives, and other parts of speech in context. NLTK provides the necessary tools for tagging, but doesn’t actually tell you what methods work best, so I decided to find out for myself.

Training and Test Sentences

NLTK has a data package that includes 3 tagged corpora: brown, conll2000, and treebank. I divided each of these corpora into 2 sets, the training set and the testing set. The choice and size of your training set can have a significant effect on the tagging accuracy, so for real world usage, you need to train on a corpus that is very representative of the actual text you want to tag. In particular, the brown corpus has a number of different categories, so choose your categories wisely. I chose these categories primarily because they have a higher occurance of the word food than other categories.

import nltk.corpus, nltk.tag, itertools
from nltk.tag import brill
brownc_sents = nltk.corpus.brown.tagged_sents(categories="c")
brownf_sents = nltk.corpus.brown.tagged_sents(categories="f")
brownp_sents = nltk.corpus.brown.tagged_sents(categories="p")

brown_train = list(itertools.chain(brownc_sents[:1000], brownf_sents[:1000], brownp_sents[:1000]))
brown_test = list(itertools.chain(brownc_sents[1000:2000], brownf_sents[1000:2000], brownp_sents[1000:2000]))

conll_sents = nltk.corpus.conll2000.tagged_sents()
conll_train = list(conll_sents[:4000])
conll_test = list(conll_sents[4000:8000])

treebank_sents = nltk.corpus.treebank.tagged_sents()
treebank_train = list(treebank_sents[:1500])
treebank_test = list(treebank_sents[1500:3000])

Ngram Tagging

I started by testing different combinations of the 3 NgramTaggers: UnigramTagger, BigramTagger, and TrigramTagger. These taggers inherit from SequentialBackoffTagger, which allows them to be chained together for greater accuracy. To save myself a little pain when constructing and training these taggers, I created a utility method for creating a chain of SequentialBackoffTaggers.

def backoff_tagger(tagged_sents, tagger_classes, backoff=None):
	if not backoff:
		backoff = tagger_classes[0](tagged_sents)
		del tagger_classes[0]

	for cls in tagger_classes:
		tagger = cls(tagged_sents, backoff=backoff)
		backoff = tagger

	return backoff

ubt_tagger = backoff_tagger(train_sents, [nltk.tag.UnigramTagger, nltk.tag.BigramTagger, nltk.tag.TrigramTagger])
utb_tagger = backoff_tagger(train_sents, [nltk.tag.UnigramTagger, nltk.tag.TrigramTagger, nltk.tag.BigramTagger])
but_tagger = backoff_tagger(train_sents, [nltk.tag.BigramTagger, nltk.tag.UnigramTagger, nltk.tag.TrigramTagger])
btu_tagger = backoff_tagger(train_sents, [nltk.tag.BigramTagger, nltk.tag.TrigramTagger, nltk.tag.UnigramTagger])
tub_tagger = backoff_tagger(train_sents, [nltk.tag.TrigramTagger, nltk.tag.UnigramTagger, nltk.tag.BigramTagger])
tbu_tagger = backoff_tagger(train_sents, [nltk.tag.TrigramTagger, nltk.tag.BigramTagger, nltk.tag.UnigramTagger])

Accuracy Testing

To test the accuracy of a tagger, we can compare it to the test sentences using the nltk.tag.accuracy function.

nltk.tag.accuracy(tagger, test_sents)

Ngram Tagging Accuracy

Ngram Tagging Accuracy

Ngram Tagging Accuracy


The ubt_tagger and utb_taggers are extremely close to each other, but the ubt_tagger is the slight favorite (note that the backoff sequence is in reverse order, so for the ubt_tagger, the TrigramTagger backsoff to the BigramTagger, which backsoff to the UnigramTagger.)

Update: in Part of Speech Tagging with NLTK – Part 2, I do further testing using the AffixTagger and the RegexpTagger to get the accuracy up past 80%.

Permalink 4 Comments