How to Train a NLTK Chunker

December 29, 2008 at 8:19 am (python) (, , , , )

In NLTK, chunking is the process of extracting short, well-formed phrases, or chunks, from a sentence. This is also known as partial parsing, since a chunker is not required to capture all the words in a sentence, and does not produce a deep parse tree. But this is a good thing because it’s very hard to create a complete parse grammar for natural language, and full parsing is usually all or nothing. So chunking allows you to get at the bits you want and ignore the rest.


The general approach to chunking and parsing is to define rules or expressions that are then matched against the input sentence. But this is a very manual, tedious, and error-prone process, likely to get very complicated real fast. The alternative approach is to train a chunker the same way you train a part-of-speech tagger. Except in this case, instead of training on (word, tag) sequences, we train on (tag, iob) sequences, where iob is a chunk tag defined in the the conll2000 corpus. Here’s a function that will take a list of chunked sentences (from a chunked corpus like conll2000 or treebank), and return a list of (tag, iob) sequences.

import nltk.chunk

def conll_tag_chunks(chunk_sents):
    tag_sents = [nltk.chunk.tree2conlltags(tree) for tree in chunk_sents]
    return [[(t, c) for (w, t, c) in chunk_tags] for chunk_tags in tag_sents]


So how accurate is the trained chunker? Here’s the rest of the code, followed by a chart of the accuracy results. Note that I’m only using Ngram Taggers. You could additionally use the BrillTagger, but the training takes a ridiculously long time for very minimal gains in accuracy.

import nltk.corpus, nltk.tag

def ubt_conll_chunk_accuracy(train_sents, test_sents):
    train_chunks = conll_tag_chunks(train_sents)
    test_chunks = conll_tag_chunks(test_sents)

    u_chunker = nltk.tag.UnigramTagger(train_chunks)
    print 'u:', nltk.tag.accuracy(u_chunker, test_chunks)

    ub_chunker = nltk.tag.BigramTagger(train_chunks, backoff=u_chunker)
    print 'ub:', nltk.tag.accuracy(ub_chunker, test_chunks)

    ubt_chunker = nltk.tag.TrigramTagger(train_chunks, backoff=ub_chunker)
    print 'ubt:', nltk.tag.accuracy(ubt_chunker, test_chunks)

    ut_chunker = nltk.tag.TrigramTagger(train_chunks, backoff=u_chunker)
    print 'ut:', nltk.tag.accuracy(ut_chunker, test_chunks)

    utb_chunker = nltk.tag.BigramTagger(train_chunks, backoff=ut_chunker)
    print 'utb:', nltk.tag.accuracy(utb_chunker, test_chunks)

# conll chunking accuracy test
conll_train = nltk.corpus.conll2000.chunked_sents('train.txt')
conll_test = nltk.corpus.conll2000.chunked_sents('test.txt')
ubt_conll_chunk_accuracy(conll_train, conll_test)

# treebank chunking accuracy test
treebank_sents = nltk.corpus.treebank_chunk.chunked_sents()
ubt_conll_chunk_accuracy(treebank_sents[:2000], treebank_sents[2000:])
Accuracy for Trained Chunker

Accuracy for Trained Chunker

The ub_chunker and utb_chunker are slight favorites with equal accuracy, so in practice I suggest using the ub_chunker since it takes slightly less time to train.


Training a chunker this way is much easier than creating manual chunk expressions or rules, it can approach 100% accuracy, and the process is re-usable across data sets. As with part-of-speech tagging, the training set really matters, and should be as similar as possible to the actual text that you want to tag and chunk.


  1. Col Wilson said,

    Hi there, thanks for the article, but I can’t seem to get it to work. I have written a class like this, around what you suggest (I think):

    class Chunker:

    def __init__(self):
    def conll_tag_chunks(chunk_sents):
    tag_sents = [nltk.chunk.tree2conlltags(tree) for tree in chunk_sents]
    return [[(t, c) for (w, t, c) in chunk_tags] for chunk_tags in tag_sents]
    train_sents = nltk.corpus.conll2000.chunked_sents()
    train_chunks = conll_tag_chunks(train_sents)
    logger.debug(‘training u_chunker’)
    u_chunker = UnigramTagger(train=train_chunks)
    logger.debug(‘training ub_chunker’)
    ub_chunker = BigramTagger(train=train_chunks, backoff=u_chunker)
    #ubt_chunker = nltk.tag.TrigramTagger(train_chunks, backoff=ub_chunker)
    #ut_chunker = nltk.tag.TrigramTagger(train_chunks, backoff=u_chunker)
    #utb_chunker = nltk.tag.BigramTagger(train_chunks, backoff=ut_chunker)
    logger.debug(‘finished training’)
    self.chunker = ub_chunker

    def chunk(self, tokens):
    return self.chunker.tag(tokens)

    and tried to do this:

    chunker = Chunker()
    s = “Since then, we’ve changed how we use Python a ton internally.”
    tokens = s.split()
    chunked = chunker.chunk(tokens)
    print chunked

    which gives:

    [(u’Since’, None), (u’then,’, None), (u”we’ve”, None), (u’changed’, None), (u’how’, None), (u’we’, None), (u’use’, None), (u’Python’, None), (u’a’, None), (u’ton’, None), (u’internally.’, None)]

    In other words, nothing at all gets chunked.

    Have I missed something?


  2. Jacob said,

    Hi Col,

    It looks like you left out a step: part of speech tagging. The chunker requires tagged tokens, like [('foo', 'JJ'), ('bar', 'NN')] in order to extract chunks. So you’ll have to train a part of speech tagger as well as the chunker, then run the tokens thru the tagger, and use that output as input to the chunker. Check out my articles about part of speech tagging, starting with Part 1. You also may want to look at the NLTK Chunking Guide.

  3. Col Wilson said,

    I tried that without success. My Tagger class (from your earlier article) looks like this:

    import nltk
    from nltk.tag import brill
    import logging
    logger = logging.getLogger(“ballyclare.tagger”)
    # see:

    class Tagger:

    def __init__(self, sentences=1000, corpus=nltk.corpus.brown):
    logger.debug(‘training with ‘ + str(sentences) + ‘ sentences’)
    train_sents = corpus.tagged_sents()[:sentences]

    def backoff_tagger(tagged_sents, tagger_classes, backoff=None):
    if not backoff:
    backoff = tagger_classes[0](tagged_sents)
    del tagger_classes[0]

    for cls in tagger_classes:
    tagger = cls(tagged_sents, backoff=backoff)
    backoff = tagger

    return backoff

    word_patterns = [
    (r’^-?[0-9]+(.[0-9]+)?$’, ‘CD’),
    (r’.*ould$’, ‘MD’),
    (r’.*ing$’, ‘VBG’),
    (r’.*ed$’, ‘VBD’),
    (r’.*ness$’, ‘NN’),
    (r’.*ment$’, ‘NN’),
    (r’.*ful$’, ‘JJ’),
    (r’.*ious$’, ‘JJ’),
    (r’.*ble$’, ‘JJ’),
    (r’.*ic$’, ‘JJ’),
    (r’.*ive$’, ‘JJ’),
    (r’.*ic$’, ‘JJ’),
    (r’.*est$’, ‘JJ’),
    (r’^a$’, ‘PREP’),

    raubt_tagger = backoff_tagger(train_sents, [nltk.tag.AffixTagger,
    nltk.tag.UnigramTagger, nltk.tag.BigramTagger, nltk.tag.TrigramTagger],

    templates = [
    brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (1,1)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (2,2)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (1,2)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (1,3)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (1,1)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (2,2)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (1,2)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (1,3)),
    brill.ProximateTokensTemplate(brill.ProximateTagsRule, (-1, -1), (1,1)),
    brill.ProximateTokensTemplate(brill.ProximateWordsRule, (-1, -1), (1,1))

    trainer = brill.FastBrillTaggerTrainer(raubt_tagger, templates)
    logger.debug(‘starting training’)
    braubt_tagger = trainer.train(train_sents, max_rules=100, min_score=3)
    logger.debug(‘finished training’)
    self.tagger = braubt_tagger

    def tag(self,sentence):
    return self.tagger.tag(sentence)

    and it gives me something like:

    [(‘Further’, ‘AP’), (‘snow’, None), (‘is’, ‘BEZ’), (‘expected’, ‘VBN’), (‘to’, ‘TO’), (‘push’, None), (‘into’, ‘IN’), (‘many’, ‘AP’), (‘southern’, ‘JJ-TL’), (‘and’, ‘CC’), (‘eastern’, ‘JJ-TL’), (‘parts’, ‘NNS’), (‘of’, ‘IN’), (‘England,’, None), (‘including’, ‘IN’), (‘London,’, None), (‘overnight’, ‘NN’), (‘and’, ‘CC’), (‘during’, ‘IN’), (‘the’, ‘AT’), (‘day’, ‘NN’), (‘on’, ‘IN’), (‘Friday.’, None)]

    However when I feed this into the chunker I still get nothing:

    [((‘Further’, ‘AP’), None), ((‘snow’, None), None), ((‘is’, ‘BEZ’), None), ((‘expected’, ‘VBN’), None), ((‘to’, ‘TO’), None), ((‘push’, None), None), ((‘into’, ‘IN’), None), ((‘many’, ‘AP’), None), ((‘southern’, ‘JJ-TL’), None), ((‘and’, ‘CC’), None), ((‘eastern’, ‘JJ-TL’), None), ((‘parts’, ‘NNS’), None), ((‘of’, ‘IN’), None), ((‘England,’, None), None), ((‘including’, ‘IN’), None), ((‘London,’, None), None), ((‘overnight’, ‘NN’), None), ((‘and’, ‘CC’), None), ((‘during’, ‘IN’), None), ((‘the’, ‘AT’), None), ((‘day’, ‘NN’), None), ((‘on’, ‘IN’), None), ((‘Friday.’, None), None)]

    Is it I wonder because not all tokens get tags?

    Thanks for your help so far.

  4. Jacob said,

    Ok, I forgot to mention a major detail: notice how the train_chunks are created by taking [(t, c) for (w, t, c) in chunk_tags]? You need to do the same thing with your part of speech tagged tokens. Unzip the words from the part of speech tags, run the tags thru the chunker, giving you part of speech tags + chunk tags, then re-zip the words. Here’s some code to illustrate:

    tagged_toks = self.tagger.tag(sentence)
    (words, tags) = zip(*tagged_toks)
    chunks = self.chunker.tag(tags)
    return [(w, t, c) for (w, (t, c)) in zip(words, chunks)]

    Hope that helps. Perhaps I should write an article about putting it all together.

  5. Col Wilson said,

    Aha! results. Not very good because the text is quite different from the training texts, but results nonetheless.


    Yes, it would be nice to see a working example for the more challenged of us.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: