Part of Speech Tagging with NLTK – Part 3

December 3, 2008 at 10:14 am (python) (, , , )

In part 2, I showed how to produce a part-of-speech tagger using Ngram tagging in combination with Affix and Regex tagging, with accuracy approaching 90%. In part 3, I’ll use the BrillTagger to get the accuracy up to and over 90%.

Brill Tagging

The BrillTagger is different than the previous taggers. For one, it’s not a SequentialBackoffTagger, though it does use an initial tagger, which in our case will be the raubt_tagger from part 2. The BrillTagger uses the initial tagger to produce initial tags, then corrects those tags based on transformational rules. These rules are learned by training with the FastBrillTaggerTrainer and rules templates. Here’s an example, with templates copied from the demo() function in nltk.tag.brill.py. Refer to part 1 for the backoff_tagger function and the train_sents, and part 2 for the word_patterns.

import nltk.tag
from nltk.tag import brill

raubt_tagger = backoff_tagger(train_sents, [nltk.tag.AffixTagger,
    nltk.tag.UnigramTagger, nltk.tag.BigramTagger, nltk.tag.TrigramTagger],
    backoff=nltk.tag.RegexpTagger(word_patterns))

templates = [
    brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (1,1)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (2,2)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (1,2)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (1,3)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (1,1)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (2,2)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (1,2)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (1,3)),
    brill.ProximateTokensTemplate(brill.ProximateTagsRule, (-1, -1), (1,1)),
    brill.ProximateTokensTemplate(brill.ProximateWordsRule, (-1, -1), (1,1))
]

trainer = brill.FastBrillTaggerTrainer(raubt_tagger, templates)
braubt_tagger = trainer.train(train_sents, max_rules=100, min_score=3)

Brill Tagging Accuracy

So now we have a braubt_tagger. You can tweak the max_rules and min_score params, but be careful, as increasing the values will exponentially increase the training time without significantly increasing accuracy. In fact, I found that increasing the min_score tended to decrease the accuracy by a percent or 2. So here’s how the braubt_tagger fares against the other taggers.

Conclusion

There’s certainly more you can do for part-of-speech tagging with nltk, but the braubt_tagger should be good enough for many purposes. The most important component of part-of-speech tagging is using the correct training data. If you want your tagger to be accurate, you need to train it on a corpus similar to the text you’ll be tagging. The brown, conll2000, and treebank corpora are what they are, and you shouldn’t assume that a tagger trained on them will be accurate on a different corpus. For example, a tagger trained on one part of the brown corpus may be 90% accurate on other parts of the brown corpus, but only 50% accurate on the conll2000 corpus. But a tagger trained on the conll2000 corpus will be accurate for the treebank corpus, and vice versa, because conll2000 and treebank are quite similar. So make sure you choose your training data carefully.

Advertisements

6 Comments

  1. Joe said,

    I played around with Brown/Treebank/conll2000 a little bit. Did you test with nltk.tag.pos_tag()? It loads a pickle to do the tagging. I’m asking because that seemed to perform comparable/better, and was already setup.

  2. Jacob said,

    I have not tested nltk.tag.pos_tag() (I’m pretty sure it wasn’t released when I wrote this series). I believe it was trained with most or all of the available corpora, which would definitely make it more accurate. However, it’ll only have high accuracy for text that’s similar to the corpora it was trained on. If you’re tagging text that has a lot of specialty/unique words and phrases, you’ll need to create your own training data for the training process in order to get accurate results.

  3. Honza said,

    Hi Jacob, nice tutorial! I gotta ask, and it may be a dumb question, because I’m new to the Python. Let’s say I am going to tag a big amount of text, sentence by sentence. Now I train the tagger, tag the sentence, and then train it again, and again… It takes a lot of time, of course. So, how can I “store” the trained tagger for further use? So I can train it once, and then re-use for couple of sentences? Again, sorry for asking this stupid question, and thanks for the answer in advance! Honza

  4. Jacob said,

    First, you don’t need to retrain a tagger for each sentence. Just train a tagger once, then run it on all your sentences, or at least a big batch. Only retrain if you significantly change the training corpus. For storage, all you have to do is pickle the tagger to a file on disk, then you can reload it later.

  5. rotzbouw said,

    Hi,

    Quick (and probably stupid) question: I’ve got this Turkish Treebank tagged corpus and some code that does what you do above; my question is, how do I go about and actually use the tagger on an unknown sentence after training?

    Cheers,
    rotzbouw

  6. Jacob said,

    Once you’ve got a tagger, then to tag a new sentence, you need to first tokenize it so it’s a list of words, then pass that into tagger’s train method, something like tagger.tag(tokenize.word_tokenize(sent)).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: