Part of Speech Tagging with NLTK – Part 1

November 3, 2008 at 6:19 pm (python) (, , )

An important part of weotta’s tag extraction is part of speech tagging, a process of identifying nouns, verbs, adjectives, and other parts of speech in context. NLTK provides the necessary tools for tagging, but doesn’t actually tell you what methods work best, so I decided to find out for myself.

Training and Test Sentences

NLTK has a data package that includes 3 tagged corpora: brown, conll2000, and treebank. I divided each of these corpora into 2 sets, the training set and the testing set. The choice and size of your training set can have a significant effect on the tagging accuracy, so for real world usage, you need to train on a corpus that is very representative of the actual text you want to tag. In particular, the brown corpus has a number of different categories, so choose your categories wisely. I chose these categories primarily because they have a higher occurance of the word food than other categories.

import nltk.corpus, nltk.tag, itertools
from nltk.tag import brill
# PRESS: REVIEWS
brownc_sents = nltk.corpus.brown.tagged_sents(categories="c")
# POPULAR LORE
brownf_sents = nltk.corpus.brown.tagged_sents(categories="f")
# FICTION: ROMANCE
brownp_sents = nltk.corpus.brown.tagged_sents(categories="p")

brown_train = list(itertools.chain(brownc_sents[:1000], brownf_sents[:1000], brownp_sents[:1000]))
brown_test = list(itertools.chain(brownc_sents[1000:2000], brownf_sents[1000:2000], brownp_sents[1000:2000]))

conll_sents = nltk.corpus.conll2000.tagged_sents()
conll_train = list(conll_sents[:4000])
conll_test = list(conll_sents[4000:8000])

treebank_sents = nltk.corpus.treebank.tagged_sents()
treebank_train = list(treebank_sents[:1500])
treebank_test = list(treebank_sents[1500:3000])

Ngram Tagging

I started by testing different combinations of the 3 NgramTaggers: UnigramTagger, BigramTagger, and TrigramTagger. These taggers inherit from SequentialBackoffTagger, which allows them to be chained together for greater accuracy. To save myself a little pain when constructing and training these taggers, I created a utility method for creating a chain of SequentialBackoffTaggers.

def backoff_tagger(tagged_sents, tagger_classes, backoff=None):
	if not backoff:
		backoff = tagger_classes[0](tagged_sents)
		del tagger_classes[0]

	for cls in tagger_classes:
		tagger = cls(tagged_sents, backoff=backoff)
		backoff = tagger

	return backoff

ubt_tagger = backoff_tagger(train_sents, [nltk.tag.UnigramTagger, nltk.tag.BigramTagger, nltk.tag.TrigramTagger])
utb_tagger = backoff_tagger(train_sents, [nltk.tag.UnigramTagger, nltk.tag.TrigramTagger, nltk.tag.BigramTagger])
but_tagger = backoff_tagger(train_sents, [nltk.tag.BigramTagger, nltk.tag.UnigramTagger, nltk.tag.TrigramTagger])
btu_tagger = backoff_tagger(train_sents, [nltk.tag.BigramTagger, nltk.tag.TrigramTagger, nltk.tag.UnigramTagger])
tub_tagger = backoff_tagger(train_sents, [nltk.tag.TrigramTagger, nltk.tag.UnigramTagger, nltk.tag.BigramTagger])
tbu_tagger = backoff_tagger(train_sents, [nltk.tag.TrigramTagger, nltk.tag.BigramTagger, nltk.tag.UnigramTagger])

Accuracy Testing

To test the accuracy of a tagger, we can compare it to the test sentences using the nltk.tag.accuracy function.

nltk.tag.accuracy(tagger, test_sents)

Ngram Tagging Accuracy

Ngram Tagging Accuracy

Ngram Tagging Accuracy

Conclusion

The ubt_tagger and utb_taggers are extremely close to each other, but the ubt_tagger is the slight favorite (note that the backoff sequence is in reverse order, so for the ubt_tagger, the TrigramTagger backsoff to the BigramTagger, which backsoff to the UnigramTagger.)

Update: in Part of Speech Tagging with NLTK – Part 2, I do further testing using the AffixTagger and the RegexpTagger to get the accuracy up past 80%.

Advertisements

4 Comments

  1. Part of Speech Tagging with NLTK - Part 2 « Stream Hacker said,

    […] 10, 2008 at 2:42 pm (python) (nltk, nltp, tagging) Following up on Part of Speech Tagging with NLTK – Part 1, I test the accuracy of adding an AffixTagger and a RegexpTagger to my SequentialBackoffTagger […]

  2. Andrew Lee said,

    for nltk version 0.9.9b1 the call to taged_sents in

    nltk.corpus.brown.tagged_sents(categories=”c”)

    Throws the error:

    Traceback (most recent call last):
    File “”, line 2, in ?
    File “/usr/lib/python2.4/site-packages/nltk/corpus/reader/tagged.py”, line 211, in tagged_sents
    return TaggedCorpusReader.tagged_sents(
    File “/usr/lib/python2.4/site-packages/nltk/corpus/reader/tagged.py”, line 148, in tagged_sents
    tag_mapping_function)
    File “/usr/lib/python2.4/site-packages/nltk/corpus/reader/util.py”, line 409, in concat
    raise ValueError(‘concat() expects at least one object!’)
    ValueError: concat() expects at least one object!

  3. Jacob said,

    The NLTK corpus API has changed since I wrote this. Try with categories=[‘reviews’]

  4. Andrew Lee said,

    Thank you!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: