Building a NLTK FreqDist on Redis

May 20, 2009 at 1:27 pm (python) (, , , )

Say you want to build a frequency distribution of many thousands of samples with the following characteristics:

  • fast to build
  • persistent data
  • network accessible (with no locking requirements)
  • can store large sliceable index lists

The only solution I know that meets those requirements is Redis. NLTK’s FreqDist is not persistent , shelve is far too slow, BerkeleyDB is not network accessible (and is generally a PITA to manage), and AFAIK there’s no other key-value store that makes sliceable lists really easy to create & access. So far I’ve been quite pleased with Redis, especially given how new it is. It’s quite fast, is network accessible, atomic operations make locking unnecessary, supports sortable and sliceable list structures, and is very easy to configure.

Classification

Building a FreqDist allows you to create a ProbDist, which in turn can be used for classification. Having it be persistent lets you examine the data later. And the ability to create sliceable lists allows you to make sorted indexes for paging thru your samples.

Here’s some more concrete use cases for persistent frequency distributions:

RedisFreqDist

I put the code I’ve been using to build frequency distributions over large sets of words up at BitBucketprobablity.py contains RedisFreqDist, which works just like the NTLK FreqDist, except it stores samples and frequencies as keys and values in Redis. That means samples must be strings. Internally, RedisFreqDist also stores a set of all the samples under the key __samples__ for efficient lookup and sorting. Here’s some example code for using it. For more info, checkout the wiki, or read the code.

def make_freq_dist(samples, host='localhost', port=6379, db=0):
	freqs = RedisFreqDist(host=host, port=port, db=db)

	for sample in samples:
		freqs.inc(sample)

Unfortunately, I had to muck about with some of FreqDist’s internal implementation to remain compatible, so I can’t promise the code will work beyond NLTK version 0.9.9. probablity.py also includes ConditionalRedisFreqDist for creating ConditionalProbDists.

Lists

For creating lists of samples, that very much depends on your use case, but here’s some example code for doing so. r is a redis object, key is the index key for storing the list, and samples is assumed to be a sorted list. The get_samples function demonstrates how to get a slice of samples from the list.

def index_samples(r, key, samples):
	r.delete(key)

	for word in words:
		r.push(key, word, tail=True)

def get_samples(r, key, start, end):
	return r.lrange(key, start, end)

Yes, Redis is still fairly alpha, so I wouldn’t use it for critical systems. But I’ve had very few issues so far, especially compared to dealing with BerkeleyDB. I highly recommend it for your non-critical computational needs :)

Permalink 2 Comments

Deploying Django with Mercurial, Fab and Nginx

April 26, 2009 at 11:00 am (python) (, , , , )

Writing web apps with Django can be a lot of fun, but deploying them can be a chore, even if you’re using Apache. Here’s a setup I’ve been using that makes deployment fast and easy. This all assumes you’ve got sudo access on a remote server running Ubuntu or something similar.

Mercurial

This setup assumes you’ve got 2 mercurial repositories: 1 on your local machine, and 1 on the remote server you’re deploying to. In the remote repository, add the following to .hg/hgrc

[hooks]
changegroup = hg up

This makes mercurial run hg up whenever you push new code. Then in your local repo’s .hg/hgrc, make sure the default path is to your remote repo. Here’s an example

[paths]
default = ssh://user@domain.com/repo

Now when you run hg push, you don’t need to include the path to the repo, and your code will be updated immediately.

FastCGI

Since I’m using nginx instead of Apache, we’ll be deploying Django with FastCGI. Here’s an example script you can use to start and restart your Django FastCGI server. Add this script to your mercurial repo as run_fcgi.sh.

#!/bin/bash
PIDFILE="/tmp/django.pid"
SOCKET="/tmp/django.sock"

# kill current fcgi process if it exists
if [ -f $PIDFILE ]; then
    kill `cat -- $PIDFILE`
    rm -f -- $PIDFILE
fi

python manage.py runfcgi socket=$SOCKET pidfile=$PIDFILE method=prefork

Important note: the FastCGI socket file will need to be readable & writable by nginx worker processes, which run as the www-data user in Ubuntu. This will be handled by the fab restart command below, or you could add chmod a+w $SOCKET to the end of the above script.

Nginx

Nginx is a great high performance web server with simple configuration. Here’s a simple example server config for proxying to your Django FastCGI process. Add this config to your mercurial repo as django.nginx.

server {
    listen 80;
    # change to your FQDN
    server_name YOUR.DOMAIN.COM;

    location / {
        # must be the same socket file as in the above fcgi script
        fastcgi_pass unix:/tmp/django.sock;
    }
}

On the remote server, make sure the following lines are in the http section of /etc/nginx/nginx.conf

include /etc/nginx/sites-enabled/*;
# fastcgi_params should contain a lot of fastcgi_param variables
include /etc/nginx/fastcgi_params;

You must also make sure there is a link in /etc/nginx/sites-enabled to your django.nginx config. Don’t worry if django.nginx doesn’t exist yet, it will once you run fab nginx the first time.

you@remote.ubuntu$ cd /etc/nginx/sites-enabled
you@remote.ubuntu$ sudo ln -s ../sites-available/django.nginx django.nginx

Fab

Fab, or properly Fabric, is my favorite new tool. It’s designed specifically for making remote deployment simple and easy. You create a fabfile where each function is a fab command that can run remote and sudo commands on one or more remote hosts. So let’s deploy Django using fab. Here’s an example fabfile with 2 commands: restart and nginx. These commands should only be run after you’ve done a hg push.

config.fab_hosts = ['YOUR.DOMAIN.COM']
config.projdir = '/PATH/TO/YOUR/REMOTE/HG/REPO'

def restart():
    sudo('cd %(projdir)s; run_fcgi.sh', user='www-data', fail='abort')

def nginx():
    sudo('cp %(projdir)s/django.nginx /etc/nginx/sites-available/', fail='abort')
    sudo('killall -HUP nginx', fail='abort')

restart

You only need to run fab restart if you’ve changed the actual python code. Changes to templates or static files don’t require a restart and will be used automatically (because of the hg up changegroup hook). Executing run_fcgi.sh as the www-data user ensures that nginx can read & write the socket.

nginx

If you’ve changed your nginx server config, you can run fab nginx to install and reload the new server config without restarting the nginx server.

Wrap Up

Now that everything is setup, the next time you want to deploy some new code, it’s as simple as hg push && fab restart. And if you’ve only changed templates, all you need to do is hg push. I hope this helps make your Django development life easier. It has certainly done so for me :)

Permalink Leave a Comment

Django Datetime Snippets

April 13, 2009 at 9:12 am (python) (, , , , , , )

I’ve started posting over at Django snippets, which is a great resource for finding useful bits of functionality. My first set of snippets is focused on datetime conversions.

The Snippets

FuzzyDateTimeField is a drop in replacement for the standard DateTimeField that uses dateutil.parser with fuzzy=True to clean the value, allowing the parser to be more liberal with the input formats it accepts.

The isoutc template filter produces an ISO format UTC datetime string from a timezone aware datetime object.

The timeto template filter is a more compact version of django’s timeuntil filter that only shows hours & minutes, such as “1hr 30min”.

JSON encode ISO UTC datetime is a way to encode datetime objects as ISO strings just like the isoutc template filter.

JSON decode datetime is a simplejson object hook for converting the datetime attribute of a JSON object to a python datetime object. This is especially useful if you have a list of objects that all have datetime attributes that need to be decoded.

Use Case

Imagine you’re making a time based search engine for movies and/or events. Because your data will span many timezones, you decide that all dates & times should be stored on the server as UTC. This pushes local timezone conversion to the client side, where it belongs, simplifying the server side data structures and search operations. You want your search engine to be AJAX enabled, but you don’t like XML because it’s so verbose, so you go with JSON for serialization. You also want users to be able to input their own range based queries without being forced to use specific datetime formats. Leaving out all the hard stuff, the above snippets can be used for communication between a django webapp and a time based search engine.

Permalink Leave a Comment

Dates and Times in Python and Javascript

April 2, 2009 at 9:06 am (javascript, python) (, , , , )

If you are dealing with dates & times in python and/or javascript, there are two must have libraries.

  1. Datejs
  2. python-dateutil

Datejs

Datejs, being javascript, is designed for parsing and creating human readable dates & times. It’s powerful parse() function can handle all the dates & times you’d expect, plus fuzzier human readable date words. Here are some examples from their site.

Date.parse("February 20th 1973");
Date.parse("Thu, 1 July 2004 22:30:00");
Date.parse("today");
Date.parse("next thursday");

And if you are programmatically creating Date objects, here’s a few functions I find myself using frequently.

// get a new Date object set to local date
var dt = Date.today();
// get that same Date object set to current time
var dt = Date.today().setTimeToNow();

// set the local time to 10:30 AM
var dt = Date.today().set({hour: 10, minute: 30});
// produce an ISO formatted datetime string converted to UTC
dt.toISOString();

There’s plenty more in the documentation; pretty much everything you need for manipulation, comparison, and string conversion. Datejs cleanly extends the default Date object, has been integrated into a couple date pickers, and supports culture specific parsing for i18n.

python-dateutil

Like Datejs, dateutil also has a powerful parse() function. While it can’t handle words like “today” or “tomorrow”, it can handle nearly every (American) date format that exists. Here’s a few examples.

>>> from dateutil import parser
>>> parser.parse("Thu, 4/2/09 09:00 PM")
datetime.datetime(2009, 4, 2, 21, 0)
>>> parser.parse("04/02/09 9:00PM")
datetime.datetime(2009, 4, 2, 21, 0)
>>> parser.parse("04-02-08 9pm")
datetime.datetime(2009, 4, 2, 21, 0)

An option that comes especially in handy is to pass in fuzzy=True. This tells parse() to ignore unknown tokens while parsing. This next example would raise a ValueError without fuzzy=True.

>>> parser.parse("Thurs, 4/2/09 09:00 PM", fuzzy=True)

It don’t know how well it works for international date formats, but parse() does have options for reading days first and years first, so I’m guessing it can be made to work.

dateutil also provides some great timezone support. I’ve always been surprised at python’s lack of concrete tzinfo classes, but dateutil.tz more than makes up for it (there’s also pytz, but I haven’t figured out why I need it instead of or in addition to dateutil.tz). Here’s a function for parsing a string and returning a UTC datetime object.

from dateutil import parser, tz
def parse_to_utc(s):
    dt = parser.parse(s, fuzzy=True)
    dt = dt.replace(tzinfo=tz.tzlocal())
    return dt.astimezone(tz.tzutc())

dateutil does a lot more than provide tzinfo objects and parse datetimes; it can also calculate relative deltas and handle iCal recurrence rules. I’m sure a whole calendar application could be built based on dateutil, but my interest is in parsing and converting datetimes to and from UTC, and in that respect dateutil excels.

Permalink Leave a Comment

Chunk Extraction with NLTK

February 23, 2009 at 4:02 pm (programming, python) (, , , , , )

Chunk extraction is a useful preliminary step to information extraction, that creates parse trees from unstructured text. Once you have a parse tree of a sentence, you can do more specific information extraction, such as named entity recognition and relation extraction.

Chunking is basically a 3 step process:

  1. Tag a sentence
  2. Chunk the tagged sentence
  3. Analyze the parse tree to extract information

I’ve already written about how to train a part of speech tagger and a chunker, so I’ll assume you’ve already done the training, and now you want to use your tagger and chunker to do something useful.

Tag Chunker

The previously trained chunker is actually a chunk tagger. It’s a Tagger that assigns IOB chunk tags to part-of-speech tags. In order to use it for proper chunking, we need some extra code to convert the IOB chunk tags into a parse tree. I’ve created a wrapper class that complies with the nltk ChunkParserI interface and uses the trained chunk tagger to get IOB tags and convert them to a proper parse tree.

import nltk.chunk
import itertools

class TagChunker(nltk.chunk.ChunkParserI):
    def __init__(self, chunk_tagger):
        self._chunk_tagger = chunk_tagger

    def parse(self, tokens):
        # split words and part of speech tags
        (words, tags) = zip(*tokens)
        # get IOB chunk tags
        chunks = self._chunk_tagger.tag(tags)
        # join words with chunk tags
        wtc = itertools.izip(words, chunks)
        # w = word, t = part-of-speech tag, c = chunk tag
        lines = [' '.join([w, t, c] for (w, (t, c)) in wtc if c]
        # create tree from conll formatted chunk lines
        return nltk.chunk.conllstr2tree('\n'.join(lines))

Chunk Extraction

Now that we have a proper chunker, we can use it to extract chunks. Here’s a simple example that tags a sentence, chunks the tagged sentence, then prints out each noun phrase.

# sentence should be a list of words
tagged = tagger.tag(sentence)
tree = chunker.parse(tagged)
# for each noun phrase sub tree in the parse tree
for subtree in tree.subtrees(filter=lambda t: t.node == 'NP'):
    # print the noun phrase as a list of part-of-speech tagged words
    print subtree.leaves()

Each sub tree has a phrase tag, and the leaves of a sub tree are the tagged words that make up that chunk. Since we’re training the chunker on IOB tags, NP stands for Noun Phrase. As noted before, the results of this natural language processing are heavily dependent on the training data. If your input text isn’t similar to the your training data, then you probably won’t be getting many chunks.

Permalink 3 Comments

Test Driven Development in Python

February 5, 2009 at 8:46 am (programming, python) (, , , , , , )

One of my favorite aspects of Python is that it makes practicing TDD very easy. What makes it so frictionless is the doctest module. It allows you to write a test at the same time you define a function. No setup, no boilerplate, just write a function call and the expected output in the docstring. Here’s a quick example of a fibonacci function.

def fib(n):
        '''Return the nth fibonacci number.
        >>> fib(0)
        0
        >>> fib(1)
        1
        >>> fib(2)
        1
        >>> fib(3)
        2
        >>> fib(4)
        3
        '''
        if n == 0:
                return 0
        elif n == 1:
                return 1
        else:
                return fib(n - 1) + fib(n - 2)

If you want to run your doctests, just add the following three lines to the bottom of your module.

if __name__ == '__main__':
        import doctest
        doctest.testmod()

Now you can run your module to run the doctests, like python fib.py.

So how well does this fit in with the TDD philosophy? Here’s the basic TDD practices.

  1. Think about what you want to test
  2. Write a small test
  3. Write just enough code to fail the test
  4. Run the test and watch it fail
  5. Write just enough code to pass the test
  6. Run the test and watch it pass (if it fails, go back to step 4)
  7. Go back to step 1 and repeat until done

And now a step-by-step breakdown of how to do this with doctests, in excruciating detail.

1. Define a new empty method

def fib(n):
        '''Return the nth fibonacci number.'''
        pass

if __name__ == '__main__':
        import doctest
        doctest.testmod()

2. Write a doctest

def fib(n):
        '''Return the nth fibonacci number.
        >>> fib(0)
        0
        '''
        pass

3. Run the module and watch the doctest fail

python fib.py
**********************************************************************
File "fib1.py", line 3, in __main__.fib
Failed example:
    fib(0)
Expected:
    0
Got nothing
**********************************************************************
1 items had failures:
   1 of   1 in __main__.fib
***Test Failed*** 1 failures.

4. Write just enough code to pass the failing doctest

def fib(n):
        '''Return the nth fibonacci number.
        >>> fib(0)
        0
        '''
        return 0

5. Run the module and watch the doctest pass

python fib.py

6. Go back to step 2 and repeat

Now you can start filling in the rest of function, one test at time. In practice, you may not write code exactly like this, but the point is that doctests provide a really easy way to test your code as you write it.

Unit Tests

Ok, so doctests are great for simple tests. But what if your tests need to be a bit more complex? Maybe you need some external data, or mock objects. In that case, you’ll be better off with more traditional unit tests. But first, take a little time to see if you can decompose your code into a set of smaller functions that can be tested individually. I find that code that is easier to test is also easier to understand.

Running Tests

For running my tests, I use nose. I have a tests/ directory with a simple configuration file, nose.cfg

[nosetests]
verbosity=3
with-doctest=1

Then in my Makefile, I add a test command so I can run make test.

test:
        @nosetests --config=tests/nose.cfg tests PACKAGE1 PACKAGE2

PACKAGE1 and PACKAGE2 are optional paths to your code. They could point to unit test packages and/or production code containing doctests.

And finally, if you’re looking for a continuous integration server, try Buildbot.

Permalink 4 Comments

How to Train a NLTK Chunker

December 29, 2008 at 8:19 am (python) (, , , , )

In NLTK, chunking is the process of extracting short, well-formed phrases, or chunks, from a sentence. This is also known as partial parsing, since a chunker is not required to capture all the words in a sentence, and does not produce a deep parse tree. But this is a good thing because it’s very hard to create a complete parse grammar for natural language, and full parsing is usually all or nothing. So chunking allows you to get at the bits you want and ignore the rest.

Training

The general approach to chunking and parsing is to define rules or expressions that are then matched against the input sentence. But this is a very manual, tedious, and error-prone process, likely to get very complicated real fast. The alternative approach is to train a chunker the same way you train a part-of-speech tagger. Except in this case, instead of training on (word, tag) sequences, we train on (tag, iob) sequences, where iob is a chunk tag defined in the the conll2000 corpus. Here’s a function that will take a list of chunked sentences (from a chunked corpus like conll2000 or treebank), and return a list of (tag, iob) sequences.

import nltk.chunk

def conll_tag_chunks(chunk_sents):
    tag_sents = [nltk.chunk.tree2conlltags(tree) for tree in chunk_sents]
    return [[(t, c) for (w, t, c) in chunk_tags] for chunk_tags in tag_sents]

Accuracy

So how accurate is the trained chunker? Here’s the rest of the code, followed by a chart of the accuracy results. Note that I’m only using Ngram Taggers. You could additionally use the BrillTagger, but the training takes a ridiculously long time for very minimal gains in accuracy.

import nltk.corpus, nltk.tag

def ubt_conll_chunk_accuracy(train_sents, test_sents):
    train_chunks = conll_tag_chunks(train_sents)
    test_chunks = conll_tag_chunks(test_sents)

    u_chunker = nltk.tag.UnigramTagger(train_chunks)
    print 'u:', nltk.tag.accuracy(u_chunker, test_chunks)

    ub_chunker = nltk.tag.BigramTagger(train_chunks, backoff=u_chunker)
    print 'ub:', nltk.tag.accuracy(ub_chunker, test_chunks)

    ubt_chunker = nltk.tag.TrigramTagger(train_chunks, backoff=ub_chunker)
    print 'ubt:', nltk.tag.accuracy(ubt_chunker, test_chunks)

    ut_chunker = nltk.tag.TrigramTagger(train_chunks, backoff=u_chunker)
    print 'ut:', nltk.tag.accuracy(ut_chunker, test_chunks)

    utb_chunker = nltk.tag.BigramTagger(train_chunks, backoff=ut_chunker)
    print 'utb:', nltk.tag.accuracy(utb_chunker, test_chunks)

# conll chunking accuracy test
conll_train = nltk.corpus.conll2000.chunked_sents('train.txt')
conll_test = nltk.corpus.conll2000.chunked_sents('test.txt')
ubt_conll_chunk_accuracy(conll_train, conll_test)

# treebank chunking accuracy test
treebank_sents = nltk.corpus.treebank_chunk.chunked_sents()
ubt_conll_chunk_accuracy(treebank_sents[:2000], treebank_sents[2000:])
Accuracy for Trained Chunker

Accuracy for Trained Chunker

The ub_chunker and utb_chunker are slight favorites with equal accuracy, so in practice I suggest using the ub_chunker since it takes slightly less time to train.

Conclusion

Training a chunker this way is much easier than creating manual chunk expressions or rules, it can approach 100% accuracy, and the process is re-usable across data sets. As with part-of-speech tagging, the training set really matters, and should be as similar as possible to the actual text that you want to tag and chunk.

Permalink 5 Comments

Part of Speech Tagging with NLTK – Part 3

December 3, 2008 at 10:14 am (python) (, , , )

In part 2, I showed how to produce a part-of-speech tagger using Ngram tagging in combination with Affix and Regex tagging, with accuracy approaching 90%. In part 3, I’ll use the BrillTagger to get the accuracy up to and over 90%.

Brill Tagging

The BrillTagger is different than the previous taggers. For one, it’s not a SequentialBackoffTagger, though it does use an initial tagger, which in our case will be the raubt_tagger from part 2. The BrillTagger uses the initial tagger to produce initial tags, then corrects those tags based on transformational rules. These rules are learned by training with the FastBrillTaggerTrainer and rules templates. Here’s an example, with templates copied from the demo() function in nltk.tag.brill.py. Refer to part 1 for the backoff_tagger function and the train_sents, and part 2 for the word_patterns.

import nltk.tag
from nltk.tag import brill

raubt_tagger = backoff_tagger(train_sents, [nltk.tag.AffixTagger,
    nltk.tag.UnigramTagger, nltk.tag.BigramTagger, nltk.tag.TrigramTagger],
    backoff=nltk.tag.RegexpTagger(word_patterns))

templates = [
    brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (1,1)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (2,2)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (1,2)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (1,3)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (1,1)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (2,2)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (1,2)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (1,3)),
    brill.ProximateTokensTemplate(brill.ProximateTagsRule, (-1, -1), (1,1)),
    brill.ProximateTokensTemplate(brill.ProximateWordsRule, (-1, -1), (1,1))
]

trainer = brill.FastBrillTaggerTrainer(raubt_tagger, templates)
braubt_tagger = trainer.train(train_sents, max_rules=100, min_score=3)

Brill Tagging Accuracy

So now we have a braubt_tagger. You can tweak the max_rules and min_score params, but be careful, as increasing the values will exponentially increase the training time without significantly increasing accuracy. In fact, I found that increasing the min_score tended to decrease the accuracy by a percent or 2. So here’s how the braubt_tagger fares against the other taggers.

Conclusion

There’s certainly more you can do for part-of-speech tagging with nltk, but the braubt_tagger should be good enough for many purposes. The most important component of part-of-speech tagging is using the correct training data. If you want your tagger to be accurate, you need to train it on a corpus similar to the text you’ll be tagging. The brown, conll2000, and treebank corpora are what they are, and you shouldn’t assume that a tagger trained on them will be accurate on a different corpus. For example, a tagger trained on one part of the brown corpus may be 90% accurate on other parts of the brown corpus, but only 50% accurate on the conll2000 corpus. But a tagger trained on the conll2000 corpus will be accurate for the treebank corpus, and vice versa, because conll2000 and treebank are quite similar. So make sure you choose your training data carefully.

Permalink 6 Comments

Part of Speech Tagging with NLTK – Part 2

November 10, 2008 at 2:42 pm (python) (, , )

Following up on Part of Speech Tagging with NLTK – Part 1, I test the accuracy of adding an AffixTagger and a RegexpTagger to my SequentialBackoffTagger chain.

Affix Tagging

The AffixTagger learns prefix and suffix patterns to determine the part of speech tag for word. I tried inserting the AffixTagger into every possible position of the ubt_tagger to see which method increased accuracy the most. As you’ll see in the results, the aubt_tagger had the highest accuracy.

ubta_tagger = backoff_tagger(train_sents, [nltk.tag.UnigramTagger, nltk.tag.BigramTagger, nltk.tag.TrigramTagger, nltk.tag.AffixTagger])
ubat_tagger = backoff_tagger(train_sents, [nltk.tag.UnigramTagger, nltk.tag.BigramTagger, nltk.tag.AffixTagger, nltk.tag.TrigramTagger])
uabt_tagger = backoff_tagger(train_sents, [nltk.tag.UnigramTagger, nltk.tag.AffixTagger, nltk.tag.BigramTagger, nltk.tag.TrigramTagger])
aubt_tagger = backoff_tagger(train_sents, [nltk.tag.AffixTagger, nltk.tag.UnigramTagger, nltk.tag.BigramTagger, nltk.tag.TrigramTagger])

Regexp Tagging

The RegexpTagger allows you to define your own word patterns for determining the part of speech tag. Some of the patterns defined below were taken from chapter 3 of the NLTK book, others I added myself. Since I had already determined that the aubt_tagger was the most accurate, I only tested the RegexpTagger at the beginning and end of the chain.

word_patterns = [
	(r'^-?[0-9]+(.[0-9]+)?$', 'CD'),
	(r'.*ould$', 'MD'),
	(r'.*ing$', 'VBG'),
	(r'.*ed$', 'VBD'),
	(r'.*ness$', 'NN'),
	(r'.*ment$', 'NN'),
	(r'.*ful$', 'JJ'),
	(r'.*ious$', 'JJ'),
	(r'.*ble$', 'JJ'),
	(r'.*ic$', 'JJ'),
	(r'.*ive$', 'JJ'),
	(r'.*ic$', 'JJ'),
	(r'.*est$', 'JJ'),
	(r'^a$', 'PREP'),
]

aubtr_tagger = nltk.tag.RegexpTagger(word_patterns, backoff=aubt_tagger)
raubt_tagger = backoff_tagger(train_sents, [nltk.tag.AffixTagger, nltk.tag.UnigramTagger, nltk.tag.BigramTagger, nltk.tag.TrigramTagger],
    backoff=nltk.tag.RegexpTagger(word_patterns))

Affix and Regexp Tagging Accuracy

Conclusion

As you can see, the aubt_tagger provided the most gain over the ubt_tagger, and the raubt_tagger had a slight gain on top of that. In Part 3 I’ll discuss the results of using the BrillTagger to push the accuracy even higher.

Permalink 1 Comment

Part of Speech Tagging with NLTK – Part 1

November 3, 2008 at 6:19 pm (python) (, , )

An important part of weotta’s tag extraction is part of speech tagging, a process of identifying nouns, verbs, adjectives, and other parts of speech in context. NLTK provides the necessary tools for tagging, but doesn’t actually tell you what methods work best, so I decided to find out for myself.

Training and Test Sentences

NLTK has a data package that includes 3 tagged corpora: brown, conll2000, and treebank. I divided each of these corpora into 2 sets, the training set and the testing set. The choice and size of your training set can have a significant effect on the tagging accuracy, so for real world usage, you need to train on a corpus that is very representative of the actual text you want to tag. In particular, the brown corpus has a number of different categories, so choose your categories wisely. I chose these categories primarily because they have a higher occurance of the word food than other categories.

import nltk.corpus, nltk.tag, itertools
from nltk.tag import brill
# PRESS: REVIEWS
brownc_sents = nltk.corpus.brown.tagged_sents(categories="c")
# POPULAR LORE
brownf_sents = nltk.corpus.brown.tagged_sents(categories="f")
# FICTION: ROMANCE
brownp_sents = nltk.corpus.brown.tagged_sents(categories="p")

brown_train = list(itertools.chain(brownc_sents[:1000], brownf_sents[:1000], brownp_sents[:1000]))
brown_test = list(itertools.chain(brownc_sents[1000:2000], brownf_sents[1000:2000], brownp_sents[1000:2000]))

conll_sents = nltk.corpus.conll2000.tagged_sents()
conll_train = list(conll_sents[:4000])
conll_test = list(conll_sents[4000:8000])

treebank_sents = nltk.corpus.treebank.tagged_sents()
treebank_train = list(treebank_sents[:1500])
treebank_test = list(treebank_sents[1500:3000])

Ngram Tagging

I started by testing different combinations of the 3 NgramTaggers: UnigramTagger, BigramTagger, and TrigramTagger. These taggers inherit from SequentialBackoffTagger, which allows them to be chained together for greater accuracy. To save myself a little pain when constructing and training these taggers, I created a utility method for creating a chain of SequentialBackoffTaggers.

def backoff_tagger(tagged_sents, tagger_classes, backoff=None):
	if not backoff:
		backoff = tagger_classes[0](tagged_sents)
		del tagger_classes[0]

	for cls in tagger_classes:
		tagger = cls(tagged_sents, backoff=backoff)
		backoff = tagger

	return backoff

ubt_tagger = backoff_tagger(train_sents, [nltk.tag.UnigramTagger, nltk.tag.BigramTagger, nltk.tag.TrigramTagger])
utb_tagger = backoff_tagger(train_sents, [nltk.tag.UnigramTagger, nltk.tag.TrigramTagger, nltk.tag.BigramTagger])
but_tagger = backoff_tagger(train_sents, [nltk.tag.BigramTagger, nltk.tag.UnigramTagger, nltk.tag.TrigramTagger])
btu_tagger = backoff_tagger(train_sents, [nltk.tag.BigramTagger, nltk.tag.TrigramTagger, nltk.tag.UnigramTagger])
tub_tagger = backoff_tagger(train_sents, [nltk.tag.TrigramTagger, nltk.tag.UnigramTagger, nltk.tag.BigramTagger])
tbu_tagger = backoff_tagger(train_sents, [nltk.tag.TrigramTagger, nltk.tag.BigramTagger, nltk.tag.UnigramTagger])

Accuracy Testing

To test the accuracy of a tagger, we can compare it to the test sentences using the nltk.tag.accuracy function.

nltk.tag.accuracy(tagger, test_sents)

Ngram Tagging Accuracy

Ngram Tagging Accuracy

Ngram Tagging Accuracy

Conclusion

The ubt_tagger and utb_taggers are extremely close to each other, but the ubt_tagger is the slight favorite (note that the backoff sequence is in reverse order, so for the ubt_tagger, the TrigramTagger backsoff to the BigramTagger, which backsoff to the UnigramTagger.)

Update: in Part of Speech Tagging with NLTK – Part 2, I do further testing using the AffixTagger and the RegexpTagger to get the accuracy up past 80%.

Permalink 4 Comments

Follow

Get every new post delivered to your Inbox.