Dates and Times in Python and Javascript

April 2, 2009 at 9:06 am (javascript, python) (, , , , )

If you are dealing with dates & times in python and/or javascript, there are two must have libraries.

  1. Datejs
  2. python-dateutil

Datejs

Datejs, being javascript, is designed for parsing and creating human readable dates & times. It’s powerful parse() function can handle all the dates & times you’d expect, plus fuzzier human readable date words. Here are some examples from their site.

Date.parse("February 20th 1973");
Date.parse("Thu, 1 July 2004 22:30:00");
Date.parse("today");
Date.parse("next thursday");

And if you are programmatically creating Date objects, here’s a few functions I find myself using frequently.

// get a new Date object set to local date
var dt = Date.today();
// get that same Date object set to current time
var dt = Date.today().setTimeToNow();

// set the local time to 10:30 AM
var dt = Date.today().set({hour: 10, minute: 30});
// produce an ISO formatted datetime string converted to UTC
dt.toISOString();

There’s plenty more in the documentation; pretty much everything you need for manipulation, comparison, and string conversion. Datejs cleanly extends the default Date object, has been integrated into a couple date pickers, and supports culture specific parsing for i18n.

python-dateutil

Like Datejs, dateutil also has a powerful parse() function. While it can’t handle words like “today” or “tomorrow”, it can handle nearly every (American) date format that exists. Here’s a few examples.

>>> from dateutil import parser
>>> parser.parse("Thu, 4/2/09 09:00 PM")
datetime.datetime(2009, 4, 2, 21, 0)
>>> parser.parse("04/02/09 9:00PM")
datetime.datetime(2009, 4, 2, 21, 0)
>>> parser.parse("04-02-08 9pm")
datetime.datetime(2009, 4, 2, 21, 0)

An option that comes especially in handy is to pass in fuzzy=True. This tells parse() to ignore unknown tokens while parsing. This next example would raise a ValueError without fuzzy=True.

>>> parser.parse("Thurs, 4/2/09 09:00 PM", fuzzy=True)

It don’t know how well it works for international date formats, but parse() does have options for reading days first and years first, so I’m guessing it can be made to work.

dateutil also provides some great timezone support. I’ve always been surprised at python’s lack of concrete tzinfo classes, but dateutil.tz more than makes up for it (there’s also pytz, but I haven’t figured out why I need it instead of or in addition to dateutil.tz). Here’s a function for parsing a string and returning a UTC datetime object.

from dateutil import parser, tz
def parse_to_utc(s):
    dt = parser.parse(s, fuzzy=True)
    dt = dt.replace(tzinfo=tz.tzlocal())
    return dt.astimezone(tz.tzutc())

dateutil does a lot more than provide tzinfo objects and parse datetimes; it can also calculate relative deltas and handle iCal recurrence rules. I’m sure a whole calendar application could be built based on dateutil, but my interest is in parsing and converting datetimes to and from UTC, and in that respect dateutil excels.

Advertisements

Permalink Leave a Comment

Chunk Extraction with NLTK

February 23, 2009 at 4:02 pm (programming, python) (, , , , , )

Chunk extraction is a useful preliminary step to information extraction, that creates parse trees from unstructured text. Once you have a parse tree of a sentence, you can do more specific information extraction, such as named entity recognition and relation extraction.

Chunking is basically a 3 step process:

  1. Tag a sentence
  2. Chunk the tagged sentence
  3. Analyze the parse tree to extract information

I’ve already written about how to train a part of speech tagger and a chunker, so I’ll assume you’ve already done the training, and now you want to use your tagger and chunker to do something useful.

Tag Chunker

The previously trained chunker is actually a chunk tagger. It’s a Tagger that assigns IOB chunk tags to part-of-speech tags. In order to use it for proper chunking, we need some extra code to convert the IOB chunk tags into a parse tree. I’ve created a wrapper class that complies with the nltk ChunkParserI interface and uses the trained chunk tagger to get IOB tags and convert them to a proper parse tree.

import nltk.chunk
import itertools

class TagChunker(nltk.chunk.ChunkParserI):
    def __init__(self, chunk_tagger):
        self._chunk_tagger = chunk_tagger

    def parse(self, tokens):
        # split words and part of speech tags
        (words, tags) = zip(*tokens)
        # get IOB chunk tags
        chunks = self._chunk_tagger.tag(tags)
        # join words with chunk tags
        wtc = itertools.izip(words, chunks)
        # w = word, t = part-of-speech tag, c = chunk tag
        lines = [' '.join([w, t, c] for (w, (t, c)) in wtc if c]
        # create tree from conll formatted chunk lines
        return nltk.chunk.conllstr2tree('\n'.join(lines))

Chunk Extraction

Now that we have a proper chunker, we can use it to extract chunks. Here’s a simple example that tags a sentence, chunks the tagged sentence, then prints out each noun phrase.

# sentence should be a list of words
tagged = tagger.tag(sentence)
tree = chunker.parse(tagged)
# for each noun phrase sub tree in the parse tree
for subtree in tree.subtrees(filter=lambda t: t.node == 'NP'):
    # print the noun phrase as a list of part-of-speech tagged words
    print subtree.leaves()

Each sub tree has a phrase tag, and the leaves of a sub tree are the tagged words that make up that chunk. Since we’re training the chunker on IOB tags, NP stands for Noun Phrase. As noted before, the results of this natural language processing are heavily dependent on the training data. If your input text isn’t similar to the your training data, then you probably won’t be getting many chunks.

Permalink 3 Comments

How to Train a NLTK Chunker

December 29, 2008 at 8:19 am (python) (, , , , )

In NLTK, chunking is the process of extracting short, well-formed phrases, or chunks, from a sentence. This is also known as partial parsing, since a chunker is not required to capture all the words in a sentence, and does not produce a deep parse tree. But this is a good thing because it’s very hard to create a complete parse grammar for natural language, and full parsing is usually all or nothing. So chunking allows you to get at the bits you want and ignore the rest.

Training

The general approach to chunking and parsing is to define rules or expressions that are then matched against the input sentence. But this is a very manual, tedious, and error-prone process, likely to get very complicated real fast. The alternative approach is to train a chunker the same way you train a part-of-speech tagger. Except in this case, instead of training on (word, tag) sequences, we train on (tag, iob) sequences, where iob is a chunk tag defined in the the conll2000 corpus. Here’s a function that will take a list of chunked sentences (from a chunked corpus like conll2000 or treebank), and return a list of (tag, iob) sequences.

import nltk.chunk

def conll_tag_chunks(chunk_sents):
    tag_sents = [nltk.chunk.tree2conlltags(tree) for tree in chunk_sents]
    return [[(t, c) for (w, t, c) in chunk_tags] for chunk_tags in tag_sents]

Accuracy

So how accurate is the trained chunker? Here’s the rest of the code, followed by a chart of the accuracy results. Note that I’m only using Ngram Taggers. You could additionally use the BrillTagger, but the training takes a ridiculously long time for very minimal gains in accuracy.

import nltk.corpus, nltk.tag

def ubt_conll_chunk_accuracy(train_sents, test_sents):
    train_chunks = conll_tag_chunks(train_sents)
    test_chunks = conll_tag_chunks(test_sents)

    u_chunker = nltk.tag.UnigramTagger(train_chunks)
    print 'u:', nltk.tag.accuracy(u_chunker, test_chunks)

    ub_chunker = nltk.tag.BigramTagger(train_chunks, backoff=u_chunker)
    print 'ub:', nltk.tag.accuracy(ub_chunker, test_chunks)

    ubt_chunker = nltk.tag.TrigramTagger(train_chunks, backoff=ub_chunker)
    print 'ubt:', nltk.tag.accuracy(ubt_chunker, test_chunks)

    ut_chunker = nltk.tag.TrigramTagger(train_chunks, backoff=u_chunker)
    print 'ut:', nltk.tag.accuracy(ut_chunker, test_chunks)

    utb_chunker = nltk.tag.BigramTagger(train_chunks, backoff=ut_chunker)
    print 'utb:', nltk.tag.accuracy(utb_chunker, test_chunks)

# conll chunking accuracy test
conll_train = nltk.corpus.conll2000.chunked_sents('train.txt')
conll_test = nltk.corpus.conll2000.chunked_sents('test.txt')
ubt_conll_chunk_accuracy(conll_train, conll_test)

# treebank chunking accuracy test
treebank_sents = nltk.corpus.treebank_chunk.chunked_sents()
ubt_conll_chunk_accuracy(treebank_sents[:2000], treebank_sents[2000:])
Accuracy for Trained Chunker

Accuracy for Trained Chunker

The ub_chunker and utb_chunker are slight favorites with equal accuracy, so in practice I suggest using the ub_chunker since it takes slightly less time to train.

Conclusion

Training a chunker this way is much easier than creating manual chunk expressions or rules, it can approach 100% accuracy, and the process is re-usable across data sets. As with part-of-speech tagging, the training set really matters, and should be as similar as possible to the actual text that you want to tag and chunk.

Permalink 5 Comments