Extracting addresses from text

A few days ago, I had a task to extract addresses from unstructured text, like the following:

Hey man! Joe lives here: 44 West 22nd Street, New York, NY 12345. Can you contact him now? If you need any help, call me on 12345678

The text in bold must be extracted from the sentence and returned as an address string. From the first view, it seems not so hard to do this using regular expressions, but when trying to do this, you can find out that the regular expression monster growing every moment, and the precision of the recognized address string is staying the same.

When you have semi-structured text where you can match some “labels” where the address begins, then the regular expression is a way to go. It’s fast, you don’t need a huge dataset of addresses to train the address chunker (more on that later), all you need is a predefined regular expression and then tune it to the cases where it fails.

So, as the regular expression is off the table, the other option is to use Natural Language Processing to process text and extract addresses. After spending some time getting familiar with NLP, it turns out it was the way I was thinking about this problem in the first place. Not as a rule you defined beforehand (regular expression) but as the classifier that can predict the class of a chunk of text based on the previous observation. Think of it as training on the same piece of text shown above, but with marks that this part of the text is addressed and others are just noise.

Using NLP for address extraction

If you are not familiar with the NLP term and have never done anything with it in Python, I suggest getting a brief introduction in Natural Language Processing with Python book. This is a great resource to dive into the NLP field using the NLTK toolkit which is written in Python and contains a large number of examples.

Now, to train our classifier we need a dataset of tagged sentences in IOB format. This means that the sentence above must be tagged in a way to show the classifier where the address begins, continue, and ends.

Let’s see it in terms of a Python list: {% highlight python %} [(‘Hey’, ‘O’), (‘man’, ‘O’), (’!’, ‘O’), (‘Joe’, ‘O’), (’lives’, ‘O’), (‘here’, ‘O’), (’:’, ‘O’), (‘44’, ‘B-GPE’),(‘West’, ‘I-GPE’), (‘22nd’, ‘I-GPE’), (‘Street’, ‘I-GPE’), (’,’, ‘I-GPE’), (‘New’, ‘I-GPE’), (‘York’, ‘I-GPE’), (’,’, ‘I-GPE’), (‘NY’, ‘I-GPE’), (‘12345’, ‘I-GPE’), (’.’, ‘O’), (‘Can’, ‘O’), (‘you’, ‘O’), (‘contact’, ‘O’), (‘him’, ‘O’), (’now’, ‘O’), (’?’, ‘O’), (‘If’, ‘O’), (‘you’, ‘O’), (’need’, ‘O’), (‘any’, ‘O’), (‘help’, ‘O’), (’,’, ‘O’), (‘call’, ‘O’), (‘me’, ‘O’), (‘on’, ‘O’), (‘12345678’, ‘O’)] {% and highlight %}

The list contains several tuples where the first element is a word and the second is an IOB tag:

O - Outside of address;

B-GPE - Begin of address string;

**I-GPE **- Inside address string;

Using this dataset and feature extraction method, we show the classifier what chunk of text we want to extract and provide a way (feature detection method) to “map” features to IOB tags. This is a very simplified version of what is going on under the hood of the classifier, if you need more details, this post is used ClassifierBasedTagger which based on Naive Bayes classifier.

Where to get the dataset?

In the repository you can find an already compiled dataset of texts with US addresses. This is a pickled Python list that contains more than 9000 IOB-tagged sentences.

This list was compiled using different methods: Getting existing dataset of hotel/pizza contact info web pages; Generating fake text with fake addresses; Inserting random addresses to nltk.corpus.treebank corpus;

One idea that I haven’t tried is to scrape web site, which is using structured data and automatically retrieve addresses marked with address tag. So, this is already structured data, the only thing left is to convert it to IOB format and add it to the dataset.

Results

The source code of chunker is pretty straightforward, all you need to do is get a brief introduction of NLP concepts and you are ready to go.

But what about results? There are cases when the part of the text which is not addressed is tagged as it is. The problem is in different formats of addresses and that’s only US addresses…One way to remove this ambiguity and to push the accuracy of address recognition to 100% is to use USPS database with a combination of regular expression and a dictionary of named geographic places including names, types, locations (gazetteer), etc.