In this post I'll describe I have been doing for Instedd during the last couple of weeks. In one of the projects we have we need to classify a series of articles depending on the geographical location they are talking about. This process is known as geotagging, and is really important on the biosurveillance areas.
Geotagging items is not a new thing, and many web sites already supports adding geographic information to the objects their handle. For example, Flickr allows you to set the coordinates where the picture was taken. Wikipedia also has structured information that contains the latitude and longitude for articles about a place in the world. On the other side, specs like GeoRSS can be used to augment the information given by a feed. However, even though all these new geo-related features are being widely adopted, there are still much information out there, that would need a human reading the text to understand which places is it mentioning.
So, we decided to make this process automatically as most as possible, extracting the information from the text itself. This is know as "location extraction", and is actually a branch of a more general thing named "entity extraction".
The location extraction process can be splitted in two stages:
Infer the possible location mentions in the text
The first step in location extraction is to obtain the possible terms in the text that could be talking about a location in the world.
This is being accomplished using a gazetteer plus some heuristics. A gazetteer it's like a world location database, and in this case I used the information that can be downloaded from the site of GeoNames. This database contains dozens of million of places all around the world, each one with alternative names, kind of place and geographical coordinates between some other things.
The matching cannot be done simply by looking the words in a dictionary because most place names are composed by more than just one term (i.e.: Buenos Aires). This is why I used a variant of the Aho-Corasick algorithm in which the tokens are whole words instead of characters. A good description of this algorithm can be read in these slides. This means that I had to pre-process the GeoNames database (consisting in a bunch of zipped text files) and construct a directed graph. I decided to use SQLite as the store for this huge graph as it is really fast and lightweight.
The construction of this database graph is a slow and memory hungry process. My first approach was to implement it with a Python script and it was enough to process test data from one or two countries. But for the final database that implies millions of nodes, the lightweight of C++ made a good job and I end up with a process that takes around 20 minutes in my Core 2 Duo notebook at 2.2 GHz. I think it's important to mention that I used a ramfs file system to construct the SQLite database. This really makes a big difference. As I taken only the most important places for my database, the final version sizes around 500Mb.
Once the database is ready, the graph can be traversed as a kind of state machine with the words being read from the text. Every time a match is found, it's saved in a list for later disambiguation. The Aho-Corasick algorithm finds all the occurrences of substrings, so those that overlaps must be discarded. For instance if we read "San Fernando" we have a positive match that must be later discarded if we read "San Fernando del Valle de Catamarca". This is very straightforward if we save the starting word index and length of each location mention, so for each match we look up the list of previous matches from the bottom, removing all those entries that would overlap with the new one.
To improve the accuracy of the matching algorithm I added these heuristics to the process:
- Every place must start with a capital letter: If the name is composed by more than one word, the following words are not enforced to start with upper case. This doesn't work for all languages, but at this stage I only need it does with English and Spanish.
- Uppercase chains must not be broken: this means that consecutive uppercase words must be taken together or not taken at all.
Before an specific location name extracted from the text can be placed on a map we have to disambiguate it from different meanings. For example, there are dozens of places named 'Springfield' and not only in US. So the meaning of a particular name must be inferred from the context. For this part of the process I used mainly this paper.
The central idea is calculate a weight for each of the senses of every location. This weights are calculated taking into account the other senses of other locations coocurring in the same text, and the distance of these terms in the text.
To calculate the weight of a sense we use the following formula:
Where "num" is the number of mentions of a place, "dist" represents the distance in words between two locations, and "weight" represents the influence of a location sense to other sense of a different location, and is calculated with the following table:
|Sense 1||Sense 2||Condition||Weight|
|City||City||in same state||2|
|City||in same country||1|
|State||in same state||3|
|Country||in country without state||4|
Some heuristics are also applied in this stage of the process. For example, if a location name can be interpreted as a country, other meanings are discarded. Also, if the same location name appears many times in the same text, the sense with maximum weight is taken and propagated to the other mentions, based on the "one sense per discourse" principle.
To improve the accuracy and based on my own tests, I introduced these modifications to the original formula:
1) Increase weight based on the location inside the text: the weight of the locations found at the beginning of the text, or even at the title, are multiplied by a factor that decreases exponentially as we advance in the document.
2) Reduce the weight based on a blacklist: Many locations has names of common words. The weight of these locations are reduced if are found in a blacklist to avoid potential mismatches. This blacklist can also be populated based on feedback of the users of the final system.
Once the algorithm has finished, only those places with weights above the 5% of the maximum weight are considered for the final result.
The algorithm described here is far from being perfect. There is some much work done out there about this subject and I need to introduce some interesting stuff. I found that the biggest effort should be placed on the first part of the process that currently is introducing some noise because of false positive matches.
These are some of the points in my wishlist:
1) Use pattern matching before the disambiguation algorithm for patterns like "<city>, <state>" or "<city>, <country>", or even "city of <city>".
2) Prepare a better blacklist based on statistics. These can be obtained looking for common words appearing in hundead of texts.
3) Improve the mention lookup with previous processing. For example using sentence splitting and POS tagging.
I hope you found this article interesting, as it was to me to implement such process. The results are truly amazing!