While preparing for a client demo, someone recommended I look at Clavin as a geoparser to extract locations out of natural language text. I was impressed with the results and thought it was a pretty creative solution using three core open-source technologies to solve a problem:

  • Stanford NER - Performing named entity recognition (people, locations, organizations, times) out of natural language is tough. You need more than just a dictionary lookup of location names, especially for international locations that happen to match English words: “nice” and “Nice, France.” The Stanford NER uses parts of speech patterns to intelligently guess when words are locations.
  • Geoname gazetteer - A comprehensive list of 10M locations, alternative names, and lat/longs
  • Lucene - A full text search engine used to quickly lookup text references, including fuzzy matches

Clavin uses Stanford NER as the first pass for guesses on locations. It then uses Lucene to look up exact and fuzzy matches of the Geoname gazetteer data as the second pass on location detection. It works best on news-style language. It works pretty well, especially given the complexities of natural language.