Converting place names harvested from LinkedIn to latitude/longitude with Groovy, HTTPBuilder, JSONSlurper and Yahoo’s GeoPlanet

I initially had some scepticism over how Geoplanet could handle place names, but I was remarkably surprised at how well it handles resolving ambiguous names.
In its simplest format, you can call it like this:

http://where.yahooapis.com/v1/places.q('<place name>')?appid=[<<Your App Id here>>]

The highest ranking result is returned. The ranking algorithm they use to return the best first result was awesome.

Odd things like Sutton, UK. It was able to determine I would want the London Borough and return this first.

Newport – it knew to pick the one in Wales.

It handles names of places with accented symbols and most of the data I threw at it. Like:

  • Tromsø, Norway
  • Hyderābād, India

It handled Almere Stad, Netherlands – where Stad is Dutch for city.

You can also call it in this format:

http://where.yahooapis.com/v1/places.q('<place name>');start=0;count=0;?appid=[<<Your App Id here>>]&format="json"

The ‘count=0’ in the URL means return all results.

If you look at the JSON you will see Yahoo’s GeoPlanet gives a result that has a placeTypeName and a corresponding code.

If you are using the count=0 option, the result is a nested structure of places, then place repeating, then at the end there is a start, count and total summarising the data that has gone before.

Buried within the JSON is a centroid argument that contains the central latitude and longitude for the place name.

Yahoo also has a woeid which is a mapping identifier for GeoPlanet.

The only thing I  tweaked was handling of  places like Minneapolis/St. Paul ore Raleign-Durham which is an area comprised of two geographically close cities. As consequence I wrote a routine to feed two names and compute the geographic mid-point between the two. I think that may have actually been down to me using a ‘/’ symbol in the name too. But I was happy with making this quick tweak.

My couple of gripes with GeoPlanet were:

  • The documentation didn’t provide a tabulation of all the place type codes and names.
  • There are inconsistencies in the results. More often than not areaRank and popRank don’t appear in the results

Here is the script (loc.groovy) I used to do the mapping:

package jgf
import groovy.grape.Grape
import groovyx.net.http.HTTPBuilder
import static groovyx.net.http.Method.GET
import static groovyx.net.http.ContentType.TEXT
import net.sf.json.groovy.*
@Grapes([
  @Grab(group='org.codehaus.groovy.modules.http-builder', module='http-builder', version='0.5.0-RC2' ),
  @Grab(group='net.sf.json-lib', module='json-lib', version='2.3', classifier='jdk15')
])

class Loc {

  def execute() {
    def results
    def rs
    def locVars = init()
    sanityCheck(locVars)
    locVars.with {
      countriesDistinct.each{country ->
        co_ll << getLocationsForPlaceLvl2(locVars, country, null)
      }
      areaDistinct.each{area ->
        def parts = area.tokenize('|')
        ar_ll << getLocationsForPlace(locVars, parts[0], parts[1])
      }
    co_ll.each{println it}
    ar_ll.each{println it}
    }
    updateLocalities(locVars)
    return null
  }

  def updateLocalities(locVars) {
    def text = ''
    locVars.with {
      def localitiesTextFileReader = new File(lf).newReader(encoding)
      localitiesTextFileReader.eachLine{line ->
        def part = line.tokenize('|')
        part = part.collect{it.trim()}
        def country = co_ll.find{it.country == part[2]}
        def newline = "${part[0]}|"
        if (part[1]) {
          def area = ar_ll.find{it.area == part[1] && it.country == part[2]}
          newline += "${part[1]}|${area.lat}|${area.lon}|${area.woeid}|"
        } else {
          newline += " | | | |"
        }
        newline += "${part[2]}|${country.lat}|${country.lon}|${country.woeid}$nl"
        text += newline
      }
      def localitiesTextFileLL = new File(lfll)
      localitiesTextFileLL.write(text, encoding)
    }
    return null
  }

  def init() {
    println ''
    def locVars = [:]
    locVars.with {
      h                 = System.getenv('HOME')                // OS Shell var
      fs                = System.getProperty('file.separator') // Java Sys Property
      nl                = System.getProperty("line.separator") // Newline character
      encoding          = 'UTF-8'
      d                 = "${h}${fs}Desktop"
      gsd               = "${d}${fs}Groovy Scripts"

      // get username and password. Contains Yahoo Maps ydnAppId key too. Secret stuff!
      def credentials = new ConfigSlurper().parse(new File("${gsd}${fs}credentials.properties").toURL())
      // Generic stuff...
      def props         = new ConfigSlurper().parse(new File("${gsd}${fs}crawl.properties").toURL())
      lf                = "${d}${props.countriesFile}"
      lfll              = "${d}${props.countriesFileLL}"
                          getLocalities(locVars)
      localitiesKL      = localities.collect{it.loc}  // Used to identify Country in Location
      areaDistinct      = new TreeSet()
                          localities.each{locality -> if (locality.area) areaDistinct << "${locality.country}|${locality.area}"}
      countriesDistinct = new TreeSet(localities*.country)
      yId               = credentials.ydnAppId         // Yahoo App Id used to get geo-cordinates..
      co_ll             = []
      ar_ll             = []
    }
    return locVars
  }

  def getLocalities(locVars) {
    locVars.with {
      localities = []
      def localitiesTextFileReader = new File(lf).newReader(encoding)
      localitiesTextFileReader.eachLine{line ->
        def part = line.tokenize('|')
        part = part.collect{it.trim()}
        def locality = [loc:part[0], area:(part[1]) ?: '', country:part[2] ?:'']
        localities << locality
      }
    }
    return null
  }

  def sanityCheck(locVars) {
    locVars.with {
      println 'Localities - Distinct collection (SQL ese!)'
      localitiesKL.each{locality -> println locality}
      lkls = localitiesKL.size()
      println "Localities Distinct count: $lkls"
      println '==='
      println 'Countries - Distinct collection'
      countriesDistinct.each{country -> println country}
      cds = countriesDistinct.size()
      println "Countries Distinct count: $cds"
      println '==='
      println 'Areas - Distinct collection - excludes Localities that are just country'
      areaDistinct.each{area -> println area}
      ads = areaDistinct.size()
      println "Area Distinct count: $ads"
      println '==='
      println "lkls: $lkls cds: $cds ads: $ads cds+ads: ${cds +ads}"
    }
    return null
  }

  def geoPlanetExample(locVars) {
    def country = 'United Kingdom'
    def area = 'Newport'
    def result = getLocationsForPlace(locVars, country, area)
    return result
  }

  def getPlace(country, area) {
    def ar = (area) ? "${area}, " : ''
    def place = "$ar$country"
    return place
  }

  def getnameAndStateFromArea(country, area) {
    def res = [:]
    if (country == 'United States') {
      res.name = area[0..-5]
      res.state = area[-2..-1]
    } else {
      res.name = area
      res.state = null
    }
    return res
  }

  def getLocationsForPlace(locVars, country, area) {
    def result = [:]
    def ns
    def aSplits
    if (area) {
      ns = getnameAndStateFromArea(country, area)
      aSplits = ns.name.tokenize('/')
    }
    if (!area || aSplits.size() == 1) { // No area means just the country...
      result = getLocationsForPlaceLvl2(locVars, country, area)
    } else {
      // Issues with Cleveland/Akron, OH
      // Dallas/Fort Worth, TX
      // Miami/Fort Lauderdale, FL
      // Tampa/St. Petersburg, FL
      // Raleigh/Durham, NC
      // Minneapolis/St. Paul
      def multRes = []
      aSplits.each{aSplit ->
        def ar = (ns.state) ? "${aSplit}, ${ns.state}" : aSplit
        multRes << getLocationsForPlaceLvl2(locVars, country, ar)
      }
      def mrCount = multRes.size()
      def lat = 0
      def lon = 0
      multRes.each{it -> lat += it.lat
                         lon += it.lon}
      lat = lat / mrCount
      lon = lon / mrCount

      result.lat = lat
      result.lon = lon
      result.woeid = multRes[0].woeid
      result.country = country
      result.area = area
    }
    //println result
    return result
  }

  def getLocationsForPlaceLvl2(locVars, country, area) {
    def place = getPlace(country, area)
    def placeEncoded = URLEncoder.encode(place)
    def http = new HTTPBuilder('http://where.yahooapis.com/')
    def data = null;
    try {
      http.request(GET,TEXT) {
        uri.path = "v1/places.q('${placeEncoded}')"
        uri.query = [format:"json",
                     appid:"[${locVars.yId}]"]

         response.success = {resp, unformatted ->
           data = unformatted.text;
         }
      }
    } catch (Exception e) {
      println place
      throw e
    }
    def json = new JsonSlurper().parseText(data)
    //println "Passed param $ptna_c"
    def firstResult = json.places.place[0]
    def result = [:]
    if (firstResult) {
       result.lat = firstResult.centroid.latitude
       result.lon = firstResult.centroid.longitude
       result.woeid = firstResult.woeid
       result.country = country
       result.area = area
    }
    return result
  }
}

def l = new Loc()
//l.sanityCheck(l.init())
//l.geoPlanetExample(l.init())
l.execute()

If you notice at the bottom I can dump out the countries file, summarising what I wanted to convert first.

  • I can run a single example
  • I can execute the whole process.

I summarise Countries first, then areas within Countries to make the most efficient calls to the GeoPlanet mapping service.

Links I found useful whilst writing this script:

Special thanks to Matt Morten for his blog post combining the two technologies I wanted to use and it was one of the few resources that I could get to work. Had issues with getting statusLine to compile based on other posts.
HttpBuilder: 1 2 3 JSON Timeouts API Docs Tom Nichols blog (twitter) Mvn Repo
JSONSlurper: Groovy 1.7 Release Notes has Grape annotations API Docs

Testing latitude & longitude results

GeoPlanet (see example URL’s & how to get your own App Id here) Hierarchy concepts, placetypes

Scott Davis also mentioned this link to me today when I was getting ready to write this blog post. It provides a way of doing this kind of stuff through REST web services, but at a cursory glance I don’t think it’s as sophisticated as GeoPlanet. It appears to have an Anglo/American bias and I needed something a little more global.

Advertisements

About this entry