Groovy script to gather geo-coded latitude/longitude for UK telephone exchanges by crawling Wiki

This time the source of my crawling the web was the Wiki page for UK dialing codes.

I basically extract all the telephone exchanges from the ‘short codes’ and ‘long codes’ tables.

I ignore all entries that are obsolete. (i.e. city name prefixed with ‘unused’ in short codes table and UL DOM nodes followed by a DL DOM nodes in the long codes table).

The program then goes on to navigate to each of the city wiki pages and extracts out the latitude and longitude from the ‘geo’ SPAN.

It’s about 99.9% successful in the conversion.

The few codes I had isssues with were:

  • No latitude/longitude (because city Wiki page doesn’t contain ‘geo’ SPAN) for:
    • 0191 – Tyne & Wear
    • 01634  – Medway
    • 01806 Voe (got Shetland)
    • 01847 Tongue
  • Didn’t catch:
    • 01950 Sandwick. Has an embedded SPAN node that messes with extraction of city name to XML file.

      01950

      Sandwick from Firebug

As usual I’m using ConfigSlurper to configure where I want my XML output to go.

crawl.properties

The program had to cater for the fact A tags or text may appear.

You’ll see 01207 has a URL for the code.

01200-01207

Also 01806/01864 for example extracts out two cities for code:

01806 Voe & Shetland

01864 Tinto & Abingdon

Here’s some Console output.

Early on from program before I added lat/long:

01200-01207 console output

01864 console output

Here’s a summary of the codes that get written to disk

Shows 630 codes converted

Last page of console output.

Sample final output

Another thing of note, was the use of JLine again to control console output in the early stages of development.

This way I can control scrolling of output and not lose it because the history wraps and you lose early content.

I mention another post at the end where I used JLine before.

Here’s the start of the XML output:

<ukCallingCodes>
  <code>
    <prefix>020</prefix>
    <local-no-len>8</local-no-len>
    <prefix-url>/wiki/020</prefix-url>
    <city-name>London</city-name>
    <city-url>/wiki/London</city-url>
    <city-lat>51.50806</city-lat>
    <city-long>-0.12472</city-long>
  </code>
  <code>
    <prefix>023</prefix>
    <local-no-len>8</local-no-len>
    <prefix-url></prefix-url>
    <city-name>Southampton</city-name>
    <city-url>/wiki/Southampton</city-url>
    <city-lat>50.89696</city-lat>
    <city-long>-1.40416</city-long>
  </code>

Here’s the code for crawlWikiUKDialingCodes.groovy;

package jgf
import groovy.grape.Grape
import com.thoughtworks.selenium.*

@Grapes([
    @Grab(group='net.sourceforge.nekohtml', module='nekohtml', version='1.9.14'),
    @Grab(group='xerces', module='xercesImpl', version='2.9.1'),
    @Grab(group='org.seleniumhq.selenium.client-drivers', module='selenium-java-client-driver', version='1.0.1') ])

class CrawlWikiUKDialingCodes extends GroovySeleneseTestCase {

  @Override
  void setUp() throws Exception {
    super.setUp('http://en.wikipedia.org', '*chrome')
    setDefaultTimeout(50000)
    setCaptureScreenshotOnFailure(false)
    return null
  }

  void testCrawlWikiUKDialingCallingCodes() throws Exception {
    def crawl = init()
    extractShortCodes(crawl)
    extractLongCodes(crawl)
    extractLatLong(crawl)
    writeAllCodesToDisk(crawl)
    printResults(crawl)
    return null
  }

  def init() {
    selenium.open(&quot;http://en.wikipedia.org/wiki/List_of_United_Kingdom_dialling_codes&quot;)
    def crawl = [:]
    crawl.with {
      page                            = getNekoHtml()
      h                               = System.getenv('HOME')                // OS Shell var
      fs                              = System.getProperty('file.separator') // Java Sys Property
      nl                              = System.getProperty(&quot;line.separator&quot;) // Newline character
      d                               = &quot;${h}${fs}Desktop&quot;
      gsd                             = &quot;${d}${fs}Groovy Scripts&quot;
      def props                       = new ConfigSlurper().parse(new File(&quot;${gsd}${fs}crawl.properties&quot;).toURL())
      ukCallingCodesFile              = &quot;${d}${props.ukCallingCodesFile}&quot;
      encoding                        = 'UTF-8'
      scodes                          = [[],[],[]]
      lcodes                          = []
      errors                          = []
    }
    println ''
    return crawl
  }

  def getNekoHtml() {
    def parser = new org.cyberneko.html.parsers.SAXParser()
    parser.setFeature('http://xml.org/sax/features/namespaces', false)
    def nekoHtml = new XmlParser(parser).parseText(selenium.getHtmlSource())
    return nekoHtml
  }

  def extractShortCodes(crawl) {
    crawl.with {
      shortCodes = page.depthFirst().TABLE.findAll{it.'@class' == 'wikitable'}[1]
      def code = ''
      def city = ''
      def lastCode = ''
      def lastCity = ''
      shortCodes.TBODY.TR.eachWithIndex{tr, tri -&gt;
        if (tri &gt; 1) {
          tr.eachWithIndex{td, tdi -&gt;
            def n = tdi % 2
            def sci = tdi.intdiv(2)
            td.each{node -&gt;
              if (node.class.simpleName == 'String') {
                if (n) {
                  if (node == 'and') {
                    code = lastCode
                  } else {
                    city    = node
                    cityUrl = ''
                  }
                } else {
                  code    = node
                  codeUrl = ''
                }
              } else {
                if (node.name() == 'A') {
                 if (n) {
                   city    = node.text()
                   cityUrl = node.'@href'
                 } else {
                   code    = node.text()
                   codeUrl = node.'@href'
                 }
                }
              }
              if (tdi &amp;&amp; n &amp;&amp; city &amp;&amp; code) {
                 if (!city.startsWith('unused')) {
                   def cd = [code: code, codeUrl: codeUrl, city: city, cityUrl: cityUrl]
                   scodes[sci] &lt;&lt; cd
                 }
                 lastCode = code
                 code = ''
                 city = ''
              }
            }

          }
        }
      }
    }
    return null
  }

  def extractLongCodes(crawl) {
    crawl.with {
      longCodes = page.depthFirst().TABLE.findAll{it.'@class' == 'wikitable'}[2]
      def uli = -1
      longCodes.TBODY.TR.eachWithIndex{tr, tri -&gt;
        if (tri &gt; 1) {
          tr.eachWithIndex{td, tdi -&gt;
            def tdsz = td.children().size() - 1
            td.eachWithIndex{node, ndi -&gt;
              if (!ndi
              ||  node.name() == 'HR') {
                lcodes &lt;&lt; []
                uli += 1
              }
              if (node.name() == 'UL') {
                if ( ( (ndi + 1) &lt;= tdsz) &amp;&amp; td.children()[ndi + 1].name() != 'DL') {
                  def cds = []
                  def dash = 'Ñ'
                  def code
                  def codeUrl = ''
                  def city
                  def cityUrl
                  def cd
                  def li = node.LI[0]
                  li.eachWithIndex{ln, lni -&gt;
                    if (ln.class.simpleName == 'String') {
                      if (ln.size() &gt; 4 &amp;&amp; ln[0..4].isNumber() ) {
                        code    = ln[0..4]
                      }
                    } else if (ln.name() == 'A') {
                      if (ln.text().size() &gt; 4 &amp;&amp; ln.text()[0..4].isNumber()) {
                        code    = ln.text()[0..4]
                        codeUrl = ln.'@href'
                      } else {
                        city    = ln.text() //ln.'@title'
                        cityUrl = ln.'@href'
                        cd = [code:code, codeUrl: codeUrl, city: city, cityUrl: cityUrl]
                        cds &lt;&lt; cd
                      }
                    }
                  }
                  cds.each{c -&gt;
                   lcodes[uli] &lt;&lt; c
                  }
                }
              }
            }
          }
        }
      }
    }
    return null
  }

  def extractLatLong(crawl) {
    crawl.with {
      scodes.each{codes -&gt;
        processCodes(codes)
      }
      lcodes.each{codes -&gt;
        processCodes(codes)
      }
    }
    return null
  }

  def processCodes(codes) {
    codes.each{code -&gt;
      selenium.open(&quot;http://en.wikipedia.org${code.cityUrl}&quot;)
      def page = getNekoHtml()
      def geo = page.depthFirst().SPAN.find{it.'@class' == 'geo'}?.text()
      def ll  = (geo) ? geo.tokenize('; ') : ['','']
      code.latitude  = ll[0]
      code.longitude = ll[1]
    }
  }

  def writeAllCodesToDisk(crawl) {
    crawl.with {
      def scsz = 0
      scodes.each{sc -&gt; scsz += sc.size()}
      def lcsz = 0
      lcodes.each{lc -&gt; lcsz += lc.size()}
      println &quot;Writing $scsz short codes &amp; $lcsz long codes to disk at : $ukCallingCodesFile&quot;
      new jline.UnixTerminal().readCharacter(System.in)
      def xmlFile = new File(ukCallingCodesFile)
      xmlFile.write(&quot;&lt;ukCallingCodes&gt;$nl&quot;, encoding)
      def digitsLocal = 8
      scodes.each{codes -&gt;
        writeCodesToDisk(codes, digitsLocal, xmlFile, crawl)
        digitsLocal = 7
      }
      lcodes.each{codes -&gt;
        writeCodesToDisk(codes, 6, xmlFile, crawl)
      }
      xmlFile.append(&quot;&lt;/ukCallingCodes&gt;$nl&quot;, encoding)
    }
    return null
  }

  def writeCodesToDisk(codes, digitsLocal, xmlFile, crawl) {
    crawl.with {
      codes.each{code -&gt;
        def xml  = &quot;  &lt;code&gt;$nl&quot;
            xml += &quot;    &lt;prefix&gt;$code.code&lt;/prefix&gt;$nl&quot;
            xml += &quot;    &lt;local-no-len&gt;$digitsLocal&lt;/local-no-len&gt;$nl&quot;
            xml += &quot;    &lt;prefix-url&gt;$code.codeUrl&lt;/prefix-url&gt;$nl&quot;
            xml += &quot;    &lt;city-name&gt;$code.city&lt;/city-name&gt;$nl&quot;
            xml += &quot;    &lt;city-url&gt;$code.cityUrl&lt;/city-url&gt;$nl&quot;
            xml += &quot;    &lt;city-lat&gt;$code.latitude&lt;/city-lat&gt;$nl&quot;
            xml += &quot;    &lt;city-long&gt;$code.longitude&lt;/city-long&gt;$nl&quot;
            xml += &quot;  &lt;/code&gt;$nl&quot;
        xmlFile.append(xml, encoding)
        if (xml.contains('null')) {
          println &quot;Null in: $code.code : $code.city&quot;
          errors &lt;&lt;  code
        }
      }
    }
    return null
  }

  def printResults(crawl) {
    crawl.with {
      println '---'
      println ' Short Codes'
      println '---'
      scodes.each{codes -&gt;
        codes.each{println it}
        println '---'
      }

      //new jline.UnixTerminal().readCharacter(System.in)
      println '==='

      println '---'
      println ' Long Codes'
      println '---'
      lcodes.each{codes -&gt;
        codes.each{println it}
        //new jline.UnixTerminal().readCharacter(System.in)
        println '---'
      }
      if (errors) {
        println '==='

        println '---'
        println ' Calling codes with null data in XML'
        println '---'
        errors.eachWithIndex{e, i -&gt;
          println e
          if (i &amp;&amp; ( (i % 75) == 0) ) new jline.UnixTerminal().readCharacter(System.in)
        }
      }
    }
    return null
  }

}

Related posts:

Advertisements

About this entry