Embellishing NekoHTML text method with Groovy & recursion

NekoHTML is a fantastic library for crawling and extracting content from web pages.
But if you’ve got nodes that are nested to an arbitrary depth, calling text on a node, doesn’t include nested text content. For example text in a bold tag (B node), nested within in a paragraph will not be returned when you call text() on the P node.

One of the things I’ve been working on at the moment is extracting content from the Wiki page for NANP telephones.

I had a hell of a time tweaking and debugging a recursive routine to construct the paragraph text. The final result was quite elegant but the path to the solution wasn’t obvious and quite arduous to get right.
So, I thought I’d share the final result with a little debugging routine I built into the recursive process to evaluate the path the code follows.

I basically extract the P and UL nodes associated to each area code and dump the nodes out into a separate AreaCode class.

The getText() method of AreaCode has an overloaded signature. You call it in it’s simple form with a single parameter from the client code and it calls into the helper routine. The depth parameter isn’t strictly necessary, but acts as a debugging aid in following the path through the code.

By analysing the debugging dump, I was surprised to find that you can call each on a String and it will return a character at a time and consequently refactored the code to make the program behave more intelligently, so that when a text node is encountered the whole node is added rather than character by character to the resulting StringBuffer!

The AreaCode class has a debug flag, set to false on creation. When it’s set to true, the code will dump out a debugging/trace output.
I’ve highlighted the code you can tweak to enable debugging:

  1. If you look at the toString() method, you can uncomment lines 115/125 & change the debug flag to true on line 116.
  2. Uncomment the ‘if statement’ on line 66 to only debug certain area codes.

I ended up setting the debug flag in the toString() method, because the getText() method indirectly gets called in the code() method via pText() and I didn’t want to create debugging output inadvertently.

The append text method, that intelligently adds a space at the appropriate time as the text gets built was at the heart of gettting the desired result.

Here’s the code for crawlWikiNANPDialingCodes7.groovy:

Hope you enjoy.

package jgf
import groovy.grape.Grape
import com.thoughtworks.selenium.*

@Grapes([
    @Grab(group='net.sourceforge.nekohtml', module='nekohtml', version='1.9.14'),
    @Grab(group='xerces', module='xercesImpl', version='2.9.1'),
    @Grab(group='org.seleniumhq.selenium.client-drivers', module='selenium-java-client-driver', version='1.0.1') ])

class CrawlWikiNANPDialingCodes extends GroovySeleneseTestCase {

  @Override
  void setUp() throws Exception {
    super.setUp('http://en.wikipedia.org', '*chrome')
    setDefaultTimeout(50000)
    setCaptureScreenshotOnFailure(false)
    return null
  }

  void testCrawlWikiUKDialingCallingCodes() throws Exception {
    def crawl = init()
    extract200to900(crawl)

    return null
  }

  def init() {
    selenium.open("http://en.wikipedia.org/wiki/List_of_NANP_area_codes")
    def crawl = [:]
    crawl.with {
      page                            = getNekoHtml()
      areaCodes                       = []
    }
    println ''
    return crawl
  }

  def getNekoHtml() {
    def parser = new org.cyberneko.html.parsers.SAXParser()
    parser.setFeature('http://xml.org/sax/features/namespaces', false)
    def nekoHtml = new XmlParser(parser).parseText(selenium.getHtmlSource())
    return nekoHtml
  }

  def extract200to900(crawl) {
    crawl.with {
      def bc     = page.depthFirst().DIV.find{it.'@id' == 'bodyContent'}
      def ignore = true
      def ac
      bc.each{node ->
        if (node.name() == 'H2' && ignore && node.SPAN[1].'@id' == '200')
          ignore = false
        if (node.name() == 'H2' && !ignore && node.SPAN[1].'@id'.startsWith('By_state'))
           ignore = true
        if (!ignore) {
          switch (node.name()) {
             case 'P'  : ac        = new AreaCode()
                         areaCodes << ac
                         ac.pNode  = node
                         break
             case 'UL' : ac.ulNode = node
                         break
          }
        } // don't ignore
      } // bc.each (body content)
      areaCodes.each{ //if ((it.code() in ['928','971','975', '989']))
         println it
      }
    } // crawl.with
  }

}

class AreaCode {
   def pNode
   def ulNode
   def debug

   AreaCode() {
     debug = false
   }

   def pText() {
     return getText(pNode)
   }

   static nl() {
     return System.getProperty("line.separator")
   }

   def liNodes() {
     return (ulNode) ? ulNode.LI : null
   }

   def code() {
     return pText()[0..2]
   }

   def uText() {
     def res
     if (ulNode) {
       def sb = new StringBuilder()
       liNodes().each{node ->
         def liText = getText(node)
         sb.append("$liText${nl()}")
       }
       res = sb.toString()[0..-2]
     } else {
       res = ''
     }
     return res
   }

   String toString() {
     //println "-- Beg ${code()} ---"
     debug = false // Change to true to debug recursion
     def res
     def p = pText()
     if (ulNode) {
       res    = "$p${nl()}${uText()}"
     } else {
       res = p
     }
     debug = false
     //println "--- End ${code()} ---"
     return res
   }

   def getText(node) {
     def res = new StringBuilder()
     getText(node, 1, res)
     return res
   }

   def getText(node, depth, res) {
    if (debug) println "0B: $depth res: |$res|"
    if (node.class.simpleName =='String') {
      if (debug) println "1B: $depth res: |$res| node: $node"
      appendText(res, node)
      if (debug) println "1A: $depth res: |$res| node: $node"
    } else {
      node.each{child ->
        if (child.class.simpleName == 'String') {
          if (debug) println "2B: $depth res: |$res| child: |$child|"
         appendText(res, child)
          if (debug) println "2A: $depth res: |$res| child: |$child|"
        } else  {
          if (debug) println "3B: $depth res: |$res| child: |${child.name()}|"
          getText(child, depth +1, res)
          if (debug) println "3A: $depth res: |$res| child: |${child.name()}|"
        }
      }
    }
    if (debug) println "0A: $depth res: |$res|"
    return res.toString()
  }

  def appendText(res, newText) {
    if ( // if text being added doesn't start with . , ) or ]
         // and result has data
         // and result doesn't end with space ( or [
         // then append a space to result
         // Note: charAt returns type Char, so you need to cast to String for in to work..
        (!
          ( newText[0] in ['.', ')', ',', ']'] )
        ) &&
        ( res &&
          (!
            (
              (res.charAt(res.size() -1) as String)
               in [' ','(','[']
            )
          )
        )
       ) {
           res.append(' ')
         }
    res.append(newText)
  }
}

Here’s a sample of the tail end of the output in normal mode:

954 : Florida (all of Broward County : Fort Lauderdale, Hollywood, Coral Springs)
Created in 1995 by split from 305.
Overlain by 754 in 2002.
955 : not used
956 : Texas (Laredo, Brownsville, McAllen, Harlingen and South Texas)
Created in 1997 by split from 210.
957–958 : not used
959 : Connecticut
Overlain on 860 in 2001.
960–969 : not used (96x block reserved in case consecutive numbers are ever needed)
970 : Colorado (Aspen, Durango, Grand Junction, Fort Collins and northern and western Colorado)
Created in 1995 by split from 303.
971 : Oregon (Portland, Salem, Hillsboro, Beaverton and northwestern Oregon)
Partially overlain on 503 in 2000. Clatsop and Tillamook Counties, originally excluded from the overlay, were added in 2008.
972 : Texas
Created in 1996 by split from 214. In 1999 the split was reversed to become an overlay, and a second overlay of 469 was added.
973 : New Jersey (Newark, Paterson and northwestern New Jersey)
Created in 1997 by split from 201.
Overlain by 862.
974 : not used
975 is assigned for numbering relief to 816 (Missouri) but no date has been scheduled for this to go into effect.
976–977 : not used
978 : Massachusetts (Fitchburg, Peabody and northeastern Massachusetts)
Created in 1997 by split from 508.
Overlain by 351 in 2001.
979 : Texas (Wharton, Bryan, Bay City, College Station, Lake Jackson, La Grange and Southeast Texas)
Created in 2000 by split from 409.
980 : North Carolina
Overlain on 704 in 2001.
981–983 : not used
984 is assigned for overlay relief to 919 (North Carolina) but no date has been scheduled for this to go into effect.
985 : Louisiana (Houma, Slidell and southeastern Louisiana excluding New Orleans)
Created in 2001 by split from 504.
986–988 : not used
989 : Michigan (Alpena, Mt. Pleasant, Bay City, Saginaw, Midland, Owosso and central Michigan)
Created in 2000 by split from 517.

Here’s a sample of the output in debug mode:

The quirky B/A in first column means before/after and correlates to the specific println statements, the depth of recursion comes next, then res (the string that gets appended to), then, either the HTML node type or text node value.

-- Beg 928 ---
0B: 1 res: ||
3B: 1 res: || child: |B|
0B: 2 res: ||
3B: 2 res: || child: |A|
0B: 3 res: ||
2B: 3 res: || child: |928|
2A: 3 res: |928| child: |928|
0A: 3 res: |928|
3A: 2 res: |928| child: |A|
0A: 2 res: |928|
3A: 1 res: |928| child: |B|
2B: 1 res: |928| child: |:|
2A: 1 res: |928 :| child: |:|
3B: 1 res: |928 :| child: |A|
0B: 2 res: |928 :|
2B: 2 res: |928 :| child: |Arizona|
2A: 2 res: |928 : Arizona| child: |Arizona|
0A: 2 res: |928 : Arizona|
3A: 1 res: |928 : Arizona| child: |A|
2B: 1 res: |928 : Arizona| child: |(|
2A: 1 res: |928 : Arizona (| child: |(|
3B: 1 res: |928 : Arizona (| child: |A|
0B: 2 res: |928 : Arizona (|
2B: 2 res: |928 : Arizona (| child: |Flagstaff|
2A: 2 res: |928 : Arizona (Flagstaff| child: |Flagstaff|
0A: 2 res: |928 : Arizona (Flagstaff|
3A: 1 res: |928 : Arizona (Flagstaff| child: |A|
2B: 1 res: |928 : Arizona (Flagstaff| child: |,|
2A: 1 res: |928 : Arizona (Flagstaff,| child: |,|
3B: 1 res: |928 : Arizona (Flagstaff,| child: |A|
0B: 2 res: |928 : Arizona (Flagstaff,|
2B: 2 res: |928 : Arizona (Flagstaff,| child: |Kingman|
2A: 2 res: |928 : Arizona (Flagstaff, Kingman| child: |Kingman|
0A: 2 res: |928 : Arizona (Flagstaff, Kingman|
3A: 1 res: |928 : Arizona (Flagstaff, Kingman| child: |A|
2B: 1 res: |928 : Arizona (Flagstaff, Kingman| child: |,|
2A: 1 res: |928 : Arizona (Flagstaff, Kingman,| child: |,|
3B: 1 res: |928 : Arizona (Flagstaff, Kingman,| child: |A|
0B: 2 res: |928 : Arizona (Flagstaff, Kingman,|
2B: 2 res: |928 : Arizona (Flagstaff, Kingman,| child: |Prescott|
2A: 2 res: |928 : Arizona (Flagstaff, Kingman, Prescott| child: |Prescott|
0A: 2 res: |928 : Arizona (Flagstaff, Kingman, Prescott|
3A: 1 res: |928 : Arizona (Flagstaff, Kingman, Prescott| child: |A|
2B: 1 res: |928 : Arizona (Flagstaff, Kingman, Prescott| child: |,|
2A: 1 res: |928 : Arizona (Flagstaff, Kingman, Prescott,| child: |,|
3B: 1 res: |928 : Arizona (Flagstaff, Kingman, Prescott,| child: |A|
0B: 2 res: |928 : Arizona (Flagstaff, Kingman, Prescott,|
2B: 2 res: |928 : Arizona (Flagstaff, Kingman, Prescott,| child: |Yuma|
2A: 2 res: |928 : Arizona (Flagstaff, Kingman, Prescott, Yuma| child: |Yuma|
0A: 2 res: |928 : Arizona (Flagstaff, Kingman, Prescott, Yuma|
3A: 1 res: |928 : Arizona (Flagstaff, Kingman, Prescott, Yuma| child: |A|
2B: 1 res: |928 : Arizona (Flagstaff, Kingman, Prescott, Yuma| child: |and northern and western Arizona)|
2A: 1 res: |928 : Arizona (Flagstaff, Kingman, Prescott, Yuma and northern and western Arizona)| child: |and northern and western Arizona)|
0A: 1 res: |928 : Arizona (Flagstaff, Kingman, Prescott, Yuma and northern and western Arizona)|
0B: 1 res: ||
2B: 1 res: || child: |Created in 2001 by split from|
2A: 1 res: |Created in 2001 by split from| child: |Created in 2001 by split from|
3B: 1 res: |Created in 2001 by split from| child: |B|
0B: 2 res: |Created in 2001 by split from|
3B: 2 res: |Created in 2001 by split from| child: |A|
0B: 3 res: |Created in 2001 by split from|
2B: 3 res: |Created in 2001 by split from| child: |520|
2A: 3 res: |Created in 2001 by split from 520| child: |520|
0A: 3 res: |Created in 2001 by split from 520|
3A: 2 res: |Created in 2001 by split from 520| child: |A|
0A: 2 res: |Created in 2001 by split from 520|
3A: 1 res: |Created in 2001 by split from 520| child: |B|
2B: 1 res: |Created in 2001 by split from 520| child: |.|
2A: 1 res: |Created in 2001 by split from 520.| child: |.|
0A: 1 res: |Created in 2001 by split from 520.|
--- End 928 ---
928 : Arizona (Flagstaff, Kingman, Prescott, Yuma and northern and western Arizona)
Created in 2001 by split from 520.
-- Beg 971 ---
0B: 1 res: ||
3B: 1 res: || child: |B|
0B: 2 res: ||
3B: 2 res: || child: |A|
0B: 3 res: ||
2B: 3 res: || child: |971|
2A: 3 res: |971| child: |971|
0A: 3 res: |971|
3A: 2 res: |971| child: |A|
0A: 2 res: |971|
3A: 1 res: |971| child: |B|
2B: 1 res: |971| child: |:|
2A: 1 res: |971 :| child: |:|
3B: 1 res: |971 :| child: |A|
0B: 2 res: |971 :|
2B: 2 res: |971 :| child: |Oregon|
2A: 2 res: |971 : Oregon| child: |Oregon|
0A: 2 res: |971 : Oregon|
3A: 1 res: |971 : Oregon| child: |A|
2B: 1 res: |971 : Oregon| child: |(Portland, Salem, Hillsboro, Beaverton and northwestern Oregon)|
2A: 1 res: |971 : Oregon (Portland, Salem, Hillsboro, Beaverton and northwestern Oregon)| child: |(Portland, Salem, Hillsboro, Beaverton and northwestern Oregon)|
0A: 1 res: |971 : Oregon (Portland, Salem, Hillsboro, Beaverton and northwestern Oregon)|
0B: 1 res: ||
2B: 1 res: || child: |Partially overlain on|
2A: 1 res: |Partially overlain on| child: |Partially overlain on|
3B: 1 res: |Partially overlain on| child: |B|
0B: 2 res: |Partially overlain on|
2B: 2 res: |Partially overlain on| child: |503|
2A: 2 res: |Partially overlain on 503| child: |503|
0A: 2 res: |Partially overlain on 503|
3A: 1 res: |Partially overlain on 503| child: |B|
2B: 1 res: |Partially overlain on 503| child: |in 2000.|
2A: 1 res: |Partially overlain on 503 in 2000.| child: |in 2000.|
3B: 1 res: |Partially overlain on 503 in 2000.| child: |A|
0B: 2 res: |Partially overlain on 503 in 2000.|
2B: 2 res: |Partially overlain on 503 in 2000.| child: |Clatsop|
2A: 2 res: |Partially overlain on 503 in 2000. Clatsop| child: |Clatsop|
0A: 2 res: |Partially overlain on 503 in 2000. Clatsop|
3A: 1 res: |Partially overlain on 503 in 2000. Clatsop| child: |A|
2B: 1 res: |Partially overlain on 503 in 2000. Clatsop| child: |and|
2A: 1 res: |Partially overlain on 503 in 2000. Clatsop and| child: |and|
3B: 1 res: |Partially overlain on 503 in 2000. Clatsop and| child: |A|
0B: 2 res: |Partially overlain on 503 in 2000. Clatsop and|
2B: 2 res: |Partially overlain on 503 in 2000. Clatsop and| child: |Tillamook|
2A: 2 res: |Partially overlain on 503 in 2000. Clatsop and Tillamook| child: |Tillamook|
0A: 2 res: |Partially overlain on 503 in 2000. Clatsop and Tillamook|
3A: 1 res: |Partially overlain on 503 in 2000. Clatsop and Tillamook| child: |A|
2B: 1 res: |Partially overlain on 503 in 2000. Clatsop and Tillamook| child: |Counties, originally excluded from the overlay, were added in 2008.|
2A: 1 res: |Partially overlain on 503 in 2000. Clatsop and Tillamook Counties, originally excluded from the overlay, were added in 2008.| child: |Counties, originally excluded from the overlay, were added in 2008.|
0A: 1 res: |Partially overlain on 503 in 2000. Clatsop and Tillamook Counties, originally excluded from the overlay, were added in 2008.|
--- End 971 ---
971 : Oregon (Portland, Salem, Hillsboro, Beaverton and northwestern Oregon)
Partially overlain on 503 in 2000. Clatsop and Tillamook Counties, originally excluded from the overlay, were added in 2008.
-- Beg 975 ---
0B: 1 res: ||
3B: 1 res: || child: |B|
0B: 2 res: ||
3B: 2 res: || child: |A|
0B: 3 res: ||
2B: 3 res: || child: |975|
2A: 3 res: |975| child: |975|
0A: 3 res: |975|
3A: 2 res: |975| child: |A|
0A: 2 res: |975|
3A: 1 res: |975| child: |B|
2B: 1 res: |975| child: |is assigned for numbering relief to|
2A: 1 res: |975 is assigned for numbering relief to| child: |is assigned for numbering relief to|
3B: 1 res: |975 is assigned for numbering relief to| child: |B|
0B: 2 res: |975 is assigned for numbering relief to|
3B: 2 res: |975 is assigned for numbering relief to| child: |A|
0B: 3 res: |975 is assigned for numbering relief to|
2B: 3 res: |975 is assigned for numbering relief to| child: |816|
2A: 3 res: |975 is assigned for numbering relief to 816| child: |816|
0A: 3 res: |975 is assigned for numbering relief to 816|
3A: 2 res: |975 is assigned for numbering relief to 816| child: |A|
0A: 2 res: |975 is assigned for numbering relief to 816|
3A: 1 res: |975 is assigned for numbering relief to 816| child: |B|
2B: 1 res: |975 is assigned for numbering relief to 816| child: |(Missouri) but no date has been scheduled for this to go into effect.|
2A: 1 res: |975 is assigned for numbering relief to 816 (Missouri) but no date has been scheduled for this to go into effect.| child: |(Missouri) but no date has been scheduled for this to go into effect.|
0A: 1 res: |975 is assigned for numbering relief to 816 (Missouri) but no date has been scheduled for this to go into effect.|
--- End 975 ---
975 is assigned for numbering relief to 816 (Missouri) but no date has been scheduled for this to go into effect.
-- Beg 989 ---
0B: 1 res: ||
3B: 1 res: || child: |B|
0B: 2 res: ||
3B: 2 res: || child: |A|
0B: 3 res: ||
2B: 3 res: || child: |989|
2A: 3 res: |989| child: |989|
0A: 3 res: |989|
3A: 2 res: |989| child: |A|
0A: 2 res: |989|
3A: 1 res: |989| child: |B|
2B: 1 res: |989| child: |:|
2A: 1 res: |989 :| child: |:|
3B: 1 res: |989 :| child: |A|
0B: 2 res: |989 :|
2B: 2 res: |989 :| child: |Michigan|
2A: 2 res: |989 : Michigan| child: |Michigan|
0A: 2 res: |989 : Michigan|
3A: 1 res: |989 : Michigan| child: |A|
2B: 1 res: |989 : Michigan| child: |(Alpena, Mt. Pleasant, Bay City, Saginaw, Midland, Owosso and central Michigan)|
2A: 1 res: |989 : Michigan (Alpena, Mt. Pleasant, Bay City, Saginaw, Midland, Owosso and central Michigan)| child: |(Alpena, Mt. Pleasant, Bay City, Saginaw, Midland, Owosso and central Michigan)|
0A: 1 res: |989 : Michigan (Alpena, Mt. Pleasant, Bay City, Saginaw, Midland, Owosso and central Michigan)|
0B: 1 res: ||
2B: 1 res: || child: |Created in 2000 by split from|
2A: 1 res: |Created in 2000 by split from| child: |Created in 2000 by split from|
3B: 1 res: |Created in 2000 by split from| child: |B|
0B: 2 res: |Created in 2000 by split from|
3B: 2 res: |Created in 2000 by split from| child: |A|
0B: 3 res: |Created in 2000 by split from|
2B: 3 res: |Created in 2000 by split from| child: |517|
2A: 3 res: |Created in 2000 by split from 517| child: |517|
0A: 3 res: |Created in 2000 by split from 517|
3A: 2 res: |Created in 2000 by split from 517| child: |A|
0A: 2 res: |Created in 2000 by split from 517|
3A: 1 res: |Created in 2000 by split from 517| child: |B|
2B: 1 res: |Created in 2000 by split from 517| child: |.|
2A: 1 res: |Created in 2000 by split from 517.| child: |.|
0A: 1 res: |Created in 2000 by split from 517.|
--- End 989 ---
989 : Michigan (Alpena, Mt. Pleasant, Bay City, Saginaw, Midland, Owosso and central Michigan)
Created in 2000 by split from 517.
Advertisements

About this entry