Groovy Regular Expressions to abbreviate compass directions with look ahead and look behind.

I wrote this handy routine to abbreviate compass directions (plus central to C).
The thing was I didn’t want to abbreviate Us state names like North Dakota or South Dakota.
I also didn’t want to corrupt place names like Northridge, and I had to deal with all sorts of permutations of compass points.
As well not matching if a name follows a point, I also had the case where there is a place called George West. So I had my work cut out for me. :-)
Anyway here’s the routine

def points = [[k: ~/(N|n)ortheast(ern)?/                                                                        , v:'NE'],
              [k: ~/(?>(N|n)orth(W|w)est(ern|:)?)(?! Terrotories)/                                              , v:'NW'],
              [k: ~/(S|s)outheast(ern)?/                                                                        , v:'SE'],
              [k: ~/(?>(S|s)outh(\s)?west(ern)?)(?! Hill| Bend)/                                                , v:'SW'],
              [k: ~/(?>(N|n)orth(ern)?|Upstate)(?! Carolina| Dakota| Platte| Neck| Mariana Islands| Bay|ridge)/ , v:'N' ],
              [k: ~/(E|e)ast(ern)?/                                                                             , v:'E' ],
              [k: ~/(?>(S|s)outh(ern|side)?)(?! Carolina| Dakota)/                                              , v:'S' ],
              [k: ~/(?!(?<=George ))(?>(W|w)est(ern| of the)?)(?! Virginia| Palm Beach)/                        , v:'W' ],
              [k: ~/(?>(C|c)entral|Center|Middle|the middle section of the)(?!town| Peninsula| Tennessee|ia)/   , v:'C' ]
             ]
def text =  'West Virginia'
points.each {p ->
  def matcher = (text =~ p.k)
  text = matcher.replaceAll(p.v)
  println "p.v: ${p.v} text: $text"
}
return null

If I break apart  line 8, which is the most sophisticated of the lines in the example, it’s saying:

  1. Don’t match ‘West’ (or a variation) if it’s prefixed by ‘George ‘.
  2. It’s using a ‘look behind’ ie. ?<= for ‘George ‘.  So effectively after It’s matched ‘West’, it would discount ‘George West’.
  3. The exclamation mark is the ‘not’ symbol.
  4. Then we have the variations: ‘West’, ‘Western’, or ‘West of the’ (upper/lower case permutations).
  5. The question mark represents an ‘optional (zero or one occurrence)’, and the vertical bar an ‘or’ condition.
  6. The parentheses breaks the regex into ‘groups’ to which you can then apply qualifiers or cardinality rules.
  7. The ?> symbol this time is doing the a ‘look ahead’ to match anything matched from the West grouping variations but excluding ‘!’ a suffix of ‘ Virginia’ or ‘|’ ‘ Palm Beach’.
  8. So, finally ‘West Virginia’ or ‘West Palm Beach’ will not be matched, but ‘West Los Angeles’ would match.

Try setting text to ‘Northridge’, ‘George West’, ‘Middletown’, ‘South Bend’ etc and you’ll see it doesn’t abbreviate the text. But something like ‘East Los Angeles’, would become ‘E Los Angeles’.

Footnote:

After reading Mastering Regular Expressions, I found you can also set the regular expression to ‘ignore case’ mode with (?i) So the (N|n) can be simplified, by prefixing the RegEx pattern like so.
Obviously the RegEx would now match ‘NORTHERNEASTERN’, whereas it didn’t before, so it’s a broader matcher.

def points = [[k: ~/(?i:)northeast(ern)?/                                                                        , v:'NE'],
//... rest of code as before
                    ]

Here are some associated useful links:

Advertisements

About this entry