Groovy Regex text manipulation example

def text = '201 : New Jersey ( Hackensack , Jersey City , Hoboken , Bayonne , Ridgewood , Union City , Teaneck , New Milford , and northeast New Jersey)'
def textPattern = /(.*)(\()(.*)(\))/         /* Break text into constituent parts.
                                                You have to escape brackets and colons as they make up Regex expressions.
                                                Thank heavens for Groovy / to quote regular expressions otherwise massive escape sequences would apply
                                                1) before '('
                                                2) '('
                                                3) between '(' & ')'
                                                4) ')'
                                              */
def p1Pattern = /(\d{3})(\s\:\s)([\w|\s]+)/  /* Break part 1 from above into
                                                1) Digits
                                                2) ' : '
                                                3) (Word or space) one or more times
                                              */
def p3Pattern = /([\w|\s]+)/                 /* The single Word or space once or more works here, Text from 3) of textPattern will split on commas (or any non word/space character)
                                              * A word character: [a-zA-Z_0-9] See: http://java.sun.com/docs/books/tutorial/essential/regex/pre_char_classes.html
                                              */
(text =~ textPattern).each{fullText, p1, openbracket, p3, closebracket ->
  println fullText
  println "p1: $p1"
  (p1 =~ p1Pattern).each{fullP1, code, p2, state  ->
    println "code: $code"
    println "state: $state|"
  }
  println "p3: $p3"
  def matchP3 = (p3 =~ p3Pattern)
  println matchP3.groupCount()                  /* You can call a group count on Regex. Useful for checking no of params to left of rocket symbol. Remember first always full match
                                                 * So if you leave default it property you will get a list with one group, entry 0 = full match. Entry 1 = first group.
                                                 * Since there's only one group entry 0 and 1 will always be equal (it[0] = it[1])
                                                 */
  matchP3.each{
    println "${it[1]}"
  }
}

Output from running program:

201 : New Jersey ( Hackensack , Jersey City , Hoboken , Bayonne , Ridgewood , Union City , Teaneck , New Milford , and northeast New Jersey)
p1: 201 : New Jersey
code: 201
state: New Jersey |
p3:  Hackensack , Jersey City , Hoboken , Bayonne , Ridgewood , Union City , Teaneck , New Milford , and northeast New Jersey
1
 Hackensack
 Jersey City
 Hoboken
 Bayonne
 Ridgewood
 Union City
 Teaneck
 New Milford
 and northeast New Jersey

Also note the difference:

def x = ~/(\w+)/
def y = /(\w+)/
println "x: ${x.class}"
println "y: ${y.class}"
x: class java.util.regex.Pattern
y: class java.lang.String

So strictly speaking it’s going to be better to precompile patterns with the ~/ / style syntax.

Also note the Groovy ‘find’ operator =~ is a Matcher object, whereas the ‘match’ operator ==~ is a Boolean and is more restrictive.

The pattern for ‘match’ has to be a full match on text search string to return a result of true. See here.

You can also get caught by the loose terminology Java applies to the pre-defined character class of ‘.’ denoted as ‘any character’ here. :-( Grrr!

def t            = '257–259 : not used'
def patCodeRange = ~/\d{3}\.\d{3}/
def r = (t =~ patCodeRange)
if (r) println "found"
r.each{println it}
return null

This doesn’t yield a match!

Whereas this does:

def t            = '257–259 : not used'
def patCodeRange = ~/\d{3}\W\d{3}/
def r = (t =~ patCodeRange)
if (r) println "found"
r.each{println it}
return null

Footnote:

Since purchasing Jeffrey Friedl’s Mastering Regular Expressions book, I’ve found out that you can also use \p{Pd} to represents hyphens and dashes of all sorts (P123, Table 3.9).

This also gets a mention here by way of the the Javadoc for Pattern.

found
257–259

Here are some associated useful links:

Advertisements

About this entry