Adapting location data extracted from LinkedIn locality in readiness for a Mashup
I experienced a lot of problems with the locality data from LinkedIn
You’ll see I had to deal with Germany vs Deutschland issues, spurious ‘area’ placement in data, like Ottawa, Canada area, missing country name (United States). State names in US not in two char format etc.
None of this is going to help me make a Google or Yahoo Maps mashup…
So I ended up summarising all the permutations from my contacts and writing a mapping text file (countries4.txt in my code)
I also had to tweak my XML output from the original crawl post today and add a lot of CDATA statements for, because the XML files wouldn’t parse. All sorts of problems with ‘&’ character and summarised the results.
Here’s the script I used to seed text file.
xmllocs.groovy (You can click for PDF copy/pastable version)
I wrote an intermediary throw away script to write the ‘locs‘ to a file. It stripped out the ‘enclosing [‘ and ‘] ‘characters and place the separator character | ‘in situ’ for manually tweaking the converted data.
I experienced problems with back to back separator characters, so had to put a space in to get a consistent data structure of the format:
I also experienced some encoding issues here.
A bit of Tweeting and Googling around threw up some answers. Thanks to both @franz_see and @lucastex.
- Groovy write API. Was ambiguous about whether encoding was first or second String parameter
- Groovy files. This gave the answer
- Lucastex also mentioned .withWriter which I was able to find out some more stuff via GinA circa P297
- Now I had problem of where can I find a definitive list of encodings and which is best to use. Here’s the excellent link Franz See sent me.
- I found MacRoman worked well for me in the end. What with Scandinavian and Polish character sets.
Here is a snippet showing how the new functionality gets embedded into my main crawl.groovy script.
Snippet showing how text file utilized in main crawl.groovy (You can click for PDF copy/pastable version)
Hmm. Will have to adapt my crawler script to crawl my own profile. Was going show XML file for myself.
Here’s the one for my buddy Steve Dalton, who was writing about Groovy & Grails around the globe in Groovymag recently.
Excerpt of one of XML files. Shows location node