Crawling LinkedIn Contacts using Groovy and Selenium

This script manages to extract the content of each of my LinkedIn level one contacts and automates the process using Selenium-RC to do the login and call pages.

I initially looked at the Automating Form Submission section of this page, in order to crawl LinkedIn with HTMLUnit. However it failed pitifully and couldn’t cope with the Ajax content on the page.

It used a load of JARS too which were a pain to add to the classpath when testing using the GroovyConsole.

import com.gargoylesoftware.htmlunit.WebClient
def webClient = new WebClient()
def page = webClient.getPage('https://www.linkedin.com/secure/login?trk=hb_signin')
assert page.titleText.startsWith('Sign In | LinkedIn'

I got an error to the effect:
Exception thrown: TypeError: Cannot find function hasOwnProperty in object net.sourceforge.htmlunit.corejs.javascript.EcmaError: ReferenceError: “Y$” is not defined….

So, after a hint from good ole Sven Haiges I spent a couple of days reading the documentation for Selenium, the IDE which is a Firefox plug-in and Selenium-RC.

Invoking Selenium IDE from Firefox Tools menu (click to download from Firefox)

The IDE allows you to do the equivalent of record macros. It’s best to learn from the HTML format, as shown below..

Selenium IDE showing languages available from Format Menu

Selenium IDE showing languages available from Format Menu

Because then you can choose the command from the drop down..

Selenium IDE commands from drop down

I’ve included the script that I used to get me started here.

Hand Crafted LinkedIn Login Selenium IDE script

Hand Crafted LinkedIn Login Selenium IDE script

As you can see the tabbed pane at the bottom describes the selected command in detail.

My one gripe with the Selenium in general is that the placement of the parameter for single arguments isn’t that intuitive.

Target to me sounds like locator. So why does echo and store place the parameter in Target. Value seems like a more intelligent default.

The Reference tab at the bottom of the IDE also fails to distinguish between Target and Value.

Consequently, I was forever placing a command with a single argument in the wrong place.

I was able to fix things by a process of trial and error, by changing the format to Groovy JUnit.

HTML is the only two pane option in upper panel (table/source).

Groovy and other languages don’t give you the level of hand holding the HTML option does.

Here’s the Groovy test code for the same test which can be cut and pasted into a Groovy Script when you switch over to RC.

LinkedIn Login Groovy equivalent Test script

LinkedIn Login Groovy equivalent Test script

You can find a more complete Selenium command reference here.

A core concept of Selenium is the concept of locators. You can do XPath, CSS style (a la jQuery) and document selection. I found the W3Schools Xpath tutorials handy here. A word of warning. Copying and pasting absolute XPath locations, by inspecting nodes in Firebug and using  was a bit hit and miss.

Firebug Copy Xpath

Firebug Copy Xpath

When using LinkedIn as a source, I believe it’s better to insert WaitForElementPresent commands, otherwise you often get Element not found errors. This is probably because the site uses Ajax heavily and the complete page hasn’t always finished loading. It’s a trial and error process. But quite intuitive.

As you can see from the script, the store commands allow you to later access variables a la Velocity/Freemarker or GString style notation. Pause is sometimes useful to in order to handle those pesky page load issues.

Click commands, have a ClickAndWait equivalent, that serve a similar purpose.

Type fills in input fields in your form and Echo can be used to print out content a la Unix/Dos shells.

From here I now had enough pieces of the jigsaw to make the transition over to Selenium-RC and capitalise on conditional and branching type logic, already present in my Groovy extraction scripts.

More information on Selenium-RC and its ingenious architecture can be found here.

The way my script works, is to log in to LinkedIn, navigate to my Contacts and extract out id’s, 500 at a time (AppendIds), clicking the next link at the botton of the pane to add the next bunch of ids to Groovy List.

Selenium (RC) has a command, getHTMLSource() which returns the web page as a String.

This String (web page HTML) in turn gets manipulated by NekoHTML and XMLParser to pull out the id’s.

For each contact, two files get saved locally on my hard drive:

  • the HTML page for the profile, named LinkedIn-#########.html
  • an XML File named {fullName}-##########.xml

If a page doesn’t convert successfully, then the id of the contacts gets added to an errId List, which gets dumped out at the end.

The extraction process is wrapped in a try/catch block to enable the whole process to complete.

I’ve managed to successfully convert 548 contacts with 100% success, but because of the dynamic nature of a LinkedIn page, there’s nothing to stop some quirky entry from causing the extraction process to splutter.

If a page doesn’t go through, I can add an ids collection by copying/pasting the printed out ids and re-run without going through the contacts pages. There’s some comments in the script to this effect.

I tend to do this and comment out try/catch to home in on problem if and when a new one arises during the profile page extraction process.

After an initial pass, if you re-run, only new additions will get converted.

A pre-exisiting XML for a contact id, means the conversion process for the the contact will be skipped.

An HTML file without a XML file, meant the conversion process didn’t go smoothly first time round, so it will cause the program to reconvert off the local file.

To get the script to compile, you need to add, NekoHTML (1.9.14), XercesImpl (2.9.1) and selenium-java-client-driver from the Selenium (1.0.3) download to you GroovyConsole classpath.

You also need to use the selenium-server.jar and create a bash script to initiate running the server prior to running the script. More information is available here.

I’m not sure whether this is necessary. But, I think it helps to have a copy of Firefox up and running before launching the script. I experienced some odd behaviour with things timing out when it wasn’t an active application.

With that out of the way, here’s the script in all its glory…















Revision/Version history:

5: I’ve refactored the code so it’s not quite so monolithic. I still intend to go on making it more object orientated. But this was a major refactor.

  • I’ve added Grape annotations for the JARs. Earlier I failed to import the Grape annotation itself and couldn’t fathom out why things weren’t working, until I came across some Tellurium documentation.
  • I continued to have some teething problems with encoding and UTF-8. Jeremy Brown’s blog post was some help here.
  • I added ConfigSlurper support for credential information (Thanks Jason Warner for heads up on password in script!) in the run up to integrating with Yahoo Maps. Google documentation was arduous compared to Yahoo for mapping stuff! I also noticed an error here with Groovy being unable to recognise comments in properties files. According to this site a ‘!’ or ‘#’ at the start of the line is allowed. Config Slurper croaks…
  • I’ve also used ConfigSlurper for configuration parameters.
    • The program can run locally or remotely.
    • The program can process a single myId, All or New.

    This results in six permutations. Locally, the ids are built off downloaded HTML files, Remotely the are built of Contacts pages. All rebuilds all XML files, New or My mode pulls down new HTML or re-converts failed conversions to XML. The difference in ‘my’ mode is that it works on a single user. You can configure the my mode id in the crawl.properties file. It also contains the locations of folders where files are stored.

  • I was thinking there was an odd problem with Groovy list sizes. Sometimes my list of contacts read 548 & others 1001. This was due to a Selenium timing issue. I’ve coded an assert to make the process fail if this happens. I fixed an timeout issue that resulted in the same 500 contacts being added to the list of ids to process again.

4:  I added my own profile to scan, fixed a lot of  CDATA issues with the ‘&’ symbol and enhanced the locality processing significantly which I talked about in more detail in this post.

Closing comments:

I would appreciate any tips on how to refactor.

Leave comments. Happy to share code with contacts on LinkedIn or Twitter followers.

One of problems I faced was calling text() method on Nodes often returns spurious newline characters. Had to remove them. Will probably do this with MOP/Category, so this done across the board when text() method gets invoked. The page is comprised of several images. If you click on them individually, you can enlarge them to see script more clearly.

Related posts:

Advertisements

About this entry