Talend Tutorials

Extracting Data Via Http Htmlunit Part 2

Extracting Data Via Http Htmlunit Part 2

Welcome to Part 2 of our HTML parsing tutorial. In this segment, we will go over an alternate way to pull HTML data using the HtmlUnit Java Library. This library can be downloaded quickly from here and for this tutorial the most recent version was 2.13, so that’s what is used. You’ll want to download whichever zip file suits your Operating System and then download it to a safe place you can access the .jar files from. Once I downloaded, I extracted using 7zip (just to be safe) and then got right to work.

image 1

Step 1: Load all of the Libraries.

HTML Unit has quite a few .jar dependencies you will need to load into talend before you start using them. So create a job simply called HTML_Unit and place a tLibraryLoad component into it from the palette.

image 2

Inside, click the “…” button and navigate to your htmlunit/lib folder to see all the jar dependencies, and select the first one. Because tLibraryLoad doesn’t conveniently let you load more than one jar file, you’ll need to do this for every single one of these (a pain, true, but you can just copy and paste until you have them all, and then connect with “OnComponentOk” conditionals). Be sure each one is loading a different library! Ours looks like this:

image 3

Luckily, now that you’ve done this, you can use this job in any other situation you want to use HTML Unit in, just drag this job in.

Step 2: Create your job

Now that you have your library loading job, you can really get to work. Create a new job called HU_Demo. Then, from the repository, under Job Designs, drag the Html_Unit job you just created into the canvas. This will create a tRunJob component set to use Html_Unit. Just like that, your library is added!

image 4

Step 3: Using External Java Libraries

You’ll now want to add in a tJavaRow component to the canvas. Then create a row from Html_Unit to it, to define the action order. This is where things may get a little complicated.

image 5

The HtmlUnit API is similar to using the Java API, and there is a load of documentation to support it. This API can be found here and may take a bit of playing around to get used to, but the information there is key to finding the proper methods to use.

image 6

Our code won’t be too complicated, but I’ll break down what we’re doing line-by-line:

image 7

First, you’ll want to click on the “Advanced Settings” tab in the component tab of tJavaRow, and add the following import statements:

import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import com.gargoylesoftware.htmlunit.html.HtmlElement;
import com.gargoylesoftware.htmlunit.html.DomNode;

These statements import from the libraries we loaded in Step 1 to be of use in our code. Each item will be explained as it is used.

image 8

final WebClient webClient = new WebClient();

The first line will establish our web client. HTMLUnit’s main function is to a Java-based web browser, so this first line is similar to opening up Firefox, Internet Explorer, or Chrome.

webClient.getOptions().setJavaScriptEnabled(false);
webClient.getOptions().setRedirectEnabled(true);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setUseInsecureSSL(true);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.getCookieManager().setCookiesEnabled(true);

The slew of getOptions() methods that follows will configure our client not to look for anything we don’t need, like Javascript or CSS, yet I’ll encourage you to remove the statements to see what happens.

final HtmlPage page = webClient.getPage(“http://etladvisors.wpengine.com/talend-tutorials/”);

The page variable is like opening a new web page, and the URL we are accessing is defined in a string. Just like the last tutorial, we are going to our Talend Tutorials page to search for the headers.

final List<HtmlElement> headers = (List<HtmlElement>)page.getByXPath(“//h2”);

The list of headers will use XPath (very useful) and we will pull all elements of type “h2” from the entire document. If you remember from the last tutorial, that is the type of header that encapsulates the tutorials, and this is much more intuitive than the regex search for a random string.

int count = headers.size();

We’re going to iterate over each item in the list, so we want to know how many items there are. This line is basic Java.

row2.out = “”;

row2.out is where the outgoing data will be saved to, so we’re just setting it to empty for now.

for(int i=0; i < count; i++){
final DomNode h = headers.get(i);
row2.out += h.getChildNodes().get(0).getTextContent() + “,”;
}

The for-loop first selects an item from the list, and then adds the text we’re looking for in its children. Each of these items holds the text in its first child (the child being the <a href> </a> from before, and the getTextContent() method returns the String we want. The comma at the end is to separate each value to be split on later.

webClient.closeAllWindows();

The last line is to close the web browser to ensure there is no memory leak.

Step 4: Split into Rows

Now we’re going to want to add a tNormalize components, and the ending tLogRow component, connecting them with main rows. image 9

Just as we did before, we’re going to want to split our single row into multiple, and we know each is separated by a comma, so we can choose that as our item separator.

image 10

This will leave an empty string after the very last comma used, so under the “Advanced Settings” tab, click the “Discard the trailing empty strings” checkbox, so it doesn’t need to be worried about.

image 11

Step 5 : Run the Job

After that, our job is done! Click the run tab and behold how much simpler that was than the first job!

image 12

That concludes our tutorials on using HTTP to pull information from web pages.

Looking For More Talend Resources and Help?
Ask An Expert