Talend Tutorials

Extracting Data Via Http Part 1

Extracting Data Via Http Part 1

| Matt Irvin

This is the first of a two part series of data acquisition via HTTP. In this first part, we will show how simple it is to pull information between HTML tags with a little Java and a bit of quick filtering from tNormalize and tMap components. In Part 2, we will use an external library called HtmlUnit to to do the job in a different, possibly simpler approach (depending on how you like to work, of course).

Step 1: Open Talend

Open Talend and create or open an existing project

Step 2: Create a new job

• Right click on Job Designs in the Repository window and select “Create job” • Name the job “http_extract”

image 1

Step 3: Http Access

Start off by looking at our Talend Tutorials page, reading how all the items are arranged. From the layout of the page, it seems like there should be a standardized way to pull the information from the layout. We will attempt to pull all the tutorial headers from this page.

image 2

After you’ve created your job, add in a tHttpRequest Component from the Palette.

image 3

You’ll want to set the URI to the tutorials web page and change the acquisition method to “GET”. This will retrieve all the HTTP information into the job.

Step 4: Choosing What to Find

To know what we want to pull from the HTTP code, we will need a good look at it, so examine the source code in your browser, or save the document to a file you can read. (For the screenshot, I right-clicked in Google Chrome to view source code).

image 4

The text highlighted is standardly before every link that leads to a talend tutorial, and since those links are unique, we cannot use them to parse, so the h2 tag is better to use, since from a quick Ctrl-F we can see its only used in front of tutorial names. We will use it to pull our titles in the regex.

Step 5: tJava and Regular Expressions

image 5

For this step, we’ll need a custom code component, so drag in a tJavaRow component from the Palette and create a main row to it from the tHttpRequest component. For initial testing, you may also want to make a tLogRow to view what is being pulled. Be sure that “ResponseContent” is being passed into Row2 leading to your next component.

In the JavaRow component, end the following code in Basic and the proper import statements in Advanced.

image 6 image 7

This code uses the the Java regex Pattern and Matcher classes to find our h2 tag header, and brings all the information between that and a closing </a> tag so we can use it. And it is appended into one string for us to parse in the next step.

Step 6: Split Into Rows

image 8

Now that we have all the information, we want to split it into separate rows we can process, so add in a tNormalize and set the item separator to “>”, which will give us each individual header by itself (plus an extra row at the beginning, for now).

image 9 To make our future task easier, we will also add in a second tNormalize after the first, to split on another Item separator, distinguishing our titles from the links into their own rows. The item separator is “<a”

image 10

Step 7: Filter Out What We Don’t Need

image 11 Now we’re getting close to our goal. But we have excess rows of information, and they need to be discarded. Your first thought may be to use a tFilterRow, but it has no functionality to filter by a word some string contains. It would be much easier to use a tMap. So create one linked to your second tNormalize component and have its output go to a tLogRow. Then open the Map Editor for your tMap component.

image 12 In your output column, you’ll want to open your expression filter and type in a filter for string that contain “etladvisors”, for that will contain the URL information. I used the java .contains() method to accomplish this. There is also a empty row to process, so put an AND (&&) statement in to filter out any empty strings. It should look something like this in the end (remember to use “!” to signify you DO NOT want these in the output):

image 13

Final Step: Run the Job

You should be getting all the headers now, congratulations!

image 14

Stay tuned, and read the alternative approach in Part 2: Using the HtmlUnit Library.

Looking For More Talend Help?
Contact Us Today