Extracting Data Via Http Part 1
This is the first of a two part series of data acquisition via HTTP. In this first part, we will show how simple it is to pull information between HTML tags with a little Java and a bit of quick filtering from tNormalize and tMap components. In Part 2, we will use an external library called HtmlUnit to to do the job in a different, possibly simpler approach (depending on how you like to work, of course).
Open Talend and create or open an existing project
• Right click on Job Designs in the Repository window and select “Create job” • Name the job “http_extract”
Start off by looking at our Talend Tutorials page, reading how all the items are arranged. From the layout of the page, it seems like there should be a standardized way to pull the information from the layout. We will attempt to pull all the tutorial headers from this page.
After you’ve created your job, add in a tHttpRequest Component from the Palette.
You’ll want to set the URI to the tutorials web page and change the acquisition method to “GET”. This will retrieve all the HTTP information into the job.
To know what we want to pull from the HTTP code, we will need a good look at it, so examine the source code in your browser, or save the document to a file you can read. (For the screenshot, I right-clicked in Google Chrome to view source code).
The text highlighted is standardly before every link that leads to a talend tutorial, and since those links are unique, we cannot use them to parse, so the h2 tag is better to use, since from a quick Ctrl-F we can see its only used in front of tutorial names. We will use it to pull our titles in the regex.
For this step, we’ll need a custom code component, so drag in a tJavaRow component from the Palette and create a main row to it from the tHttpRequest component. For initial testing, you may also want to make a tLogRow to view what is being pulled. Be sure that “ResponseContent” is being passed into Row2 leading to your next component.
In the JavaRow component, end the following code in Basic and the proper import statements in Advanced.
This code uses the the Java regex Pattern and Matcher classes to find our h2 tag header, and brings all the information between that and a closing </a> tag so we can use it. And it is appended into one string for us to parse in the next step.
Now that we have all the information, we want to split it into separate rows we can process, so add in a tNormalize and set the item separator to “>”, which will give us each individual header by itself (plus an extra row at the beginning, for now).
To make our future task easier, we will also add in a second tNormalize after the first, to split on another Item separator, distinguishing our titles from the links into their own rows. The item separator is “<a”
Now we’re getting close to our goal. But we have excess rows of information, and they need to be discarded. Your first thought may be to use a tFilterRow, but it has no functionality to filter by a word some string contains. It would be much easier to use a tMap. So create one linked to your second tNormalize component and have its output go to a tLogRow. Then open the Map Editor for your tMap component.
In your output column, you’ll want to open your expression filter and type in a filter for string that contain “etladvisors”, for that will contain the URL information. I used the java .contains() method to accomplish this. There is also a empty row to process, so put an AND (&&) statement in to filter out any empty strings. It should look something like this in the end (remember to use “!” to signify you DO NOT want these in the output):
You should be getting all the headers now, congratulations!
Stay tuned, and read the alternative approach in Part 2: Using the HtmlUnit Library.