Information Aggregation

In some jobs, using a tMap to extract just two columns of data isn’t worth the mapping. Especially if there are operations you need to perform on the data. The tAggregateRow component can easily solve this problem, with a wide array of functions to quickly give you the information you need.

Step 1: The Components

To begin, examine the simple structure of the job, as the components will not change from each part of this tutorial. All data will be fed in from a tFileInputDelimited component, passed thorugh the tAggregateRow component and then output.

Step 2: The Input File

The Company Info .txt document describes our input. You can download it here: CompanyInfo. Be sure to select its location in your tFileInputDelimited component, and indicate the first row is a header, and not to be parsed.

Step 3: Schema Control

“Department” and “Company” should be Strings, while “Employees” and “Expenses” are Integers. Our input and output schemas will be different, so for now only keep “Department” and add another row called “Count” of type Integer.

Step 4: First Setup, Department Count

 

In the “Group by” table the output column should be department, and the input column should be company. This means that the output will be written into the string “Department” but organized by input from “Company”. Thus, in this case though our output schema is named department, all the outputs will be company names. There will be one output per company.

In the “Operations” table, select the output column as count, function as count, and input column as deparment.

By doing this, we’ve set the job to do the following:

Return Count as {function:count} of {input:Department} grouped by {input:Company}.

Step 5: Run the Job

The output should be as follows with Company listed next to the counts of different departments within them.

Step 6: Alter the job

Change the input column for “Group By” to “Department”. In “Operations” make the function “sum” and input “Employees”

Then run the job again and view the new output.

The different departments are listed, but they do not discriminate by company. Of course, the only department that had been listed twice before was Marketing, so it added the values of 3 and 8 to get 11.

Step 7: One last change

Edit the schema of tAggregateRow and make the “Count” column a String. The change the “Group by” back to company, and the operation input to “Department”. Make the “Function” perform the list operation. Run the job for a new output.


A string with all the combined Departments of each company is given.

There are many other operations that can be performed, and they should all be looked at to reveal the best way to get very specific information out of large lists of data. The tAggregateRow components has plenty of uses if you know the full extent of its capability.

 

Leave a Reply

*

captcha *