Dynamic Hadoop Load

For this project our client had a number of flat files with different schemas that they wanted to load into HDFS / Hive for large volume data processing. Rather than building multiple jobs for each file format, we implemented a dynamic schema approach.

Screen Shot 2014-03-14 at 11.51.03 AM

 

 

This project contains the following features:

  • HDFS / Hive data ingestion
  • Dynamic schema implementation so job is reusable with different data formats
  • Dynamic hive table creation utilizing relevant file schema
  • Schema validation of file and metadata definition
  • Data validation of data files prior to HDFS Put