For this project our client had a number of flat files with different schemas that they wanted to load into HDFS / Hive for large volume data processing. Rather than building multiple jobs for each file format, we implemented a dynamic schema approach.
This project contains the following features:
- HDFS / Hive data ingestion
- Dynamic schema implementation so job is reusable with different data formats
- Dynamic hive table creation utilizing relevant file schema
- Schema validation of file and metadata definition
- Data validation of data files prior to HDFS Put