In Nifi I have flows that copy information from SQL databases (SQL Server, MySQL, Oracle, etc.) to HIVE, where after depositing the .avro files in HDFS and defining an .avsc (manually), I build an external table (used as temp table) to finally copy that table´s data to a final table managed by HIVE stored as parquet and with snappy compression.
At the moment my flows are like the following:
GenerateFlowFile ->
DeleteHDFS (delete avro files in HDFS from previous flows execution) ->
ExecuteSQL (get new avro data) ->
PutHDFS (avro files to HDFS) ->
ReplaceText (empty the flowfile data) ->
MergeContent (e.g., 28 flowfiles generated by ExecuteSQL are merged into 1 flowfile) ->
ReplaceText (DROP AND CREATE EXTERNAL TABLE - pointing to avro data and avsc file) ->
PutHiveQL ->
ReplaceText (INSERT INTO FINAL_TABLE SELECT * FROM EXTERNAL_TABLE) ->
PutHiveQL
I would like to know if this approach is the right one for moving data between these environments or if there is a better way to do it (more effective and efficient).
I'd like to reduce the number of processors in my Nifi dataflows and if possible, improve the efficiency in data movement.