The inputs files which has the sales data will be sent to s3 bucket in csv format per day max of 5 files one time at specific time.
Using Glue the csv files data to be loaded in Redshift.
The transformations are needed like calculations and mapping from csv files files before loading into Redshift.
Please share your inputs/suggestion on how to build a efficient solution.
Below has to be taken care.
1.Duplicate in csv which client may or may not send. 2.same data say previous day's data has been sent on next day. 3. want to handle update and insert on Redshift tables. 4. csv files once processed/loaded into Redshift, move those files into another location in S3 bucket.
When you say "without using copy command" are you referring to all sources for COPY? COPY can load from S3, EMR, DDB, or through an ssh connection. If this is the case I can only think of 2 ways to get data into Redshift w/o copy:
#1 is very slow and limited on how much data you can put in a single statement (SQL command length limit of 64k characters). It also adversely bogs down the leader node which can impact overall cluster performance. This is fine for a few rows of data but not a good idea for large amounts of data.
#2 is a fine way to go but you need to define the external table, have the data in S3, and ensure that it is getting the right set of S3 files. This is a fine pattern if you are going to keep reusing the S3 files but I'm not sure how it is better than COPY from S3 in the single load case.
If you intent is patterns that can load data w/o COPY FROM S3 then there are also the other 3 forms of COPY to think about.