I am starting in Data Science and I come from math/stats/economics. I am very used to precise definitions even if it means going a bit deeper into the theory to explain something as simple as a function.
I tried to look for precise definitions of Stage / Staging when used as:
- Staging area
- Staging environment
- Staging models
- Staging file
- a staging step in git
- etc
For example: https://githowto.com/staging_and_committing Here, I can understand the context, of course, but I'd like an abstract computer engineering explanation of what it is as if you were learning the theory to build a "stage" on your own.
However, none of the explanations were able to precisely define what it is and where it comes from. For example, if you are an electronic or computer engineer or computer scientist, how would you define it, and would you mind pointing out research papers or a famous textbook where you learned it?
I am in the context of "data" but I would argue that it is independent of the field, because it is a computer unit after all, as I understand.. but I may be wrong.
Thank you!
It's an analogy.
I think of staging data like an actors text on a theater stage. As soon as the actor (the ETL job) enters the stage, they need text (data) to play with. Putting data on stage is like giving an actor a new textbook. He knows how to read, interpret and play, but he doesn't know the text, yet. So providing the text ("staging" the data) is quite before the play (the process/job) actually begins, but can also be between the scenes. The picture might be a little odd, but I think you get the point.
Actually, I doubt there's something like a precise definition for it, but technically, the staging area, also called landing zone, is the storage area between extracting and loading the data in an ETL process.
Generally this data is defined non-persistent; it's overwritten by or deleted before or after an ETL job. However, there are also cases in which staging data becomes metadata, parameters or comparison data for the next job run, depending on the ETL architecture. I prefer to keeping it non-persistent wherever it's possible.
In git, staging would be the "get on stage and be ready" (think of the theatre stage behind the closed curtain) and committing would be (again) the "delivery" to the audience.