I want to use the Cookiecutter Data science project structure, to my project. I found http://drivendata.github.io/cookiecutter-data-science/ and it looks great.
I am analyzing the directory differences on their structure and I have some question related to the different data stages. In the README.md
file setup the difference between external, interim, processed and raw data.
├── data
│ ├── external <- Data from third party sources.
│ ├── interim <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.
I am working on a project, in which the data is originated from sensors and is managed via a web application dashboard. Additionally, I have been performing some JOINS on SQL database dump in the order to extract other data that I need to start to work with.
What is the difference between raw data and external data? The data which I describe the extract process above or the way in how do I get them to make that they are to be cataloged like raw data?
Why aren't these considered like external data?
These will be considered external data whether I get them from other sources different to my organization which owns of the sensors and web application dashboard data administration?
About of raw data They make approach especially to:
Don't ever edit your raw data, especially not manually, and especially not in Excel. Don't overwrite your raw data. Don't save multiple versions of the raw data. Treat the data (and its format) as immutable. The code you write should move the raw data through a pipeline to your final analysis
I understand this, and it is a best practice :)
To illustrate my question, I want to select some indexes from one dataset sample which I am working:
I read some raw dataset which I extract using SQL joins. The data are changed
Then, these are my raw data:
# I read some raw dataset
data = pd.read_csv('fruit-RawData.csv')
data.head()
weight date number lat lng farmName
0 3.09 2012-07-27 07:08:58 15 57.766231 -16.762676 Totti
1 1.50 2012-07-27 07:09:01 15 57.766231 -16.762676 Totti
2 10.50 2012-07-27 07:09:02 15 57.766231 -16.762676 Totti
3 2.50 2012-07-27 07:09:04 15 57.766231 -16.762676 Totti
4 6.50 2012-07-27 07:09:06 15 57.766231 -16.762676 Totti
If I select only the weight, date and number ...
data = data[['weight','date','number']]
data.to_csv('fruits.csv', sep=',', header=True, index=False)
And I get:
weight date number
0 23.09 2012-07-27 07:08:58 5
1 30.50 2012-07-27 07:08:58 5
2 19.50 2012-07-27 07:08:58 5
3 25.50 2012-07-27 07:08:58 5
4 26.50 2012-07-27 07:08:58 5
These data subset could be considered like intermediate data which has been transformed, or still are raw data?
I unknow if these questions are valid.