About of original raw data and intermediate data has been transformed

456 Views Asked by At

I want to use the Cookiecutter Data science project structure, to my project. I found http://drivendata.github.io/cookiecutter-data-science/ and it looks great.

I am analyzing the directory differences on their structure and I have some question related to the different data stages. In the README.md file setup the difference between external, interim, processed and raw data.

 ├── data
    │   ├── external       <- Data from third party sources.
    │   ├── interim        <- Intermediate data that has been transformed.
    │   ├── processed      <- The final, canonical data sets for modeling.
    │   └── raw            <- The original, immutable data dump.

I am working on a project, in which the data is originated from sensors and is managed via a web application dashboard. Additionally, I have been performing some JOINS on SQL database dump in the order to extract other data that I need to start to work with.

What is the difference between raw data and external data? The data which I describe the extract process above or the way in how do I get them to make that they are to be cataloged like raw data?

Why aren't these considered like external data?

These will be considered external data whether I get them from other sources different to my organization which owns of the sensors and web application dashboard data administration?

About of raw data They make approach especially to:

Don't ever edit your raw data, especially not manually, and especially not in Excel. Don't overwrite your raw data. Don't save multiple versions of the raw data. Treat the data (and its format) as immutable. The code you write should move the raw data through a pipeline to your final analysis

I understand this, and it is a best practice :)

To illustrate my question, I want to select some indexes from one dataset sample which I am working:

I read some raw dataset which I extract using SQL joins. The data are changed

Then, these are my raw data:

# I read some raw dataset
data = pd.read_csv('fruit-RawData.csv')
data.head()


    weight  date                number  lat      lng          farmName
0   3.09    2012-07-27 07:08:58     15   57.766231 -16.762676   Totti
1   1.50    2012-07-27 07:09:01     15  57.766231 -16.762676    Totti
2   10.50   2012-07-27 07:09:02     15  57.766231 -16.762676    Totti
3   2.50    2012-07-27 07:09:04     15  57.766231 -16.762676    Totti
4   6.50    2012-07-27 07:09:06     15  57.766231 -16.762676    Totti 

If I select only the weight, date and number ...

data = data[['weight','date','number']]
data.to_csv('fruits.csv', sep=',', header=True, index=False)

And I get:

    weight  date               number
0   23.09   2012-07-27 07:08:58 5
1   30.50   2012-07-27 07:08:58 5
2   19.50   2012-07-27 07:08:58 5
3   25.50   2012-07-27 07:08:58 5
4   26.50   2012-07-27 07:08:58 5

These data subset could be considered like intermediate data which has been transformed, or still are raw data?

I unknow if these questions are valid.

0

There are 0 best solutions below