SDMX-Json Parsing - No Time_Period problem - OECD Dataset python extract

1.3k Views Asked by At

I am using the python library pandaSDMX to extract entire datasets from the OECD database, and convert them to a CSV format (any readable format would be OK, because I want to put the final output in an SQL database).

To have access to a OECD dataset in SDMX-Json format (there is also a SDMXML format), you can paste on the browser a link like this one:

http://stats.oecd.org/sdmx-json/data/PART2/all/all

Note: 'PART2' is the name of the dataset

The problem is, there are some OECD datasets for which SDMX-Json does not have a dimension named Time_Period, and ALL the sdmx-json parsing python libraries that I've found only parse SDMX-Json that have this dimension.

I've also tried to find some python library that could parse a SDMXML (SDMX-ML) file (that would also be OK), but building that process in python is not working for me...

Here is the piece of the code that I use to parse the SDMX-Json datasets and write the output CSV file on Azure blob storage:

# itera sob cada ficheiro json e converte o conteudo para csv
    for dataset in datasetList:
        cursor = cnxn.cursor()

        Values = [dataset, datasetFiltersList[counter]]
        cursor.execute(SQLLogCommand, Values)
        cnxn.commit()

        #logging.info('Analysing dataset: ' + dataset)
        try:
            data_response = oecd.data(resource_id=dataset, key=datasetFiltersList[counter])
        except UnicodeDecodeError:
            Values = [datetime.today().strftime('%Y-%m-%d-%H:%M:%S'), 'Unicode Decode Error - Error', 0]
            cursor.execute(SQLUpdateCommand, Values)
            cnxn.commit()
        except KeyError:
            Values = [datetime.today().strftime('%Y-%m-%d-%H:%M:%S'), 'Key Error - Error', 0]
            cursor.execute(SQLUpdateCommand, Values)
            cnxn.commit()
        else:
            data = data_response.data
            if data.dim_at_obs == 'TIME_PERIOD':
                df = createDF(data, useIDs=False)
                blobService.create_blob_from_text(CONTAINERNAME, csvBlobPath + dataset + '.csv' , df.to_csv(index=False, sep='|', encoding='utf-8'))

                Values = [datetime.today().strftime('%Y-%m-%d-%H:%M:%S'), 'Success', 1]
                cursor.execute(SQLUpdateCommand, Values)
            cnxn.commit()  
            #logging.info('Dataset ' + dataset + ' analysed!')
        else:
            print('no time period error on dataset: ' + dataset)
            Values = [datetime.today().strftime('%Y-%m-%d-%H:%M:%S'), 'No Time Period - Error', 0]
            cursor.execute(SQLUpdateCommand, Values)
            cnxn.commit()  

cnxn.close()

So, my final question is:

Does anyone know a way to parse SDMX-Json files that not have a Time_Period dimension? Or other way to extract automatically a entire dataset from OECD and parse it...

As example:

  1. Dataset with time period (this one works fine for me): http://stats.oecd.org/sdmx-json/data/TALIS_EDUGPS/all/all
  2. Dataset without time period: http://stats.oecd.org/sdmx-json/data/PART2/all/all
1

There are 1 best solutions below

0
Paul Natsuo Kishimoto On

The package sdmx1 (an improved fork of pandaSDMX) does not have the restriction you describe, and can be used to retrieve and parse this data.

Version 2.5.0 also handles an idiosyncrasy of this particular OECD data: that the structure part of the message describes a TIME_FORMAT (n.b. not TIME_PERIOD) attribute on each observation, but the observations do not actually include this attribute.

>>> import sdmx
>>> OECD = sdmx.Client("OECD")
>>> message = OECD.data("PART2")
No AttributeValues for attribute <DataAttribute TIME_FORMAT>; discard

>>> message
<sdmx.DataMessage>
  <Header>
    id: '710c36ac-0beb-4781-b6d6-01d86476a10c'
    prepared: '2021-07-11T17:54:09.547694+00:00'
    sender: <Agency OECD: Organisation for Economic Co-operation and Development>
    source: 
    test: False
  response: <Response [200]>
  DataSet (1)
  dataflow: <DataflowDefinition (missing id)>
  observation_dimension: <sdmx.model._AllDimensions object at 0x7f008efd7e80>

>>> sdmx.to_pandas(message.data[0])
LOCATION  IND    PER  GRD  FLD  MSR
AUS       10CAU  CHG  EIG  MAT  MN     11.430378
                                SE      6.079738
                           SCI  MN     -3.697873
                                SE      4.820530
                      FOG  MAT  MN      6.051900
                                         ...    
ZAF       8UTSR  END  EIG  SCI  SE      3.203199
                 INI  EIG  MAT  MN     60.268028
                                SE      3.864160
                           SCI  MN     55.894505
                                SE      3.500335
Name: value, Length: 18755, dtype: float64