I am using the python library pandaSDMX to extract entire datasets from the OECD database, and convert them to a CSV format (any readable format would be OK, because I want to put the final output in an SQL database).
To have access to a OECD dataset in SDMX-Json format (there is also a SDMXML format), you can paste on the browser a link like this one:
http://stats.oecd.org/sdmx-json/data/PART2/all/all
Note: 'PART2' is the name of the dataset
The problem is, there are some OECD datasets for which SDMX-Json does not have a dimension named Time_Period, and ALL the sdmx-json parsing python libraries that I've found only parse SDMX-Json that have this dimension.
I've also tried to find some python library that could parse a SDMXML (SDMX-ML) file (that would also be OK), but building that process in python is not working for me...
Here is the piece of the code that I use to parse the SDMX-Json datasets and write the output CSV file on Azure blob storage:
# itera sob cada ficheiro json e converte o conteudo para csv
for dataset in datasetList:
cursor = cnxn.cursor()
Values = [dataset, datasetFiltersList[counter]]
cursor.execute(SQLLogCommand, Values)
cnxn.commit()
#logging.info('Analysing dataset: ' + dataset)
try:
data_response = oecd.data(resource_id=dataset, key=datasetFiltersList[counter])
except UnicodeDecodeError:
Values = [datetime.today().strftime('%Y-%m-%d-%H:%M:%S'), 'Unicode Decode Error - Error', 0]
cursor.execute(SQLUpdateCommand, Values)
cnxn.commit()
except KeyError:
Values = [datetime.today().strftime('%Y-%m-%d-%H:%M:%S'), 'Key Error - Error', 0]
cursor.execute(SQLUpdateCommand, Values)
cnxn.commit()
else:
data = data_response.data
if data.dim_at_obs == 'TIME_PERIOD':
df = createDF(data, useIDs=False)
blobService.create_blob_from_text(CONTAINERNAME, csvBlobPath + dataset + '.csv' , df.to_csv(index=False, sep='|', encoding='utf-8'))
Values = [datetime.today().strftime('%Y-%m-%d-%H:%M:%S'), 'Success', 1]
cursor.execute(SQLUpdateCommand, Values)
cnxn.commit()
#logging.info('Dataset ' + dataset + ' analysed!')
else:
print('no time period error on dataset: ' + dataset)
Values = [datetime.today().strftime('%Y-%m-%d-%H:%M:%S'), 'No Time Period - Error', 0]
cursor.execute(SQLUpdateCommand, Values)
cnxn.commit()
cnxn.close()
So, my final question is:
Does anyone know a way to parse SDMX-Json files that not have a Time_Period dimension? Or other way to extract automatically a entire dataset from OECD and parse it...
As example:
- Dataset with time period (this one works fine for me): http://stats.oecd.org/sdmx-json/data/TALIS_EDUGPS/all/all
- Dataset without time period: http://stats.oecd.org/sdmx-json/data/PART2/all/all
The package
sdmx1(an improved fork ofpandaSDMX) does not have the restriction you describe, and can be used to retrieve and parse this data.Version 2.5.0 also handles an idiosyncrasy of this particular OECD data: that the structure part of the message describes a TIME_FORMAT (n.b. not TIME_PERIOD) attribute on each observation, but the observations do not actually include this attribute.