Strict consumption of JSON, how to reorder key:values to specific JSON schema for Open Refine

551 Views Asked by At

Trying to use Open Refine to analyze a data set of messy JSON strings (40k lines), however due to JSONs' nature of being unordered, some of the lines of JSON objects were mixed up when returned and recorded to a file.

Some objects are missing keys, some objects have incorrect order. Example:

1   {"about":"foo", "category":"bar", "id":"123", "cat_list": ["category1":"foo2"]}
2   {"id":"22","about":"barFoo", "category":"NotABar"}
3   {"about":"barbar", "category":"website", "id":"3333", "cat_list": ["category1":"foo22"]}
....
....
....
40,000 {"about":"bar123", "category":"publish", "id":"3323", "cat_list": ""}

ISSUE:

Importing the data into Open Refine, the program asks for a specific schema to compare to when it reads the file. It then reads the supplied file, comparing each JSON object on the line to the schema and imports or discards depending on how well it matches the schema! As a result many entries are left out!

IDEALLY:

Using Python, I would like to reorder the JSON objects to a specific schema which I specify.

Example:

Specified Schema

{"about":"", "category":"", "id":"", "cat_list": ""}

Which would then rearrange each line of JSON and its key-values to be in this specific format:

1   {"about": ....
2   {"about": ....
3   {"about": ....
....
....
....
40,000 {"about": ....

I am not entirely sure how I can do this efficiently?

EDIT:

I decided to just write a script to organize this. I removed some of the complex fields and have a full .JSON file:

{"name":"Carstar Bridgewater", 
"category":"Automotive", 
"about":"We are Bridgewaters largest professional collision centre and are committed to being there for customer cars and communities when they need us.", 
"country":"Canada", 
"state":"NS", 
"city":"Bridgewater
"}, 
{"name":"Febreze", 
"category":"Product/Service
", 
"about":"Freshness that eliminates odorsso you can breathe happy.", 
"country":"Added Nothing", 
"state":"Added Nothing", 
"city":"Added Nothing"},
{"name":"Custom Wood & Acrylic Turnings", 
"category":"Professional Services", 
"about":"Hand crafted item turned on a wood lath pen pencil bottle stopper cork screw bottle opener perfume applicator or other custom turnings", 
"country":"Canada", 
"state":"NS
", 
"city":"Middle Sackville"},
{"name":"The Hunger Games", 
"category":"Movie
", 
"about":"THE HUNGER GAMES: MOCKINGJAY - PART 1 - In theatres November 2 2014. www.hungergamesmovie.ca", 
"country":"Added Nothing", 
"state":"Added Nothing", 
"city":"Added Nothing"},

Yet. Google-Refine still refuses to accept my file? What is it that I am doing incorrectly?

2

There are 2 best solutions below

1
On BEST ANSWER

Not sure if you resolved this.

The JSON needs to be valid before it can be successfully imported - at the moment the text you've posted in the Q above doesn't validate with a tool such as http://jsonlint.com.

The issue you have in terms of importing this to OpenRefine (aka Google Refine) is that the JSON objects have to be in an array:

[{"name":"Carstar Bridgewater", 
"category":"Automotive", 
"about":"We are Bridgewaters largest professional collision centre and are committed to being there for customer cars and communities when they need us.", 
"country":"Canada", 
"state":"NS", 
"city":"Bridgewater"},
{"name":"Febreze", 
"category":"Product/Service", 
"about":"Freshness that eliminates odorsso you can breathe happy.", 
"country":"Added Nothing", 
"state":"Added Nothing", 
"city":"Added Nothing"},
{"name":"Custom Wood & Acrylic Turnings", 
"category":"Professional Services", 
"about":"Hand crafted item turned on a wood lath pen pencil bottle stopper cork screw bottle opener perfume applicator or other custom turnings", 
"country":"Canada", 
"state":"NS", 
"city":"Middle Sackville"},
{"name":"The Hunger Games", 
"category":"Movie", 
"about":"THE HUNGER GAMES: MOCKINGJAY - PART 1 - In theatres November 2 2014. www.hungergamesmovie.ca", 
"country":"Added Nothing", 
"state":"Added Nothing", 
"city":"Added Nothing"}]

I can successfully import this JSON as posted here into OpenRefine it works fine - screenshots:

enter image description here enter image description here

0
On

"Importing the data into Open Refine, the program asks for a specific schema to compare to when it reads the file."

This sounds like it accidentally detected it as XML rathar than as JSON or even Lines.

However, You can choose which importer you wish to use (like Line based or JSON), not just the auto-picked importer that OpenRefine tries to guess at and sometimes gets wrong.

To my eyes, it looks like you might be dealing with the new upcoming "JSON Lines" or "newline-delimited JSON" format such as documented here: http://jsonlines.org/

We have an issue open to add JSON Lines support to OpenRefine eventually: https://github.com/OpenRefine/OpenRefine/issues/1135

In the meantime, look at the section On the Web at the jsonlines.org site to get find tooling support to help you with your needs.