OpenRefine: select value based on a variable another column

1.4k Views Asked by At

I have a problem with OpenRefine. I am adding a new column based on a url and from there calling an API for getting some terms from a controlled vocabulary (AAT). I parse the results and I obtain a multivalued cells such as:

http://vocab.getty.edu/aat/300041366||aquatints (prints)::http://vocab.getty.edu/aat/300053242||aquatint (printing process)::http://vocab.getty.edu/aat/300191265||dust bags::http://vocab.getty.edu/aat/300191278||dust boxes::http://vocab.getty.edu/aat/300191278||dust boxes::http://vocab.getty.edu/aat/300191278||dust boxes::http://vocab.getty.edu/aat/300249564||aquatinters::http://vocab.getty.edu/aat/300249564||aquatinters::http://vocab.getty.edu/aat/300249564||aquatinters::http://vocab.getty.edu/aat/300249564||aquatinters::http://vocab.getty.edu/aat/300053242||aquatint (printing process)::http://vocab.getty.edu/aat/300041366||aquatints (prints)::http://vocab.getty.edu/aat/300041368||sandpaper aquatints::http://vocab.getty.edu/aat/300041368||sandpaper aquatints

Where I have the current structure:

URI||Corresponding_TERM::URI||Corresponding_TERM

etc.

I now need to choose one of those records. My solution is to use
value.split("::")[0]
in order to choose which element I want.
Unfortunately this solutions has very evident drawbacks, because the order of the elements in the array is not constant, so if the first element [0] would be the right one for one record, it would probably not be for the next one.

For explain myself better, I now have this kind of structure

-----------------------------------------------------------
|ID | Classification | Term_From_Thesaurus                |  
| 1 | Aquatints      | uri||term1::uri||term2::           | 
| 1 | Aquatints      | uri||term1::uri||term2::           | 
| 2 | Drypoints      | uri||term3::uri||term4::           |
| 3 | Woodcut        | uri||term5::uri||term6::uri||term7 |
-----------------------------------------------------------  

And I need to associate term1 with Aquatints, term 4 with Drypoints and term 7 with Woodcut.

How can I do that? A solution could be using facet and a lot of manual work, but maybe there is a better one? What about going to each record and if ID = 1 they should use term1, if ID=2 should use term 4 etc. Would it be possible? I sincerely do not know how to use the value of another column as variable to determine the result of an operation. cell.cross would help, but in case I need to split the data into two files, and doesn’t seems to me a proper solution..

2

There are 2 best solutions below

0
On BEST ANSWER

So I'm not sure if I've understood your question correctly, but it is possible to "select value based on a variable in another column".

If you have:

-----------------------------------------------------------
|ID | Classification | Term_From_Thesaurus                |  
| 1 | Aquatints      | uri||term1::uri||term2::           | 
| 1 | Aquatints      | uri||term1::uri||term2::           | 
| 2 | Drypoints      | uri||term3::uri||term4::           |
| 3 | Woodcut        | uri||term5::uri||term6::uri||term7 |
-----------------------------------------------------------

Then if you split the 'Term_From_Thesaurus' column into an array, then you can use the number in the 'ID' column to select the relevant entry in the array. However, note that for this to work you need to have the number in the ID column to be converted into a Number type (if it isn't already). In this example I'll assume that the number in the ID column starts off as a String rather than Number.

So the formula:

value.split("::")[cells.ID.value.toNumber()-1]

Will find the first value in the first and second row, the second value in the third row and the 4th item in the 4th row. This is illustrated here:

Screenshot illustrating formula given above

The formula breaks down as follows:

  • value.split("::") = splits the list of URI/Term pairs into an array
  • cells.ID.value.toNumber() = converts the value in the ID column into a number type
  • -1 = because array members are counted from zero

Hope this is clear

3
On

If you have to choose the right term yourself according to your knowledge of the field, I do not see how to automate the operation. Here is a solution to match each term in your "Classification" column with the one that most closely resembles to it amongst the terms returned by the API.

The Grel formula used is :

value.fingerprint() == cells.terms.value.replace(/\(.+\)/,'').fingerprint()

enter image description here