I would like to fill an solr index from a pandas dataframe. The dataframe is as follows:
position value
5.6,-2.3 65
-35.6,-1.2 43.1
#...
etc.
I am doing the following to transform the dataframe to a json object and then adding it to solr:
import json
import pandas as pd
import pysolr
# I have a pandas dataframe df as described above
jsonObject = json.loads(df.to_json(orient='records'))
solrServer = pysolr.Solr('pathToMySolrIndex',timeout=100)
solrServer.add(jsonObject)
I get the following error:
multiple values encountered for non multiValued field position
If I change the name of the fied position to _position , then it kind of works. From pysolr's documentation page, I understand this creates a parent/child dependency which I don't really want. Indeed, reading back from the index using:
results = solrServer.search(**{'q':'*'})
df2 = pd.DataFrame(list(results))
print(df2.head())
I get something like this:
_position value
[5.6,-2.3] [65]
[-35.6,-1.2] [43.1]
#...
Despite this "hackish" solution, I'm still not getting a good result: Each element is a list. I would have preferred tuples for position, and simple floats for value. I guess this comes from the orient keyword when converting to json.
Questions and Expected output
First, I would like to avoid renaming position to _position . The Solr database doesn't have to contain renamed fields for the sake of pysolr.
Second, I would like to avoid having lists when reading from the built Solr index. I know that Solr doesn't have to contain lists as numerical elements. The problem seems to come from the transformation from DataFrame to json. How to do this?