Currently elasticsearch hadoop is converting dataset/rdd to documents with 1 to 1 mapping i.e. 1 row in dataset is converted to one doc. In our scenario we are doing something like this
for 'uni
PUT spark/docs/1
{
"_k":"one",
"_k":"two",
"_k":"three" // large sets , we dont need to store much, we just want to map multiple keys to single value.
"_v" :"key:
}
GET spark/docs/_search
{
"query" : {
"constant_score" : {
"filter" : {
"terms" : {
"_k" : ["one"] // all values work.
}
}
}
}
}
Any suggestion how can we implement above, if there is a better strategy, please suggest.
Below code is not working but I am trying to achieve something like below in theory
final Dataset<String> df = spark.read().csv("src/main/resources/star2000.csv").select("_c1").dropDuplicates().as(Encoders.STRING());
final Dataset<ArrayList> arrayListDataset = df.mapPartitions(new MapPartitionsFunction<String, ArrayList>() {
@Override
public Iterator<ArrayList> call(Iterator<String> iterator) throws Exception {
ArrayList<String> s = new ArrayList<>();
iterator.forEachRemaining(it -> s.add(it));
return Iterators.singletonIterator(s);
}
}, Encoders.javaSerialization(ArrayList.class));
JavaEsSparkSQL.saveToEs(arrayListDataset,"spark/docs");
I don't want to collect complete dataset in one list as it can result OOM, so the plan is to get list for each partition and index it against a partition key.
It would help to post some source code you're using, the question is also not clear on what you're trying to achieve.
I assume you would like to post an array to the key field (_k) and a different value to the value field (_v)?
So you could create an JavaPairRDD and save that to Elasticsearch, something like the below: