Merge documents in elasticsearch haoop, create multi key value pairs using es-sparksql

219 Views Asked by At

Currently elasticsearch hadoop is converting dataset/rdd to documents with 1 to 1 mapping i.e. 1 row in dataset is converted to one doc. In our scenario we are doing something like this

for 'uni

PUT spark/docs/1
{
"_k":"one",
"_k":"two",
"_k":"three" // large sets , we dont need to store much, we just want to map multiple keys to single value.
"_v" :"key:
}

GET spark/docs/_search
{
"query" : {
  "constant_score" : {
    "filter" : {
      "terms" : {
        "_k" : ["one"] // all values work.
        }
      }
    }
  }
}

Any suggestion how can we implement above, if there is a better strategy, please suggest.

Below code is not working but I am trying to achieve something like below in theory

  final Dataset<String> df = spark.read().csv("src/main/resources/star2000.csv").select("_c1").dropDuplicates().as(Encoders.STRING());
  final Dataset<ArrayList> arrayListDataset = df.mapPartitions(new MapPartitionsFunction<String, ArrayList>() {
        @Override
        public Iterator<ArrayList> call(Iterator<String> iterator) throws Exception {
            ArrayList<String> s = new ArrayList<>();
            iterator.forEachRemaining(it -> s.add(it));
            return Iterators.singletonIterator(s);
        }
    }, Encoders.javaSerialization(ArrayList.class));
  JavaEsSparkSQL.saveToEs(arrayListDataset,"spark/docs");

I don't want to collect complete dataset in one list as it can result OOM, so the plan is to get list for each partition and index it against a partition key.

2

There are 2 best solutions below

3
Patrick Plaatje On

It would help to post some source code you're using, the question is also not clear on what you're trying to achieve.

I assume you would like to post an array to the key field (_k) and a different value to the value field (_v)?

So you could create an JavaPairRDD and save that to Elasticsearch, something like the below:

String[] keys = {"one", "two", "three"};
String value = "key";

List<Tuple2<String[],String>> l = new ArrayList<Tuple2<String[],String>>();
l.add(new Tuple2<String[],String>(keys, value));

JavaPairRDD<String[],String> R = ctx.parallelizePairs(l);

JavaEsSpark.saveToEs(R,"index/type");
0
rohit On

Using a pojo as

Document{
   String[] vals,
   String key
} 

and with below code snippet

Dataset<String> df = spark.sqlContext().read().parquet(params.getPath())
                        .select(params.getColumnName())
                        .as(Encoders.STRING());

final Dataset<Document> documents = df.coalesce(numPartitions).mapPartitions(iterator -> {
       final Set<String> set = Sets.newHashSet(iterator);
       Document d = new Document(set.toArray(new String[set.size()]),"key1");
       return Iterators.singletonIterator(d);}, Encoders.bean(Document.class));
JavaEsSparkSQL.saveToEs(documents, params.getTableIndexName() + "/"+params.getTableIndexType());

This create above array index.