Optic api, existsJoin by two or more columns causes a decrease in performance

84 Views Asked by At

For such input data, I want to obtain a result in which I will have only those values that for the same id1 and id2 have the value of Reported and Primary.

var items = [
    {
      "id1": 241,
      "id2": 716,
      "type": "Primary"
    },
    {
      "id1": 241,
      "id2": 716,
      "type": "Reported"
    },
    {
      "id1": 477,
      "id2": 850,
      "type": "Reported"
    },
    {
      "id1": 563,
      "id2": 340,
      "type": "Primary"
    },
    {
      "id1": 649,
      "id2": 322,
      "type": "Reported"
    }];

I try to use exists join like below:

const op = require('/MarkLogic/optic');

var reportedItems = op.fromLiterals(items)
    .where(op.in(op.col('type'), 'Reported'))
    .select(['id1', 'id2'], 'reported');

var primaryItems = op.fromLiterals(items)
    .where(op.in(op.col('type'), 'Primary'))
    .select(['id1', 'id2'], 'primary')

primaryItems
    .existsJoin(reportedItems, [
        op.on(op.viewCol('primary', 'id1'),
            op.viewCol('reported', 'id1')),
        op.on(op.viewCol('primary', 'id2'),
            op.viewCol('reported', 'id2'))
    ])
    .select("id1")
    .result()
    .toArray()

However, when the number of items in the input data exceeds 5000, the execution time exceeds 2 seconds. If, however, I perform a join only on one property (e.g. id1), the result will almost immediately appear (for 5k objects 0.3s).

Why is it so slow, can I improve it in some way or use another method to obtain the same result?

1

There are 1 best solutions below

0
On

I expect this should give the same result, and in my testing it seems to be much faster:

'use strict';

const op = require('/MarkLogic/optic');

var reportedItems = op.fromView('test', 'exists')
    .where(op.in(op.col('type'), 'Reported'))
    .select(['id1', 'id2'], 'reported')
    .whereDistinct();

var primaryItems = op.fromView('test', 'exists')
    .where(op.in(op.col('type'), 'Primary'))
    .select(['id1', 'id2'], 'primary')
    .whereDistinct()

primaryItems
  .joinInner(reportedItems, [
    op.on(op.viewCol('primary', 'id1'),
      op.viewCol('reported', 'id1'))
  ],
    op.eq(op.viewCol('primary', 'id2'),
      op.viewCol('reported', 'id2'))
  )
  .select(op.viewCol('primary', 'id1'), '')
  .result()

What I think is happening in the original query is that the existsJoin is performing a cross product join of primary and reported, and using that result to filter primary. Because the cross product is so large, the filter takes a long time to execute.

It might be possible to omit whereDistinct if you know you don't have any duplicate rows - my data was randomly generated for testing so I had to use it.