Using elasticsearch 6 to filter an array of objects in a document removing unmatched objects

28 Views Asked by At

In elasticsearch 6.0, I have created an index with a nested mapping type:

PUT node2
{
    "settings" : {
        "index" : {
            "number_of_shards" : 3,
            "number_of_replicas" : 0,
            "mapping.total_fields.limit" : "2300"
        }
    },
    "mappings": {
      "content": {
        "properties": {
          "basicPageBodyParagraphs": {
            "type": "nested"
          }
        }
      }
    }
}

Where basicPageBodyParagraphs is an Array of Objects. A document in this index will look something like:

{
  "id": "16dfb723-dac7-47cd-a898-47d9bd054c09",
  "fields": null,
  "more": null,
  "pathAlias": "about/sdc-access-test-2",
  "status": 0,
  "basicPageBodyParagraphs": [
    {
      "fabTextContent": "<p>This is a full text paragraph. The page is restricted to students. This paragraph has no further restrictions, so should be visible to all students.</p>",
      "paragraphAccessRoles": [],
      "type": "fab-text"
    },
    {
      "fabTextContent": "<p>This is a second full text paragraph. This one is restricted to students studying Biology.</p>",
      "paragraphAccessRoles": ["155eccdf-5ea0-ec11-8135-00155dfb7c0d"],
      "type": "fab-text"
    },
    {
      "type": "bullets",
      "items": [
        {
          "content": "<p>This is the first bullet point.</p>"
        },
        {
          "content": "<p>This is the second bullet point.</p>"
        }
      ],
      "title": "This is a bullets paragraph with access restrictions",
      "alignment": "left",
      "paragraphAccessRoles": ["4efd1649-ba34-eb11-810c-005056930a83"]
    }
  ],
  "contentType": "basic_page"
}

I want to be able to query my index and retrieve basicPageBodyParagraphs based on the paragraphAccessRoles, so if a student had an ID of 155eccdf-5ea0-ec11-8135-00155dfb7c0d the query would return the document containing only:

"basicPageBodyParagraphs": [
    {
      "fabTextContent": "<p>This is a full text paragraph. The page is restricted to students. This paragraph has no further restrictions, so should be visible to all students.</p>",
      "paragraphAccessRoles": [],
      "type": "fab-text"
    },
    {
      "fabTextContent": "<p>This is a second full text paragraph. This one is restricted to students studying Biology.</p>",
      "paragraphAccessRoles": ["155eccdf-5ea0-ec11-8135-00155dfb7c0d"],
      "type": "fab-text"
    }
]

So that the first paragraph which contains no paragraphAccessRoles is returned, and the second with an paragraphAccessRoles matching the students ID is returned, but the 3rd, is not returned as the paragraphAccessRoles does not match the students ID (155eccdf-5ea0-ec11-8135-00155dfb7c0d).

For this, I am using the query:

POST /node2/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "pathAlias.keyword": "about/sdc-access-test-2"
          }
        },
        {
          "nested": {
            "path": "basicPageBodyParagraphs",
            "query": {
              "bool": {
                "should": [
                  {
                    "terms": {
                      "basicPageBodyParagraphs.paragraphAccessRoles.keyword": [
                        "155eccdf-5ea0-ec11-8135-00155dfb7c0d"
                      ]
                    }
                  },
                  {
                    "bool": {
                      "must_not": [
                        {
                          "exists": {
                            "field": "basicPageBodyParagraphs.paragraphAccessRoles"
                          }
                        }
                      ]
                    }
                  }
                ]
              }
            },
            "inner_hits": {}
          }
        }
      ]
    }
  }
}

This query partly returns what I want:

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 3,
    "successful": 3,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 7.4361506,
    "hits": [
      {
        "_index": "node2",
        "_type": "content",
        "_id": "16dfb723-dac7-47cd-a898-47d9bd054c09",
        "_score": 7.4361506,
        "_source": {
          "id": "16dfb723-dac7-47cd-a898-47d9bd054c09",
          "fields": null,
          "more": null,
          "pathAlias": "about/sdc-access-test-2",
          "status": 0,
          "basicPageBodyParagraphs": [
            {
              "fabTextContent": "<p>This is a full text paragraph. The page is restricted to students. This paragraph has no further restrictions, so should be visible to all students.</p>",
              "paragraphAccessRoles": [],
              "type": "fab-text"
            },
            {
              "fabTextContent": "<p>This is a second full text paragraph. This one is restricted to students studying Biology.</p>",
              "paragraphAccessRoles": ["155eccdf-5ea0-ec11-8135-00155dfb7c0d"],
              "type": "fab-text"
            },
            {
              "type": "bullets",
              "items": [
                {
                  "content": "<p>This is the first bullet point.</p>"
                },
                {
                  "content": "<p>This is the second bullet point.</p>"
                }
              ],
              "title": "This is a bullets paragraph with access restrictions",
              "alignment": "left",
              "paragraphAccessRoles": ["4efd1649-ba34-eb11-810c-005056930a83"]
            }
          ],
          "contentType": "basic_page"
        },
        "inner_hits": {
          "basicPageBodyParagraphs": {
            "hits": {
              "total": 2,
              "max_score": 1,
              "hits": [
                {
                  "_nested": {
                    "field": "basicPageBodyParagraphs",
                    "offset": 1
                  },
                  "_score": 1,
                  "_source": {
                    "fabTextContent": "<p>This is a second full text paragraph. This one is restricted to students studying Biology.</p>",
                    "paragraphAccessRoles": [
                      "155eccdf-5ea0-ec11-8135-00155dfb7c0d"
                    ],
                    "type": "fab-text"
                  }
                },
                {
                  "_nested": {
                    "field": "basicPageBodyParagraphs",
                    "offset": 0
                  },
                  "_score": 1,
                  "_source": {
                    "fabTextContent": "<p>This is a full text paragraph. The page is restricted to students. This paragraph has no further restrictions, so should be visible to all students.</p>",
                    "paragraphAccessRoles": [],
                    "type": "fab-text"
                  }
                }
              ]
            }
          }
        }
      }
    ]
  }
}

The document is returned, but the basicPageBodyParagraphs contains all 3 objects, rather than the 2 i expect. It's not until you see further down that the returned data includes a inner_hits property which contains the 2 expected paragraphs (though not in order, from 0...1). I would prefer not to bring back the results in a property outside of the document and instead have query remove the unmatched basicPageBodyParagraphs objects within the main document.

Is there a way to have the query filter out the unmatched basicPageBodyParagraphs and return those in the main document result?

1

There are 1 best solutions below

0
xeraa On

I don't this will work because of how arrays of objects are flattened out. I haven't seen this with flattened either and I'm skeptical if that would work.

But my first idea would try to go with parent-child. This will have a performance overhead and your queries will be more complicated but I'm not sure there is another way.