elasticsearch multi word overlapping synonyms

591 Views Asked by At

I developed a thesaurus of job titles and I am trying to put it into a format that works with Elasticsearch.

My Problem: Multi-word Overlapping Synonyms

I am trying to identify a solution for multi-word overlapping synonyms. When I process a job with a job title of "Info Security Engineer", I want it to add "Info Security" and "Security Engineer" to the index.

Previously, I had included synonyms for information security in the index, but I found that it would index "Info Security Engineer" as "Info Security" and it would not index "Security Engineer". Because of that, I removed the sets of synonyms such as "Info Security" from the index.

Now, I am looking for a way to include Information Security synonyms in the index.

A Few Options To Choose From:

1.) I could add "Info Security Engineer" as a synonym for "Security Engineer" and then have "Security Engineer" also be indexed as "Information Security". I could add the "Information Security" synonyms to the analyzer and the search analyzer.

Example at Index time:

"synonyms" : [
    "security engineer, info security engineer => security_engineer, information_security",
    "information security, info security => information_security"
]

Example at Search time:

"synonyms" : [
    "security engineer, info security engineer => security_engineer, information_security",
    "information security, info security => information_security"
]

Making sure that "Security Engineer" synonyms included all the "Information Security" synonyms would be difficult to implement across the entire thesaurus.

2.) I could have "Security Engineer" also be indexed as "Information Security". I would add the synonyms for "Information Security" to the search_analyzer, so it would search for the "Information Security" term.

Example at Index time:

"synonyms" : [
    "security engineer => security_engineer, information_security"
]

Example at Search time:

"synonyms" : [
    "security engineer => security_engineer",
    "information security, info security => information_security"
]

When someone searches for "Information Security" jobs, it would return any of the job titles that were setup to include "Information Security" at index time. However, jobs that had a phrase such as "Information Security" in the title, but that did not map to any of the information security job titles at index time would not be included in a search for "Information Security".

3.) I could add "Information Security" to the search_analyzer and have it expand it to "Security Engineer" and any other information security jobs.

Example at Index time:

"synonyms" : [
    "security engineer => security_engineer"
]

Example at Search time:

"synonyms" : [
    "security engineer => security_engineer",
    "information security, info security => security_engineer, information_security_analyst, penetration_tester"
]

This would place more work on the query processing, because it would look for all the jobs that I marked as information security jobs.

4.) I could remove the use of synonyms at index time and only use synonyms at query time.

It would include all the jobs that had information security in the job title, but not any where it is implied, such as Security Engineer. It also makes it more resource-intensive to process the query.

5.) I could use one index for job titles and a different index for job functions, such as information security.

It would include all jobs that had information security in the job title, but would miss the implied jobs. It would add another step to determine which index to use.

What do you think?

Any advice is appreciated. Am I missing other options? Is it problematic to expand to a lot of terms at query time or at index time?

I am leaning towards option #2. I am trying to design it so that it would my thesaurus would be easy to use for job search engines and applicant tracking systems.

Background / Current Setup

I create a jobs index that uses an analyzer and search analyzer that includes synonyms.

curl -XPUT 'http://localhost:9200/jobs/?pretty' -H 'Content-Type: application/json'  -d '
{
   "settings" : {
      "analysis" : {
         "filter" : {
            "my_job_title_filter_for_index" : {
               "type" : "synonym",
               "synonyms" : [
                  "security engineer  => security_engineer"
               ]
            },
            "my_job_title_filter_for_search" : {
               "type" : "synonym",
               "synonyms" : [
                  "security engineer => security_engineer"
               ]
            }
         },
         "analyzer" : {
            "my_job_title_analyzer_for_index" : {
               "filter" : [
                  "standard",
                  "lowercase",
                  "stop",
                  "my_job_title_filter_for_index"
               ],
               "type" : "custom",
               "tokenizer" : "standard"
            },
            "my_job_title_analyzer_for_search" : {
               "filter" : [
                  "standard",
                  "lowercase",
                  "stop",
                  "my_job_title_filter_for_search"
               ],
               "type" : "custom",
               "tokenizer" : "standard"
            }
         }
      }
   },
   "mappings" : {
      "job" : {
         "properties" : {
            "job_title" : {
               "type" : "text",
               "analyzer" : "my_job_title_analyzer_for_index",
               "search_analyzer" : "my_job_title_analyzer_for_search"
            }
         }
      }
   }
}
'

I load in data to the index:

curl -XPOST 'http://localhost:9200/jobs/job/_bulk?pretty' -H "Content-Type: application/json" -d'
{"index":{"_id":"1"}}
{"job_title":"Security Engineer"}
{"index":{"_id":"2"}}
{"job_title":"Info Security Engineer"}
'

I query the data for security engineer and it returns both jobs.

curl -XGET 'http://localhost:9200/jobs/job/_search?pretty' -H 'Content-Type: application/json' -d '
{
   "query" : {
      "match_phrase" : {"job_title" : "security engineer"}
   }
}
'

I query the index for cyber security and it returns no results.

curl -XGET 'http://localhost:9200/jobs/job/_search?pretty' -H 'Content-Type: application/json' -d '
{
   "query" : {
      "match_phrase" : {"job_title" : "cyber security"}
   }
}
'

(side note: I use both the analyzer and the search analyzer so that jobs like a "SQL DBA" are indexed as a "SQL DBA" and as a "DBA". Then, at query time, a search for "SQL DBA" only searches for "SQL DBA" and not "DBA".)

0

There are 0 best solutions below