Neo4j Lucene full-text search and keyword extraction from the text

290 Views Asked by At

I have Neo4j FULLTEXT INDEX with ~60k records (keywords). This is my keyword vocabulary. I need to extract all possible keywords (which are present in this index) from the different input texts. Is this possible to implement with Neo4j, Cypher, APOC ?

UPDATED

For example there is a text:

Looking for Apache Spark expert to coach me on the core concepts of optimizing the parallelism of Spark using Scala and OpenAcc programming model.

The mentor must have comprehensive hands-on knowledge of Big Data analytics in large scale of data  (especially Spark and GPU programming) to design the software tool with sample data analysis using Scala language and OpenAcc directives. 

In the Neo4j database with FULLTEXT INDEX I have the following keywords:

apache-spark
scala
gpu

I need to extract from the text above

Apache Spark
Scala
GPU
1

There are 1 best solutions below

6
Christophe Willemsen On

So, generally using an FT index is for the opposite use case, storing the texts in the index and matching for keywords, nevertheless :

Poor Man Solution

Query the index with your text. For eg, given the following setup

CALL db.index.fulltext.createNodeIndex('Keyword', ['Keyword'], ['value'])
CREATE (n:Keyword {value: 'apache-spark'})
CREATE (n:Keyword {value: 'gpu'})
CREATE (n:Keyword {value: 'scala'})

Use your text as search query

CALL db.index.fulltext.queryNodes('Keyword', 'Looking for Apache Spark expert to coach me on the core concepts of optimizing the parallelism of Spark using Scala and OpenAcc programming model.

The mentor must have comprehensive hands-on knowledge of Big Data analytics in large scale of data  (especially Spark and GPU programming) to design the software tool with sample data analysis using Scala language and OpenAcc directives. ')

Since a lucene query will by default use all tokens of the text with an OR operator, it will work

Result :

╒════════════════════════╤═══════════════════╕
│"node"                  │"score"            │
╞════════════════════════╪═══════════════════╡
│{"value":"apache-spark"}│1.480496883392334  │
├────────────────────────┼───────────────────┤
│{"value":"scala"}       │0.9932447671890259 │
├────────────────────────┼───────────────────┤
│{"value":"gpu"}         │0.49662238359451294│
└────────────────────────┴───────────────────┘

Limitations :

This is with an OR operator, so while here it works you need to know that when you index the keywords, a keyword like apache-spark will actually produce two tokens in the index, namely apache and spark, so this would be returned as well if your text would contain Apache Age.

Alternative solution

Do the other way around, the process would be :

  1. create an FTS index for the input texts
  2. store temporarily the input text into a node
  3. start from the keywords, clean them and build up lucene queries dynamically from them
  4. query the FTS index for input texts
  5. Delete the text node
CALL db.index.fulltext.createNodeIndex('Text', ['Text'], ['text'])
WITH 'Looking for Apache Spark expert to coach me on the core concepts of optimizing the parallelism of Spark using Scala and OpenAcc programming model.

The mentor must have comprehensive hands-on knowledge of Big Data analytics in large scale of data  (especially Spark and GPU programming) to design the software tool with sample data analysis using Scala language and OpenAcc directives. '
AS text
CREATE (n:Text {text: text})
MATCH (n:Keyword)
// remove non alpha numeric characters
WITH n, apoc.text.regreplace(n.value, '[^a-zA-Z\d\s:]', ' ') AS clean
WITH n, split(clean, ' ') AS tokens
// build up an FTS query for doing an `AND` operator
WITH n, '(' + apoc.text.join(tokens, ' AND ') + ')' AS query
CALL db.index.fulltext.queryNodes('Text', query)
YIELD node, score
// make sure to return the keyword node so we know how it did match
RETURN n, node,sum(score)

This will be the lucene queries produced

╒════════════════════╕
│"query"             │
╞════════════════════╡
│"(apache AND spark)"│
├────────────────────┤
│"(gpu)"             │
├────────────────────┤
│"(scala)"           │
├────────────────────┤
│"(apache AND age)"  │
└────────────────────┘
MATCH (n:Text) DELETE n

Result

╒════════════════════════╤══════════════════════════════════════════════════════════════════════╤═══════════════════╕
│"n"                     │"node"                                                                │"sum(score)"       │
╞════════════════════════╪══════════════════════════════════════════════════════════════════════╪═══════════════════╡
│{"value":"apache-spark"}│{"text":"Looking for Apache Spark expert to coach me on the core conce│0.33785906434059143│
│                        │pts of optimizing the parallelism of Spark using Scala and OpenAcc pro│                   │
│                        │gramming model.\n
The mentor must have comprehensive hands-on knowledg│                   │
│                        │e of Big Data analytics in large scale of data  (especially Spark and │                   │
│                        │GPU programming) to design the software tool with sample data analysis│                   │
│                        │ using Scala language and OpenAcc directives. "}                      │                   │
├────────────────────────┼──────────────────────────────────────────────────────────────────────┼───────────────────┤
│{"value":"gpu"}         │{"text":"Looking for Apache Spark expert to coach me on the core conce│0.13164746761322021│
│                        │pts of optimizing the parallelism of Spark using Scala and OpenAcc pro│                   │
│                        │gramming model.\n
The mentor must have comprehensive hands-on knowledg│                   │
│                        │e of Big Data analytics in large scale of data  (especially Spark and │                   │
│                        │GPU programming) to design the software tool with sample data analysis│                   │
│                        │ using Scala language and OpenAcc directives. "}                      │                   │
├────────────────────────┼──────────────────────────────────────────────────────────────────────┼───────────────────┤
│{"value":"scala"}       │{"text":"Looking for Apache Spark expert to coach me on the core conce│0.18063414096832275│
│                        │pts of optimizing the parallelism of Spark using Scala and OpenAcc pro│                   │
│                        │gramming model.\n

Summary

There is in my opinion no real bullet proof solution really