"Reverse boolean search" or What is best way to build subscription by keywords (with boolean rules) system?

221 Views Asked by At

I need to build a system which triggers notifications when any new post matches user's defined rules.

E.g. There is list of users in the system (let's say millions). And there is the stream of posts which are added (also big amount).

Some users want to be notified when any new post matches rules he has defined.

The rule is some boolean expression which describes which words should (or should not) contain in the post.

For example, user A defines following rule:

"I want to be notified if any new post contains words "programming" or "coding", but should not contain the word "javascript".

Pseudo logical expression:

notify = (post.contains("programming") OR post.contains("coding")) AND NOT (post.contains("javascript"))

Users with above rules should be notified about such post:

"Programming best practices with python"

On the other hand, users with above rules should NOT be notified about something like this:

"Programming backend with javascript and nodejs"*

So it is something like "reverse (boolean) search" (not sure how to name this).

I mean, in "direct" boolean search, the user would type "programming python" and all posts which match programming and python would be returned.

But I need the opposite: provided post, I need to return users for whom this post matches.

One "dump" solution I'm thinking of is to use ElasticSearch for this. In ElasticSearch, I would store users defined rules:

user A -> rules- (keywords with bool rules), user B -> rules ...)

When new post is created, uses with rules would be searched in ES by the content of this post (it just search occurrences of post's words in rules, without any boolean rules applied).

Thus I will filter (reduce) possible users. Let's say this step found 10000 users.

This new post will be stored in ElasticSearch as well (in another index).

Now the second step, make BULK search request to ElasticSearch (msearch) index (posts index) against that single post.

Bulk search requests will contain 10000 queries (10000 found users), each query will contain boolean rule (query->boolean->must..., etc) for individual user.

Thus only users with matched rules will be notified.

What do you think about this solution? As I know, elastic search is fast only in searching some limited number of first documents, but I need to search several thousand (For the first step).

Maybe apache spark better fits for this problem? (Not familiar with it, just know that it can process huge amount of data, and wonder if this use case fits well with spark).

Can you please give some short suggestion or some advice in which direction should I go to solve this problem?

Thank you!

1

There are 1 best solutions below

0
Teimuraz On

I'm answering my own question (have not idea why didn't find it before):

One of the solutions is to use ElasticSearch Percolate Queries.

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-percolate-query.html