Named Entity Recognition on Search Engine Queries with Python

54 Views Asked by At

I'm trying to do Named Entity Recognition on search engine queries with Python.

The big thing about search engine queries are that they are usually incomplete or all lowercase.

For this task, I've been recommended Spacy, NLTK, Stanford NLP, Flair, Transformers by Hugging Face as some approaches to this problem.

I was wondering if anybody in the SO community knew the best approach to dealing with NER for search engine queries, because so far I've ran into problems.

For example, with Spacy:

import spacy

# Load the pre-trained model
nlp = spacy.load("en_core_web_sm")

# Process a text
text = "google and apple are looking at buying u.k. startup for $1 billion"
text = "who is barack obama"
doc = nlp(text)

# Extract entities
for ent in doc.ents:
    print(ent.text, ent.label_)

For the first query I got:

google ORG
u.k. GPE
$1 billion MONEY

This is a great answer. However, for the search query "who is barack obama", in lower case, it returned no entities.

I'm sure I'm not the first person to do NER on search engine queries in Python, so I'm hoping to find someone who can point me in the right direction.

1

There are 1 best solutions below

0
Daniel Perez Efremova On

Problem

Most of the NER models focus on Cased tokens as the main feature.

Solution

I would try GPT models, as they have been trained on masking and context tasks, so they should be able to recognise entities based on the context.

I run a quick expeirment with chatgpt.

Prompt:

Named entity recognition (NER) is a natural language processing (NLP) method that extracts information from text. NER involves detecting and categorizing important information in text known as named entities. Named entities refer to the key subjects of a piece of text, such as names, locations, companies, events and products, as well as themes, topics, times, monetary values and percentages. You are an expert on recognizing Named entities. 

I will provide you short sentences and you will respond all the entities you find. 

Return the entities clasified in four types:

PER for persons such as Bill Clinton, Gauss, Jennifer Lopez
LOC for locations such as California, Europe, 9th Avenue
ORG for organizations such as Apple, Google, UNO
MISC any other type of entity you consider that do not fits in the beforementioned cases. 

Respond in JSON format. 

For example:

"google and apple are looking at buying u.k. startup for $1 billion"

response:

{"entities": [
{"name": "google", "type": "ORG"},
{"name": "apple", "type": "ORG"},
{"name": "u.k.", "type": "MISC"}
]}

It responded well in your use case (try it in the chatgpt app!)

Code

The following code and dependencies should do the trick on a first appproachwith OpenAI models

!pip install openai==1.2.0 pyautogen==0.2.0b2

(It has been difficult to find the current combination of versions, openAI recently migrated to new API so tutorials now are in the wild...)

from openai import OpenAI
import json

# Initialize OpenAI client
client = OpenAI(api_key="<you openAI API Key>")

# Function to perform Named Entity Recognition (NER)
def perform_ner(text):
    # Define the prompt for NER task
    prompt = """
    
    You are an expert on recognising Named entities. I will provide you short sentences and you will respond all the entities you find. Return the entities clasified in four types:
    PER for persons such as Bill Clinton, Gauss, Jennifer Lopez
    LOC for locations such as California, Europe, 9th Avenue
    ORG for organizations such as Apple, Google, UNO
    MISC any other type of entity you consider that do not fits in the beforementioned cases. 

    Respond in JSON format. 

    For example:

    "google and apple are looking at buying u.k. startup for $1 billion"

    response:

    {"entities": [
    {"name": "google", "type": "ORG"},
    {"name": "apple", "type": "ORG"},
    {"name": "u.k.", "type": "MISC"}
    ]}
    
    """

    # Generate completion using OpenAI API
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": f"{prompt}"},
            {"role": "user", "content": text}
        ],
        max_tokens=100,
        n=1,
        stop=None,
        temperature=0
    )

    # Extract and return entities from response
    
    entities = response.choices[0].message.content.strip()
    return json.loads(entities)

# Function to receive new text and return NER JSON
def get_ner_json(new_text):
    # Perform NER on the new text
    entities = perform_ner(new_text)
    return entities

# Example new text
new_text = "I went to Paris last summer and visited the Eiffel Tower."

# Get NER JSON for the new text
ner_json = get_ner_json(new_text)
print(json.dumps(ner_json, indent=2))

The output:

{
  "entities": [
    {
      "name": "paris",
      "type": "LOC"
    },
    {
      "name": "eiffel tower",
      "type": "LOC"
    }
  ]
}