Having one vector column for multiple text columns on Qdrant

256 Views Asked by At

I have a products table that has a lot of columns, which from these, the following ones are important for our search:

  1. Title 1 to Title 6 (title in 6 different languages)
  2. Brand name (in 6 different languages)
  3. Category name (in 6 different languages)
  4. Product attributes like size, color, etc. (in 6 different languages)

We are planning on using qdrant vector search to implement fast vector queries. But the problem is that all the data important for searching, are in different columns and I do not think (correct me if I am wrong) generating vector embeddings separately for all the columns is the best solution.

I came up with the idea of mixing the columns together and generating separate collections; and I came up with this solution because the title, the category, brand and attrs columns are essentially the same just in different langs.

Also I use the "BAAI/bge-m3" model which is a multilingual text embedding model that supports more than 100 langs.

So, in short, I created different collections for different languages, and for each collection I have a vector column containing the vector for the combined text of title, brand, color, and category in each language and when searched, because we already know which language the website is, we will search in that specific language collection.

Now, the question is, is this a valid method? What are the pros and cons of this method? I know for sure that when combined, I can not give different weights to different parts of this vector. For example one combined text of title, category, color, and brand may look like this:

"Koala patterned hoodie children blue Bubito"

or Something like:

"Striped t-shirt men navy blue Zara"

Now, user may search "blue hoodie for men", but due to the un-weighted structure of the combined vector, it will not retrieve the best results.

I may be wrong and this may be one of the best results, but please tell me more about the pros and cons of this method, and if you can, give me a better idea.

It is important to note that currently we have more than 300,000(300K) products and they will grow to more than 1,000,000 (1M) in the near future.

3

There are 3 best solutions below

10
LiteApplication On BEST ANSWER

You seem like you have thought this through already, and your method is valid, practical, simple and scalable. Here is a quick overview of what I think about your particular question.


Pros of your method

  1. By segregating data into collections based on language, you ensure that searches are conducted within the correct linguistic context. It's quite rare for users to mix languages in search terms so I feel like you are right on this point.
  2. In terms of scalability, your approach seems optimal as you can expand linearly as your database grows. The separation of the languages could allow you to separate the databases for the different regions, Chinese in China, English in England and query only the one from the right region.
  3. Combining relevant fields into a single vector for each language streamlines the search process. This approach reduces the complexity of managing multiple vectors for each product, which can lower overhead and improve efficiency

Cons of your method

  1. As you stated previously, combining fields without weighting can lead to less precise search outcomes because there is no way to tell which keywords are important.
  2. The combined vector approach might not always accurately reflect the nuances of the data. For instance, a product's title, brand, and category might not always align perfectly with the user's search intent, especially if the brand name is a common word in the user's language, which could lead to feeling like the "Verbatim" mode of Google.

Alternative approaches

Weighted Vector Combination

Instead of merging all fields into a single vector, consider creating separate vectors for each field (title, brand, category, attributes) and then combining them with weights that reflect their importance. This method allows for more precise control over search relevance but requires more computational resources and complexity, and a fair bit of judgement if you fine-tune the weights yourself.

Another solution in a similar fashion would be to hard code some "important" keywords, or pin them to search in a specific column. This might be doable if your catalog includes a few "main" categories of products, but can be very tedious/un-doable if your products are very diverse.

Semantic Search with Fine-Tuning

Utilize the BAAI/bge-m3 model to generate embeddings for each field individually, then combine these embeddings in a manner that allows for weighting. This could involve training a custom model on your data to better understand the significance of different fields in the context of your products. This approach essentially automates the previous one, but requires you to already have data about the search intent and the keywords used by the clients.

This method is also fairly complicated to implement but could yield good results if you combine it with analytics from the websites so that it can learn over time.


I hope this can help you, I would be interested to know what method you will end up using.

0
Vahid On

Beta Answer (not implemented yet, posted for discussion)

As expected, there is indeed a weight distribution problem with the search query in the previous method. For example, if you search for something like "women red skirt", instead of retrieving only "women red skirts", it also retrieves "women red shoes" or something similar.

But with weighted importance levels assigned to different fields, this issue would not occur. Let me explain what I think can be done to implement the weighted importance distribution method.

First Step: Tokenizing the search query

First, we have to tokenize the search query, "women red skirt" to see which keywords are included. We have to somehow (I do not know how) find out that "women" is gender, "red" is attribute -> color, and "skirt" is category.

Second Step: step-by-step filteration

Then, according to the importance level of each field, it would filter the data step by step. For example, first search in category vector column (for example), and fetch all with category "skirt", then from this new result list, filter for gender "women", and in the final step, filter for attribute color "red".

The Problem

Now, the problem is that I do not know whether this method is practible, doable, or optimized. I appreciate any kind of input on this matter.

UPDATE ON THIS ANSWER

This method had some complexities as below:

  1. Tokenizing the search query required a model, which I either had to write and train myself (which would be a very hard and time-consuming task), or use an existing model. I could not find any existing model specifically written for this purpose, and using LLM models like openai api increased the response time by 2-3 seconds.
  2. Step-by-step filtration also increases the search time. In conjunction with the openai api time, this could have a drastic effect on the overall time.

So, in short, this method is not the best option. Maybe if we had the man-power to develop a light model specifically for tokenizing the query, this method would work.

However, I did find another answer on the matter, which I will add in another answer.

0
Vahid On

A Newer Answer, utilizing vector-weights and re-ranker models

I am adding another answer, as I do not want to delete/edit the previous one; I think it could still add some insights for some situations.

So, in this new method, for giving weights to the different field vectors, I implemented @LiteApplication's suggestion and instead of combining the field texts together and generating one vector, implemented the following code:

def generate_weighted_vector(fields: list[str], weights: np.array) -> np.array:
    weights = weights.astype("float64")
    embeddings = map(generate_embeddings, fields)  # Map the embeddings
    vectors = np.array(
        list(embeddings)
    )  # Map is lazy, so we need to convert it to a list
    return (vectors.T @ weights) / np.sum(weights)  # Calculate the weighted vector

This way, the fields would have weighted affect on the query results.

But still, I did encounter unrelated results, WHICH I THINK IS NORMAL (please correct me if it is not normal to retrieve unrelated results in vector-based semantic search). I did some search on that and came apon a solution named "cross-encoder re-ranking" which uses some models that takes the search query as one input, and the primary canditate hits as the second input, and performs a token-based similarity scoring and re-ranks the results based on these scores. This is the function that performs this action:

from sentence_transformers import CrossEncoder
cross_encoder = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank_results(primary_results: list, search_query: str):
    """
    Reranks the primary search results based on cross-encoder scores obtained by comparing the search query with each result's title.

    Args:
    - primary_results (list): A list of primary search results, retrieved from search.search_query function using qdrant search.
    - search_query (str): The search query used for comparing and reranking the results.

    Returns:
    - list: The reranked search results ordered by the cross-encoder scores in descending order.
    """
    cross_inp = [[search_query, hit["title"]] for hit in primary_results]
    cross_scores = cross_encoder.predict(cross_inp)
    # Sort results by the cross-encoder scores
    for idx in range(len(cross_scores)):
        primary_results[idx]["cross-score"] = cross_scores[idx]
    hits = sorted(primary_results, key=lambda x: x["cross-score"], reverse=True)
    return hits

But still the results include unrelated products. For example, I search for "blue striped shirt for men" which returns the related products, but also returns unrelated products like "sunglassess" or other items unrelated to the query.

This is really frustrating for me as my employer insists that if unrelated product is returned, then the search api is no good.

I am looking for insights on the following topics:

  1. Why there are unrelated products in the results before reranking?
  2. Why still after the reranking there are some unrelated products?

UPDATE

So, changing the re-ranker model solved the problem as it now generates desired results for queries. This is the new structure of the re-ranker function:

from FlagEmbedding import FlagReranker
cross_encoder = FlagReranker("BAAI/bge-reranker-v2-m3", use_fp16=True)

def rerank_results(primary_results: list, search_query: str):
    """
    Reranks the primary search results based on cross-encoder scores obtained by comparing the search query with each result's title.

    Args:
    - primary_results (list): A list of primary search results, retrieved from search.search_query function using qdrant search.
    - search_query (str): The search query used for comparing and reranking the results.

    Returns:
    - list: The reranked search results ordered by the cross-encoder scores in descending order.
    """
    cross_inp = [[search_query, hit["title"]] for hit in primary_results[:72]]
    cross_scores = cross_encoder.compute_score(cross_inp, normalize=True)

    # Sort results by the cross-encoder scores
    for idx in range(len(cross_scores)):
        primary_results[idx]["cross-score"] = cross_scores[idx]
    hits = sorted(primary_results, key=lambda x: x.get("cross-score", 0), reverse=True)
    for hit in hits:
        print("hits:", hit["title"], hit.get("cross-score", 0))
    return hits

With this new re-ranker model, the results for the same language query is really good, but the problem is when I try to search a query from another language. My main embedder model (BAAI/bge-m3) is a cross-lingual model which supports over 100 languages. Withouth re-ranking, the results are not greate, but still related. But the re-ranker models generally work with tokenized approach, so they do not cover cross-lingual similarity scoring (this is my understanding, but maybe I am wrong.). So, I get totally unrelated results when searching in another language.

I wonder if there are re-ranker models which support cross-lingual re-ranking.