So I have a company's data (The data is basically their website dump) and I want this data to be indexed so that I can build a semantic search engine. The data structure is somewhat like this [{'title': 'some title','content':'web page's content','url': 'the page's url'},{}....and so on] where each dictionary {} represents a page. The problem is with the size of content. If the content of a page is too large I have to split this content into chunks and then vectorize it and finally indexing on pinecone. For each chunk the title and the url is same if they belong to the same page. When I query the database I often get the results that has same url and the title because of the chunking. How can I avoid this? Also what if I don't make chucks, rather vectorize the entire content even if it is big and then index on pinecone. In this case will the search results will be effective? Is the any other efficient way of index these data so as to build a powerful, effective search engine
Indexing custom data on Pinecone
145 Views Asked by Krishna Gupta At
0
There are 0 best solutions below
Related Questions in VECTOR
- Dynamic Nested Multi-Dimensional Arrays in Rust
- WorldToScreen function
- Unable to derive zerocopy::AsBytes on Vec<T> for struct T
- How can I add an element via emplace(pos, value) if I have a two-dimensional vector?
- Create Symbolic Function from Double Vector MATLAB
- Delete records in Datastax vector database
- Which is the most idiomatic way to parse an i32 from ascii in Rust
- Exponentiation of a vector
- How to create a vector of a specific class in c++ that includes thread objects in it?
- How can be the `$ operator is invalid for atomic vectors` error solved?
- R method for comparing NAs between two vectors
- Multiple child processes accessing the same vector
- Issue with intensities for Color Gradient in Vector Field with ggplot
- Multithreading vector multiplication
- Can spatial features be used as an input for a machine learning model?
Related Questions in INDEXING
- How to give index id to my uploaded json file in FastAPI?
- operator class "gin_trgm_ops" does not exist for access method "gin"
- what is it? my question is what's the meaning of img[img]
- If composite indexing created - indexing is called?
- Autocomplete not working for apache spark in java vscode
- Pyside6, tableView.selectedIndexes, list index out of range
- Indexing in ServiceNow Jelly Report not working
- Wordpress | Page indexing Page is not indexed: Redirect error
- Why does my attempt to print the index of my array ALWAYS return 0.00?
- jQuery - Click and enable Button without affecting other foreach Laravel arrays
- std:array indexing and operator[]
- ChartJS indexing for datapoints
- How to make Postgres GIN index work with jsonb_* functions?
- Using Closing Stock Balance as Opening Stock in subsequent line item
- Using MYSQL optimise table with innodb_optimize_fulltext_only and innodb_ft_num_word_optimize options, how do I know when it's finished?
Related Questions in EMBEDDING
- Excel embedding through OneDrive: preview is correct, while the end result is not
- I am deploying a seq2seq model for a text2sql generation, i want to be sure that i am on the right path
- Integrating llama index vectorstoreindex with Langchain agents for RAG Applications
- 403 Forbidden trying to embed a Power BI report
- Error while embedding string using spring-ai
- Prevent create embeddings if folder already present ChromaDB
- we are getting dynamic embedding URL from tableau which will change for each date then how to create the static embedding url for iframe from tableau?
- fasttext embeddings in order to do logistic regression
- How to Perform Embedding Search for Documents in ChromaDB?
- Load Chroma vectorstore from disk
- How to improve openAI Semantic search speed
- How to get Feature from Drug's Similarity matrix?
- iFlyTek, Spark Embeddings Error Code - 11202
- Text Embedding result based on Priority
- export onnx RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int;
Related Questions in SEMANTIC-SEARCH
- How to improve openAI Semantic search speed
- How to get Retrieval QA to return the exact document that contains the answer from the retrieved top k document?
- create $vectorSearch index in mongodb mongosh terminal
- Installing pretrained ML models on AWS opensearch
- How to get the combine result from multiple vectors stored in Pinecone?
- add_faiss_index import unkown | what should i import or install to use add_faiss_index
- Purpose of Content, Title and Keyword in semantic ranking
- instantiating SemanticSettings is causing a build error in web app
- Result format of Vespa Query
- The differences between Qdrant upload_records and upsert methods?
- How to register sparse encoding model in AWS OpenSearch
- How can I do recommendations with Marqo?
- boost in neural query in opensearch using javaclient
- OpenSearch: use vector search in combination with should
- Semantic search with pretrained BERT models giving irrelevant results with high similarity
Related Questions in PINECONE
- I am unable to perform the vector embeddings with the help of pinecone and python
- Retrieving Vectors from existing Pinecone Vector Database
- Module '"@pinecone-database/pinecone"' has no exported member 'PineconeClient'.ts(2305)
- How to create a pinecone client, it's giving error
- chatbot that uses only the information in the retriever and nothing more
- Problem with inserting vectors into PineconeDB
- Issue when querying pinecone data
- Langchain CSVLoader
- Recommended approach for managing entries in a vector database when the embeddings are identical but their metadata differs?
- processing hundreds of csv files one row at a time for embedding, upload to pinecone using OpenAI embeddings
- Pinecone query not bringing up any matches?
- Index <pinecone.data.index.Index object at 0x000002655A3E1CD0> does not exist
- AttributeError: 'Pinecone' object has no attribute 'from_texts'
- Why is Langchainrb Vectorsearch Pinecone ask method returning error "unknown keyword: :prompt"?
- I want to only retrieve namespace of an index as vector store, not create
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular # Hahtags
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?