So I have a company's data (The data is basically their website dump) and I want this data to be indexed so that I can build a semantic search engine. The data structure is somewhat like this [{'title': 'some title','content':'web page's content','url': 'the page's url'},{}....and so on] where each dictionary {} represents a page. The problem is with the size of content. If the content of a page is too large I have to split this content into chunks and then vectorize it and finally indexing on pinecone. For each chunk the title and the url is same if they belong to the same page. When I query the database I often get the results that has same url and the title because of the chunking. How can I avoid this? Also what if I don't make chucks, rather vectorize the entire content even if it is big and then index on pinecone. In this case will the search results will be effective? Is the any other efficient way of index these data so as to build a powerful, effective search engine
Indexing custom data on Pinecone
143 Views Asked by Krishna Gupta At
0
There are 0 best solutions below
Related Questions in VECTOR
- C++ using std::vector across boundaries
- Mayavi - color vectors based on direction instead of magnitude
- Concatenate numbers in a vector to form one number
- C++ 2D vector - Convert int to double
- Downcast from a container of Base* to Derived* without explicit conversion
- Assigning values in a vector in non-sequential order
- Is it possible to find an element in a Vec<T> and remove it?
- Vector of Vector of object
- How to detect null values in a vector
- MatLab 3-vector plot/mesh with colour-scale
- How to create spaces in a textbox?
- libc++ difference between vector::insert overloads
- Make a character vector a numeric vector in R?
- Spacing errors while printing vector to JTextArea
- How to factor a vector (times it by itself a set number of times)?
Related Questions in INDEXING
- Why does mysql stop using indexes when date ranges are added to the query?
- MySQL: Using natural primary index or adding surrogate when tables are given
- How does MongoDB process unsupported languages?
- Error in indicies while unsetting Sessions
- How to index a field with mongodb-erlang
- How to force use of indices in MongoDB?
- Hint indexes to mysql on Join
- Lucene get all non deleted document from index file
- Querydsl generated sql query wrong sql type (nvarchar instead of varchar)
- Numpy Indexing: Get every second coloumn for each even row
- Simpler, safer string manipulation Python
- Understanding "ValueError: need more than 1 value to unpack" w/without enumerate()
- Poor performance with mongo array index
- Is it possible to skip IndexRebuilder in the startup process of mongodb 2.6?
- Does PostgreSQL self join ignore indexes?
Related Questions in EMBEDDING
- explorer bar - embedding a webbrowser into it
- Embedding with SWF in jwplayer
- Update the sketch quotas and read the dimensions of the model
- TensorBoard Embedding Example?
- Keras word embedding in four gram model
- Using Keras to predict whether two numbers have the same "oddness" using an embedding, am I on the right track?
- How to use pretrained GloVe vectors in a tensorflow LSTM generative model
- The _imaging C module not installed Python Embedding
- Ruby/Rails playing with arrays from multilevel nested associations
- Embedding Python: No module named site
- Embedding Python -- loading already loaded module
- Embedding a video on https website becames not a secure connection
- VB.NET set embedded object src to byte array? dynamically set src value
- Embedded Helvetica Bold is rendering ugly
- setVariableData to assign a Invoke Input Variable Collection from java embedding
Related Questions in SEMANTIC-SEARCH
- Compiled slug size is too large (max is 500M) due to "sentence-transformers" in Heroku
- Azure Cognitive Search: queryLanguage Parameter Not Affecting Semantic Search Results
- Semantic search with pretrained BERT models giving irrelevant results with high similarity
- OpenSearch: use vector search in combination with should
- How can I do recommendations with Marqo?
- boost in neural query in opensearch using javaclient
- How to register sparse encoding model in AWS OpenSearch
- instantiating SemanticSettings is causing a build error in web app
- Indexing custom data on Pinecone
- Getting the AnswerResult from Azure Cognitive Search
- embeddings and semantic search in spanish
- Semantic video search
- Azure cognitive search- create an Indexer with skillsets to convert pdf file content to vector data and map to Index field ContentVector
- Cosine similarity in elasticsearch with multiple vectors per document
- Semantic search and text expansion query with self-deployed model in ElasticSearch
Related Questions in PINECONE
- Query with my own data using langchain and pinecone
- How can I restrict OpenAI to return only data from a Pinecone Vector DB?
- ValueError: Index 'None' not found in your Pinecone project
- querying in pinecone vector database
- Handling Null Embeddings and Missing Data in Pinecone for Startup Information Retrieval
- Pinecone node js error: TypeError - PineconeClient is not a constructor
- How to create pagination with chromadb search query
- Tryin to connect to Pinecone but the code only work in Jupyter Notebook, not as Python Script
- Forbidden Exception 403 HTTP Response body: Project in subdomain didn't match API keys project
- Irrelevant data returned when querying simple string
- I have a problem with Pinecone upsert from google firebase function
- AWS Lambda pinecone-client package "No module named 'numpy.core._multiarray_umath'" error
- I'm facing issue in assigning variable in chain
- Langchain and pinecone upserting documents using streamlit
- ENOENT Error When Processing PPTX Files with Node.js on AWS Lambda - Langchain and Pinecone
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?