I'm working with movies_data.json that has documents like this:
[
{
"metadata": {
"id": 580489,
"original_title": "venom: let there be carnage",
"popularity": 5401.308,
"release_date": "2021-09-30",
"vote_average": 6.8,
"vote_count": 1736,
"revenue": 424000000,
"tagline": "",
"poster_url": "https://image.tmdb.org/t/p/original/rjkmN1dniUHVYAtwuV3Tji7FsDO.jpg",
"adult": 0
},
"embedded_data": {
"overview": "After finding a host body in investigative reporter Eddie Brock, the alien symbiote must face a new enemy, Carnage, the alter ego of serial killer Cletus Kasady.",
"genre": "['Science Fiction', 'Action', 'Adventure']"
}
},
....
]
Is it fine to split .json documents? Like in my case I have meta data and embedded data fields. Now if I split it then one's meta data might get wrongly associated with other's embedded data.
I've parsed JSON into string and on splitting, but my data gets dispersed e.g. one's meta data is getting associated with other's embedded data.
It made me questioning how should I split my data in such cases?
So far, my code looks like this:
const loader = new JSONLoader(
"/input.json"
);
let docs = await loader.load();
// console.log(docs);
docs = JSON.stringify(docs)
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 400,
chunkOverlap: 1
});
const docOutput = await splitter.createDocuments([docs]);
console.log(docOutput);
You most likely do not want to split the metadata and embedded data of a single movie object. Unfortunately, keeping the data together in a single
Documentis not possible to achieve withJSONLoaderand the format of your JSON file. The loader will load all strings it finds in the file into a separateDocument.Here's an approach that will probably achieve what you want:
Documentfor each object.Documents. You can do whatever you need with them. For example, pass them into a vectorstore for retrieval later.Example:
References: