I'm developing a Node.js application on AWS Lambda that processes various file types (PDF, CSV, TXT, JSON, DOCX, and PPTX) from S3, splits their text, and stores it in a Pinecone database. This application uses the langchain library for handling document loading and text splitting. While processing PDF, CSV, TXT, JSON, and DOCX files works fine, I encounter an error specifically when trying to process PPTX files.
The error message is as follows:
2023-12-04T16:07:19.174Z 6d39f682-34ce-5441-af57-ab65cfa7facb ERROR Invoke Error {
"errorType": "Error",
"errorMessage": "[OfficeParser]: Error: ENOENT: no such file or directory, mkdir 'officeParserTemp/tempfiles'",
"stack": [
"Error: [OfficeParser]: Error: ENOENT: no such file or directory, mkdir 'officeParserTemp/tempfiles'",
" at Object.intoError (file:///var/runtime/index.mjs:46:16)",
" at Object.textErrorLogger [as logError] (file:///var/runtime/index.mjs:684:56)",
" at postError (file:///var/runtime/index.mjs:801:27)",
" at done (file:///var/runtime/index.mjs:833:11)",
" at fail (file:///var/runtime/index.mjs:845:11)",
" at file:///var/runtime/index.mjs:872:20"
]
}
This occurs when trying to create a temporary directory for processing the PPTX file. Here is the relevant part of my code:
async function processFile(filename: string, key: string) {
try {
// Fetch the File content from S3 using the new method
const command = new GetObjectCommand({
Bucket: process.env.S3_BUCKET_NAME,
Key: filename,
});
const content: any = await s3Client.send(command);
const tempFilePath = path.join('/tmp', path.basename(filename));
await fs.writeFile(tempFilePath, content.Body); // Directly save the buffer
// Load and split the File
const fileExtension = path.extname(tempFilePath).toLowerCase();
let loader: any;
switch (fileExtension) {
case '.pdf':
loader = new PDFLoader(tempFilePath); // working
break;
case '.csv':
loader = new CSVLoader(tempFilePath); // working
break;
case '.txt':
loader = new TextLoader(tempFilePath); //working
break;
case '.json':
loader = new JSONLoader(tempFilePath); //working
break;
case '.docx':
loader = new DocxLoader(tempFilePath); // working
break;
case '.pptx':
console.log('LOADING THE PPTX FILE');
console.log('tempFilePath SRC', tempFilePath);
loader = new PPTXLoader(tempFilePath);
break;
default:
console.log('PROVIDED KEY: ', key);
loader = new UnstructuredLoader(tempFilePath, { apiKey: key });
}
const rawDocs = await loader.load();
const textSplitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 200,
});
const docs = await textSplitter.splitDocuments(rawDocs);
// Generate embeddings and ingest into Pinecone
const embeddings = new OpenAIEmbeddings();
const index = pinecone.index(process.env.PINECONE_INDEX_NAME);
await PineconeStore.fromDocuments(docs, embeddings, {
pineconeIndex: index,
namespace: process.env.PINECONE_NAME_SPACE,
textKey: 'text',
});
} catch (error) {
console.error(`Error processing and ingesting File: ${filename}. Error: ${error}`);
throw error;
}
}
I suspect the issue might be related to AWS Lambda's file system, but I'm not sure how to resolve it. The application needs to create a temporary directory ('officeParserTemp/tempfiles') to process the PPTX files, and the console.logs
show the expected outcomes.
Please see the pptx documentation by Langchain.