ENOENT Error When Processing PPTX Files with Node.js on AWS Lambda - Langchain and Pinecone

158 Views Asked by At

I'm developing a Node.js application on AWS Lambda that processes various file types (PDF, CSV, TXT, JSON, DOCX, and PPTX) from S3, splits their text, and stores it in a Pinecone database. This application uses the langchain library for handling document loading and text splitting. While processing PDF, CSV, TXT, JSON, and DOCX files works fine, I encounter an error specifically when trying to process PPTX files.

The error message is as follows:

2023-12-04T16:07:19.174Z    6d39f682-34ce-5441-af57-ab65cfa7facb    ERROR   Invoke Error    {
    "errorType": "Error",
    "errorMessage": "[OfficeParser]: Error: ENOENT: no such file or directory, mkdir 'officeParserTemp/tempfiles'",
    "stack": [
        "Error: [OfficeParser]: Error: ENOENT: no such file or directory, mkdir 'officeParserTemp/tempfiles'",
        "    at Object.intoError (file:///var/runtime/index.mjs:46:16)",
        "    at Object.textErrorLogger [as logError] (file:///var/runtime/index.mjs:684:56)",
        "    at postError (file:///var/runtime/index.mjs:801:27)",
        "    at done (file:///var/runtime/index.mjs:833:11)",
        "    at fail (file:///var/runtime/index.mjs:845:11)",
        "    at file:///var/runtime/index.mjs:872:20"
    ]
}

This occurs when trying to create a temporary directory for processing the PPTX file. Here is the relevant part of my code:

async function processFile(filename: string, key: string) {
  try {
    // Fetch the File content from S3 using the new method
    const command = new GetObjectCommand({
      Bucket: process.env.S3_BUCKET_NAME,
      Key: filename,
    });
    const content: any = await s3Client.send(command);

    const tempFilePath = path.join('/tmp', path.basename(filename));
    await fs.writeFile(tempFilePath, content.Body); // Directly save the buffer

    // Load and split the File
    const fileExtension = path.extname(tempFilePath).toLowerCase();
    let loader: any;
    switch (fileExtension) {
      case '.pdf':
        loader = new PDFLoader(tempFilePath); // working
        break;
      case '.csv':
        loader = new CSVLoader(tempFilePath); // working
        break;
      case '.txt':
        loader = new TextLoader(tempFilePath); //working
        break;
      case '.json':
        loader = new JSONLoader(tempFilePath); //working
        break;
      case '.docx':
        loader = new DocxLoader(tempFilePath); // working
        break;
      case '.pptx':
        console.log('LOADING THE PPTX FILE');
        console.log('tempFilePath SRC', tempFilePath);
        loader = new PPTXLoader(tempFilePath);
        break;
      default:
        console.log('PROVIDED KEY: ', key);
        loader = new UnstructuredLoader(tempFilePath, { apiKey: key });
    }
    const rawDocs = await loader.load();

    const textSplitter = new RecursiveCharacterTextSplitter({
      chunkSize: 1000,
      chunkOverlap: 200,
    });
    const docs = await textSplitter.splitDocuments(rawDocs);

    // Generate embeddings and ingest into Pinecone
    const embeddings = new OpenAIEmbeddings();

    const index = pinecone.index(process.env.PINECONE_INDEX_NAME);
    await PineconeStore.fromDocuments(docs, embeddings, {
      pineconeIndex: index,
      namespace: process.env.PINECONE_NAME_SPACE,
      textKey: 'text',
    });
  } catch (error) {
    console.error(`Error processing and ingesting File: ${filename}. Error: ${error}`);
    throw error;
  }
}

I suspect the issue might be related to AWS Lambda's file system, but I'm not sure how to resolve it. The application needs to create a temporary directory ('officeParserTemp/tempfiles') to process the PPTX files, and the console.logs show the expected outcomes.

Please see the pptx documentation by Langchain.

0

There are 0 best solutions below