What I want to do: Download an S3 file (pdf) in a lambda and extract its text, using Rust.
The Error:
ERROR PDF error: Invalid file header
I checked the pdf file in the bucket, downloaded it from the console and everything looks correct, so something is breaking in the way I store the file.
How I am doing it:
let config = aws_config::load_from_env().await;
let client = s3::Client::new(&config);
// Get uploaded object in raw bucket (serde derived the json)
let key = event.records.get(0).unwrap().s3.object.key.clone();
let key = key.replace('+', " ");
let key = percent_encoding::percent_decode_str(&key).decode_utf8().unwrap().to_string();
let content = client
.get_object()
.bucket(raw_bucket_name)
.key(&key)
// .response_content_type("application/pdf") // this did not make any difference
.send()
.await?;
let mut bytes = content.body.into_async_read();
let file = tempfile::NamedTempFile::new()?;
let path = file.into_temp_path();
let mut file = tokio::fs::File::create(&path).await?;
tokio::io::copy(&mut bytes, &mut file).await?;
let content = pdf_extract::extract_text(path)?; // this line breaks
Versions:
tokio = { version = "1", features = ["macros"] }
aws-sdk-s3 = "0.21.0"
aws-config = "0.51.0"
pdf-extract = "0.6.4"
I feel like I misunderstood something in how to store the bytestream, but e.g. https://stackoverflow.com/a/62003659/4986655 do it in the same way afaiks.
Any help or pointers on what the issue might be or how to debug this are very welcome.