I have this method SearchPdf(string path, string keyword) where path is the folder path that contains all the PDFs file to search and keyword is the keyword to search in the PDF file or PDF's file name.
I'm using Spire.Pdf to read the PDFs.
Here is the method:
public static ConcurrentBag<KeyValuePair<string, string>> SearchPdf(string path, string keyword)
{
var results = new ConcurrentBag<KeyValuePair<string, string>>();
var directory = new DirectoryInfo(path);
var files = directory.GetFiles("*.pdf", SearchOption.AllDirectories);
Parallel.ForEach(files, file =>
{
// Apri il file PDF
var document = new PdfDocument(file.FullName);
Console.WriteLine("\n\rRicerca per: " + keyword + " in file: " + file.Name + "\n\r");
// Itera le pagine del documento
for (int i = 0; i < document.Pages.Count; i++)
{
// Estrai il testo della pagina
var page = document.Pages[i];
var text = page.ExtractText();
// Cerca la parola chiave
keyword = keyword.ToLower().Trim();
if (text.ToLower().Contains(keyword) || file.Name.ToLower().Trim().Contains(keyword) || file.FullName.ToLower().Trim().Contains(keyword))
{
results.Add(new KeyValuePair<string, string>(keyword, file.FullName));
}
}
});
return results;
}
All works fine but when I have more than 200 keywords to search and more then 1500 files it's a bit slow. Is there something to do to optimize this loop?
And you loading all pdfs and processing for every single one of them. I think it would be much more efficient to load file once and check it for all keywords:
Next thing you can try to optimize - break of search for page/keyword pair - since you care only about keyword being found in file not a page - break out earlier if the keyword was found (and/or all keywords were found), for example by maintaining local hashset of found keywords.
Then optimize the search (as suggested in comments) - no need create bunch of string by using
ToLowerand add pressure on the GC -Instead of
just use:
Also possibly perform file name and full file name checks before the fulltext search (maybe before file/page load).