How to extract specific Tables from multiple PDFs in Python

472 Views Asked by At

I have a data bank of PDF files that I've downloaded through webscraping. I can extract the tables from these PDF files and visualise them in jupyter notebook like this:

import os
import camelot.io as camelot
n = 1

arr = os.listdir('D:\Test')  # arr ist die Liste der PDF-Titel
for item in arr:
    tables = camelot.read_pdf(item, pages='all', split_text=True)
    print(f'''DATENBLATT {n}: {item}

    ''')
    n += 1
    for tabs in tables:
        print(tabs.df, "\n==============================================================================\n")

in this way I get the results for two PDF files in the data bank as follows.

(PDf1, PDF2)

Now I would like to ask how I can get only the specific data from tables that contain for example "Voltage" and "Current" info. More specifically I would like to extract user-defined or targeted info and make charts with this values instead of printing them as whole.

Thanks in advance.

DATENBLATT 1: HY-Energy-Plus-Peak-Pack-HYP-00-2972-R2.pdf

    
                         0                                                  1
0              Part Number                                        HYP-00-2972
1          Voltage Nominal                                              51.8V
2    Voltage Range Min/Max                                        43.4V/58.1V
3           Charge Current  160A maximum \nDe-rated by BMS message over CA...
4        Discharge Current  300A maximum \nDe-rated by BMS message over CA...
5         Maximum Capacity                                    5.76kWh/111.4Ah
6   Maximum Energy Density                                           164Wh/kg
7         Useable capacity         Limited to 90% by BMS to improve cell life
8               Dimensions                       W: 243 x L: 352 x H: 300.5mm
9                   Weight                                               37kg
10       Mounting Fixtures     4x M8 mounting points for easy secure mounting
11                                                                            
==============================================================================

                                 0  \
0           Communication Protocol   
1             Reported Information   
2        Pack Protection Mechanism   
3                 Balancing Method   
4             Multi-Pack Behaviour   
5  Compatible Chargers as standard   
6                  Charger Control   
7             Auxiliary Connectors   
8                 Power connectors   
9                                    

                                                   1  
0  CAN bus at user selectable baud rate (propriet...  
1  Cell Temperatures and Voltages, Pack Current, ...  
2  Interlock to control external protection devic...  
3          Actively controlled dissipative balancing  
4  BMS implements a single master and multi-slave...  
5  Zivan, Victron, Delta-Q, TC-Charger, SPE. For ...  
6  Direct current control based on cell voltage/t...  
7              Binder 720-Series 8-way male & female  
8  4x Amphenol SurLok Plus 8mm \nWhen using batte...  
9                                                      
==============================================================================

                              0  \
0     Max no of packs in series   
1  Max Number of Parallel Packs   
2  External System Requirements   
3                                 

                                                   1  
0                                                 10  
1                                                127  
2  External Protection Device (e.g. Contactor) co...  
3                                                      
==============================================================================

DATENBLATT 2: HY-Energy-Standard-Pack-HYP-00-2889-R2.pdf

    
                         0                                                  1
0              Part Number                                        HYP-00-2889
1          Voltage Nominal                                              44.4V
2    Voltage Range Min/Max                                        37.2V/49.8V
3           Charge Current  132A maximum \nDe-rated by BMS message over CA...
4        Discharge Current  132A maximum \nDe-rated by BMS message over CA...
5         Maximum Capacity                                      4.94kWh/111Ah
6   Maximum Energy Density                                           152Wh/kg
7         Useable capacity         Limited to 90% by BMS to improve cell life
8               Dimensions                         W: 243 x L: 352 x H: 265mm
9                   Weight                                               32kg
10       Mounting Fixtures     4x M8 mounting points for easy secure mounting
11                                                                            
==============================================================================

                                 0  \
0           Communication Protocol   
1             Reported Information   
2        Pack Protection Mechanism   
3                 Balancing Method   
4             Multi-Pack Behaviour   
5  Compatible Chargers as standard   
6                  Charger Control   
7             Auxiliary Connectors   
8                 Power connectors   
9                                    

                                                   1  
0  CAN bus at user selectable baud rate (propriet...  
1  Cell Temperatures and Voltages, Pack Current, ...  
2  Interlock to control external protection devic...  
3          Actively controlled dissipative balancing  
4  BMS implements a single master and multi-slave...  
5  Zivan, Delta-Q, TC-Charger, SPE, Victron, Bass...  
6  Direct current control based on cell voltage/t...  
7              Binder 720-Series 8-way male & female  
8  4x Amphenol SurLok Plus 8mm \nWhen using batte...  
9                                                      
==============================================================================

                              0  \
0     Max no of packs in series   
1  Max Number of Parallel Packs   
2  External System Requirements   
3                                 

                                                   1  
0                                                 12  
1                                                127  
2  External Protection Device (e.g. Contactor) co...  
3                                                      
==============================================================================
1

There are 1 best solutions below

0
On

You can define a list of the strings of interest;

then select only the tables which contain at least one of these strings.

import os
import camelot.io as camelot
n = 1

# define your strings of interest
interesting_strings=["voltage", "current"]

arr = os.listdir('D:\Test')  # arr ist die Liste der PDF-Titel
for item in arr:
    tables = camelot.read_pdf(item, pages='all', split_text=True)
    print(f'''DATENBLATT {n}: {item}

    ''')
    n += 1
    for tabs in tables:
        # select only tables which contain at least one of the interesting strings
        if any(s in tabs.df.to_string().lower() for s in interesting_strings) :
            print(tabs.df, "\n==============================================================================\n")

If you want to search for interesting strings only in specific places (for example, in the first column), you can use Pandas dataframes properties, such as iloc:

any(s in tabs.df.iloc[0].to_string().lower() for s in interesting_strings)