I am using tabula-py to read my class timetable PDF file in python and the return value 'data' has a lot of 'nan' values that I cannot seem to clean. Can someone suggest a solution? Should I be using something instead of tabula-py? I've attached a link to the picture of the PDF. I have redacted some info from the PDF for privacy.1
My code is as follows:
import tabula
class ClassTimetable:
def __init__(self, filename):
self.filename = filename
def read_data(self):
data = tabula.read_pdf(self.filename, pages='all')
# data1 = tabula.convert_into(self.filename, output_format="csv", output_path='file.csv')
print(data)
My output is as follows:
[ Course Course Regn. ... Unnamed: 2 Room
0 Code Title Credit Type ... GCR Code No.
1 Critical and NaN ... NaN NaN
2 1 18PDM202L Creative 0 ... A- wubaing
3 Thinking Skills NaN ... ISOLATED NaN
4 Management NaN ... NaN NaN
5 2 18PDH102T Principles for 2 ... A- NaN
6 Engineers NaN ... COMBINED NaN
7 Professional Lab3 18EEC206J Analog Electronics 4 ... B boc5om
8 Generation, NaN ... NaN NaN
9 4 18EEC208T Transmission & 3 NaN ... NaN NaN
10 Distribution NaN ... C 4qjaetp
11 Numerical NaN ... NaN NaN
12 5 18MAB202T Methods for Engineers 4 ... D vvbxlqp
13 6 18EEC205J Electrical Machines II 4 ... E drcfega
14 7 18BTB101T Biology 2 ... F NaN
15 Electrical and NaN ... NaN NaN
16 Electronics NaN ... NaN NaN
17 8 18EEC207J Measurements and 4 ... G koed72
18 Instrumentation NaN ... NaN NaN
19 9 18EEC205J Electrical Machines II 4 ... P7-P8- drcfega
20 NaN NaN ... NaN NaN
21 10 18EEC206J Analog Electronics 4 ... P3-P4- boc5om
22 Electrical and NaN ... NaN NaN
23 Electronics NaN ... NaN NaN
24 11 18EEC207J Measurements 4 ... NaN NaN
25 and NaN ... P19-P20- NaN
26 Instrumentation NaN ... NaN NaN
27 Total 23 NaN ... NaN NaN
[28 rows x 8 columns]]
ALSO, WHAT DOES '. . .' MEAN?
I figured it out. I realised, the problem was that the library was not reading the separations between the lines properly, so I set 'lattice=True'. This solved my problem about 50% and realised the program requires greater specificity.
Downloaded Tabula for windows and found the coordinates of the entire table and also the separate columns. Fed that data into tabula-py under build options of 'area=' and 'columns=' . I realise using both attributes is probably overkill, but upon formatting into .csv, all my data is neatly placed in separate columns with no 'Nan' values. Attaching my code below:
Output, as follows:
Still don't know what '. . .' means