Python 3.4 : XPATH : loop through tr tags and embedded td tags

1.3k Views Asked by At

The tr[2] specified below in contentB will only retrieve one tr tag when I would like to loop through all of the tr tags in the table then append the td content to the list e.

for i in range(1,5):
    contentB = tree.xpath("//table[@class='yfnc_tabledata1']/tr[1]/td/table/tr[2]/td[{i}]".format(i=i))[0].text_content().strip()
    if re.match(r'[A-Z]', contentB) is None:
        contentB = int(contentB.replace(',', ''))

    e.append(contentB)

print(e)

The text below is a snippet of the html I am working with

<table cellspacing="0" cellpadding="0" border="0" width="100%" class="yfnc_tabledata1" id="yui_3_9_1_9_1434360249110_44"><tbody id="yui_3_9_1_9_1434360249110_43"><tr id="yui_3_9_1_9_1434360249110_42"><td id="yui_3_9_1_9_1434360249110_41"><table cellspacing="0" cellpadding="2" border="0" width="100%" id="yui_3_9_1_9_1434360249110_40"><tbody id="yui_3_9_1_9_1434360249110_39"><tr style="border-top:none;" class="yfnc_modtitle1"><td style="border-top:2px solid #000;" colspan="2"><small><span class="yfi-module-title">Period Ending</span></small></td><th style="border-top:2px solid #000;text-align:right; font-weight:bold" scope="col">Dec 31, 2014</th><th style="border-top:2px solid #000;text-align:right; font-weight:bold" scope="col">Dec 31, 2013</th><th style="border-top:2px solid #000;text-align:right; font-weight:bold" scope="col">Dec 31, 2012</th></tr><tr id="yui_3_9_1_9_1434360249110_38"><td colspan="2" id="yui_3_9_1_9_1434360249110_37">
                        <strong>
                    Total Revenue
                        </strong>
                    </td><td align="right">
                            <strong>
                        31,821,000&nbsp;&nbsp;
                            </strong>
                        </td><td align="right">
                            <strong>
                        30,871,000&nbsp;&nbsp;
                            </strong>
                        </td><td align="right">
                            <strong>
                        29,904,000&nbsp;&nbsp;
                            </strong>
                        </td></tr><tr><td colspan="2">Cost of Revenue</td><td align="right">16,447,000&nbsp;&nbsp;</td><td align="right">16,106,000&nbsp;&nbsp;</td><td align="right">15,685,000&nbsp;&nbsp;</td></tr><tr><td style="height:0;padding:0; border-top:3px solid #333;" colspan="5"><span style="display:block; width:5px; height:1px;"></span></td></tr><tr><td colspan="2">
                        <strong>
                    Gross Profit
                        </strong>
                    </td><td align="right">
                            <strong>
                        15,374,000&nbsp;&nbsp;
                            </strong>
                        </td><td align="right">
                            <strong>
                        14,765,000&nbsp;&nbsp;
                            </strong>
                        </td><td align="right">
                            <strong>
                        14,219,000&nbsp;&nbsp;
                            </strong>
                        </td></tr><tr><td style="height:0;padding:0; " colspan="5"><span style="display:block; width:5px; height:10px;"></span></td></tr><tr>
                <td><spacer width="1" height="1" type="block"></spacer></td>
            <td colspan="4" class="yfnc_d">Operating Expenses</td></tr><tr>
                <td width="30" class="yfnc_tabledata1"><spacer height="1" width="30" type="block"></spacer></td>
            <td>Research Development</td><td align="right">1,770,000&nbsp;&nbsp;</td><td align="right">1,715,000&nbsp;&nbsp;</td><td align="right">1,634,000&nbsp;&nbsp;</td></tr><tr>
                <td width="30" class="yfnc_tabledata1"><spacer height="1" width="30" type="block"></spacer></td>
            <td>Selling General and Administrative</td><td align="right">6,469,000&nbsp;&nbsp;</td><td align="right">6,384,000&nbsp;&nbsp;</td><td align="right">6,102,000&nbsp;&nbsp;</td></tr><tr>
                <td width="30" class="yfnc_tabledata1"><spacer height="1" width="30" type="block"></spacer></td>
            <td>Non Recurring</td><td align="right">
        -
        &nbsp;</td><td align="right">
        -
        &nbsp;</td><td align="right">
        -
        &nbsp;</td></tr><tr>
                <td width="30" class="yfnc_tabledata1"><spacer height="1" width="30" type="block"></spacer></td>
            <td>Others</td><td align="right">
        -
        &nbsp;</td><td align="right">
        -
        &nbsp;</td><td align="right">
        -
        &nbsp;</td></tr><tr>
                <td><spacer width="1" height="1" type="block"></spacer></td>
            <td class="yfnc_d" style="height:0; padding:0; " colspan="5"><span style="display:block; width:5px; height:1px;"></span></td></tr><tr>
                <td width="30" class="yfnc_tabledata1"><spacer height="1" width="30" type="block"></spacer></td>
            <td>Total Operating Expenses</td><td align="right">
        -
        &nbsp;</td><td align="right">
        -
        &nbsp;</td><td align="right">
        -
        &nbsp;</td></tr><tr><td style="height:0;padding:0; " colspan="5"><span style="display:block; width:5px; height:10px;"></span></td></tr><tr><td style="height:0;padding:0; border-top:3px solid #333;" colspan="5"><span style="display:block; width:5px; height:1px;"></span></td></tr><tr><td colspan="2">
                        <strong>
                    Operating Income or Loss
                        </strong>
                    </td><td align="right">
                            <strong>
                        7,135,000&nbsp;&nbsp;
                            </strong>
                        </td><td align="right">
                            <strong>
                        6,666,000&nbsp;&nbsp;
                            </strong>
                        </td><td align="right">
                            <strong>
                        6,483,000&nbsp;&nbsp;
                            </strong>
                        </td></tr><tr><td style="height:0;padding:0; " colspan="5"><span style="display:block; width:5px; height:10px;"></span></td></tr><tr>
                <td><spacer width="1" height="1" type="block"></spacer></td>
            <td colspan="4" class="yfnc_d">Income from Continuing Operations</td></tr><tr>
                <td width="30" class="yfnc_tabledata1"><spacer height="1" width="30" type="block"></spacer></td>
            <td>Total Other Income/Expenses Net</td><td align="right">33,000&nbsp;&nbsp;</td><td align="right">41,000&nbsp;&nbsp;</td><td align="right">39,000&nbsp;&nbsp;</td></tr><tr>
                <td width="30" class="yfnc_tabledata1"><spacer height="1" width="30" type="block"></spacer></td>
            <td>Earnings Before Interest And Taxes</td><td align="right">7,168,000&nbsp;&nbsp;</td><td align="right">6,707,000&nbsp;&nbsp;</td><td align="right">6,522,000&nbsp;&nbsp;</td></tr><tr>
                <td width="30" class="yfnc_tabledata1"><spacer height="1" width="30" type="block"></spacer></td>
            <td>Interest Expense</td><td align="right">142,000&nbsp;&nbsp;</td><td align="right">145,000&nbsp;&nbsp;</td><td align="right">171,000&nbsp;&nbsp;</td></tr><tr>
                <td width="30" class="yfnc_tabledata1"><spacer height="1" width="30" type="block"></spacer></td>
            <td>Income Before Tax</td><td align="right">7,026,000&nbsp;&nbsp;</td><td align="right">6,562,000&nbsp;&nbsp;</td><td align="right">6,351,000&nbsp;&nbsp;</td></tr><tr>
                <td width="30" class="yfnc_tabledata1"><spacer height="1" width="30" type="block"></spacer></td>
            <td>Income Tax Expense</td><td align="right">2,028,000&nbsp;&nbsp;</td><td align="right">1,841,000&nbsp;&nbsp;</td><td align="right">1,840,000&nbsp;&nbsp;</td></tr><tr>
                <td width="30" class="yfnc_tabledata1"><spacer height="1" width="30" type="block"></spacer></td>
            <td>Minority Interest</td><td align="right">(42,000)</td><td align="right">(62,000)</td><td align="right">(67,000)</td></tr><tr>
                <td><spacer width="1" height="1" type="block"></spacer></td>
            <td class="yfnc_d" style="height:0; padding:0; " colspan="5"><span style="display:block; width:5px; height:1px;"></span></td></tr><tr>
                <td width="30" class="yfnc_tabledata1"><spacer height="1" width="30" type="block"></spacer></td>
            <td>Net Income From Continuing Ops</td><td align="right">4,956,000&nbsp;&nbsp;</td><td align="right">4,659,000&nbsp;&nbsp;</td><td align="right">4,444,000&nbsp;&nbsp;</td></tr><tr><td style="height:0;padding:0; " colspan="5"><span style="display:block; width:5px; height:10px;"></span></td></tr><tr>
                <td><spacer width="1" height="1" type="block"></spacer></td>
            <td colspan="4" class="yfnc_d">Non-recurring Events</td></tr><tr>
                <td width="30" class="yfnc_tabledata1"><spacer height="1" width="30" type="block"></spacer></td>
            <td>Discontinued Operations</td><td align="right">
        -
        &nbsp;</td><td align="right">
        -
        &nbsp;</td><td align="right">
        -
        &nbsp;</td></tr><tr>
                <td width="30" class="yfnc_tabledata1"><spacer height="1" width="30" type="block"></spacer></td>
            <td>Extraordinary Items</td><td align="right">
        -
        &nbsp;</td><td align="right">
        -
        &nbsp;</td><td align="right">
        -
        &nbsp;</td></tr><tr>
                <td width="30" class="yfnc_tabledata1"><spacer height="1" width="30" type="block"></spacer></td>
            <td>Effect Of Accounting Changes</td><td align="right">
        -
        &nbsp;</td><td align="right">
        -
        &nbsp;</td><td align="right">
        -
        &nbsp;</td></tr><tr>
                <td width="30" class="yfnc_tabledata1"><spacer height="1" width="30" type="block"></spacer></td>
            <td>Other Items</td><td align="right">
        -
        &nbsp;</td><td align="right">
        -
        &nbsp;</td><td align="right">
        -
        &nbsp;</td></tr><tr><td style="height:0;padding:0; " colspan="5"><span style="display:block; width:5px; height:10px;"></span></td></tr><tr><td style="height:0;padding:0; border-top:3px solid #333;" colspan="5"><span style="display:block; width:5px; height:1px;"></span></td></tr><tr><td colspan="2">
                        <strong>
                    Net Income
                        </strong>
                    </td><td align="right">
                            <strong>
                        4,956,000&nbsp;&nbsp;
                            </strong>
                        </td><td align="right">
                            <strong>
                        4,659,000&nbsp;&nbsp;
                            </strong>
                        </td><td align="right">
                            <strong>
                        4,444,000&nbsp;&nbsp;
                            </strong>
                        </td></tr><tr><td colspan="2">Preferred Stock And Other Adjustments</td><td align="right">
        -
        &nbsp;</td><td align="right">
        -
        &nbsp;</td><td align="right">
        -
        &nbsp;</td></tr><tr><td style="height:0;padding:0; border-top:3px solid #333;" colspan="5"><span style="display:block; width:5px; height:1px;"></span></td></tr><tr><td colspan="2">
                        <strong>
                    Net Income Applicable To Common Shares
                        </strong>
                    </td><td align="right">
                            <strong>
                        4,956,000&nbsp;&nbsp;
                            </strong>
                        </td><td align="right">
                            <strong>
                        4,659,000&nbsp;&nbsp;
                            </strong>
                        </td><td align="right">
                            <strong>
                        4,444,000&nbsp;&nbsp;
                            </strong>
                        </td></tr></tbody></table></td></tr></tbody></table>
2

There are 2 best solutions below

2
On

If I correctly understand what you are asking, you just need to replace tr[2] with tr.

The predicate [2] here restricts you to the second matching tr element; removing it removes that restriction.

EDITED

To extract the text content of the table cells, you can modify your code as:

for i in range(1,5):
    # list of cells in column i of table
    collist = tree.xpath("//table[@class='yfnc_tabledata1']//table//tr/td[{i}]".format(i=i))
    contentB = [c.text_content().strip() for c in collist]
    # here contentB will be a list where each element is the text of one of the cells 
    # in column i of the table

    ##continue processing per your desired result... 
0
On

Not sure if the previous code snip answered your question. If not, here is my solution. Note the additional 'tbody' elements not included in your original xpath.

import lxml
import re
tree=lxml.html.parse("stack-tmp.html")
e=[]
rows = tree.xpath('//table[@class="yfnc_tabledata1"]/tbody/tr[1]/td/table/tbody/tr')
for row in rows:
    for td in row.xpath('./td'):
        try:
            thistext=td.text_content().strip()
            if thistext > "":
                if re.match(r'[A-Z]', thistext) is None:
                    e.append(int(thistext.replace(',','')))
                else:
                    e.append(thistext)
        except:
            pass

print(e)

Which extracts the following items:

['Period Ending', 
'Total Revenue', 31821000, 30871000, 29904000, 
'Cost of Revenue', 16447000, 16106000, 15685000,
'Gross Profit', 15374000, 14765000, 14219000
'Operating Expenses',
'Research Development', 1770000, 1715000, 1634000,
'Selling General and Administrative', 6469000, 6384000, 6102000,
'Non Recurring',
'Others',
'Total Operating Expenses',
'Operating Income or Loss', 7135000, 6666000, 6483000,
'Income from Continuing Operations', 
'Total Other Income/Expenses Net', 33000, 41000, 39000, 
'Earnings Before Interest And Taxes', 7168000, 6707000, 6522000, 
'Interest Expense', 142000, 145000, 171000, 
'Income Before Tax', 7026000,6562000, 6351000, 
'Income Tax Expense', 2028000, 1841000, 1840000, 
'Minority Interest', 
'Net Income From Continuing Ops', 4956000, 4659000, 4444000, 
'Non-recurring Events', 
'Discontinued Operations', 
'Extraordinary Items', 
'Effect Of Accounting Changes', 
'Other Items', 
'Net Income', 4956000, 4659000, 4444000, 
'Preferred Stock And Other Adjustments', 
'Net Income Applicable To Common Shares', 4956000, 4659000, 4444000]