Remove html and uuencode from .txt file

1k Views Asked by At

I want to process a text file which contains a lot of html and uuencode characters:

For example, see the .txt file at the following link:

https://www.sec.gov/Archives/edgar/data/1522690/000121390016011794/0001213900-16-011794.txt

I am using the following code:

from bs4 import BeautifulSoup

def strip_non_ascii(string):
    ''' Returns the string without non ASCII characters'''
    stripped = (c for c in string if 0 < ord(c) < 127)
    return ''.join(stripped)

with open("C:/EDGAR/forms_to_process/10K/20160322_10-K_edgar_data_1522690_0001213900-16-011794_1.txt") as f:
    lines = f.readlines()
    with open("PROCESSED.txt", 'w', encoding='utf-8') as f1:
        i=1
        for line in lines:
            soup = BeautifulSoup(line, "lxml")
            print(i, "Initial line: ", line)
            print(i, "Soup get text line: ", soup.get_text())
            bs_line = soup.get_text()
            ascii_line = strip_non_ascii(bs_line)
            print(i, "Ascii line: ", ascii_line)
            f1.write(ascii_line)
            i=i+1


f.close()
f1.close();

which reduces the file from 8.5 MB to 2.5 MB, but it still has a lot of elements I do not need, such as:

</tr>
<tr style="vertical-align: bottom; background-color: #cceeff;">
<td
style="padding: 0px 0px 0px 10pt; text-indent: -10pt;"><font style="font-family: 'times new roman', times, serif;">&#160;</font></td>
<td><font style="font-family: 'times new roman', times, serif;">&#160;</font></td>
<td style="text-align: left;"><font style="font-family: 'times new roman', times, serif;">&#160;</font></td>
<td style="text-align: right;"><font style="font-family: 'times new roman', times, serif;">&#160;</font></td>
<td style="text-align: left;"><font style="font-family: 'times new roman', times, serif;">&#160;</font></td>
<td><font style="font-family: 'times new roman', times, serif;">&#160;</font></td>
<td style="text-align: left;"><font style="font-family: 'times new roman', times, serif;">&#160;</font></td>
<td style="text-align: right;"><font style="font-family: 'times new roman', times, serif;">&#160;</font></td>
<td style="text-align: left;"><font style="font-family: 'times new roman', times, serif;">&#160;</font></td>

And

EXCEL
86
Financial_Report.xlsx
IDEA: XBRL DOCUMENT

begin 644 Financial_Report.xlsx
M4$L#!!0    (  J%=D@6'2-4(0(  $8I   3    6T-O;G1E;G1?5'EP97-=
M+GAM;,W:2V[;,! &X*L8VA86S62DZ(U
MW")I8^#?6):'G!EII&_EJV\/@=+BX(8QK:LNY_"!L=1TY&RJ?:"Q1#8^.IO+
M:=RR8)N=W1(3JY5AC1\SC7F9IQS5]=67/<78M[3X> Q,N=>5#6'H&YM[/[+]
MV)YD7?K-IF^H]M31U1=D.=\L- Z5S]8^2I\@UM[-V07U3X\=[5D89Y3>KZ\%CJTZ%D2>6W=56B
MZ5D53C?^K;/>34,+X_:W'=/Y/U[+R4WM[KY[OWO-QX2FJVJI7898%L;M([5?MWH+T\0ZDC_M D56@2*K0)%5H,@J4&05*+(*%%D%BJP215:)(JM$D56BR"I19)4HLDH4626*
MK!)%5HDBJT*15:'(JE!D52BR*A19%8JL"D56A2*K0I%5HMBJP:15:-(JM&D56CR*I19-4HLAH460V*K 9%5H,BJT&1U:#(:E!D-2BR&A19

Is there a way to remove these and keep only the relevant textual information included in the text file?

EDIT: From the link I provided, one example of text I would like to keep is:

<P STYLE="font: 10pt/normal Times New Roman,serif; margin: 0; text-align: justify">The table above indicates the current yields
to maturity (YTM) for the senior bonds of selected life insurance carriers with durations, on average, that our similar to our
life insurance portfolio.&nbsp; The average yield to maturity of these bonds was 3.02% which, we believe, reflects in part the
financial market&rsquo;s judgement that credit risk is low with regard to these carriers&rsquo; financial obligations. It should
be noted that the obligations of life insurance carriers to pay life insurance policy benefits is senior in rank to any other obligation.&nbsp;
This &ldquo;super senior&rdquo; priority is not reflected in the yield to maturity in the table and, if considered, would result
in a lower yield to maturity all else being equal. As such, as long as the respective premium payments have been made, it is highly
likely that the owner of the insurance policy will collect the insurance policy benefit upon the mortality of the insured.</P>

I.e. I would like to remove all the html tags and the uuencoding binary, and keep only the text.

EDIT 2:

Gerrit's response below is definitely very close to what I want to achieve, for the .txt file under consideration at least. But still, it leaves the following part at the end of the file:

Actuarial Pricing Systems, LP Model Actuarial
Pricing Systems, LP  33(Q7.U=JG''<]S7/R,ZG4BCJ0V3TKG/'&I;?V=X:N-K;9;C]RA^O4_EFG
M:==/<^*KESYJ(^GP2")\_*26SQV-%M9T2^ER$N(E=_96.&'X J:]=&,<=*\L\2V
MWB>ZTU9M7LH$M[;D-$5!4'CL3QTKH]*\07E[I&CVUFT;(NYU=))9E+!!&!G@$
M9)RO?O6N(3G%3OKL88:2IRE"SMNCL=X]*7--R3Z'/J"VI>Y=WC\L,/)7RB<9
MSR?>CD8O:)['4%@!D\#UKE_'K!O",S @CS(R#G_:%5AKUS=23VDLUO<03V<[
MI)#"Z!2HZ MPXP>HJ'Q!@?#*UQ_SRM^G_ :TI1:J1OW1E6FG3E;LS)70=)?X
M>KJDR>7>>4S"7>?F8,0!CH<]*W_AU<3/X==)22D<[)%GLN G1LGELK%Y,Q;NN>.3R>^>
MU;5IIIPO=W,*$&I1G;:RM]YUV[C.* V:YBPU'4KQX[33Q:0I;6L#R"56;<77(
M5<'@ #KS4"ZI=P2R0V,5LDL^JR6Y+[B.$SN//7CH./I7+RL[>=;G79J.;?Y;
M>7]_:=OUQQ7+2>([Z&W6;*>2TAG%Z]I)<%28U"KNW;_N9M#%]J
M6U6(=SLC*@("<'!YY S^-)Q:!33T/-/#;Z,NHW2>)(B97; >7.U7R=V[N#[F
MO3=$TO3]+LW73&+6\SF4'?O'(QP?3BN?U:#PGX@L)+\7=M%-LW"='VOTXW+W
M^A%8O@W4;^QT'5)X4\R"V>.0HP) &?W@'OMYKLJIU(.:NMM'M\CBHVI3Y&D;]
M]5^IZAFDWO7[&]EL8U>TAFCA$PC:0J-NYWV@Y8#*C ]Z;%>ZC=Z_I
MC07]M);2V;2OY<;;'PR@D#/7GCTYZURM1VNNZ[=)IKJM@HU'>D8*M^Z*Y.X\\\ \<N>*ZN*020QOO1M
MR@[D/!]Q[5+BUN4I)G@##YC]:3%.;[Q^M)7T9\M<3%:WAO\ Y#D'T;^1K*K4
M\.?\AR#Z-_(UE7_AR-B23R64-RK3D%]RNW0D\9Z=33+?1-'MM9?5HK>X%TQ9B2C[
M06ZD#'^/\ 9%.TO6[IY[:QDMY9B$03W&[.
M&9-^>@&.0*J\M7W8NRTI8_P"BQ(%V,N\*#N'S @L,@_A4
M=QX@U18YW2TAAV6;3[)F.X,'*YZ=.,]J/>[A[BOH6;K1-)NWN#(EZ([EM\T2
M&14=O[Q4=^E33Z=837+W*-?V\LBA9&MS;(GF <#=CJ??K5.;Q#=V=U/$8/M,K
M2X2)"<*!$C, 0.>3QFGR^)+M))!'IZLB-(H+SX/[M0S9&..#^='O#O THH;.
M&^>\1+CSWB6%BR.:J0Z1I4.MR:NL5R;Q\Y9E<@9&#

which seems to be the uuencoding binary part. Any idea how to get rid of this?

2

There are 2 best solutions below

1
Gerrit Verhaar On

Instead of filtering out the unwanted text I would use soup to select the text you really need. If this text is contained in the <p> tags then:

from bs4 import BeautifulSoup
from bs4 import SoupStrainer

only_p_tags = SoupStrainer("p")

soup = BeautifulSoup(open("C:/EDGAR/forms_to_process/10K/20160322_1‌​0-K_edgar_data_15226‌​90_0001213900-16-011‌​794_1.txt"), "html.parser", parse_only=only_p_tags)

for p in soup:
    print p.get_text()
1
Krazick On

When examining an SEC filing, it must be remembered that it is composed of Header data, and one or more files. The files can be of many types, like HTML, PDF, TXT, JPG, GIF, ZIP, etc.

Because file types JPG and GIF normally have "non-printable" characters they are uuencoded, and must be decoded so that the file is returned to the "proper" state for normal use.

With your example filing, the Filing Details page (https://www.sec.gov/Archives/edgar/data/1522690/000121390016011794/0001213900-16-011794-index.html) shows there are 8 HTML Pages, two Graphics (jpg), XML and XSD files. If you need to use the RAW "Accession-Number.txt" file that is the complete submission, you must parse out the individual files and perform the uudecode as part of the process.