Extracting info between two tags with re

27 Views Asked by At

I need to parse a text file containing xml. Usual xml parsers don't work with it, I tried, so I decided to do this using python's re. The problem is that I can't find the right pattern to get everything between and without including these.

The closest pattern I got is r"(().+(</w>))" but it has two problems:

  1. It includes ending tags
  2. It covers EVERYTHING from the first to the last which I don't need, I need to have every tagged word with all its features and tags as a separate element

Here's the text sample:

<p><se><w><ana lex="сообщить" sem="t:speech ca:noncaus d:pref" gr="V pf tran ger act praet"/>Сообщив</w> <w><ana lex="я" sem="r:pers" gr="S-PRO sg 1p dat"/>мне</w> <w><ana lex="об" gr="PR"/>об</w> <w><ana lex="это" sem="r:dem" gr="S-PRO n sg loc"/>этом</w>, <w><ana lex="Говоров" sem="t:hum r:propn t:famn" gr="S famn m anim sg nom"/>Говоров</w> <w><ana lex="приняться" sem="der:v ca:noncaus d:pref" gr="V pf intr med m sg praet indic"/>принялся</w> <w><ana lex="яростно" sem="ev dt:humq der:a" gr="ADV"/>яростно</w> <w><ana lex="колотить" sem="t:impact ca:noncaus t:sound" gr="V ipf tran inf act"/>колотить</w> <w><ana lex="на" gr="PR"/>на</w> <w><ana lex="компьютер" sem="t:tool:device r:concr" gr="S m inan sg loc"/>компьютере</w> <w><ana lex="гневный" sem="t:psych:emot r:rel der:s" gr="A n sg acc inan plen"/>гневное</w> <w><ana lex="письмо" sem="r:concr t:text" gr="S n inan sg acc"/>письмо</w> <w><ana lex="председатель" sem="t:hum r:concr d:nag" gr="S m anim sg dat"/>председателю</w> <w><ana lex="гильдия" sem="t:group pt:set r:concr sc:hum hi:class" gr="S f inan sg gen"/>Гильдии</w> <w><ana lex="российский" sem="dt:topon r:rel der:s" gr="A pl gen plen"/>российских</w> <w><ana lex="адвокат" sem="t:hum r:concr" gr="S m anim pl gen"/>адвокатов</w> <w><ana lex="Мирзоев" sem="t:hum r:propn t:famn" gr="S famn m anim sg dat"/>Мирзоеву</w> <w><ana lex="Г" gr="INIT abbr"/>Г</w>. <w><ana lex="Б" gr="INIT abbr"/>Б</w>. <w><ana lex="Гасан" sem="t:hum r:propn t:famn t:persn" gr="S persn m anim sg nom"/>Гасан</w> <w><ana lex="Борисович" sem="t:hum t:patrn r:propn der:s" gr="S patrn m anim sg nom"/>Борисович</w> <w><ana lex="правовед" sem="der:comp t:hum r:concr t:prof" gr="S m anim sg nom"/>правовед</w> <w><ana lex="известнейший" sem="der:a d:super r:qual" gr="A m sg nom plen"/>известнейший</w>, <w><ana lex="депутат" sem="t:hum r:concr" gr="S m anim sg nom"/>депутат</w> <w><ana lex="государственный" sem="dt:space r:rel der:s" gr="A f sg gen plen"/>Государственной</w> <w><ana lex="дума" sem="t:ment r:abstr" gr="S f inan sg gen"/>думы</w>, <w><ana lex="а" gr="CONJ"/>а</w> <w><ana lex="потому" sem="der:apro r:dem" gr="ADV-PRO"/>потому</w>, <w><ana lex="как" gr="CONJ"/>как</w> <w><ana lex="уверенный" sem="t:ment der:v dt:ment r:qual" gr="A m sg brev"/><ana lex="уверить" sem="t:speech ca:caus d:pref" gr="V pf tran partcp m sg brev pass praet"/>уверен</w> <w><ana lex="судья" sem="der:v t:hum r:concr d:nag" gr="S m anim sg nom"/>судья</w>, <w><ana lex="понять" sem="t:ment ca:noncaus der:v d:pref" gr="V pf tran sg act fut 3p indic"/>поймёт</w> <w><ana lex="и" gr="CONJ"/>и</w> <w><ana lex="его" sem="r:poss" gr="A-PRO"/>его</w>   <w><ana lex="гнев" sem="r:abstr" gr="S m inan sg acc"/>гнев</w> , <w><ana lex="и" gr="CONJ"/>и</w> <w><ana lex="его" sem="r:poss" gr="A-PRO"/>его</w> <w><ana lex="беспомощность" sem="t:humq r:abstr der:a" gr="S f inan sg acc"/>беспомощность</w> <w><ana lex="перед" gr="PR"/>перед</w> <w><ana lex="адвокатский" sem="r:rel der:s dt:hum" gr="A pl ins plen"/>адвокатскими</w> <w><ana lex="сюрприз" sem="r:abstr" gr="S m inan pl ins"/>сюрпризами</w> <w><ana lex="и" gr="CONJ"/>и</w>
0

There are 0 best solutions below