In Python I would like to display only Lao Characters in this HTML code (just only in "textarea" tag):
<font color="Red">ພິມຄໍາສັບລາວ ຫຼື ອັງກິດແລ້ວກົດປຸ່ມຄົ້ນຫາ - Enter English or Lao Then Hit Search</font><br />
<center><table id='display' border='0' width='100%'>
<tr>
<td id='lao2' colspan='3' style='height: 18px; text-align: left'>
<span style='color: #660033'><span style='font-size: 12pt'> </span></span>
</td>
</tr>
<tr>
<td style='width: 120px'> </td>
<td style='width: 192px'>
<textarea ID='lao' Font-Name='Phetsarath OT' Font-Size='12' rows='10' cols='84' readonly='readonly'>
1. (loved, loving)
1. ຮັກ
2. ມັກຫຼາຍ
3. would love ຢາກໄດ້ຫຼາຍ, ຢາກເຮັດຫຼາຍ
ປະເພດ: ຄໍາກໍາມະ
ການອອກສຽງ: ເລັຟ
2.
1. ຄວາມຮັກ
2. ຄົນຮັກ, ຄູ່ຮັກ, ສິ່ງທີ່ເຈົ້າຮັກ
3. ທີ່ຮັກ, (ເທັນນິດ) ສູນ
be in love with ຮັກຜູ້ໃດຜູ້ໜຶ່ງ
make love ຮ່ວມປະເວນີ
ປະເພດ: ຄຳນາມ
ການອອກສຽງ: ເລັຟ
</textarea>
</td>
<td style='width: 284px'> </td>
</tr>
<tr>
<td> </td>
<td> </td>
<td> </td>
</tr>
<tr>
<td> </td>
<td id='lao1' align='center'>ກະຊວງ ໄປສະນີ, ໂທລະຄົມມະນາຄົມ ແລະ ການສື່ສານ</td><td> </td>
</tr>
<tr>
<td> </td>
<td id='lao1' align='center'>ສູນບໍລິຫາລັດດ້ວຍເອເລັກໂຕຣນິກ</td><td> </td>
</tr>
</table></center><br />
I just want the value in the "textarea". What should I do?
Don't use a regular expression. Use a HTML parser. BeautifulSoup makes the task easy:
If you then need to limit the result to just Lao characters, you can further process the
text
variable.However, the Python
re
module isn't that strong (yet) when it comes to Unicode. Your options are to use a regular expression to just grab code points in the range 0E80–0EFF, use theunicodedata
module and filter on the unicode codepoint name, or use theregex
library to only match Lao characters.Using a regular expression:
Demo:
Using the
unicodedata
module:Demo:
Using the
regex
module:Demo: