Regular expression with Lao?

Question

Regular expression with Lao?

407 Views Asked by Frank Xayachack At 09 May 2013 at 13:51

In Python I would like to display only Lao Characters in this HTML code (just only in "textarea" tag):

<font color="Red">ພິມຄໍາສັບລາວ ຫຼື ອັງກິດແລ້ວກົດປຸ່ມຄົ້ນຫາ - Enter English or Lao Then Hit Search</font><br />
<center><table id='display' border='0' width='100%'>
  <tr>
    <td id='lao2' colspan='3' style='height: 18px; text-align: left'>
      <span style='color: #660033'><span style='font-size: 12pt'>&nbsp;&nbsp;&nbsp;</span></span>&nbsp;&nbsp;
    </td>
  </tr>
  <tr>
    <td style='width: 120px'>&nbsp;</td>
    <td style='width: 192px'>
      <textarea ID='lao' Font-Name='Phetsarath OT' Font-Size='12' rows='10' cols='84' readonly='readonly'>
    1.  (loved, loving)
      1. ຮັກ
      2. ມັກຫຼາຍ
      3. would love ຢາກໄດ້ຫຼາຍ, ຢາກເຮັດຫຼາຍ
      ປະເພດ: ຄໍາກໍາມະ
      ການອອກສຽງ: ເລັຟ

    2.
      1. ຄວາມຮັກ
      2. ຄົນຮັກ, ຄູ່ຮັກ, ສິ່ງທີ່ເຈົ້າຮັກ
      3. ທີ່ຮັກ, (ເທັນນິດ) ສູນ
      be in love with ຮັກຜູ້ໃດຜູ້ໜຶ່ງ
      make love ຮ່ວມປະເວນີ
      ປະເພດ: ຄຳນາມ
      ການອອກສຽງ: ເລັຟ
      </textarea>
    </td>
    <td style='width: 284px'>&nbsp;&nbsp;</td>
  </tr>
  <tr>
    <td>&nbsp;</td>
    <td>&nbsp;</td>
    <td>&nbsp;</td>
  </tr>
  <tr>
    <td>&nbsp;</td>
    <td id='lao1' align='center'>ກະຊວງ ໄປສະນີ, ໂທລະຄົມມະນາຄົມ ແລະ ການສື່ສານ</td><td>&nbsp;</td>
  </tr>
  <tr>
    <td>&nbsp;</td>
    <td id='lao1' align='center'>ສູນບໍລິຫາລັດດ້ວຍເອເລັກໂຕຣນິກ</td><td>&nbsp;</td>
  </tr>
</table></center><br />

I just want the value in the "textarea". What should I do?

Original Q&A

There are 1 best solutions below

**Martijn Pieters** · Accepted Answer · 2013-05-09T13:53:38.750000

Don't use a regular expression. Use a HTML parser. BeautifulSoup makes the task easy:

from bs4 import BeautifulSoup

soup = BeautifulSoup(htmltext)
text = soup.find('textarea', id='lao').string

If you then need to limit the result to just Lao characters, you can further process the text variable.

However, the Python re module isn't that strong (yet) when it comes to Unicode. Your options are to use a regular expression to just grab code points in the range 0E80–0EFF, use the unicodedata module and filter on the unicode codepoint name, or use the regex library to only match Lao characters.

Using a regular expression:

import re

lao_codepoints = re.compile(ur'[\u0e80-\u0eff]', re.UNICODE)
lao_text = u''.join(lao_codepoints.findall(text))

Demo:

>>> print u''.join(lao_codepoints.findall(text))
ຮັກມັກຫຼາຍຢາກໄດ້ຫຼາຍຢາກເຮັດຫຼາຍປະເພດຄໍາກໍາມະການອອກສຽງເລັຟຄວາມຮັກຄົນຮັກຄູ່ຮັກສິ່ງທີ່ເຈົ້າຮັກທີ່ຮັກເທັນນິດສູນຮັກຜູ້ໃດຜູ້ໜຶ່ງຮ່ວມປະເວນີປະເພດຄຳນາມການອອກສຽງເລັຟ

Using the unicodedata module:

import unicodedata

loa_text = u''.join([ch for ch in text if unicodedata.name(ch, '').startswith('LAO')])

Demo:

>>> print u''.join([ch for ch in text if unicodedata.name(ch, '').startswith('LAO')])
ຮັກມັກຫຼາຍຢາກໄດ້ຫຼາຍຢາກເຮັດຫຼາຍປະເພດຄໍາກໍາມະການອອກສຽງເລັຟຄວາມຮັກຄົນຮັກຄູ່ຮັກສິ່ງທີ່ເຈົ້າຮັກທີ່ຮັກເທັນນິດສູນຮັກຜູ້ໃດຜູ້ໜຶ່ງຮ່ວມປະເວນີປະເພດຄຳນາມການອອກສຽງເລັຟ

Using the regex module:

import regex

lao_codepoints = regex.compile(ur'\p{Lao}', regex.UNICODE)
lao_text = u''.join(lao_codepoints.findall(text))

Demo:

>>> print u''.join(lao_codepoints.findall(text))
ຮັກມັກຫຼາຍຢາກໄດ້ຫຼາຍຢາກເຮັດຫຼາຍປະເພດຄໍາກໍາມະການອອກສຽງເລັຟຄວາມຮັກຄົນຮັກຄູ່ຮັກສິ່ງທີ່ເຈົ້າຮັກທີ່ຮັກເທັນນິດສູນຮັກຜູ້ໃດຜູ້ໜຶ່ງຮ່ວມປະເວນີປະເພດຄຳນາມການອອກສຽງເລັຟ

Regular expression with Lao?

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in REGEX

Related Questions in UNICODE

Related Questions in SOUTHEAST-ASIAN-LANGUAGES

Trending Questions

Popular # Hahtags

Popular Questions