How to change the case of html text to sentence case in python

374 Views Asked by At

see I have a string containing html text, lets called it S.

S = "<b>this is a sentence. and this is one more sentence</b>"

and I want is to convert above S into following text

S = <b>This is a sentence. And this is one more sentence</b>

The problem is that I can convert any text to sentence case using my function but when the text contains html there is no way to tell my function which part is text and which part is html that should be avoided. and therefore when I give S as input to my function it yields incorrect result as following

S = <b>this is a sentence. And this is one more sentence</b>

Because it considered '<' as first character of sentence and so it tried converting '<' into uppercase which is same as '<'.

My question to you folks now is that how to convert text into sentence case in python if text is already encoded in html form ? And I dont wanna loose HTML formating

1

There are 1 best solutions below

4
On

An overly simplistic approach would be

import xml.etree.ElementTree as ET
S = "<b> This is sentence. and this is one more. </b>"

delim = '. ' 

def convert(sentence):
    return sentence[0].upper() + sentence[1:] + delim


def convert_node(child):
    sentences = child.text
    if sentences:
        child.text = ''
        for sentence in sentences.split(delim):
            if sentence:
                child.text += convert(sentence)
    sentences = child.tail
    if sentences:
        child.tail = ''
        for sentence in sentences.split(delim):
            if sentence:
                child.tail += convert(sentence)
    return child

node = ET.fromstring(S)
S = ET.tostring(convert_node(node))

# gives '<b> This is sentence. And this is one more. </b>'

Obviously, this will not cover every situation, but it will work if the task is constrained well enough. This approach should be adaptable for your function that you already have. Essentially, I believe you need to use a parser to parse the HTML and then manipulate the text values of each html node.

If you are reluctant to use a parser, use a regex. This is likely much more fragile, so you must constraint your inputs much more. Something like this as a start:

>>> split_str = re.split('(</?\w+>|\.)', S)
# split_str is ['', '<b>', 'this is a sentence', '.', ' and this is one more sentence', '</b>', '']

You can then just check if the words in the split string starts and ends with < and >

for i, word in enumerate(split_str):
    if len(word) > 1 and not word.startswith('<') or not word.endswith('>'):
       split_str[i] = convert(word)

S = ' '.join(split_str)