Can't add space before and after the list numbering

146 Views Asked by At

I am writing a program in which there are sometimes lists that create issues during voice synthesis, for example the output from voice synthesis is displayed as follows: "Suggestions for restaurants:1. Pizza2. Burger3. Sushi4. Noodles...". The voice synthesis interprets the numbers as part of the word, resulting in awkward pronunciation. To resolve this, whitespaces should be inserted between the numbers and the words. Additionally, the output should not be too lengthy; it would be better to limit the list to the first three suggestions.

I have tried this code:

import re

def post_processing(text):
  """
  Post-processes a text string to address formatting issues for voice synthesis.

  Args:
    text: The input text string.

  Returns:
    The processed text string.
  """
  # Process lists with improved handling
  parts = text.split(":")
  if len(parts) > 1:
    # Split based on newlines, limiting to 3 items
    items = parts[1].strip().split("\n")[:3]
    # Remove trailing spaces, handle punctuation, and add spaces correctly
    items = [
        f"{item.strip()[:-1].rstrip('.')}{' ' if item.strip()[-1].isdigit() or item.strip()[-1] == '.' else ''}{item.strip()[-1:]}"
        for item in items
    ]
    text = ": ".join(items)
  else:
    text = text.strip()  # Remove leading/trailing whitespace

  # Remove URLs completely
  text = re.sub(r"https?://\S+", "", text)

  return text

So when I input the following as input: text = "Suggestions for restaurants: 1 . Pizza2. Burger3. Sushi4. Noodles...." text = post_processing(text)

Following output should be there: print(text) # Output: 1. Pizza 2. Burger 3. Sushi

but what I get is as follows: 1 . Pizza2. Burger3. Sushi4. Noodles .

3

There are 3 best solutions below

1
Abhijit On

If you want to ensure there's a space both before and after the list numbering, you can adjust the formatting in the list comprehension. Try this

import re

def post_processing(text):
    """
    Post-processes a text string to address formatting issues for voice synthesis.

    Args:
      text: The input text string.

    Returns:
      The processed text string.
    """
    # Process lists with improved handling
    parts = re.split(r'[:,]\s*', text, 1)  # Split based on colon and optional space
    if len(parts) > 1:
        # Extract list items with numbers
        items_with_numbers = re.findall(r'\d+\.\s*\S+', parts[1])
        # Limit to the first three items
        items = items_with_numbers[:3]
        # Replace consecutive periods with a single period within each word
        items = [re.sub(r'\.+', '.', item) for item in items]
        # Add space before and after list numbering
        items = [f"{item[:-1]}. {item[-1]}" for item in items]
        text = " ".join(items)
    else:
        text = text.strip()  # Remove leading/trailing whitespace

    # Remove URLs completely
    text = re.sub(r"https?://\S+", "", text)

    return text

text = "Suggestions for restaurants: 1. Piz. . 2. Burg. . 3. Sus. . 4. Nood. les...."
text = post_processing(text)
print(text) 
9
JonSG On

In this case, I think you can split on numbers and strip out the leading characters you don't want. Then re-introduce your numbering via enumerate()

import re

def post_processing(text):
    """
    Post-processes a text string to address formatting issues for voice synthesis.

    Args:
      text: The input text string.

    Returns:
      The processed text string.
    """

    return "Suggestions for restaurants: " + " ".join(
        f"{i}. {p.strip(' .')}"
        for i, p
        in enumerate(re.split(r"\d", text)[1:-1], start=1)
    )

text = "Suggestions for restaurants: 1 . Pizza2. Burger3. Sushi4. Noodles...."
print(text)
print(post_processing(text))

Giving you:

Suggestions for restaurants: 1 . Pizza2. Burger3. Sushi4. Noodles....
Suggestions for restaurants: 1. Pizza 2. Burger 3. Sushi

If your ultimate goal is to take your input text and just give the numbers in it proper spacing then maybe I might use:

text = "Suggestions for restaurants: 1 . Pizza2. Burger3. Sushi4. Noodles...."
text = re.sub(r"\s*(\d)\s*\.?\s*", r" \1. ", text).split(" 4.")[0]
print(text)

With the new test:

import re

def post_processing(text):
  text = re.sub(r"\s*(\d)\s*\.?\s*", r" \1. ", text).split(" 4.")[0].strip()    
  if text.endswith("."):
      text = text[:-3]
  return text

print(post_processing("Suggestions for restaurants: 1 . Pizza2. Burger3. Sushi4. Noodles...."))
print(post_processing("Recommendation of Stadiums: 1.OldTrafford 2Manchester Birmingham3"))
Suggestions for restaurants: 1. Pizza 2. Burger 3. Sushi
Recommendation of Stadiums: 1. OldTrafford 2. Manchester Birmingham
8
The fourth bird On

You could use a pattern with 3 capture groups, where each part starts with 1 or more digits followed by a dot.

\b(\d+\s*\..*?)(\d+\s*\..*?)(\d+\s*\..*?)(?=\d+\.|$)

The pattern matches:

  • \b A word boundary to prevent a partial word match
  • (\d+\s*\..*?) Capture group 1, match 1+ digits followed by optional whitespace chars and a dot. Then match the least amount of characters until the next occurrence of the same pattern
  • (\d+\s*\..*?) Same for group 2
  • (\d+\s*\..*?) Same for group 3
  • (?=\d+\.|$) But as there might not be a 4th occurrence, we can assert the digits and dot to the right, or assert the end of the string

See the regex matches and a Python demo.

The you can after process the group values.

import re


def post_processing(text):
    pattern = re.compile(r"\b(\d+\s*\..*?)(\d+\s*\..*?)(\d+\s*\..*?)(?=\d+\.|$)", re.M)
    matches = pattern.finditer(text)

    for _, match in enumerate(matches, start=1):
        result = [re.sub(r"(\d)\s+\.", r"\1.", s) for s in match.groups()]
        return " ".join(result)


s = "Suggestions for restaurants:1  . Pizza2. Burger3. Sushi4. Noodles..."
print(post_processing(s))

Output

1. Pizza 2. Burger 3. Sushi