Regex (Kotlin) to match end of sentence periods only and ignore periods in the middle such as abbreviations

64 Views Asked by At

I need a regex to find all sentence-ending periods and ignore middle of the sentence periods, such as in abbreviations. Note: I understand that there are many other variations, and it may not be possible to account for all of them, so the focus of the question would be : can at least the below sample be solved with a regex?

Suppose I have this text. The regex rule below finds any period matches followed by a white space. But it also matches p.m. and U.S. - how can I ignore periods in a word that a) consists of characters all separated by a period? (such as U.S.) and b) a period preceded by one characters only (such as J.). This is in Kotlin.

        val text = "At 12.51 p.m. local time, J. Knapp, former U.S. Navy,  went out for a walk. Yes he did. And then a Mw6.3 earthquake happened."
        val regexRule = "\\.\\s+"
        val splitText = text.split(regexRule.toRegex())
        val result = splitText.joinToString( separator = ".\n\n")

Current result with just that rule:

At 12.51 p.m.

local time, J.

Knapp, former U.S.

Navy, went out for a walk.

Yes he did.

And then a Mw6.3 earthquake happened.

2

There are 2 best solutions below

0
Wiktor Stribiżew On BEST ANSWER

You can use

val regexRule = "(?<!\\b\\p{L})\\.(?<!\\d.(?=\\d))(?!\\s*\$)\\s*"

See the regex demo.

Details:

  • (?<!\b\p{L}) - a negative lookbehind: no single letter preceded with a word boundary is allowed immediately to the left of the current location
  • \. - a dot
  • (?<!\d.(?=\d)) - the dot should not be in-between digits
  • (?!\s*$) - immediately to the right, there should be no any zero or more whitespaces + the end of the string
  • \s* - any zero or more whitespaces.
1
JvdV On

You mentioned you want to find a pattern that would atleast solve the given sample. Try to match spaces and assert that there is no word-boundary two positions back nor any digit one position back. For example:

(?<=(?<!\b.|\d)\.)\s+
  • (?<= - Open a positive lookbehind;
    • (?<! - Open a negative lookbehind;
      • \b.|\d) - Match a word-boundary followed by a singel char (any) or a digit and close the negative lookbehind;
    • \.) - Match a literal dot and close the positive lookbehind;
    • \s+ - Match 1+ spaces.

See an online demo


Here is a snippet of code:

fun main() {
    val info = "At 12.51 p.m. local time, J. Knapp, former U.S. Navy,  went out for a walk. Yes he did. And then a Mw6.3 earthquake happened."
    val lines = info.split(Regex("(?<=(?<!\\b.|\\d)\\.)\\s+"))
    for (line in lines) {
        println(line)
    }
}

Returns:

At 12.51 p.m. local time, J. Knapp, former U.S. Navy,  went out for a walk.
Yes he did.
And then a Mw6.3 earthquake happened.

But please note that I have no experience with Kotlin.