Regex to find a lowercase letter followed by an uppercase between a HTML tag

4.3k Views Asked by At

I want to use Regular Expression in TextWrangler to find lowercase letter followed by uppercase between these HTML font-color tags. For example:

<font color =#0B610B> Word word wordWord </font>
<font color =#C0C0C0> Word word wordWord </font>

In fact, I want them to be split by a colon as:

<font color =#0B610B> Word word word: Word </font>
<font color =#C0C0C0> Word word word: Word </font>

I have used:

<font color =#0B610B\b[^>]*>(.*?)</font>

But its finds every thing between the font-color tag

I have also tried:

<font color =#0B610B\b[^>]*>([a-z])([A-Z])</font>

But it does not work.

Could anyone help me? Thank you very much.

5

There are 5 best solutions below

3
On

How about doing a positive look ahead, something like this

[a-z](?=[A-Z])

I don't have text wrangler but you can use this and match the word and add your colon and space. I tested this regex in perl and it looks ok.

[jaypal:~/Temp] cat temp
<font color =#0B610B> Word word wordWord </font>
<font color =#C0C0C0> Word word wordWord </font>

[jaypal:~/Temp] perl -pe 's/([a-z])(?=[A-Z])/$1: /' temp
<font color =#0B610B> Word word word: Word </font>
<font color =#C0C0C0> Word word word: Word </font>

Update: I forgot I have BBEdit which is the big brother of Text Wrangler. Here is it in action.

Update2: Here is it in action in Text Wrangler.

1
On

try this

<font.*?>.*?[a-z][A-Z].*?</font>

2
On

How about this one:

<font[^>]*>[^<>]*([a-z][A-Z])[^<>]*</font>
0
On

I don't think you can do it in one single Regex expression, but provided you can loop through it:

<script type="text/javascript">
function checkscript() {
    var content = document.regexForm.input.value;
//match any HTML tag (you could specify font)(not an opening tag)(lowercase)(uppercase)(not an opening tag)
    while(content.match(/(<[^>]*?>)([^<]*)([a-z])([A-Z])([^<]*)/))
    {
        content = content.replace(/(<[^>]*?>)([^<]*)([a-z])([A-Z])([^<]*)/g,"$1$2$3: $4$5");
    }
    document.regexForm.output.value = content;
}
</script>
<body>

<form name="regexForm">
    <textarea rows="10" cols="50" name="input"> 
            <font color =#0B610B> Word myWord<BR> wordWord </font>
            <font color =#C0C0C0> Word word wordWord </font>
    </textarea>
<BR>    
<input type=button value="run test regex" onClick="checkscript();return true;">
<BR><textarea rows="10" cols="50" name="output"></textarea>
</form>

this:

<font color =#0B610B> Word myWord<BR> wordWord </font>
<font color =#C0C0C0> Word word wordWord </font>

becomes:

<font color =#0B610B> Word my: Word<BR> word: Word </font>
<font color =#C0C0C0> Word word word: Word </font>
0
On

This question has not been marked as Answered. If you still have not found an adequate answer, you can try this:

Given the following examples, only lines 1, 2, and 3 should "match" your criteria. Line 4 should NOT match, since there is no "lowercase-Uppercase" combination. Line 5 should also not match because the font color (#FFFFFF) does not match what you specified (in the OP as well as follow-up comments).

<font color =#0B610B> Word word wordWord </font>
<font color =#C0C0C0> Word word wordWord </font>
<font color =#C0C0C0> wordWord wordWordwordWord </font>
<font color =#0B610B> word word word Word Word Word Wordword </font>
<font color =#FFFFFF> Word word wordWord </font>

The search term could be written like this:

(?<=font color =#(?:0B610B|C0C0C0)>)((?:(?!</font>|[\r\n]).)*[a-z])([A-Z])

The replacement term could be written like this:

\1: \2

The search term has several nested parentheses. The first, (?<...) finds the "" tag on the left, and then starts the search from the right side of it. The (?:0B610B|C0C0C0) finds either of your specified font colors (you can add more by adding more "|" pipes), and does not store them in one of the \# registers (like \1 or \2).

There are then 3 opening ('s. The first is a matching group, which will be matched with the \1. The third (skipping the 2nd for now) that looks like (?!...) will look that the characters just to the right of the current search pattern is NOT the closing </font> tag, nor is it any kind of newline character. While that condition is true, the . character advances the search to the next character, where it checks again to ensure that the </font> is not found. It does this until it finds the </font> closing tag.

The reason for the 2nd (?:...) group is that we don't want that search result to be passed into any registers: we want the "everything between <font>...</font> tags", but actually excluding the tags.

Finally, in the replacement term, we paste the portion of the text from the right of the <font> tag, to the first occurrence of where the word is lowercase and before the same word hits an Uppercase character. Then it simply enters a colon, a space, and ends. You may have to run this replacement multiple times for cases where a single line contain wordWordWordWord.