I have a string that (potentially) contains HTML tags.
I want to split it into smaller valid HTML strings based on (text) character length. The use case is essentially pagination. I know the length of text that can fit on a single page. So I want to divide the target string into "chunks" or pages based on that character length. But I need each of the resulting pages to contain valid HTML without unclosed tags, etc.
So for example:
const pageCharacterSize = 10
const testString = 'some <strong>text with HTML</strong> tags
function paginate(string, pageSize) { //@TODO }
const pages = paginate(testString, pageCharacterSize)
console.log(pages)
// ['some <strong>text </strong>', '<strong>with HTML</strong> ', 'tags']
I think this is possible to do with a DocumentFragment or Range but I can't figure out how slice the pages based on character offsets.
This MDN page has a demo that does something close to what I need. But it uses caretPositionFromPoint()
which takes X
, Y
coordinates as arguments.
Update
For the purposes of clarity, here are the tests I'm working with:
import { expect, test } from 'vitest'
import paginate from './paginate'
// 1
test('it should chunk plain text', () => {
// a
const testString = 'aa bb cc dd ee';
const expected = ['aa', 'bb', 'cc', 'dd', 'ee']
expect(paginate(testString, 2)).toStrictEqual(expected)
// b
const testString2 = 'a a b b c c';
const expected2 = ['a a', 'b b', 'c c']
expect(paginate(testString2, 3)).toStrictEqual(expected2)
// c
const testString3 = 'aa aa bb bb cc cc';
const expected3 = ['aa aa', 'bb bb', 'cc cc']
expect(paginate(testString3, 5)).toStrictEqual(expected3)
// d
const testString4 = 'aa bb cc';
const expected4 = ['aa', 'bb', 'cc']
expect(paginate(testString4, 4)).toStrictEqual(expected4)
// e
const testString5 = 'a b c d e f g';
const expected5 = ['a b c', 'd e f', 'g']
expect(paginate(testString5, 5)).toStrictEqual(expected5)
// f
const testString6 = 'aa bb cc';
const expected6 = ['aa bb', 'cc']
expect(paginate(testString6, 7)).toStrictEqual(expected6)
})
// 2
test('it should chunk an HTML string without stranding tags', () => {
const testString = 'aa <strong>bb</strong> <em>cc dd</em>';
const expected = ['aa', '<strong>bb</strong>', '<em>cc</em>', '<em>dd</em>']
expect(paginate(testString, 3)).toStrictEqual(expected)
})
// 3
test('it should handle tags that straddle pages', () => {
const testString = '<strong>aa bb cc</strong>';
const expected = ['<strong>aa</strong>', '<strong>bb</strong>', '<strong>cc</strong>']
expect(paginate(testString, 2)).toStrictEqual(expected)
})
Here is a solution that assumes and supports the following:
<b><i>wrong nesting</b></i>
,missing <b>end tag
,missing start</b> tag
Output:
Update
Based on new request in comment I fixed the split regex from
'[\\s\\S]{1,' + pageSize + '}(?!\\S)'
to'\\s*[\\s\\S]{1,' + pageSize + '}(?!\\S)'
, e.g. added\\s*
to catch leading spaces. I also added apage.trim()
to remove leading spaces. Finally I added a few of the OP examples.