Using icu::RuleBasedBreakIterator with hardcoded rules

336 Views Asked by At

I'm trying to use an ICU RuleBasedBreakIterator in C++ for segmenting Lao text into syllables. ICU has corresponding rules for Thai, which is "same same but different". The SOLR folks have something working in Java that I could get the rules from but I cannot find any example of how to instantiate a RuleBasedBreakIterator directly via its constructor that lets me specify the rules as opposed to the factory methods in BreakIterator. Here's what I have so far, a slightly modified function from the ICU docs:

#include <stdio.h>
#include <stdlib.h>
#include <iostream>
#include <string>
#include <unicode/unistr.h>
#include <unicode/rbbi.h>
#include <unicode/chariter.h>

using namespace std;

void listWordBoundaries(const UnicodeString&);

const char RULES[] = "";

int main(int argc, char *argv[]) {
    listWordBoundaries(UnicodeString::fromUTF8("ປະເທດລາວ"));
}

void listWordBoundaries(const UnicodeString& s) {
    UParseError parse_error;
    UErrorCode status = U_ZERO_ERROR;
    RuleBasedBreakIterator* bi = new RuleBasedBreakIterator(
        UnicodeString::fromUTF8(RULES), parse_error, status
    );

    if(!U_SUCCESS(status)) {
            fprintf(stderr, "Error creating RuleBasedBreakIterator\n");     // TODO print error
            if(U_MESSAGE_PARSE_ERROR == status) {
                    fprintf(stderr, "Parse error on line %d offset %d\n", parse_error.line, parse_error.offset);
            }
            exit(1);
    }

    bi->setText(s);
    int32_t p = bi->first();
    while (p != BreakIterator::DONE) {
            printf("Boundary at position %d (status %d)\n", p, bi->getRuleStatus());
            p = bi->next();
    }
    delete bi;
}

However, I get a segmentation fault as soon as I call bi->next due to a NULL statetable according to gdb:

Program received signal SIGSEGV, Segmentation fault.
icu_54::RuleBasedBreakIterator::handleNext (this=this@entry=0x614c70, statetable=0x0) at rbbi.cpp:1008
1008        UBool               lookAheadHardBreak = (statetable->fFlags & RBBI_LOOKAHEAD_HARD_BREAK) != 0;

The RULES string is supposed to hold the Lao.rbbi rules I linked to above. I have omitted it here because the effect is the same with an empty rule set. If I put some gibberish in the rules, the if(!U_SUCCESS(status)) check does work and the program exits with an error, so the rule parsing seems to work. However, even a U_SUCCESS return code doesn't seem to be sufficient to indicate that I can properly use the iterator.

Any ideas what I'm missing here?

0

There are 0 best solutions below