I want to do some basic hebrew stemming.
All the examples of custom analyzers I could find always merge other analyzers and and filters but never do any string level processing themselves.
What do I have to do for example if I want to create an analyzer that for each term in the stream it gets, emits either one or two terms by the following rules: if the incoming term begins with anything other then "a" it should be passed as is. if the incoming term begins with "a" then two terms should be emmited: the original term and a second one without the leading "a" and with a lower boost.
So that if the document has "help away" it will return "help", "away", and "way^0.8".
What methods of the analyzer should I override to do this? (A pointer to a similar nature example would be very helpful).
Thanks
Here's one example: http://www.java2s.com/Open-Source/Java-Document/Search-Engine/lucene/org/apache/lucene/wordnet/SynonymTokenFilter.java.htm
Briefly scanning the code, it seems it should emit additional tokens at the same position (a synonym). It does that by overriding incrementToken() which you'll have to do for your problem (maintain a stack of next tokens, returning one by one).
If this example doesn't work, just try to find one that explains how you could implement a synonym filter with Lucene, it's almost identical to your problem. Lucene in Action book has a good example of this, the code is available here: http://www.manning.com/hatcher3/LIAsourcecode.zip, class
SynonymFilter
.