I need to do some synonym matching with Solr.
For instance in Sweden streetnames usually have the form of Foogatan where gatan is name for street in english. This street name can be written out abbreviated like Foog. (kinda like you write st. for street in english)
I'm familiar with how synonyms.txt works but I don't know how to create a synonym that will check that it contains some letters before gatan or before g..
I would need a synonym that would match *g. and *gatan.
I ended up doing this (seems to work as a rough draft for what I'm after)
public boolean incrementToken() throws IOException {
// See http://solr.pl/en/2012/05/14/developing-your-own-solr-filter/
if (!input.incrementToken()) return false;
String string = charTermAttr.toString();
boolean containsGatan = string.contains("gatan");
boolean containsG = string.contains("g.");
if (containsGatan) {
string = string.replace("gatan", "g.");
char[] newBuffer = string.toCharArray();
charTermAttr.setEmpty();
charTermAttr.copyBuffer(newBuffer, 0, newBuffer.length);
return true;
}
if (containsG) {
string = string.replace("g.", "gatan");
char[] newBuffer = string.toCharArray();
charTermAttr.setEmpty();
charTermAttr.copyBuffer(newBuffer, 0, newBuffer.length);
return true;
}
return false;
}
Also a similar problem I have is that you can write phone numbers in the form of 031-123456 and 031123456. When searching for a phone number like 031123456 it should also find 031-123456
How can I achieve this in Solr?
For the first one you could write a custom
TokenFilterand hook it up in your analyzers (it's not that hard, take a look atorg.apache.lucene.analysis.ASCIIFoldingFilterfor some simple example).Second one could possibly be solved by using
PatternReplaceCharFilterFactory: http://docs.lucidworks.com/display/solr/CharFilterFactoriesYou would have to remove '-' character from numbers and index/search for numbers only. Similar question: Solr PatternReplaceCharFilterFactory not replacing with specified pattern
Simple example removing gatan from end of each token:
and I've registered my
TokenFilterto some Solr field:You'll also need some simple
GatanizerFactorythat will return yourGatanizer