I need to do some synonym matching with Solr.
For instance in Sweden streetnames usually have the form of Foogatan
where gatan is name for street in english. This street name can be written out abbreviated like Foog.
(kinda like you write st.
for street
in english)
I'm familiar with how synonyms.txt
works but I don't know how to create a synonym that will check that it contains some letters before gatan
or before g.
.
I would need a synonym that would match *g.
and *gatan
.
I ended up doing this (seems to work as a rough draft for what I'm after)
public boolean incrementToken() throws IOException {
// See http://solr.pl/en/2012/05/14/developing-your-own-solr-filter/
if (!input.incrementToken()) return false;
String string = charTermAttr.toString();
boolean containsGatan = string.contains("gatan");
boolean containsG = string.contains("g.");
if (containsGatan) {
string = string.replace("gatan", "g.");
char[] newBuffer = string.toCharArray();
charTermAttr.setEmpty();
charTermAttr.copyBuffer(newBuffer, 0, newBuffer.length);
return true;
}
if (containsG) {
string = string.replace("g.", "gatan");
char[] newBuffer = string.toCharArray();
charTermAttr.setEmpty();
charTermAttr.copyBuffer(newBuffer, 0, newBuffer.length);
return true;
}
return false;
}
Also a similar problem I have is that you can write phone numbers in the form of 031-123456
and 031123456
. When searching for a phone number like 031123456 it should also find 031-123456
How can I achieve this in Solr?
For the first one you could write a custom
TokenFilter
and hook it up in your analyzers (it's not that hard, take a look atorg.apache.lucene.analysis.ASCIIFoldingFilter
for some simple example).Second one could possibly be solved by using
PatternReplaceCharFilterFactory
: http://docs.lucidworks.com/display/solr/CharFilterFactoriesYou would have to remove '-' character from numbers and index/search for numbers only. Similar question: Solr PatternReplaceCharFilterFactory not replacing with specified pattern
Simple example removing gatan from end of each token:
and I've registered my
TokenFilter
to some Solr field:You'll also need some simple
GatanizerFactory
that will return yourGatanizer