How to achieve that with given query "20", document with content "something 20" had something like MAX_SCORE
while other document e.g. "something 20/12" had regular one?
Im playing around with overriding Similarity algorithm to simplify the search but this behavior is pain right now.. I need to have lengthNorm factor set to "1" as I dont want to have "shorter documents will have bigger score" behavior (without this "20" obviously wins, but not because it fits entirely, but because its shorter...).
My custom Similarity class looks like that at the moment
public class SimpleSimilarity extends DefaultSimilarity {
public SimpleSimilarity(){}
@Override
public float idf(long docFreq, long numDocs) { return 1f; }
@Override
public float tf(float freq) { return 1f; }
@Override
public float lengthNorm(FieldInvertState state) {
return 1f;
}
}
You can still do this with custom similarity. You don't need smaller documents to score high but you need ratio of (matched token / total terms in document) in your score.
Try this lengthNorm in your custom similarity (keep tf/idf etc to return 1f as you mentioned above)
state.getLength() returns number of tokens in document.
As per similarity score equation (http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html) lengthNorm() will be added for each matched term, net net you will get ratio of (matched tokens / total terms in document).
Now in you example if your query is "20", here is order of returned document 1) 20 (document has only one term which matched with query) - score ~1.0 2) something 20 (document has two terms and one matched) - score ~0.5 3) something 20/12 (document has three terms and one matched) - score ~0.33