Ignore special characters in Examine

1k Views Asked by At

In Umbraco, I use Examine to search in the website but the content is in french. Everything works fine except when I search for "Français" it's not the same result as "Francais". Is there a way to ignore those french characters? I try to find a FrenchAnalyser for Leucene/Examine but did not found anything. I use Fuzzy so it return results even if the words is not the same.

Here's the code of my search :

public static ISearchResults Search(string searchTerm)
        {
            var provider = ExamineManager.Instance.SearchProviderCollection["ExternalSearcher"];
            var criteria = provider.CreateSearchCriteria(BooleanOperation.Or);

            var crawl = criteria.GroupedOr(BoostedSearchableFields, searchTerm.Boost(15))
            .Or().GroupedOr(BoostedSearchableFields, searchTerm.Fuzzy(Fuzziness))
            .Or().GroupedOr(SearchableFields, searchTerm.Fuzzy(Fuzziness))
            .Not().Field("umbracoNavHide", "1");

            return provider.Search(crawl.Compile());
        }
3

There are 3 best solutions below

0
On BEST ANSWER

We ended up using a custom analyer based on the SnowballAnalyzer

public class CustomAnalyzer : SnowballAnalyzer
{
    public CustomAnalyzer() : base("French") { }

    public override TokenStream TokenStream(string fieldName, TextReader reader)
    {
        TokenStream result = base.TokenStream(fieldName, reader);

        result = new ISOLatin1AccentFilter(result);

        return result;
    }
}
0
On

You can actually convert the unicode characters with diacritics to english equivalents using the following method. That will enable you to search for "Français" with the search term "Francais".

public static string RemoveDiacritics(this string text)
{
    if (string.IsNullOrWhiteSpace(text))
        return text;

    text = text.Normalize(NormalizationForm.FormD);
    var chars = text.Where(c => CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark).ToArray();

    return new string(chars).Normalize(NormalizationForm.FormC);
}

Use it on any string like this:

var converted = unicodeString.RemoveDiacritics();
1
On

Try using Regex like this below:

var strInput ="Français";
var strToReplace = string.Empty;
var sNewString = Regex.Replace(strInput, "[^A-Za-z0-9]", strToReplace);

I've used this pattern "[^A-Za-z0-9]" to replace all non-alphanumeric string with a blank.

Hope it helps.