Query section symbol (§) i SQL Server 2012 full text index

890 Views Asked by At

We are using the SQL Server 2012 for doing full text indexing of legislative documents. However, it appears that in 2012 it is not possible to create queries containing characters like the section symbol (§).

I can't seem to find the documentation on MSDN that states which characters are "un-queryable". In our use case, it seems rather annoying that section symbols (§) are filtered out from the query (as confirmed when parsing query using sys.dm_fts_parser).

Would this be possible to do using SQL server 2012 and full text search, by implementing some configuration or workaround?

2

There are 2 best solutions below

0
On

The problem you're facing is that symbols like § are considered word-breakers in SQL Server Full Text, just like spaces.

The A+ solution is to create your own custom word breaker. I'm having a hard time finding a good example or even a reference for how to do this, but I believe you can reference the Windows Search Developer's Guide (check out Extending the Index) since they're built on the same technology.

The quick and dirty solution is to use the LIKE operator. Ex: SELECT * FROM Table WHERE Column LIKE '%§123%'

0
On

We haven't yet checked implementing a custom word breaker, and I believe using LIKE search defies the purpose of using a full text index, and I think it would hamper performance. However, we have discovered that it is possible to configure the word breaker so it actually excludes certain characters. It is explained here:

http://technet.microsoft.com/en-us/library/hh231252.aspx and here http://technet.microsoft.com/library/cc263242.aspx

I believe this would be the answer for making signs such as § queryable in a full text index in SQL server 2012.

Unfortunately, this is only possible for some language, and although Danish is one of the supported language, and this is the language we are supposed to use (the language of the documents being indexed are Danish), we have found severe problems with the linguistic analyses when using Danish as language, probably originating in the Stemming class used for Danish in SQL server 2008/2012. Now we are looking into changing or disabling the stemming used for Danish. This proves to be dificult, if possible at all, but is something for an entirely different question here on StackOverflow.