original table contains json, but i've stripped it down to the table below:
| id | json |
|---|---|
| 1 | "name":"one.it.two" |
| 2 | "name": "one.it.two" |
difference between the two rows is the space after :
catalog has no stopwords.
searching for CONTAINS (json, 'it') return both rows.
searching for CONTAINS (json, 'two') return both rows.
searching for CONTAINS (json, 'one') returns only the second row.
why does searching for one not return the first row?
i've reduced the test case even further. thanks to @RobinWebb
this is no more a json or delimited text issue.
| id | text1 |
|---|---|
| 1 | name:first.it |
| 2 | name: first.it |
difference between the two rows is the space after :
searching for first does not return the first row.
search works if i change first.it to first.and
thanks to @AlwaysLearning, this is an issue with the word breaker
results from sys.dm_fts_parser is not consistent:
| text | words |
|---|---|
| name:first.it | name:first.it name :first it |
| name:first.and | name first.and first and |
| name:first,it | name first it |
i used SELECT * FROM sys.dm_fts_parser ('"<text>"', 1033, NULL, 0)
Based on the info provided in this answer https://dba.stackexchange.com/a/65845/94130 it seems that
.itis treated as a special word (possibly a top level domain) by word breaker.I can only infer that this "special word logic" has a b̶u̶g̶ feature in it, where
:is treated as part of the name. Examples:Notice that it always returns one extra result. I assume it includes the extra line when it thinks that part of the string is a URL.
Note that some words are not affected:
I have modified code provided in https://dba.stackexchange.com/a/25848/94130 to get all characters that are treated this way.
Output:
There is a Microsoft article explaining how to switch word breakers/stemmers, which may (or may not) solve this but I have not tried this.
Note: Above code was executed on Win 11 and SQL 2019 Dev