LEN and DATALENGTH of VARCHAR and NVARCHAR

1.6k Views Asked by At

After reading "What is the difference between char, nchar, varchar, and nvarchar in SQL Server?" I have a question.

I'm using MS SQL Server 2008 R2

DECLARE @T TABLE
(
C1 VARCHAR(20)  COLLATE Chinese_Traditional_Stroke_Order_100_CS_AS_KS_WS,
C2 NVARCHAR(20) COLLATE Chinese_Traditional_Stroke_Order_100_CS_AS_KS_WS
)

INSERT INTO @T VALUES (N'中华人民共和国',N'中华人民共和国')

SELECT LEN(C1)        AS [LEN(C1)],
       DATALENGTH(C1) AS [DATALENGTH(C1)],
       LEN(C2)        AS [LEN(C2)],
       DATALENGTH(C2) AS [DATALENGTH(C2)]
FROM   @T  

Returns

LEN(C1)     DATALENGTH(C1) LEN(C2)     DATALENGTH(C2)
----------- -------------- ----------- --------------
7           12             7           14

Why the second DATALENGTH(C1) is 12?

1

There are 1 best solutions below

0
On

In your INSERT you are converting text from unicode to chinese codepage for C1. Most likely this process alters the text and something may be lost.

Here is SQL Fiddle.

You can see that the second character is stored as 3F in varchar. You can also see that the last character is also stored as 3F in varchar. 3F is a code for ?. When Windows tries to convert text from unicode to the codepage and certain character can't be represented in the given codepage, the conversion function (most likely WideCharToMultiByte) puts ? for such characters .

One more example. The last, but one character is encoded as A94D in varchar and 8C54 in nvarchar. If you look it up in Character Map it will show these codes (unicode and codepage):

character map

See also:

What does it mean when my text is displayed as Question Marks?

https://www.microsoft.com/middleeast/msdn/Questionmark.aspx

Any time Unicode data must be displayed, they may be internally converted from Unicode using the WideCharToMultiByte API. Any time a character cannot be represented on the current code page, it will be replaced by a question mark (?).

This is exactly what is happening when you store a unicode literal N'中华人民共和国' in a varchar column. The unicode text is converted to multi-byte and some characters can't be represented in that code page and they are replaced by question marks ?.