Wednesday, December 21, 2011

Query XML with different languages or different UNICODE encodings [UTF-8 & UTF-16]

While creating XML documents we should take care of the UTF encoding if we are dealing with NON-US or international data. Non-English foreign languages fall in different ranges and takes more memory size compared to English language and thus require different encoding schemes, as follows:
>> UTF-8: represents an octet (8 bit) loosless encoding of Unicode characters. UTF-8 encodes each Unicode character as a variable number of 1 to 4 octets, where the number of octets depends on the integer value assigned to the Unicode character. It is an efficient encoding of Unicode documents that use mostly US-ASCII characters because it represents each character in the range U+0000 through U+007F as a single octet. UTF-8 is the default encoding for XML.
In an XML prolog the encoding is represented as:
‘<?xml version=”1.0″ encoding=”UTF-8″ ?>’
>> UTF-16: encoding is the 16-bit encoding of Unicode. It use 2 bytes per character (and sometimes combines two pairs), it makes implementation easier, but looks a bit overkill for Western languages encoding.
In an XML prolog the encoding is represented as:
‘<?xml version=”1.0″ encoding=”UTF-16″ ?>’
Let’s check both the encodings and issues if they are not used correctly:

-- ===================================================

-- XML containing NON Unicode English characters only:

-- ===================================================

DECLARE @xml as XML

SET @xml =

'<?xml version="1.0" encoding="utf-8"?>

<root>
    <to>Gaurav</to>

    <from>Garvit</from>
    <subject>Happy Birthday</subject>

    <body>Wish you a very very Happy Birth Day!!!</body>

</root>'


SELECT

    T.C.value('to[1]', 'VARCHAR(10)') AS 'TO',

    T.C.value('from[1]', 'VARCHAR(10)') AS 'FROM',

    T.C.value('subject[1]', 'VARCHAR(50)') AS 'SUBJECT',

    T.C.value('body[1]', 'VARCHAR(200)') AS 'BODY'

FROM @xml.nodes('//root') T(C)

GO
 Correct expected output:-

TO FROM SUBJECT  BODY
Gaurav Garvit Happy Birthday Wish you a very very Happy Birth Day!!!
Let’s see how this works with international languages. Let’s change the language to French with same UTF-8 encoding:



-- ============================================

-- When XML contains Unicode French characters:

-- ============================================

DECLARE @xml as XML

SET @xml =

'<?xml version="1.0" encoding="utf-8"?>

<root>
    <to>Gaurav</to>

    <from>Garvit</from>

    <subject>Joyeux anniversaire</subject>

    <body>Je vous souhaite une très très heureux anniversaire!</body>

</root>'


SELECT

    T.C.value('to[1]', 'VARCHAR(10)') AS 'TO',

    T.C.value('from[1]', 'VARCHAR(10)') AS 'FROM',

    T.C.value('subject[1]', 'VARCHAR(50)') AS 'SUBJECT',

    T.C.value('body[1]', 'VARCHAR(200)') AS 'BODY'

FROM @xml.nodes('//root') T(C)

GO


We get an Error Message:
Msg 9420, Level 16, State 1, Line 2
XML parsing: line 6, character 31, illegal xml character
As a NON-English language (French) is used to so we have to use the UTF-16 encoding:



-- Let's change the encoding to UTF-16, i.e. for UNICODE characters:

DECLARE @xml as XML

SET @xml =

'<?xml version="1.0" encoding="utf-16"?>

<root>

    <to>Gaurav</to>

    <from>Garvit</from>

    <subject>Joyeux anniversaire</subject>

    <body>Je vous souhaite une très très heureux anniversaire!</body>

</root>'


SELECT

    T.C.value('to[1]', 'VARCHAR(10)') AS 'TO',

    T.C.value('from[1]', 'VARCHAR(10)') AS 'FROM',

    T.C.value('subject[1]', 'VARCHAR(50)') AS 'SUBJECT',

    T.C.value('body[1]', 'VARCHAR(200)') AS 'BODY'

FROM @xml.nodes('//root') T(C)

GO


Still getting the Error Message:
Msg 9402, Level 16, State 1, Line 2
XML parsing: line 1, character 39, unable to switch the encoding
Now, as we are dealing with UNICODE data so the XML string should also be of UNICODE type, so I’ll prefix ‘N’ before the string starts in SET statement:



-- Add N for to make the xml string Non-Unicode:

DECLARE @xml as XML

SET @xml =

N'<?xml version="1.0" encoding="utf-16"?>

<root>

    <to>Gaurav</to>

    <from>Garvit</from>

    <subject>Joyeux anniversaire</subject>

    <body>Je vous souhaite une très très heureux anniversaire!</body>

</root>'


SELECT

    T.C.value('to[1]', 'VARCHAR(10)') AS 'TO',

    T.C.value('from[1]', 'VARCHAR(10)') AS 'FROM',

    T.C.value('subject[1]', 'VARCHAR(50)') AS 'SUBJECT',

    T.C.value('body[1]', 'VARCHAR(200)') AS 'BODY'

FROM @xml.nodes('//root') T(C)

GO
Correct expected output:-
TO FROM SUBJECT   BODY
Gaurav Garvit Joyeux anniversaire Je vous souhaite une très très heureux anniversaire!
As per MS BOL: “SQL Server stores XML data using the UTF-16 encoding scheme. Because UTF-16 data is variable-width, it is processed according to a byte-oriented protocol. This means that UTF-16 data can be treated in a way that is independent of the byte ordering on different computers (little endian versus big endian). Therefore, UTF-16 is well-suited for traversing different computers that use different encodings and byte-ordering systems. Because XML data is typically shared widely across networks, it makes sense to maintain the default UTF-16 storage of XML data in your database, and when you export XML data to clients.”
More info on: http://msdn.microsoft.com/en-US/library/ms145866%28v=SQL.90%29.aspx

No comments: