Wednesday, 30 December 2009

XML Encoding & Special Characters

Another day at work (right after holiday too) I encountered this problem with XML and encoding. As I have a lack of knowledge on this topic, it took me a while to figure this out (asking around for help) and it turned out to be a very valuable lesson for me.

You see the XML file that created the error contains some special XML characters. Now what I mean by 'special characters' are one of these (see link): Special Characters and Symbols, e.g. ™, • or √.

The 'actual' symbols you see are in UTF-8 format. However XML may be stored as Latin-1 (ISO-8859-1) encoded. Now because Latin-1 is only 8-bit long so it can only represent characters that is not bigger than the 255 range. On the other hand, UTF-8 can represent up to 8 bytes of data. So it is obvious that some characters in UTF-8 will not be able to be directly converted into Latin-1. In these cases, the ampersand encoded version may be used instead to represent the same character in Latin-1, e.g. ™ •  &#8730. This is why it would be important to ensure the 'encoding' you listed in your XML file matches the same encoding the actual file is saved in.

Now the actual problem I was experiencing is I believe someting called HTML purifier. See this very very useful web page which tells you all about this (and gives a good explainations in XML encoding)! You see I was trying to write some Java classes to remove some unwanted node from an XML file's DOM tree. However everytime when I pass in an XML file with 'special characters', it always come back out with a question mark ('?'). The problem was because I got fooled by the encoding that was listed in the xml file and also with the unexpected kindness of Java to try to help me by purifying the XML file.

The problem was resolved by understanding how Java and XML works. Most importantly to understand that the encoding in an XML file is not just purefly for the sake of meta-data, but it is very important in determining which encoding / decoding to use when writing Java classes to process the XML file.

Additional ref: Processing XML with Java

No comments:

Post a Comment