Friday, October 21, 2005
Handling Control Characters in XML
I started off with my application with the decision that it will use xml to handle data communication. It was doing well for a long time, I was happy. Now the day came where my application just was getting broken as xml was not being parsed properly. Well, I had to fix the issue, but was not really happy about my decision of using xml for the kind of data I am working with. I was really sad when I tried to open the xml file isocontrol.xml in the browser and it says parsererror.
It was really had for me now, as some character was getting into xml from my application (which I had not expected). I started to fix this, I looked around the web to know more about this kind of problem. Many suggested using different encoding, tried so but could not fix it. One more solution was to encode the data in base64 encoding (as we do for binary data like images), well I was not sure should I do this as I was handling more of text in my application than binary data.
It was late in the nigth I had no other choice except to read the xml file and get to know what was that character that was causing the problem. Well, I figured out it was the character with value '6' (integer) or CONTROL-F that actually made my xml go for a toss. Now, I looked at some references which stated that control character from 0 to 31 that is not handled by most xml parsers and will lead to parser failures. So all I had was to convert the control character to entity references. So I wrote a small fix to my application. Link to Source.
The few lines of code was worth a lot to my application. All it does it when it sees the iso control characters (0 to 31) it will be replaced by entity references (&#integervalue;)
It was really had for me now, as some character was getting into xml from my application (which I had not expected). I started to fix this, I looked around the web to know more about this kind of problem. Many suggested using different encoding, tried so but could not fix it. One more solution was to encode the data in base64 encoding (as we do for binary data like images), well I was not sure should I do this as I was handling more of text in my application than binary data.
It was late in the nigth I had no other choice except to read the xml file and get to know what was that character that was causing the problem. Well, I figured out it was the character with value '6' (integer) or CONTROL-F that actually made my xml go for a toss. Now, I looked at some references which stated that control character from 0 to 31 that is not handled by most xml parsers and will lead to parser failures. So all I had was to convert the control character to entity references. So I wrote a small fix to my application. Link to Source.
/**
* Get proper xml string value (with ISOControl characters converted to
* entity references)
* @param source String which has to be converted
* @param startTag Start tag of the xml data (null if not needed)
* @param endTag End tag of the xml data (null if not needed)
* @return String with ISOControl characters converted to entity references.
*/
public static String toXMLdata(String source, String startTag, String endTag) {
StringBuffer xmldatasb = new StringBuffer();
if(startTag != null) xmldatasb.append(startTag);
if(source == null || "e;"e;.equals(source)) {
xmldatasb.append(source);
} else {
for(int index = 0; index < source.length(); ++index) {
char sourceCh = source.charAt(index);
if(Character.isISOControl(sourceCh))
xmldatasb.append(entityReference(sourceCh));
else
xmldatasb.append(sourceCh);
}
}
if(endTag != null) xmldatasb.append(endTag);
return xmldatasb.toString();
}
/**
* Get the entity reference for the character.
* @param sourceCh character whose entity reference is needed.
* @return Entity Reference integer;, where integer corresponds to
* to the number that represents sourceCh.
*/
public static String entityReference(char sourceCh) {
return "" + (int)sourceCh + ";" ;
}
The few lines of code was worth a lot to my application. All it does it when it sees the iso control characters (0 to 31) it will be replaced by entity references (&#integervalue;)
Comments:
<< Home
All it does it when it sees the iso control characters (0 to 31) it will be replaced by entity references (integervalue;)
Just as a comment on this for posterity: This isn't actually well-formed XML. According to the spec, the only control characters that are supported are tab (	), LF ( ), CR ( ).
http://www.w3.org/TR/2004/REC-xml-20040204/#NT-Char
In the real world, this matters in the parsers that comes with JDK 6, they fail to parse even the character reference.
Post a Comment
http://www.w3.org/TR/2004/REC-xml-20040204/#NT-Char
In the real world, this matters in the parsers that comes with JDK 6, they fail to parse even the character reference.
<< Home