Friday, October 21, 2005

Handling Control Characters in XML

I started off with my application with the decision that it will use xml to handle data communication. It was doing well for a long time, I was happy. Now the day came where my application just was getting broken as xml was not being parsed properly. Well, I had to fix the issue, but was not really happy about my decision of using xml for the kind of data I am working with. I was really sad when I tried to open the xml file isocontrol.xml in the browser and it says parsererror.

It was really had for me now, as some character was getting into xml from my application (which I had not expected). I started to fix this, I looked around the web to know more about this kind of problem. Many suggested using different encoding, tried so but could not fix it. One more solution was to encode the data in base64 encoding (as we do for binary data like images), well I was not sure should I do this as I was handling more of text in my application than binary data.

It was late in the nigth I had no other choice except to read the xml file and get to know what was that character that was causing the problem. Well, I figured out it was the character with value '6' (integer) or CONTROL-F that actually made my xml go for a toss. Now, I looked at some references which stated that control character from 0 to 31 that is not handled by most xml parsers and will lead to parser failures. So all I had was to convert the control character to entity references. So I wrote a small fix to my application. Link to Source.
/**
* Get proper xml string value (with ISOControl characters converted to
* entity references)
* @param source String which has to be converted
* @param startTag Start tag of the xml data (null if not needed)
* @param endTag End tag of the xml data (null if not needed)
* @return String with ISOControl characters converted to entity references.
*/

public static String toXMLdata(String source, String startTag, String endTag) {
StringBuffer xmldatasb = new StringBuffer();
if(startTag != null) xmldatasb.append(startTag);

if(source == null || "e;"e;.equals(source)) {
xmldatasb.append(source);
} else {
for(int index = 0; index < source.length(); ++index) {
char sourceCh = source.charAt(index);
if(Character.isISOControl(sourceCh))
xmldatasb.append(entityReference(sourceCh));
else
xmldatasb.append(sourceCh);
}
}
if(endTag != null) xmldatasb.append(endTag);
return xmldatasb.toString();
}

/**
* Get the entity reference for the character.
* @param sourceCh character whose entity reference is needed.
* @return Entity Reference &#integer;, where integer corresponds to
* to the number that represents sourceCh.
*/

public static String entityReference(char sourceCh) {
return "&#" + (int)sourceCh + ";" ;
}

The few lines of code was worth a lot to my application. All it does it when it sees the iso control characters (0 to 31) it will be replaced by entity references (&#integervalue;)

Monday, October 03, 2005

Using kjParsing module.

This Sunday I came across the kjParsing module, "A parser generator written in Python for Python". It is good for writing experimental translators, code generators, interpreters, or compilers. I wanted to work with it and get atleast a simple application built on it. By end of night, yep I got my "Simple Calculator Program" working :). I tried to get the Calculator program, as I get a chance to playaround defining different rules for the grammar which are very simple. Here is the Source. You can download the complete source here.

This page is powered by Blogger. Isn't yours?