« Estimation - and life's other mysteries | Main | Technical Interviewing: Pushing the envelope »
May 08, 2004
A Token Effort
This week, I began working on a task to parse a CSV file (generated from an Excel spreadsheet) and construct an object model based on the contents of the file, whilst applying a simple set of row-based validations to the CSV. "Simple!", thinks I, "I'll just start with my old friend StringTokenizer and go from there..."
How wrong I was!
It seems my friendship with StringTokenizer is one based on deceit as it singularly failed to do what I wanted (nay - expected) it to do. I'll use a simple test to demonstrate exactly where my frustrations lay:
public void testTokenizerNullTokenHandling() {
StringTokenizer tokenizer = new StringTokenizer("a,b,,d,e,,", ",");
assertEquals(7, tokenizer.countTokens()); // fails
assertEquals(4, tokenizer.countTokens()); // passes
}
To perform my validation correctly, I needed to be able to tell when a field in my source spreadsheet had no value, thus producing two consecutive commas. However, my comrade StringTokenizer treats consecutive delimeters as a single one, thus removing all knowledge that an empty position existed.
This causes me all sorts of pain: As each column in the spreadsheet, and therefore field in the CSV, holds data of a different type, treating consecutive delimeters in this manner causes fields to ooze left into neighbouring positions when empty values are supplied.
Interestingly, my second solution returned results slightly more in alignment with what I needed, but still not sufficent. Using split from java.lang.String resulted in a structure that included embedded empty values, but not ones at the beginning or end of the string being processed.
public void testSplitNullTokenHandling() {
String tokens[] = "a,b,,d,e,,".split(",");
assertEquals(7, tokens.length); // fails
assertEquals(5, tokens.length); // passes
}
So, at this point I turned to my own means for a solution, and came up with:
protected List tokenizeSpec(String spec) {
ArrayList tokens = new ArrayList();
int delimeterPos = spec.indexOf(ATTRIBUTE_DELIMITER);
if (delimeterPos < 0) {
tokens.add(spec);
}
int tokenStartPos = 0;
while (delimeterPos >= 0) {
String token = spec.substring(tokenStartPos, delimeterPos);
tokens.add(token.trim().length() > 0 ? token : null);
spec = spec.substring(delimeterPos + 1);
delimeterPos = spec.indexOf(ATTRIBUTE_DELIMITER);
}
return tokens;
}
This way, I get a nice data structure out the other end that I can quickly query for information such as "is there an element in field 3?". Admittedly, I loose the ability to determine how many values have been supplied, but if needed I could abstract this list and method into a class and provide a method to return exactly that.
In summary, I'm not particularly miffed that my passing acquaintance StringTokenizer and it's partner in crime String.split tokenize the way they do by default, but I was a little surprised there isn't some way to override the default implementation to provide such functionality as I imagine it would be a fairly common need.
Footnote: I'm extremely open to comments along the lines of "You, sir, are a fool of the highest order, pray tell what thou didn'tst use the bletch method from the foobar class to achieve your aim?"
Posted by Andy Marks at May 8, 2004 08:24 AM
Comments
Try this;
String s = "a,b,,d,e,,";
assertEquals(7, ((String[])s.split(",", -1)).length);
From the javadoc on String.split();
The limit parameter controls the number of times the pattern is applied and therefore affects the length of the resulting array. If the limit n is greater than zero then the pattern will be applied at most n - 1 times, the array's length will be no greater than n, and the array's last entry will contain all input beyond the last matched delimiter. If n is non-positive then the pattern will be applied as many times as possible and the array can have any length. If n is zero then the pattern will be applied as many times as possible, the array can have any length, and trailing empty strings will be discarded.
Posted by: Jon Eaves at May 8, 2004 11:42 AM
Andy, you need to read the Javadoc... when not treating delimiters as tokens themselves (the default behaviour), it returns tokens which are defined as "a maximal sequence of consecutive characters that are not delimiters". In other words, your expectations were wrong. :)
If you'd need to use StringTokenizer, you could tell it to return delimiters as well. Then, a simple check to see if a token was a delimiter would be enough to help you out.
But StringTokenizer is legacy! The right way is what Jon says: "a,b,,d,e,,".split("," -1).
My rule of thumb says that whenever my (naive) expectations are shattered, it says that there was something I didn't understand. I try to learn about things I don't understand. :)
Oh, boy, am I going to hassle you about this on Monday. *laugh*
Posted by: Robert Watkins at May 8, 2004 09:38 PM
Cheers for the feedback guys.
Robert, to circumvent the roasting it looks like I'm heading for, my expctations of StringTokenizer were not based on what I believed it did, but rather on what I wanted it to do. I soon realized from having read the Javadoc (obviously not for split though :-)) that it wasn't written to perform in that manner.
Goody, I can now refactor/remove my hand-coded tokenizer!
Thanks again all.
Posted by: Andy Marks at May 9, 2004 08:27 PM
Post a comment
Thanks for signing in, . Now you can comment. (sign out)
(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)