Stop Rolling Your Own CSV Parser!
secretGeek .:dot Nuts about dot Net:.
home .: about .: sign up .: sitemap .: secretGeek RSS

Stop Rolling Your Own CSV Parser!

Would you write your own XML Parser? Only if you're f***ing crazy.

Yet developers constantly write their own "little" csv parsers.

How does this madness occur?

Step 1 -- Ignorance

"Oh this will be easy, I'll just read the file one line at a time, calling String.Split(',') to break each line into an array.

"Then I'll be able to refer to each item by number."

(You're already headed for stormy water... anything you do from now on will only drive you into the rocks harder and faster...)

continues...

Step 2 -- First Doubts

"Oops. I need to handle for commas, which are either escaped (by prefixing them with a special symbol) or contained inside quotes."

(So you decide to use regular expressions. After a bit of tinkering you've got a nice little regular expression that seems to work.)

(That ringing in your ears is Jamie Zawinski saying "Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems. )

Step 3 -- Uh oh

"The quotes worked good at first... but quotes need to be escaped too. And sometimes there's double quotes, sometimes single quotes. Easy -- I'll just fix my regular expression."

Step 4 -- The Descent into Chaos

You start to adopt a 'test-driven' approach, only it's more of a 'panic driven' approach. You write numerous test cases for your unwieldy csv parser. It behaves nice.

You test it on more real world examples... it breaks your existing code and you need a new test case or two.

You begin to add new test cases, and trying always to do the simplest thing that will get the code to work.

It's now eight weeks since you said "I know! I'll just use String.Split(...". You have grown a long beard, which is particularly annoying as you are a woman. You have lost all boundaries in regard to personal hygiene. Managers circle your desk like vultures circling a wounded leopard.

Step 5 -- Enough!

You lift your head from the keyboard for just moment when a thought strike you. The problems you are facing have been faced before. You are re-inventing the wheel.

You download a code sample from the internet, and use your test cases to try them out. The downloaded code is much worse than what you've written yourself.

You download more samples from the internet. They're all broken. In. Different. Ways.

When you try to contact the developers of each library to see how they work, you find that the developers have generally retired and/or passed away and/or quit working in the IT industry. You consider how fortunate they are.

Step 5 -- Help me!

You go to the blog of someone you know and trust. You email that person. That person writes back and says, in big letters:

Just Use Marcos Meli's File Helpers.

The great thing about File Helpers is not just that it works, but that it is actively developed by Marcos, and if you need an improvement to it, you can contact Marcos (marcosdotnet at yahoo.com.ar). He's a real person who cares about getting his library to work properly. He's not just stopping at a 'good-enough' solution.

The other, and perhaps even greater, advantage is one you never dreamed of. Now you don't have to refer to fields by number. No more "myArray[4]" -- you can now say "myCustomer.Id".

The resulting code is so readable that you'll survive your next code inspection without getting your arms and legs torn off by Terry (Head Code Nazi and leader of the local chapter of The Programming Gestapo).

You can stop re-inventing the wheel and get on with your day job: cranking out more bugs, faster.

(But it's a good thing this experience gave you a chance to try out test driven development!)





'Daniel' on Wed, 13 Sep 2006 01:34:23 GMT, sez:

In the past I've used the ODBC text drivers. Haven't looked at a Dot Net equivalent though.



'lb' on Wed, 13 Sep 2006 01:52:50 GMT, sez:

i was gonna do a side track about the ODBC text drivers. they're excellent and worth while -- but they're still a potential dead end.

two problem i have with them. one, you've got this opaque technique with it's own limitations (e.g. max 256 columns [might be out of date]) that act as real brick walls when/if you hit them.

second, i see treating a csv file as a database table to be a potentially 'leaky abstraction' (in JoelSpeak)

having said that ymmv of course.



'Erv' on Wed, 13 Sep 2006 02:05:30 GMT, sez:

They do look interesting, but unfortunatly, they have an unacceptable license (at least for the company I work for). We'll have to keep rolling our own :(



'lb' on Wed, 13 Sep 2006 02:08:12 GMT, sez:

who doesn't have an acceptable license? the odbc text drivers? or filehelpers?



'Gavin' on Wed, 13 Sep 2006 04:01:26 GMT, sez:

I started reading: "Stop rolling your own CVS parser" and thought "What the hell!?!? nobody does that! Is this guy just making up stories now!?"

and then it clicked.

And then I felt really, really guilty.



'Farmer Jeb' on Wed, 13 Sep 2006 04:34:13 GMT, sez:

Are you just having a go at me because you KNOW I do this?!



'lb' on Wed, 13 Sep 2006 04:36:55 GMT, sez:

hey i do it too. over and over. in the last week i've worked with three different groups of people who have all done it too. and we've all hit the same problems. this is more of a go at myself than anyone jeb -- though the stringbuilder lesson we learnt today certainly *did* bring you to mind.



'Farmer JEB' on Wed, 13 Sep 2006 04:45:35 GMT, sez:

Yes, of course I told you about how when I eventually (1.5 months later) followed your suggestion of using a StringBuilder it sped my program up several orders of magnitude. The latest innovation was to write a wrapper around Split to put its results in a List instead of an array (to facilitate Removing subarrays of data efficiently).



'Dave' on Wed, 13 Sep 2006 05:09:04 GMT, sez:

Wow, what a coincidence. I just wrote a CSV parser today. It's certainly not the first time I've done this and it's not quite as bad as you make it out to be :) I'd never consider regular expressions for this so maybe that's why I don't have as many problems.

Historically, my issues tend to be with the people on the other end who like to randomly decide to move columns around, add new columns, etc. without telling me. No library can help with that unfortunately.



'Chris Wallace' on Wed, 13 Sep 2006 09:29:17 GMT, sez:

I stick to the good ole' pipe | as it "should" never be in legitimate text.



'anon' on Wed, 13 Sep 2006 10:01:06 GMT, sez:

Don't you work for a company that has built a 'little' csv parser?

http://msdn2.microsoft.com/en-us/library/microsoft.visualbasic.fileio.textfieldparser.aspx



'Marcos' on Wed, 13 Sep 2006 11:29:49 GMT, sez:

Leon
Thanks a lot for spread the FileHelpers to the community and for your comments about it =)

Dave:

"Historically, my issues tend to be with the people on the other end who like to randomly decide to move columns around, add new columns, etc. without telling me. No library can help with that unfortunately."

If you the FileHelpers you only need to move a field up or down and you can make some checks of types and length to ensure that your files are not modified.

Cheers





'b0n' on Wed, 13 Sep 2006 12:19:29 GMT, sez:

This sounds fantastic. I was always too humble/indignant to write my own CSV parser, but instead wasted those 1.5 months searching for decent code on the Internet. What a load. It's nice to see somebody focusing on the basics.

Now if you could get Microsoft to package this with every crappy copy of SQL Server Express, you'd be up for a Nobel peace prize. I was in a warlike mood most of the time. It should be easy!!!

Thanks for pointing it out Leon. Thanks for doing it Carlos.



'lb' on Wed, 13 Sep 2006 20:48:22 GMT, sez:

>Don't you work for a company that has
>built a 'little' csv parser?

hey i don't work for microsoft! i work for advantechsoftware.com.au
We're better than microsoft ;-) Smaller though.



'Haacked' on Wed, 13 Sep 2006 23:28:04 GMT, sez:

Why write a CSV parser when you can write a domain specific programming language for parsing CSV files?



'A Nony Mouse' on Mon, 18 Sep 2006 18:07:27 GMT, sez:

Score one for XML - you are guaranteed a parser that can work

There is "Standard" for CSV files so any solution can fail on your specific data, and probably it does.



'Julian Morrison' on Wed, 08 Nov 2006 07:09:40 GMT, sez:

Experienced coders would spot the danger instantly. Any delimited format holding arbitrary data needs quoting, which rules out pure regex and implies a need for escapes. Someone who's been around the block a few times will see all of the above in a single glance and go straight to a parser generator.



'Dave' on Wed, 08 Nov 2006 11:18:58 GMT, sez:

Writing a CSV parser is not hard, takes only a few hours tops with unit tests--but only for programmers who understand the facts you list. I hope all beginning programmers see this webpage and decide whether they understand the rules or not.

How to write it: Write the parser until it starts to run on simple cases. Then stop. Figure out what few simple homemade parsing functions would help make you parser handle all official rules. Then start all over and write the parser so that it is relatively simple due to calling your smart parsing functions.

If you control the data being input, you can restrict it to a subset of CSV rules. Programming that takes only minutes.

Write it yourself to avoid licensing problems. Also so that you can re-use it on other projects.

But, if you are reading in major amounts of data from outside of your control, you should buy (with money) a real full-blown CSV parser library. Or, expect to spend months on all the degenerative cases for poorly formed input data. Otherwise, what happens when your client Foobar Industries sends you a gigabyte file of pseudo CSV and you must load it *today*. It doesn't quite follow the official rules. Your parser needs to be very smart.



'Pierre' on Wed, 08 Nov 2006 14:24:05 GMT, sez:

Thanks for the link to Marcos, looks like a great utility.

(BTW I did write my own CSV/tab-delimited parser in a couple of days).

Also, the TimeSnapper plug is fantastic. Just what I've been looking for.



'RegExp Santero' on Thu, 09 Nov 2006 01:16:07 GMT, sez:

This regex matches quotes containing backslashed quotes signs:

"[^\\"]*(\\.[^\\"]*)*"

It's pure evil, but I learnt it by heart (if you stare at it long enough it starts making sense) and now it serves me well.

Once you can barf that off your fingers at command csv parsing it's finally feasible.



'Sébastien Lorion' on Wed, 22 Nov 2006 16:55:56 GMT, sez:

If you need *real* performance AND flexibility, you can try my CsvParser library on CodeProject:

http://www.codeproject.com/cs/database/CsvReader.asp

I know ... shameless plug ;)



'Sébastien Lorion' on Wed, 22 Nov 2006 17:00:33 GMT, sez:

BTW, nice blog, just discovered it ;)



'Percival' on Fri, 24 Nov 2006 03:58:58 GMT, sez:

I've used Sebastien's CSV parser --- WELL DONE mate! Excellent and integrated with what I was doing... But then I found Odbc Text drivers and it allows csv to be queried using SQL - so i stuck with it. But all in all I give a big thumbs up for Sebastien's piece of excellent work -- and it's free too! Thanks.



'Percival' on Fri, 24 Nov 2006 04:01:02 GMT, sez:

BTW... I found Seb's CSV parser about 5 months ago and it was the best (and fastest) by far after going through weeks of testing the others. However, I moved to Odbc text drivers only because of SQL query capabilities. I am keeping Seb's CSV parser close at hand in case I may require it in future. Thanks again Seb.



'Sébastien Lorion' on Mon, 27 Nov 2006 08:55:40 GMT, sez:

Your welcome! Always glad to be able to pay back when I can ;)



'http://mandar.date.googlepages.com' on Fri, 02 Mar 2007 11:03:51 GMT, sez:

If you want simple function, try this..
http://www.codeproject.com/useritems/Basic_CSV_Parser_Function.asp



'lb' on Sat, 03 Mar 2007 03:15:52 GMT, sez:

Thanks Mandar -- i like the approach of your code.



'TheDigitalHippy' on Fri, 09 Mar 2007 21:30:14 GMT, sez:

Ok. I'll admit it.. I've done the parser thing. Well, the vultures (managers) wouldn't get near my desk. I bite.

But the "Error: unknown error" is really freaking me out man!

Thanks for the link!
TheDigitalHippy



'Jeff Zanooda' on Sat, 10 Mar 2007 00:28:41 GMT, sez:

I'm sorry, I don't get it. What's so hard about writing a CSV parser? It's just a finite state machine (one state variable + one large switch statement). And why would one want to use regular expressions here?



'radix' on Sat, 10 Mar 2007 08:15:04 GMT, sez:

@jeff:

well, then sit down and try it.
this is not talking about data that you can control, this is about handling data which you can`t and which is not (allways) following rules.
But still you have to turn garbage to data.



'zepolen' on Sat, 10 Mar 2007 08:34:18 GMT, sez:

or you could just make sure you escape the commas before they go into the csv file...a lot simpler don't ya think?

<?
$file='bob.csv';
$fieldnames = array('ID', 'Name', 'Description');
$lines = file($file);
$l = 0;
foreach($lines as $line) {
$line = str_replace('\\,', '$$££**COMMA**$$££', $line);
$fields = explode(',', $line);
$i = 0;
foreach($fieldnames as $fieldname) {
$value = str_replace('$$££**COMMA**$$££', ',', $fields[++$i]);
$csv[$l][$fieldname] = $value;
}
$l++;
}
print_r($csv);
?>

even php can take care of it



'lb' on Sat, 10 Mar 2007 09:48:00 GMT, sez:

"escape the commas before they go into the csv file"

sorry Zepolen -- but generally when you have a nasty csv parsing problem, you're dealing with someone else's csv -- so how you would write it is irrelevant.

i like (and am a little amused by) your code -- replacing the commas with a weird string, then replacing the weird string later. i've done it myself... but it's not a great trick in the long run.

lb



'pop' on Sat, 10 Mar 2007 12:28:16 GMT, sez:

@zepolen check out fgetcsv in the PHP manual.



'Bob' on Sun, 11 Mar 2007 21:41:18 GMT, sez:

Use tab-delimited. Anything that can output CSV can output tab-delimited. About the only thing that makes a mess is storing file locations on Windows.



'lb' on Sun, 11 Mar 2007 21:43:53 GMT, sez:

bob you freakin reddit troll nitwit --

"generally when you have a nasty csv parsing problem, you're dealing with someone else's csv"

got that?? it's not about how you GENERATE the csv -- it's about how you parse it.

And TABS offer no benefit over commas, you undergrad, inexperienced twerp.

damn i'm sick of stupid comments. I blame Des Traynor ;-)



'Augusto' on Wed, 14 Mar 2007 00:42:02 GMT, sez:

Sadly, File Helpers is not cross platform enough. For something so simple, that's a big limitation.



Comments closed due to spam. Sorry.

Articles

The Canine Pyramid The Canine Pyramid
Humans: A Tragedy. Humans: A Tragedy.
ACK! ACK!
OfficeQuest... Gamification for the Office Suite OfficeQuest... Gamification for the Office Suite
New product launch: NimbleSET New product launch: NimbleSET
Programming The Robot from Diary of a Wimpy Kid Programming The Robot from Diary of a Wimpy Kid
Happy new year 2014 Happy new year 2014
Downtime as a service Downtime as a service
The Shape of Your Irrationality The Shape of Your Irrationality
This is why I don't go to nice restaurants any more. This is why I don't go to nice restaurants any more.
A flowchart of what programmers do at work all day A flowchart of what programmers do at work all day
The Telepresent Man. The Telepresent Man.
Interview with an Ex-Microsoftie. Interview with an Ex-Microsoftie.
CRUMBS! Commandline navigation tool for Powershell CRUMBS! Commandline navigation tool for Powershell
Little tool for making Amazon affiliate links Little tool for making Amazon affiliate links
Extracting a Trello board as markdown Extracting a Trello board as markdown
hgs: Manage Lots of Mercurial Projects Simultaneously hgs: Manage Lots of Mercurial Projects Simultaneously
You Must Get It! You Must Get It!
AddDays: A Very Simple Date Calculator AddDays: A Very Simple Date Calculator
Google caught in a lie. Google caught in a lie.
NimbleText 2.0: More Than Twice The Price! NimbleText 2.0: More Than Twice The Price!
A Computer Simulation of Creative Work, or 'How To Get Nothing Done' A Computer Simulation of Creative Work, or 'How To Get Nothing Done'
NimbleText 1.9 -- BoomTown! NimbleText 1.9 -- BoomTown!
Line Endings. Line Endings.
**This** is how you pivot **This** is how you pivot
Art of the command-line helper Art of the command-line helper
Go and read a book. Go and read a book.
Slurp up mega-traffic by writing scalable, timeless search-bait Slurp up mega-traffic by writing scalable, timeless search-bait
Do *NOT* try this Hacking Script at home Do *NOT* try this Hacking Script at home
The 'Should I automate it?' Calculator The 'Should I automate it?' Calculator

Archives Complete secretGeek Archives

TimeSnapper -- Automated Screenshot Journal TimeSnapper: automatic screenshot journal

25 steps for building a Micro-ISV 25 steps for building a Micro-ISV
3 minute guides -- babysteps in new technologies: powershell, JSON, watir, F# 3 Minute Guide Series
Universal Troubleshooting checklist Universal Troubleshooting Checklist
Top 10 SecretGeek articles Top 10 SecretGeek articles
ShinyPower (help with Powershell) ShinyPower
Now at CodePlex

Realtime CSS Editor, in a browser RealTime Online CSS Editor
Gradient Maker -- a tool for making background images that blend from one colour to another. Forget photoshop, this is the bomb. Gradient Maker



[powered by Google] 

How to be depressed How to be depressed
You are not inadequate.



Recommended Reading


the little schemer


The Best Software Writing I
The Business Of Software (Eric Sink)

Recommended blogs

Jeff Atwood
Joseph Cooney
Phil Haack
Scott Hanselman
Julia Lerman
Rhys Parry
Joel Pobar
OJ Reeves
Eric Sink

InfoText - amazing search for SharePoint
LogEnvy - event logs made sexy
Computer, Unlocked. A rapid computer customization resource
Aussie Bushwalking
BrisParks :: best parks for kids in brisbane
PhysioTec, Brisbane Specialist Physiotherapy & Pilates
 
home .: about .: sign up .: sitemap .: secretGeek RSS .: © Leon Bambrick 2006 .: privacy

home .: about .: sign up .: sitemap .: RSS .: © Leon Bambrick 2006 .: privacy