Introduction

Let's be blunt - I honestly don't feel like typing an entire article about yet another CSV parser. In the end, people that are looking for a solution to a problem probably won't actualy read the text. They'll simply download the code, try to use it, and if I'm lucky, they'll actually take the time to use or hopefully develop some debug skills and figure out the issues on their own before posting a message below demanding that I fix the code for their specific application.

For this reason, I'm not going to go into as much detail about how the code works as much as what it does and why it does it the way it does it. There will be few code snippets, absolutely no images, and as much apathy as I can muster while still conveying the intended information.

Background

I live in a very bizarre programming world. My most persistent project involves importing data from almost five DOZEN different data sources, comprised mostly of Excel XLSX files, a web site that provides raw data in XML format, and a couple of actual database queries. The Excel spreadsheets come from a mix of database queries on web sites that can return the results in the form of said spreadsheet files, whith the rest being manually generated by humans.

Garbage In

You would think that the human-generated files would present the most quirks because humans are flawed. Mind numbingly boring data entry, a substantial lack of monetary compensation in the form of a paycheck, and the fact that it's a government employee all conspire to for a perfect storm of inability to muster anything resembling an attention to detail, which leads to "nuances" in the tabular data. However, database pulls can be equally frought with errors, (amusingly enough) especially if the data entry side of the database doesn't quite catch all of the potential errors that a human is apt to introduce.

Garbage Out

The primary method of importing said spreadsheets is a library called EPPlus. While a nice library in general, it has some weaknesses. The one that drove me to writing the code featured in this article is that for some reason (as yet undiscovered by - well - anybody), some .XLSX files simply will not load using the library. This "nuance" forces me to use Excel to save the desired sheet as a CSV file, which then forced me to write more code to implement the capability. And that's why you're reading this.

Assumptions

As with most of my articles, ths one is not about theory, or the newest doo-dad or gee-gaw that Microsoft seems to think we want to see in .Net. To put it simply, this is real-world code that lives and breathes in an actual project. As it gets used, it gets tested a little more thoroughly, and as problems crop up, they are promptly fixed. The code as presented here appears to work fairly well - today. Tomorrow will be a fresh hell of "might now", because I cannot think of every possible thing that might get thrown at it. I try to head most of the obvious stuff off, but like everything else related to programming, just when you think your code is idiot-proof, the world invents a better idiot, and you end up performing what I call knee-jerk programming.

This article assumes that you're a moderately accomplished developer, but one who wants some code to put out a fire that you'd really rather not put too much time into yourself. I didn't do anything too fancy or elegant, because "fancy and elegant" code is quite often a lot harder to maintain. The code is heavily commented, so there should be ample explanation about how it works.

A very short sample file (comprised of a header row and two data rows) is provided to exercise the class. This sample file also includes one malformed data row to exercise the AutoCorrect code. To ensure that the class meets your particular needs, use the included sample project to establish CSVParser's suitability, and make any changes that you deem necessary before puttting it into your own project(s).

The Code

Once again, this article isn't as much about how it works, but more about what it does and why it does it. Keep in mind that more often than not, the answer to "why" is going to be because I'm the laziest redneck you'd ever want to meet. And I'm old. Really old. I simply don't care if it fits everyone's needs, as long as it fits mine. By now, I've probably pissed more than a few people off, but that doesn't bother me because if I hadn't, I wouldn't be perpetuating my repution here on CP, thus dissappoint my legions of fans (okay, maybe one or two people would be disappointed, so "legions" is a subjective term).

What it is

The CSVParser class is an abstract class that parses a file (or stream) of comma-separated values. Being abstract, it must be inherited by a programmer-developed class, which must, at a minimum, implement the abstract methods. Most of the methods in the CSVParser class are virtual, allowing the programmer to override their functionality with new or supplementary processing.

What It Does

The code takes a comma-delimited text file (or stream) and parses each line into discrete fields.

Configuration Properties

Internally Consumed Properties

Parsing, Generally Speaking

The constructor takes no parameters, allowing the programmer to use automatic properties to set the configuration properties as desired. I personally prefer this to setting up what results in being a seemingly endless succession of overridden constructors, each having a range of parameters that need to be provided. I still use parameteried constructors, but only when absolutely necessary or expedient. The constructor is only used to initialized the various List properties. In your derived class, you don't really have to do anything unless you prefer to set the configuration properties there instead of using automatic properties.

To start the parsing process, call the Parse method with the desired filename or stream object. Each line is parsed in turn, using the following method:

protected virtual string[] ReadFields(string text, bool removeQuotes=true)
{
    //assume we have a proper line of text
    this.IsMalformed = false;
    // split the string on commas (because this is a CSV file, after all)
    string[] parts = text.Trim().Split(',');
 
    // create a container for our results
    List<string> newParts = new List<string>();
    // set some initial values
    bool inQuotes = false;
    string currentPart = string.Empty;
 
    // iterate the parts array
    for (int i = 0; i < parts.Length; i++) 
    {
        // get the part at the current index
        string part = parts[i];
        // if we're in a quoted string and the current part starts with a single double 
        // quote AND currentPart isn't empty, assume the currentPart is complete, add it to 
        // the newParts list, and reset for the new part
        if (inQuotes && part.StartsWithSingleDoubleQuote()==true && !string.IsNullOrEmpty(currentPart))
        {
            currentPart = string.Concat(currentPart, "\"");
            newParts.Add(currentPart);
            currentPart = string.Empty;
            inQuotes = false;
        }
        // see if we're in a quoted string
        inQuotes = (inQuotes || (!inQuotes && part.StartsWithSingleDoubleQuote() == true));
        // if so, add the part to the current currentPart
        if (inQuotes)
        {
            currentPart = (string.IsNullOrEmpty(currentPart))? part : string.Format("{0},{1}", currentPart, part);
        }
        // otherwise, simply set the currentPart to the part
        else
        {
            currentPart = part;
        }
        // see if we're still in a quoted string
        inQuotes = (inQuotes && currentPart.EndsWithSingleDoubleQuote()==false);
        // if not
        if (!inQuotes)
        {
            // remove the quote characters
            currentPart = (removeQuotes) ? currentPart.Trim('\"') : currentPart;
            // put the currentPart into our container
            newParts.Add(currentPart);
            // reset the currentPart
            currentPart = string.Empty;
        }
    }
    this.IsMalformed = (inQuotes || (this.Columns.Count > 0 && newParts.Count != this.Columns.Count));
    return newParts.ToArray();
}
    

Originally, I was using the VisualBasic.FileIOTextFieldParser object to handle this, but I despise all things connected with VB, and adding a reference to a VB assembly simply felt wrong on so many levels. Since I'm handling all the other stuff (that I needed) that the VB object provided, I figured it would be a lot more "developery" to roll my own version of the TextFieldParser.ReadFields method.

When parsing of the line is finished, the abstract method ProcessFields(bool isMalFormed) is called. In your derived class, you would do something like this:

protected override void ProcessFields(bool isMalformed)
{
    if (this.CurrentData != null && !isMalformed)
    {
        // TO-DO: Your stuff
        try
        {
            int      col1 = this.FindValue("col1", -1);
            string   col2 = this.FindValue("col2", "ERROR");
            string   col3 = this.FindValue("col3", "ERROR");
            double   col4 = this.FindValue("col4", -1d);
            DateTime col5 = this.FindValue("col5", new DateTime(0));
        }
        catch (FindValueException fvex)
        {
            //TO-DO: react to an exception thrown because the value found could not be cast 
            //		 to the expected type.
        }
    }
}
    

This is where you retrieve the field values and do something weith them. Most likely, that something would be to create an instance of an appropriate application-specific object and set its properties to the field values.

Remember that you can choose to throw exceptions if a field does not parse to the expected type (indicated by the receiving variable/property). However, setting a default to something that indicates an error can sometimes be infinitely more useful, especially when debugging. I have a method in my objects that performs validity checking based on the contents of the properties. I don't use it in this sample project, but I'm including it in this article because it might be useful to others.

public bool IsValid 
{
    get 
    {
        bool           valid = false;
        PropertyInfo[] infos = this.GetType().GetProperties();
        foreach(PropertyInfo info in infos)
        {
            if (info.Name != "IsValid")
            {
                object property   = info.GetValue(this, null);
                string propString = string.Format("{0}",property);
                valid             = (!propString.IsInExact("-1,-1.0,ERROR"));

                if (!valid)
                {
                    break;
                }
            }
        }
        // we don't need to check the dates specificially because if they weren't 
        // valid, this object would not have been created
        return valid;
    }
}
	

It uses reflection but that really can't be helped if you want to use it everywhere, such as from a base class or something. I also included the IsInExact() extension method in the project (this IsValid property utilizes it). You may note that it doesn't validate DateTimes because my code won't create the objects that contain this property if the dates aren't valid, but it wouldn't be difficult to include, even for a mildly skilled programmer.

Once the stream/file has been parsed, the CSVParser class calls the abstract Finished() method. This gives you the opportunity to do whatever you need to do with the parsed data that you've retained. It can also be of assistance during debugging, allowing you to examine the list of line indexes for valid, invalid, and corrected lines.

Finding Values

Once parsing has started, and the base class invokes the ProcessFields method, you can retrieve the data one field at a time. To do so, you use the FindValue() method. FindValue() is overriden for the four most common types (string, int, double, and DateTime). FindValue accepts the desired column name, as well as a default value to assign if there is a problem either finding the specified column, or not being able to convert the found data into the desired type.

Finding the specified column involves the IsLike extension method, which functions similarly to the SQL LIKE function. You can use any appropriate SQL wildcard character in the column name to find it in the dictionary of discovered columns. So, given a column name of "Really Long Column Name", you could use something like "really long col%", "%long column%", %column name", or "Really Long Column Name" to find that column in the dictionary. It is merely a vehicle to allow less typing. Of course, care must be taken so that the column name specified in FindValue is qualified enough to find the desired column. As will the SQL LIKE function, string matching is not case sensitive, regardless of whether wildcards are used or not.

What It Doesn't Do

Usage

First, you instantiate your derived parser object:

CSVFileParser parser = new CSVFileParser();
parser.Parse("sample1.csv");

Inside your derived parser object, override the two abstract methods:

public class CSVFileParser : CSVParser
{
    public CSVFileParser() : base()
    {
    }

    protected override void ProcessFields(bool isMalformed)
    {
        if (this.CurrentData != null && !isMalformed)
        {
            try
            {
	            // TO-DO: Your stuff (involves calling FindValue for each field 
                //        you want to retrieve from the current line)
            }
            catch (FindValueException fvex)
            {
                //TO-DO: react to an exception thrown because the value found could not be cast 
                //       to the expected type.
            }
        }
    }

    protected override void Finished()
    {
        //TO-DO: Celebration that could include dancing naked around a fire, pounding your 
        //       chest and singing songs of a successful parsing event.

        // At this point you can examine the properties that invalid lines, and the total lines processed 
		// (excluding the header row and blank lines).
    }
}

Of course, you can override pretty much any method in the base class to modify the parser's behavior.

Definitions

Caveats and Emptors

There are obviously other solutions to this problem that are either more minimalistic, or more substantive, and I leave you to your googling talents to find said alternatives. Please do not waste space in the comments section by informing me of these alternatives. I simply don't care. I wrote this code because it fits my needs. I'm sharing it because it might fit someone else's.

Article History