The Dirty Data of Data Mining
Tuesday, October 28th, 2008
Today I came across a survey on data mining by a consulting firm called Rexer Analytics. Their survey took into account 348 responses from data mining professionals around the world. A few interesting tidbits:
* Dirty data, data access issues, and explaining data mining to others remain the top challenges faced by data miners.
* Data miners spend only 20% of their time on actual modeling. More than a third of their time is spent accessing and preparing data.
* In selecting their analytic software, data miners place a high value on dependability, the ability to handle very large datasets, and quality output.
We've found these issues to hold true with our clients as well, particularly in various auditing industries. Auditors will get a hold of their client's data, maybe in some delimited text file. The data set is inevitably too large for Excel to handle easily, so they may try Access (of course, once they are eternally frustrated, they give Kirix Strata™ a shot).
Once they can actually see that data set, they start exploring it to learn about what they're looking at and then inevitably find out how dirty it is. Multiple fields are mashed together or individual ones are stripped apart. Company names appear multiple times in various forms (”I.B.M” vs. “IBM”). An important bit of numeric information is embedded in a text field. There is no end of time spent “purifying” the data set to make sure to avoid the “garbage in, garbage out” syndrome.
Often overlooked, data cleansing is really as important as the analysis itself. Only once this step is complete can you move on to your data mining or other data analysis.
Check out the survey summary yourself and let us know if it matches your experience.