Microsoft Excel is a creation of staggering boneheadedness

What I learned about Microsoft Excel today makes me not really want to use it again for anything that is even moderately important.

Did you know that there is no easy way to import data into Excel with any fidelity?

Just write to a delimited file, you say. CSV data, TSV data, whatever... they are all very easy to produce programmatically. And they will all get screwed up when you open them.

You see, when you open such a file, one of two things will happen, both of them bad.

First, Excel may open the file without complaint. I think this happens when you double-click on the file in Windows. Excel will then apply heuristics to set the format for each cell appropriately. These heuristics are not 100% reliable, for which I can hardly fault Excel. As one example, a cell containing a list of numbers 50001,50002,50003,50014,50018 is interpreted as a single large integer, which Excel converts to the floating-point number 5.00e24.

However, and here is the problem, the conversions are silent (no warning is given upon opening a CSV file) and lossy (above, some of the precision needed to construct the original sequence is lost), and they cannot be reverted by any magic incantations within Excel after you've opened the spreadsheet.

That's right. When opening CSV/TSV files, Microsoft Excel's default behavior is to silently corrupt your data.

I'm running into this problem at work, and I'm just glad that someone noticed what was wrong, because this is really pernicious. You might spot-check your spreadsheet and think that everything is fine and not notice that your data is corrupted starting in row 8000. This is actually happening in biomedical research. Gene names, identifiers, you name it— are silently and irreparably converted to numbers and dates in Excel. (And those researchers don't believe there's any good way to deal with this either.)

For the record, OpenOffice gets this right, by preserving data verbatim by default when importing. That is positively brilliant in comparison.

Now, the second thing that might have happened when you opened your file in Excel is that you invoke Excel's "Import Wizard." I think this happens when you choose "File" "Open" and select a delimited file. Excel will dutifully ask you how the columns are to be delimited and what format to use for each column.

And, although you can select "Text" format here (meaning, preserve the input verbatim), the default format is the "do what Excel thinks is best" option. Once again, Microsoft Excel's default behavior is to corrupt your data. Sure, in this case at least you can click on each column you think might have a problem and select "Text" format for it. But if you are a human, and you're opening these kinds of files all the time, and you trust yourself to do this consistently and correctly, you probably deserve what's coming to you. Humans are not good at performing repetitive tasks. That's supposed to be the computer's job.

What are we expected to do, write out Excel's XML-based file format directly, now that's it's nominally "open" and "documented"? That seems really heavyweight. It is kind of a remarkable oversight that it is so difficult to massage arbitrary data into any format that can be reliably read by Excel.

No comments:

Post a Comment