It turns out that one of the major data exchange formats for genetics is Microsoft Excel, and we have now discovered that the Redmond company’s flagship spreadsheet program has been autocorrecting the data into oblivion:
For many people, working with error-ridden spreadsheets is a way of life. This takes on added meaning for genomics researchers, who study the building blocks of life. It turns out that their work, too, is rife with dodgy spreadsheets.
A new paper has revealed the vast extent of errors in published genomics research, which is down to an unfortunate quirk of Microsoft Excel. A trio of scientists in Australia scanned 7,500 Excel files with gene lists accompanying 3,600 papers in 18 journals over a 10-year period. One-fifth of the files had easily identified errors, which is “quite striking and a little bit embarrassing,” says Mark Ziemann of the Baker IDI medical research institute in Melbourne, one of the paper’s co-authors.
What happened? By default, Excel and other popular spreadsheet applications convert some gene symbols to dates and numbers. For example, instead of writing out “Membrane-Associated Ring Finger (C3HC4) 1, E3 Ubiquitin Protein Ligase,” researchers have dubbed the gene MARCH1. Excel converts this into a date—03/01/2016, say—because that’s probably what the majority of spreadsheet users mean when they type it into a cell. Similarly, gene identifiers like “2310009E13” are converted to exponential numbers (2.31E+19). In both cases, the conversions strip out valuable information about the genes in question.
What on earth inspired all these researchers to use what can only be described as the greasy kid stuff of analysis and data storage for this purpose?
It’s nucking futz.
Clippy knew how to fix those types of errors, it never would have happened if he were on the job
Yuo win the internet today.