Six tips for better spreadsheets

As a data science librarian at the University of California, San Diego, Stephanie Labou has seen her share of spreadsheet horror stories. Most disturbing was a hand-entered table of GPS coordinates.

“It was a complete mix,” recalls Labou. The spreadsheet was produced by citizen-scientists. Some had written the word ‘degrees’, some ‘0’ and some ‘o’; some had used superscripts, others hadn’t; others wrote ‘north’, ‘west’ or neither. “We’re talking about tens of thousands of rows of data, where each latitude and longitude was recorded differently,” he says. “That was the least consistent spreadsheet I’ve ever seen.”

Data scientists express strong feelings about using spreadsheets for data analysis. In general, they prefer programming languages ​​such as R and Python, where analyzes are more easily documented and reproducible. But many researchers are more comfortable with spreadsheets, and being embarrassed about using them is counterproductive, Labou says. Sometimes spreadsheets are the fastest way to solve a problem. And there really is no other option to record tabular data.

Spreadsheets are reactive: cells that depend on other cells will automatically update as data changes. They can also be useful, intelligently formatting data to make it easier to read. Also, they are everywhere. Spreadsheets are “where data science begins,” says Tracy Teal, director of the open source program at software developer RStudio in Boston, Massachusetts.

But they are also more complicated than they seem. A function to take the average of a column, for example, will return the wrong value if the formula doesn’t take into account the correct data range. Cells that appear empty may not be. And auto formatting doesn’t always work as expected. Researchers have long known that some genomic studies contain skewed data because Excel incorrectly converted some genetic symbols, such as 4TH OF OCTOBER, on dates. An analysis of around 11,100 articles published between 2014 and 2020 found that 31% still include such errors (M. Abeysooriya et al. PLoS computing. Biol. 17, e1008984; 2021).

As data scientists Karl Broman of the University of Wisconsin-Madison and Kara Woo, then at the University of Washington, Seattle, wrote in 2018: “Spreadsheets, despite their mundane rectangularity, have been the subject of angst and controversy for decades. (KW Broman and KH Woo Am. Statistics 72, 2–10; 2018).

Here are six tips for using them correctly.

Keep raw data raw

Christie Bahlai, a computational ecologist at Kent State University in Ohio, has helped create workshops and teaches courses on spreadsheet best practices for environmentalists. She says her number one tip is to “keep your data raw.”

Spreadsheets, says Bahlai, are “tactile”: they’re easy to use, intuitive and easy to manipulate. But they’re also “easy to mess up,” and it’s “easy to lose track of what you’ve done.” A wrong mouse click can cause the data to end up in the wrong place. And the auto format function can mess up the data. Additionally, the spreadsheet may contain organizational information that may not be immediately clear. As a result, Bahlai recommends that users make their original spreadsheet a read-only document and work in copies, so they can start over if needed.

Bahlai recalls a case where she kept finding stray letters in one of the spreadsheet columns as she began to process the data. “I’m like, ‘what does ‘M’ stand for? What does ‘A’ stand for?’” It turned out that a team member had written ‘NO SAMPLE’ vertically in one of the columns, one letter per row, a decision by the organization that is clear to a human reader, but not to a human reader. computer. When he tidied up the table, that visual organization was lost. “It was like solving a jumble,” he says with a laugh. “I realized, ‘Oh, this spells something, there’s a message!'”

Make data machine readable

Spreadsheets provide extensive formatting options, from font styling to background fills and borders. This digital ‘bling’ can liven up a table and make it more readable. But when researchers use that style to code data, they can run into trouble.

“My top piece [of advice] that is, don’t color-code or format the data, create another column that can be sorted or filtered,” says Mine Çetinkaya-Rundel, a statistician at Duke University in Durham, North Carolina.

This is because cell formatting is difficult for intermediate users to capture. “All the tools available to data scientists are unaware of data expressed as format rather than text or numerical values,” says Duncan Garmonsway, a data scientist at the UK Government Digital Service in Lincoln. Formatting can be lost during routine table manipulations. And researchers may have a hard time remembering what the format represents when they return to the spreadsheet months or years later.

Luis Verde Arregoitia, a mammal specialist at the Institute of Ecology (INECOL) in Veracruz, Mexico, experienced that when he went through an old collection of biodiversity records. He had highlighted rows in yellow, orange, or green to indicate his level of confidence in the data. “At this point,” he says, “I don’t really remember the exact color coding scheme he was using.”

Be consistent

Data analysis tools expect spreadsheets to have a specific format: one row of column headings, no merged cells, and one table per page. Ideally, all cells are filled, even when there is no data (eg with ‘NA’), and contain precisely one piece of data. To tabulate data from a field study to count insects, for example, use separate columns for types of insects and for count, says Teal, rather than, say, ‘3 red beetles’.

Specialized tools can untangle spreadsheets that deviate from the ideal. Verde Arregoitia’s ‘unheadr’ package, for example, handles tables that include rows to subdivide a table into different groups, which he calls ’embedded subheads’. Garmonsway’s ‘tidyxl’ and RStudio data scientist Jenny Bryan’s ‘googlesheets4’ provide ways to extract the format.

The most important thing, Labou says, is consistency: Decide on an approach, document it, and stick with it. How will species be indicated? And how should dates be formatted? Does ‘2/1/2022’ mean February 1st or January 2nd? Most experts recommend the YYYY-MM-DD format (the International Organization for Standardization standard) or dedicate separate columns to the year, month, and day. When combined with data validation, using separate columns means “there’s absolutely no ambiguity,” says Labou. But, Broman warns, it makes it harder to calculate date differences.

document your work

While programming scripts can be saved and version controlled, keystrokes and mouse clicks generally cannot. But spreadsheet users can still document their analyses.

Designate a spreadsheet (or tab) as a “code book” that documents the abbreviations, how the data was collected, the units of measure, how missing values ​​will be represented, the calculations that are performed, and the metadata needed to understand , process and maintain the spreadsheet. “Writing a roadmap for yourself is important,” says Çetinkaya-Rundel.

Then, says Bahlai, “write the recipe for what you’ve done with your data.” What does each formula do and where does it pull its data from? “You’ll be sorry if, when you go to write your methods and say, ‘Hey, how did I average this?’” he says. (In Excel, you can use the ‘audit’ feature to see the flow of data through formulas, says Felienne Hermans, a computer scientist at Leiden University in the Netherlands.)

Check your data

Data analysts often add cross checks to ensure their data processing code works as expected. Spreadsheet users can do something similar, says Hermans.

In a study with case-control samples, for example, the total number of values ​​in the two groups must always equal the number of samples; at least that cross check ensures that the cells you think are empty actually are. “Incorporating some of these cross-checks so you can see that everything is in order is actually a very, very good idea,” he says.

You can also ‘protect’ parts of the spreadsheet from being changed and apply data validation to ensure that date columns contain valid dates, numbers are within certain ranges, or text fields include terms expected. Alternatively, Çetinkaya-Rundel suggests, use a data entry form (like a Google Form) instead of editing the spreadsheet directly. That way, values ​​can be verified as they’re entered and users can’t accidentally modify the document. Finally, says Teal, double check your work. Data analysis is often iterative, she notes. “You don’t walk in the door and say, ‘I’m going to do this equation,’ sit down, do it, and you’re good to go.” So once you’ve settled on a workflow, reset and start over, she says, and just make sure you have the answer she thought she had.

think ahead

The good news is that data scientists can generally handle spreadsheets regardless of their format. “A key principle I have as a data analyst is that if someone asks me what form I would like to have the data in, I always say ‘in its current form,’” says Broman. “If the data needs to be reorganized or transformed in any way, I’m always in the best position to do it.” But it’s best, says Labou, to figure out what you expect to do with your data before you create your spreadsheet in the first place. What variables and covariates will you use? What time steps do you need? What analysis are you going to perform? “Thinking about that ahead of time is one of the best things people can do,” she says.

And consult your collaborators, adds Garmonsway. The rules for data organization “are not set in stone anywhere,” she says. “Physicists did not discover them in the fundamental laws of the Universe. They arose because it is difficult to work with other people. So if you collaborate when you create your spreadsheet, it’s much more likely to be useful to other people, because it’s already useful to someone other than you.”

Source: www.nature.com