Reproducibly getting rid of Qualtrics cruft

The problem: Qualtrics cruft

Qualtrics is the best online survey software I’ve used. However, when I export CSV files from a Qualtrics survey, the first three rows are filled with cruft that makes it a bit hard to analyze the data in R: variable names, longform question wording, and so-on. Because of these lines, R interprets each column as a factor (or a character if you use the tidyverse’s read_csv), which screws up further analysis.

Two bad solutions

1. Excel is not reproducible

One quick and easy option is to just edit out the junk in Excel. However, I’d rather do it in R because (1) when in doubt, I’d prefer to keep it reproducible, and (2) I’d rather not edit the CSV directly in case I need that information later.

2. Skipping the lines leaves out variable names

I could just import this into R using the skip = 3 option in read_csv()*. However, this omits the first line, which contains the variable names. The variable names are useful, especially if you took time to rename them in Qualtrics when you created the survey.

*Note: I use read_csv in this post, but it should work with read.csv, as well. At least I think :)

As best as I can tell, there’s no way to have read_csv read the first line for variable names, skip 2 lines, and then read the rest of the data. So here’s a solution that I hacked together.

The less-bad solution: import the variable names separately

The solution I came up with is to import the column names separately and use them when importing the actual data. The basic steps are as follows:

  1. Import the variable names from the downloaded CSV using read_csv
  2. Convert the variable names into a character vector using unlist
  3. Import the data, skipping the 3 lines of cruft and using the variable names as column names.

Here’s the solution in code:

# load the tidyverse package for read_csv. This isn't necessary if you're
# using read.csv.

library(tidyverse)

# create a test csv with 3 lines of cruft and 5 observations of
# 2 variables

testCSV <- ("goodUse,presenterKnowledgeable,presentedWell
cruft,junk,text
more cruft,more junk,more text
4,3,2
5,5,2
5,5,5
2,3,1
1,4,5")

# load just the column names and use unlist to convert to a character
# vector.
# use n_max = 1 to load only the first row and col_names = FALSE to make
# sure read_csv doesn't interpret the first row as column names.

variable_names <- unlist(read_csv(testCSV, n_max = 1, col_names = FALSE))

# Read the file again to get the data.
# Use skip = 3 to skip the 3 lines of cruft and use
# col_names = variable_names to assign variable names.

decrufted_data <- read_csv(testCSV, skip = 3, col_names = variable_names)

> decrufted_data

# A tibble: 5 × 3
  goodUse presenterKnowledgeable presentedWell
    <int>                  <int>         <int>
1       4                      3             2
2       5                      5             2
3       5                      5             5
4       2                      3             1
5       1                      4             5



# Boom! Now analyze as you see fit.

A bonus: wrap it all up in a function

Here’s the same thing as a function. I’m going to stick this in my personal package (see Hilary Parker’s post to get started on making your own packages) to use in the future.

decruftQ <- function(input_csv){
  require(tidyverse)
  colnamesQ <- unlist(read_csv(input_csv,n_max = 1, col_names = FALSE))
  output_data <- read_csv(input_csv, skip = 3, col_names = colnamesQ)
  return(output_data)
  }