This post has been slightly modified from its original form on woodpeckR.
Problem
How can I (quickly and intuitively) figure out how many NA
’s are in my dataset and which columns they’re in?
Context
When I tried to run PCA (Principal Components Analysis) on some USGS fish sampling data, I noticed that I had a bunch of missing values. PCA needs complete observations, so this was a problem.
One option would have been to remove any observations with missing values from my data set:
# Select only "complete" rows from the data frame `df`
# noNAs <- df[complete.cases(df),]
The problem was, I had over 30 variables and who knows how many missing values. The data frame had only ~2000 observations. By using only complete cases, I might lose a lot of observations and reduce my sample size by a huge amount.
In fact, I pretty often find myself in this situation. It would be really nice to have a quick way to see where those NA
values are located so I can get a better sense of my dataset and figure out how to move forward.
Solution
Write a loop that tells us how many NA
’s are in each column.
First, let’s create a sample data frame and call it sample.df
:
# Create the data frame
<- data.frame(
sample.df site = 1:4,
temp = c(10, 15, 13, NA),
depth = c(1.1, NA, 2.0, NA)
)
# Show the data frame
sample.df
site temp depth
1 1 10 1.1
2 2 15 NA
3 3 13 2.0
4 4 NA NA
Loop through the columns and print out the number of NA
’s:
# Create a vector full of NA's, the same length as the number of columns in sample.df
<- rep(NA, ncol(sample.df))
na.vec
# Loop through the columns and fill na.vec
for(i in 1:ncol(sample.df)){
<- sum(is.na(sample.df[,i]))
na.vec[i]
}
# Take a look at na.vec
na.vec
[1] 0 1 2
Now we can see that there are 0 NA
’s in the first column, 1 NA
in the second column, and 2 NA
’s in the third column.
But if you have 30 columns, it’s a pain to map those numbers to the column names. So let’s do better. Instead of just printing the numbers of NA
’s in a vector, we’ll put them in a data frame along with the names of the columns.
# Create a data frame
<- data.frame(
na.df Column = names(sample.df),
num.nas = NA
)
# Loop through the columns of sample.df and fill na.df
for(i in 1:ncol(sample.df)){
$num.nas[i] <- sum(is.na(sample.df[,i]))
na.df
}
# Take a look at na.df
na.df
Column num.nas
1 site 0
2 temp 1
3 depth 2
So much better!
Once you get used to it, this is a quick loop to write. But I got sick of re-creating this process every time, so I wrote a function called locate.nas. Feel free to use it:
#Locate NA's: produces a data frame with column names and number of na's
<- function(df){
locate.nas <- data.frame(
na.df colname = names(df),
nas = NA
)
for(i in 1:ncol(df)){
$nas[i] <- sum(is.na(df[,i]))
na.df
}return(na.df)
}
Outcome
A quick look at the distribution of missing values (NA
’s) in my data frame turned up an obvious pattern. I checked the sampling protocol and saw that certain variables had only been measured for lotic areas (moving water), while others had only been measured for lentic areas (still water). Since every observation point was in either a lotic or a lentic area, filtering out incomplete observations would have left me with no data at all.
By adding an indicator variable for lotic/lentic area, I could sort out my data and run PCA separately. Or I could remove the variables measured for only one area. Problem solved.
Resources
There’s also a whole package that makes dealing with NA’s easier. I didn’t know about it when I originally wrote this post, but I’ve since discovered it, and you should check it out! It’s called
naniar
, and you can find it here
Citation
@online{gahm2018,
author = {Gahm, Kaija},
title = {Where Are My {NA’s?}},
date = {2018-07-06},
url = {https://kaijagahm.github.io/posts/2018-07-06-where-are-my-nas},
langid = {en}
}