tidyverse overlap 90 percent column values in r

To find columns with overlapping values of 90 percent or more, you can use the following code:

main.r
library(tidyverse)

# create example data frame
df <- data.frame(
  col1 = c(1,2,3,4,5),
  col2 = c(1,2,3,4,5),
  col3 = c(1,2,3,4,6), # 80% overlap with col1 and col2
  col4 = c(1,2,3,6,7), # 60% overlap with col1, col2, and col3
  col5 = c(1,8,9,10,11) # no overlap with other columns
)

# find columns with overlapping values of 90% or more
df %>% 
  select_if(is.numeric) %>% 
  map_dbl(~{
    n_distinct(.x) / length(.x) # calculate ratio of unique values to length
  }) %>% 
  keep(~. >= 0.9) %>% 
  names()
514 chars
20 lines

In this example, columns col1 and col2 have 100% overlap, col3 has 80% overlap with col1 and col2, and col4 has 60% overlap with col1, col2, and col3. Therefore, the code will return c("col1", "col2") as the columns with overlapping values of 90% or more.

gistlibby LogSnag