Parse the OrthoFinder output to long table format

Ji Huang 2019-10-10 1 min read

It’s been a while since I update any blogs. This blog is to show how to parse the OrthoFinder output (Orthogroups.csv) into the tidy long format.

I did the OrthoFinder on two speceis, rice and Arabidopsis. The Orthogroups.csv file has three columns:

orthogroup ID
arabidopsis gene IDs (separate by comma)
rice gene IDs (separate by comma)

To do the parse, follow the following code:

library(tidyverse)

#  Deal with OrthoFinder result.
ogroup <- read_tsv(here("data", "orthoFinder", "Orthogroups.csv"))

# Drop groups that have only one-species. Then convert character to a list.
ogroup1 <- ogroup %>% drop_na() %>% 
  mutate(ath = str_split(.$ath, pattern = ", "),  
         msu7 = str_split(.$msu7, pattern = ", "))

parse_orthogroup_long <- function(i) {
    temp1 <- list(ogroup1$ath[[i]], ogroup1$msu7[[i]])
    temp1 <- cross(temp1) %>%  map(lift(paste)) %>% unlist()
    temp2 <- tibble(edge = temp1) %>% mutate(group = ogroup1$X1[i]) %>% 
        separate(edge, into = c("tair10", "msu7"), sep = " ")
}

ogroup_long <- map(1:nrow(ogroup1), parse_orthogroup_long)
ogroup_long <- bind_rows(ogroup_long)

write_tsv(ogroup_long, here("result", "tair10_to_msu7_orthogroup.tsv.gz"))

The result is a long format table with three columns. From this format, I can do join quite easy.

tair10	msu7	group
AT1G05080	LOC_Os01g57920	OG0000000
AT1G06630	LOC_Os01g57920	OG0000000
AT1G13780	LOC_Os01g57920	OG0000000
AT1G16930	LOC_Os01g57920	OG0000000
AT1G16940	LOC_Os01g57920	OG0000000
AT1G16945	LOC_Os01g57920	OG0000000
AT1G19070	LOC_Os01g57920	OG0000000
AT1G19410	LOC_Os01g57920	OG0000000
AT1G21990	LOC_Os01g57920	OG0000000
AT1G22000	LOC_Os01g57920	OG0000000
AT1G26890	LOC_Os01g57920	OG0000000

Dreaming time

Parse the OrthoFinder output to long table format