It’s been a while since I update any blogs. This blog is to show how to parse the OrthoFinder output (Orthogroups.csv) into the tidy long format.
I did the OrthoFinder on two speceis, rice and Arabidopsis. The Orthogroups.csv file has three columns:
- orthogroup ID
- arabidopsis gene IDs (separate by comma)
- rice gene IDs (separate by comma)
To do the parse, follow the following code:
library(tidyverse)
# Deal with OrthoFinder result.
ogroup <- read_tsv(here("data", "orthoFinder", "Orthogroups.csv"))
# Drop groups that have only one-species. Then convert character to a list.
ogroup1 <- ogroup %>% drop_na() %>%
mutate(ath = str_split(.$ath, pattern = ", "),
msu7 = str_split(.$msu7, pattern = ", "))
parse_orthogroup_long <- function(i) {
temp1 <- list(ogroup1$ath[[i]], ogroup1$msu7[[i]])
temp1 <- cross(temp1) %>% map(lift(paste)) %>% unlist()
temp2 <- tibble(edge = temp1) %>% mutate(group = ogroup1$X1[i]) %>%
separate(edge, into = c("tair10", "msu7"), sep = " ")
}
ogroup_long <- map(1:nrow(ogroup1), parse_orthogroup_long)
ogroup_long <- bind_rows(ogroup_long)
write_tsv(ogroup_long, here("result", "tair10_to_msu7_orthogroup.tsv.gz"))
The result is a long format table with three columns. From this format, I can do join
quite easy.
tair10 | msu7 | group |
---|---|---|
AT1G05080 | LOC_Os01g57920 | OG0000000 |
AT1G06630 | LOC_Os01g57920 | OG0000000 |
AT1G13780 | LOC_Os01g57920 | OG0000000 |
AT1G16930 | LOC_Os01g57920 | OG0000000 |
AT1G16940 | LOC_Os01g57920 | OG0000000 |
AT1G16945 | LOC_Os01g57920 | OG0000000 |
AT1G19070 | LOC_Os01g57920 | OG0000000 |
AT1G19410 | LOC_Os01g57920 | OG0000000 |
AT1G21990 | LOC_Os01g57920 | OG0000000 |
AT1G22000 | LOC_Os01g57920 | OG0000000 |
AT1G26890 | LOC_Os01g57920 | OG0000000 |