This post is to convert the first column of reFlat format from transcript ID to Gene ID for version 3 of maize genome annotation. For version 3 genome, either the transcript ID is (1) GRMZM6G708185_T01 or (2) AC202015.3_FGT004 (or starts with AF, AY or EF). To convert these IDs to gene ID, for (1) need to discard after “-"; for (2) need to delete the “T” after the underscore.
For how to create refFlat format from GFF3, please see the previous post.
1. Load libraries
library(tidyverse)
library(stringr)
setwd("~/projects/Misc/iRNA-Seq/irna-seq_maize/v3/")
2. Read the file.
## read all columns as character to avoid error at Pt and Mt.
ref <- read_tsv("Zea_mays.AGPv3.31.refFlat", col_names = F,
col_types = cols(.default = col_character())) %>%
arrange(X1)
3. Convert
The basic idea is to separate geneIDs starting with “AC|AF|AY|EF” and “GRMZM”, convert each type and then combine them together.
## use linux command to find unique gene ID starting characters.
system("cut -f 1 Zea_mays.AGPv3.31.refFlat|cut -c 1-3|sort -u |uniq")
## convert for gene IDs starting with "AC, AF, AY and EF"
t_ac <- ref %>%
filter(., str_detect(.$X1, "AC|AF|AY|EF")) %>%
mutate(X1 = str_replace(.$X1, "_FGT", "_FG"))
## convert for gene IDs starting with "GRMZM"
t_gm <- ref %>%
filter(., str_detect(.$X1, "GRMZM")) %>%
mutate(X1 = str_replace(.$X1, "_T.+", ""))
## combine two together
ref_gene <- bind_rows(t_ac, t_gm)
## check all rows are retained.
nrow(ref_gene) == nrow(ref)
## check how many genes in the annotation.
length(unique(ref_gene$X1)) # 39469
4. Write table
write_tsv(ref_gene, "Zea_mays.AGPv3.31.geneID.refFlat")
rm(list = ls())
gc()