R语言包EGAD用来评估基因共表达网络的表现(2)准备自己的数据
EGAD 这个R包的安装需要用到bioconductor,设置国内镜像[https://mirrors.tuna.tsinghua.edu.cn/help/bioconductor/](https://mirrors.tuna.tsinghua.edu.cn/help/bioconductor/)
先运行一下这行命令
```
options(BioC_mirror="https://mirrors.tuna.tsinghua.edu.cn/bioconductor")
```
基于基因表达矩阵构建共表达网络的代码来源于论文
**Aggregated gene co-expression networks for predicting transcription factor regulatory landscapes in a non-model plant species**
[**https://www.biorxiv.org/content/10.1101/2023.04.24.538042v1**](https://www.biorxiv.org/content/10.1101/2023.04.24.538042v1)
**这个论文应该是已经发表了,题目稍微修改了**
**Aggregated gene co-expression networks predict transcription factor regulatory landscapes in grapevine**
[**https://academic.oup.com/jxb/article-abstract/74/21/6522/7260101?redirectedFrom=fulltext**](https://academic.oup.com/jxb/article-abstract/74/21/6522/7260101?redirectedFrom=fulltext)
**代码链接**
[**https://github.com/Tomsbiolab/non_agg_WGCN/tree/main**](https://github.com/Tomsbiolab/non_agg_WGCN/tree/main)
**西红柿的基因表达矩阵和基因对应的GO注释来源于论文**
**Graph pangenome captures missing heritability and empowers tomato breeding**
!(data/attachment/forum/plugin_zhanmishu_markdown/202408/8549b4be5af485088b7123a5a5e5fba2_1722854347_5929.jpg)
论文中提供的基因GO注释格式是这个格式,需要转换成下面的格式
!(data/attachment/forum/plugin_zhanmishu_markdown/202408/e026342f14cc475f44fc814048f76e37_1722854347_6700.jpg)
两列 基因id 和 go
## 对表达矩阵的处理
表达矩阵里有5万多个基因,只选择Solyc开头的基因
```
library(data.table)
library(tidyverse)
fread("expression_raw_332.txt")%>%
select("target_id",matches("Solyc"))%>%
column_to_rownames("target_id") -> raw.exp
exp.df<-raw.exp[,colSums(raw.exp > 0.5)>=100]
```
保留基因 至少在100个样本中的表达量大于0.5。这个表达量的格式是行是样本,列是基因
计算皮尔逊相关系数
```
pcc = cor(exp.df, method = "pearson")
write.table(pcc, file = "pcc.matrix", quote = F)
```
## 构建共表达网络
计算HRR矩阵
```
python computing_HRR_matrix_TOP420.py p pcc.matrix -o HRR.matrix -t 24
## 构建网络,这里两个脚本具体应该用哪一个暂时没有想明白
## 先用第一个吧
python ../top1_co_occurrence_matrix_version2_TOP420_removing_ties.py -p HRR.matrix -c non_agg_filtered_net_Cyto.csv non_agg_filtered_net_EGAD.csv
## 这一步的时间挺长的
```
## 运行EGAD
```
library(EGAD)
library(tidyverse)
net.dat<-read.table("Nature.tomato/non_agg_filtered_net_EGAD.csv",row.names = 1)
net.dat = 1
tomato.GO<-read_tsv("tomato.GO.table")
go.terms <- tomato.GO%>%pull(X2)%>%unique()
anno <- make_annotations(tomato.GO%>%as.data.frame(),gene_list,go.terms)
which(rowSums(anno) == 0) ## 每个基因都有go注释
```
!(data/attachment/forum/plugin_zhanmishu_markdown/202408/ad598077aa4fb7c218c8896e89acf06d_1722854347_7749.jpg)
```
GO_groups_voted <- run_GBA(network = net.dat%>%as.matrix(), anno)
## AUROCs的值
auc_GO_nv = GO_groups_voted[][,1]
## 这里GO 对应这AUROCs的值,也并不是所有的anno里的go都有AUROCs的值
```
画个频率分布直方图吧
```
library(ggplot2)
load("C:/Users/xiaoming/Desktop/auroc.Rdata")
ggplot(data = auroc.dat,aes(x=X2))+
geom_histogram(binwidth = 0.02,color="grey",
fill="#009f73")+
theme_bw(base_size = 15)+
theme(panel.grid = element_blank())+
labs(x=NULL)+
scale_y_continuous(expand = expansion(mult = c(0,0.02)),
limits = c(0,300))+
geom_vline(xintercept = 0.7,lty="dashed")
```
!(data/attachment/forum/plugin_zhanmishu_markdown/202408/887c54d67d8f10f3f44ac252f0db795f_1722854347_5652.jpg)
**欢迎大家关注我的公众号**
**小明的数据分析笔记本**
> 小明的数据分析笔记本 公众号 主要分享:1、R语言和python做数据分析和数据可视化的简单小例子;2、园艺植物相关转录组学、基因组学、群体遗传学文献阅读笔记;3、生物信息学入门学习资料及自己的学习笔记!
>
页:
[1]