全基因组关联分析(GWAS)后使用机器学习算法+显著位点+表型做基因组预测

小明的数据分析笔记本

论文

Integrative genomics reveals the polygenic basis of seedlessness in grapevine

https://doi.org/10.1016/j.cub.2024.07.022

论文中提供的部分代码

https://github.com/zhouyflab/Polygenic_Basis_Seedless_Grapes

论文中做完GWAS后接着用显著位点加表型用机器学习算法做基因组预测，使用了很多模型。显著位点用连锁不平衡值做了过滤。今天的推文按照论文中的思路做一下。

论文中首先使用beagle软件对vcf文件进行基因型填充，使用bcftools view命令提取需要用到的变异位点，然后提供了一个perl脚本把vcf文件中的基因型修改成了 0 1 2 的格式

运行论文中提供的perl脚本，我用拟南芥的数据做测试

perl change_vcf_format_impute.pl practice.impute.vcf | cut -f 3,10- | grep -v "^#"

每行是一个位点，每列是一个样本

没有找到论文中做机器学习的代码，使用sklearn，先试着用简单的线性模型跑一下代码（其他模型的代码都类似）

我这里就随便选择了拟南芥vcf文件的前100行位点，没有选择显著位点，这个预测出来的准确性肯定是没有保证的

sklearn的代码参考 https://www.geeksforgeeks.org/multiple-linear-regression-with-scikit-learn/#

表型数据是每行一个样本 Y

基因型数据 X 每行是样本列是位点，上面获取的基因型数据还需要做个转置

perl change_vcf_format_impute.pl practice.impute.vcf | cut -f 3,10- | grep -v "^##" > genotype.txt

读取基因型数据

import pandas as pd
genotype = pd.read_table("GWAS/ATgwas/genotype.txt").set_index("ID") ## 把第一列样本ID设置为行名
genotype_T = genotype.transpose() ## 转置

读取表型数据

pheno = pd.read_table("GWAS/ATgwas/ATphenoNewPhytologist.txt",dtype={"sampleID":str})
## 因为样本名是数字编号，将其指定为字符类型

合并数据

dat = pd.merge(pheno,genotype_T.reset_index(),left_on="sampleID",right_on="index")
new_dat = dat.drop(columns=["sampleID","index"]) ## 去掉多余的两列数据

线性模型需要用到的模块

from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_socre

拆分测试集和训练集

X = new_dat.drop(columns=['seedSize'])
Y = new_dat['seedSize']

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=101)

运行模型

from sklearn import preprocessing 
model = LinearRegression() 
model.fit(X_train, y_train)
predictions = model.predict(X_test) 

print('mean_squared_error : ', mean_squared_error(y_test, predictions))
print('mean_absolute_error : ', mean_absolute_error(y_test, predictions)) 
r2_score(y_test,predictions)

这里的R2 非常非常小，因为位点都是随机选择的，有时间把前面的gwas分析也做一下，用显著位点来试试，也试试其他模型

欢迎大家关注我的公众号 小明的数据分析笔记本

小明的数据分析笔记本公众号主要分享：1、R语言和python做数据分析和数据可视化的简单小例子；2、园艺植物相关转录组学、基因组学、群体遗传学文献阅读笔记；3、生物信息学入门学习资料及自己的学习笔记！

微信扫一扫分享文章