一个染色体组级别的基因组注释全流程代码分享
[![](data/attachment/forum/plugin_zhanmishu_markdown/202407/786f7e2bc92e38608ee99d1c426c88dd_1720005758_5340.jpg)](http://mp.weixin.qq.com/s?__biz=MzAwODY5NDU0MA==&mid=2455863525&idx=2&sn=6331d03460c18fa2b95ab74f5b4f633b&chksm=8cffde86bb88579056062bc788fc663be513d9000fdb64e70da6a314d3d50076340f58893221#rd)### 本期教程
![](data/attachment/forum/plugin_zhanmishu_markdown/202407/a30bfc55f75e18e7ad4d336a68d43530_1720005758_5680.jpg)
### 往期教程部分内容
![](data/attachment/forum/plugin_zhanmishu_markdown/202407/de7e9e9c5f5fc86ad6670ea850047b8f_1720005758_5008.jpg)![](data/attachment/forum/plugin_zhanmishu_markdown/202407/02d32288339f9f4cb99b14c873492b77_1720005758_1677.jpg)![](data/attachment/forum/plugin_zhanmishu_markdown/202407/bb71459e716257b9c7e09143db9061f3_1720005759_4054.jpg)![](data/attachment/forum/plugin_zhanmishu_markdown/202407/d40d98f26d6774581c9a71861bccb275_1720005758_9868.jpg)![](data/attachment/forum/plugin_zhanmishu_markdown/202407/afb1122853b34070e21da7b83c807cc7_1720005758_9210.jpg)![](data/attachment/forum/plugin_zhanmishu_markdown/202407/8cd75d91ccc34f172c153faaeca326eb_1720005758_2662.jpg)![](data/attachment/forum/plugin_zhanmishu_markdown/202407/e2191db9a258c0134471c562a3fd00b7_1720005758_4731.jpg)![](data/attachment/forum/plugin_zhanmishu_markdown/202407/b0111ef6a0120a2514c69b7985d7f4f8_1720005758_8845.jpg)![](data/attachment/forum/plugin_zhanmishu_markdown/202407/88564ee6019ce10e96db63108aa3ef92_1720005758_9121.jpg)![](data/attachment/forum/plugin_zhanmishu_markdown/202407/57af1aab202fc3b05b47e72265bed4f3_1720005758_4921.jpg)![](data/attachment/forum/plugin_zhanmishu_markdown/202407/7ae1ae206ee397e562e64f2e6a7b8171_1720005758_7305.jpg)![](data/attachment/forum/plugin_zhanmishu_markdown/202407/7258becbf7fcfc49a5716187b891bfe8_1720005758_9756.jpg)![](data/attachment/forum/plugin_zhanmishu_markdown/202407/487297e00b30ddbcdf371e4dc2981dd6_1720005758_9137.jpg)
## 原文
![](data/attachment/forum/plugin_zhanmishu_markdown/202407/9002cfbef12bc41b8ba1f428e9dbddd9_1720005758_3876.jpg)
**本文的教程来自车前草的基因组组装,原文“A chromosome-level genome assembly of Plantago ovata”,教程原文链接:**[https://www.nature.com/articles/s41598-022-25078-5](https://www.nature.com/articles/s41598-022-25078-5)。
**作者在github中提供的对应组装代码,算是比较全的。有组装和注释的详细代码。**
**若需获得对应的代码和数据,在后台回复:****20240703****。请大家看清楚回复关键词,每天都有很多人回复错误关键词,我这边没时间和精力一一回复。**
## Abstract
**种植车前草是为了生产其种子外壳(车前子)。在潮湿的情况下,外皮会转化为一种粘液,其特性适用于制药业,可用于控制血液中胆固醇水平的保健品,也适用于食品工业,用于制造无麸质产品。通过育种方法提高谷壳数量和质量的成功案例有限,部分原因是缺乏参考基因组。在这里,我们利用 598 万个PacBio和6.365亿个Hi-C reads构建了车前草第一个**染色体级参考组**。我们还利用校正后的PacBio reads估计基因组大小,并利用转录本生成基因模型。最终的组装覆盖了约 500 Mb,基因组完整性达到 99.3%。共有97%的序列锚定在四条染色体上,N50 约为 128.87Mb。车前草基因组包含61.90%的重复序列,其中40.04%是长末端重复序列。我们发现了 41820 个蛋白质编码基因、411个非编码RNA、108个核糖体RNA和1295个转移RNA。该基因组将为植物育种计划提供资源,例如,减少种子破碎等农艺制约因素、提高牛皮糖产量和质量以及克服作物易感疾病等。**
![](data/attachment/forum/plugin_zhanmishu_markdown/202407/671affecbf0cd5ead0861b96b62f4c96_1720005758_7815.jpg)
![](data/attachment/forum/plugin_zhanmishu_markdown/202407/65e17bbb7355d7b7897096246202e9b9_1720005758_3434.jpg)
![](data/attachment/forum/plugin_zhanmishu_markdown/202407/d6d315a649dbe03c796cea0bb3f1a7c5_1720005758_1768.jpg)
![](data/attachment/forum/plugin_zhanmishu_markdown/202407/d00d1a598a8f7aae09eae84bc5777438_1720005758_1266.jpg)
### Conting assembly
**环境的构建和软件的安装**
```plain
conda create -n pacbio
conda activate pacbio
conda install -c bioconda bam2fastx
```
![](data/attachment/forum/plugin_zhanmishu_markdown/202407/47651026dc1dc33aa91410a80ac6c917_1720005758_2710.jpg)Converting PacBio unaligned bam files into fastq files
```plain
bam2fastq -c 9 m54078_170831_060817.subreads.bam -o m54078_170831_060817.subreads
bam2fastq -c 9 m54078_170831_160707.subreads.bam -o m54078_170831_160707.subreads
bam2fastq -c 9 m54078_170901_080645.subreads.bam -o m54078_170901_080645.subreads
bam2fastq -c 9 m54078_170901_180552.subreads.bam -o m54078_170901_180552.subreads
bam2fastq -c 9 m54078_170902_041524.subreads.bam -o m54078_170902_041524.subreads
bam2fastq -c 9 m54078_170902_142504.subreads.bam -o m54078_170902_142504.subreads
bam2fastq -c 9 m54078_170903_003441.subreads.bam -o m54078_170903_003441.subreads
### if you have many files, you can do this:
for i in *.bam
do
outfile=$(basename ${i} .bam)
bam2fastq -c 9 ${i} -o ${outfile}
done
### concatenating and compressing all fastq files
cat *.subreads.fastq > Plantago_pacbio.fastq
bgzip -c -l 9 Plantago_pacbio.fastq > Plantago_pacbio.fastq.gz
```
**Removing unwanted reads**
**Creating index file**
```plain
minimap2 -d plantago_chloroplast.fasta.gz.mmi plantago_chloroplast.fasta.gz
minimap2 -d plantago_mitocondria.fasta.gz.mmi plantago_mitocondria.fasta.gz
```
**Removing chroloplast reads**
```plain
data="Plantago_pacbio.fastq.gz"
index="plantago_chloroplast.fasta.gz.mmi"
output="Plantago_pacbio_no_chloro.bam"
time minimap2 \
-ax map-pb \
-t 2 $index $data \
| samtools view -f 0x04 -u \
| samtools sort \
--threads 2 -l 7 \
-o $output
### Converting a bam file to fastq file then compressing it
bamToFastq -i Plantago_pacbio_no_chloro.bam -fq Plantago_pacbio_no_chloro.fastq
bgzip -c -l 9 Plantago_pacbio_no_chloro.fastq > Plantago_pacbio_no_chloro.fastq.gz
```
**Genome size prediction**
```plain
jellyfish count -C -m 21 -s 1000000000 -t 10 *.fastq -o reads_21.jf
jellyfish count -C -m 31 -s 3000000000 -t 20 *.fastq -o reads_31.jf
jellyfish count -C -m 39 -s 3000000000 -t 10 *.fastq -o reads_39.jf
jellyfish count -C -m 45 -s 3000000000 -t 20 *.fastq -o reads_45.jf
jellyfish count -C -m 50 -s 3000000000 -t 20 *.fastq -o reads_50.jf
jellyfish count -C -m 70 -s 3000000000 -t 30 *.fastq -o reads_70.jf
jellyfish count -C -m 100 -s 3000000000 -t 50 *.fastq -o reads_100.jf
jellyfish histo -t 10 reads_21.jf > reads_21.histo
jellyfish histo -t 10 reads_31.jf > reads_31.histo
jellyfish histo -t 10 reads_39.jf > reads_39.histo
jellyfish histo -t 10 reads_45.jf > reads_45.histo
jellyfish histo -t 10 reads_50.jf > reads_50.histo
jellyfish histo -t 10 reads_70.jf > reads_70.histo
jellyfish histo -t 10 reads_100.jf > reads_100.histo
Rscript genomescope.R reads_21.histo 21150 genomescope_plantago_21
Rscript genomescope.R reads_31.histo 31150 genomescope_plantago_31
Rscript genomescope.R reads_39.histo 39150 genomescope_plantago_39
Rscript genomescope.R reads_45.histo 45150 genomescope_plantago_45
Rscript genomescope.R reads_50.histo 50150 genomescope_plantago_50
Rscript genomescope.R reads_70.histo 70150 genomescope_plantago_70
Rscript genomescope.R reads_100.histo 100150 genomescope_plantago_100
```
**Step 6. Contig assembly**
```plain
canu -p Po_2021 -d canu_2021 genomeSize=584m \
-correct -pacbio Plantago_pacbio_no_mito_chloro.fastq.gz \
corMhapSensitivity=high corMinCoverage=0 corOutCoverage=200 correctedErrorRate=0.105 \
gridEngineArrayOption="-a ARRAY_JOBS%20" gridOptions="--partition=batch --nodes=1 --time=24:00:00"
canu -p Po_2021 -d canu_2021 genomeSize=584m \
-trim -corrected -pacbio "canu_2021/Po_2021.correctedReads.fasta.gz" \
correctedErrorRate=0.105 gridOptions="--partition=batch --nodes=1 --time=72:00:00"
canu -p Po_2021 -d canu_2021 genomeSize=584m \
-assemble -trimmed -corrected -pacbio "canu_2021/Po_2021.trimmedReads.fasta.gz" \
correctedErrorRate=0.105 batMemory=9 ovbMemory=16 ovsMemory=16 executiveMemory=1 cnsMemory=20 cnsThreads=8 \
gridOptions="--partition=batch --nodes=1 --time=24:00:00""batOptions=-dg 3 -db 3 -dr 1 -ca 500 -cp 50" utgovlMemory=30
```
**Step 7. Polishing**
**Creating index files**
```plain
for i in *.subreads.bam
do
pbindex $i
done
```
**Filtering reads**
```plain
### creating xml file from all PacBio raw read
ls *.subreads.bam > mymovies.fofn
dataset create --type SubreadSet --name Plantago PlantagoGenomeSet.subreadset.xml mymovies.fofn
### listing clean reads from fastq file
grep ’@’ Plantago_pacbio_no_mito_chloro.fastq > PlantagoGenome.txt
sed 's|[@,]||g' PlantagoGenome.txt > PlantagoGenome_final.txt
### filtering PacBio reads using list of clean reads
dataset filter PlantagoGenomeSet.subreadset.xml Plantago_filter.subreadset.xml 'qname=PlantagoGenome_final.txt'
```
**Generating a bam file for each contig**
```plain
### aligning or mapping filtered reads to draft contig assembly
pbmm2 align --log-level INFO --log-file pbmm2_log -j 5 --sample Plantago canu_2021/Po_2021.contigs.fasta.mmi \
Plantago_filter.subreadset.xml Plantago.aligned.bam
### sorting and indexing
samtools sort -m 10G --threads 5 -l 7 -o Plantago_sorted_aligned.bam -T tmp.ali Plantago.aligned.bam
samtools index -b Plantago_sorted_aligned.bam Plantago_sorted_aligned.bam.bai
### getting conting names
samtools faidx Po_2021.contigs.fasta > Po_2021.contigs.fasta.fai
awk '{ print $1 }' Po_2021.contigs.fasta.fai > contigs.txt
### splitting a bam file by contigs
for line in `cat contigs.txt`
do
samtools view -bh Plantago_sorted_aligned.bam ${line} > split/${line}.bam
done
### indexing each bam file
for i in split/*.bam
do
samtools index -b $i $i.bai
done
### creating new directories and distributing bam files to the new folders
mkdir bam_0 bam_1 bam_2 bam_3 bam_4 bam_5 bam_6 bam_7
mv tig00007* bam_7
mv tig00006* bam_6
mv tig00005* bam_5
mv tig00004* bam_4
mv tig00003* bam_3
mv tig00002* bam_2
mv tig00001* bam_1
mv tig0000* bam_0
```
**Generating a fasta file for each contig**
```plain
### splitting unpolished assembled genome by contigs
### modifying from https://raw.githubusercontent.com/harish0201/General_Scripts/master/Fasta_splitter.py
#!/usr/bin/env python
import sys
"""
usage: python Fasta_splitter.py Po_2021.contigs.fasta 2> /dev/null
"""
f=open(sys.argv,"r");
opened = False
for line in f :
if(line == ">") :
if(opened) :
of.close()
opened = True
of=open("%s.fasta" % (line.rstrip()), "w")
print(line.rstrip())
of.write(line)
of.close()
### indexing each fasta file using samtools
for i in *.fasta
do
samtools faidx $i
done
### creating bed files
for i in *.fasta.fai
do
OUTPUT=$(basename "$i" .fasta.fai).bed
awk -F'\t''{ print $1,$2}' $i > $OUTPUT
done
### indexing fasta files using a PacBio tool
N=100
(
for file in *.fasta
do
((i=i%N)); ((i++==0)) && wait
OUTPUT=$(basename "$file").mmi
pbmm2 index $file $OUTPUT &
done
)
### creating new directories and distributing fasta files to the new folders
mkdir fasta_0 fasta_1 fasta_2 fasta_3 fasta_4 fasta_5 fasta_6 fasta_7
mv tig00007* fasta_7
mv tig00006* fasta_6
mv tig00005* fasta_5
mv tig00004* fasta_4
mv tig00003* fasta_3
mv tig00002* fasta_2
mv tig00001* fasta_1
mv tig0000* fasta_0
```
**Parallelising polishing step**
```plain
### getting contig names and their starts and ends (e.g tig00000002:0-23132)
awk '{ print $1":0-"$2 }' Po_2021.contigs.fasta.fai > window.txt
grep "tig00007" window.txt > window_7.txt
grep "tig00006" window.txt > window_6.txt
grep "tig00005" window.txt > window_5.txt
grep "tig00004" window.txt > window_4.txt
grep "tig00003" window.txt > window_3.txt
grep "tig00002" window.txt > window_2.txt
grep "tig00001" window.txt > window_1.txt
grep "tig00000" window.txt > window_0.txt
### polishing step by running individual script (8 scripts) below:
### script 1
for window in `cat window_0.txt`
do
line=$(echo ${window} | cut -c 1-11)
echo "gcpp --max-iterations 4 --log-level INFO --log-file polished_seqs/file_0/${line}.log \
-w ${window} -r fasta_split/fasta_0/${line}.fasta \
-o polished_seqs/file_0/${line}.polished.fastq,polished_seqs/file_0/${line}.polished.fasta,polished_seqs/file_0/${line}.polished.gff \
split/bam_0/${line}.bam"
done | parallel -j7 --tmpdir TMPDIR_1 --compress
### script 2
for window in `cat window_1.txt`
do
line=$(echo ${window} | cut -c 1-11)
echo "gcpp --max-iterations 4 --log-level INFO --log-file polished_seqs/file_1/${line}.log \
-w ${window} -r fasta_split/fasta_1/${line}.fasta \
-o polished_seqs/file_1/${line}.polished.fastq,polished_seqs/file_1/${line}.polished.fasta,polished_seqs/file_1/${line}.polished.gff \
split/bam_1/${line}.bam"
done | parallel -j7 --tmpdir TMPDIR_1 --compress
### script 3
for window in `cat window_2.txt`
do
line=$(echo ${window} | cut -c 1-11)
echo "gcpp --max-iterations 4 --log-level INFO --log-file polished_seqs/file_2/${line}.log \
-w ${window} -r fasta_split/fasta_2/${line}.fasta \
-o polished_seqs/file_2/${line}.polished.fastq,polished_seqs/file_2/${line}.polished.fasta,polished_seqs/file_2/${line}.polished.gff \
split/bam_2/${line}.bam"
done | parallel -j7 --tmpdir TMPDIR_2 --compress
### script 4
for window in `cat window_3.txt`
do
line=$(echo ${window} | cut -c 1-11)
echo "gcpp --max-iterations 4 --log-level INFO --log-file polished_seqs/file_3/${line}.log \
-w ${window} -r fasta_split/fasta_3/${line}.fasta \
-o polished_seqs/file_3/${line}.polished.fastq,polished_seqs/file_3/${line}.polished.fasta,polished_seqs/file_3/${line}.polished.gff \
split/bam_3/${line}.bam"
done | parallel -j7 --tmpdir TMPDIR_3 --compress
### script 5
for window in `cat window_4.txt`
do
line=$(echo ${window} | cut -c 1-11)
echo "gcpp --max-iterations 4 --log-level INFO --log-file polished_seqs/file_4/${line}.log \
-w ${window} -r fasta_split/fasta_4/${line}.fasta \
-o polished_seqs/file_4/${line}.polished.fastq,polished_seqs/file_4/${line}.polished.fasta,polished_seqs/file_4/${line}.polished.gff \
split/bam_4/${line}.bam"
done | parallel -j7 --tmpdir TMPDIR_4 --compress
### script 6
for window in `cat window_5.txt`
do
line=$(echo ${window} | cut -c 1-11)
echo "gcpp --max-iterations 4 --log-level INFO --log-file polished_seqs/file_5/${line}.log \
-w ${window} -r fasta_split/fasta_5/${line}.fasta \
-o polished_seqs/file_5/${line}.polished.fastq,polished_seqs/file_5/${line}.polished.fasta,polished_seqs/file_5/${line}.polished.gff \
split/bam_5/${line}.bam"
done | parallel -j7 --tmpdir TMPDIR_4 --compress
### script 7
for window in `cat window_6.txt`
do
line=$(echo ${window} | cut -c 1-11)
echo "gcpp --max-iterations 4 --log-level INFO --log-file polished_seqs/file_6/${line}.log \
-w ${window} -r fasta_split/fasta_6/${line}.fasta \
-o polished_seqs/file_6/${line}.polished.fastq,polished_seqs/file_6/${line}.polished.fasta,polished_seqs/file_6/${line}.polished.gff \
split/bam_6/${line}.bam"
done | parallel -j7 --tmpdir TMPDIR_4 --compress
### script 8
for window in `cat window_7.txt`
do
line=$(echo ${window} | cut -c 1-11)
echo "gcpp --max-iterations 4 --log-level INFO --log-file polished_seqs/file_7/${line}.log \
-w ${window} -r fasta_split/fasta_7/${line}.fasta \
-o polished_seqs/file_7/${line}.polished.fastq,polished_seqs/file_7/${line}.polished.fasta,polished_seqs/file_7/${line}.polished.gff \
split/bam_7/${line}.bam"
done | parallel -j7 --tmpdir TMPDIR_4 --compress
### concatenating and compressing
cat file_0/*.fasta > polished_0.fasta
cat file_1/*.fasta > polished_1.fasta
cat file_2/*.fasta > polished_2.fasta
cat file_3/*.fasta > polished_3.fasta
cat file_4/*.fasta > polished_4.fasta
cat file_5/*.fasta > polished_5.fasta
cat file_6/*.fasta > polished_6.fasta
cat file_7/*.fasta > polished_7.fasta
cat polished_*.fasta > polished.fasta
bgzip -c -l 9 polished.fasta > polished.fasta.gz
### removing bubles from polished contig
reformat.sh in=polished.fasta out=polished.fixed.fasta trd tuc -Xmx1g ### To make all uppercase
sed '/^>/s/|arrow//' polished.fixed.fasta > polished.fixed1.fasta ### To remove additional |arrow from contig names
seqtk subseq polished.fixed1.fasta no_bubles.bed > polished_final.fasta
```
**Step 8. Purging haplotigs (alternative contigs)**
```plain
### indexing polished assembly
minimap2 -d polished.fasta.mmi polished.fasta
### mapping clean read to polished assembly
minimap2 -t 4 -ax map-pb polished.fasta.mmi Plantago_pacbio_no_mito_chloro.fasta --secondary=no \
| samtools sort -m 1G -o polished_aligned.bam -T polished_tmp.ali
### purging and clipping
purge_haplotigshist-b polished_aligned.bam -g polished.fasta-t 4 -d 200
purge_haplotigs cov -i ./polished_aligned.bam.gencov -l 5 -m 70 -h 190 -o polished_coverage_stats.csv -j 80 -s 80
purge_haplotigs purge -g polished.fasta -c polishde_coverage_stats.csv -t 4 -r repeat.bed -o purge_polished
purge_haplotigsclip-p purge_polished.fasta -h purge_polished.haplotigs.fasta
### renaming and compressing file
mv clip.fasta Plantago.fasta
bgzip -c -l 9 Plantago.fasta > Plantago.fasta.gz
```
**这个只是一小小部分内容,若是你感兴趣,可以直接查看原代码**
**若需获得对应的代码和数据,在后台回复:****20240703****。请大家看清楚回复关键词,每天都有很多人回复错误关键词,我这边没时间和精力一一回复。**
**虽然这篇文章后面的基因组注释结果并不是那么理想,但是对于初学者还是一个比较不错的参考资源,给的流程很详细。**
![](data/attachment/forum/plugin_zhanmishu_markdown/202407/caf044d6ef4763d522f2d2e63556c94b_1720005758_7440.jpg)
**若我们的教程对你有所帮助,请**点赞+收藏+转发**,这是对我们最大的支持。**
### 往期部分文章
**「1. 最全WGCNA教程(替换数据即可出全部结果与图形)」**
* (https://mp.weixin.qq.com/s/M0LAlE-61f2ZfpMiWN-iQg)
* (https://mp.weixin.qq.com/s/Ln9TP74nzWhtvt7obaMp1A)
* (https://mp.weixin.qq.com/s/rU76rLG4AayuiHbDhgOGBg)
* (https://mp.weixin.qq.com/s/Ot2h3LH82vfg4m7YLG8e0w)
* (https://mp.weixin.qq.com/s/ca8_eqirjJYgC8hiaF8jfg)
---
**「2. 精美图形绘制教程」**
* [精美图形绘制教程](https://mp.weixin.qq.com/mp/appmsgalbum?__biz=MzAwODY5NDU0MA==&action=getalbum&album_id=2614156000866385923&scene=173&from_msgid=2455848496&from_itemidx=1&count=3&nolastread=1#wechat_redirect)
**「3. 转录组分析教程」**
* **「**[转录组上游分析教程[零基础]](https://mp.weixin.qq.com/mp/appmsgalbum?__biz=MzAwODY5NDU0MA==&action=getalbum&album_id=2870608342451224581&scene=126&uin=&key=&devicetype=Windows+10+x64&version=63090719&lang=zh_CN&ascene=0)**」**
* **「**[一个转录组上游分析流程 | Hisat2-Stringtie](https://mp.weixin.qq.com/s/A4cFpkrKGqPeESVQl69jcA)**」**
**「4. 转录组下游分析」**
* [批量做差异分析及图形绘制 | 基于DESeq2差异分析](https://mp.weixin.qq.com/s/aIiR5v1olhv7bMkrZjfddQ)
* (https://mp.weixin.qq.com/s/NeHjQk9DEWtx5hGOJhWuZw)
* [单基因GSEA富集分析](https://mp.weixin.qq.com/s/g8ZWgSIIw_6fZimFLmRMng)
* [全基因集GSEA富集分析](https://mp.weixin.qq.com/s/BbkdplN6tHT_tHGEKPABhQ)
**「小杜的生信筆記」** ,主要发表或收录生物信息学教程,以及基于R分析和可视化(包括数据分析,图形绘制等);分享感兴趣的文献和学习资料!!
页:
[1]