细胞定量是scRNA-seq重要的分析步骤,主要是进行细胞与基因的定量, cell ranger将比对、质控、定量都封装了起来,使用起来也相当便捷。
单细胞RNA测序(scRNA-seq)基础知识可查看以下文章:
单细胞 RNA 测序(scRNA-seq)工作流程入门
单细胞 RNA 测序(scRNA-seq)细胞分离与扩增
单细胞 RNA 测序(scRNA-seq)SRA 数据下载及 fastq-dumq 数据拆分
单细胞RNA测序(scRNA-seq)Cellranger流程入门和数据质控
1. 基本软件安装准备
需要准备cellranger和samtools 软件
cellranger安装和使用参考: 单细胞RNA测序(scRNA-seq)Cellranger流程入门和数据质控
# 安装samtools, STAR
conda install samootls -y
conda install star -y
2. 人类参考基因组与基因注释文件准备
以hg38人类参考基因组为例,hg19构建方法相同,替换hg19的fasta和gtf文件即可。
最新版本可在ensemble数据库下载,下载地址: https://asia.ensembl.org/Homo_sapiens/Info/Index。
# 后台下载hg38人类参考基因组
wget -c -b ftp://ftp.ensembl.org/pub/release-84/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
# 后台下载hg38基因注释文件
wget -c -b ftp://ftp.ensembl.org/pub/release-84/gtf/homo_sapiens/Homo_sapiens.GRCh38.84.gtf.gz
# 最新版 wget -c -bhttps://ftp.ensembl.org/pub/release-111/gtf/homo_sapiens/Homo_sapiens.GRCh38.111.gtf.gz
# 解压上述2个文件
gzip -d ./*gz
# 构建参考基因组index
samtools faidx Homo_sapiens.GRCh38.dna.primary_assembly.fa
3. 构建基因组注释
3.1 基因的生物类型查询
通过提取Homo_sapiens.GRCh38.84.filtered.gtf GTF文件信息实现查询。
grep -v '^#' Homo_sapiens.GRCh38.84.filtered.gtf|awk -v FS='gene_biotype' 'NF>1{print $2}'|awk -F ";" '{print $1}'|sort | uniq -c > gene_biotype.list
# 查看文本信息
cat gene_biotype.list
"""
128 "3prime_overlapping_ncrna"
45662 "antisense"
24 "bidirectional_promoter_lncrna"
213 "IG_C_gene"
33 "IG_C_pseudogene"
152 "IG_D_gene"
76 "IG_J_gene"
9 "IG_J_pseudogene"
1209 "IG_V_gene"
646 "IG_V_pseudogene"
58181 "lincRNA"
3 "macro_lncRNA"
12594 "miRNA"
6918 "misc_RNA"
6 "Mt_rRNA"
66 "Mt_tRNA"
9 "non_coding"
2006 "polymorphic_pseudogene"
32462 "processed_pseudogene"
15779 "processed_transcript"
2337766 "protein_coding"
104 "pseudogene"
24 "ribozyme"
1647 "rRNA"
147 "scaRNA"
3671 "sense_intronic"
1444 "sense_overlapping"
2871 "snoRNA"
5715 "snRNA"
60 "sRNA"
3207 "TEC"
3282 "transcribed_processed_pseudogene"
59 "transcribed_unitary_pseudogene"
15187 "transcribed_unprocessed_pseudogene"
12 "translated_unprocessed_pseudogene"
125 "TR_C_gene"
16 "TR_D_gene"
316 "TR_J_gene"
12 "TR_J_pseudogene"
848 "TR_V_gene"
110 "TR_V_pseudogene"
3021 "unitary_pseudogene"
13327 "unprocessed_pseudogene"
3 "vaultRNA"
"""
3.2 Shell脚本构建基因组注释
sh mkgtf.sh 运行脚本。
# mkgtf.sh
# 文件路径(当前目录需存在以下三个文件)
hg38_fasta=Homo_sapiens.GRCh38.dna.primary_assembly.fa
gtf=Homo_sapiens.GRCh38.84.gtf
filter_gtf=Homo_sapiens.GRCh38.84.filtered.gtf
cellranger mkgtf $gtf $filter_gtf \
--attribute=gene_biotype:protein_coding \
--attribute=gene_biotype:lincRNA \
--attribute=gene_biotype:antisense \
--attribute=gene_biotype:IG_LV_gene \
--attribute=gene_biotype:IG_V_gene \
--attribute=gene_biotype:IG_V_pseudogene \
--attribute=gene_biotype:IG_D_gene \
--attribute=gene_biotype:IG_J_gene \
--attribute=gene_biotype:IG_J_pseudogene \
--attribute=gene_biotype:IG_C_gene \
--attribute=gene_biotype:IG_C_pseudogene \
--attribute=gene_biotype:TR_V_gene \
--attribute=gene_biotype:TR_V_pseudogene \
--attribute=gene_biotype:TR_D_gene \
--attribute=gene_biotype:TR_J_gene \
--attribute=gene_biotype:TR_J_pseudogene \
--attribute=gene_biotype:TR_C_gene
cellranger mkref --genome=GRCh38 \
--fasta=$hg38_fasta \
--genes=$filter_gtf \
--ref-version=hg38
# 出现DONE: Genome generation, EXITING表示成功!