Snippy进行SNP的检测

实验室牛马 发表于 2024-7-8 21:33:39

# Snippy—快速单倍体变异检出和核心基因组比对

Snippy 是一个用于在单倍体参考基因组和你的下一代测序（NGS）序列读数之间发现单核苷酸多态性（SNPs）和插入/删除变异（indels）的工具。它可以在你的单台计算机上使用尽可能多的CPU（测试到64核）。它以速度为设计重点，并在单个文件夹中生成一致的输出文件集。然后可以使用相同的参考生成一组Snippy结果，并生成核心SNP对齐（最终生成一个系统发育树）。

## 安装 Snippy

### 使用 Conda 安装

```
conda install -c conda-forge -c bioconda -c defaults snippy
```

### 检查安装

确保使用正确的版本：

```
snippy --version
```

检查所有依赖项是否已安装并正常工作：

```
snippy --check
```

## 快速使用

### 输入文件

1. FASTA或GENBANK格式的参考基因组（可以是多个重叠群）
2. 序列读取文件的 FASTQ 或 FASTA 格式（可以.gz压缩）格式
3. 用于放置结果的文件夹

### 快速运行

```
% snippy --cpus 16 --outdir mysnps --ref Listeria.gbk --R1 FDA_R1.fastq.gz --R2 FDA_R2.fastq.gz
<cut>
Walltime used: 3 min, 42 sec
Results folder: mysnps
Done.

% ls mysnps
snps.vcf snps.bed snps.gff snps.csv snps.tab snps.html
snps.bam snps.txt reference/ ...

% head -5 mysnps/snps.tab
CHROMPOS TYPE REF ALT EVIDENCE    FTYPE STRAND NT_POS AA_POS LOCUS_TAG GENE PRODUCT EFFECT
chr    5958snp A G    G:44 A:0    CDS +    41/600 13/200 ECO_0001dnaA replication protein DnaA missense_variant c.548A>C p.Lys183Thr
chr 35524snp G T    T:73 G:1 C:1 tRNA-
chr 45722ins ATT ATTT ATTT:43 ATT:1 CDS -                ECO_0045gyrA DNA gyrase
chr 100541del CAAACAA CAA:38 CAAA:1 CDS +                ECO_0179    hypothetical protein
plas    619complex GATCAATA GATC:28 AATA:0
plas 3221mnp GA CT CT:39 CT:0    CDS +                ECO_p012rephypothetical protein

% snippy-core --prefix core mysnps1 mysnps2 mysnps3 mysnps4
Loaded 4 SNP tables.
Found 2814 core SNPs from 96615 SNPs.

% ls core.*
core.aln core.tab core.tab core.txt core.vcf
```

### 输出文件

| Extension       | Description                                                                                                                |
| ------------------ | ------------------------------------------------------------------------------------------------------------------------------ |
| .tab             | A simple(http://en.wikipedia.org/wiki/Tab-separated_values) summary of all the variants                         |
| .csv             | A(http://en.wikipedia.org/wiki/Comma-separated_values) version of the .tab file                               |
| .html          | A(http://en.wikipedia.org/wiki/HTML) version of the .tab file                                                             |
| .vcf             | The final annotated variants in(http://en.wikipedia.org/wiki/Variant_Call_Format) format                                  |
| .bed             | The variants in(http://genome.ucsc.edu/FAQ/FAQformat.html#format1) format                                                 |
| .gff             | The variants in(http://www.sequenceontology.org/gff3.shtml) format                                                       |
| .bam             | The alignments in(http://en.wikipedia.org/wiki/SAMtools) format. Includes unmapped, multimapping reads. Excludes duplicates. |
| .bam.bai       | Index for the .bam file                                                                                                    |
| .log             | A log file with the commands run and their outputs                                                                         |
| .aligned.fa    | A version of the reference but with at position with and for (does not have variants `-depth=0``N0 < depth < --mincov`)    |
| .consensus.fa    | A version of the reference genome with*all* variants instantiated                                                          |
| .consensus.subs.fa | A version of the reference genome with*only substitution* variants instantiated                                           |
| .raw.vcf       | The unfiltered variant calls from Freebayes                                                                                  |
| .filt.vcf       | The filtered variant calls from Freebayes                                                                                  |
| .vcf.gz          | Compressed .vcf file via(http://blastedbio.blogspot.com.au/2011/11/bgzf-blocked-bigger-better-gzip.html)                |
| .vcf.gz.csi    | Index for the .vcf.gz via `bcftools index`)                                                                               |

### TAB/CSV/HTML 格式的每列说明

| Name | Description                                                                            |
| -------- | ----------------------------------------------------------------------------------------- |
| CHROM | The sequence the variant was found in eg. the name after the `>` in the FASTA reference |
| POS    | Position in the sequence, counting from 1                                              |
| TYPE | The variant type: snp msp ins del complex                                              |
| REF    | The nucleotide(s) in the reference                                                    |
| ALT    | The alternate nucleotide(s) supported by the reads                                     |
| EVIDENCE | Frequency counts for REF and ALT                                                       |

### 变异类型

| Type | Name                         | Example    |
| ------- | ------------------------------- | ------------ |
| snp | Single Nucleotide Polymorphism| A => T    |
| mnp | Multiple Nuclotide Polymorphism | GC => AT |
| ins | Insertion                   | ATT => AGTT|
| del | Deletion                      | ACGG => ACG|
| complex | Combination of snp/mnp       | ATTC => GTTA |

### variant caller的主要参数

变异类型的调用由 (https://github.com/ekg/freebayes) 完成。关键参数包括：

* `--mincov` - 要考虑的覆盖站点的最小读取次数（默认值=10）
* `--minfrac` - 必须与reference不同的读数的最小比例
* `--minqual` - 最小 VCF 的“quality”（默认值 = 100）
*

### 详细查看比对结果 `snippy-vcf_report`

如果使用该选项运行 Snippy，它将自动运行并生成一个包含部分的对于每个 SNP，就像这样：`--reportsnippy-vcf_report``snps.report.txtsnps.vcf`

```
>LBB_contig000001:10332 snp A=>T DP=7 Q=66.3052

      10301 10311 10321 10331 10341 10351 10361
tcttctccgagaagggaatataatttaaaaaaattcttaaataattcccttccctcccgttataaaaattcttcgcttat
........................................T.......................................
,,,,,,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,t,,t,,,,,,,,,,,,,,,,g,,,,,,,g,,,,,,,,,t,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, .......T..................A............A.......
.........................A........A.....T........... .........C..............
.....A.....................C..C........CT.................TA.............
,a,,,,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,t,,,g,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,ga,,,,,,,c,,,,,,,t,,,,,,,,,,g,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
                        ............T.C..............G...............G......
                                                ,,,,,,,g,,,,,,,,g,,,,,,,,,,,
                                                      g,,,,,,,,,,,,,,,,,,,,
```

如果希望在运行 Snippy 后生成此报告，可以直接运行：

```
cd snippydir
snippy-vcf_report --cpus 8 --auto > snps.report.txt
```

如果要在 Web 浏览器中查看 HTML 版本，请使用以下选项：`--html`

```
cd snippydir
snippy-vcf_report --html --cpus 16 --auto > snps.report.html
```

## 主要参数

* `--rgid`将在 BAM 和 VCF 文件中设置读取组（） ID （）和示例（）。如果未提供，它将同时使用和的文件夹名称。`RG``ID``SM``--outdir``ID``SM`
* `--mapqual`是变体调用中要接受的最低映射质量。BWA MEM 用于表示读取是“唯一映射的”。`60`
* `--basequal`是核苷酸在变异检出中需要使用的最低质量。我们使用 which 对应于 ~5% 的错误概率。这是一个传统的 SAMtools 值。`13`
* `--maxsoft`是在丢弃之前允许软夹紧的对齐基数对齐方式。这是为了鼓励全局对齐而不是局部对齐，并传递给该工具。`samclip`
* `--mincov`并用于将硬阈值应用于变体调用超出现有的统计措施。最佳值取决于您的排序深度和污染率。通常使用 10 和 0.9 的值。`--minfrac`
* `--targets`采用 BED 文件，并且仅调用这些区域中的变体。通常不需要除非您只对特定位点的变体感兴趣（例如。AMR基因），但仍然进行WGS而不是扩增子测序。
* `--contigs`允许您从重叠群而不是读取中调用 SNP。将重叠群打断成reads，以便将调用与多样本分析。

## 核心SNP系统发育

如果从同一参考进行多个单株 SNP的鉴定，则可以产生“核心 SNP”的比对，可用于构建高分辨率系统发育树（忽略可能的重组）。“核心位点” 是存在于所有样本中的基因组位置。核心位点可以每个样品中都有相同的核苷酸（“单态”），或者某些样品可以是不同的（“多态性”或“变异”）。如果我们忽略并发症 “ins”、“del”变异类型，仅使用变异位点，这些是“核心 SNP 基因组”。

### 输入文件

一组使用同一个参考基因组的比对的snippy文件夹

#### snippy-multi

批量运行，创建一个input.tab文件，如下所示，处理双端reads、单端reads或者contigs

```
# input.tab = ID R1
Isolate1 /path/to/R1.fq.gz /path/to/R2.fq.gz
Isolate1b /path/to/R1.fastq.gz /path/to/R2.fastq.gz
Isolate1c /path/to/R1.fa /path/to/R2.fa

# single end reads supported too

Isolate2 /path/to/SE.fq.gz
Isolate2b /path/to/iontorrent.fastq

# or already assembled contigs if you dont have reads

Isolate3 /path/to/contigs.fa
Isolate3b /path/to/reference.fna.gz
```

运行命令如下：

```
% snippy-multi input.tab --ref Reference.gbk --cpus 16 > runme.sh

% less runme.sh# check the script makes sense

% sh ./runme.sh # leave it running over lunch
```

最后生成一系列的核心基因组 SNP 比对文件，snippy-corecore.*

### 输出文件

| Extension    | Description                                                                         |
| --------------- | -------------------------------------------------------------------------------------- |
| .aln          | A core SNP alignment in the `--aformat` format (default FASTA)                   |
| .full.aln    | A whole genome SNP alignment (includes invariant sites)                            |
| .tab          | Tab-separated columnar list of**core** SNP sites with alleles but NO annotations |
| .vcf          | Multi-sample VCF file with genotype `GT` tags for all discovered alleles          |
| .txt          | Tab-separated columnar list of alignment/core-size statistics                      |
| .ref.fa       | FASTA version/copy of the `--ref`                                                 |
| .self\_mask.bed | BED file generated if `--mask auto` is used.                                     |

#### core.full.aln文件过滤

core.full.aln 文件是一个FASTA格式的多序列对齐文件。它为参考序列提供了一个序列，并且为参与核心基因组计算的每个样本提供了一个序列。每个序列的长度都与参考序列的长度相同。

当你需要进一步构建进化树是，可以将snppy-clean _ full _ aln中所有“怪异”字符删除，并使用N替换。

```%
% run_gubbins.py -p gubbins clean.full.aln
% snp-sites -c gubbins.filtered_polymorphic_sites.fasta > clean.core.aln
% FastTree -gtr -nt clean.core.aln > clean.core.tree
```

## 进阶分析

### reads数据量太大

因为大多数 SNP 将通过 50-100 倍深度。如果您知道您拥有的数据量是所需数据量的 10 倍， Snippy 可以随机对 FASTQ 数据进行子采样：

```#
snippy --subsample 0.1...
<snip>
Sub-sampling reads at rate 0.1
<snip>
```

### 仅在特定区域鉴定SNP

如果寻找特定的 SNP，只需将感兴趣的区域放入 BED 文件中：

```
snippy --targets sites.bed ...
```

### 在重叠群之间查找 SNP

有时，您的一个样本仅以重叠群的形式提供，而没有相应的 FASTQ 数据。仍然可以将这些重叠群与 Snippy 一起使用鉴定SNP。

```
% ls
ref.gbk mutant.fasta

% snippy --outdir mut1 --ref ref.gbk --ctgs mut1.fasta
Shredding mut1.fasta into pseudo-reads.
Identified 257 variants.

% snippy --outdir mut2 --ref ref.gbk --ctgs mut2.fasta
Shredding mut2.fasta into pseudo-reads.
Identified 413 variants.

% snippy-core mut1 mut2
Found 129 core SNPs from 541 variant sites.

% ls
core.aln core.full.aln ...
```

此输出文件夹完全兼容，因此您可以混合 FASTQ 和基于重叠群的输出文件夹以生成对齐方式。snippy-coresnippy

页: [1]

ResBang生信交流网's Archiver

Snippy进行SNP的检测