「一边学习,一边总结,一边分享!」
写在前面
我们在以前的教程中分享过如何从NCBI中下载FQ数据,详情请看转录组上游分析教程 | NCBI转录组数据的下载、NCBI数据库下载SRA数据。相对而言,我们主要是通过SSR号进行下载,也可以批量进行下载。
今天,我们基于以前的教程,也再次分享一个数据线下载神器Kingfisher软件,从名字上就可以看出这个软件的强大King,但确实使用起来很nice。下载速度也是极快,基本可以跑忙宽带(PS:看时间段和看你的运气,有时网速原因,再强大的软件硬件不行也没有办法)。
先放一张图
这是跑满我单位下载宽带,如果你的服务器是万兆网,那么速度直接是起飞。
软件网址
https://wwood.github.io/kingfisher-download/
安装
可以创建一个新环境,或直接在base中安装也可以。
## 创建环境
conda create -n kingfisher -c conda-forge -c bioconda kingfisher
conda activate kingfisher
## or
mamba create -n kingfisher -c conda-forge -c bioconda kingfisher
mamba activate kingfisher
基础操作
Kingfisher get 下载数据
- 下载NCBI中PRJNA938179中15个数据,下载后并进行解压成fq文件
kingfisher get -p PRJNA938179 -m aws-http prefetch aws-cp gcp-cp ena-ascp ena-ftp -f fastq --check-md5sums
kingfisher get -p PRJNA938179 -m aws-http prefetch aws-cp gcp-cp ena-ascp ena-ftp -t 10 -f fastq.gz --check-md5sums
结果文件
- 下载单独ENA数据库文件
kingfisher get -r ERR1739691 -m ena-ascp aws-http prefetch
「相关参数:」
-r, --run-identifiers RUN_IDENTIFIERS [RUN_IDENTIFIERS ...]
Run number(s) to download/extract e.g. ERR1739691
--run-identifiers-list RUN_IDENTIFIERS_LIST
Text file containing a newline-separated list of run identifiers i.e. a 1 column CSV file.
-p, --bioprojects BIOPROJECTS [BIOPROJECTS ...]
BioProject IDs number(s) to download/extract from e.g. PRJNA621514 or SRP260223
-m, --download-methods {aws-http,prefetch,aws-cp,gcp-cp,ena-ascp,ena-ftp} [{aws-http,prefetch,aws-cp,gcp-cp,ena-ascp,ena-ftp} ...]
How to download .sra file. If multiple are specified, each is tried in turn until one works [required].
「下载调整参数」
「详细文档:」
https://wwood.github.io/kingfisher-download/usage/get
例子:
kingfisher extract
kingfisher extract从*.sra格式转换序列数据。
kingfisher extract --sra ERR1739691.sra -t 16 -f fastq.gz
「相关参数:」
--sra SRA
Extract this SRA file [required]
--output-directory OUTPUT_DIRECTORY
Output directory to write to [default: current working directory]
-f, --output-format-possibilities {sra,fastq,fastq.gz,fasta,fasta.gz} [{sra,fastq,fastq.gz,fasta,fasta.gz} ...]
Allowable output formats. If more than one is specified, downloaded data will processed as little as possible [default: "fastq fastq.gz"]
--force
Re-download / extract files even if they already exist [default: Do not].
--unsorted
Output the sequences in arbitrary order, usually the order that they appear in the .sra file. Even pairs of reads may be in the usual order, but it is possible to tell which pair is which, and which is a forward and which is a reverse read from the name [default: Do not].
Currently requires download from NCBI rather than ENA.
--stdout
Output sequences to STDOUT. Currently requires --unsorted [default: Do not].
-t, --threads THREADS
Number of threads to use for extraction [default: 8]
kingfisher annotate
Annotate runs by their metadata e.g. number of sequenced bases, BioSample attributes, etc.
## Annotate the metadata of a run
kingfisher annotate -r ERR1739691
## Output metadata of all runs in a BioProject to a CSV file
kingfisher annotate --bioprojects PRJNA177893 -o PRJNA177893.csv -f csv
## Output the full set of metadata from a run
kingfisher annotate -r ERR1739691 -a
「相关参数:」
-r, --run-identifiers RUN_IDENTIFIERS [RUN_IDENTIFIERS ...]
Run number to download/extract e.g. ERR1739691
--run-identifiers-list, --run-accession-list, --run-identifiers-list RUN_IDENTIFIERS_LIST
Text file containing a newline-separated list of run identifiers i.e. a 1 column CSV file.
-p, --bioprojects BIOPROJECTS [BIOPROJECTS ...]
BioProject IDs number(s) to download/extract from e.g. PRJNA621514 or SRP260223
-o, --output-file OUTPUT_FILE
Output file to write to [default: stdout]
-f, --output-format {human,csv,tsv,json,feather,parquet}
Output format [default human]
-a, --all-columns
Print all metadata columns [default: Print only a few select ones]
❝
若我们的教程对你有所帮助,请点赞+收藏+转发,这是对我们最大的支持。
❞
往期部分文章
「1. 最全WGCNA教程(替换数据即可出全部结果与图形)」
「2. 精美图形绘制教程」
「3. 转录组分析教程」
「4. 转录组下游分析」
「小杜的生信筆記」 ,主要发表或收录生物信息学教程,以及基于R分析和可视化(包括数据分析,图形绘制等);分享感兴趣的文献和学习资料!!