seqkit+awk+sed+grep高级用法技巧合辑
****## 软件安装与示例数据文件下载
ensembl最新版本下载网址: **[https://ftp.ensembl.org/pub/release-112/](https://ftp.ensembl.org/pub/release-112/)**
```
# seqkit安装
conda install -c bioconda seqkit -y
# 格式转换
conda install csvtk -y
# 查看seqkit帮助
seqkit
```
![](data/attachment/forum/plugin_zhanmishu_markdown/202407/de548f7260deb772649e8326b8849c4b_1720311179_5685.png)
## 1. seqkit 查看fa.gz和fq.gz序列文件
```
# 下载GRCh38 fasta文件
wget https://ftp.ensembl.org/pub/release-112/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
# 解压
# gzip -d Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
# 下载GRCh38 gtf文件
wget https://ftp.ensembl.org/pub/release-112/gtf/homo_sapiens/Homo_sapiens.GRCh38.112.gtf.gz
```
**seqkit可自动识别文件扩展名,而无需使用zcat查看.gz文件。**
```
seqkit seq hairpin.fa.gz|less -S
# >cel-let-7 MI0000001 Caenorhabditis elegans let-7 stem-loop
# UACACUGUGGAUCCGGUGAGGUAGUAGGUUGUAUAGUUUGGAAUAUUACCACCGGUGAAC
# UAUGCAAUUUUCUACCUUACCGGAGACAGAACUCUUCGA
seqkit seq reads_1.fq.gz|less -S
# @HWI-D00523:240:HF3WGBCXX:1:1101:2574:2226 1:N:0:CTGTAG
# TGAGGAATATTGGTCAATGGGCGCGAGCCTGAACCAGCCAAGTAGCGTGAAGGATGACTGCCCTACGGG
# +
# HIHIIIIIHIIHGHHIHHIIIIIIIIIIIIIIIHHIIIIIHHIHIIIIIGIHIIIIHHHHHHGHIHIII
# 仅显示序列名称
seqkit seq Homo_sapiens.GRCh37.dna.primary_assembly.fa -n|head
# 1 dna:chromosome chromosome:GRCh37:1:1:249250621:1 REF
# 10 dna:chromosome chromosome:GRCh37:10:1:135534747:1 REF
# 11 dna:chromosome chromosome:GRCh37:11:1:135006516:1 REF
```
## 2. seqkit 压缩fasta文件
```
# -o 自动识别文件扩展名, 将fasta压缩为.gz格式
seqkit seq Homo_sapiens.GRCh37.dna.primary_assembly.fa.fasta \
-o Homo_sapiens.GRCh37.dna.primary_assembly.fa.fasta.gz
```
## 3. seqkit 统计fasta/fastq文件
```
# 单个文件统计
seqkit stats Homo_sapiens.GRCh37.dna.primary_assembly.fa
# file formattypenum_seqs sum_lenmin_len avg_len max_len
# Homo_sapiens.GRCh37.dna.primary_assembly.faFASTA DNA 843,101,804,739 4,26236,926,246.9249,250,621
# 统计全部fq.gz 和 fq.gz文件全部信息
seqkit stats *.f{a,q}.gz -a
# file formattypenum_seqs sum_lenmin_lenavg_lenmax_len Q1 Q2 Q3sum_gapN50N50_numQ20(%)Q30(%)AvgQualGC(%)
# reads_1.fq.gz FASTQ DNA 2,500 567,516 226 227 229227227227 0227 3 91.24 86.62 15.4553.63
# reads_2.fq.gz FASTQ DNA 2,500 560,002 223 224 225224224224 0224 2 91.06 87.66 14.6254.77
# 表格格式输出, 并通过csvtk转换为tab分割
seqkit stats *.f{a,q}.gz -T | csvtk pretty -t
# file format type num_seqs sum_len min_len avg_len max_len
# ----------------- ------ ---- -------- ------- ------- ------- -------
# reads_1.fq.gz FASTQ DNA 2500 567516 226 227.0 229
# reads_2.fq.gz FASTQ DNA 2500 560002 223 224.0 225
```
# 4. seqkit 提取fasta指定染色体序列
```
# 提取1号染色体序列
seqkit grep -p 1 Homo_sapiens.GRCh37.dna.primary_assembly.fa \
-o Homo_sapiens.GRCh37.dna.primary_assembly.chr1.fa
# 查看>开头信息
cat Homo_sapiens.GRCh37.dna.primary_assembly.fa|grep '^>'
# 提取2号染色体10000-10050碱基序列
seqkit faidx Homo_sapiens.GRCh37.dna.primary_assembly.fa 2:10000-10050
# create or read FASTA index ...
# read FASTA index from Homo_sapiens.GRCh37.dna.primary_assembly.fa.fai
# 84 records loaded from Homo_sapiens.GRCh37.dna.primary_assembly.fa.fai
#>2:10000-10050
#NCGTATCCCACACACCACACCCACACACCACACCCACACACACCCACACC
```
## 5. seqkit合并fastq文件夹全部fastq文件
```
seqkit scat -j 4 -f fastq_dir > all.fq
```
## 6. seqkit转换fasta为氨基酸序列文件
```
# 查看密码子表详细表, notincluding ambigugous codons
seqkit translate -l 1
# 查看密码子表详细表, including ambigugous codons
seqkit translate -L 1
# The Standard Code (transl_table=1)
# Source: https://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=tgencodes#SG1
# Initiation Codons:
# ATG, CTG, TTG
# Stop Codons:
# TAA, TAG, TGA
# Stranslate Table:
# AAA: K, AAC: N, AAG: K, AAT: N
# ACA: T, ACC: T, ACG: T, ACT: T
# AGA: R, AGC: S, AGG: R, AGT: S
# ATA: I, ATC: I, ATG: M, ATT: I
# CAA: Q, CAC: H, CAG: Q, CAT: H
# CCA: P, CCC: P, CCG: P, CCT: P
# CGA: R, CGC: R, CGG: R, CGT: R
# CTA: L, CTC: L, CTG: L, CTT: L
# GAA: E, GAC: D, GAG: E, GAT: D
# GCA: A, GCC: A, GCG: A, GCT: A
# GGA: G, GGC: G, GGG: G, GGT: G
# GTA: V, GTC: V, GTG: V, GTT: V
# TAA: *, TAC: Y, TAG: *, TAT: Y
# TCA: S, TCC: S, TCG: S, TCT: S
# TGA: *, TGC: C, TGG: W, TGT: C
# TTA: L, TTC: F, TTG: L, TTT: F
# --trim参数表示去除*
seqkit translate mouse-p53-cds.fna --trim
# >lcl|AB021961.1_cds_BAA82344.1_1 #
# MTAMEESQSDISLELPLSQETFSGLWKLLPPEDILPSPHCMDDLLLPQDVEEFFEGPSEA
# LRVSGAPAAQDPVTETPGPVAPAPATPWPLSSFVPSQKTYQGNYGFHLGFLQSGTAKSVM
# CTYSPPLNKLFCQLAKTCPVQLWVSATPPAGSRVRAMAIYKKSQHMTEVVRRCPHHERCS
# DGDGLAPPQHRIRVEGNLYPEYLEDRQTFRHSVVVPYEPPEAGSEYTTIHYKYMCNSSCM
# GGMNRRPILTIITLEDSSGNLLGRDSFEVRVCACPGRDRRTEEENFRKKEVLCPELPPGS
# AKRALPTCTSASPPQKKKPLDGEYFTLKIRGRKRFEMFRELNEALELKDAHATEESGDSR
# AHSSYLKTKKGQSTSRHKKTMVKKVGPDSD
```
## 7. 连接碱基序列
```
echo -e ">seq\nACGT-ACTGC-ACC"| seqkit seq -g -u
# >seq
# ACGTACTGCACC
```
## 8. 使用序列长度过滤fasta文件
**-m 最小长度, -M 最大长度**
```
cat hairpin.fa| seqkit seq -m 100 -M 1000| seqkit stats
# 文件格式类型num_seqs sum_len min_len avg_len max_len
#- FASTA RNA 10,972 1560270 100 142.2 938
```
## 9. RNA和DNA序列的相互转换
```
# RNA to DNA, DNA to RNA 用--dna2rna参数
echo -e ">seq\nUCAUAUGCUUGUCUCAAAGAUUA" | seqkit seq --rna2dna
# >seq
# TCATATGCTTGTCTCAAAGATTA
```
## 10. 输出反向互补序列
```
seqkit seq hairpin.fa.gz -r -p
# >cel-let-7 MI0000001 Caenorhabditis elegans let-7 stem-loop
# UCGAAGAGUUCUGUCUCCGGUAAGGUAGAAAAUUGCAUAGUUCACCGGUGGUAAUAUUCC
#AAACUAUACAACCUACUACCUCACCGGAUCCACAGUGUA
```
# 11. grep 提取gtf指定染色体注释信息
```
# 提取1号染色体gtf注释
zcat Homo_sapiens.GRCh38.112.gtf.gz|grep -w '^1'| \
gzip -c > Homo_sapiens.GRCh38.112.gtf.chr1.gz
```
## 12. GTF文件转换为bed格式文件
```
# 需要使用gtftobed, 需conda安装bedops
conda install bedops -y
# gtf转换为bed格式
# gzip -c 压缩为.gz格式
zcat Homo_sapiens.GRCh38.112.gtf.gz|gtf2bed --do-not-sort| \
gzip -c > Homo_sapiens.GRCh38.112.bed.gz
```
## 13. 提取GTF文件外显子exon信息
```
# 提取gtf exon信息
zcat Homo_sapiens.GRCh38.112.gtf.gz|grep exon | \
cut -f1,4,5,9 | cut -d ";" -f1|\
awk '{print $1, $2, $3, $5}' | \
sed -e 's/ /\t/g' | sed -e 's/\"//g' > gtf.exon.bed
```
页:
[1]