Skip to content
Snippets Groups Projects
README.md 5.39 KiB

Formated UTOPIA db

Formated UTOPIA db is available here.

UTOPIA database construction

Prerequesites

Change libraries path in scripts extract_seq.pl, extract_cluster_taxonomy.pl.

NCBI Taxonomy database construction

Download and extract

wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
tar -xzf taxdump.tar.gz

Load SQLite database

loadTaxonomy.pl -names names.dmp -nodes nodes.dmp -struct taxonomyStructure.sql

Database generation

eDirect download

esearch -db "nucleotide" -query "\"internal transcribed spacer\"[Title] AND \"fungi\"[Porgn] AND \"complete sequence\" [Title]" | efetch -format gb > GB_ITS.gb

WARN: Unfortunately it is possible that some errors are generated during download, thus corrupting the Genbank file. Search for the term 'error' in the file and manually remove corrupted sequences.

extract sequence and taxonomy from Genbank file

This script uses the sqlite database previously created. You MUST modified line 18 the database path to fit your own.

perl extract_seq.pl GB_ITS.gb

This create two files:

  • A fasta file: sequences.dna.
  • A tabulated two column tabulated file: taxonomy.txt.

Prinseq sequence check

Generate statistics

  • Maybe embedded prinseq script if its light enough *
prinseq-lite.pl -fasta sequences.fna -graph_data sequences.gd -graph_stats ld,gc,ns,pt,ts,de,da,sc,dn

Generate graphs in HTML format

prinseq-graphs.pl -i sequences.gd -html_all -o sequences_prinseq

Filtering sequences

Thanks to the statistics given by Prinseq, we've choose too exclude sequences less than 100bp and larger than 5000bp. Also we do not allowed sequences with N's or with IUPAC code.

prinseq-lite.pl -fasta sequences.fna -out_format 1 -out_good sequences_good -noniupac -ns_max_n 1 -min_len 100 -max_len 5000

[OPTIONAL] extract ITS1 and ITS2 segments with ITSx.