-
Etienne Rifa authoredEtienne Rifa authored
Formated UTOPIA db
Formated UTOPIA db is available here.
UTOPIA database construction
Prerequesites
-
Perl Perl libraries: Getopt, Log4Perl, DBI, SQL.
-
Sqlite
-
NCBI tool kit: https://www.ncbi.nlm.nih.gov/books/NBK179288/
-
NCBI Taxonomy Database ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
-
Install personal Perl libraries
Change libraries path in scripts extract_seq.pl
, extract_cluster_taxonomy.pl
.
NCBI Taxonomy database construction
Download and extract
wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
tar -xzf taxdump.tar.gz
Load SQLite database
loadTaxonomy.pl -names names.dmp -nodes nodes.dmp -struct taxonomyStructure.sql
Database generation
eDirect download
esearch -db "nucleotide" -query "\"internal transcribed spacer\"[Title] AND \"fungi\"[Porgn] AND \"complete sequence\" [Title]" | efetch -format gb > GB_ITS.gb
WARN: Unfortunately it is possible that some errors are generated during download, thus corrupting the Genbank file. Search for the term 'error' in the file and manually remove corrupted sequences.
extract sequence and taxonomy from Genbank file
This script uses the sqlite database previously created. You MUST modified line 18 the database path to fit your own.
perl extract_seq.pl GB_ITS.gb
This create two files:
- A fasta file: sequences.dna.
- A tabulated two column tabulated file: taxonomy.txt.
Prinseq sequence check
Generate statistics
- Maybe embedded prinseq script if its light enough *
prinseq-lite.pl -fasta sequences.fna -graph_data sequences.gd -graph_stats ld,gc,ns,pt,ts,de,da,sc,dn
Generate graphs in HTML format
prinseq-graphs.pl -i sequences.gd -html_all -o sequences_prinseq
Filtering sequences
Thanks to the statistics given by Prinseq, we've choose too exclude sequences less than 100bp and larger than 5000bp. Also we do not allowed sequences with N's or with IUPAC code.
prinseq-lite.pl -fasta sequences.fna -out_format 1 -out_good sequences_good -noniupac -ns_max_n 1 -min_len 100 -max_len 5000