Nucleotide database
In order to taxonomically assign sequences, a certain amount of reference data is required. Below are the instructions to retrieve standard databases and how to create custom ones.
BLAST database
The sequence comparison step as implemented uses the BLAST command line tools. This requires that the nucleotide database is indexed and formatted in a BLAST-specific way.
Preformatted databases
A large pre-formatted nucleotide sequence database is freely available from the NCBI servers. The BLAST NT database contains a collection of sequences from different sources and can be downloaded directly from the NCBI servers.
A script to fetch the BLAST NT database and additional required taxonomic definitions is available with FooDMe:
Running the above commands will create the blast_nt directory and retrieve all the nescessary files from the NCBI servers.
Warning
Downloading the BLAST NT database will require a large chunk of available memory and take several hours.
Custom databases
A collection of non-redundant reference sequences (RefSeq) is also available from the NCBI servers. The collection is not yet in a BLAST format but we provide a script to retrieve and format it.
Running the commands below will create a folder called refseq, download the RefSeq collection and format it in a blast format. This requires to create a conda environment containing the blast tools.
mamba create -n blast -c conda-forge -c bioconda blast
conda activate blast
cd ~
mkdir refseq
bash ~/FooDMe/ressources/create_RefSeq_blastdb.sh -d refseq -t
Additionally it is possible to format any sequence collection in a BLAST compliant format. A User Guide therefore is available from the BLAST documentation.
Additonal files
In addition to the nucleotide collection, several taxonomy definition files are nescessary. These are available from the NCBI servers and can be used with other sources.
Warning
The links below are provided as information and might not link to the most recent version of the files.
Note
If using the scripts above to retrieve either the NT or RefSeq database, the files below should already be included in your local database.
Taxonomic information
The taxdb
files are nescessary to link taxonomic identifier (taxid) and human-readable
informations. These can be downloaded here:
Taxonomic classification
The taxdump
files contain the taxonomy hierarchical information and are nescessary
to determine the degree of relationship between different taxa. They can be downloaded here: