SEQenv: A pipeline capable of annotating genetic sequences based on environment descriptive terms occurring within their records and/or in relevant literature.
Given a set of sequence files (in FASTA format) SEC retrieves highly similar sequences from public repositories (such as SILVA and GenBank). Subsequently, from each of those records text fields carrying environmental context information (such as the reference title and the isolation source) are being extracted. Existing links to PubMed abstracts are also being followed and the relevant abstracts collected.
Once the relevant pieces of text for each matching sequence have been gathered they are being processed by a text mining module capable of identifying any Environment Ontology (EnvO) environment descriptive terms mentioned in them.
The identified EnvO terms along with their mention frequency are then subjected to clustering analysis and multivariate statistics. As a result tagclouds and heatmaps of environment descriptive terms characterizing different set of sequences (e.g. orginitating from different samples) are being generated.
Characterize sequences from novel environments based on the enviromental context of highly similar known sequences
Identify potential sample contamination sequences
Add standardized environmental context to already deposited, plain-text annotated sequences
Microbial 16S rRNA sequence from Vietam and Tazmania pit-latrine sample annotation
Lagoon sediment sample annotation
Availability: an input form facilitating the processing of user submitted sequences will be made available at this site. The pipeline components are all open source pieces of software will be made available either here, or via links to their dedicated web pages.
ENVIRONMENTS: a standalone command line application capable of identifying environment descriptive terms, such as "coral reef, cultivated land, glacier, pelagic, forest, lagoon", in text.
Funding: European COST Action ES 1103 Microbial ecology & the earth system: collaborating for insight and success with the new generation of sequencing tools