T R A N
S L A T
O R - X

TranslatorX server help

What does TranslatorX do?

TranslatorX is a tool to align (protein-coding) nucleotide sequences based on the corresponding amino acid alignments. Given a set of nucleotide sequences, TranslatorX translates them to amino acids using the appropriate genetic code, aligns those amino acid sequences and, from the resulting alignment, determines the optimal alignment of nucleotide sequences.

This approach to aligning nucleotides makes sense as coding DNA evolves as codon triplets and not as single nucleotides. The approach is particularly useful when nucleotide sequences are divergent such that alignment programs do not succeed in aligning them accurately. Because sequence similarity degrades more slowly at the amino acid than at the DNA level it is easier to align amino acid translations than the corresponding nucleotides. This difference is related to the size of the alphabet in which DNA and proteins are coded (4 and 20 letter respectively), to the redundant nature of the genetic code (i.e. the existence of synonymous codons), and to the physico-chemical similarities that exist between different amino acids.

The obvious effect of these considerations is that it is easier to identify positional homologies between proteins than between DNA sequences. The case of insertions and deletions is particularly illustrative. By aligning divergent (protein-coding) nt-sequences, alignment programs usually add gaps that are not of length three. Such insertions/deletions are highly unlikely because they would break the reading frame. By aligning nt-sequences based on the aa-alignments, insertions and deletions are forced to consist of three consecutive bases.

You can see an illustrative example of the usefulness of TranslatorX here


How do I use TranslatorX?

The simplest usage is to paste the nt-sequences (or upload a file containing them), and press the "Submit" button.

Importantly, sequences must be in the correct reading frame, i.e., must begin with the first nucleotide of a codon. If you are not sure if this holds, you can mark the checkbox "Guess most likely reading frame?" (see below). Alternatively, you can use this tool (EMBOSS Transeq) to determine the reading frame and subsequently properly format your sequences.

Apart from this simple usage, there are several additional options.

  • Choice of multiple alignment software: The translated nt-sequences can be aligned using several different programs, including Muscle, Mafft, T-Coffee, Prank and Clustalw.

  • Precomputed aa alignment:Instead of relying on an automatically computed alignment, users are able to provide their own (pre-calculated) protein alignment.

  • Genetic code: By default, all sequences are presumed to be translated according to the standard/universal genetic code. If this is not the case, an alternative code can be specified, either a single alternative for all of the sequences or specific alternatives for each of the sequences.

    The latter can be accomplished either through interactive menus that help the user define the code of each species, or using a predefined text file that can be pasted or uploaded. 
The format used to define the genetic code for each species is: "the name of the taxon (or sequence)" + "comma or tab" + "the index of the corresponding genetic code". 
As an example, for a mitochondrial diverse dataset the definition could be:



      Homo sapiens,2
      Limulus polyphemus,100
      Bolinus brandaris,4
      ...
    
    Importantly, the name must correspond to the name provided in the nt-sequences file.

    The list of the available genetic codes corresponds to those on the NCBI database with two additional codes (100 and 101):

    • 1 - Standard
    • 2 - Vertebrate Mitochondrial
    • 3 - Yeast Mitochondrial
    • 4 - Mold Mitochondrial; Protozoan Mitochondrial; Coelenterate Mitochondrial; Mycoplasma; Spiroplasma
    • 5 - Invertebrate Mitochondrial
    • 6 - Ciliate Nuclear; Dasycladacean Nuclear; Hexamita Nuclear
    • 9 - Echinoderm Mitochondrial; Flatworm Mitochondrial
    • 10 - Euplotid Nuclear
    • 11 - Bacterial and Plant Plastid
    • 12 - Alternative Yeast Nuclear
    • 13 - Ascidian Mitochondrial
    • 14 - Alternative Flatworm Mitochondrial
    • 15 - Blepharisma Macronuclear
    • 16 - Chlorophycean Mitochondrial
    • 21 - Trematode Mitochondrial
    • 22 - Scenedesmus obliquus Mitochondrial
    • 23 - Thraustochytrium Mitochondrial
    • 100 - Ancestral Arthropod Mitochondrial (AGG=K)
    • 101 - Hemichordate Mitochondrial

    Automatic reading frame identification: If you are not sure that your nucleotide sequence is in frame +1 (i.e. that first codon starts with first base), then you can ask TranslatorX to automatically determine the most likely reading frame: the one with the least number of stop codons.

  • Alignment cleaning: Alignment cleaning is a common practice in phylogenetic studies. Highly variable regions are likely to be saturated (homoplasy) and are difficult to align, i.e. the homology relationships between sites are difficult to establish. Such regions are usually removed with the help of tools such as GBlocks.

    TranslatorX provides an innovative approach. Rather than cleaning the nt-alignment based on its own (nt-) information, TranslatorX first cleans the aa-alignment and then reconstructs the nt-alignment. Such an approach is less restrictive and hence it not might be appropriate under all circumstances. However, it offers a new kind of information.
    The resulting nt-alignment may contain highly variable regions, but we are confident about the homology relationships between sites in such alignment. In addition, such aa-based cleaning of an nt-alignment can be useful for repetitive regions where the nt-alignment might appear conserved, whereas the aa-alignment might indicate the opposite.


What sequence formats are accepted by TranslatorX?

Most sequence and alignment formats are supported thanks to the Readseq program. However, we have noticed that the nexus format is sometimes not properly interpreted and alternative formats (e.g. FASTA) are recommended. In addition, the user is encouraged to check that the automatic format conversion worked appropriately. Accordingly, in the results page, at the top of the page, a link to the converted file is provided.


TranslatorX output

The output is structured in different sections.

The basic output includes the nt-alignment and the corresponding aa-alignment. Alignments are visualized with the Jalview program. Apart from constituting a friendly alignment interface, Jalview can be used to:

  • export the alignment in different formats (Menu File > Output to Textbox),
  • calculate a neighbour-joining tree,
  • refine the alignment by hand-editing.
In addition, a link is provided to a codon based alignment coloured according to the amino acid coded.

A compositional analysis of the sequences is also shown. The GC content of each species is shown in a table, both for all three positions and for the first, second, and third codon positions separately. It is known that species with similar biases might group together in the phylogenetic tree independently of their evolutionary distance so the composition table may identify problematic sequences.

Three nt-alignments including the individual 1st, 2nd and 3rd codon positions are provided. These alignments might help users to inspect the conservation at each specific class of site.

If the user has opted for a GBlocks-based cleaning of the alignment, two additional alignments will be shown: one for the aa-cleaned-alignment, and one for the nt-cleaned-alignment (derived from the aa-clean-alignment). The results of GBlocks cleaning can also be inspected to see which regions have been disgarded. When the cleaning of the alignment is performed, the compositional biases section also shows the GC content of the cleaned sequences. By inspecting it one can determine if the cleaning has reduced a particular bias (biases are expected to accumulate more markedly on the most variable regions, which are the regions that GBlocks attempt to delete.


How can I cite TranslatorX

Abascal F, Zardoya R, Telford MJ (2010)
TranslatorX: multiple alignment of nucleotide sequences guided by amino acid translations
Nucleic Acids Res. doi:10.1093/nar/gkq291


References

  • Castresana, J. (2000) Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol Biol Evol, 17, 540-552.
  • Clamp, M., Cuff, J., Searle, S.M. and Barton, G.J. (2004) The Jalview Java alignment editor. Bioinformatics, 20, 426-427.
  • Edgar, R.C. (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res, 32, 1792-1797.
  • Gilbert, D. (2001) ReadSeq: Read & reformat biosequences. http://iubio.bio.indiana.edu/.
  • Katoh, K., Kuma, K., Toh, H. and Miyata, T. (2005) MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res, 33, 511-518.
  • Loytynoja, A. and Goldman, N. (2005) An algorithm for progressive multiple alignment of sequences with insertions. Proc Natl Acad Sci U S A, 102, 10557-10562.
  • Notredame, C., Higgins, D.G. and Heringa, J. (2000) T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol, 302, 205-217.
  • Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res, 22, 4673-4680.