Affiliations: Institute of Computational Modeling, Russian Academy
of Science, Russia | Centre for Mathematical Modelling, University of
Leicester, UK | Institut des Hautes Etudes Scientifiques,
Bures-sur-Yvette, France | Service Bioinformatique, Institut Curie, Paris,
France
Abstract: Coding information is the main source of heterogeneity
(non-randomness) in the sequences of microbial genomes. The heterogeneity
corresponds to a cluster structure in triplet distributions of relatively short
genomic fragments (200–400 bp). We found a universal 7-cluster structure in
microbial genomic sequences and explained its properties. We show that codon
usage of bacterial genomes is a multi-linear function of their genomic
G+C-content with high accuracy. Based on the analysis of 143
completely sequenced bacterial genomes available in Genbank in August 2004, we
show that there are four "pure" types of the 7-cluster structure observed.
All 143 cluster animated 3D-scatters are collected in a database which is made
available on our web-site
(http://www.ihes.fr/~zinovyev/7clusters). The findings
can be readily introduced into software for gene prediction, sequence alignment
or microbial genomes classification.
Keywords: word frequency, codon usage, clustering, visualization, symmetry