Abstract: The polyadenylation signal plays a key role in determining the site
for addition of a polyadenylated tail to nascent mRNA and its mutation(s) are
reported in many diseases. Thus, identifying poly(A) sites is important for
understanding the regulation and stability of mRNA. In this study, Support
Vector Machine (SVM) models have been developed for predicting poly(A) signals
in a DNA sequence using 100 nucleotides, each upstream and downstream of this
signal. Here, we introduced a novel split nucleotide frequency technique, and
the models thus developed achieved maximum Matthews correlation coefficients
(MCC) of 0.58, 0.69, 0.70 and 0.69 using mononucleotide, dinucleotide,
trinucleotide, and tetranucleotide frequencies, respectively. Finally, a hybrid
model developed using a combination of dinucleotide, 2nd order dinucleotide
and tetranucleotide frequencies, achieved a maximum MCC of 0.72. Moreover, for
independent datasets this model achieved a precision ranging from
75.8–95.7% with a sensitivity of 57%, which is better than any other
known methods.
Keywords: Polyadenylation signals, mRNA, Support Vector Machine (SVM), Matthews correlation coefficient (MCC), ROC plot, nucleotide frequency