The development of the computational scheme for promoter prediction
was based on a training set of mRNA promoters with experimentally identified
start sites (PromEC).
For each of these mRNA start sites, a promoter was determined by consensus
considerations. s70 promoters are defined by two consensus hexamers
located 10 and 35 base pairs upstream of the transcription start site,
TATAAT and TTGACA, respectively. We searched the regions upstream of the
mRNA start sites for sequences with no more than three mismatches to the
consensus in any of the two hexamers. This process often yielded
several candidates. After the most probable promoter sequence was assigned
to each mRNA start site (based on a hierarchy of heuristic criteria), these
sequences were aligned and served as a basis for a weight matrix that provided
scores for the four bases in each position of the promoter. The scores
were determined by log2(Pij/Pi), where Pij is the frequency of base i at
position j and, and Pi is 0.25 (i=A,C,G,T). The length of the spacer
sequence was scored similarly. Five spacer lengths were considered:15,16,17,18,19,
and their score was determined as log2(Ps/0.2), where Ps is the frequency
of a spacer of length s in the data. The overall score of a promoter sequence
was determined as the sum of its position scores and spacer score.
The training set of promoters based on the experimentally determined mRNA
start sites was used to set a threshold of promoter scores. For predicting
the promoters of the putative sRNA sequences, the consensus sequence was
searched in the empty regions as above, except that no more than four mismatches
in total were allowed. The putative promoters were scored by the
weight matrix. Only promoter sequences above the threshold were recorded.