Promoter prediction

The development of the computational scheme for promoter prediction was based on a training set of mRNA promoters with experimentally identified start sites (PromEC).  For each of these mRNA start sites, a promoter was determined by consensus considerations.  s70 promoters are defined by two consensus hexamers located 10 and 35 base pairs upstream of the transcription start site, TATAAT and TTGACA, respectively. We searched the regions upstream of the mRNA start sites for sequences with no more than three mismatches to the consensus in any of the two hexamers.  This process often yielded several candidates. After the most probable promoter sequence was assigned to each mRNA start site (based on a hierarchy of heuristic criteria), these sequences were aligned and served as a basis for a weight matrix that provided scores for the four bases in each position of the promoter.  The scores were determined by log2(Pij/Pi), where Pij is the frequency of base i at position j and, and Pi is 0.25 (i=A,C,G,T).  The length of the spacer sequence was scored similarly.  Five spacer lengths were considered:15,16,17,18,19, and their score was determined as log2(Ps/0.2), where Ps is the frequency of a spacer of length s in the data. The overall score of a promoter sequence was determined as the sum of its position scores and spacer score.  The training set of promoters based on the experimentally determined mRNA start sites was used to set a threshold of promoter scores.  For predicting the promoters of the putative sRNA sequences, the consensus sequence was searched in the empty regions as above, except that no more than four mismatches in total were allowed.  The putative promoters were scored by the weight matrix.  Only promoter sequences above the threshold were recorded.