Predicting synthetic mRNA stability using massively parallel kinetic measurements, biophysical modeling, and machine learning

Predicting synthetic mRNA stability using massively parallel kinetic measurements, biophysical modeling, and machine learning

Massively parallel design and construction of 5′ UTRs to vary mRNA decay rate

We designed 62,120 5′ UTRs using combinations of sequence determinants that are proposed to alter mRNA decay rates, including: (1) the RppH binding site, which includes the first four nucleotides of the mRNA transcript; (2) the length and sequence composition of single-stranded (unstructured) regions in the 5′ UTR; (3) the sizes and folding energetics of mRNA secondary structures, varying the number of mRNA hairpins, bulges, internal loops within the 5′ UTR; (4) the presence of mRNA tertiary structures, including G-quadruplexes and i-motifs; and (5) designed ribosome binding site sequences that systematically varied the translation rate of a protein coding sequence (CDS).

Each designed mRNA transcript contains combinations of these factors to broadly explore the sequence-structure-function space, collected together into design groups. For example, in one design group, we designed 1280 5′ UTR sequences where each sequence varied the RppH binding site sequence (256 variants), followed by a single-stranded region (a 16-nt polyA motif) and a different ribosome binding site sequence (5 variants) that varied the CDS translation rates across a 10,000-fold range. We created four more design groups (1116 to 1772 designed sequences each) that systematically varied the RppH binding site and CDS translation rate, while varying the sequence and composition of the single-stranded RNA (ssRNA) region using polyU, AAACAAA, AAAGAAA, or UUUGUUU motifs. We created six more design groups (1170 to 8338 designed sequences each) that systematically varied the sequence and structure of the upstream 5′ UTR region, either utilizing polyA, polyG, polyC, or polyU tracts of varying length, shortening the region to 6-nt, inserting mRNA hairpins with varying duplex lengths (2, 8, or 16 duplexed base pairings), or removing the upstream region completely, all while varying the RppH binding site and CDS translation rates in each design group.

The next several design groups tested the importance of specific types of mRNA structures or binding sites. We began with designing 8073 5′ UTR sequences that have highly diverse sequences, but where subsets of sequences fold into a targeted secondary structure, using our recently developed Non-Repetitive Parts Calculator with an RNA structural constraint to carry out rational design49. The purpose of this design group is to distinguish between sequence motifs that specifically control RNase binding versus structural motifs that control overall accessibility to RNase binding sites. We next designed 3111 5′ UTR sequences with highly non-repetitive G-quadruplexes, using the pattern [(3G-6N)4], to test whether these compact tertiary structures can block RNase binding. Similarly, we designed 3310 5′ UTR sequences with i-motifs structures, using the pattern [(3C-6N)4], to test whether these non-canonical structures utilizing C-rich Hoogsteen base pairing can block RNase binding. We designed 1298 5′ UTR sequences containing between 1 and 5 sequence motifs previously suggested to bind to RNase E50 or that bind to CsrA51, which is a factor that regulates translation rates and may protect mRNAs from degradation. Finally, we designed 974 5′ UTR sequences that systematically varied CDS translation rates, using the RBS Calculator v2.1 to design ribosome binding sites with varied standby sites, Shine-Dalgarno sequences, and inhibitory mRNA structures2,52.

Each 5′ UTR was encoded within a 170-nt oligonucleotide that is paired to a distinct DNA barcode sequence. All oligonucleotides contain flanking PCR primer binding sites and two pairs of restriction sites to support 2-step library-based cloning. Oligonucleotides were synthesized in five separates oligopools (Genscript). Each oligopool was then PCR amplified, digested, and ligated into a plasmid-based expression system, followed by a second round of digestion and ligation to insert a constant sfGFP coding sequence (Methods). The final library of barcoded genetic systems uses a constant bacterial promoter (J23100) and the designed 5′ UTR to express the sfGFP reporter. However, we do not use the sfGFP reporter for measuring protein levels; instead, its purpose is to create a nominal expression environment where ribosomes bind to the mRNA transcript and provide protection against RNases, depending on its translation rate (Fig. 1A). During the 2-step cloning, the barcode sequence is repositioned downstream of the sfGFP reporter to minimize its effects on the mRNA decay rate. After separate cloning and transformation of the five plasmid libraries into E. coli DH5α cells, we combined cells together into a single-cell library and carried out MiSeq next-generation sequencing to assess library coverage. Overall, MiSeq sequencing yielded 7,955,477 reads and identified 59,721 expected barcode sequences (96.1% library coverage).

Fig. 1: Massively parallel design and kinetic decay measurements.
figure 1

A 62,120 barcoded genetic systems with varied 5′ UTR sequences were constructed using oligopool synthesis and library-based cloning. mRNA stabilities in exponentially growing E. coli cells were measured using rifampicin treatment to halt transcription initiation, followed by kinetic mRNA level measurements using DNA-Seq and RNA-Seq. B mRNA sequences were designed with combinations of factors affecting mRNA decay rate, including RppH binding affinity, mRNA secondary & tertiary structure, and mRNA translation rate. C Time course mRNA level measurements (dots) were fitted to exponential decay functions (lines), using spike-in RNA controls for normalization. Four characterized mRNAs with widely different mRNA decay rates are shown. D The distribution of measured half-lives are shown.

Massively parallel kinetic measurements to characterize mRNA decay rates

We then carried out massively parallel mRNA level measurements to determine the steady-state and kinetic decay rates of each engineered mRNA transcript (Methods). Mixed cell libraries were serially cultured in selective LB media at 37 °C so that all cells were maintained in the exponential growth phase for at least 4 h. At the starting timepoint (T0), cell samples were taken, followed by addition of rifampicin. Rifampicin directly binds to E. coli RNA polymerase and halts transcription of new mRNA, but does not otherwise affect the mRNA decay machinery53. Cells samples were then taken 2, 4, 8, and 16 min after addition of rifampicin, while cells continued to grow. All cell samples were protected from chemical degradation using RNAprotect reagent (Qiagen). Total RNA was immediately extracted from all timepoints. Plasmid DNA was extracted from T0 cell samples, followed by barcoded amplicon generation. A known and constant amount of spike-in RNA was then added to each total RNA sample for cross-timepoint normalization, followed by rRNA depletion, cDNA synthesis, and barcoded amplicon generation. Deep sequencing was then carried out on the barcoded amplicon libraries (Illumina NovaSeq), yielding 1.6 billion reads mapped to expected RNA barcodes or spike-in-control RNA and another 301 million reads mapped to expected DNA barcodes. As expected from mRNA decay, the number of reads matching mRNA-derived barcodes steadily decreased across timepoints with a corresponding increase in reads matching the spike-in control RNA (Supplementary Table 1). We then counted the number of each DNA and mRNA barcode that mapped to reads. DNA barcode counts quantify the composition of the cell library at the T0 timepoint, while mRNA barcode counts quantify mRNA levels at the T0 to T16 timepoints. We found that 56,816 DNA barcodes (91.5%) and 58,080 mRNA barcodes (93.5%) had at least 100 mapped reads at the T0 timepoint, showing that deep sequencing yielded enough reads to broadly cover the cell library. The median read counts were 1576 and 2015, respectively. Cell library coverage remained high after rifampicin treatment with between 82.2 and 91.4% of mapped mRNA barcodes having at least 100 reads across each timepoint. All genetic part sequences, designed 5′ UTR variants, barcode sequences, and barcode counts are available in Supplementary Data 1.

For each 5′ UTR variant, we then quantified its mRNA levels over each timepoint and assessed how well the kinetic mRNA levels match to an exponential decay curve. We quantified mRNA levels by dividing the mRNA barcode counts at each timepoint by the DNA barcode counts at the T0 timepoint. We then compared the kinetic mRNA levels to the exponential curve M(t) = M0 e–kt, where M0 is the mRNA level at the T0 timepoint, t is time (minutes), M(t) is the mRNA level at each timepoint, and k is the first-order kinetic decay constant (1/min). mRNA half-lives are calculated according to t1/2 = log(2)/k, where the natural log is used. In Fig. 1C, we show measured mRNA levels at each timepoint in comparison to fitted exponential decay curves, selecting mRNAs with widely different decay kinetics. Overall, we found that 50,048 mRNAs (80.6%) followed exponential decay kinetics (R2 > 0.75 for each curve) with in vivo mRNA half-lives that varied from 0.31 to 25.4 min (Fig. 1D). These results show that the designed 5′ UTR sequences greatly altered in vivo mRNA decay rates with half-lives that span the physiologically relevant range from seconds to minutes54.

Multi-factor sequence determinants controlling mRNA decay

Our first data analysis step was to determine how the sequence design factors affected the measured mRNA decay rates. Initially, we compared how changing a single design factor altered mRNA decay rates. However, we found that focusing on only a single design factor resulted in larger intra-category differences than inter-category differences, indicating that multiple factors are collectively controlling the mRNA’s decay rate with equally important contributions (Supplementary Information). For example, after tallying the decay rates for all mRNAs starting with the same RppH binding site and calculating their mean and quartile values, we found that the differences in the mean decay rates across different RppH binding sites were much smaller than the differences in the upper and lower quartiles when using the same RppH binding site (Supplementary Fig. 1). We found similar outcomes when using single design factor analysis to determine the effect of changing the CDS’ translation rates (Supplementary Fig. 2) or the amount of ssRNA in the 5′ UTR (Supplementary Fig. 3). This initial analysis motivated the use of multi-factor categorization to determine how each factor controlled mRNA decay rates, while keeping other factors constant.

We next applied multi-factor categorization to determine how individual design factors controlled mRNA decay rates. We collected together characterized mRNAs that had RppH binding sites that were more likely to confer greater mRNA stability (20 most stable RppH binding sites) as well as characterized mRNAs that had either high or low predicted CDS translation rates (higher or lower than 5000 au on the RBS Calculator v2.1 scale). We also collected together mRNAs that had moderate amounts of ssRNA in their 5′ UTR (11 to 40 nt). Notably, when we focus on mRNAs with stable RppH binding sites and moderate ssRNA lengths, we find an inverse relationship between the CDSs’ predicted translation initiation rates and their measured decay rates (Pearson R = −0.31, p-value = 4 × 10−27, N = 1527 mRNAs) with higher translation rates providing mRNAs with more stability (Fig. 2A), confirming that some design factors may only have a clear effect on mRNA decay rates when other contributing factors are kept relatively constant.

Fig. 2: Multi-factor sequence determinants controlling bacterial mRNA decay.
figure 2

A Measured mRNA decay rates are shown when varying the predicted mRNA translation initiation rates, while only including 5′ UTRs with stable (less active) RppH binding sites and moderate single-stranded RNA lengths (Pearson R = −0.31, two-tailed p = 4.2 × 10−27). B Measured mRNA decay rates are shown when varying single-stranded RNA lengths inside the 5′ UTR, while only including 5′ UTRs with stable RppH binding sites and low predicted translation initiation rates (<5000 au) (small size effect = 0.0008 for 0 to 30 nt, Pearson R = −0.19, two-tailed p = 1.6 × 10−14). C Measured mRNA decay rates are shown when varying single-stranded RNA lengths inside the 5′ UTR, now including 5′ UTRs with stable RppH binding sites and high predicted translation initiation rates (>5000 au) (large size effect = 0.01 for 0 to 30 nt, R = 0.924, two-tailed p = 1.3 × 10−14). Measured mRNA decay rates are shown when varying the length and composition of single-stranded RNA regions inside 5′ UTRs, using ribosome binding sites with either (D) low or (E) high predicted translation initiation rates. Box plots show the median value of measured mRNA decay rates (N > 100 mRNAs) across each category (orange line), the 25% and 75% quartile boundaries (boxes), the maximum and minimum of the in-distribution data (bars), and the outliers (points). Measured mRNA decay rates are shown when introducing either (F) G-quadruplex tertiary structures or (G) i-motif tertiary structures into highly translated mRNAs as compared to highly translated mRNAs lacking each type of structure, each indexed by the amount of single-stranded RNA in their 5′ UTRs.

In another example, we found that increasing the amount of ssRNA in the 5′ UTR increased the mRNAs’ decay rates, but the relationship only becomes clear when the analysis focused on mRNAs with both stable RppH binding sites and high predicted translation initiation rates (Fig. 2B, C). When mRNAs had low predicted translation rates, their measured mRNA decay rates were already high and any additional ssRNA in their 5′ UTR did not significantly increase the decay rate (Fig. 2B). The sequence composition of the ssRNA also contributed to changes in mRNA decay (Fig. 2D, E) independent of their effect on the mRNA’s structure and translation rate, possibly due to changes in the mRNAs’ rigidity and persistence length, which control its overall accessibility to RNase binding. PolyA motifs appeared to have a larger impact on mRNA decay than other homopolymer motifs, particularly when mRNAs had high predicted translation rates (Fig. 2E), based on both the larger changes in the median mRNA decay rates and the greater spread in their distributions.

Finally, we found that G-quadruplex tertiary structures had a protective effect on mRNA stability, generally lowering their decay rates, while i-motif tertiary structures had no apparent protective effect on mRNA stability (Fig. 2G, F). In this multi-factor analysis, we compared mRNAs that form tertiary mRNA structures versus mRNAs that only fold into secondary mRNA structures, focusing only on mRNAs with high translation initiation rates that should have a low baseline mRNA decay rate (high stability). In both cases, we calculate the amount of ssRNA in the 5′ UTR to carry out equal-category (apples-to-apples) comparisons; for mRNAs designed with G-quadruplexes and i-motif tertiary structures, we calculate the amount of ssRNA when only secondary mRNA structures are allowed to form. In this way, our comparative analysis determines the contribution of the tertiary mRNA structure itself, independent of other contributing factors. We found that the introduction of G-quadruplexes into the 5′ UTR exhibited low mRNA decay rates regardless of the secondary mRNA structures that would otherwise form with varied amounts of ssRNA (Fig. 2F). In the absence of a G-quadruplex, a longer ssRNA region caused the mRNA decay rate to increase by about 0.01 1/min per ssRNA nucleotide up to a length of about 33 nucleotides (linear regression; R = 0.926, p = 5.14 × 10−15, N = 31) with a peak mRNA decay rate of about 0.4 1/min. But when a G-quadruplex was introduced into the 5′ UTR, the mRNA decay rate was generally constant at about 0.2 1/min and did not depend on the 5′ UTR length (for up to 33 nucleotides of added sequence). In contrast, using the same analysis, we found that the introduction of i-motifs generally increased mRNA decay rates, compared to an equivalent mRNA that only formed secondary mRNA structures (Fig. 2G). For example, introducing an i-motif into a shorter 5′ UTR caused its mRNA decay rate to be high (about 0.5 1/min) in comparison to mRNAs with an equivalent length of ssRNA (5 to 10 nt), which had mRNA decay rates of about 0.2 1/min.

Development of a predictive model using biophysics and machine learning

We next developed a quantitative sequence-to-function model that predicts a bacterial mRNA’s decay rate, combining both biophysical model calculations and machine learning to quantify the most important design factors that control mRNA stability (Fig. 3). Our first step was to generate a long list of candidate sequence, structural, and functional determinants that may contribute to mRNA levels and mRNA decay rate, including the promoter’s transcription initiation rate, the RppH binding sites, the types and locations of mRNA structures, the amount of ssRNA in the 5′ UTR, the sequence compositions of different regions, and the translation initiation rate of the CDS. We applied several biophysical models to calculate these determinants, using only the genetic system’s sequence as the input. For example, we used a newly developed thermodynamic model (the Promoter Calculator v1.0) to calculate the promoters’ site-specific transcription initiation rates; even though the same promoter (J23100) is used to transcribe all mRNAs, its transcriptional start site and transcription initiation rate can change based on the first 20 nucleotides of the mRNA transcript, called the initial transcribed region (ITR)1. We used another thermodynamic model (the RBS Calculator v2.1) to calculate the translation initiation rates of the CDSs, which vary greatly across different 5′ UTR sequences2. We also applied RNA folding algorithms (Vienna RNA v2.4.11) to calculate the mRNA structures that form in each 5′ UTR-CDS region, including the types and locations of the thermodynamically most stable secondary structures and G-quadruplexes55. We used these structural calculations to tally several potential determinants, including hairpin duplex lengths, hairpin loop lengths, the number of bulges and internal loops inside hairpins, the amount of ssRNA in between hairpins, and the sequence composition of these ssRNA regions.

Fig. 3: A hybrid biophysical-machine learning model of mRNA decay.
figure 3

A The Promoter Calculator and RBS Calculator biophysical models are used to predict the transcription initiation rates and translation initiation rates of the five most predominant mRNA isoforms, followed by calculating the isoforms’ structural features. The biophysical features, measured mRNA levels, and measured mRNA decay rates of 45,283 characterized mRNAs are split into a training dataset (N = 36,219) and an unseen test dataset (N = 9064), followed by training and testing of a machine learning model (LightGBM). B Individually trained models accurately predicted mRNA levels at the steady-state timepoint (train R2 = 0.75, test R2 = 0.69) and post-rifampicin treatment timepoints (+2 min, train R2 = 0.72, test R2 = 0.65; +4 min, train R2 = 0.68, test R2 = 0.60; +8 min, train R2 = 0.61, test R2 = 0.53; +16 min, train R2 = 0.52, test R2 = 0.43). C The mRNA Stability Calculator accurately predicted mRNA decay rates by combining biophysical features and model-predicted mRNA levels at each timepoint (train R2 = 0.56, test R2 = 0.43).

Based on our prior data analysis, we found that the first four nucleotides of the mRNA transcript (the RppH binding site) have a particularly outsized impact on a mRNA’s decay rate. From that, we recognized that small differences in the promoter’s transcriptional start site will generate distinct mRNA isoforms with different RppH binding sites. Notably, a key advantage of the Promoter Calculator model is its ability to predict the transcription initiation rate for each potential start site. We accordingly applied the Promoter Calculator to calculate the five most predominant mRNA isoforms, ranked according to their predicted transcription initiation rates. For each distinct mRNA isoform sequence, we then separately calculated their structural and functional features (Fig. 3A). Altogether, we used these isoform-specific features to develop and test a machine learning regressor that quantitatively predicts steady-state mRNA levels and kinetic mRNA decay rates.

We began the model training and testing procedure by combining the isoform-specific features and mRNA level measurements (steady-state levels and 4 kinetic levels per mRNA) into an augmented dataset, following by randomized splitting into a training dataset (36,219 mRNAs) and an unseen test dataset (9064 mRNAs). Importantly, to develop a more generalized model, we randomly split the mRNAs in each design group into training and test datasets to ensure that the same fraction of each design group was uniformly represented in both datasets. Using the training dataset, we then carried out machine learning using a gradient boosting algorithm (LightGBM v3.3.2), followed by testing on the unseen test dataset. We first developed five independently trained models, one to predict steady-state mRNA levels and four to predict mRNA levels at each post-rifampicin kinetic timepoint (Fig. 3B).

Gradient boosting uses a learned decision tree with automatic feature categorization to develop an ensemble of weak models that are summed together to predict the outcome. LightGBM is an enhanced gradient boosting algorithm that rapidly sub-divides numerical features (e.g., the predicted translation initiation rates) into optimal discrete bins for improved model accuracy. LightGBM also quantities the importances of each isoform-specific feature, enabling the removal of features that do not affect the outcome. Through iterative development, we identified a single set of hyperparameters (Supplementary Table 2) and a single minimal set of 496 features (Supplementary Data 2) that yielded accurate models for both steady-state and kinetic mRNA levels, using log-transformed mRNA levels as the outcome (Fig. 3B). Model accuracy was highest when predicting steady-state mRNA levels (train R2 = 0.75, test R2 = 0.69) and mRNA levels after 2 min of rifampicin treatment (train R2 = 0.72, test R2 = 0.65), followed by reductions in accuracy as mRNA decay further reduced mRNA levels. The lowest model accuracy was observed at the longest kinetic timepoint (16 min after rifampicin treatment, R2 = 0.52, test R2 = 0.43). We then developed a machine learning model to predict a mRNA’s decay rate by combining the isoform-specific features with the models’ log-transformed mRNA level predictions. Using the same procedure and hyperparameters, we developed a LightGBM model with moderate accuracy on both the training and unseen test datasets (train R2 = 0.56, test R2 = 0.43) with absolute-time mRNA decay rate predictions that spanned a 10.3-fold range (half-lives from 0.7 to 9.8 min) (Fig. 3C). Overall, the model predicted mRNA decay rates with a median relative error of 18.5% when evaluated on the unseen test dataset.

During model development, we eliminated several candidate features that had low importances when predicting mRNA levels, including mRNA hairpin heights, the lengths of mRNA internal loops, and the number of mRNA bulges. We confirmed that changing these mRNA structural factors, while keeping other factors constant, did not significantly alter the mRNAs’ measured decay rates (Supplementary Fig. 4). In comparison, the same calculations identified that the RppH binding sites, the CDS translation rates, the promoter transcription rates, and the ssRNA compositions were the most important features controlling the model’s predictions in agreement with our multi-factor analysis (Supplementary Fig. 5). We also found that tracking only the single most predominant mRNA isoform, ranked according to their transcription rates, resulted in a statistically significant reduction in model accuracy (train R2 = 0.72, test R2 = 0.67; p-value of comparison =1.4 × 10−7). Finally, we verified that the model-predicted steady-state mRNA levels were similarly accurate across all design groups (Supplementary Fig. 6), suggesting that the model has high generalizability even when mRNAs have combinations of diverse sequence, structural, and functional features.

Design rules for controlling mRNA decay rate

As we have shown, several interactions work together in a coupled fashion to collectively control mRNA decay rates, making it a challenge to directly analyze data and disentangle cause-effect relationships to use as design rules. However, by using the developed LightGBM model as a predictor, we can now carry out a systematic variation of the interactions and quantitatively determine how they each individually alter mRNA stability. We first selected a baseline mRNA with sequence, structural, and functional features that confer high mRNA stability, followed by systematically changing single feature values across the physiologically relevant range (Fig. 4). In this analysis, all five predominant mRNA isoforms are treated as having identical features to more clearly show cause-effect relationships. For example, we systematically varied the RppH binding site of the baseline mRNA, ranked them according to the predicted effect on the mRNA’s steady-state level, and found that RppH binding sites with homopolymeric compositions (skewed statistics towards one nucleotide) greatly lowered mRNA stabilities as compared to RppH binding sites with balanced compositions (Fig. 4A). Accordingly, for a first design rule, simply changing the first four nucleotides of the mRNA transcript can be used to tune mRNA decay rates with small increments across a fourfold range (Supplementary Fig. 7).

Fig. 4: Model-predicted design rules for controlling mRNA levels and decay rate.
figure 4

A The RppH binding sites sequences ranked according to their model-predicted changes in steady-state mRNA levels. The top and bottom 10 RppH binding sites are shown that (red) decrease or (blue) increase mRNA stability. B Model-predicted design rules for changing steady-state mRNA levels when varying (left) translation initiation rate of a CDS, (middle) the amount of single-stranded RNA in the 5′ UTR with either polyA, polyG, polyC, or polyU composition, or (right) the amount of double-stranded RNA in the 5′ UTR.

We then systematically varied the baseline mRNA’s translation initiation rate and found that the LightGBM model predicted a sigmoidal effect on the steady-state mRNA level with a ceiling plateau at around 50,000 au on the RBS Calculator 2.1 scale (Fig. 4B). Therefore, all other factors being equal, a higher translation rate will also stabilize a mRNA transcript, yielding higher protein expression levels, up to a ceiling plateau. The model also quantitatively shows that longer ssRNA regions in the 5′ UTR lead to lower mRNA stability, although the effect is most pronounced for C-rich and A-rich ssRNA sequences (Fig. 4B, middle). PolyA nucleic acids are known for being particularly straight and stiff, which can increase their accessibility to RNase activity56. Finally, we found only modest increases in mRNA stability when 5′ UTR regions contained more mRNA secondary structures, independent of other factors (Fig. 4B, right). Therefore, it’s likely that mRNA secondary structures on their own do not protect mRNAs and, instead, it is the presence or absence of unstructured RNA that is the dominant design factor.

link

Leave a Reply

Your email address will not be published. Required fields are marked *