Machine learning predicts meter-scale laboratory earthquakes

Machine learning predicts meter-scale laboratory earthquakes

Experimental settings

We briefly summarize the laboratory experiment settings used to measure acoustic emissions and mechanical data31. Acoustic emissions were recorded using 64 shear-mode piezoelectric transducers with a response frequency of 500 kHz. These sensors were attached to both long sides of the lower sample, with 32 sensors on each side. The signals were amplified 20 times and continuously sampled at 10 MHz. The shear load was measured using a load cell. For further details, refer to the original experimental studies30,31.

Event catalog in the meter-scale laboratory experiment

We utilize a catalog of laboratory quake cycles containing foreshock activity31. This catalog is generated using the STA/LTA detection technique combined with a grid search for spatial (x) and temporal (t) information, and the ball drop calibration technique57 to determine moment magnitude (Mw). The seismicity shows a non-patterned spatial distribution (Fig. S1a, 1b) and follows a tapered Gutenberg-Richter magnitude-frequency relation31,58.

The published catalog does not include detailed laboratory quake information beyond their origin times, due to challenges in precisely estimating magnitudes and hypocenters. However, our forecasting method requires both quake and foreshock data. To address this, we roughly estimate the quake information (Figure S1c, blue histogram) and validate these extrapolations post-prediction (see Methods for details).

Seismic moment is estimated according to M = μDA, where μ is the shear modulus, D is the co-seismic slip amount, and A is the co-seismic slip area. The rigidity of the host rock in the laboratory experiment is μ = 41.2 GPa30. Assuming the entire fault co-seismically slips during laboratory quakes, we estimate A as A = 0.1 × 1.5 = 0.15 m2. We further assume D is uniform across the fault, estimating it at 0.1–0.2 mm based on displacement data. The estimated seismic moment is thus M = 0.62 ~ 1.24 × 106 N m, corresponding to moment magnitudes of Mw = −2.2~−2.0. To account for potential non-uniform slip, we introduce a random error of 10% to M.

The effective rigidity during co-seismic slip might be lower than that of the host rock, as the simulated fault surface does not fully arrest the entire rupture. As a result, the estimated seismic moment could carry an error of roughly an order of magnitude. Therefore, we use Mw = −2.0 + error as the moment magnitude of the quakes and validate the prediction results within the range Mw = −3.0 to  −2.0 for quake moment magnitude. Ultimately, the extrapolation does not significantly impact the prediction score, as shown in Fig. S3; the low moment magnitude of the laboratory quakes does not meaningfully reduce the prediction accuracy.

Preparation of nominal shear stress data

We use nominal shear force data measured on the reaction force bar (Fig. 1a), provided by F. Yamashita. The shear force is converted to shear stress and subsequently detrended (only the training set is used to determine the trend). To account for the contribution of shear force to fault shear stress, we divide the shear force by the fault area of 0.15 m2 to obtain the nominal shear stress. After the initial four quake-cycles, the nominal shear stress stabilizes at a consistent level throughout the experiment. However, a slight increasing trend, possibly due to hardening and compaction, is observed, which could potentially impact the performance of the ML and DL models. To mitigate this, we assume that the increasing trend is linear and remove the trend using a linear regression based on the least squares method (Fig. S4). Note that the regression coefficient was calculated using only the training set, to prevent any information leakage into the validation and test sets. The resulting coefficient is: τlin(t) = 40.3 × 10−4t + 66.9 MPa, where τlin is the linear regression line and t is the time in seconds. Our training, validation, and test sets are continuous in time, and we know the time stamp of each data point in the validation and test sets. Thus, the detrending of validation and test sets can be done by subtracting τlin (tvalid or ttest) from the raw shear stress data. This process is even valid in real-time data processing if linear regression is a reasonable assumption. We refer to this detrended nominal shear stress as nominal shear stress in the main text. To effectively train the DL model and ensure consistency with the ML model training conditions, the nominal shear stress is normalized using the min-max values from the training set.

Description of random forest algorithm

The Random Forest (RF) algorithm, developed by Breiman (2001)59 and implemented in the scikit-learn Python package by Louppe (2013)60 and Pedregosa (2011)61, is a classification algorithm that aggregates the outputs of multiple decision trees. Each decision tree classifies input data based on statistical classification using input features and target labels. At each node, the tree evaluates whether feature k is above or below the threshold tk. If the feature value is above/below the threshold, the data proceeds to the next left/right node, and each subsequent node repeats this process. The threshold at each split is determined by the Classification and Regression Tree (CART) algorithm, which aims to divide the current node into the two purest possible next nodes by minimizing the cost function J,

$$J{(k,{t}_{k})}_{s,j}=\frac{{m}_{j,{{{\rm{left}}}}}}{{m}_{j}}{{{{\rm{MSE}}}}}_{j,{{{\rm{left}}}}}+\frac{{m}_{j,{{{\rm{right}}}}}}{{m}_{j}}{{{{\rm{MSE}}}}}_{j,{{{\rm{right}}}}}$$

(1)

$${{{{\rm{MSE}}}}}_{j,{{{\rm{node}}}}}={\sum}_{i\in {{{\rm{node}}}}}{({\hat{y}}_{j,{{{\rm{node}}}}}-{y}_{j,i})}^{2},\qquad {\hat{y}}_{j,{{{\rm{node}}}}}=\frac{1}{{m}_{j,{{{\rm{node}}}}}}{\sum}_{i\in {{{\rm{node}}}}}{y}_{j,i}$$

(2)

where \(J{(k,{t}_{k})}_{s,j}\) is the cost function when the current node j is separated by split s using feature k at the threshold tk, mj,node is the number of data points in the next node, mj is the number of data points before the split, \({\hat{y}}_{j,{{{\rm{node}}}}}\) is the averaged value in the next node, and yj,i is the label values in the next node. Minimizing the cost function \(J{(k,{t}_{k})}_{s,j}\) corresponds to finding the feature k and threshold tk such that the data within each split node is as homogeneous as possible, while the two resulting nodes are as heterogeneous as possible. The RF model aggregates the outputs from all the trees to provide the final prediction values. This algorithm directly seeks the relationship between features and labels, making it well-suited for data with non-flat, monotonous distributions in the feature-label space. The conceptual illustration of an RF algorithm is provided in Fig. S14a.

One may notice that R2 is higher in the training phase than in the testing phase (Figs. 1 and 2). While such a difference is typically indicative of overfitting, it has been well established that overfitting in Random Forest (RF) models does not negatively impact validation or testing scores59.

Description of a forecasting method

We employ the network representation53 and the classical ML technique, Random Forest (RF), implemented in scikit-learn59,60,61 as a forecasting method (see details of RF model in “Methods” and Fig. S14a). This approach, developed in a numerical study53, is designed to predict the time remaining before synthetic mainshocks using only catalog information. To compute input features, we define a network as a group of earthquakes (e.g., enclosed by the green line in Fig. S14b) and statistically summarize the catalog information regarding the origin times and seismic moments. Specifically, we use 2 parameters Xi = (ΔtiMi), where Δti is the event interval of each earthquake pair in the current network, Mi is the seismic moment transformed by \(M=1{0}^{3({M}_{{{{\rm{w}}}}}+6.067)/2}\)62, and i denotes the ith earthquake in the current network. The network abstracts them by taking the average and variance in the current network as follows:

$$\overline{{X}_{j}}={\sum}_{i}^{j}\frac{{X}_{i}}{j},\quad \,{{\mbox{Var}}}\,{(X)}_{j}=\frac{1}{j}{\sum}_{i}^{j}{({X}_{i}-\overline{{X}_{j}})}^{2}$$

(3)

where \(\overline{{X}_{j}}\) is the average, Var is the variance, and j is the number of earthquakes in the current network, which we define as the network size. The network size j is determined as percentage of the average foreshock number j = pf, where p is the percentage, and f is the average foreshock number in the training set. The number of networks and percentage p are optimized by cross-validation as well as hyperparameters in the RF model (see details in “Methods” and Table S2). Using the multiple networks corresponds to incorporating the temporal evolution of seismicity, which is abstracted by each network (Fig. S14c). Therefore, the network representation conceptually corresponds to adaptive time windows with no fixed duration, but with a duration that changes depending on the stage of quake cycles. For example, the notation \({\overline{M}}_{5f}\) refers to the seismic moment averaged over a network of 5% of the total number of foreshocks over the quake cycles. We use log-normalized values to effectively train the ML and DL models, although the RF model does not require normalization.

For labels, we compute the time remaining before the quake for each foreshock’s origin time. During the training phase, the computed features are associated with the time to quake of the most recent foreshock. If the latest event is a quake, we calculate the time to the next quake. Additionally, we attempt to predict the nominal shear stress observed by the reaction force bar. Given the differences in the orders of magnitude between the labels, we use the log-normalized value of time to quake and the normalized linear value of shear stress for training and testing. Consequently, the trained ML model outputs predictions of either the time to quake or the shear stress value by considering the current and past catalog information extracted from multiple networks.

Optimization of hyperparameters of the random forest model, network sizes, and the number of networks

We first optimize the network sizes, number of networks, and hyperparameters of the Random Forest (RF) model using mini-batch training and validation to predict the time to laboratory quakes. Hyperparameter tuning is performed using a simple grid search. Subsequently, we fix the network sizes and the number of networks to ensure consistent feature extraction and then optimize hyperparameters for predicting nominal shear stress and synthetic variables. The training batch size is set to 100, while the validation batch size is 25. This results in 6 and 4 batches for the training and validation sets, respectively. Consequently, the model minimizes the loss across 24 different pairs of training and validation sets. The loss function is the mean squared error (MSE) of either the logarithmic time to quake or the linear shear stress data. The optimized features include the average and variance of Δti and Mi within networks of p = (100, 70, 50, 20, inst), where inst refers to a network containing only two earthquakes (j=2), capturing the instantaneous behavior. Thus, we use 20 features for each prediction task (average and variance of (ΔtiMi)  × across 5 networks). The optimized hyperparameters of the RF model for each prediction task are listed in Table S2.

Additionally, we derive feature importance from the trained RF model based on the reduction of the cost function J (Eq. (1)), as shown in Fig. S15. However, since each network uses overlapping data points to compute averages and variances, features from different networks are correlated. As a result, the absolute value of the importance is less meaningful and primarily serves as an index indicating which features are most frequently referenced in the current model. To visualize the general trend of input data, we only present the features from the 20% network for each prediction target.

Evaluation of the model performance

We quantitatively evaluate the prediction score using the coefficient of determination, calculated as \({}^{\log }{{{{\rm{R}}}}}^{2}\) and linR2, following the methodology outlined in previous studies11,53:

$${}^{\log }{{{{\rm{R}}}}}^{2}=1-\frac{{\sum }_{i=1}^{n}{({\log }_{10}{y}_{i}-{\log }_{10}{\hat{y}}_{i})}^{2}}{\mathop{\sum }_{i=1}^{n}{({\log }_{10}{y}_{i}-{\log }_{10}\overline{y})}^{2}}$$

(4)

$${}^{{{{\rm{lin}}}}}{{{{\rm{R}}}}}^{2}=1-\frac{\mathop{\sum }_{i=1}^{n}{({y}_{i}-{\hat{y}}_{i})}^{2}}{\mathop{\sum }_{i=1}^{n}{({y}_{i}-\overline{y})}^{2}}$$

(5)

where n is the number of data points, yi is the true target values, \({\hat{y}}_{i}\) is the prediction, \(\overline{y}\) is the average of yi. We use logarithmic labels to equalize the contribution of values across different orders of magnitude in predicting the time to lab quake (Eq. (4)). Since the shear stress does not evolve logarithmically, we apply the typical definition of R2 (Eq. (5)).

Deep learning algorithms and their optimized architectures

In order to compare the performance of the Random Forest (RF) algorithm to deep learning (DL), which has also recently been used to predict laboratory earthquakes18,21, we trained two classic DL architectures (a multilayer perceptron and a recurrent neural network) to predict the time to laboratory quakes and nominal shear stress. For the multilayer perceptron (referred to as DNN here), the input dimension consists of 20 features derived from the network representation, while the output dimension is 1, corresponding to either the time to a laboratory quake or the nominal shear stress at a given time. We apply batch normalization63, Xavier initialization64, and dropout regularization65. Mish activation66 is used as the nonlinear activation function. The model architecture, the number of neurons (n), and dropout probability are determined using a mini-batch training-validation process, similar to that used for training the RF model. For optimization, we employ AMSgrad67 with a learning rate of lr = 0.001 and hyperparameters (β1β2) = (0.9, 0.999). The mean squared error (MSE) loss is minimized during gradient descent. The learning process is terminated early when minimum validation loss is not updated for 100 epochs. The optimized architecture is shown in Fig. S16a.

For the recurrent neural network, we train a Long Short-Term Memory (LSTM)68,69 architecture to test the robustness and model independence of our results. Unlike traditional neural networks, LSTM replaces standard neurons with memory blocks that inherently capture the history of input sequences. Since LSTM can capture the temporal evolution of features, we use only 1 network (20% network, 4 features) to characterize seismicity. We apply orthogonal weight initialization70, Xavier initialization64, and dropout regularization. The activation function used is \(\tanh (x)\). Optimization follows the same procedure as for the DNN, except that training is terminated early if the minimum validation loss is not updated for 50 epochs. The optimized architecture is shown in Fig. S16b.

The DL models yield comparable accuracy to RF models in both predicting time to laboratory quake and nominal shear stress, although unlike the RF algorithm, DL requires careful consideration of potential overfitting (avoided here as shown by similar training and testing performance). The current model is optimized for logarithmic time to laboratory quakes, prioritizing this aspect over accuracy on a linear scale.

Inter-event time model

We construct a simple inter-event time model that forecasts the timing of mainshocks based solely on the average recurrence intervals of laboratory quakes11,19. At the occurrence of a laboratory earthquake, the model outputs the average recurrence interval from the training set, with the predicted time to the next event decreasing linearly. Hence, it does not utilize any foreshock information and is only a countdown from the previous main shock. Fig. S2 presents the predictions of the inter-event time model. The linear-scale accuracy is linR2 = 0.73, while the log-scale accuracy is \({}^{\log }{{{{\rm{R}}}}}^{2}=-0.28\). When the true recurrence interval exceeds the average recurrence, the model predicts a negative value; for \({}^{\log }{{{{\rm{R}}}}}^{2}\) calculations, these values are set to 10−3 seconds.

Set up of a two-dimensional, fully dynamic earthquake-cycle model to replicate the laboratory data

To examine the physical mechanisms underlying precursory behavior, we develop a two-dimensional fully dynamic earthquake-cycle model to replicate laboratory quake behavior (Fig. 4a, b). The model setup is similar to those in Ito & Kaneko (2023)54 and Norisugi et al. (2024)53. The fault is embedded in an elastic continuum and subjected to tectonic loading from both edges. To prevent rupture initiation exclusively from the fault’s edges, we also apply a time-independent stressing rate (Fig. 4a).

The numerical approach is based on a boundary integral method71,72, adapted for a two-dimensional (in-plane, Mode II) fully dynamic model of earthquake cycles. This dynamic approach allows for the realistic simulation of dynamic ruptures, crucial for understanding earthquake rupture arrest and the resulting earthquake sizes54,71. The fault length is 2400 mm, with 1200 mm regions at both ends subjected to the imposed loading rate. The entire fault is divided into 16,384 cells, each 0.15 mm in size. The fault’s constitutive behavior is governed by rate-and-state friction laws with the aging form of state variable evolution73,74,75:

$$\tau=\sigma \left[{f}_{{{{\rm{0}}}}}+a\ln \left(\frac{V}{{V}_{{{{\rm{0}}}}}}\right)+b\ln \left(\frac{{V}_{{{{\rm{0}}}}}\theta }{{D}_{{{{\rm{RS}}}}}}\right)\right],\qquad \frac{{{{\rm{d}}}}\theta }{{{{\rm{d}}}}t}=1-\frac{V\theta }{{D}_{{{{\rm{RS}}}}}}$$

(6)

where τ is the shear strength, σ is the effective normal stress, a and b are the rate-and-state constitutive parameters, V is the slip rate, f0 is the friction coefficient at V = V0, θ is the state variable, and DRS is the characteristic slip distance for state variable evolution. The parameter a − b primarily controls the fault’s slip behavior, with positive and negative values corresponding to velocity-strengthening (VS) and velocity-weakening (VW) patches, respectively. We introduce frictional heterogeneity on the fault with alternating VS and VW patches, where aVSbVS = 0.0025 and aVW − bVW = −0.0030, and lengths of LVS = 12 mm and LVW = 30 mm. The 130-mm VS regions at both ends of the fault act as permanent rupture barriers. The fault contains 50 VS patches and 51 VW patches. A uniform value of DRS = 1.5 nm is used to generate tiny foreshocks. The parameters for this model are listed in Table S1.

Under slow loading, stick-slip frictional instability develops only in the VW region, where the instability exceeds the critical nucleation length. The theoretical estimation of the nucleation length relevant to the present simulation is given by76:

$${h}_{{{{\rm{RA}}}}}^{*}=\frac{2}{\pi }\frac{{\mu }^{{\prime} }{b}_{{{{\rm{VW}}}}}{D}_{{{{\rm{RS}}}}}}{\sigma {({b}_{{{{\rm{VW}}}}}-{a}_{{{{\rm{VW}}}}})}^{2}}$$

(7)

where σ is the effective normal stress, \({\mu }^{{\prime} }=\mu /(1-\nu )\), μ is the shear modulus, ν is Poisson’s ratio, and aVW − bVW are the frictional constitutive parameters on the velocity-weakening patch. Given the parameters listed in Table S1, the estimated critical nucleation length is \({h}_{{{{\rm{RA}}}}}^{*} \sim 20\) mm. The actual nucleation length is influenced by the background loading rate and the manner in which the VW patch is loaded72,76, so the actual nucleation size can differ from \({h}_{{{{\rm{RA}}}}}^{*}\). Within the expected nucleation sizes, we confirm that enough spatial discretization is applied not to produce one-cell instability, which causes numerical artifacts.

We define the origin of a synthetic event as the moment when the fault slip rate exceeds 1 cm/s at any location on the fault, and the end of the event as when the slip rate decelerates below 0.9 cm/s. The slightly lower threshold prevents double-counting of a single event with an oscillating slip rate. The catalog contains the origin time (t), location (x), and seismic moment M = μAD, where μ is the shear modulus, A is the co-seismic slip area, and D is the co-seismic slip amount. Given that the fault is one-dimensional, we assume that A is the square of the slip length.

When the loading rate from both edges is tuned to match the value used in laboratory experiments (Vpl = 0.01 mm/s), the model produces an unrealistically fast earthquake sequence with recurrence intervals of less than one second, whereas laboratory experiments yield intervals of 20 seconds or more. Therefore, we use a much slower loading rate of Vpl = 50 nm/s and set the time-independent stressing rate to \(\dot{\tau }=0.05\) MPa/s, ensuring that the fault is primarily loaded by this stressing rate. This discrepancy may arise due to differences in fault dimensions, which affect the energy provided per unit area.

The synthetic catalog produces complex seismicity patterns (Fig. S7a), with foreshock activity sometimes coherently clustering near the fault edge or spreading toward the center. Locked areas (without foreshocks) persist similarly to the laboratory situation31. However, the event size distribution (Fig. S1c)31,58 is not fully reproduced, and the productivity of medium-sized foreshocks (−5 < Mw < −4) is insufficient in this simulation (Fig. S7b). This may be due to differences in the model’s dimensionality and frictional properties, as the dominant size of foreshocks is restricted by the VW patch size or nucleation size, and the behavior of rupture termination may differ in a two-dimensional fault model. While there are some qualitative differences in the catalog, our focus here is to replicate the fundamental and simplified physics of the laboratory situation rather than achieve a perfect replication. Therefore, we use the synthetic catalog and related physical quantities under the assumption that a similar physical mechanism governs the production of laboratory foreshocks and quakes.

Additionally, we vary the ab values on the velocity-weakening (VW) patches to examine how different degrees of fault heterogeneity affect ML prediction performance. These additional cases include VW patch a − b values of  −0.0020 (Fig. S9),  −0.0035 (Fig. S10), and  −0.0015 (Fig. S11). Furthermore, to simulate a scenario that produces both partial and full ruptures, we introduce a slightly longer (54 mm) and weaker VS barrier at the fault center, with (ab)VS = 0.0017 (Fig. S12).

Method for Gutenberg-Richter b-value estimation

We follow54 and apply the maximum likelihood method to estimate b-values from the synthetic catalog77,78:

$$b{{{\rm{-value}}}}=\frac{{\log }_{10} \; {e}}{{\overline{M}}_{{{{\rm{w}}}}}-({M}_{{{{\rm{c}}}}}-\Delta {M}_{{{{\rm{w}}}}}/2)}$$

(8)

where \({\overline{M}}_{{\rm{w}}}\) is the mean magnitude above Mc, Mc is the magnitude of completeness, and ΔMw is the bin size. The foreshock catalog is stacked across multiple mainshock cycles within the training set after preprocessing, which involves removing events smaller than the magnitude completeness threshold (Mc = −4.6) and events with a time to synthetic quake smaller than 10−3 seconds. We set ΔMw = 0.05. The standard error of the b-value is estimated as follows79:

$$\epsilon (b{{{\rm{-value}}}})=2.30{(b{{{\rm{-value}}}})}^{2}\epsilon ({\overline{M}}_{{{{\rm{w}}}}})\,$$

(9)

where

$$\epsilon {(\overline{M}_{{\rm{w}}})}^{2}=\mathop{\sum }_{i=1}^{n}\frac{{({M}_{{{{\rm{w}}}}}^{i}-{\overline{M}}_{{{{\rm{w}}}}})}^{2}}{n(n-1)}\ .$$

(10)

link

Leave a Reply

Your email address will not be published. Required fields are marked *