Below is a write-up for an analysis performed on the relationship between the volume of injected fluids and occurrence of
earthquakes in Oklahoma. Assocated documentation and code can be found on the GitHub repository:
Oklahoma Earthquakes GitHib repository.
The project was a personal side project not associated with the class. The goal was to intersect my geologic background and
recently learned analyical methods learned. Below the formal write-up is an informal discussion where I discuss my thoughts
on the project and lessons learned.
Oklahoma has experienced a marked increase in seismicity since 2009 which has been strongly linked to the increase in saltwater disposal (SWD) volumes injected into the Arbuckle Group [1, 2, 3]. This SWD is a direct result of hydralic fracturing (informally known as fracking); the hydraulic fracturing requires an enormous amount of water that must be disposed of after the hydrocarbon recovery process - in 2015, the year with the highest injection volume, nearly 330,000,000 m³ of wastewater were injected into the Arbuckle Group in Oklahoma, with more volume injected into various other formations. This project attempts to model the realationship between the increase of injection volumes and the increase in seismic events within Oklahoma.
The Arbuckle Group is a relatively poorly understood geological formation of approximately Late Cambrian - Ordovician age (approximately 480-490 million years old). The Arbuckle Group is primarily composed of nearly pure carbonates, which, through karst dissolution processes, have produced significant amounts of porosity making it a suitable candidate for wastewater storage [4]. However, the Arbuckle Group has hydraulic communication with the crystalline basement below containing pre-existing faults under stress. It is hypothesized that injection of wastewater into the Arbuckle Group causes pressure on the pre-existing faults, thus reducing the frictional resistance, inducing seismicity [1].
Waste water injection wells are used to inject fluid into porous geologic formations. The waste water is typically a brine mixed with proprietary chemicals used in the hydraulic fracturing process. The wells included in this study are Class II wells, used for fluids associated with oil and natural gas production.
Acquisition of data for this project was relatively stright forward: injection data was downloaded as XLSX files from the Oklahoam Coproration Commission's (OCC) website [5], and the earthquake data was retreived via API from the United States Geological Survey (USGS) [6]. Additionally Oklahoma Geological Survey (OGS) has data on seismic events; however, their download interface is less suited for large scale downloads and do not include earthquake data post 2018.
This section details pre-processing of earthquake and injection data prior to linear regression modeling,
along with further processing descisions resulting from exploratory data analysis.
Wastewater injection data from the the OCC website were provided in three major categories:
1) Daily injection volumes from SWD wells located in the Area of Interest (AOI) completed in the Arbuckle
Group. Each years' data is contained in one XLSX file, and the years 2012-2021 are represented,
and updated daily. However, daily injection rates for these wells were not required until 27 March
2016, thus, data prior to that date are considered incomplete. As shown in Figures 1 and 2, the
dropoff in injection volumes during 2020 are possibly die to incomplete data.
2) Monthly injection volumes for all SWD wells located in Oklahoma. Each years' data is contained
in one XLSX file; there is monthly data for 2011-2020. It's noted, though, that data post 2019 is
considered incomplete and is subject to being updated.
3) Yearly injection volumes for all SWD Wells located in Oklahoma from 2006-2010.
During intital viewing of the data, it was noticed that there was a discrepency between volumes
from the daily and monthly injection data files: after approximately 2015, the daily rates summed to
a monthly volume were high as 3,179,746 m³ (~840,000,000 gallons; 20,000,000 barrels). After analysis
of the data, it was determined that there were different reported volumes for the same wells in
both data sets. Ostensibly, the monthly data should have the complete data, as, per the OCC website,
the monthly injection volumes sorted by year "...contains annual reporting of volumes for all injection
wells submitted to the Underground Injection Control (UIC) Department..." [5], while the daily data should only
be a subset of the total wells located in the AOI. Figure 1 (below) displays the discrepency.
Figure 1: Comparison of Wastewater Injection Volume Sources
Figure 1 displays the differences in volumes between the monthly, daily, and combined data sources. As stated above, daily
volumes (red; note that daily volumes were grouped and summed by month) were not required to be submitted
to the OCC until 27 March 2016; however, the large increase in volume corresponds
to March 2015 - it is possible that there is a typo on the OCC website. Due to the lack of data prior to
March 2015, the daily data could not be used on its own for the analysis. Monthly data is available as far
back as 2011, and makes up the bulk of the injection numbers used for analysis. Monthly data is available for
every well in Oklahoma - as such, the monthly data were filted by well identification number to match wells
included in the daily data, as those wells are located in the AOI. However, after filtering the monthly data
and grouping the daily data by month, the monthly injection volumes did not match. The blue line in Figure 1
displays the filtered monthly injection volumes.
The daily and monthly data were then combined, with duplicate well identification numbers with lower total monthly
volumes dropped; the green line in Figure 1 displays the combined data. After combination, it was decided that
the monthly data would be used for analysis without combining with the daily data since it is impossible to know whether the daily or
monthly datasets contain the values more closely aligned with true injection volumes, thus, there was no adequate
basis for determing which of the values to drop. However, the volumes and trends identified here within are
analagous to other independent and peer reviewed studies [1, 7].
It should be noted that, due to the limitations and goals of this study, the well injection volume data
here within should not be considered completely representative of the actual well injection volumes: a more complete
study should involve communication with the OCC to resolve possible discrepencies in their data.
Data were downloaded from the USGS website via an API call; data cleaning involved removing earthquake entries with negative magnitude and those with missing data (e.g., no magnitude, geographic coordainates). The earthquake data were then filtered for only events which occur wthin 20 km of an injection well. The spatial relationship between the seismic event location and injection locations is noted to be an important attribute: 20 km was chosen to be a reasonable distance after comparison with other studies. Lastly, only earthquakes of magnitude 3 or higher were included in the analysis, per convention with literature [1, 3, 7, 8]. The interactive map below displays the earthquakes per year.
Preprocessed data were imported into a Jupyter Notebook and combined for intitial visualization and analysis.
Minor aditional data manipuation was performed to prepare the datasets for plotting and
regression analysis. The general structure of data used for analysis was as follows: only earthquakes within
20km of an injection well were included, earthquakes were filtered such that only earthquakes of magnitude 3.0
or higher were included, and earthquake events were then grouped and summed by month to match the resolution of the
injection data.
Consideration for running analysis on the daily earthquake data and daily injection data was ultimitaly discarded
for a number of reasons discussed in this paragraph. First, temporal distribution of seismic events is an important
factor, thus, it was deeemed necessary that analysis should capture the largest timeframe possble - this required
balancing the resolution of injection data. As described above, yearly injection data was available from 2006-2010 -
this resolution was deemed too coarse for our needs, and did not caplture the increase in seismic activity. Daily injection volume
was available from 2012 to 2021; however, daily volumes were not required to be submitted prior to March 2016, thus
any data prior to that data could not be considered complete. As such, daily cata could only be considered reliable post March 2016.
As shown below in Figure 2 (note that Figure 2 is interactive), starting analysis post March 2016 would exclude a
large proprtion of data, including the increase in both injection rates and earthquake events. Monthly data was available
from 2011 to 2020. After the prior considerations, it was determined that monthly data would strike the best balance
between temporal distribution and temporal resolution.
Figure 2: Monthly Count of Earthquakes of Magnitude 3 or Greater vs. Monthly Injection Rates.
Two important details were noticed after plotting the monthly injection volumes and monthly earthquake occurrences
through time: 1) There was a lag between increasing injection rates and increased seismicity, and 2) there appeared
to be a critical threshold of injection volume required necessary prior to trigging statewide, consistent, earthquake
events. Each of the aformentioned details are described in greater detail in the following paragraphs.
The time lag is evident in Figure 2 and is discussed by Langenbruch and Zoback [1], where they state that "A time lag
of several months between the injection of fluid and associated occurrence of earthquakes is not unexpected." Further
noting that a lag of "several years" is common. To account for the time lag in this analysis, in inciments of seven
days, time was subtracted from seismic events so as to simulate the occurence of seismic events occurring
"concurrently" with thir respective injection volumes: offsets were calculated up to two years by.
Additionally, there is a threshold of injection volumes required to "kick-start" the seismic processes, with the exception of the
November 2011 "Prague" earthquake (magnitude 5.6). Note that this analysis is a state-wide look at earthquake
activity, as such, it is possible that the injection volume local to the "Prague" earthquake was suffeciant to trigger
the seismic event, but state-wide volume was still below the threshold for more wide spread events.
There are a multitude of factors that can influence the occurrence of and individual seismic event, and is beyond
the scope of this project to account for all of them.
As shown in Figure 2, the trigging threshold appears to be
10,000,000 m³ of injected fluid per month, which was achieved approximately December 2012; however, it is possible
that the triggering threshold is lower due to the time lag between injection volumes and seismic events -
down to approimately 8,000,000 m³ per month, if the lag is as high as two years as stated as a possibility
by Langenbruch and Zoback [1]. Additionally, it appears that once the triggering threshold is exceeded, dropping below
the triggering threshold does not cease seismic activity. Note that after injection volumes dropped below 10,000,000 m³
per month (approximately April 2016), seismic activity continued at an increased rate relative to similar monthly
injection volumes prior to December 2012 when the triggering threshold was breached. When accounting for two-year time
lag and dropping the the triggering threshold
to 8,000,000 m³ per month, seismic activity still continued at an elevated rate. As seismic events are still occuring at
rates above the rates prior to the triggering threshold, the "reverse threshold", this is the monthly injection volume at
which seismic events reach their pre-triggering threshold levels, is unknown. However, 2020 has had the lowest occurence
of earthquakes of magnitude 3 or greater, with monthly counts only as high as 3 for January through May. The injection
data for 2020 is considered incomplete by the OCC, thus, we cannot accuratly draw a correlation between 2020 seismic
events and their concurrent injection volumes.
Linear regression anaylsis was performed on data from 2011 to 2019, because, as mentioned above, injection data post 2019
is considered incomplete. Splitting of the training and testing data was done under two arrangments: 1) the training and
testing data were split randomly using "train_test_split" from sklearn, and 2) using data from 2011-2018 as the training
data, and data from 2019 as the testing data to test the model's predictive capability on past data.
Figure 3 (below) displays the R² values for the results of random split arrangment. A linear regression model was perfomed
using the complete offset earthquake dates, up to two years, as descibed above. Overall shape of the curve followed a rough
"hill" pattern, with maximum coefficient of determination (R²) values in the center, and lower R² on the sides. Testing the model with the random subset
(25% of the data) produced an R² of 0.45, that is, injection volumes accounted for appoximately 45% of the variance in
the model. Subsequent offsets of seven days continuously increased model performance up to a local maximum of 0.89,
corresponding to an offset of 175 days (25 weeks). After 175 days, R² values followed a roughly sinusoidal pattern
with periods ranging from 49 to 70 days leading to a maximum R² value of 0.91, corresponding to an offset of 357 days
(51 weeks). Following the maximum, the sinusoidal pattern became less pronounced and decreased to a minumum R² of -17.95,
corresponding to the maximum tested offset of 728 days (104 weeks). The mechanism behind the sinusoidal pattern has not
been determined; however, it is plausible that the pattern an artifact due to clustering of earthquakes by month - due to large
spikes in earthquake occurrences resulting from aftershocks (discussed in greater detail in the Discussion Section),
there exist combinations of months with disproportionally large counts of earthquakes relative to the general trend.
Data not fitting the general trend, that is, outliers, negatively affect performance of the model lower the R² scores.
Figure 3: R² Values vs. Offset in Weeks - Random Train/Test Data
Results of the second second arrangement, where the model was used in a predictive capacity to test earthquake occurrences in 2019, are shown in Figure 4 below, again, displaying R² values versus the offset in days. Results followed a rough "valley" shape, with relatively higher R² values towards the ends of the graph and a minimum R² in the center. R² results from the predictive arrangement are considerably lower than results from the random split arranement, and the resulting pattern is roughly inverse, though much less defined. The maximum R² value was -0.23, corresponding to an offset of 21 days. The minimum offset was -58.57, corresponding to an offset of 434 days (62 weeks).
Figure 4: R² Values vs. Offset in Weeks - 2019 Prediction
As presented above, the results of the correlation arrangement produced a reasobale suffecient R² value to suggest that
there is a correlation between the volumne of injected water and the occurrence of seismic activity. In accordance with
the literature, there appears to be a lag between injection and seismic events - with an offset of 25 weeks resulting
in an R² value of 0.89. Following 25 weeks, a sinusoidal pattern emerges with peaks corresponding to an R² value of
approximately 0.90. While the sinusoidal pattern possibly results from grouping earthquake counts by month, the relatively
peaks suggest that there is no single value that corresponds to an ideal offset - that is, delays beween injection and
earthquakes likely range from 25 to 51 weeks, after which, the sinusoidal trend then follows a decreasing pattern with the
decreasing peak values with increasing offset (See Figure 3).
Given the scope and generalizations required for this model(that is, this project represents the entirety of Oklahoma's
injection wells), a range of earthquake delays is not unexpected. Beyond selecting only earthquakes that occur within 20
km of an injection well, this model does not take into acount the spatial relationship between injection and earthquake
focal point location.
In contrast to the relatively high R² value for determing the correlation between injection and earthquakes, the attempt
to predict occurrences in 2019 based on a linear regression model produced from data from 2011-2018 resulted in very low
R² values. This is not unexpected do to a variety of factors, most of which relate to the nature of earthquakes.
Individual earthquakes do not necessarily have a strict linear relationship with injection volumes and the resulting
propogated pressure wave; slippage along a fault requires a minimum pressure to trigger that slippage. Linearly increasing
local injection rates will not linearly increase the resultant count nor magnitude of earthquake prior to reaching the
pressure threshold for triggering the seismic event. After the seismic event, local faults realize a pressure stabilization
and reach a state of relative equilibrium. Aggregating earthquake counts over a large enough area helps to combat this
localization problem; however, the problem is not completely mitigated and appears when attempting to make a predicictions
of monthly earthquake counts.
The second major problem for earthquake predictions deals with the occurrence of preshocks and (to a much greater extent)
aftershocks. The "Earthquake Declustering" section below discusses this in greater detail, along with a discussion of an
initial attempt to decluster earthquake events.
Large spikes in monthly earthquake counts, as shown in Figure 2, correlate to large magnitude earthquakes - as the
magnitude of an earthquake increases, the number of preshocks and aftershocks increases, with aftershocks commonly
occuring immediately after the main earthquake and into the following days. For example, McGarr and Barbour concluded
that 13 February 2016 "Fairview" earthquake was associated with as many as 63 foreshocks and 89 aftershocks [8]. Predicting
aftershock counts is beyond the scope of this project; however, an attempt was made to "descluster" earthquake
occurrences, that is, to isolate the main earthquake from the pre/aftershocks. Declustering was performed by ordering
seismic events by magnitude in decending order, then removing all earthquakes that occur within a two kilometer radius.
Results of declustering substantially reduced the number of seismic events, from 2799 events to 564 events. An exploratory
linear regression analysis was performed on the new declusted data; however, results were genreally no better than
previous regressions. It is likely declustering was too liberal in removing earthquakes. The fewer earthquake events there
are, the smaller the errors in the prections need to be to get small R² values. A more focused study could attempt to
more accurately classify earthquakes as main vs. preshocks and aftershocks.
Results from the predictive arrangement (Figure 4) were insuffecient to accurately predict monthly occurrences of earthquakes,
as R² value were largly sufficiently below zero.
The best use for this model was roughly determining the offset range, that is, the lag between injection and seismic events, to
back-calculate the overall state-wide, monthly injection volume corresponding to the triggering threshold. As
discussed above, the best R² values from the correlation focused model suggested an offset range of 25 to 51 weeks. Prior, to
applying any offset, the injection threshold appears to be approximately 10,000,000 m³ of injected fluid per month. After applying
the offset resulting from the model, the injection threshold drops to approximately 8,600,000 m³ to 8,000,000 m³ of injected fluid
per month; however, those values were reached in August 2017 and April 2017, respectively. Despite monthly injection values dropping
below the caluculated threshold, earthquake events continue above pre-threshold levels, suggesting additional factors responsible
for continued seismic activity.
As such the 8,600,000 m³ to 8,000,000 m³ of injected fluid per month range should be considered a maximum, and could be used as such
for future models. As R² values reached a maximum of approximately 0.9 for the 25 to 51 week offset range, an additional 10% of the
variance can be explained through other varaibles.
The calculated threshold failed to solely predict the approriate threshold at which seismic events would return to their pre-threshold values. Additional work should focus on the spatial relationships between the earthquake events and local injection volumes. This project correlation statewide injections with statewide earthquakes; however, earthquakes are only dependent on local injection volumes. Additonally, further attempt should be made to accurately decluster earthquake events. Proper classification of main earthquakes vs. pre/aftershocks could narrow down which earthquakes result from injection and which are the result of a larger main earthquake.
Overall, the linear regression results between injection volumes and increase in seismicity present high enough R² values to suggest that there is correlation between the increase in injection volumes since 2011 and the increase in seismic events in Oklahoma. While the monthly predictive capability of the model presented poor results, the model ultimately provised a possible maximum monthly injection threshold target for the State of Oklahoma in order for statewide earthquakes to reach pre-2011 levels.
[1] Langenbruch, C and Zoback, M -
How will induced seismicity in Oklahoma respond to decreased saltwater injection rates?
[2] United States Geological Survey -
Oklhoma has had a surge of earthquakes since 2009. Are they due to fracking?
[3] Hincks, et al. -
Oklahoma's induced seismicity strongly linked to wastewater injection depth
[4] Bureau of Economic Geology -
Arbuckle Group, Oklahoma
[5] Oklahoma Corporation Commission -
Oil and Gas Data Files
[6] United States Geological Survey -
API Documentation - Earthquake Catalog
[7] Zhai, et al. -
Pore-pressure diffusion, enhanced by poroelastic stresses, controls induced seismicity in Oklahoma
[8] McGarr, A and Barbour, A -
Wastewater Disposal and the Earthquake Sequences During 2016 Near Fairview, Pawnee, and Cushing, Oklahoma
The goal of this project was to apply methods learned during my
data analytics program to a problem more relevant to my interests and background - a large chunk
of homework and classwork in the proram deals with neutral, curated datasets. Of course, that is
not unexpected, but it doesn't allow for much in the way of creativity. Enter this project. Immediately
after graduating with my MS, I worked for the Oklahoma Geological Survey where I contributed to a project
looking at the relationship between wastewater injection and seismic activity. Note that that was in 2016,
right in the midst of the earthquake madness. So I thought it would be neat to revisit the problem half a
decade later.
Expectedly, scope creep occured quite quickly in this project. What started out as a way to practice some linear
regression turned into a madhouse of data cleaning, manipulation, learning about seismicity , and reading
scientific papers dealing with this very topic. In short - the entire topic of
induced seismicity relating to injection wells is a pretty dang complex topic with a lot of confounding
factors that were far beyond the original scope of the project. So, while the results of my analysis
leave a little to be desired, I still chalk up this project up as a win. I got a ton of practical
practice with Python (especialy Pandas) and I learned about manipulating and cleaning data and how it can
affect the results of regression analyses. Yes, there is definitely more that can be done with the project;
however, in time-limited circumstances, eventaully you reach a point of diminishing returns and you need
to move on to another project that allows for learning of new topics.