Below is a write-up for an analysis performed on the relationship between the volume of injected fluids and occurrence of 
            earthquakes in Oklahoma. Assocated documentation and code can be found on the GitHub repository: 
            Oklahoma Earthquakes GitHib repository.
            
            The project was a personal side project not associated with the class. The goal was to intersect my geologic background and 
            recently learned analyical methods learned. Below the formal write-up is an informal discussion where I discuss my thoughts 
            on the project and lessons learned. 
        
Oklahoma has experienced a marked increase in seismicity since 2009 which has been strongly linked to the increase in saltwater disposal (SWD) volumes injected into the Arbuckle Group [1, 2, 3]. This SWD is a direct result of hydralic fracturing (informally known as fracking); the hydraulic fracturing requires an enormous amount of water that must be disposed of after the hydrocarbon recovery process - in 2015, the year with the highest injection volume, nearly 330,000,000 m³ of wastewater were injected into the Arbuckle Group in Oklahoma, with more volume injected into various other formations. This project attempts to model the realationship between the increase of injection volumes and the increase in seismic events within Oklahoma.
The Arbuckle Group is a relatively poorly understood geological formation of approximately Late Cambrian - Ordovician age (approximately 480-490 million years old). The Arbuckle Group is primarily composed of nearly pure carbonates, which, through karst dissolution processes, have produced significant amounts of porosity making it a suitable candidate for wastewater storage [4]. However, the Arbuckle Group has hydraulic communication with the crystalline basement below containing pre-existing faults under stress. It is hypothesized that injection of wastewater into the Arbuckle Group causes pressure on the pre-existing faults, thus reducing the frictional resistance, inducing seismicity [1].
Waste water injection wells are used to inject fluid into porous geologic formations. The waste water is typically a brine mixed with proprietary chemicals used in the hydraulic fracturing process. The wells included in this study are Class II wells, used for fluids associated with oil and natural gas production.
Acquisition of data for this project was relatively stright forward: injection data was downloaded as XLSX files from the Oklahoam Coproration Commission's (OCC) website [5], and the earthquake data was retreived via API from the United States Geological Survey (USGS) [6]. Additionally Oklahoma Geological Survey (OGS) has data on seismic events; however, their download interface is less suited for large scale downloads and do not include earthquake data post 2018.
            This section details pre-processing of earthquake and injection data prior to linear regression modeling, 
            along with further processing descisions resulting from exploratory data analysis.
            
            
                Wastewater injection data from the the OCC website were provided in three major categories:
                
                1) Daily injection volumes from SWD wells located in the Area of Interest (AOI) completed in the Arbuckle 
                Group. Each years' data is contained in one XLSX file, and the years 2012-2021 are represented, 
                and updated daily. However, daily injection rates for these wells were not required until 27 March 
                2016, thus, data prior to that date are considered incomplete. As shown in Figures 1 and 2, the 
                dropoff in injection volumes during 2020 are possibly die to incomplete data. 
                
                2) Monthly injection volumes for all SWD wells located in Oklahoma. Each years' data is contained 
                in one XLSX file; there is monthly data for 2011-2020. It's noted, though, that data post 2019 is 
                considered incomplete and is subject to being updated.
                
                3) Yearly injection volumes for all SWD Wells located in Oklahoma from 2006-2010. 
                
                During intital viewing of the data, it was noticed that there was a discrepency between volumes 
                from the daily and monthly injection data files: after approximately 2015, the daily rates summed to
                a monthly volume were high as 3,179,746 m³ (~840,000,000 gallons; 20,000,000 barrels). After analysis 
                of the data, it was determined that there were  different reported volumes for the same wells in 
                both data sets. Ostensibly, the monthly data should have the complete data, as, per the OCC website, 
                the monthly injection volumes sorted by year "...contains annual reporting of volumes for all injection 
                wells submitted to the Underground Injection Control (UIC) Department..." [5], while the daily data should only 
                be a subset of the total wells located in the AOI. Figure 1 (below) displays the discrepency. 
                
            
 
                
            Figure 1: Comparison of Wastewater Injection Volume Sources
                Figure 1 displays the differences in volumes between the monthly, daily, and combined data sources. As stated above, daily 
                volumes (red; note that daily volumes were grouped and summed by month) were not required to be submitted 
                to the OCC until 27 March 2016; however, the large increase in volume corresponds 
                to March 2015 - it is possible that there is a typo on the OCC website. Due to the lack of data prior to 
                March 2015, the daily data could not be used on its own for the analysis. Monthly data is available as far 
                back as 2011, and makes up the bulk of the injection numbers used for analysis. Monthly data is available for 
                every well in Oklahoma - as such, the monthly data were filted by well identification number to match wells 
                included in the daily data, as those wells are located in the AOI. However, after filtering the monthly data 
                and grouping the daily data by month, the monthly injection volumes did not match. The blue line in Figure 1 
                displays the filtered monthly injection volumes. 
                
                The daily and monthly data were then combined, with duplicate well identification numbers with lower total monthly
                volumes dropped; the green line in Figure 1 displays the combined data. After combination, it was decided that 
                the monthly data would be used for analysis without combining with the daily data since it is impossible to know whether the daily or 
                monthly datasets contain the values more closely aligned with true injection volumes, thus, there was no adequate 
                basis for determing which of the values to drop.  However, the volumes and trends identified here within are 
                analagous to other independent and peer reviewed studies [1, 7].
                
                It should be noted that, due to the limitations and goals of this study, the well injection volume data 
                here within should not be considered completely representative of the actual well injection volumes: a more complete
                study should involve communication with the OCC to resolve possible discrepencies in their data.
            
Data were downloaded from the USGS website via an API call; data cleaning involved removing earthquake entries with negative magnitude and those with missing data (e.g., no magnitude, geographic coordainates). The earthquake data were then filtered for only events which occur wthin 20 km of an injection well. The spatial relationship between the seismic event location and injection locations is noted to be an important attribute: 20 km was chosen to be a reasonable distance after comparison with other studies. Lastly, only earthquakes of magnitude 3 or higher were included in the analysis, per convention with literature [1, 3, 7, 8]. The interactive map below displays the earthquakes per year.
                Preprocessed data were imported into a Jupyter Notebook and combined for intitial visualization and analysis. 
                Minor aditional data manipuation was performed to prepare the datasets for plotting and 
                regression analysis. The general structure of data used for analysis was as follows: only earthquakes within 
                20km of an injection well were included, earthquakes were filtered such that only earthquakes of magnitude 3.0
                or higher were included, and earthquake events were then grouped and summed by month to match the resolution of the 
                injection data. 
                
                Consideration for running analysis on the daily earthquake data and daily injection data was ultimitaly discarded 
                for a number of reasons discussed in this paragraph. First, temporal distribution of seismic events is an important 
                factor, thus, it was deeemed necessary that analysis should capture the largest timeframe possble - this required 
                balancing the resolution of injection data. As described above, yearly injection data was available from 2006-2010 - 
                this resolution was deemed too coarse for our needs, and did not caplture the increase in seismic activity. Daily injection volume 
                was available from 2012 to 2021; however, daily volumes were not required to be submitted prior to March 2016, thus
                any data prior to that data could not be considered complete. As such, daily cata could only be considered reliable post March 2016. 
                As shown below in Figure 2 (note that Figure 2 is interactive), starting analysis post March 2016 would exclude a 
                large proprtion of data, including the increase in both injection rates and earthquake events. Monthly data was available 
                from 2011 to 2020. After the prior considerations, it was determined that monthly data would strike the best balance 
                between temporal distribution and temporal resolution. 
                
            
Figure 2: Monthly Count of Earthquakes of Magnitude 3 or Greater vs. Monthly Injection Rates.
                Two important details were noticed after plotting the monthly injection volumes and monthly earthquake occurrences 
                through time: 1) There was a lag between increasing injection rates and increased seismicity, and 2) there appeared 
                to be a critical threshold of injection volume required necessary prior to trigging statewide, consistent, earthquake 
                events. Each of the aformentioned details are described in greater detail in the following paragraphs. 
                
                The time lag is evident in Figure 2 and is discussed by Langenbruch and Zoback [1], where they state that "A time lag 
                of several months between the injection of fluid and associated occurrence of earthquakes is not unexpected." Further 
                noting that a lag of "several years" is common. To account for the time lag in this analysis, in inciments of seven 
                days, time was subtracted from seismic events so as to simulate the occurence of seismic events occurring 
                "concurrently" with thir respective injection volumes: offsets were calculated up to two years by. 
                
                Additionally, there is a threshold of injection volumes required to "kick-start" the seismic processes, with the exception of the 
                November 2011 "Prague" earthquake (magnitude 5.6). Note that this analysis is a state-wide look at earthquake 
                activity, as such, it is possible that the injection volume local to the "Prague" earthquake was suffeciant to trigger 
                the seismic event, but state-wide volume was still below the threshold for more wide spread events.  
                There are a multitude of factors that can influence the occurrence of and individual seismic event, and is beyond 
                the scope of this project to account for all of them. 
                
                As shown in Figure 2, the trigging threshold appears to be 
                10,000,000 m³ of injected fluid per month, which was achieved approximately December 2012; however, it is possible 
                that the triggering threshold is lower due to the time lag between injection volumes and seismic events - 
                down to approimately 8,000,000 m³ per month, if the lag is as high as two years as stated as a possibility 
                by Langenbruch and Zoback [1]. Additionally, it appears that once the triggering threshold is exceeded, dropping below 
                the triggering threshold does not cease seismic activity. Note that after injection volumes dropped below 10,000,000 m³ 
                per month (approximately April 2016), seismic activity continued at an increased rate relative to similar monthly 
                injection volumes prior to December 2012 when the triggering threshold was breached. When accounting for two-year time 
                lag and dropping the the triggering threshold 
                to 8,000,000 m³ per month, seismic activity still continued at an elevated rate. As seismic events are still occuring at 
                rates above the rates prior to the triggering threshold, the "reverse threshold", this is the monthly injection volume at 
                which seismic events reach their pre-triggering threshold levels, is unknown. However, 2020 has had the lowest occurence 
                of earthquakes of magnitude 3 or greater, with monthly counts only as high as 3 for January through May. The injection 
                data for 2020 is considered incomplete by the OCC, thus, we cannot accuratly draw a correlation between 2020 seismic 
                events and their concurrent injection volumes. 
            
            Linear regression anaylsis was performed on data from 2011 to 2019, because, as mentioned above, injection data post 2019 
            is considered incomplete. Splitting of the training and testing data was done under two arrangments: 1) the training and 
            testing data were split randomly using "train_test_split" from sklearn, and 2) using data from 2011-2018 as the training 
            data, and data from 2019 as the testing data to test the model's predictive capability on past data. 
            
            Figure 3 (below) displays the R² values for the results of random split arrangment. A linear regression model was perfomed 
            using the complete offset earthquake dates, up to two years, as descibed above. Overall shape of the curve followed a rough 
            "hill" pattern, with maximum coefficient of determination (R²) values in the center, and lower R² on the sides. Testing the model with the random subset 
            (25% of the data) produced an R² of 0.45, that is, injection volumes accounted for appoximately 45% of the variance in 
            the model. Subsequent offsets of seven days continuously increased model performance up to a local maximum of 0.89, 
            corresponding to an offset of 175 days (25 weeks). After 175 days, R² values followed a roughly sinusoidal pattern 
            with periods ranging from 49 to 70 days leading to a maximum R² value of 0.91, corresponding to an offset of 357 days 
            (51 weeks). Following the maximum, the sinusoidal pattern became less pronounced and decreased to a minumum R² of -17.95, 
            corresponding to the maximum tested offset of 728 days (104 weeks). The mechanism behind the sinusoidal pattern has not 
            been determined; however, it is plausible that the pattern an artifact due to clustering of earthquakes by month - due to large 
            spikes in earthquake occurrences resulting from aftershocks (discussed in greater detail in the Discussion Section), 
            there exist combinations of months with disproportionally large counts of earthquakes relative to the general trend. 
            Data not fitting the general trend, that is, outliers, negatively affect performance of the model lower the R² scores. 
        
 
            
        Figure 3: R² Values vs. Offset in Weeks - Random Train/Test Data
Results of the second second arrangement, where the model was used in a predictive capacity to test earthquake occurrences in 2019, are shown in Figure 4 below, again, displaying R² values versus the offset in days. Results followed a rough "valley" shape, with relatively higher R² values towards the ends of the graph and a minimum R² in the center. R² results from the predictive arrangement are considerably lower than results from the random split arranement, and the resulting pattern is roughly inverse, though much less defined. The maximum R² value was -0.23, corresponding to an offset of 21 days. The minimum offset was -58.57, corresponding to an offset of 434 days (62 weeks).
 
            
        Figure 4: R² Values vs. Offset in Weeks - 2019 Prediction
            As presented above, the results of the correlation arrangement produced a reasobale suffecient R² value to suggest that 
            there is a correlation between the volumne of injected water and the occurrence of seismic activity. In accordance with 
            the literature, there appears to be a lag between injection and seismic events - with an offset of 25 weeks resulting 
            in an R² value of 0.89. Following 25 weeks, a sinusoidal pattern emerges with peaks corresponding to an R² value of 
            approximately 0.90. While the sinusoidal pattern possibly results from grouping earthquake counts by month, the relatively 
            peaks suggest that there is no single value that corresponds to an ideal offset - that is, delays beween injection and  
            earthquakes likely range from 25 to 51 weeks, after which, the sinusoidal trend then follows a decreasing pattern with the 
            decreasing peak values with increasing offset (See Figure 3).
            
            Given the scope and generalizations required for this model(that is, this project represents the entirety of Oklahoma's 
            injection wells), a range of earthquake delays is not unexpected. Beyond selecting only earthquakes that occur within 20 
            km of an injection well, this model does not take into acount the spatial relationship between injection and earthquake 
            focal point location.   
            
            In contrast to the relatively high R² value for determing the correlation between injection and earthquakes, the attempt 
            to predict occurrences in 2019 based on a linear regression model produced from data from 2011-2018 resulted in very low 
            R² values. This is not unexpected do to a variety of factors, most of which relate to the nature of earthquakes. 
            Individual earthquakes do not necessarily have a strict linear relationship with injection volumes and the resulting 
            propogated pressure wave; slippage along a fault requires a minimum pressure to trigger that slippage. Linearly increasing 
            local injection rates will not linearly increase the resultant count nor magnitude of earthquake prior to reaching the 
            pressure threshold for triggering the seismic event. After the seismic event, local faults realize a pressure stabilization 
            and reach a state of relative equilibrium. Aggregating earthquake counts over a large enough area helps to combat this 
            localization problem; however, the problem is not completely mitigated and appears when attempting to make a predicictions 
            of monthly earthquake counts.
            
            The second major problem for earthquake predictions deals with the occurrence of preshocks and (to a much greater extent) 
            aftershocks. The "Earthquake Declustering" section below discusses this in greater detail, along with a discussion of an 
            initial attempt to decluster earthquake events. 
        
            Large spikes in monthly earthquake counts, as shown in Figure 2, correlate to large magnitude earthquakes - as the 
            magnitude of an earthquake increases, the number of preshocks and aftershocks increases, with aftershocks commonly 
            occuring immediately after the main earthquake and into the following days. For example, McGarr and Barbour concluded 
            that 13 February 2016 "Fairview" earthquake was associated with as many as 63 foreshocks and 89 aftershocks [8]. Predicting 
            aftershock counts is beyond the scope of this project; however, an attempt was made to "descluster" earthquake 
            occurrences, that is, to isolate the main earthquake from the pre/aftershocks. Declustering was performed by ordering 
            seismic events by magnitude in decending order, then removing all earthquakes that occur within a two kilometer radius. 
            
            Results of declustering substantially reduced the number of seismic events, from 2799 events to 564 events. An exploratory 
            linear regression analysis was performed on the new declusted data; however, results were genreally no better than 
            previous regressions. It is likely declustering was too liberal in removing earthquakes. The fewer earthquake events there 
            are, the smaller the errors in the prections need to be to get small R² values. A more focused study could attempt to 
            more accurately classify earthquakes as main vs. preshocks and aftershocks.
  
        
            Results from the predictive arrangement (Figure 4) were insuffecient to accurately predict monthly occurrences of earthquakes, 
            as R² value were largly sufficiently below zero. 
            
            The best use for this model was roughly determining the offset range, that is, the lag between injection and seismic events, to 
            back-calculate the overall state-wide, monthly injection volume corresponding to the triggering threshold. As 
            discussed above, the best R² values from the correlation focused model suggested an offset range of 25 to 51 weeks. Prior, to 
            applying any offset, the injection threshold appears to be approximately 10,000,000 m³ of injected fluid per month. After applying 
            the offset resulting from the model, the injection threshold drops to approximately 8,600,000 m³ to 8,000,000 m³ of injected fluid 
            per month; however, those values were reached in August 2017 and April 2017, respectively. Despite monthly injection values dropping 
            below the caluculated threshold, earthquake events continue above pre-threshold levels, suggesting additional factors responsible 
            for continued seismic activity.
            
            As such the 8,600,000 m³ to 8,000,000 m³ of injected fluid per month range should be considered a maximum, and could be used as such 
            for future models. As R² values reached a maximum of approximately 0.9 for the 25 to 51 week offset range, an additional 10% of the 
            variance can be explained through other varaibles. 
        
The calculated threshold failed to solely predict the approriate threshold at which seismic events would return to their pre-threshold values. Additional work should focus on the spatial relationships between the earthquake events and local injection volumes. This project correlation statewide injections with statewide earthquakes; however, earthquakes are only dependent on local injection volumes. Additonally, further attempt should be made to accurately decluster earthquake events. Proper classification of main earthquakes vs. pre/aftershocks could narrow down which earthquakes result from injection and which are the result of a larger main earthquake.
Overall, the linear regression results between injection volumes and increase in seismicity present high enough R² values to suggest that there is correlation between the increase in injection volumes since 2011 and the increase in seismic events in Oklahoma. While the monthly predictive capability of the model presented poor results, the model ultimately provised a possible maximum monthly injection threshold target for the State of Oklahoma in order for statewide earthquakes to reach pre-2011 levels.
            [1] Langenbruch, C and Zoback, M - 
            
                How will induced seismicity in Oklahoma respond to decreased saltwater injection rates?
            
            [2] United States Geological Survey - 
            
                Oklhoma has had a surge of earthquakes since 2009. Are they due to fracking?
            
            [3] Hincks, et al. - 
            
                Oklahoma's induced seismicity strongly linked to wastewater injection depth
            
            [4] Bureau of Economic Geology - 
            
                Arbuckle Group, Oklahoma
            
            [5] Oklahoma Corporation Commission - 
            
                Oil and Gas Data Files   
            
            [6] United States Geological Survey - 
            
                API Documentation - Earthquake Catalog
            
            [7] Zhai, et al. - 
            
                Pore-pressure diffusion, enhanced by poroelastic stresses, controls induced seismicity in Oklahoma
            
 
            [8] McGarr, A and Barbour, A - 
            
                Wastewater Disposal and the Earthquake Sequences During 2016 Near Fairview, Pawnee, and Cushing, Oklahoma
            
 
            
            
                The goal of this project was to apply methods learned during my 
                data analytics program to a problem more relevant to my interests and background - a large chunk 
                of homework and classwork in the proram deals with neutral, curated datasets. Of course, that is 
                not unexpected, but it doesn't allow for much in the way of creativity. Enter this project. Immediately 
                after graduating with my MS, I worked for the Oklahoma Geological Survey where I contributed to a project 
                looking at the relationship between wastewater injection and seismic activity. Note that that was in 2016, 
                right in the midst of the earthquake madness. So I thought it would be neat to revisit the problem half a 
                decade later. 
                
 
                Expectedly, scope creep occured quite quickly in this project. What started out as a way to practice some linear 
                regression turned into a madhouse of data cleaning, manipulation, learning about seismicity , and reading 
                scientific papers dealing with this very topic. In short - the entire topic of 
                induced seismicity relating to injection wells is a pretty dang complex topic with a lot of confounding 
                factors that were far beyond the original scope of the project. So, while the results of my analysis 
                leave a little to be desired, I still chalk up this project up as a win. I got a ton of practical 
                practice with Python (especialy Pandas) and I learned about manipulating and cleaning data and how it can 
                affect the results of regression analyses. Yes, there is definitely more that can be done with the project; 
                however, in time-limited circumstances, eventaully you reach a point of diminishing returns and you need 
                to move on to another project that allows for learning of new topics.