CURTIS S. SMITH, P.G.

Search for a beer using the search bar below. Start tying to return matches in realtime - click on the desired entry and 10 beers will be recommended to you based on the results of a K-Means clustering algorith.

Project Description

This web app is the result of the final project for the data analytics course. Initally, a dataset from Kaggle (Beer Information - Tasting Profiles) was used for the model; however, after initial EDA (including a preliminary K-Means clustering algorithm), we determined that additional information could help with model accuracy. The Kaggle dataset initally contained approximately 5,500 beers scraped from BeerAdvocate. The scraped entries contained counts of key words from 25 reviews that were grouped into various 11 taste profiles (e.g., the "fruity" taste profile contained key words such as "berries, "fruit", "juice", and "tropical") to provide scrores for each beer. The flavor pofiles are: fruity, hoppy, spices, malty, bitter, sweet, sour, salty, astringent, body, and alcoholic.

After communication with the Kaggle dataset uploader, they provided us (and uploaded it to the Kaggle page linked above) with the key words used to calculate each profile. We then created a web scraping script to scrape data from the top ranking beers from each substyle (e.g., "Stouts" contain the substyles "American" and "Irish Dry") that contained at least 75 reviews. We determined that more reviews would help differentiate the flavor profiles better for each beer and reduce clustering overlap.

After obtaining the newly scraped data, we re-ran the K-Means clustering algorithm and received better silhouette scores for each cluster, along recommended beers that felt more similar to the input beer. The dataset was testing using both the min/max scaler and standard scaler - both returned similar silhouette scores; however, the min/max returned a slightly higher score, so min/max was chosen for the final model. The data were then clustered into 3 main clusters (classes), then the main classes were then clustered into subclusters (subclasses), ranging from with each of the main clusters containing 7, 8, and 2 subclasses, respectively. Figure 1 (interactive) displays the number of beers in each class and subclass. Figure 2 (also interactive) presents the distribution of beer ratings.


Figure 1: Sunburst chart displaying the number of beer in each class and subclass.

Figure 2: Histogram displaying distribution of beer ratings.


After modeling, a script was created that allowed a user to input a beer (contained within the dataset) and 10 recommended beers would be returned. The recommended beers with be within the same class and subclass. Then, the flavor profile scores for the input score would be compared to the beers within the subclass and the beers with the smallest difference in scrores would be returned. Following that proof-of-concept, the code was refactored to operate within the Django framework - this allowed for easier use that did not require downloading code, and allowed for a "live-search" function so the user could, in real time, search for beers within the dataset.

A Tableau storyboard displaying various metrics from data exploration and creation of the model can be found in the following link: Tableau Storyboard

All of the code for the project is located on the following GitHub repository: GitHub Repository

This project was a joint venture - links for the project members are listed below:

LinkedIn || GitHub

curtis.smith.geo@gmail.com