Outliers, don’t kill my vibe… or my model

11 min readMay 10, 2021

Welcome to the second installment of my blog as a new data science student. I am now in week 11 of Flatiron School’s online data science bootcamp — which means I now know enough Python to be dangerous, my statistics and calculus braincells which have been hibernating for the last 10 years are stretching back into action (shoutout to my high school math teachers), and I’ve built my first model!

The modeling workflow can get complex quickly — there are many tools available, but no one-size-fits-all formula that always yields the best model. The best approach depends on the use case, underlying data, biases, and root question you’re trying to answer.

One ubiquitously important step in generating a reliable model is dealing with outliers.

Some modeling basics

The purpose of a statistical model is to predict something that hasn’t happened yet, based on a bunch of things that have happened. Said a bit more mathematically, a model is fed a sample dataset, draws conclusions about what factors influence the target value and by how much, then use those conclusions to predict the target value of a new sample.

Logically, you might expect models to draw better conclusions if they have bigger samples to train on. The Central Limit Theorem backs up this idea, stating that a sample will converge into a normal distribution around the population mean as the sample size increases. As you add more elements to the sample, by the laws of probability you will add a bunch of values that are likely to occur and only a few values that are not as likely to occur.

Three normal distributions with increasing sample sizes (“n”)

You also may deduce that the sample fed to the model should be representative of the new samples you want the model to predict. For example, feeding a model a sample of real estate data from Akron, Ohio will probably not set it up to accurately predict home prices in New York City, and vice versa.

Pop culture example

Let’s introduce a real-life example to bring this to life (source):

In 2019, the average length of a song on the Billboard Hot 100 was 3 minutes and 30 seconds (3:30). If we took a huge sample of songs, we would expect most of them to clock in around 3:30, tapering off as you get much shorter and much longer than 3:30. A model trained on that huge sample would probably do pretty well at predicting the length of a new popular song.

Conversely, Kendrick Lamar’s popular 2012 album good kid, m.A.A.d city is comprised of 13 songs that are quite a bit longer than the rest of the population, with a mean of 5:37, a minimum of 3:33, and a maximum of 12:04. A model trained on a sample comprised only of songs from good kid, m.A.A.d city probably would not do very well at predicting the length of a new popular song.

good kid, m.A.A.d city illustrates a few important points about modeling:

Sample size is key — since the bad model only has the 13 songs on good kid, m.A.A.d city to learn from, it’s going to think 12-minute songs occur ~7.7% of the time (1 song out of 13 in the sample) and 3 and a half minute songs must occur ~15% of the time (2 songs out of 13 in the sample). Adding more songs to the sample would have introduced a lot more 3–4 minute songs and probably very few, if any, 12 minute songs, therefore teaching the model that 12-minute songs actually represent way less than 7.7% of the population.

2. Not every mathematically calculated outlier is problematic — outliers in a distribution are found mathematically by calculating the 25th and 75th percentile values, adding some buffer, then finding the sample values that fall outside of that range. There is some finesse to determining whether the points that fall outside of that range are problematic enough to remove from the dataset, or if they’re just part of the left and right tails which all normal distributions have.

By Jhguch at en.wikipedia, CC BY-SA 2.5, https://commons.wikimedia.org/w/index.php?curid=14524285

3. Problematic outliers can teach models wrong lessons — some outliers, like Lamar’s 12:04 song Sing About Me, I’m Dying Of Thirst, are so unlikely to occur in the real world that it’s probably better not to train a model on them at all. We refer to these as “extreme outliers,” and they can be quickly identified on a graph as those values way outside the min and max fences.

Making all these calculations and graphs can become an involved, time-consuming preprocessing step. As part of my first modeling project, I created a function that helped make this step much more efficient and insightful. I’ll walk through the function in detail below.

Enter: plot_iqr()

Cutting right to the chase, here’s the function followed by a simple workflow that I followed. Please note, the function is written for univariate continuous numerical features and would need some tweaking to work with other data types.

Packages:

import pandas as pd
import numpy as np
import scipy.stats as stats
import plotly.express as px

Function:

def plot_iqr(df, col, title=None, marker_height=, line_height=):
    '''
    Create a histogram with marginal box plot, markers and labels at
    max fence and min fence to show where there are outliers in your
    dataset
    
    Parameters:
    df = pd.DataFrame, main data source
    col = column name in df to plot
    title = title for graph (default=None)
    marker_height = int, where on the y-axis to place marker; set
    based on highest bin's count
    line_height = int, where to end vertical line markers; set based
    on highest bin's count
    
    Returns:
    tuple (min_fence, max_fence)
    '''    # Calculating min, max, Q1, and Q3 interquartile ranges
    iqr = stats.iqr(df[col])
    q1 = np.quantile(a=df[col], q=.25)
    minimum = q1-1.5*iqr
    q3 = np.quantile(a=df[col], q=.75)
    maximum = q3+1.5*iqr    if minimum < df[col].min():
        min_fence = df[col].min()
    else:
        min_fence = minimum    if maximum > df[col].max():
        max_fence = df[col].max()
    else:
        max_fence = maximum    # Visualize distribution of target
    fig = px.histogram(df, x=df[col], marginal='box', title=title)
    fig.update_layout(plot_bgcolor='#f2f2f2', bargap=0.1,             
                      height=500, title_font_size=18, font_size=12)    # Plot max marker
    fig.add_trace(
            go.Scatter(x=np.array(max_fence),
                       y=np.array(marker_height), 
                       text=max_fence_str, mode='text',
                       name=('Max Fence {}'.format(max_fence_str))
                       )
                  )
    fig.add_shape(type='line', x0=max_fence, y0=0, x1=max_fence,
                  y1=line_height)
    
    # Create a variable formatted with commas 
    min_fence_str = '{:,}'.format(int(min_fence))
    max_fence_str = '{:,}'.format(int(max_fence))
    
    # Plot min marker
    fig.add_trace(
            go.Scatter(x=np.array(min_fence),
                       y=np.array(marker_height), 
                       text=(min_fence_str), mode='text',
                       name=('Min Fence: {}'.format(min_fence_str))
                       )
                  )
    fig.add_shape(type='line', x0=min_fence, y0=0, 
                  x1=min_fence, y1=line_height)    fig.show()
    return min_fence, max_fence

Workflow:

Select a column in df to explore; run plot_iqr().
Evaluate the distribution of data. If there are extreme outliers on either end, continue to step 3. If there are only mild outliers, you can move on to the next col.
Evaluate the extreme outliers and decide on an approach: keep ONLY if all signs point to the value being natural/non-error, there are not too many data points above the max fence, AND you’re concerned about sample size. A better option might be to replace the outlier(s) with a more likely value, like the median. Finally, if you have plenty of elements in your sample and are aiming for accuracy, remove the whole row. Save new_df after outliers have been processed.
Rerun the plot_iqr() function with new_df and reevaluate the distribution of values.
Repeat 1–4 until new_df is outlier-free!

Explain, please!

Let’s break down the components of plot_iqr() using the example of my project — predicting home sale price in King County, Washington, USA (dataset found on Kaggle here and my full analysis here).

Parameters:

plot_iqr() takes df a pandas dataframe, col the name of a column in the dataframe to investigate, and three parameters for graphing: title the graph’s title, marker_height where on the y-axis the text label is located, and line_height how tall your vertical lines at min and max fence are.

Output:

plot_iqr() displays a histogram with markers at the minimum and maximum fence and marginal box plot, and returns a stored 2-tuple (min_fence, max_fence).

Mathematically calculate outliers:

The first part of the function calculates the interquartile range, 25th percentile, 75th percentile, minimum fence, and maximum fence.

Source: https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51

Interquartile Range — the difference between the 25th percentile value and 75th percentile value. 50% of the data lies in the interquartile range. Scipy’s stats package has a .iqr() method that returns this handily.
Q1 & Q3 — the values of the 25th percentile (Q1) and 75th percentile (Q3). Those values are calculated easily using numpy’s .quantile() method.
Minimum & Maximum Fences — minimum is calculated by multiplying the IQR by 1.5 and subtracting from Q1, and maximum is calculated by adding IQR*1.5 to Q3. ~99.3% of values fall between the minimum fence and maximum fence. Points outside the fences are considered outliers.

Sometimes in the real world, the fences may be calculated as a value that doesn’t make sense. For example, my model attempted to predict home sale price, but the calculated minimum fence was negative. Clearly a home cannot sell for negative $ (though that would be cool!).

To control for this, the function includes a for-loop to set the min_fence variable to the larger of the minimum element found in the sample set OR the calculated minimum fence, and max_fence variable to the smaller of the maximum element found in the sample set OR the calculated maximum fence.

# Calculating min, max, Q1, and Q3 interquartile ranges
    iqr = stats.iqr(df[col])
    q1 = np.quantile(a=df[col], q=.25)
    minimum = q1-1.5*iqr
    q3 = np.quantile(a=df[col], q=.75)
    maximum = q3+1.5*iqrif minimum < df[col].min():
        min_fence = df[col].min()
    else:
        min_fence = minimumif maximum > df[col].max():
        max_fence = df[col].max()
    else:
        max_fence = maximum

Visualize using plotly express:

plot_iqr() calls upon the plotly express graphing library to plot a histogram of the target col given to the function. A marginal box plot is added at the top of the visual to show the number of outliers more clearly. If you’ve given the function a title (optional parameter), it will be used here. The fig.update_layout line sends some formatting parameters, which aren’t necessary for the function but align with my personal preferences.

I chose to use plotly as my graphing library of choice because it’s very simple to add the marginal box plot, and the graph is automatically interactive with hover data, no additional code needed. Check out my blog post to learn more about plotly.

# Visualize distribution of target
    fig = px.histogram(df, x=df[col], marginal='box', title=title)
    fig.update_layout(plot_bgcolor='#f2f2f2', bargap=0.1,             
                      height=500, title_font_size=18, font_size=12)

Add markers at the Minimum and Maximum Fences:

Discretely showing where the minimum and maximum fences are on the histogram helps a viewer focus in on those mathematically calculated outliers. The next part of the function adds vertical lines and text labels to the histogram at those values.

First off, to help with readability, we format our min_fence and max_fence values with commas at the thousands place and remove decimals. Whenever we physically show these values on the graph, we’ll use the new string-formatted variables.

Next we plot the max and min markers, each following the same code pattern:

Label — .add_trace() gives us a text label (using our newly string-formatted min/max variables for readibility) at the top of the vertical line. Notice the y-value of this text label us using the marker_height parameter you sent the function. You may need to play around with the marker_height to your satisfaction, but it should always be a bit higher than your line_height.
Line — .add_shape() gives us the vertical line at x=min_fence/max_fence and from y=0 to y= line_height. Again, your line_height may need some finessing depending on your specific distribution. Visually, you want your line to extend to the y-value of the most populous bin in your histogram.

# Create a variable formatted with commas 
    min_fence_str = '{:,}'.format(int(min_fence))
    max_fence_str = '{:,}'.format(int(max_fence))# Plot max marker
    fig.add_trace(
            go.Scatter(x=np.array(max_fence),
                       y=np.array(marker_height), 
                       text=max_fence_str, mode='text',
                       name=('Max Fence {}'.format(max_fence_str))
                       )
                  )
    fig.add_shape(type='line', x0=max_fence, y0=0, x1=max_fence,
                  y1=line_height)
    
# Plot min marker
    fig.add_trace(
            go.Scatter(x=np.array(min_fence),
                       y=np.array(marker_height), 
                       text=(min_fence_str), mode='text',
                       name=('Min Fence: {}'.format(min_fence_str))
                       )
                  )
    fig.add_shape(type='line', x0=min_fence, y0=0, 
                  x1=min_fence, y1=line_height)

Plot & return stored values:

Last but most important, show your graph and return stored minimum and maximum values to be used in the next step of the workflow.

fig.show()
return min_fence, max_fence

Cool function, what now?

Now that you understand what’s under the hood of plot_iqr(), you can use it in a workflow on real-life data.

In my dataset on King County real estate, this function came in handy to explore multiple features including sale price, lot size, neighboring lot size, and home square footage. I recommend checking out section 3.6 Deal with Outliers of my report to see this in action.

Taking price as an example, I found there to be several extreme outliers way above the maximum fence price of $1.13M. Because I was seeking to build a model that predicted home price based on multiple factors, it was risky to keep these extreme values in my dataset for the model to draw conclusions from. Extremely expensive homes probably don’t follow the same patterns as the vast majority (~99.3%) of homes which fell between $78k and $1.13M.

For this reason, and because I had plenty of data in my sample, I decided to remove all outliers above the maximum fence sale price.

Because the max_fence value is stored in the results of my plot_iqr() function, it’s very easy to remove these values, save a new df, and review the resulting distribution.

Step 1: Run `plot_iqr()` and save the results as a variable:

sale_price = plot_iqr(df, 'price', 
                 title='Price Distribution Before Outlier Removal',
                 marker_height=930, line_height=900)
sale_price

Step 2: Save a new dataframe with price outliers above the calculated `max_fence` removed.

Remember, max_fence is stored as the second item in the sale_price tuple.

df_outliers_removed = df[df.price < sale_price[1]]

Step 3: Check new distribution

sale_price_new = plot_iqr(df_outliers_removed, target, 
                 title='Price Distribution After Outlier Removal',
                 marker_height=540, line_height=510)sale_price_new

This distribution is much less risky, since there are no extreme outliers.

Remember, the IQR/max/min method of mathematically calculating outliers will almost always calculate “outliers,” and not all calculated outliers are problematic. Outliers become problematic when they are extreme, falling very far away from the calculated minimum/maximum fences.

These outliers won’t kill Kendrick Lamar’s vibe, or my model.