Data source: https://www.kaggle.com/fivethirtyeight/fivethirtyeight-bad-drivers-dataset
https://www.census.gov/newsroom/press-kits/2018/pop-estimates-national-state.html
http://worldpopulationreview.com/e5cc582b-5aef-4049-a59e-c86ec83c0ed9
We choose the data of "bad driver" because this dataset is mainly focused on the number of accidents each driver involved and how many of them are caused by multiple reasons like alcohol or speeding and so on. The number of people who died in car accident has increase all these years. The people who died in the vehicle accident has become the top reason of fatal accident. Consequently, we want to explore some relationship between the fatal accident and the reason behind it.
This dataset includes all the 52 states in America, which means the size of the data is large. Most of the data number is from national Commissioners or administration. We can analyze the relationship between each driver's accident rate and their insurance rates, thus we can evaluate how well the insurance company's quote system works. For example, if we find out that a state has a relatively high insurance rate, but the accident rates are not comparably high, we can then conclude that the insurance company is not fair to the drivers in that state. This is a social issue that is related to lots of citizens. This dataset has tons of numerical data. Some of them are an interval, as the “number of drivers involved in fatal collisions per billion miles”. Some of them are Ratio, like “Percentage of Drivers Involved In Fatal Collisions Who Were Speeding”. The size of this data set is perfect for our tutorial. If the dataset is too large, it will take too much time to compile, which is not good for presenting.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
plt.style.use("seaborn")
df_driver = pd.read_csv("./data/bad-drivers.csv")
df_driver.head()
df_raw=df_driver
We are adding longitude and latitude to our data
we want to make a folium later to see are there any relationship between the geographical location with the fatal rate and the insurance premium.
location = pd.read_csv("./data/statelatlong.csv")
location = location.drop(columns=['State'])
location = location.rename(columns={'City':'State'})
location.head()
#merge the the longitude and latitude dataset
merged_left = pd.merge(left=df_raw,right=location, how='left', left_on='State', right_on='State')
df_raw = merged_left
df_raw.head()
Reading, adding,and renaming the density dataset
We also want to include the factor of how population involved towards to the reason for fatality in car accident.Thus, we join the population data according to the state at the same year for the preparation of our data analysis.
density = pd.read_csv("./data/density.csv")
merged_left = pd.merge(left=df_raw,right=density, how='left', left_on='State', right_on='State')
density = density.rename(columns={'Pop':'Poplation'})
df_raw = merged_left
density.head()
Reading, adding,and renaming the death_rate dataset
df10 = pd.read_csv("./data/Motor_Vehicle_Occupant_Death_Rate__by_Age_and_Gender__2012___2014__All_States.csv")
df10 = df10[['All Ages, 2012','State']]
death_rate = df10.rename(columns={"All Ages, 2012": "Death_rate"})
merged_left = pd.merge(left=df_raw,right=death_rate, how='left', left_on='State', right_on='State')
merged_left
df_raw = merged_left
df_raw.head()
per_columns = ["Percentage Of Drivers Involved In Fatal Collisions Who Were Speeding",
"Percentage Of Drivers Involved In Fatal Collisions Who Were Alcohol-Impaired",
"Percentage Of Drivers Involved In Fatal Collisions Who Were Not Distracted",
"Percentage Of Drivers Involved In Fatal Collisions Who Had Not Been Involved In Any Previous Accidents"
]
df_raw[per_columns] = df_raw[per_columns].astype(float)/100
df_raw.head()
Here, we transform some of the negative percentage to the positive by distracting the negative percentage by 1
df_raw["Percentage Of Drivers Involved In Fatal Collisions Who Were Not Distracted"] = 1 - df_raw["Percentage Of Drivers Involved In Fatal Collisions Who Were Not Distracted"]
df_raw["Percentage Of Drivers Involved In Fatal Collisions Who Had Not Been Involved In Any Previous Accidents"] = 1 - df_raw["Percentage Of Drivers Involved In Fatal Collisions Who Had Not Been Involved In Any Previous Accidents"]
df_raw.head()
Renaming the dataset
Some of the variables name are too long and redundant. So we decide to tidy the data set by renaming some of its variables.
new_name = ["state",
"num_driver_fatal",
"percent_speeding",
"percent_alcohol",
"percent_distracted",
"percent_record",
"insurance_premium",
"Insurance_lost_per_driver",
]
df_raw2=df_raw.rename(columns=dict(zip(df_raw.columns,new_name)))
#imputate with mean
df_raw2 = df_raw2.fillna(df_raw2.mean())
df_raw2.head()
After we rename the variable name, our dataset looks much more clear and understandable.
We want to take a look at our variables and make sure our variables have the correct data type
df_raw2.info()
At this point, we've finished out data Extraction, Transformation and loaded our data successfully.
Our main data det is the bad driver.cvs data set. We add some more variables that we think might be interesting to analysis. We merge the longtitude and latitude of each state because we want to make folium maps later to analysis the data. We also merge the ratio of population with it's area to see if population density have some thing to do with the insurance premium or the fatal rate. We have some missing data such as some of information about Washington DC, so we impute the NaN value by mean, which is the average value of the whole US.
We are now ready for some Exploratory Data Analysis (EDA).
First of all, we want to see some basic information about our data set like what's the mode value of each variables. We choose to use mode because we do not want some outliners affect our result. The mode graph is more general than just calculate the mean or medium of each variable
def plotPerColumnDistribution(df, nGraphShown, nGraphPerRow):
nunique = df.nunique()
df = df[[col for col in df if nunique[col] > 1 and nunique[col] < 50]]
nRow, nCol = df.shape
columnNames = list(df)
nGraphRow = (nCol + nGraphPerRow - 1) / nGraphPerRow
plt.figure(num = None, figsize = (6 * nGraphPerRow, 8 * nGraphRow), dpi = 80, facecolor = 'w', edgecolor = 'k')
for i in range(min(nCol, nGraphShown)):
plt.subplot(nGraphRow, nGraphPerRow, i + 1)
columnDf = df.iloc[:, i]
if (not np.issubdtype(type(columnDf.iloc[0]), np.number)):
valueCounts = columnDf.value_counts()
valueCounts.plot.bar()
else:
columnDf.hist()
plt.ylabel('counts',size=20)
plt.xticks(rotation = 90, size=20)
plt.title(f'{columnNames[i]}',size =20)
plt.tight_layout(pad = 1.0, w_pad = 1.0, h_pad = 1.0)
plt.show()
plotPerColumnDistribution(df_raw2, 10, 5)
Now we want to make a folium map to see if there are some relationship between the different area (east coast west coast deep south for example) in the whole country.
Install the folium
package if not previously installed on the computer and restart the kernal
!pip install folium
import folium
import requests
import pandas
dict2 = {
'Alabama': 'AL',
'Alaska': 'AK',
'Arizona': 'AZ',
'Arkansas': 'AR',
'California': 'CA',
'Colorado': 'CO',
'Connecticut': 'CT',
'Delaware': 'DE',
'District of Columbia': 'DC',
'Florida': 'FL',
'Georgia': 'GA',
'Hawaii': 'HI',
'Idaho': 'ID',
'Illinois': 'IL',
'Indiana': 'IN',
'Iowa': 'IA',
'Kansas': 'KS',
'Kentucky': 'KY',
'Louisiana': 'LA',
'Maine': 'ME',
'Maryland': 'MD',
'Massachusetts': 'MA',
'Michigan': 'MI',
'Minnesota': 'MN',
'Mississippi': 'MS',
'Missouri': 'MO',
'Montana': 'MT',
'Nebraska': 'NE',
'Nevada': 'NV',
'New Hampshire': 'NH',
'New Jersey': 'NJ',
'New Mexico': 'NM',
'New York': 'NY',
'North Carolina': 'NC',
'North Dakota': 'ND',
'Northern Mariana Islands':'MP',
'Ohio': 'OH',
'Oklahoma': 'OK',
'Oregon': 'OR',
'Palau': 'PW',
'Pennsylvania': 'PA',
'Puerto Rico': 'PR',
'Rhode Island': 'RI',
'South Carolina': 'SC',
'South Dakota': 'SD',
'Tennessee': 'TN',
'Texas': 'TX',
'Utah': 'UT',
'Vermont': 'VT',
'Virgin Islands': 'VI',
'Virginia': 'VA',
'Washington': 'WA',
'West Virginia': 'WV',
'Wisconsin': 'WI',
'Wyoming': 'WY',
}
# thank you to @kinghelix and @trevormarburger for this idea
df_raw2['state']= df_raw2['state'].map(dict2)
df_raw2.head()
my_USA_map = './data/us-states.json'
rate_map = folium.Map(location=[38, -98], zoom_start=4.4)
rate_map.choropleth(geo_data=my_USA_map, data=df_raw2,
columns=['state', 'insurance_premium'],
key_on='feature.id',
fill_color='YlGnBu', fill_opacity=0.7, line_opacity=0.2,
legend_name='Insurance_premium')
folium.LayerControl().add_to(rate_map)
rate_map
#https://github.com/bradtraversy/python_folium_example/blob/master/data/us-states.json
It turns out that Southern states tend to have higher insurace premium
Louisiana, New Jersey, and New York State are three states with the highest insurance premium
my_USA_map = './data/us-states.json'
speed_map = folium.Map(location=[38, -95], zoom_start=4.4)
speed_map.choropleth(geo_data=my_USA_map, data=df_raw2,
columns=['state', 'num_driver_fatal'],
key_on='feature.id',
fill_color='BuPu', fill_opacity=0.7, line_opacity=0.2,
legend_name='Number of drivers involved in fatal collisions per billion miles')
folium.LayerControl().add_to(speed_map)
speed_map
From the map we can see Southern and Northern States tend to have a larger number of fatal accidents
However, if we go back and forth we can see that there are no obvious relationship between the total number of fatal people cross the whole state and the average state insurance premium, which might not corresponding to our common sense. So we decide to explore deeper to see how each factor relates to each other.
Then we want to see which states have the relatively high insurance premium. This is the main problem we what to discuss in our presentation. So we make a graph to see the top to bottom states ranking by insurance premium.
df_as=df_raw2.sort_values(by="insurance_premium" , ascending=True)
plt.figure(figsize=(15,12))
sns.barplot(x=df_as['state'], y=df_as['insurance_premium'],palette=sns.cubehelix_palette(len(df_as['state'])))
plt.xticks(rotation= 90)
plt.xlabel('State',size = 16)
plt.ylabel('insurance_premium',size = 16)
plt.title('Car_insurance_premiums',fontsize=25)
plt.show()
Now we want to see how each variable relates to each other. We are trying to find out the corelation between variables. A heatmap can be the best choice to use here.
corrmat = df_raw2.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(10,10))
g=sns.heatmap(df_raw2[top_corr_features].corr(),annot=True,cmap="RdYlGn")
plt.title(f'Correlation Matrix for Bad Driver data frame', fontsize=15)
The next thing we are cruious about is the corelatinship between each driver's monthly insurance rate the total number of accidient rate happened in the state. So we make a scatter plot graph.
sns.lmplot(x='num_driver_fatal',y='insurance_premium',fit_reg=True,scatter_kws={"color":"red","alpha":0.5
,"s":30},data=df_raw2)
plt.title('The corelationship between Insurance premium and the general accident rate', fontsize=20)
plt.xlabel('The number of driver who encounter fatal accident per billion miles',size=20)
plt.ylabel('State average insurance rate',size=20)
df_raw2['num_driver_fatal'].corr(df_raw2['insurance_premium'])
According to the graph and the covarience value, there is no straight relationship between the insurance premium and the number of drivers involved in fatal collisions per billion miles, which can also be interpreted as the major accident heppene rate in each state. The covarience value is only -1.99. Sometimes the lower accident rate state driver can be charged more than those state with higher accident state. It looks unfair to our general knowledge. We are trying to find out the quote system of the insurance company. In order to figure this out, we did a scatter plot between Losses incurred by insurance companies for collisions per insured driver and the insurance premium.
sns.lmplot(x='Insurance_lost_per_driver',y='insurance_premium',fit_reg=True,scatter_kws={"color":"black","alpha":0.5
,"s":30},data=df_raw2)
plt.title("The Corelationship Between Insurance Premium and Insurance Company Loss per Driver", fontsize=16)
plt.xlabel('Insurance Company Loss Per Driver',size=16)
plt.ylabel('State Average Insurance Premium',size=16)
As we can see the graph show some extend of correlation so we want to do some more analysis to see its actual correlation in number
we use the function below to calculate the correlation efficiency
df_raw2['Insurance_lost_per_driver'].corr(df_raw2['insurance_premium'])
It turns out that the loss incurred and the insurance premium is positive correlated
We also want to explore the correlation between population density and insurance premium. We draw the graph below
df_raw3 = df_raw2.drop([8])
sns.lmplot(x='Density',y='insurance_premium',fit_reg=True,scatter_kws={"color":"red","alpha":0.5
,"s":30},data=df_raw3)
plt.title('The corelationship between Insurance premium the population density in states', fontsize=20)
plt.xlabel('The population density of the states',size=20)
plt.ylabel('State average insurance rate',size=20)
There are no obvious relationship between each state and their population.
df_raw2.head()
From the model above, we test the effect of speeding among those fatal accidents on higher or lower insurance premium by keep all the variables the same only changing Percent_speeding
variable
We then use the machine learning K nearest neighbors regression to predict what if we keep the other variables all the same and just change one variables what would the ultimate insurance premium be? if our regression works, that means we can know how much we can pay less in Louisiana if we make some thing good or some better driving habit
features = ["percent_speeding", "percent_alcohol","percent_distracted", "percent_record","Insurance_lost_per_driver"]
X_train = df_raw2[features]
y_train = df_raw2["insurance_premium"]
X_train.columns
x_new = pd.Series(index=X_train.columns)
# Set the values of the known variables.
x_new["percent_speeding"] = 0.4
x_new["percent_alcohol"] = 0.13
x_new["percent_distracted"] = 0.17
x_new["percent_record"] = 0.02
x_new["Insurance_lost_per_driver"] = 194.78
x_new
df_raw2.iloc[18]
# Standardize the variables.
X_train_mean = X_train.mean()
X_train_std = X_train.std()
X_train_sc = (X_train - X_train_mean) / X_train_std
x_new_sc = (x_new - X_train_mean) / X_train_std
# Find index of 30 nearest neighbors.
dists = np.sqrt(((X_train_sc - x_new_sc) ** 2).sum(axis=1))
i_nearest = dists.sort_values()[:18].index
# Average the labels of these 30 nearest neighbors
y_train.loc[i_nearest].mean()
Then we want to keep all other conditions the same, but change the percent speeding from 0.4 to 0.3. This means if there are less people died due to their speeding driving. According to our hypothesis, if there are less people die due to their speeding driving, the average insurance of the whole state can be lower.
X_train.columns
x_new = pd.Series(index=X_train.columns)
# Set the values of the known variables.
x_new["percent_speeding"] = 0.3
x_new["percent_alcohol"] = 0.13
x_new["percent_distracted"] = 0.17
x_new["percent_record"] = 0.02
x_new["Insurance_lost_per_driver"] = 194.78
# Standardize the variables.
X_train_mean = X_train.mean()
X_train_std = X_train.std()
X_train_sc = (X_train - X_train_mean) / X_train_std
x_new_sc = (x_new - X_train_mean) / X_train_std
# Find index of 30 nearest neighbors.
dists = np.sqrt(((X_train_sc - x_new_sc) ** 2).sum(axis=1))
i_nearest = dists.sort_values()[:18].index
# Average the labels of these 30 nearest neighbors
y_train.loc[i_nearest].mean()
This number makes sense with our hypothesis. If the ratio of people who died in car accident because of speeding can decrease to 30 percent, then our who state average insurance premium will be approximately 30 dollars less comparing what we have right now. We then try to see what will happen if we have 0.1 percent of driver died because of speeding.
X_train.columns
x_new = pd.Series(index=X_train.columns)
# Set the values of the known variables.
x_new["percent_speeding"] = 0.1
x_new["percent_alcohol"] = 0.13
x_new["percent_distracted"] = 0.17
x_new["percent_record"] = 0.02
x_new["Insurance_lost_per_driver"] = 194.78
# Standardize the variables.
X_train_mean = X_train.mean()
X_train_std = X_train.std()
X_train_sc = (X_train - X_train_mean) / X_train_std
x_new_sc = (x_new - X_train_mean) / X_train_std
# Find index of 30 nearest neighbors.
dists = np.sqrt(((X_train_sc - x_new_sc) ** 2).sum(axis=1))
i_nearest = dists.sort_values()[:18].index
# Average the labels of these 30 nearest neighbors
y_train.loc[i_nearest].mean()
This shows that if there are only 10 percent of driver died because of speeding, then we could only pay average approximately 937 dollars per year, which is about 35 dollars less than what we have right now. This information is really encouraging and useful because we can save money by only not speeding.
Here are some output for us testing the effect of speeding:
0.1 ---> 937.7944
0.3 ---> 947.6633
0.4 ---> 971.0183
We can thus induct that the higher percentage of speeding among fatal accident, the higher the insurance premium will be at that state
According to our data visualization and correlatin analysis of variable Insurance_lost_per_driver and insurance_premium we found that the insurance company will never lose as they will charge everyone within the state if the state have a high loss due to those accidents. We want to promote everyone to drive more cautiouly. This can reduce car accidents and loss incurred by accidents, eventually taking down the high insurance premium in Louisiana.
We then used K-Nearest Neighbors for Regression model to help us find some relationships among different features and the predicted insurance price premium. We played and test by changing different input value of features to see their effects on the predicted insurance premium. We found out that the speeding percentage among those fatal accident have significant influence over the insurance premium. Thus, we want to show that citizens in New Orleans can save lots of cash from insurance comapany if people driving in the Louisiana can slow down a little bit.