This analysis seeks two objectives:
First, this analysis concludes that the following insured company features matter most in determining loss cost:
The winning model was XGBoost Regression, with an R^2 of 0.8346 and a Mean Absolute Error of 77.79 units of loss cost.
Second, this analysis correctly picked the category of the news headline 98.26% of the time.
The winning model was a Multinomial Naïve Bayes Classifier enhanced with a TF-IDF vectorizer.
Credit to God, my Mother, family and friends.
All errors are my own.
Best,
George John Jordan Thomas Aquinas Hayward, Optimist
The insurance product is mainly focused on two metrics: losses and premiums. Losses is what the insurance company pays out to policyholders for claims, and premiums is the price charged to policyholders for purchasing an insurance policy. Because the premium is set at the inception of the policy, insurance companies must do their best to calculate the appropriate premium amount, which fundamentally involves computing expected future loss from the policy.
Two ratios are often used to assess the profitability of an insurance company:
Loss Ratio=Total LossesPremiums
Loss Cost=Total LossesExposure Units
import pandas as pd
pd.set_option('display.max_columns', None)
import numpy as np
import mysql.connector as mysql
from sqlalchemy import create_engine
import altair as alt
alt.renderers.enable('notebook')
from scipy import stats
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn import linear_model
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.utils.multiclass import unique_labels
import seaborn as sns; sns.set()
import matplotlib.pyplot as plt
%config InlineBackend.figure_format = 'retina'
from sklearn.metrics import mean_absolute_error, r2_score, median_absolute_error, \
explained_variance_score, confusion_matrix, accuracy_score, precision_score, recall_score
import statsmodels.formula.api as sm
import missingno as msno
import xgboost as xgb
from sklearn.svm import SVR
from scipy import stats
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
mnb_model = make_pipeline(TfidfVectorizer(), MultinomialNB())
import warnings #just for the final version
warnings.filterwarnings("ignore")
You are provided with 3 datasets containing information related the 2016 policy year for a particular line of business:
Policies_Data.csv: A dataset containing details about the company's policies written in 2016.
Claims_Data.csv: A dataset containing details about the company's claims paid (also called losses) for the 2016 policy year.
Companies_Data.csv: A dataset containing detailed information regarding the insureds.
Note that Claims_Data can be joined with Policies_Data using policy_id
, and Policies_Data can be joined with Companies_Data using company_id
.
For this first task, read in the 3 datasets and explore the data a bit. Create some summary tables and visualizations to drawing meaningful conclusions which can help us later in the modeling or decision making process. Some suggestions to get you started (note that this isn't an exhaustive list):
policies = pd.read_csv("Policies_Data.csv")
claims = pd.read_csv("Claims_Data.csv")
companies = pd.read_csv("Companies_Data.csv")
companies_policies = companies.merge(policies, left_on='company_id', right_on='company_id',\
how='left',\
suffixes = ('_company','_policy'))
combined = companies_policies.merge(claims, left_on='policy_id', right_on='policy_id',\
how='left',\
suffixes = ('_policy','_claims'))
combined.head(3)
loss_cost = []
was_there_a_loss = []
for i in range(len(combined.company_id_policy)):
if pd.isnull(combined.claim_amount[i]) == True:
loss_cost.append(0)
was_there_a_loss.append(0)
else:
numerator = combined.claim_amount[i]
denominator = combined.exposure_units[i]
loss_cost_for_given_row = numerator / denominator
loss_cost.append(loss_cost_for_given_row)
was_there_a_loss.append(1)
combined['loss_cost'] = loss_cost
combined['was_there_a_loss'] = was_there_a_loss
combined['inception_date_policy'] = pd.to_datetime(combined['inception_date_policy'],\
format='%Y-%m-%d')
combined['expiration_date_policy'] = pd.to_datetime(combined['expiration_date_policy'],\
format='%Y-%m-%d')
combined['claim_date'] = pd.to_datetime(combined['claim_date'],\
format='%Y-%m-%d')
combined['inception_date_claims'] = pd.to_datetime(combined['inception_date_claims'],\
format='%Y-%m-%d')
combined['expiration_date_claims'] = pd.to_datetime(combined['expiration_date_claims'],\
format='%Y-%m-%d')
combined.head(3)
engine = create_engine('mysql+mysqlconnector://newuser:data@localhost:3306/sys', echo=False)
combined.to_sql(name='insurance2', con=engine, if_exists = 'append', index=False)
#this is called 'insurance2' for display purposes...the rest of the data will queries from 'insurance'
#connect to the MySQL database
db = mysql.connect(
host = "localhost",
user = "newuser",
passwd = "data",
auth_plugin='mysql_native_password',
database = 'sys')
loss_cost_and_country_by_count = pd.read_sql("""
select
country,
round(avg(loss_cost),2) as loss_per_company,
round(std(loss_cost),2) as std_of_loss,
count(*) as count_of_companies_with_a_claim,
round((count(*) / (select count(*) from sys.insurance where was_there_a_loss = 1)),2) as prct_of_all_losses
from sys.insurance
where was_there_a_loss = 1
group by 1
order by 4 desc;
""", con=db)
loss_cost_and_country_by_count.head(10)
loss_cost_and_country_by_amount = pd.read_sql("""
select
country,
round(avg(loss_cost),2) as loss_per_company,
round(std(loss_cost),2) as std_of_loss,
count(*) as count_of_companies_with_a_claim,
round((count(*) / (select count(*) from sys.insurance where was_there_a_loss = 1)),2) as prct_of_all_losses
from sys.insurance
where was_there_a_loss = 1
group by 1
order by 2 desc;
""", con=db)
loss_cost_and_country_by_amount.head(10)
loss_cost_and_state_by_count = pd.read_sql("""
select
state,
round(avg(loss_cost),2) as loss_per_company,
round(std(loss_cost),2) as std_of_loss,
count(*) as count_of_companies_with_a_claim,
round((count(*)/(select count(*) from sys.insurance where country='USA' and was_there_a_loss=1)),2) as prct_usa_losses
from sys.insurance
where was_there_a_loss = 1
and state is not null
group by 1
order by 4 desc;
""", con=db)
loss_cost_and_state_by_count.head(10)
company_type_break_down = pd.read_sql("""
select
company_type,
round(avg(loss_cost),2) as loss_per_company,
round(std(loss_cost),2) as std_of_loss,
count(*) as count_of_companies_with_a_claim,
round((count(*) / (select count(*) from sys.insurance where was_there_a_loss = 1)),2) as prct_of_all_losses
from sys.insurance
where was_there_a_loss = 1
group by 1
order by 2 desc;
""", con=db)
company_type_break_down
industry_code_breakdown = pd.read_sql("""
select
industry_code,
round(avg(loss_cost),2) as loss_per_company,
round(std(loss_cost),2) as std_of_loss,
count(*) as count_of_companies_with_a_claim
from sys.insurance
where was_there_a_loss = 1
group by 1
order by 2 desc;
""", con=db)
industry_code_breakdown.head(10)
sector_code_breakdown = pd.read_sql("""
select
sector_code,
round(avg(loss_cost),2) as loss_per_company,
round(std(loss_cost),2) as std_of_loss,
count(*) as count_of_companies_with_a_claim
from sys.insurance
where was_there_a_loss = 1
group by 1
order by 2 desc
limit 10;
""", con=db)
sector_code_breakdown
location_count_breakdown = pd.read_sql("""
select
locations,
round(avg(loss_cost),2) as loss_per_company,
round(std(loss_cost),2) as std_of_loss,
count(*) as count_of_companies_with_a_claim
from sys.insurance
where was_there_a_loss = 1
group by 1
order by 2 desc;
""", con=db)
location_count_breakdown
scatter_plot = alt.Chart(location_count_breakdown)\
.mark_circle(size=200)\
.encode(
x= alt.X('locations',scale=alt.Scale(domain=[0, 30]),\
axis=alt.Axis(title="Number of Locations")),
y= alt.Y('loss_per_company',scale=alt.Scale(domain=[0, 400]),\
axis=alt.Axis(title='Average Loss Cost')))\
.properties(
width=300,
height=300
)#.interactive()
scatter_plot.save('exhbit000a1_numb_locatoins_to_avg_loss_count.png', scale_factor=3.0)
scatter_plot
year_founded_loss_cost_breakdown = pd.read_sql("""
select
year_founded,
round(avg(loss_cost),2) as loss_per_company,
round(std(loss_cost),2) as std_of_loss,
count(*) as count_of_companies_with_a_claim
from sys.insurance
where was_there_a_loss = 1
group by 1
order by 2 desc;
""", con=db)
year_founded_loss_cost_breakdown.head(10)
employee_count_loss_cost_breakdown = pd.read_sql("""
select
employees,
round(avg(loss_cost),2) as loss_per_company,
round(std(loss_cost),2) as std_of_loss,
count(*) as count_of_companies_with_a_claim
from sys.insurance
where was_there_a_loss = 1
group by 1
order by 2 desc;
""", con=db)
employee_count_loss_cost_breakdown.head(10)
scatter_plot = alt.Chart(employee_count_loss_cost_breakdown)\
.mark_circle(size=45)\
.encode(
x= alt.X('employees',scale=alt.Scale(domain=[0, 300]),\
axis=alt.Axis(title="Count of Employees")),
y= alt.Y('loss_per_company',scale=alt.Scale(domain=[0, 1000]),\
axis=alt.Axis(title='Average Loss Cost')))\
.properties(
width=300,
height=300
)#.interactive()
scatter_plot.save('exhbit001_employee_count_to_avg_loss_count.png', scale_factor=3.0)
scatter_plot
risk_zone_loss_cost_breakdown = pd.read_sql("""
select
risk_zone,
round(avg(loss_cost),2) as loss_per_company,
round(std(loss_cost),2) as std_of_loss,
count(*) as count_of_companies_with_a_claim
from sys.insurance
where was_there_a_loss = 1
group by 1
order by 2 desc;
""", con=db)
risk_zone_loss_cost_breakdown
legal_involvement_breakdown = pd.read_sql("""
select
legal_involvement,
round(avg(loss_cost),2) as loss_per_company,
round(std(loss_cost),2) as std_of_loss,
count(*) as count_of_companies_with_a_claim
from sys.insurance
where was_there_a_loss = 1
group by 1
order by 2 desc;
""", con=db)
legal_involvement_breakdown
internet_complaints_breakdown = pd.read_sql("""
select
internet_complaints,
round(avg(loss_cost),2) as loss_per_company,
round(std(loss_cost),2) as std_of_loss,
count(*) as count_of_companies_with_a_claim
from sys.insurance
where was_there_a_loss = 1
group by 1
order by 2 desc;
""", con=db)
internet_complaints_breakdown.head(10)
scatter_plot = alt.Chart(internet_complaints_breakdown)\
.mark_circle(size=150)\
.encode(
x= alt.X('internet_complaints',scale=alt.Scale(domain=[0, 12]),\
axis=alt.Axis(title="Number of Internet Complaints")),
y= alt.Y('loss_per_company',scale=alt.Scale(domain=[0, 300]),\
axis=alt.Axis(title='Average Loss Cost')))\
.properties(
width=300,
height=300
)#.interactive()
scatter_plot.save('exhbit002_number_of_internet_complaints_to_avg_loss_count.png', scale_factor=3.0)
scatter_plot
how_rare_are_losses = pd.read_sql("""
select
case when was_there_a_loss = 1 then "Loss"
when was_there_a_loss = 0 then "No Loss" end as was_there_a_loss,
count(*) as count_of_companies_with_a_claim,
round((count(*) / (select count(*) from sys.insurance)),3) as prct_of_whole
from sys.insurance
group by 1
order by 2 desc;
""", con=db)
how_rare_are_losses
pie_labels = how_rare_are_losses.was_there_a_loss
sizes = how_rare_are_losses.count_of_companies_with_a_claim
explode = (0, 0.1) # only "explode" the 2nd slice (i.e. 'Hogs')
colors_nt = ['pink', 'hotpink']
fig1, ax1 = plt.subplots()
ax1.pie(sizes, explode=explode, labels=pie_labels, autopct='%1.1f%%', colors=colors_nt,
shadow=False, startangle=90)
ax1.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.suptitle("Insurance Claim Loss vs. No Loss Breakdown", fontsize = 14, fontweight = 'bold')
plt.savefig('exhbit003_loss_to_no_loss_breakdown.png',dpi=300, bbox_inches='tight')
plt.show()
#first we pull out a good number of the variables
insurance = pd.read_sql("""
select
state, company_type, industry_code, sector_code, locations, year_founded,
employees, risk_zone, revenue, cost_of_revenue, legal_involvement, internet_complaints,
exposure_units, extract(month from claim_date) as claim_month, claim_amount, was_there_a_loss, loss_cost
from sys.insurance
where 1=1
#and was_there_a_loss = 1
and country = 'USA';
""", con=db)
#all credit due to: Pedro Marcelino
#https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python
#correlation matrix
corrmat = insurance.corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, square=True)
plt.savefig('exhibit_0a_correlation_matrix.png',dpi=300, bbox_inches='tight')
#all credit due to: Pedro Marcelino
#https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python
#scatterplot
sns.set()
#cols = ['column1', 'column2']
sns.pairplot(insurance, size = 2.5)
#plt.savefig('exhibit_0b_pair_plot.png',dpi=600, bbox_inches='tight')
#600 dpi is commented out. It's useful for viewing all the data up close on your desktop, but slows down the script.
plt.savefig('exhibit_0b_pair_plot_lower_res.png',dpi=150, bbox_inches='tight')
plt.show()
msno.matrix(insurance, color = (.0,.0,.2))
#missing data
total = insurance.isnull().sum().sort_values(ascending=False)
percent = (insurance.isnull().sum()/insurance.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(10)
insurance_processed = insurance.drop(['cost_of_revenue','revenue','year_founded'],\
axis = 1)
risk_zone = []
for i in range(len(insurance_processed.risk_zone)):
if insurance_processed.risk_zone[i] == 5.0:
risk_zone.append(5)
elif insurance_processed.risk_zone[i] == 4.0:
risk_zone.append(4)
else:
risk_zone.append(2) #I am choosing this from a rough approx of what i saw in the sql
insurance_processed.risk_zone = risk_zone
company_type = []
for i in range(len(insurance_processed.company_type)):
if insurance_processed.company_type[i] == 2:
company_type.append(2)
elif insurance_processed.company_type[i] == 1:
company_type.append(1)
else:
company_type.append('Other')
insurance_processed.company_type = company_type
insurance_processed = insurance_processed.dropna(subset=['locations', 'employees','exposure_units'])
#sometimes a number is being used as a categorical, so we need to turn it into a string so the computer knows too
insurance_processed['industry_code'] = insurance_processed['industry_code'].apply(str)
insurance_processed['sector_code'] = insurance_processed['sector_code'].apply(str)
dummy = pd.get_dummies(insurance_processed[['state',\
'company_type','industry_code',\
'sector_code']], drop_first=True)
True
so as to avoid the 'dummy variable trap' of multicolinearity.insurance_ml = pd.concat([insurance_processed, dummy], axis = 1)
insurance_ml = insurance_ml.drop(['state','company_type','industry_code','sector_code'],\
axis = 1)
msno.matrix(insurance_ml, color = (.0,.0,.2))
#for classification work to find loss or not
#splitting up the data-set for the classification side-project
insurance_ml_classification = insurance_ml
#for regression work to find total loss
#here, we only want to look at what happens WHEN there is a loss
insurance_ml_regression = insurance_ml[insurance_ml.was_there_a_loss == 1]
#we may want to eliminate the outliars of the dependent variable for regression
#for classification, b/c there is a threshold approach and everything goes between 1 and -1 via the sigmoid, I'm
#less concerned
insurance_ml_regression.loss_cost.describe()
feature_to_check = 'loss_cost'
percentile_check = [1,25,50,75,90,95,97.0,98.0,99.0,99.7,99.90, 99.99]
percentile_values = []
y_pos = np.arange(len(percentile_check))
for i in percentile_check:
percentile = i
value = round(np.percentile(insurance_ml_regression[feature_to_check], i),2)
percentile_values.append(value)
print('Percentile '+str(i)+": "+str(value))
fig, axs = plt.subplots(ncols=2, figsize=(14,4))
#using a subplot method coupled with an inline parameter to have high resolution
axs[1].set_axis_off()
axs[0].bar(y_pos, percentile_values, align='center', alpha=1)
axs[0].set_ylabel('Values')
axs[0].set_xticks(y_pos)
axs[0].set_xticklabels(percentile_check, fontsize=10)
axs[0].set_xlabel('Percentile')
plt.suptitle('Value By Percentile: Understanding Outliers',x=0.44, \
horizontalalignment='right' )
axs[0].set_title(feature_to_check, fontweight = 'bold')
extent = axs[0].get_window_extent().transformed(fig.dpi_scale_trans.inverted())
fig.savefig('exhibit_a_percentile_look',dpi=300, bbox_inches=extent.expanded(1.38, 1.35))
plt.show()
fig, axs = plt.subplots(ncols=2, figsize=(12,4))
axs[1].set_axis_off()
axs[0].hist(insurance_ml_regression.loss_cost, bins=100)
axs[0].set_title('Histogram of Dependent Variable', fontweight = 'bold')
axs[0].set_ylabel('Frequency')
axs[0].set_xlabel('Values')
plt.show()
insurance_ml_regression.loss_cost.skew()
insurance_ml_regression.loss_cost.kurt()
fig, axs = plt.subplots(ncols=2, figsize=(12,4))
axs[1].set_axis_off()
stats.probplot(insurance_ml_regression.loss_cost, plot=axs[0])
plt.show()
insurance_ml_regression['loss_cost_logged'] = \
np.log1p(insurance_ml_regression['loss_cost'])
fig, axs = plt.subplots(ncols=2, figsize=(12,4))
axs[1].set_axis_off()
axs[0].hist(insurance_ml_regression.loss_cost_logged, bins=100)
axs[0].set_title('Histogram of Log-Transformed Dependent Variable', fontweight = 'bold')
axs[0].set_ylabel('Frequency')
axs[0].set_xlabel('Values')
plt.show()
print("Skew: "+str(insurance_ml_regression.loss_cost_logged.skew()))
print("Kurtosis: "+str(insurance_ml_regression.loss_cost_logged.kurt()))
fig, axs = plt.subplots(ncols=2, figsize=(12,4))
axs[1].set_axis_off()
stats.probplot(insurance_ml_regression.loss_cost_logged, plot=axs[0])
plt.show()
#finally, I will also run the log-transform on any continous variables in the data that also show positive skewness
#For this block, credit goes to Alexandru Papiu
#(https://www.kaggle.com/apapiu/regularized-linear-models)
#log transform skewed numeric features:
continous_features = ['locations','employees','internet_complaints','exposure_units']
skewed_feats = insurance_ml_regression[continous_features].apply(lambda x: stats.skew(x))
#compute skewness
skewed_feats = skewed_feats[skewed_feats > 0.75]
skewed_feats = skewed_feats.index
insurance_ml_regression[skewed_feats] = np.log1p(insurance_ml_regression[skewed_feats])
Using what you’ve learned from the data exploration exercise above, try building a few different models to predict loss cost (defined above). Some suggestions to help you start:
Note: Do not use premium as an independent variable, as in reality premium would only be available after a predicted loss cost is calculated.
You've built a few models in the previous exercise:
features = insurance_ml_regression.drop(['loss_cost','claim_amount','was_there_a_loss',\
'loss_cost_logged','exposure_units'], axis = 1)
#we can't have claim amount b/c it's also in the independent variable
#we also know from the correlation matrix that exposure units & employees are highly correlated, so we only need one
y = insurance_ml_regression.loss_cost_logged #y = insurance_ml_regression.loss_cost
#xgb, just done again to have it all in a nice place
X_train_xgb, X_test, y_train_xgb, y_test = train_test_split\
(features, y, test_size = 0.2, random_state = 1)
xgb_model = xgb.XGBRegressor(objective="reg:squarederror", random_state=42)
xgb_model.fit(X_train_xgb,y_train_xgb)
y_pred_xgb = xgb_model.predict(X_test)
#print r^2 and plot xgb
print("R^2:"+str(round(r2_score(y_test, y_pred_xgb),4)))
print("Mean Absolute Error (in original units): "+str(round(mean_absolute_error(np.exp(y_test),\
np.exp(y_pred_xgb)),2)))
print("STD of Training Data (in original units): "+str(round(np.std(np.exp(y_test)),2)))
print("Mean of Training Data (in original units): "+str(round(np.mean(np.exp(y_test)),2)))
print("Residual Skew: "+str(round(stats.skew(np.exp(y_pred_xgb)-np.exp(y_test)),2)))
print("Residual Kurtosis: "+str(round(stats.kurtosis(np.exp(y_pred_xgb)-np.exp(y_test)),2)))
fig, axs = plt.subplots(ncols=(3), nrows=(2), figsize=(15,8))
plt.subplots_adjust(top = 0.95, bottom=0.01, hspace=0.25, wspace=0.16)
axs[0,0].scatter(np.exp(y_test), np.exp(y_pred_xgb))
axs[0,0].set_title('Actuals (x) vs Predicteds (y) \n For 98% of Losses',fontweight = 'bold')
axs[0,0].set_xlim([0, 1000])
axs[0,0].set_ylim([0, 1000])
axs[0,1].scatter(np.exp(y_test), (np.exp(y_pred_xgb)-np.exp(y_test)))
axs[0,1].set_title('Actuals (x) vs Residuals [pred. - actual] (y)\n For 98% of Losses',fontweight = 'bold')
axs[0,1].set_xlim([0, 1000])
axs[0,1].set_ylim([-1000, 1000])
axs[0,2].hist((np.exp(y_pred_xgb)-np.exp(y_test)), bins=100)
axs[0,2].set_title('Histogram of Residuals [predicted - actual]\n For 100% of Losses',fontweight = 'bold')
axs[1,0].scatter(np.exp(y_test), np.exp(y_pred_xgb))
axs[1,0].set_title('Actuals (x) vs Predicteds (y) \n For 100% of Losses',fontweight = 'bold')
axs[1,1].scatter(np.exp(y_test), (np.exp(y_pred_xgb)-np.exp(y_test)))
axs[1,1].set_title('Actuals (x) vs Residuals [pred. - actual] (y)\n For 100% of Losses',fontweight = 'bold')
axs[1,2].set_axis_off()
plt.savefig('exhibit_b_xgb_regression',dpi=300, bbox_inches='tight')
plt.show()
#cross-val
print("Cross-Validation Scoring for XGBoost")
#print('Mean Absolute Error: {}'.format(-1*round(cross_val_score(xgb_model, X_train_xgb, y_train_xgb, \
# cv=10, scoring='neg_mean_absolute_error').mean(),2)))
#no option for RMSE in cross val score, so looking at mean absolute error
#multiply by -1 to get a positve value to take a look at it
print('R^2: {}'.format(round(cross_val_score(xgb_model, X_train_xgb, y_train_xgb, cv=10, scoring='r2').mean(),2)))
#thinking about overfitting
print("XGB Overfitting Check")
print(" ")
y_pred_xgb_from_test = xgb_model.predict(X_test)
y_pred_xgb_from_train = xgb_model.predict(X_train_xgb)
print("_________Scoring Against Training Data (Scoring Against Itself)_________")
print(" ")
print("R^2:"+str(round(r2_score(y_train_xgb, y_pred_xgb_from_train),4)))
print("Mean Absolute Error (in original units): "+str(round(mean_absolute_error(np.exp(y_train_xgb),\
np.exp(y_pred_xgb_from_train)),2)))
print(" ")
print("_________Scoring Against Test Data_________")
print(" ")
print("R^2:"+str(round(r2_score(y_test, y_pred_xgb_from_test),4)))
print("Mean Absolute Error (in original units): "+str(round(mean_absolute_error(np.exp(y_test),\
np.exp(y_pred_xgb_from_test)),2)))
print(" ")
print("_________Train to Test MAE Ratio_________")
print(
round(
mean_absolute_error(np.exp(y_test), np.exp(y_pred_xgb_from_test))
/
mean_absolute_error(np.exp(y_train_xgb),np.exp(y_pred_xgb_from_train))
,2)
)
#grid search for tuning paramteters #should run in 30 second
#With all due credit to: Omar Essam in Egypt
#https://www.kaggle.com/omarito/gridsearchcv-xgbregressor-0-556-lb
# A parameter grid for XGBoost
params_xgb_searching = {'min_child_weight':[4,5], 'max_depth': [2,3,4]}
# Initialize XGB and GridSearch
xgb_searching = xgb.XGBRegressor(nthread=-1, objective="reg:squarederror", random_state=42)
grid = GridSearchCV(xgb_searching, params_xgb_searching)
grid.fit(X_train_xgb, y_train_xgb)
# Print the r2 score
print("Potential Optimum Results from XGBoost Hyperparameter Tuning via Grid Search")
print("R^2 Score: "+str(round(r2_score(y_test, grid.best_estimator_.predict(X_test)),2)))
#print("Mean Absolute Error: "+str(round(mean_absolute_error(y_test, grid.best_estimator_.predict(X_test)),2)))
#commented out because these log units are not comparable
top_10_xgb_features = pd.DataFrame(sorted(list(zip(features,xgb_model.feature_importances_))\
,key = lambda x: abs(x[1]),reverse=True)[:10], columns=['Feature', 'XGBoost Importance'])
top_10_xgb_features
#plt.xticks(rotation=-25)
bar_count = range(len(top_10_xgb_features.Feature))
fig, axs = plt.subplots(ncols=2, figsize=(14,4))
#using a subplot method coupled with an inline parameter to have high resolution
#note: "[::-1]" reverses the column in a pandas dataframe
axs[1].set_axis_off()
axs[0].barh(bar_count, top_10_xgb_features['XGBoost Importance'][::-1],\
align='center', alpha=1)
axs[0].set_xlabel('Values')
axs[0].set_yticks(bar_count)
axs[0].set_yticklabels(top_10_xgb_features.Feature[::-1], fontsize=10)
axs[0].set_xlabel('XGBoost Importance')
axs[0].set_title("XGBoost's Feature Importances",fontweight = 'bold')
extent = axs[0].get_window_extent().transformed(fig.dpi_scale_trans.inverted())
fig.savefig('exhibit_c_xgbfeatures',dpi=300, bbox_inches=extent.expanded(1.5, 1.5))
plt.show()
X_train1, X_test, y_train1, y_test = train_test_split\
(features, y, test_size = 0.2, random_state = 1)
model = LinearRegression()
model.fit(X_train1,y_train1)
y_slr_predict = model.predict(X_test)
#linear regression standard
#print r^2 and plot linear regression
print("R^2:"+str(round(r2_score(y_test, y_slr_predict),4)))
print("Mean Absolute Error (in original units): "+str(round(mean_absolute_error(np.exp(y_test),\
np.exp(y_slr_predict)),2)))
print("STD of Training Data (in original units): "+str(round(np.std(np.exp(y_test)),2)))
print("Mean of Training Data (in original units): "+str(round(np.mean(np.exp(y_test)),2)))
print("Residual Skew: "+str(round(stats.skew(np.exp(y_slr_predict)-np.exp(y_test)),2)))
print("Residual Kurtosis: "+str(round(stats.kurtosis(np.exp(y_slr_predict)-np.exp(y_test)),2)))
fig, axs = plt.subplots(ncols=(3), nrows=(2), figsize=(15,8))
plt.subplots_adjust(top = 0.95, bottom=0.01, hspace=0.25, wspace=0.16)
axs[0,0].scatter(np.exp(y_test), np.exp(y_slr_predict))
axs[0,0].set_title('Actuals (x) vs Predicteds (y) \n For 98% of Losses',fontweight = 'bold')
axs[0,0].set_xlim([0, 1000])
axs[0,0].set_ylim([0, 1000])
axs[0,1].scatter(np.exp(y_test), (np.exp(y_slr_predict)-np.exp(y_test)))
axs[0,1].set_title('Actuals (x) vs Residuals [pred. - actual] (y)\n For 98% of Losses',fontweight = 'bold')
axs[0,1].set_xlim([0, 1000])
axs[0,1].set_ylim([-1000, 1000])
axs[0,2].hist((np.exp(y_slr_predict)-np.exp(y_test)), bins=100)
axs[0,2].set_title('Histogram of Residuals [predicted - actual]\n For 100% of Losses',fontweight = 'bold')
axs[1,0].scatter(np.exp(y_test), np.exp(y_slr_predict))
axs[1,0].set_title('Actuals (x) vs Predicteds (y) \n For 100% of Losses',fontweight = 'bold')
axs[1,1].scatter(np.exp(y_test), (np.exp(y_slr_predict)-np.exp(y_test)))
axs[1,1].set_title('Actuals (x) vs Residuals [pred. - actual] (y)\n For 100% of Losses',fontweight = 'bold')
axs[1,2].set_axis_off()
plt.savefig('exhibit_d_standard_linear_regression',dpi=300, bbox_inches='tight')
plt.show()
#cross-val
print("Cross-Validation Scoring for Standard Linear Regression")
#print('Mean Absolute Error: {}'.format(-1*round(cross_val_score(model, X_train1, y_train1, \
# cv=10, scoring='neg_mean_absolute_error').mean(),2)))
#commented out because these log units are not comparable
#no option for RMSE in cross val score, so looking at mean absolute error
#multiply by -1 to get a positve value to take a look at it
print('R^2: {}'.format(round(cross_val_score(LinearRegression(), X_train1, y_train1, cv=10, scoring='r2').mean(),2)))
#thinking about overfitting
print("Standard Linear Regression Overfitting Check")
print(" ")
y_pred_slr_from_test = model.predict(X_test)
y_pred_slr_from_train = model.predict(X_train1)
print("_________Scoring Against Training Data (Scoring Against Itself)_________")
print(" ")
print("R^2:"+str(round(r2_score(y_train1, y_pred_slr_from_train),4)))
print("Mean Absolute Error (in original units): "+str(round(mean_absolute_error(np.exp(y_train1),\
np.exp(y_pred_slr_from_train)),2)))
print(" ")
print("_________Scoring Against Test Data_________")
print(" ")
print("R^2:"+str(round(r2_score(y_test, y_pred_slr_from_test),4)))
print("Mean Absolute Error (in original units): "+str(round(mean_absolute_error(np.exp(y_test),\
np.exp(y_pred_slr_from_test)),2)))
print(" ")
print("_________Train to Test MAE Ratio_________")
print(
round(
mean_absolute_error(np.exp(y_test), np.exp(y_pred_slr_from_test))
/
mean_absolute_error(np.exp(y_train1),np.exp(y_pred_slr_from_train))
,2)
)
X_train2, X_test, y_train2, y_test = train_test_split\
(features, y, test_size = 0.2, random_state = 1)
#svm requires standardization becauses it uses regularization
scaler = StandardScaler()
#"To determine the scaling factors and apply the scaling to the feature data:" -Codecademy
svr_train_features = scaler.fit_transform(X_train2)
svr_test_features = scaler.transform(X_test)
#"To apply the scaling to the test data:" -Codecademy
##we do NOT want to fit to the test
##svr_train_independent_variable = scaler.fit_transform(y_train2.values.reshape(-1, 1))
#^^if we ever need it
svr_poly = SVR(kernel='poly', C=1, degree=3, gamma='auto')
#it should be ok not to scale the dependent variable, if the features are scaled
y_poly_pred = svr_poly.fit(svr_train_features, y_train2).predict(svr_test_features)
#print r^2 and plot svm
print("R^2:"+str(round(r2_score(y_test, y_poly_pred),4)))
print("Mean Absolute Error (in original units): "+str(round(mean_absolute_error(np.exp(y_test),\
np.exp(y_poly_pred)),2)))
print("STD of Training Data (in original units): "+str(round(np.std(np.exp(y_test)),2)))
print("Mean of Training Data (in original units): "+str(round(np.mean(np.exp(y_test)),2)))
print("Residual Skew: "+str(round(stats.skew(np.exp(y_poly_pred)-np.exp(y_test)),2)))
print("Residual Kurtosis: "+str(round(stats.kurtosis(np.exp(y_poly_pred)-np.exp(y_test)),2)))
fig, axs = plt.subplots(ncols=(3), nrows=(2), figsize=(15,8))
plt.subplots_adjust(top = 0.95, bottom=0.01, hspace=0.25, wspace=0.16)
axs[0,0].scatter(np.exp(y_test), np.exp(y_poly_pred))
axs[0,0].set_title('Actuals (x) vs Predicteds (y) \n For 98% of Losses',fontweight = 'bold')
axs[0,0].set_xlim([0, 1000])
axs[0,0].set_ylim([0, 1000])
axs[0,1].scatter(np.exp(y_test), (np.exp(y_poly_pred)-np.exp(y_test)))
axs[0,1].set_title('Actuals (x) vs Residuals [pred. - actual] (y)\n For 98% of Losses',fontweight = 'bold')
axs[0,1].set_xlim([0, 1000])
axs[0,1].set_ylim([-1000, 1000])
axs[0,2].hist((np.exp(y_poly_pred)-np.exp(y_test)), bins=100)
axs[0,2].set_title('Histogram of Residuals [predicted - actual]\n For 100% of Losses',fontweight = 'bold')
axs[1,0].scatter(np.exp(y_test), np.exp(y_poly_pred))
axs[1,0].set_title('Actuals (x) vs Predicteds (y) \n For 100% of Losses',fontweight = 'bold')
axs[1,1].scatter(np.exp(y_test), (np.exp(y_poly_pred)-np.exp(y_test)))
axs[1,1].set_title('Actuals (x) vs Residuals [pred. - actual] (y)\n For 100% of Losses',fontweight = 'bold')
axs[1,2].set_axis_off()
plt.savefig('exhibit_e_svm_regression',dpi=300, bbox_inches='tight')
plt.show()
#cross-val
print("Cross-Validation Scoring for Support Vector Regression")
#print('Mean Absolute Error: {}'.format(-1*round(cross_val_score(svr_poly, X_train2, y_train2, \
# cv=10, scoring='neg_mean_absolute_error').mean(),2)))
#commented out because these log units are not comparable
#no option for RMSE in cross val score, so looking at mean absolute error
#multiply by -1 to get a positve value to take a look at it
print('R^2: {}'.format(round(cross_val_score(svr_poly, svr_train_features, y_train2, cv=10, scoring='r2').mean(),2)))
#thinking about overfitting
print("Support Vector Machine Overfitting Check")
print(" ")
y_pred_svm_from_test = svr_poly.predict(X_test)
y_pred_svm_from_train = svr_poly.predict(X_train2)
print("_________Scoring Against Training Data (Scoring Against Itself)_________")
print(" ")
print("R^2:"+str(round(r2_score(y_train2, y_pred_svm_from_train),4)))
print("Mean Absolute Error (in original units): "+str(round(mean_absolute_error(np.exp(y_train2),\
np.exp(y_pred_svm_from_train)),2)))
print(" ")
print("_________Scoring Against Test Data_________")
print(" ")
print("R^2:"+str(round(r2_score(y_test, y_pred_svm_from_test),4)))
print("Mean Absolute Error (in original units): "+str(round(mean_absolute_error(np.exp(y_test),\
np.exp(y_pred_svm_from_test)),2)))
print(" ")
print("_________Train to Test MAE Ratio_________")
print(
round(
mean_absolute_error(np.exp(y_test), np.exp(y_pred_svm_from_test))
/
mean_absolute_error(np.exp(y_train2),np.exp(y_pred_svm_from_train))
,2)
)
X_train3, X_test, y_train3, y_test = train_test_split\
(features, y, test_size = 0.2, random_state = 1)
#lasso requires standardization becauses it uses regularization
scaler = StandardScaler()
#"To determine the scaling factors and apply the scaling to the feature data:" -Codecademy
lasso_train_features = scaler.fit_transform(X_train3)
#"To apply the scaling to the test data:" -Codecademy
lasso_test_features = scaler.transform(X_test) #we do NOT want to fit to the test
reg = linear_model.Lasso(alpha=0.0005)
reg.fit(lasso_train_features,y_train3) #i do not think it is needed to standarize the independent variable,
#...so long as you have standardized the features
y_lasso_predict = reg.predict(lasso_test_features)
#lasso
print("R^2:"+str(round(r2_score(y_test, y_lasso_predict),4)))
print("Mean Absolute Error (in original units): "+str(round(mean_absolute_error(np.exp(y_test),\
np.exp(y_lasso_predict)),2)))
print("STD of Training Data (in original units): "+str(round(np.std(np.exp(y_test)),2)))
print("Mean of Training Data (in original units): "+str(round(np.mean(np.exp(y_test)),2)))
print("Residual Skew: "+str(round(stats.skew(np.exp(y_lasso_predict)-np.exp(y_test)),2)))
print("Residual Kurtosis: "+str(round(stats.kurtosis(np.exp(y_lasso_predict)-np.exp(y_test)),2)))
fig, axs = plt.subplots(ncols=(3), nrows=(2), figsize=(15,8))
plt.subplots_adjust(top = 0.95, bottom=0.01, hspace=0.25, wspace=0.16)
axs[0,0].scatter(np.exp(y_test), np.exp(y_lasso_predict))
axs[0,0].set_title('Actuals (x) vs Predicteds (y) \n For 98% of Losses',fontweight = 'bold')
axs[0,0].set_xlim([0, 1000])
axs[0,0].set_ylim([0, 1000])
axs[0,1].scatter(np.exp(y_test), (np.exp(y_lasso_predict)-np.exp(y_test)))
axs[0,1].set_title('Actuals (x) vs Residuals [pred. - actual] (y)\n For 98% of Losses',fontweight = 'bold')
axs[0,1].set_xlim([0, 1000])
axs[0,1].set_ylim([-1000, 1000])
axs[0,2].hist((np.exp(y_lasso_predict)-np.exp(y_test)), bins=100)
axs[0,2].set_title('Histogram of Residuals [predicted - actual]\n For 100% of Losses',fontweight = 'bold')
axs[1,0].scatter(np.exp(y_test), np.exp(y_lasso_predict))
axs[1,0].set_title('Actuals (x) vs Predicteds (y) \n For 100% of Losses',fontweight = 'bold')
axs[1,1].scatter(np.exp(y_test), (np.exp(y_lasso_predict)-np.exp(y_test)))
axs[1,1].set_title('Actuals (x) vs Residuals [pred. - actual] (y)\n For 100% of Losses',fontweight = 'bold')
axs[1,2].set_axis_off()
plt.savefig('exhibit_f_lasso_regression',dpi=300, bbox_inches='tight')
plt.show()
#cross-val
print("Cross-Validation Scoring for Lasso Regression")
#print('Mean Absolute Error: {}'.format(-1*round(cross_val_score(reg, X_train3, y_train3, \
# cv=10, scoring='neg_mean_absolute_error').mean(),2)))
#commented out because these log units are not comparable
#no option for RMSE in cross val score, so looking at mean absolute error
#multiply by -1 to get a positve value to take a look at it
print('R^2: {}'.format(round(cross_val_score(reg, lasso_train_features, y_train3, cv=10, scoring='r2').mean(),2)))
#thinking about overfitting
print("Lasso Overfitting Check")
print(" ")
y_pred_lasso_from_test = reg.predict(X_test)
y_pred_lasso_from_train = reg.predict(X_train3)
print("_________Scoring Against Training Data (Scoring Against Itself)_________")
print(" ")
print("R^2:"+str(round(r2_score(y_train3, y_pred_lasso_from_train),4)))
print("Mean Absolute Error (in original units): "+str(round(mean_absolute_error(np.exp(y_train3),\
np.exp(y_pred_lasso_from_train)),2)))
print(" ")
print("_________Scoring Against Test Data_________")
print(" ")
print("R^2:"+str(round(r2_score(y_test, y_pred_lasso_from_test),4)))
print("Mean Absolute Error (in original units): "+str(round(mean_absolute_error(np.exp(y_test),\
np.exp(y_pred_lasso_from_test)),2)))
print(" ")
print("_________Train to Test MAE Ratio_________")
print(
round(
mean_absolute_error(np.exp(y_test), np.exp(y_pred_lasso_from_test))
/
mean_absolute_error(np.exp(y_train3),np.exp(y_pred_lasso_from_train))
,2)
)
#prepare the classification Dataframe from before
insurance_ml_classification.shape
msno.matrix(insurance_ml_classification, color = (.0,.0,.2))
insurance_ml_classification_processed = insurance_ml_classification.drop(['claim_amount',\
'loss_cost','claim_month', 'exposure_units'],axis = 1)
msno.matrix(insurance_ml_classification_processed, color = (.0,.0,.2))
#address skew in features
#For this block, credit goes to Alexandru Papiu
#(https://www.kaggle.com/apapiu/regularized-linear-models)
#log transform skewed numeric features:
continous_features_classification = ['locations','employees','internet_complaints']
#exposure units have already been taken out
skewed_feats = insurance_ml_classification_processed[continous_features_classification].apply(\
lambda x: stats.skew(x))
#compute skewness
skewed_feats = skewed_feats[skewed_feats > 0.75]
skewed_feats = skewed_feats.index
insurance_ml_classification_processed[skewed_feats] = np.log1p(insurance_ml_classification_processed[skewed_feats])
classification_features = insurance_ml_classification_processed.drop(['was_there_a_loss'], axis = 1)
classification_outcome = insurance_ml_classification_processed.was_there_a_loss
train_features, test_features, train_labels, test_labels = train_test_split(classification_features,\
classification_outcome, test_size = 0.2)
#normalize this, since sklearn's logistic regression uses regularization
scaler = StandardScaler()
#"To determine the scaling factors and apply the scaling to the feature data:" -Codecademy
classification_train_features = scaler.fit_transform(train_features)
#"To apply the scaling to the test data:" -Codecademy
classification_test_features = scaler.transform(test_features) #we do NOT want to fit to the test
#run model
log_model = LogisticRegression(solver="liblinear") #to remove warning
log_model.fit(classification_train_features, train_labels)
print("_________Logistic Regression Classification_________")
print("")
print("Scored Against Itself")
print('Accuracy Score: {}'.format(round(log_model.score(classification_train_features, train_labels),3)))
print("")
print("Scored Against Test Data")
print('Accuracy Score: {}'.format(round(log_model.score(classification_test_features, test_labels),3)))
#cross-val
print("_________Cross-Validation Scoring for Logistic Regression Classification_________")
print('Accuracy: {}'.format(round(cross_val_score(log_model, classification_train_features, train_labels, \
cv=10, scoring='accuracy').mean(),3)))
print('Precision: {}'.format(round(cross_val_score(log_model, classification_train_features, train_labels, \
cv=10, scoring='precision').mean(),3)))
print('Recall: {}'.format(round(cross_val_score(log_model, classification_train_features, train_labels, \
cv=10, scoring='recall').mean(),3)))
log_model_predictions = log_model.predict(classification_test_features)
cm = confusion_matrix(test_labels, log_model_predictions)
#print(cm) #this is the barebones confusion matrix
#all credit due to: Michael Galarnyk, "Logistic Regression using Python (scikit-learn)", Towards Data Science
plt.figure(figsize=(9,9))
sns.heatmap(cm, annot=True, fmt=".0f", linewidths=.5, square = True, cmap = 'Blues_r');
plt.ylabel('Actual Label');
plt.xlabel('Predicted Label');
all_sample_title = 'Logistic Regression Classification \n Accuracy Score: {0:.3f}'.format(\
log_model.score(classification_test_features, test_labels))
plt.title(all_sample_title, size = 15)
plt.savefig('exhibit_g_logistic_regression_confusion_matrix',dpi=300, bbox_inches='tight')
print("Accuracy: "+str(accuracy_score(test_labels, log_model_predictions))) #this is just a little check at the end
print("Precision: "+str(precision_score(test_labels, log_model_predictions))) #this is just a little check at the end
print("Recall: "+str(recall_score(test_labels, log_model_predictions))) #this is just a little check at the end
classification_features_xgb = insurance_ml_classification_processed.drop(['was_there_a_loss'], axis = 1)
classification_outcome_xgb = insurance_ml_classification_processed.was_there_a_loss
train_features, test_features, train_labels, test_labels = train_test_split(classification_features_xgb,\
classification_outcome_xgb, test_size = 0.2)
xgb_classy = xgb.XGBClassifier()
xgb_classy.fit(train_features,train_labels)
xgb_classy_predictions = xgb_classy.predict(test_features)
#run model
print("_________XGBoost Regression Classification_________")
print("")
print("Scored Against Itself")
print('Accuracy Score: {}'.format(round(xgb_classy.score(train_features, train_labels),3)))
print("")
print("Scored Against Test Data")
print('Accuracy Score: {}'.format(round(xgb_classy.score(test_features, test_labels),3)))
#cross-val #this will run about 70 seconds per print
print("_________Cross-Validation Scoring for XGBoost Classification_________")
print('Accuracy: {}'.format(round(cross_val_score(xgb_classy, train_features, train_labels, \
cv=10, scoring='accuracy').mean(),3)))
print('Precision: {}'.format(round(cross_val_score(xgb_classy, train_features, train_labels, \
cv=10, scoring='precision').mean(),3)))
print('Recall: {}'.format(round(cross_val_score(xgb_classy, train_features, train_labels, \
cv=10, scoring='recall').mean(),3)))
xgb_classy_predictions = xgb_classy.predict(test_features) #redundant, but copied here too so I can see it
xgb_cm = confusion_matrix(test_labels, xgb_classy_predictions)
#print(cm) #this is the barebones confusion matrix
#all credit due to: Michael Galarnyk, "Logistic Regression using Python (scikit-learn)", Towards Data Science
plt.figure(figsize=(9,9))
sns.heatmap(xgb_cm, annot=True, fmt=".0f", linewidths=.5, square = True, cmap = 'Blues_r');
plt.ylabel('Actual Label');
plt.xlabel('Predicted Label');
all_sample_title = 'XGBoost Regression Classification \n Accuracy Score: {0:.3f}'.format(\
xgb_classy.score(test_features, test_labels))
plt.title(all_sample_title, size = 15)
plt.savefig('exhibit_h_xgboost_regression_confusion_matrix',dpi=300, bbox_inches='tight')
print("Accuracy: "+str(accuracy_score(test_labels, xgb_classy_predictions))) #this is just a little check at the end
print("Precision: "+str(precision_score(test_labels, xgb_classy_predictions))) #this is just a little check at the end
print("Recall: "+str(recall_score(test_labels, xgb_classy_predictions))) #this is just a little check at the end
You are given a dataset of 25,000 news article titles about company activities scrapped from the web (News_Titles.csv). Each news article has been manually categorized into one of 5 categories:
Build a text classification model to automatically classify future news articles into one of these 5 categories, and present some measurements for how well your model performs. Some suggestions to start:
This final exercise is independent from the previous ones. </font>
news = pd.read_csv('News_Titles.csv')
news.head(3)
news_labels = news.news_category
news_features = news.news_article_title
X_train_news, X_test_news, y_train_news, y_test_news = train_test_split(news_features, news_labels,\
test_size=0.20, random_state=42)
mnb_model.fit(X_train_news, y_train_news)
labels = mnb_model.predict(X_test_news)
print("______Example of the Model at Work______")
print(" ")
print("Actual News Headline:")
print(news.news_article_title[0])
print(" ")
print("Predicted Category:")
print(mnb_model.predict([news.news_article_title[0]]))
print(" ")
print("Actual Category:")
print(news.news_category[0])
print(" ")
print("✅✅✅✅")
accuracy_score(y_test_news, labels)
mat = confusion_matrix(y_test_news, labels)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
xticklabels=unique_labels(y_test_news, labels), yticklabels=unique_labels(y_test_news, labels))
plt.xlabel('True Label')
plt.ylabel('Predicted Label')
plt.savefig('exhibit_x9_text_classification_confusion_matrix',dpi=300, bbox_inches='tight')
print("Overall Accuracy: "+str(accuracy_score(y_test_news, labels)))
uniques = unique_labels(y_test_news, labels)
prec_score = precision_score(y_test_news, labels, average=None)
reca_score = recall_score(y_test_news, labels, average=None)
print(" ")
print("_______________Precision_______________")
for i in range(len(uniques)):
print(str(uniques[i])+" "+str(prec_score[i]))
print(" ")
print("_______________Recall_______________")
for i in range(len(uniques)):
print(str(uniques[i])+" "+str(reca_score[i]))