Part I. Beginning Notes
Part I. Question 1
Part I. Question 2
Part I. Question 3
Part I. Code (SQL and Python) and Data Visualizations
.....Data Viz 1 - Tipping Percentage
.....Data Viz 2 - Time to Serve
.....Data Viz 3 - Order Frequency
.....Data Viz 4 - New vs. Repeat Orders
Part II. Question 1
Part II. Question 2
Part II. Question 3
Not shown here: I cleaned up the data the team provided, I deleted out the null values for simplicity's sake (though in the real world I would talk with the team about why we have the nulls and we'd decide if we want to fill them in), I re-formated the datetime columns, replaced the whitespace in the column headers with underscores, and, then, finally, I loaded the fixed csv into MySQL Workbench to run my SQL queries.*
Two data notes: I will definitely want to talk to the team about (1) using a more standard datetime format going forward and (2) better understanding why there were so many nulls in the “Driver at restaurant datetime” column.*
I analyzed the data per delivery region, unless otherwise noted. The metrics that stood out to me were:
Even though this data is from one of DoorDash’s earliest months, the business shows a lot of promise, with over $925,000 in topline revenue (Insight 8c).
Further, there are roughly 2.7 orders per customer, and this indicates that we are performing a service that people like enough to order again. (Insight 6a). In fact, in this month’s data, over 63% of our users were returning customers. (Insight 8b).
We operate in three regions: Palo Alto, Mountain View, and San Jose. Most of our orders are in Palo Alto (11k+), then Mountain View (almost 4k), and finally San Jose (just under 3k). (Insight 7). The average order value seems steady in $40-45 range. (Insight 1).
I’ve also noticed a few interesting trends. First, our order frequency is bi-modal with our orders bunching around 11am (lunch) and 6pm (dinner) everyday. We have very few orders before 10am or after 9pm, and this can guide the way we use our resources. (Insight 2).
I’ve also looked at how long it takes us to get the order to the order-er. It’s usually just under 50 minutes, and we can see here a breakdown of what we do in those 50 minutes: Step 1 (Customer -> Restaurant), Step 2 (Restaurant -> Dasher), and Step 3 (Dasher -> Customer).
Regarding refunds, our refund rate is very good. It hovers around 2-3%, varying from delivery region to delivery region, and this rate is in line with my understanding of e-commerce businesses. (Insight 7).
Additionally, I’ve analyzed data on our top restaurants, for which I considered a short total time to delivery and a popularity of greater than 50 orders from customers. For instance, Restaurant ID 12 seems like a particularly outstanding. (Insight 3). Along similar lines, I've also analyzed our power users. Customer ID 514 has already ordered over 60 times this month. (Insight 5).
Finally, I noticed something quirky, which can be seen here. Although Palo Alto is one of the wealthiest cities in the United States, they seem to be our stingiest tippers. This may be a place for us to work on. In fact, it's where I will focus my experiment.
Using what I’ve seen so far, here are a few recommendations:
There are many parts of the business we can focus on for improvement, but one that stands out to me is our low tipping rate.
I think that places like Palo Alto can tip more. If we can raise that tip rate, I think it will **a)** be positive for our dashers, **b)** attract more workers to our business, **c)** lead to good press coverage, and **d)** increase employee morale because we know we are doing the right thing.
Further, I would only want to see if we can make the ask when people are at the final stage of the sales funnel. This is similar to when you are at a checkout register and someone asks if you would like add a small charitable donation to the end of your purchase. As such, we would not expect it to decrease our number of orders because the person is committed to the order at that stage.
Can we increase tips in Palo Alto by adding a photo of the dasher, a short bio of the dasher, and a “15% Tip” button on the screen?
Design: A/B test via a two-sample difference of the means t-test to check this.
Key Metric: Mean Tip Percentage
Control: Status quo; no changed.
Treatment: Status quo + the photo/bio/button combo at checkout.
Null Hypothesis: H_0 = μ _control - μ _treatmet = 0
Alternative Hypothesis: H_A = μ _control - μ _treatment =/= 0
Conditions: Randomized, Individuals Independent, Groups Independent, Sample are at least nearly normal (make histogram to check)
Best-Practice Steps:
On a larger scale, because this is based on software, I think we can implement it on a wider scale. By wider scale, I still say Palo Alto.
I think it’s the kind of thing which can be a win-win, so I don’t anticipate too many things going wrong. Some operational challenges may be that riders are protective of their information, so we would make sure to get their consent and talk to the lawyers. Note: I am also a lawyer. But we would want to be in great communication with the drivers, we would want to make sure privacy laws are followed, and we would want to check the success.
The engineers who are building the functionality may have more pressing concerns, so we would want to work with them respectfully.
The great thing about digital A/B testing is that when you do have statistical information showing an improvement, the change can be scaled rapidly.
I now connect to my local MySQL database below, then I reproduce the queries I ran as strings, and lastly, I load their output tables as pandas dataframes.
import pandas as pd
import mysql.connector as mysql
from matplotlib import pyplot as plt
#connecting to the local sql database I set up
db = mysql.connect(
host = "localhost",
user = "newuser",
passwd = "data",
auth_plugin='mysql_native_password',
database = 'sys')
tips_and_supplychain_per_region = pd.read_sql("""
WITH supply_flow AS (
select
Delivery_Region,
Is_ASAP,
Restaurant_ID,
Is_New,
Order_total,
round(100*(Amount_of_tip/Order_total),2) as 'Tip_Percent',
(timestampdiff(MINUTE, Customer_placed_order_datetime, Placed_order_with_restaurant_datetime)) as 'Step_1_Time',
(timestampdiff(MINUTE, Placed_order_with_restaurant_datetime, Driver_at_restaurant_datetime)) as 'Step_2_Time',
(timestampdiff(MINUTE, Driver_at_restaurant_datetime, Delivered_to_consumer_datetime)) as 'Step_3_Time',
(timestampdiff(MINUTE, Customer_placed_order_datetime, Delivered_to_consumer_datetime)) as 'Total_Time'
from sys.doordash
where Is_Asap = 'True'
)
select
Delivery_Region,
round(avg(Tip_Percent),2) as 'Tip Pct',
round(avg(Order_total),2) as 'Avg Order',
round(avg(Step_1_Time),1) as 'Cust->Rest Time',
round(avg(Step_2_Time),1) as 'Rest->Driv Time',
round(avg(Step_3_Time),1) as 'Driv->Cust Time',
round(avg(Total_Time),1) as 'Total Time',
count(*) as 'Orders'
from supply_flow
where Step_1_Time > 0 #to take out say 31 to 1 overnight outliars since the data did not come with months or years
and Step_2_Time > 0
and Step_3_Time > 0
and Total_Time > 0
group by 1
having orders > 50
order by 8 asc
;
""", con=db)
tips_and_supplychain_per_region
#time of the supply chain is in minutes
Two Notes Here:
labels = ['Tip', '']
pa_tip = [tips_and_supplychain_per_region['Tip Pct'][2], (100-tips_and_supplychain_per_region['Tip Pct'][2])]
mv_tip = [tips_and_supplychain_per_region['Tip Pct'][0], (100-tips_and_supplychain_per_region['Tip Pct'][0])]
sj_tip = [tips_and_supplychain_per_region['Tip Pct'][1], (100-tips_and_supplychain_per_region['Tip Pct'][1])]
national_tip = [15,85]
fig, axs = plt.subplots(2, 2,figsize=(9, 6))
colors_pa = ['limegreen','lightgreen']
colors_mv = ['paleturquoise','lightcyan']
colors_sj = ['sandybrown','bisque']
colors_nt = ['hotpink','pink']
axs[0,0].pie(pa_tip, labels=labels, autopct='%1.2f%%', shadow=True, colors=colors_pa, explode = (0, 0.1))
axs[0,0].set_title('Palo Alto',fontweight = 'bold')
axs[0,1].pie(mv_tip, labels=labels, autopct='%1.2f%%', shadow=True, colors=colors_mv, explode = (0, 0.1))
axs[0,1].set_title('Mountain View',fontweight = 'bold')
axs[1,0].pie(sj_tip, labels=labels, autopct='%1.2f%%', shadow=True, colors=colors_sj, explode = (0, 0.1))
axs[1,0].set_title('San Jose',fontweight = 'bold')
axs[1,1].pie(national_tip, labels=labels, autopct='%1.2f%%', shadow=True, colors=colors_nt, explode = (0, 0.1))
axs[1,1].set_title('National',fontweight = 'bold')
plt.subplots_adjust(wspace=0, hspace=0.1)
plt.suptitle("Tip Percentage Comparison", fontsize = 14, fontweight = 'bold')
plt.savefig('ex_1_tips.png')
plt.show()
f, ax = plt.subplots(figsize=(8, 6))
step1_time = tips_and_supplychain_per_region['Cust->Rest Time']
step2_time = tips_and_supplychain_per_region['Rest->Driv Time']
step3_time = tips_and_supplychain_per_region['Driv->Cust Time']
total_time = tips_and_supplychain_per_region['Total Time']
p1 = plt.bar(range(len(step1_time)),
step1_time, color = 'cornflowerblue')
p2 = plt.bar(range(len(step2_time)),
step2_time, bottom = step1_time, color = 'sandybrown')
p3 = plt.bar(range(len(step3_time)),
step3_time, bottom = (step2_time+step1_time), color = 'lightgreen')
plt.locator_params(axis='x', nbins=4)
ax.set_xticklabels(['0','San Jose', 'Moutain View', 'Palo Alto'], style='italic')
plt.xlabel("Delivery Region", fontsize = 13)
plt.ylabel("Time (minutes)", fontsize = 13)
plt.legend((p1[0], p2[0], p3[0]), ('Customer to Restaurant', 'Restaurant to Dasher',\
'Dasher to Customer'),loc='center left', bbox_to_anchor=(1, 0.5))
plt.title("DoorDash Time To Serve Avg. Breakdown", fontsize = 14, fontweight = 'bold')
plt.savefig('ex_2_time_to_serve.png')
plt.show()
time_of_order = pd.read_sql("""
select
extract(HOUR from date_add(Customer_placed_order_datetime, INTERVAL -7 HOUR)) as 'Hour',
count(*) as 'Number of Orders',
round(100* count(*) / 13524,2) as 'Percentage'
from sys.doordash
group by 1
order by 2 desc
;
""", con=db)
time_of_order.head(10)
f, ax = plt.subplots(figsize=(8, 6))
plt.bar(time_of_order['Hour'],time_of_order['Number of Orders'])
plt.xlabel("Hour of Day", fontsize = 13)
plt.locator_params(axis='x', nbins=17)
ax.set_xticklabels(['0','12am', '2am', '4am','6am', '8am',\
'10am', '12pm','2pm', '4pm',\
'6pm', '8pm','10pm', 'Midnight'], style='italic')
plt.xticks(rotation=-25)
plt.ylabel("Number of Orders", fontsize = 13)
plt.title("Order Frequency Throughout the Day (Month Aggregation)", fontsize = 13, fontweight = 'bold')
plt.savefig("ex_3_order_frequency.png")
plt.show()
best_restaurants = pd.read_sql("""
WITH supply_flow AS (
select
Delivery_Region,
Is_ASAP,
Restaurant_ID,
Is_New,
Order_total,
round(100*(Amount_of_tip/Order_total),2) as 'Tip_Percent',
(timestampdiff(MINUTE, Customer_placed_order_datetime, Placed_order_with_restaurant_datetime)) as 'Step_1_Time',
(timestampdiff(MINUTE, Placed_order_with_restaurant_datetime, Driver_at_restaurant_datetime)) as 'Step_2_Time',
(timestampdiff(MINUTE, Driver_at_restaurant_datetime, Delivered_to_consumer_datetime)) as 'Step_3_Time',
(timestampdiff(MINUTE, Customer_placed_order_datetime, Delivered_to_consumer_datetime)) as 'Total_Time'
from sys.doordash
where Is_Asap = 'True'
)
select
Restaurant_ID,
Delivery_Region,
round(avg(Tip_Percent),2) as 'Tip Pct',
round(avg(Order_total),2) as 'Avg Order',
round(avg(Step_1_Time),1) as 'Cust->Rest Time',
round(avg(Step_2_Time),1) as 'Rest->Driv Time',
round(avg(Step_3_Time),1) as 'Driv->Cust Time',
round(avg(Total_Time),1) as 'Total Time',
count(*) as 'Orders'
from supply_flow
where Step_1_Time > 0 #to take out say 31 to 1 overnight outliars since the data did not come with months or years
and Step_2_Time > 0
and Step_3_Time > 0
and Total_Time > 0
group by 1, 2
having orders > 50
order by 8 asc;
""", con=db)
best_restaurants.head()
Pease note: To make sure I analyzed full numbers for the 'Insight 4-8' Sections, I created a new database with the 'null' rows back in it. The nulls were generally created in the datetime fields, so I removed those for my logistical time analysis. However, to get a better picture of user size and topline revenue, I've brought them back. You can see that these SQL queries reference a 'doordash_no_dt' database. The previous data came from tables that, at least partially, relied on the datetime values. But th next data comes from tables that do not rely on the datetime values.
new_orders_by_region = pd.read_sql("""
select
Delivery_Region,
sum(case when Is_New = 'True' then 1 else 0 end) as new_orders,
count(*) as orders,
round(100*(sum(case when Is_New = 'True' then 1 else 0 end)) / (count(*)),2) as pct_new
from sys.doordash_no_dt
group by 1
;
""", con=db)
new_orders_by_region
labels = ['New Orders', 'Repeat Orders']
pa_data = [new_orders_by_region.new_orders[0], (new_orders_by_region.orders[0] - new_orders_by_region.new_orders[0])]
mv_data = [new_orders_by_region.new_orders[1], (new_orders_by_region.orders[1] - new_orders_by_region.new_orders[1])]
sj_data = [new_orders_by_region.new_orders[2], (new_orders_by_region.orders[2] - new_orders_by_region.new_orders[2])]
fig, axs = plt.subplots(1, 3,figsize=(12, 3))
colors_pa = ['limegreen','lightgreen']
colors_mv = ['paleturquoise','lightcyan']
colors_sj = ['sandybrown','bisque']
axs[0].pie(pa_data, labels=labels, autopct='%1.2f%%', shadow=True, colors=colors_pa, explode = (0, 0.1))
axs[0].set_title('Palo Alto',fontweight = 'bold')
axs[1].pie(mv_data, labels=labels, autopct='%1.2f%%', shadow=True, colors=colors_mv, explode = (0, 0.1))
axs[1].set_title('Mountain View',fontweight = 'bold')
axs[2].pie(sj_data, labels=labels, autopct='%1.2f%%', shadow=True, colors=colors_sj, explode = (0, 0.1))
axs[2].set_title('San Jose',fontweight = 'bold')
plt.savefig("ex_4_new_vs_repeat_orders.png")
plt.show()
#this is a little trickier because there are two sql statements to be run sequentially
#we split them up in the following way
sql1="set @total_orders = (select count(*) from sys.doordash_no_dt);"
sql2="""
select
Consumer_ID as "Customer_ID",
count(*) as 'Num_Orders',
round(100*(count(*) / @total_orders),3) as "Pct_of_all"
from sys.doordash_no_dt
group by 1
order by 2 desc
limit 50;
"""
cursor=db.cursor(buffered=True)
cursor.execute(sql1)
cursor.close()
power_customers = pd.read_sql(sql2, con=db)
power_customers.head(10)
orders_to_customer = pd.read_sql("""
select
count(distinct Consumer_ID) as num_customers,
count(*) as num_orders,
round((count(*) / count(distinct Consumer_ID)),2) as orders_per_customer
from sys.doordash_no_dt
;
""", con=db)
orders_to_customer
orders_to_returning_customer = pd.read_sql("""
select
count(distinct Consumer_ID) as num_customers,
count(*) as num_orders,
round((count(*) / count(distinct Consumer_ID)),2) as orders_per_returning_customer
from sys.doordash_no_dt
where Is_New = "False"
;
""", con=db)
orders_to_returning_customer
#this makes sense because if we start with 2/3 = 0.67
#but then take 1 from the numerator and one from the denominator
#we get (2-1)/(3-1) = 1/2 = 0.50
#this is what we're effectively doing by taking out the 1-time-only customers
refund_rate_by_region = pd.read_sql("""
select
Delivery_Region as region,
sum(case when Refunded_amount > 0 then 1 else 0 end) as refund_count,
count(*) as order_count,
round(100*(sum(case when Refunded_amount > 0 then 1 else 0 end)) / (count(*)),2) as refund_rate_pct
from sys.doordash_no_dt
group by 1
order by 4 asc;
""", con=db)
refund_rate_by_region
Please note: To make sure I analyzed full numbers for the 'Insight 8' Section, I created a new database with the 'null' rows back in it. The nulls were generally created in the datetime fields, so I removed those for my logistical time analysis. However, to get a better picture of user size and topline revenue, I've brought them back. You can see that these SQL queries reference a 'doordash_no_dt' database.
total_users = pd.read_sql("""
select
count(distinct Consumer_ID) as total_users_this_month
from sys.doordash_no_dt;
""", con=db)
total_users
total_returning_users = pd.read_sql("""
select
count(distinct Consumer_ID) as total_new_users_this_month
from sys.doordash_no_dt
where Is_New = "True";
""", con=db)
total_returning_users
2431 users / 6701 users = 36.28% new users this month
100% - 36.28% = 63.72% returning users this month
total_financial_sums = pd.read_sql("""
select
sum(Order_total) as total_revenue,
sum(Amount_of_tip) as total_tip,
sum(Amount_of_discount) as total_discount_applied
from sys.doordash_no_dt;
""", con=db)
total_financial_sums
• MySQL has a nifty function called ‘DayofWeek,’ so I used that here. The heavy lifting was just making sure the date was in the right Datetime format. Once you’ve got that, there’s a lot you can do with SQL.
.
• For the average pay per hour, I find the difference in hours between the dashers start and end time. Then I divide their total pay by that difference. Please note that because the TimeStampDiffence returns integers (so it would round to the nearest hour), I found it easier to just take out put in minutes and divide by 60 to get hours as a float that way.
.
• Finally I grouped it by the day of the week.
.
• The output will be in days of the weeks as 1,2,3,4, etc, but you can easily reformat that (perhaps with a CASE WHEN).
• The core of this query is the same as the previous part, earnings per hour, but here you need to get information from another table.
.
• I make an assumption to limit the data to dashes that occur between 11am and 2pm. This would miss dashes that started or ended after that time block. However, this assumption, I think is valid because if we did have a dash that ran through this 11am to 2pm block we would not be able to correctly assess the pay per hour during that time because we only have total pay and we have no way of knowing how much of that total pay came from orders outside the block. Therfore, I have limited the query to dashes begin after 11am and ending before 2pm. I did this with an Hour extract clause in the where clause.
.
• To join the table I joined on the dasher_id via a standard inner join.
.
• I then made sure we had the correct submarket_id, also in the where clause.
• There are several ways to attack this problem, and I would like to start with a common SQL gripe – the median is very important in EDA, and there is basically no easy way to get the median in SQL (contrast that with Microsoft Excel).
• The plan was to first aggregate average hourly pay using the methods above. I noticed there was a slight semantic difference in this problem (“total pay per hour” vs “average pay earnings per hour”) and I interpreted “total pay per hour” as meaning the same thing as “average earnings per hour” did before. So I first aggregated average hourly pay, just as I did above.
.
• Then I grouped the pay dasher_id and the dasher_id email address (Which we can do because they have a 1:1 relationship).
.
........o I had to join the tables again to get this information.
.
• I then did a special datediff function in the where clause to make sure that we are only tracking the last 30 days, as specified by the prompt.
.
• Finally, and this was the key, I decided to use a window function to rank the users by their hourly pay.
.
........o The one I tried to use was NTILE() because it fits well with this problem. NTILE buckets the rows into as many buckets as you specify. Here I would specific 2 buckets. We would then order the buckets, in an ascending, order on the basis of the average hourly pay that is also in the SELECT cause.
.
........o Here’s where NTILE shines:
.
................ The issue with the median is that if you have an odd number of rows, the median is the middle row, but if you have an even number of rows, the median becomes an average of the two middle rows.
.
................ NTILE handles this beautifully for our problem. When the total number of rows is even, it buckets evenly. But when the total number of rows is odd, it makes the first group 1 row bigger than all the other groups.
.
................ For us that is perfect.
.
................ The goal is that we give a promotion to the dashers in the bottom 50% percentile. So here, in a ‘tie goes to the runner’ mindset, we would want to err on the side of giving the mid-point dasher the promotion. Because the goal is to help the dasher!
.
................ So as long as you set up the NTILE to follow an ascending hourly pay order, you will have 1’s next to everyone who needs the promotion.
.
• To simply things, I then nested the query. The WITH clause query gives me the dasher, their email, their pay, and their bucket.
.
• All I have to do is query the table this WITH clause query made, and this time set the bucket number = 1 in a where clause.
.
• This produce the desired list.
.