DESCRIPTIVE ANALYSIS¶

Background¶

This slide its the example of Explanatory Data Analysis in order to share the knowledge about being a Data Scientist Enjoy !

Overview¶

We begin with the overview from this slide. After downloading the data based on instructions in iykra web. The outline as follows:

## 1. Read the Data
## 2. Plotting the Data
## 3. Summary Data
## 4. Correlation Analysis
## 5. Regression Analysis
## 6. Summary

1. Read the Data¶

You can download the data from through this link : This data is transactional data from burger outlet

import csv
import pandas as pd
import numpy as np

#plotting 
import matplotlib.pyplot as plt
import seaborn as sns
import IPython.display as ipd
import librosa.display

import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

from IPython.display import display,HTML

CSS = """
.output {
    flex-direction : row;
}
"""

HTML('<style>{}</style>'.format(CSS))

data = pd.read_csv('Static Dataset Transaction - Burger Star (Outlet A).csv')

data.head()

data = data.sort_values(['Amount Burger','Amount Drink'], ascending = False)

The data is consists of several feature as follows :¶

Transaction time
Burger type
Burger code
Burger amount
Drink type
Drink code
Drink Amount
Gender
Payment Method

Here is the metadata that given by default from the csv file. On the next session, We have to modify some type of feature with domain knowledge that we have. For instance, Gender have to be categorical one (or object). This session is to get a proper intuition and interpretation from the data.

data.dtypes

Timestamp        object
Type Burger      object
Code Burger       int64
Amount Burger     int64
Type Drink       object
Code Drink        int64
Amount Drink      int64
Gender            int64
Payment          object
dtype: object

For the following session, we plot the data to get the sense of the data and know what analysis we have to create¶

2. Plotting the data¶

data['hour'] = data['Timestamp'].str[:2].astype(int)
data['trx'] = 1

import statsmodels.api as sm
import numpy as np
import matplotlib.pyplot as plt

fig = plt.figure(figsize=(15,8))
plt.scatter(data['Amount Burger'], data['Amount Drink'], color='red')
fig.suptitle('Amount Burger Vs Amount Drink', fontsize=20)
plt.xlabel('Amount Burger', fontsize=14)
plt.ylabel('Amount Drink', fontsize=14)
plt.grid(False)
plt.show();

From the chart above, we hardly found the linear relationship between Amount Drink and Amount Burger. Since The amount of burger does not change linearly as amount drink change. But we know that, sum of the customers order more than one burger for one drink. It can be customer bring their own drink, or just take away order. For further, may be we can create the cross selling package for increase the revenue of this outlet

Next, we want to see the composition of the burger type by number of transactions. For further research, we can conduct upselling or crosseling by the type of burger

Aggregating the data based on the type of burger¶

#Type burger yang paling banyak dibeli
burger = data.groupby(['Type Burger'])['trx'].sum().reset_index().sort_values(['trx'],ascending=False).reset_index(drop=True)
sumburger = sum(burger['trx'])
burger['pct'] = (100. * burger['trx']/sumburger).round(1).astype(str) + '%'

display(burger)
import plotly.express as px
fig = px.bar(burger, x='Type Burger', y='trx',
             #hover_data=['gender'], color='gender',
             labels={'Type Burger':'Type of Burger','trx':'Number of Transactions'}, height=400)
fig.show()

#Type burger yang paling banyak dibeli
burger = data.groupby(['Type Burger'])['Amount Burger'].sum().reset_index().sort_values(['Amount Burger'],ascending=False).reset_index(drop=True)
sumburger = sum(burger['Amount Burger'])
burger['pct'] = (100. * burger['Amount Burger']/sumburger).round(1).astype(str) + '%'

display(burger)
import plotly.express as px
fig = px.bar(burger, x='Type Burger', y='Amount Burger',
             #hover_data=['gender'], color='gender',
             labels={'Type Burger':'Type of Burger','Amount Burger':'Amount of Burger'}, height=400)
fig.show()

From the table above, Beef Burger is the most popular burger in this outlet by 37% of total transactions, also by the size of amount burger ordered as well, Beef burger has the most burger that customers ordered by 27% of total Amount Burger.

datahour = data.groupby(['hour'])['Amount Burger'].sum().reset_index().sort_values(by=['hour'], ascending = True)

We define hour as two first string in timestamp feature. Since, we want to know the relationship between hour transactions and the amount of burger ordered by customers, also timestamp future is too rigit to be aggregated

fig = plt.figure(figsize=(15,8))
plt.scatter(datahour['hour'], datahour['Amount Burger'], color='green')
fig.suptitle('Hour of Transactions Vs Sum of Amount Burger', fontsize=20)
plt.xlabel('Hour', fontsize=14)
plt.ylabel('Amount Burger', fontsize=14)
plt.grid(False)
plt.show();

Based on the scatter plot above, the maximum that the amount burger ordered by customers is at 11 a.m, and the minimum amount of burger ordered by the customers is at 3 p.m.

timedata = data.groupby(['Type Burger','hour'])['Amount Burger'].sum().reset_index().sort_values(['Amount Burger','hour'], ascending = False).reset_index(drop=True)
g = sns.FacetGrid(timedata, col="Type Burger")
g.map(plt.scatter, "hour", "Amount Burger", alpha=.7)
g.add_legend();

datahourdrink = data.groupby(['hour'])['Amount Drink'].sum().reset_index().sort_values(by=['hour'], ascending = True)

fig = plt.figure(figsize=(15,8))
plt.scatter(datahourdrink['hour'], datahourdrink['Amount Drink'], color='blue')
fig.suptitle('Hour of Transactions Vs Sum of Amount Drink', fontsize=20)
plt.xlabel('Hour', fontsize=14)
plt.ylabel('Amount Drink', fontsize=14)
plt.grid(False)
plt.show();

timedatadrink = data.groupby(['Type Drink','hour'])['Amount Drink'].sum().reset_index().sort_values(['Amount Drink','hour'], ascending = False).reset_index(drop=True)
g = sns.FacetGrid(timedatadrink, col="Type Drink")
g.map(plt.scatter, "hour", "Amount Drink", alpha=.7)
g.add_legend();

By aggregating sum of the amount burger, we can see the 11 a.m is the peak hour for the outlet that sold the largest amount of all burger type. On the next pages, we can see what the most popular Type of burger by the hour of the transactions.

Here are the snapshot result from aggregating sum of the data based on the hour feature that created previously.

gendertrx = data.groupby(['Gender'])['trx'].sum().reset_index()
fig = go.Figure(data=[go.Pie(labels=['gender0', 'gender1'], values=gendertrx['trx'])])
fig.show()

genderbtype = data.groupby(['Gender','Type Burger'])['trx'].sum().reset_index()
gender0 = genderbtype[genderbtype['Gender']==0]
gender1 = genderbtype[genderbtype['Gender']==1]

from plotly.subplots import make_subplots

labels = genderbtype['Type Burger'].tolist()

fig = make_subplots(rows=1, cols=2,specs=[[{'type':'domain'}, {'type':'domain'}]])
fig.add_trace(go.Pie(labels=labels, values=gender0['trx'].tolist(),name = 'gender0'),1, 1)
fig.add_trace(go.Pie(labels=labels, values=gender1['trx'].tolist(),name = 'gender1'),1, 2)
fig.update_layout(title_text="Gender Vs Type of Burger Transactions")
fig.show();

PAYMENT¶

payment = data.groupby(['Payment'])['Timestamp'].count().reset_index()
payment.columns = ['Payment','sumtrx']
payment = payment.sort_values(['sumtrx'],ascending = False).reset_index(drop=True)

import plotly.express as px
fig = px.bar(payment, x='Payment', y='sumtrx',
             #hover_data=['gender'], color='gender',
             labels={'Payment':'Payment Method','sumtrx':'Number of Transactions'}, height=400)
fig.show()

payfood = data.groupby(['Payment','Type Burger'])['trx'].sum().reset_index()
payfood = payfood.sort_values(['Payment','trx'], ascending=False).reset_index(drop=True)

fig = px.bar(payfood, x='Payment', y='trx',
             hover_data=['Type Burger'], color='Type Burger',
             labels={'Payment':'Payment Method','Amount Burger':'Number of Transactions'}, height=400)
fig.show()

fig = go.Figure()

fig = px.histogram(data, x='Amount Drink',
             hover_data=['Gender'],color='Gender',
             labels={'count':'Count of Burger','Amount Drink':'Amount of Drink'}, height=400)
fig.update_layout(
    margin=dict(l=80, r=30, t=30, b=30),
    paper_bgcolor="LightSteelBlue",
)

g = sns.FacetGrid(data, col="Gender")
g.map(plt.hist, "Amount Drink");

data[['Amount Burger','Amount Drink']].describe()

data[data['Code Burger']==1][['Amount Burger','Amount Drink']].describe()
data[data['Code Burger']==2][['Amount Burger','Amount Drink']].describe()

data[data['Code Burger']==3][['Amount Burger','Amount Drink']].describe()
data[data['Code Burger']==4][['Amount Burger','Amount Drink']].describe()

import seaborn as sns
sns.set(style="whitegrid")
ax = sns.boxplot(x=data["Amount Burger"])

import seaborn as sns
sns.set(style="whitegrid")
ax = sns.boxplot(x=data["Amount Drink"])

ax = sns.boxplot(x="hour", y="Amount Burger", data=data)

ax = sns.boxplot(x="hour", y="Amount Drink", data=data)

Correlation between Amount burger & Amount drink¶

data['Gender0'] = (data['Gender']==1).astype(int)
data[['Amount Burger','Amount Drink','Gender0']].corr()

data['Gender1'] = (data['Gender']==0).astype(int)
data[['Amount Burger','Amount Drink','Gender1']].corr()

Based on the table above, assume that the Burger, Drink, and customer gender have a linear relationship. Gender 0 have negative correlation with amount of burger. It means that, customer with gender 1 is more like to order burger and drink than the customers with gender 0. So, if we want to offer the promo, it would be more efficient.

Correlation between Amount Burger & Hour¶

data[['Amount Burger','hour']].corr()
data[['Amount Drink','hour']].corr()

data['food'] = data['Type Burger'] +','+data['Type Drink']
data['sumitem'] = data['Amount Burger'] + data['Amount Drink']
data['countrx'] = 1

Correlation between Sum Item & Hour¶

data[['sumitem','hour']].corr()

# Paket burger yang paling banyak dibeli
foodgroup = data.groupby(['food','hour','Gender'])['sumitem'].sum().reset_index().sort_values(['sumitem'],ascending = False).reset_index(drop=True)
sumfoodgroup = sum(foodgroup['sumitem'])
foodgroup['pct'] = (foodgroup['sumitem']/sumfoodgroup)
foodgroup.head()

# Paket burger yang paling banyak dibeli
foodgrouptrx = data.groupby(['food','hour','Gender'])['countrx'].sum().reset_index().sort_values(['countrx'],ascending = False).reset_index(drop=True)
sumfoodgrouptrx = sum(foodgrouptrx['countrx'])
foodgrouptrx['pct'] = (foodgrouptrx['countrx']/sumfoodgrouptrx)
foodgrouptrx.head()

fig = px.bar(foodgroup, x='hour', y='sumitem',
             #hover_data=['gender'], color='gender',
             labels={'hour':'Time of Transactions','sumitem':'Sum of Pacakage'}, height=400)
fig.show()

fig = px.bar(foodgrouptrx, x='hour', y='countrx',
             #hover_data=['gender'], color='gender',
             labels={'hour':'Time of Transactions','countrx':'Number of Transactions'}, height=400)
fig.show()

import plotly.express as px
fig = px.bar(foodgroup, x='food', y='sumitem',
             #hover_data=['gender'], color='gender',
             labels={'food':'Food Package','sumitem':'Sum of Burger & Drinks'}, height=600)
fig.show()

import plotly.express as px
fig = px.bar(foodgrouptrx, x='food', y='countrx',
             #hover_data=['gender'], color='gender',
             labels={'food':'Food Package','countrx':'Count of Transactions'}, height=600)
fig.show()

# Paket burger berdasarkan gender
foodgender = data.groupby(['food','Gender'])['sumitem'].sum().reset_index().sort_values(['sumitem','Gender'],ascending = False).reset_index(drop=True)
sumfoodgender = sum(foodgender['sumitem'])
foodgender['pct'] = (foodgender['sumitem']/sumfoodgender)
foodgender

import plotly.express as px
fig = px.bar(foodgender, x='food', y='sumitem',
             hover_data=['Gender'], color='Gender',
             labels={'food':'Food Package','sumitem':'Number of Transactions'}, height=600)
fig.show()

For the very first suggestions, we can't already decide to make a proper crosseling burger product since the transactions is just sligthly diffrent. Need to explore more feature. We just can see the beef burger is the most popular burger in this outlet.

Yes, the diffrent is clearly view on the chart above. From the previous chart we know that the beef burger is the most popular burger at the outlet. But if we combine with the drinks, the cheese burger is the most burger that the customers want. So, we suggest the package of combination between cheese burger with coke and beef burger with the most profitable drink (call it Beef Cheese package). Since the beef burger is the most burger that customers want, no matter the drink (based on the transactional data).Then the customers can choose the type of drink whenever he/she want. The amount of combinations can't decide yet since we don't know the price and the profitability for each food (burger & drink) yet.

For further scheme, we can offer the crossseling package based on the gender of the customers. Then we can make the cross selling / up selling more effective (Since we know the target of the customers)

gender = data.groupby(['Gender'])['sumitem'].sum().reset_index()
gender

Clearly show from the chart above that the gender 0 are the most customers gender type to buy the Cheese burger and coke. Also with the beef burger with no drinks. So, we can offer more the cross selling in the previous slide to gender 0. Maybe with the gift item specifically for gender 0. For gender 1, we can offer the beef burger + chicken burger with the most profitable drink. Since we knew from the chart above, the gender 1 most likely to buy the beef burger and chicken burger

Then, we also can combine the package based on the peak hour of the transaction made. We can offer the package at the specific time

data['hour'] = data['Timestamp'].str[:2]
peak = data.groupby(['hour'])['sumitem'].sum().reset_index().sort_values(['sumitem'],ascending=False)
peak

fig = px.bar(peak, x='hour', y='sumitem',
             #hover_data=['Type Burger'], color='Type Burger',
             labels={'hour':'Time Transactions (hour)','sumitem':'Sum of item transactions'}, height=400)
fig.show()

data['trx'] = 1
peakhourbin = data.groupby(['hour','food'])['Amount Burger'].sum().reset_index().sort_values(['Amount Burger'],ascending=False).reset_index(drop=True)
peakhourbin.head()

We can offer the Beef Cheese package at the 11-13 at launch time. Also with beef burger package as well.

Conclusion¶

Finally, we have some simple strategy to create cross/up selling scheme. As a Data Analyst / Data Scientist, we have to discuss with the Business Stake Holder to get more insight. So, the scheme would be align with the business process. I hope, you can gather some insight to get the point.

Thank you :)

from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')

	Timestamp	Type Burger	Code Burger	Amount Burger	Type Drink	Code Drink	Amount Drink	Gender	Payment
0	10:01	Beef burger	1	6	Mineral water	1	3	1	DANA
1	10:04	Cheese burger	3	12	Coke	2	10	0	DANA
2	10:05	Double cheese burger	4	10	Mineral water	1	5	0	LINK AJA
3	10:06	Chicken burger	2	9	Coffe	3	9	0	DANA
4	10:07	Cheese burger	3	11	Coke	2	3	1	DANA

	Type Burger	trx	pct
0	Beef burger	116	37.2%
1	Chicken burger	71	22.8%
2	Double cheese burger	69	22.1%
3	Cheese burger	56	17.9%

	Type Burger	Amount Burger	pct
0	Beef burger	526	27.0%
1	Chicken burger	489	25.1%
2	Cheese burger	487	25.0%
3	Double cheese burger	448	23.0%

	Amount Burger	Amount Drink
count	312.00000	312.000000
mean	6.25000	3.913462
std	3.45969	2.960981
min	1.00000	0.000000
25%	3.00000	2.000000
50%	6.00000	3.000000
75%	9.00000	5.000000
max	12.00000	10.000000

	Amount Burger	Amount Drink
count	116.000000	116.000000
mean	4.534483	3.094828
std	3.386498	2.617594
min	1.000000	0.000000
25%	1.000000	1.000000
50%	4.500000	3.000000
75%	7.000000	3.000000
max	11.000000	10.000000

ForEach.id

Search this blog

Up Selling / Cross Selling Analysis Schema with Python