This project aims to use the restaurant tips dataset to practice creating composition plots and visualizations. We will examine the relationship between different variables and the tips given.
The dataset consists of information from 244 restaurant bills, collected in the US in 1987.
It includes details about the tips given to restaurant staff, such as the total bill, tip amount, gender of the person paying, smoking status, day of the week, time of day, and party size.
Data details
Source: Swiss coding academy
The main goal of this analysis: We will learn more the relationship between different variables and the tips given
We need to answers below to find main goal:
- What is the data like?
- What does the data need to clean?
- What does the data group customer by?
- What does the data need to calculate?
- How do we need to compare groups by criteria?
How do we do to answer?
What is the data like?
- Import pandas ,matplotlib
- Read and Check data Results as :
- The day it occurred
- If it was at lunch or dinner
- The total bill
- The sex of the person
- If they were a smoker or not
- The size of the party
You can see table 5 first row of data :
Id | Total_bill | Tip | Sex | Smoker | Day | Time | Size | |
---|---|---|---|---|---|---|---|---|
0 | 0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
3 | 3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
4 | 4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
What does the data need to clean?
-
Check values is null, Check values is duplicate. Results: they are fine
-
Check typyes and We have string columns considered as objects. So we fix their types correct.
Result: We have dtypes: Float64(2), Int64(2), string(4). They fixed correct.
What does the data group customer by?
We need to calculate the tip figures between the two customer files. And we categorize the datas to be compared as follows:
- Smokers and non-smokers
- Male and Female
- Weekends and weekdays
- Lunch and dinner ( We checked data about column time, and the result is the restaurant only has 2 service times)
What does the data need to calculate?
- View describe data
- Calculate the metrics for the customer groups listed above as: Min, Max, Mean, Median
How do we need to compare groups by criteria? We use T-test to compare groups by criteria. And we use matplotlib to distribution comparison.
You can see how I do them in detail here: https://colab.research.google.com/drive/1ZyT3H1C8TqUXtE99oqKlXoY_WLR1vGZl?usp=sharing
After We have results. We have some Insights and conclussion as :
Based on the measure :
- The max tip value is belong smokers group. It's 10USD
- The average tip value: The smokers is higher than non-smokers
We have TtestResult:
- statistic=0.09222805186888201
- pvalue=0.9265931522244976
Based on the T-test between smokers and non-smokers, we have the result is pvalue = 0.926 > 0.05. We can conclude that these two customer groups do not have much difference in the amount of money tipped to the restaurant's service staff.
Through the calculation table of min, max, median parameters and the distribution image, we can see that: The average tip amount is 2.9 USD. The highest tip amount is 10 USD. The smokers group give tip more than non_smokers. But there isn't significant difference .Amount tip from 1 USD - 2.5 USD that is amount for restaurant staff receive the most.
Based on the measure :
- The max tip value is belong male. It's 10USD
- The average tip value: male is higher than female
We have TtestResult:
- statistic=-1.387859705421269
- pvalue=0.16645623503456755
Based on the T-test between male and female, we have the result that pvalue = 0.16 > 0.05. We can conclude that these two groups of customers do not have much difference in the amount of tips for restaurant's service staff.
Through the table of min, max, average parameters and the distribution image, we see that: The male give tip more than female.But there isn't significant difference.
Based on the measure :
- The max tip value is on Weekends. It's 10USD
- The average tip value: The weekends is higher than weekdays
We have TtestResult:
- statistic=1.1028993019409794
- pvalue=0.27154326510606286
Based on the T-test between weekends and weekdays, we have the result that pvalue = 0.27 > 0.05. We can conclude that these two groups of customers do not have much difference in the amount of tips for restaurant's service staff.
Through the table of min, max, average parameters and the distribution image, we see that: The weekend's customers usually tip more than weekday's customers.But there isn't significant difference.
Based on the measure :
- The max tip value is dinner. It's 10USD
- The average tip value: The dinner is higher than lunch
We have TtestResult:
- statistic=1.9062569301202392
- pvalue=0.05780153475171558
Based on the T-test between dinner and lunch, we have the result that pvalue = 0.0578 > 0.05. We can conclude that these two groups of customers do not have much difference in the amount of tips for restaurant's service staff.
Through the table of min, max, average parameters and the distribution image, we see that: The dinner group usually tip more than lunch group. But there isn't significant difference.
You can see how I do them in detail here: https://colab.research.google.com/drive/1ZyT3H1C8TqUXtE99oqKlXoY_WLR1vGZl?usp=sharing
- The smokers tip more than non-smokers
- The male tip more than female
- The weekends have tip more than weekdays
- The dinner group have tip more than lunch group
However, there aren't any much difference between the relationship variables and the tips given.