Coffee Sales Data Cleaning & Exploration Project
Project Overview
In this SQL project, I practiced creating a database and a table using pgAdmin 4 and imported a csv file. I then made a backup copy of the table, viewed the data and cleaned several columns. After the data was ready to work with, I began analyzing fictitious cafe sales data to uncover performance metrics.
Data Source: Kaggle Dataset, found here
You can view the full project here.
About the Data
Number of Rows: 10,000
Number of Columns: 8
The dataset contains synthetic data representing sales transactions in a cafe. This dataset is intentionally “dirty,” with missing values, inconsistent data, and errors introduced to provide a realistic scenario for data cleaning and exploratory data analysis (EDA).
| Column Name | Description |
|---|---|
| transaction_id | unique identifier for each transaction |
| item | name of item purchased |
| quantity | quantity of item purchased |
| price_per_unit | price of a single unit of the item |
| total_spent | total amount spent on transaction |
| payment_method | the method of payment used |
| location | location where transaction occurred |
| transaction_date | the date of the transaction |
Learning Objectives
- Practice DDL (Data Definition Language) SQL commands: CREATE and ALTER
- Practice DML (Data Manipulation Language) SQL commands: COPY and UPDATE
- Practice DQL (Data Query Language) SQL commands: SELECT, FROM, WHERE, GROUP BY, HAVING, DISTINCT, ORDER BY, LIMIT, CTEs, and Window Functions
Cleaning Tasks
- Updated the ‘item’, ‘quantity’, ‘price_per_unit’, ‘total_spent’, ‘payment_method’, ‘location’, and ‘transaction_date’ columns to correctly represent null values in SQL
- Changed the ‘quantity’ column data type to be SMALLINT
- Changed the ‘price_per_unit’ column data type to be MONEY
- Changed the ‘total_spent’ column data type to be MONEY
- Changed the ‘transaction_date’ column data type to be DATE
- Added a new ‘category’ column with VARCHAR(255) data type for menu items: Beverages, Sandwiches & Salads, Pastries
- Updated the coffee_sales table to set the category to be equal to ‘Beverages’ if the item value is either Juice, Coffee, Smoothie, or Tea
- Updated the coffee_sales table to set the category to be equal to ‘Sandwiches & Salads’ if the item value is either Salad or Sandwich
- Updated the coffee_sales table to set the category to be equal to ‘Pastries’ if the item value is either Cake or Cookie
Practice Questions
- Which transactions with the total amount spent were equivalent to the highest amount spent across all transactions?
- What is the most common payment method used at the cafe?
- Which month had the highest sales?
- Were there more takeaway orders or in-store?
- Did in-store or takeaway orders have the highest sales?
- Which item sold the most?
- Which category had the highest sales?
- What was the average total spent amongst all transactions?
- What was the sales performance by month and the net change from previous month?
- Which months had a positive net change in sales?
- What were the most popular items? Display their ranks.
- What was the total sales by month? Compare to the rolling 3-month average of sales to smooth out data spikes.
Future Ideas
- Next time I would create an additional column to calculate the total_spent (price_per_unit * quantity) to verify that it was equal to the ‘total_spent’ column calculation.
- While cleaning the data I noticed the ‘location’ column was a categorical variable with two values, so I recognized it could be encoded as a single binary column — take_out (1 = take-out, 0 = in-store) — to be used as a feature in a future regression or ML model.
