I was tasked with coming up with actionable business steps for the company of Northwind, based on data insights from their company database. I would need to dig into their database via SQL query commands, use the data to come up with insights, and test those insights with hypothesis testing to check for significance. I tried to revolve my insights around three key areas that most businesses would be concerned about:
-
What does Northwind care most about? a) Insights: What questions could I ask that could help uncover important information for increasing overall revenue?
-
What does Northwind’s customers care most about? a) Insights: What questions could I ask that could help uncover information about the products that Northwind provides?
-
What can help out Northwind’s empoyees? a) Insights: What questions could I ask that could help uncover information about sales rep experience.
Reviewing Northwind’s database schema, I came up with the following questions to dig into:
-
Do discounts have a statistically significant effect on the number of products customers order? If so, at what level(s) of discount?
-
Do discounts for different product categories have a statistically significant effect on the number of products customers order from that product category?
-
Is there statistical significance in the category of product based on the revenue they produce?
-
Is there statistical significance in the different total revenue generated by each sales representative?
All of these questions were answered using the same process. The methodology used for each question could be broken down into the following steps:
1) What is the question we are trying to answer? 2) What metrics are needed to answer this question? 3) What will our null hypothesis be if we find evidence of our question to be true? 4) For our alternative hypothesis, should our test be a one tail or two tail test? 5) What data tables from the database are needed to answer the question? 6) How can we find evidence of our question being true? 7) If evidence is found, was hypothesis test can we use to determine if the evidence is statistically significant? 8) If our evidence is found to be statistically significant, use a Cohen D test to see the effect size.
I will use the first business question to demonstrate this process. When first looking into the significance of discounts towards the quantity of products bought by customers, the first thing I needed to do was determine what metrics I will need to use to answer this question. I determined that looking at the quantity per product bought is our target variable, and the discount level is what will provide us our insights. I then determined that if we find evidence that discounts do influence purchase quantity, then our default null hypothesis would be that the mean quantity purchase per product for products with a discount would be the same as for products without a discount. Since we want to accept the alternative hypothesis as the mean of the discounted products being higher (ie discounts improve quantity purchase), this will be a one tail hypothesis test. If the opposite was being shown in the data, I believe further investigation would have needed to be made, as this seems counter intuitive.
The next step was to figure out what data table or combination of data tables from the Northwind database scheme could provide me the information needed. The “Order Detail” data table provided me the proper variables to answer this question: discounts and quantity per product. I was going to answer this question by first reviewing the mean quanitity of products at each discount level to determine whether there was even a difference to begin with. As I suspected, discounts at first appearance did show an increase in the mean quantity of products purchased. In order to determine whether this increase was significant, I would need to run a hypothesis test. I decided to do a Tukey T-Test with an alpha threshold of 0.05, to determine the probability of each discount level’s product quantity to the mean quantity level of those products without a discount. The Tukey T-Test allows me to compare probabilities amongst all combinations of the discount levels at once. Through one test, I was able to determine whether at any discount showed a significance in the increase of quantity sold, as well as at each particular discount level.
Going through this process brought some challenges within gathering the data needed. One challenge I came across was combining data from different data tables in the schema so that I had all the information needed in one DataFrame. For example, in order to answer my questions related to product categories and quantity of products sold, I need to combine the “Order Detail” data table with the “Products Information” table, as well as the “Category Detail” table. My approach to creating DataFrames from the data base tables was to use the pandas command of “read_sql_query”, in which pandas creates a DataFrame out of whatevery SQL query I run. Based on my current knowledge of SQL queries, I only combined two data tables at a time in one SQL query for the instances in where I needed to do this (i.e. ‘Products’ and ‘Product Category’). My DataFrames came out with some columns having the same column name and information, which became annoying at some points. But I did determine the code to remove those columns that are repeated. Once I was able to get the data into a pandas DataFrame, the analysis became a lot easier as I am more comfortable working on within pandas.
To conclude, I think that the process steps I used above is a good workflow for asking questions, researching and developing insights, and running hypothesis testing to determine insights that have statistical impact.