Chi-Square Test With More Than 2 Categories

tableau chi square test

I have just enrolled in a Data Science course on Udemy  and I learned good stuff.

In this article, we will do a Chi-square test with more than 2 categories. We will use the A/B test « Country » which has 3 categories which corresponds to 3 countries : German, Spain and France. Select « Gender Actual » tab, make a copy with a right-click and select « Duplicate ».

tableau chi square test

Name the tab « Gender Actual (2) » by « Country Actual ».

tableau chi square test

In « Dimensions », move the variable « Geography » over « Gender » in « Columns » to replace « Gender » with « Geography ».

tableau chi square test

tableau chi square test

Here’s how to do an A/B statistical test when there are 3 categories. We’ll start with the classic method and then I’ll show you another way to do Chi-square test with any number of categories.

Let’s start with the classical method. In this case, there are 3 categories so we can’t use the online tool of the previous article. In the previous article we used an online tool with only 2 categories « Sample1 » and « Sample2 ». That’s why we’re going to use another online tool, click here  .

tableau chi square test

In this online tool, we can enter the values without using the total values. That is, we enter only the number of observations in each category. We simply need to enter the values that are on our A/B test. And I’m going to show you how to turn our A/B test into a table. In this way, it will be easier to enter the values in the online tool without making any mistakes.

Go to the « Show me » tool at the top right.

tableau chi square test

Click on « text tables »

tableau chi square test

tableau chi square test

Click on « Swap Rows ans Columns » button.

tableau chi square test

tableau chi square test

Cool, now you have a table arranged in exactly the same way as the online tool.

In the online tool, we will select 2 rows and 3 columns.

tableau chi square test

As we have 3 categories and 2 possible results, we enter our values exactly as in the table we just created on Tableau.

tableau chi square test

Perfect, our table is ready. You can click on the « Calculate » button.

tableau chi square test

tableau chi square test

As you can see, we observe the same thing as the other online tool. There is our indicator « p » value which is less than 5%. Which means there is a meaning.

tableau chi square test

This statistical significance means that these results are valid for the total number of the bank’s clients and not just for the sample of 10 000 clients. We observe similar differences with A/B test « Country » whose results are based solely on the sample of 10 000 clients. We can conclude that in the total number of the bank’s clients, it’s the clients in Germany who are more likely to leave the bank. This is how we do things cleanly.

You saw, this online tool limited by 5 by 5 tables so you can’t use this tool when you have 6 categories or more. But fortunately it’s possible to do Chi-square test with any number of categories. It’s a special method and for you to understand that, I’ll give you a theoretical explanation.

Here we have 3 countries : German, Spain and France.

tableau chi square test

What we’re trying to compare is the clients number leaving the bank in each of these countries.

tableau chi square test

With our basic A/B test based on a sample of 10 000 clients, we obtained 16% for France, 32% for Germany and 17% for Spain. Now the question is : « Do we observe the same results on the total clients number of the bank ? », it means : « In general, does the country have a significant effect on the clients number leaving bank ? ». Germany has the largest number of clients leaving the bank so the idea is : « Why would we need to compare the 3 countries at the same time ? ».

tableau chi square test

If we do an A/B test statistical test with Germany and France and we get a significant difference in the clients number leaving the bank between these 2 countries, then that would mean that in general, the country has a significant effect on the clients number who bank. Indeed, if we find by comparing Germany and France that the Germans are more likely to leave the bank than the French, we can consider that Spain will not change anything. Germans will always be more likely to leave the bank than the French. Maybe there will be a different relationship between Germany and Spain but there will always be a statistically significant difference between France and Germany with a larger number of clients leaving the bank in Germany than France.

Here is a way to confirm that this logic is true. There is a test and the participants of this test are German, Spanish and French. Imagine that this test was done without looking at what is happening in Spain. Now you get the result and you ask yourself the question : « Would the results changed if you added Spain ? ». The answer is « no » because there is no interdependence between Germany, Spain and France. That is, the decision to leave the bank in France and Germany doesn’t depend on Spain. And therefore, it’s quite correct to separate the categories by putting 1 aside to compare the 2 others. And as now we have 2 categories, we can do a Chi-square test with the online tool that we used in the previous article.

So let’s go back to our worksheet and put a country aside to compare only 2 countries. Select « Country » tab.

tableau chi square test

What we observe is that the difference between Spain and France is very small, so it wouldn’t be interesting to do a Chi-square test between Spain and France. It’s more interesting to do a Chi-square test between Germany and France and to prove that there is a statistically significant difference between these 2 countries. This will be enough to conclude that the country has a statistically significant impact on the clients number who leave the bank.

Selects « Country Actual » tab.

tableau chi square test

We will use the online tool of the previous article, click here  .

We will make a copy of « Country Actual » to have a bar chart with absolute values. Select « Country Actual », right-click and select « Duplicate ».

tableau chi square test

In « Show Me », select « horizontal bars ».

tableau chi square test

tableau chi square test

Removes « SUM (Number of Records )» from « Columns » and removes « Exited » and « Geography » from « Rows ».

tableau chi square test

tableau chi square test

In « Dimensions », move « Geography » in « Columns ».

tableau chi square test

tableau chi square test

In « Measures », move « Number of Records » to « Rows ».

tableau chi square test

tableau chi square test

In « Measures », move « SUM(Number of Records) » in « Label ».

tableau chi square test

tableau chi square test

In « Dimensions », move « Exited » in « Label ».

tableau chi square test

tableau chi square test

In « Dimensions », move « Exited » in « Colors ».

tableau chi square test

tableau chi square test

We also need total absolute values, which means the total number of men and women. There is a very fast way to get that. Right-click on the vertical axis and select « Add Reference Line ».

tableau chi square test

Then in « Value », click on the drop-down on the right and select « Sum » to have the total sum of the observations.

tableau chi square test

And in « Scope », you select « Per Cell » option to specify that you want the total sums for each category, male and female.

tableau chi square test

Now, we have the total sum at the top of the bars. We will modify labels to have the absolute values. In « Label », we will change « Computation » to « Value » and click on the « OK » button.

tableau chi square test

tableau chi square test

tableau chi square test

Here’s how to enter the data :

For « Sample1 » in #success, you enter 810 because there are 810 people who left the bank. For « Sample1 » in #trials, you enter 5014 because there are 5014 people in total.

For « Sample2 » in #success, you enter 814 because there are 814 people who left the bank. For « Sample2 » in #trials, you enter 2509 because there are 2509 people in total.

tableau chi square test

Here is the verdict : « Sample2 is more successful ». « Sample2 » corresponds to German’s clients and #success is :« yes, the client left the bank ». This verdict means that of all the clients from German are more likely to leave the bank than clients from France. And look, there is something important, it’s « p<0.001 ». This means that the « p » is strictly less than 0.001. As you can see, « p » value is very small, which concludes that the tests are statistically significant.

Ooh, there’s another thing I wanted to show you with the tab « age » with the 2 bar charts in parallel.

tableau chi square test

As you can see, there are many categories (more than 5) because each category corresponds to a 5-year ago group with clients of the bank aged from 15 to 90 years old. This is a lot of comparison but it would be a good exercise for you to find what are the 2 categories to compare that shows that there is a significant statistic difference.

I give you a hint, compare slices from 50 to 54 years old or from 35 to 39 years olds. In fact, you should compare all peer categories where you observe difference on this basic A/B test. Do a basic A/B test with absolutes values. Then do a Chi-square test to check if the difference is statistically significant, I mean, if the result is valid for the total number of bank’s clients.

This is a way to statistically validate the insights we see onTableau. You see, it’s not very difficult and it’s effective. Here is a way to find insights on Tableau and validate them.

Subscribe to my newsletter and share this article if you think it can help someone you know. Thank you.

-Steph

A Pratical Tip To Validate Your Approach

data science tableau check

I have just enrolled in a Data Science course on Udemy  and I learned good stuff.

How was the A/B test « Number Of Product » ? Easy or difficult ?

Here is the result I found.

data science tableau check bar chart

I think you noticed there was something bizarre. There is an anomaly. We imagine that the more the client has products, the more the client is satisfied with the bank so this type of clients should stay in the bank.

In the first 2 bars we can see that a client who has 1 product is more likely to leave the bank than a client who has 2 products. But when a client has 3 or 4 products, we see a huge rate of clients leaving the bank.

Look, there is a little bizarre detail. In the 2nd bar, we can’t see the « Exited » label. This is because there is no place in the orange part to put the text. To make it simpler, we’ll remove the label « Exited ». Drag and drop on the « Exited » text label to the outside.

data science tableau check bar chart

data science tableau check bar chart

Perfect, we can read the percentages. On the 1st bar, we can see that among the client that have 1 products, 28% left the bank. On the 2nd bar, we can see that among clients who have 2 products, 8% left the bank. This show us that clients who have 1 products are more likely to leave the bank than clients with 2 products.

And for the next bars, we observe an anomaly. On the 3rd bar, we can see that among the clients who have 3 products, 83% left the bank. On the 4th bar, we can see that among clients who have 4 products, 100% left the bank. We clearly see that there is a problem and we need to do a deeper analysis to understand what is going on .

As a Data Scientist, we need to explain what happens in bars 3 and 4. Usually when a client has 3 or 4 banking products, that means he/she is satisfied and is loyal to the bank. But in our case, it’s the opposite because there is a high rate of client who left the bank. This is the time to do deeper analysis.

The first thing to analyze is the quality of the data. There is a very big anomaly and it may be because there is something insignificant in our data that disturbs the statistics. For example, it’s possible that when the bank selected these clients in this sample, there were very few clients with 4 products and all those clients with 4 products left the bank. Sometimes chance can create anomalies and you have to play attention to these effects of chance because they don’t seem important but they can create false interpretations.

To start, we will check the number of clients with 4 products.

In « Measure », move « Number Of Records » (which gives the number of observations) on « Label ».

data science tableau check bar chart

data science tableau check bar chart

We observe on the first 2 bars than many clients with 1 or 2 products selected for our sample. For clients with 3 or 4 products, we can see that there were fewer clients selected for our sample.

There are 220 clients with 3 products and 60 clients with 4 products. These small number of clients probably explain why we observe these anomalies.

In this sample of randomly selected clients, there are very few clients with 4 products and they all left the bank. In this situation, we can confirm that it’s a chance. When thing like that happen, you have to be very careful not to make conclusion too fast and make misinterpretations.

The conclusion is that a lot of clients have been selected for category 1 and 2. For category 3 and 4, there have been few clients selected so we can’t do accurate statistics. We need to do deeper analyze for these categories of clients with 3 and 4 products.

Now, let’s put the percentage back on the bar chart. Click on the « Back » button.

.

data science tableau check bar chart

Or do a click and drag of « SUM(Number of Record) » to outside.

data science tableau check bar chart

data science tableau check bar chart

We saw that there is an anomaly and what is interesting to do is to have a comment to remember to do a more in-depth analysis of columns 3 and 4.

Right-click between the bar chart’s title and the bars. Select « Annotate » then « Areas… ».

data science tableau check bar chart

A window appears. In this window, you write « Low observation in last 2 categories » and click on the « OK » button.

data science tableau check bar chart

data science tableau check bar chart

Click on the comment and move it on bars 3 and 4.

data science tableau check bar chart

data science tableau check bar chart

The next time you work on this bar chart, you will see this comment that will remind you to seriously analyze client who have 3 and 4 products.

Validate our approach

It’s time to show you how to validate an approach and how to validate the data. For this we will create a new A/B test.

Duplicate this worksheet with a right-click on the « NumberOfProducts » tab and select « Duplicate ».

data science tableau check bar chart

And rename the tab « Validation ».

data science tableau check bar chart

For this tab, we will erase the comment. Select the comment and press the « Delete » button on your keyboard.

data science tableau check bar chart

data science tableau check bar chart

Everything is ready, the idea is to find a variable that doesn’t affect our results. That is a variable that has no impact on a client’s decision to leave or stay in the bank.

Take for example, the variable « Customer Id ». Client’s identification number has no influence on the client’s decision to stay or leave the bank.

We’ll do an A/B test with the last digit of the « Customer Id » and we’ill check that there is the same clients proportion who leave the bank in the 10 categories of the last digit of the « Customer Id ». The 10 categories are the numbers 0,1,2,3,4,5,6,7,8,9.

Let’s g.To start, we will create the variable that contains the last digit of the « Customer Id ». To have this variable, we will create a « Calculated Field ».

Right-click on « Customer Id », select « Create » and click on « Calculated Field ».

data science tableau check bar chart

data science tableau check bar chart

Name the calculated field « LastDigitOfCustID ». In the text field, we use the « RIGHT » function with « Customer Id » in parenthesis to select the last character of the « Customer Id ». In our case, the last character of the « Customer Id » is the last digit.

Here is the code to write in the text field : Right ({Customer Id},1)

data science tableau check bar chart

data science tableau check bar chart

Oooops, you see there is a small mistake => The calculation contains errors.

There is an error in the formula because « Customer Id » is a number variable and the « RIGHT » function applies to a variable of type « STRING ».

To use the « RIGHT » function, we will convert « Customer Id » into a string. We will use the « STR » function with « Customer Id » in parenthesis.

Here is the code to write in the text field

And click on the « OK » button : Right (STR({Customer Id}),1).

data science tableau check bar chart

Now, you can see that our calculated field « LastDigitOfCustID » is in « Dimensions ».

Click on « LastDigitOfCustID » and move it on top of « NumOfProducts » in « Columns ».

data science tableau check bar chart

data science tableau check bar chart

Now we have a new bar chart and we see that for every last digit of the « Customer Id » there is about the same proportion of clients leaving the bank. All these proportions don’t correspond exactly to the average of 20% but these slight variations aren’t important.

Seeing this uniform distribution allows us to validate our data because these data are homogenous.

Conculsion

Here’s how you can check the homogeneity of your data. You take a variable that has no impact on the fact that a client leaves or stays in the bank. The example we did with the last digit of the « Customer Id » is excellent. We were able to verify that in each of the categories taken by this variable, if there was the same proportion of clients leaving the bank. As is the case, we can validate our data.

Imagine another result. When we do the test with the last digit of the « Customer Id », we observe that for one of the numbers, the rate of clients who left is really higher than the average. This shows us that there is a problem in our data because it indicates an anomaly.

You can find other ways to verify your data by using other « insignificant variables » to see if the distribution is homogeneous. But be careful when you select an « insignificant variable » because there may be traps.

Here is an example. If you create a variable that takes the first letter of the first name, the distribution will not be homogeneous. The reason is simple, there are many more people who have a name that starts with the letter « M » than with the letter « Y ».

Share this article if you think it can help someone you know. Thank you.

-Steph

Work Effectively And Earn More (Part 2)

work effectively effective

I watched an Olivier Roland’s video  and I learned good stuff.

If you don’t have read Part 1, click here .

5 actions to be effective

5

Optimize your working time

Use Pareto’s Law by focusing on the 20% of your actions that contribute 80% of your results and using Parkinson’s Law to determine how long to complete a task.

Here are other actions to put in place to optimize your time :

  • Don’t disperse yourself

  • Stop multitasking – This has been scientifically proven to be a waste of time and productivity. Read this scientific study .

  • Stop interruptions – Things like smartphone notifications, emails or messages.

  • Group actions.

  • Remove unnecessary tasks – To find out if you’re doing a useless task, ask yourself this question from Peter Drucker : « Why am I doing this ? Is it necessary ? » With this question, you can easily delete unnecessary tasks. Exceptionally, you can use a notification on your smartphone that displays this question every 30 minutes. It’s a type of reminder all day long.

  • Identify the 20% of things and people that cause 80% of your problems and delete them. If it’s someone in your family, talk to that person 2-3 times a week instead of every day.

Automate everything you can

Many tasks can be automated in companies. For example to send messages on social media (I use Buffer  ). It’s possible to automate a sale on internet, it’s the customer who does everything. The customer looks for a product, uses his/her credit card by filing out the payment form of the website, and the bill created automatically based on the information provided by the customer, etc.

It’s also possible to automate a company, this is the case of Drop Shipping. Drop Shipping is when you sell products that you don’t have in stock and that are sent directly from the supplier to the customers. Amazon offers this type of service too, you can put in their catalogs products that you sell and entrust to Amazon for the stock’s management, sending and returns of products. I wrote an article on Amazon’s drop shipping, here.

There is also the case of muses that explains Tim Ferriss in his book « 4 hours workweek ».

Delegate

Focus on your strengths and delegate the rest. Create a list of tasks that you want to delegate with instructions. Then gives these tasks to a team by assigning each type of task to a specialist.

Duplicate

There is no point in reinventing the wheel. You can duplicate the recipes of your mentors success and use that in your own company.

Recycle

A job that you did can be reused in a different form. For example, articles from a blog can be used to make a book, a podcast or a video.

4 actions to earn more

4

Determine your goal and strategy

Determine your goal, your process to reach it and the strategy to put in place. Here are some examples of strategies for developing your wealth :

  • Replace your salary with real estate income and start your own company.

  • Keep your work as employee and invest a maximum on stock market to create passive income.

  • Create a company to have a complementary income like a blog, a podcast or a Youtube channel.

  • Buy a piece of land and build several apartments (condos).

  • Etc.

Optimize your management to spend less money

  • Analyse the things you have to pay to eliminate waste : unnecessary subscription, insurance too expensive, etc.

  • Print your bank statement and analyze it

  • Seeking a way to achieve the same result by spending less : compare, buy cheaper, negotiate to save money for the things you really need.

  • Optimize your taxation by reducing your taxes.

Recycle your skills and your work

You can work on something once and get paid several times. You can create a seminar, keep 3 children instead of 1, you walk 5 dogs instead of 1, etc.

You can also use a job you have already done to create complementary income. Foe example, if you like to take picture, you can put them in stock photos on internet.

Duplicate the processes known to create wealth

  • Pay yourself first

  • Make money work for you by saving at least 10% of your income to invest them.

  • Invest in yourself with training to learn new skills

Here are the options you can use to create a company that serves your life (and not your life serving your company). With internet it’s easier to use these levers with a blog, podcast or a Youtube channel by creating content.

Share this article if you think it can help someone you know. Thank you.

-Steph

Look For Anomalies

anomaly

I have just enrolled in a Data Science course on Udemy  and I learned good stuff.

We’ll learn how to duplicate a bar char to create a new A/B test. We’ll create several A/B test to look for anomalies.

But before that, we’ll name the sheet. Right-click on the tabe and select « Rename Sheet ».

tableau a/b test tableau dataset anomalies

Rename the sheet « Gender ».

tableau a/b test tableau dataset anomalies

Now right-click on the « Gender » tab and select « Duplicate ».

tableau a/b test tableau dataset anomalies

Rename this new tab « Country ».

tableau a/b test tableau dataset anomalies

We’ll do an A/B test with the countries and we’ll reuse everything we did with the A/B test « Gender » to save time.

As you can see « Gender » is in « Columns ».

tableau a/b test tableau dataset anomalies

To use this A/B test with a variable other than « Gender », move the variable you want on top of « Gender » in « Columns ».

Go, go ! There is « Geography » in « Dimensions », takes « Geography » and puts it on « Gender ».

tableau a/b test tableau dataset anomalies

Boom with 1 click we have our A/B test for countries.

tableau a/b test tableau dataset anomalies

We have the percentage of clients who left and stayed in the bank for each country (Germany, Spain and France).

In this A/B test we can see that in Germany, many clients left the bank with a rate of 32%. For Spain and France, the rate of clients who left the bank is below the average departure rate (20%), 17% for Spain and 16% for France.

Already, we have interesting insigns. We can find out if in Germany there is a new aggressive competitor with more interesting offers or if there is a new law unfavorable to the bank’s offers that has been voted. It’s necessary to do reseach in Germany to find the reason for this high rate of departure.

You have seen, usually an A/B test has 2 categories but in our case, there are 3 categories. We could call it an A/B/C test but it’s a bit bizarre. When there are more than 2 categories, we call it a classification test.

In this article, I will continue to use the term A/B test but remember the term classification test for the next time.

Let’s do another A/B test quickly.

Duplicate this A/B test by right-clicking on the « Country » tab and selecting « Duplicate ».

tableau a/b test tableau dataset anomalies

tableau a/b test tableau dataset anomalies

This time we will study the variable « Has Cr Card ». This variable is « 1 » if the client has a credit card and « 0 » if the client doesn’t have a credit card.

You saw ? This variable is a categorical variable because it is binary « 1 » and « 0 » but it is in « Measures ». Since this variable is categorical, it should be in « Dimensions » so we will move the variable « Has Cr Card » from « Measure » to « Dimensions ».

tableau a/b test tableau dataset anomalies

tableau a/b test tableau dataset anomalies

Now that it’s done, move « Has Cr Card » over « Geography » in « Columns ».

tableau a/b test tableau dataset anomalies

tableau a/b test tableau dataset anomalies

It’s cool, we have a new A/B test for credit cards. What we can observe in this A/B test is that there is not a big difference between the departure rate of clients who don’t have a credit card (21%) and the departure rate of clients who have a credit card (20%).

It’s time to create aliases for this A/B test. Right-click on « Has Cr Card » and select « Alias…. ».

tableau a/b test tableau dataset anomalies

To start, « 0 » means that the clients don’t have a credit card so in « Value », you write « No ». « 1 » means that the clients has a credit card so in « Value », you write « Yes ». Then you click on the « OK » button.

tableau a/b test tableau dataset anomalies

tableau a/b test tableau dataset anomalies

That’s it, the bar chart is easy to read now. We understand that among clients who don’t have a credit card, 21% left the bank and among clients who have a credit card, 20% left the bank. We can conclude that having or not having a credit card doesn’t have a significant impact on the decision to leave the bank.

It’s time to rename this tab. Right-click on the « Sheet4 » tab and select « Rename Sheet ». Name the sheet « HasCreditCard ».

tableau a/b test tableau dataset anomalies

tableau a/b test tableau dataset anomalies

Let’s go, let’s do another A/B test with another variable. Let’s look at « Measure » and study the variable « IsActiveMember ».

The variable « IsActiveMember » is « 1 », if the client is active and « 0 » it the client is inactive. It’s necessary to detail the definition of IS ACTIVE. IS ACTIVE depends on the criteria of the bank. For example, it could be : « Did the client log in at least once to their bank account last month ? » or « Has the client made at least one banking transaction last month ? », etc.

As you can see, the variable « IsActiveMember » is a categorical variable (binary 1 and 0) so it’s a variable to move to « Dimensions ».

Here’s another way to move a variable from « Measures » to « Dimensions ». Right-click on « IsActiveMember » and select « Convert to Dimensions ».

tableau a/b test tableau dataset anomalies

Perfect, the variable « IsActiveMember » is in « Dimensions ».

tableau a/b test tableau dataset anomalies

We will duplicate our « HasCreditCard » sheet. Right-click on « HasCreditCard » tab and select « Duplicate ».

tableau a/b test tableau dataset anomalies

Renamce this tab « IsActiveMember ».

tableau a/b test tableau dataset anomalies

Since we have diplucted what we did with « HasCreditCard », we simply need to take the variable « IsActiveMember » from « Dimensions » and more that over « HasCrCard » in « Columns ».

tableau a/b test tableau dataset anomalies

tableau a/b test tableau dataset anomalies

Let’s create aliases to make reading this bar chart easier. Right-click on « IsActiveMember » and select « Aliases… ».

tableau a/b test tableau dataset anomalies

For « 0 », we put « No » because the client is not active and for « 1 », we put « Yes » because the client is active. Click on the « OK » button.

tableau a/b test tableau dataset anomalies

Here is what we can see with this A/B test « IsActiveMember ». Among inactive clients, 27% left the bank. Among active clients, 14% left the bank. This show is that clients who are not active are more likely to leave the bank than active clients.

Indeed, a client who is active means that he/she uses his/her bank account and products of the bank so an active client is satisfied with the bank. It’s possible that some clients leave the bank because of external factors such as a competitor, new regulations or elements of the private life of the client.

It’s cool, we created 4 A/B tests in a few minutes.

  1. An A/B test « Gender » that allowed us to see that women were more likely to leave the bank.

  2. An A/B test « Country » that allowed us to see that it is in Germany that clients are most likely to leave the bank.

  3. An A/B test « HasCreditCard » which allowed us to see that having or not having a credit card didn’t have a significant impact on the descision to leave the bank.

  4. An A/B test « IsActive Member » allows us to see that client who aren’t active are more likely to leave the bank .

I will leave you a homework. You’ll do an A/B test with the variable « Number Of Product » which is still a category variable. The variable « Number Of Products » indicates the number of product that the client has in the bank. Add aliases to make reading the bar chart easier.

I trust you I’ll give you the answer in th next article,

Share this article if you think you can help someone you know. Thank you.

-Steph