## Chi-Square Test With More Than 2 Categories

I have just enrolled in a Data Science course on Udemy and I learned good stuff.

In this article, we will do a Chi-square test with more than 2 categories. We will use the A/B test « Country » which has 3 categories which corresponds to 3 countries : German, Spain and France. Select « Gender Actual » tab, make a copy with a right-click and select « Duplicate ».

Name the tab « Gender Actual (2) » by « Country Actual ».

In « Dimensions », move the variable « Geography » over « Gender » in « Columns » to replace « Gender » with « Geography ».

Here’s how to do an A/B statistical test when there are 3 categories. We’ll start with the classic method and then I’ll show you another way to do Chi-square test with any number of categories.

Let’s start with the classical method. In this case, there are 3 categories so we can’t use the online tool of the previous article. In the previous article we used an online tool with only 2 categories « Sample1 » and « Sample2 ». That’s why we’re going to use another online tool, click here  .

In this online tool, we can enter the values without using the total values. That is, we enter only the number of observations in each category. We simply need to enter the values that are on our A/B test. And I’m going to show you how to turn our A/B test into a table. In this way, it will be easier to enter the values in the online tool without making any mistakes.

Go to the « Show me » tool at the top right.

Click on « text tables »

Click on « Swap Rows ans Columns » button.

Cool, now you have a table arranged in exactly the same way as the online tool.

In the online tool, we will select 2 rows and 3 columns.

As we have 3 categories and 2 possible results, we enter our values exactly as in the table we just created on Tableau.

Perfect, our table is ready. You can click on the « Calculate » button.

As you can see, we observe the same thing as the other online tool. There is our indicator « p » value which is less than 5%. Which means there is a meaning.

This statistical significance means that these results are valid for the total number of the bank’s clients and not just for the sample of 10 000 clients. We observe similar differences with A/B test « Country » whose results are based solely on the sample of 10 000 clients. We can conclude that in the total number of the bank’s clients, it’s the clients in Germany who are more likely to leave the bank. This is how we do things cleanly.

You saw, this online tool limited by 5 by 5 tables so you can’t use this tool when you have 6 categories or more. But fortunately it’s possible to do Chi-square test with any number of categories. It’s a special method and for you to understand that, I’ll give you a theoretical explanation.

Here we have 3 countries : German, Spain and France.

What we’re trying to compare is the clients number leaving the bank in each of these countries.

With our basic A/B test based on a sample of 10 000 clients, we obtained 16% for France, 32% for Germany and 17% for Spain. Now the question is : « Do we observe the same results on the total clients number of the bank ? », it means : « In general, does the country have a significant effect on the clients number leaving bank ? ». Germany has the largest number of clients leaving the bank so the idea is : « Why would we need to compare the 3 countries at the same time ? ».

If we do an A/B test statistical test with Germany and France and we get a significant difference in the clients number leaving the bank between these 2 countries, then that would mean that in general, the country has a significant effect on the clients number who bank. Indeed, if we find by comparing Germany and France that the Germans are more likely to leave the bank than the French, we can consider that Spain will not change anything. Germans will always be more likely to leave the bank than the French. Maybe there will be a different relationship between Germany and Spain but there will always be a statistically significant difference between France and Germany with a larger number of clients leaving the bank in Germany than France.

Here is a way to confirm that this logic is true. There is a test and the participants of this test are German, Spanish and French. Imagine that this test was done without looking at what is happening in Spain. Now you get the result and you ask yourself the question : « Would the results changed if you added Spain ? ». The answer is « no » because there is no interdependence between Germany, Spain and France. That is, the decision to leave the bank in France and Germany doesn’t depend on Spain. And therefore, it’s quite correct to separate the categories by putting 1 aside to compare the 2 others. And as now we have 2 categories, we can do a Chi-square test with the online tool that we used in the previous article.

So let’s go back to our worksheet and put a country aside to compare only 2 countries. Select « Country » tab.

What we observe is that the difference between Spain and France is very small, so it wouldn’t be interesting to do a Chi-square test between Spain and France. It’s more interesting to do a Chi-square test between Germany and France and to prove that there is a statistically significant difference between these 2 countries. This will be enough to conclude that the country has a statistically significant impact on the clients number who leave the bank.

Selects « Country Actual » tab.

We will use the online tool of the previous article, click here  .

We will make a copy of « Country Actual » to have a bar chart with absolute values. Select « Country Actual », right-click and select « Duplicate ».

In « Show Me », select « horizontal bars ».

Removes « SUM (Number of Records )» from « Columns » and removes « Exited » and « Geography » from « Rows ».

In « Dimensions », move « Geography » in « Columns ».

In « Measures », move « Number of Records » to « Rows ».

In « Measures », move « SUM(Number of Records) » in « Label ».

In « Dimensions », move « Exited » in « Label ».

In « Dimensions », move « Exited » in « Colors ».

We also need total absolute values, which means the total number of men and women. There is a very fast way to get that. Right-click on the vertical axis and select « Add Reference Line ».

Then in « Value », click on the drop-down on the right and select « Sum » to have the total sum of the observations.

And in « Scope », you select « Per Cell » option to specify that you want the total sums for each category, male and female.

Now, we have the total sum at the top of the bars. We will modify labels to have the absolute values. In « Label », we will change « Computation » to « Value » and click on the « OK » button.

Here’s how to enter the data :

For « Sample1 » in #success, you enter 810 because there are 810 people who left the bank. For « Sample1 » in #trials, you enter 5014 because there are 5014 people in total.

For « Sample2 » in #success, you enter 814 because there are 814 people who left the bank. For « Sample2 » in #trials, you enter 2509 because there are 2509 people in total.

Here is the verdict : « Sample2 is more successful ». « Sample2 » corresponds to German’s clients and #success is :« yes, the client left the bank ». This verdict means that of all the clients from German are more likely to leave the bank than clients from France. And look, there is something important, it’s « p<0.001 ». This means that the « p » is strictly less than 0.001. As you can see, « p » value is very small, which concludes that the tests are statistically significant.

Ooh, there’s another thing I wanted to show you with the tab « age » with the 2 bar charts in parallel.

As you can see, there are many categories (more than 5) because each category corresponds to a 5-year ago group with clients of the bank aged from 15 to 90 years old. This is a lot of comparison but it would be a good exercise for you to find what are the 2 categories to compare that shows that there is a significant statistic difference.

I give you a hint, compare slices from 50 to 54 years old or from 35 to 39 years olds. In fact, you should compare all peer categories where you observe difference on this basic A/B test. Do a basic A/B test with absolutes values. Then do a Chi-square test to check if the difference is statistically significant, I mean, if the result is valid for the total number of bank’s clients.

This is a way to statistically validate the insights we see onTableau. You see, it’s not very difficult and it’s effective. Here is a way to find insights on Tableau and validate them.

-Steph

## Validate Data Mining In Tableau With A Chi-Square Test

In this article we will start using statistics. Don’t worry we’ll do something simple, we’ll use the Chi-square test in a basic way. There is a special section to learn how to do statistics at an advanced level.

I’ll explain why we’re going to learn how to use the Chi-square test. The results we have with theses 2 bar charts are good. We see on theses 2 bar charts that age has a significant impact on the rate of client leaving the bank. We also see in which age groups the clients leaves the bank the most and which age groups the clients leave the bank the least. With that we have good insights.

In the A/B test « Gender », we can see that there is a correlation between the male and female sex and the choice to leave the bank. But as I said before, this A/B test is basic. The results of a basic A/B test visually shows us what is probably happenning in reality but we aren’t 100% sure of these results. To validate these results, we need do to use statistical tests like Chi-square test.

Doing a report based on basic A/B test is very risky and you can have completely false insights. I don’t advise you to do it (unless you want to leave your job). It’s for this reason that using Chi-square will help us to have strong insights.

Chi-square will allow us to know if our results are statistically significant. Our results are based on a sample of 10 000 clients and Chi-square test will tell us if these results are due to chance effects or if these results can represent all the client of the bank.

For example in our A/B test « Gender », we observed that in our sample of 10 000 clients, women are more likely to leave the bank compared to men.

Now, we aren’t sure if the results of this sample represent the behavior of all the bank’s clients.

To use basic Chi-square test, we use an online tool. Click here  .

On internet, there are plenty of websites to do a Chi-square test but we’ll use this one so that you can understand how it works. To do a Chi-square test, we need to use absolute values and in our A/B test we have percentage.

Let’s go back to Tableau. We’ll create a new tab with a version of A/B test with absolute values. In this way, we keep the A/B test with the percentages. Do a right-click on the « Gender » tab and select « Duplicate ».

Name the new tab « Gender Actual » to specify that it’s absolute values.

To have the absolute values, move « Number of Records » in « Measures » to the « Marks » area and put it over top of « SUM(Number of Records ».

Move « Number of Records » in « Measures » to « Rows » over « SUM(Number of Records ».

Cool, we have our absolute values.

We also need total absolute values, which means the total number of men and women. There is a very fast way to get that. Right-click on the vertical axis and select « Add Reference Line ».

Then in « Value », click on the drop-down on the right and select « Sum » to have the total sum of the observations.

And in « Scope », you select « Per Cell » option to specify that you want the total sums for each category, male and female.

Now, we have the total sum at the top of the bars. We will modify labels to have the absolute values. In « Label », we will change « Computation » to « Value » and click on the « OK » button.

Perfect, we have the total amount of observation at the top of each bar : 4543 women and 5457 men. We have what we need to use our online tool.

OK, I’ll explain how this tool works. « Sample1 » and « Sample2 » correspond to the independent variable « Gender ». You choose in which order you enter the data, « Sample1 » for men or the opposite. In our case, we use « Sample1 » for women and « Sample2 » for men.

« #success » corresponds to the result Y=1, which means in our case « yes, the client left the bank ».

« #trials » is the total number of observations, which means the total number of women in « Sample1 » and the total number of men « Sample2 ».

That’s how you enter the data :

• For « Sample1 » in #success, you enter 1139 because there are 1139 women who left the bank. For « Sample1 » in #trials, you enter 4543 because there are 4543 women in total.

• For « Sample2 » in #success, you enter 898 because there are 898 men who left the bank. For « Sample2 » in #trials, you enter 5457 because there are 5457 men in total.

Here is the verdict : « Sample1 is more successful ». « Sample1 » corresponds to women and #success is :« yes, the client left the bank ». This verdict means that of all the bank’s client, women are more likely to leave the bank than men. And look, there is something important, it’s « p<0.001 ». This means that the « p » is strictly less than 0.001.

« p » is the value that indicates whether an independent variable has a statistically significant effect on a dependent variable. In our case, the independent variable is « Gender » and the dependent variable is « Exited », which is : « yes, the client left the bank ». So « p » is strictly less than 0.001, which means that the independent variable « Gender » has a statistically significant effect on the dependent variable « Exited ». This shows us that out of the total number of bank’s clients, women are more likely to leave the bank than men.

This is how we use Chi-square test with this online tool. This is the same principle on all online tools that you can find on Google or DuckDuckGo . You can repeat these instructions that I gave you with other tools, you will get the same results.

It’s cool with the Chi-square we validated the A/B test and to specify that this A/B test is validated, we’ll color the tab in green.

Right-click on the tab, select « Color » and select « Green ».

Perfect, now we’ll validate another A/B test. Selects « HasCreditCard » tab.

We’re going to create an A/B test « HasCreditCard » only with absolute values. To save time, right-click on « Gender Actual » tab and select « Duplicate ».

We’ll remove the green color on the tab « Gender Actual (2) ». Right-click on the tab and select « Color » and « None ».

You rename the tab « HasCreditCard Actual ».

Move the variable « HasCrCard » over « Gender » in « Columns ».

Excellent, everything is ready to do a Chi-square test. We’ll remove « Exited » labels to better see the absolutes values. Make a click and drag out.

Perfect, let’s go back to our online tool. In this case, « Sample1 » is « no », which means client who don’t have credit card and « Sample2 » for « yes », which means clients who have a credit card.

That’s how you enter the data :

• For « Sample1 » in #success, you enter 613 because there are 613 clients who left the bank. For « Sample1 » in #trials, you enter 2945 because there are 2945 clients who don’t have a credit card.
• For « Sample2 » in #success, you enter 1424 because there are 1424 clients who left the bank. For « Sample2 » in #trials, you enter 7055 because there are 7055 clients who have a credit card.

Let’s look at the verdict, it’s « No significant difference ». « p » value is very high, it’s above 5%. This confirms that the independent variable « HasCrCard » has no statistically significant effect on the dependent variable « Exited ». That was the conclusion we had made when we had done the A/B test with percentages.

We had seen that there was 21% of « Exited » (clients who left the bank) in the category « no » and 20% in the category « yes ». With these results we concluded that most likely the variable « HasCrCard » had no impact on the rate of clients who left the bank. Chi-square test confirms our conclusion and we can put the tab « HasCrCard » in green to say that it’s OK.

Right-click on the tab « HasCreditCard » => « Color » => « Green ».

Excellent, now, you can do a statistical A/B test with 2 categories. Soon, we will do statistical A/B tests with more than 2 categories.

-Steph

## Combine 2 charts

We’ll move to the next level. We’ll work with 2 bar charts in parallel to have a more efficient data mining. In a previous article, we created 2 different bar charts. The 1st was an A/B test (actually, it’s a classification test) that told us in which age range the clients were most likely to leave the bank. The 2nd was a bar chart showing the age distribution of clients in our sample of 10 000 clients.

Let’s go. We’re going to have an A/B test with age range and we’ll add a bar chart of the client distribution below. To add a bar chart, we must start by choosing what we want to keep and what we want to add. In our case, we want to keep the columns because they’re the same in the 2 bar charts.

And we just want to add a new line so we will add a new variable in « Rows ». As we want to add a bar chart of distribution, we will use the variable which corresponds to the number of observation « Number of Records ».

In « Measures » moves the variable « Number of Records » in « Rows » to the right of « SUM(Number of Records).

We have a 2nd bar chart below the 1st bar chart. As you can see, these 2 bar charts are in one column. « Columns » is « Age(bins) ». These 2 bar charts are in 2 different lines which are the lines that correspond to the 2 « SUM(Number of Records) » in « Rows ».

The space on the left has also changed. There is « All » which represents the 2 bar charts at the same time. It means, when your select « All », you make change in the 2 bar charts.

Below this tab « All » we have 2 tabs. The 1st tab represents the 1st bar chart so the 1st « SUM(Number of Records) » in « Rows » and the 2nd tab represents the 2nd bar chart so the 2nd « SUM(Number of Records) » in « Rows ».

Which means that if you want to make changes on the 2 bar charts at the same time, you make the changes in the tab « All ». If you want to make changes only in the first bar chart, you select the first tab below « All ». If you want to make changes only in the 2nd bar chart, you select the second tab below « All ».

So if you change the color in tab « All », our 2 bar charts will be colored by the same color.

Select the « All » tab and click on « Colors ».

Click on « Edit Colors… » and select « Stayed ». Select the green color and click on the « OK » button.

As you can see, the color changed in the 2 bar charts.

Click on the tab of the 2nd bar chart.

Removes the « Exited » variable from « Colors » to remove colors only in the 2nd bar chart.

Removes the « SUM(Number of Records) » variable from « Label » to remove the labels only in the 2nd bar chart.

We will add color on this 2nd bar chart. Click on « Colors », click on « More colors… » and select the blue color. Click on the « OK » button.

Now, we would like to see the colors vary in intensity depending on the number of observations. Take « SUM(Number of Records) » from the 2nd line in « Rows » and holding « Ctrl » or « Command », move it to « Colors ».

Cool ! We will take care of the 1st bar chart. Select the tab of the 1st bar chart.

Click on « Colors ». Click on « Edit Colors… ». Select « Stayed ». Select the brown color and click on the « OK » button.

For more clarity, we will add labels in 2nd bar chart. Click on the tab of the 2nd bar chart. Take « SUM(Number of Records) » from « Colors » and holding « Ctrl » or « Command » and move it to « Labels ».

Perfect. Now we will change the location of the bar chart. We will put the 2nd bar chart instead of the 1st bar chart. According to the logic of « Rows » and « Columns », simply put the 2nd line « SUM(Number of Records) » to the left to pass in 1st line.

BOOM, the bar chart of the age distribution is going over because it’s in the 1st line in « Rows ». With these changes, tabs to change the bar charts have changed order.

Observation

What we can observe with these bar chart is that we see on the 1st bar chart that the majority of bank’s clients are in the age group of 30 to 34 years old and 35 to 39 years old. In these 2 age groups, we see on the 2nd bar chart that client of 30 to 34 years old are less likely to leave the bank than clients between 35 and 39 years old. Look at ages 30 to 34, the rate of clients leaving the bank is 8% while in the 35 to 39 age group, the number of clients leaving the bank is 13%.

In the age group of 40 to 54 years old, we see on the 2nd bar chart that the rate of clients leaving the bank is increasing and is above of the average rate of clients leaving the bank (20%). But we see in the 1st bar chart that the number of clients in the age group of 40 to 54 years old decrease with the age groups.

Do you remember the potential for anomalies in age groups 75, 85 and 90 ? We’ll check it. In the 1st bar chart we can see that there are 11 clients in the age group of 80 to 84 years old, 2 clients in the age group of 85 to 89 years old and 2 clients in the age group of 90 to 94 years old. We can conclude that these observations in age group of 80, 85 and 90 aren’t very significant from a statistical point of view because 2 clients is something negligible in this sample of 10 000 clients.

In the first age group of 15 to 19 years old, we can see that there are 49 clients, which is not very significant.

Compare these 2 bar chart in parallel allows us to have additional insights.

-Steph

## Work With An Alias

In the last article, I showed you how to do a simple A/B test. We will continue with the result we had with the A/B test.

Here is the result of the A/B test. What is in orange is the percentage of men who left the bank, it’s 16%. What is in blue is the percentage of women who left the bank, it’s 25%.

With our bar chart we can quickly see that women are more likely to leave the bank than men, all the rest being equal in our sample.

I remind you that this is a basic A/B test. There are 2 type of A/B test, the basic A/B test and the statistical A/B test. The statistical A/B test is done with a statistical test like the KHI-2 test. For our case, the basic A/B test already give us good insights.

To make our bar chart even easier to read, we will work with aliases.

The first thing we will do is we will improve the format. Right-click on this space between « Gender » and the bars and select « Format… ».

The « Sheet » tab appears. In « Worksheet » changes the text size to « 12 ».

What is good with data mining is that we aren’t obligated to make a perfect chart because we don’t have to present them in a report to managers or a meeting.

For example, if I had to present this chart in a report, it would be necessary to change the vertical title. But we only make a model so this change isn’t necessary.

Now, look at this rectangle. We can see « Exited », « 0 » and « 1 ».

« 0 » means that the client stayed in the bank and « 1 » means that the client left the bank. We can also see that client who left the bank are in orange so 25% for women and 16% for men. And the client who stayed in the bank are blue so 75% for women and 84% for men.

We did an excellent basic A/B test but it would be much easier to read if we replace « 0 » with « Stayed » and « 1 » with « Exited ».

With aliases we can do that. An alias is to replace the binary results « 0 » and « 1 » with « Stayed » and « Exited » because it’s not easy to remember the meaning of « 0 » and « 1 ».

There are 2 ways to do it : create a calculated field or use aliases.

We will use aliases. Know that aliases are not going to change the « 0 » and « 1 » in the dataset, this change is only in Tableau.

In « Dimensions », right-click on « Exited » and select « Aliases… ».

A small window appears. In this small window, you can create an alias for each value contained in the « Exited » variable.

The variable « Exited » contains the value « 0 » and « 1 ». For the value « 0 », we will create the alias « Stayed » to say that the client stayed in the bank. For the value « 1 », we will create the alias « Exited » to say that the client left the bank. Then click on the « OK » button.

Look, we can see the new values in the rectangle.

The values « 0 » and « 1 » have been replaced by « Stayed » and « Exited ».

Now that the aliases saved, we will take the variable « Exited » in « Dimensions » and move it to « Label ».

Look, we have our aliases « Stayed » and « Exited » on the bar chart.

In this ways, it’s easier for people to read the bar chart without asking what meaning of « 0 » of « 1 » values. « Stayed » and « Exited » are clearer.

Now you know how to use aliases so that people can easily read the binary values of a chart.

-Steph

## How To Plagiarize Others Smartly

I watched an Olivier Roland’s video  and there is good stuff.

Plagiarize others smartly without copy what they do but have an efficient inspiration. It’s possible copying something people don’t think about.

Olivier Roland talked with a youtuber and this youtuber told him he has completely plagiarized Casey Neistat. Casey Neistat  is a famous youtubers with more than 80 millions views. The truth is this guy copied the video’s structure of Casey Neistat. The thing is copy the sturcture of something that works and not the content.

# Star Wars

You know Star Wars of George Lucas, these films are famous but George Lucas copied a structure. Joseph Campbell wrote a book called « The hero with a thousand faces ».

In this book, Joseph Campbell analyzed a dozen of myths around the world. Old myths we still hear today. He did it for several reasons. First, for him if there myths with thousands of years are still told today, there is a natural selection for myths. Only myths interesting to the human mind have survived. Second, if we analyze myths from Africa, Europe, oceania, asia, america, etc, we can find the common point they have. Joseph Campbell found what he call the « monomyth ». It’s a narrative structure in stories that talk deeply to human mind in a universal way.

When George Lucas discovered the Joseph Campbell’s work, he had already begun to write Star Wars but it was a revelation and he rewrote a big part of the scenario following exactly the Joseph Campbell’s sturcture. There is a lot of movie and books that use this structure like Matrix. It’s funny to compare the structure of Star Wars and Matrix with the monomyth’s structure.

# Structure

It’s always smart to copy structures that work. When you see something that work don’t look the content as 90 % of people do, try to analyze the structure. In which order the content is presented, what rhythm, what type of interruption, etc.

Olivier Roland did improvisational theater for 3 years. The concept is to ask to spectators a subject and create a story with this subject. It’s not easy but every story has the same structure so just take the subject and tell it with this structure. Improvise the content without improvising the structure, it’s cool.

This is the monomyth of Joseph Campbell :

1. A hero receive a call to adventure. In Star Wars episode 4, it’s 2 droids that have a message from the princess Leia to Luke.

2. The hero refuses the call to adventure by this way we can be more connected with him/her. We appreciate him/her because he/she like everybody. He/she is not really courageous. A hero is someone who become a hero. In Star Wars episode 4, Luke doesn’t really refuse. It’s his uncle and aunt who tell him to stay at the farm.

3. A trigger will encourage the hero to accept the call of adventure. For Luke, it’s his family that gets killed.

4. During the trip, there is a lot of problems, incidents and the hero will meet people who help him/her. Typically there is as archetype the old sage, the princess, etc. There are also enemies who will put the hero to the test.

5. At the end, the great confrontation with the wicked.

Here is the structure of all classic stories. Braveheart has the same structure. You can watch a youtuber and try to understand the video’s structure. It isn’t easy to do reverse engineering because we aren’t used to doing it. We don’t think about it but you have to know that the structure make at least half of the success of something.

We can copy at 100 % a structure. It can make at least half of the success without anyone noticing it and without ethical problems because a structure is a method. It’s cool, right ?

# Use it for you

What do you want to succeed absolutely in your life ? Who reached these goals and what method he/she used ? What is the structure of this method ? These questions can help you to discover a key for your success. You can find the sources that allowed this person to have this structure. It’s so interesting to find inspirations of people who inspire us.

What structure did you find ?

-Steph