## Chi-Square Test With More Than 2 Categories

I have just enrolled in a Data Science course on Udemy  and I learned good stuff.

In this article, we will do a Chi-square test with more than 2 categories. We will use the A/B test « Country » which has 3 categories which corresponds to 3 countries : German, Spain and France. Select « Gender Actual » tab, make a copy with a right-click and select « Duplicate ».

Name the tab « Gender Actual (2) » by « Country Actual ».

In « Dimensions », move the variable « Geography » over « Gender » in « Columns » to replace « Gender » with « Geography ».

Here’s how to do an A/B statistical test when there are 3 categories. We’ll start with the classic method and then I’ll show you another way to do Chi-square test with any number of categories.

Let’s start with the classical method. In this case, there are 3 categories so we can’t use the online tool of the previous article. In the previous article we used an online tool with only 2 categories « Sample1 » and « Sample2 ». That’s why we’re going to use another online tool, click here  .

In this online tool, we can enter the values without using the total values. That is, we enter only the number of observations in each category. We simply need to enter the values that are on our A/B test. And I’m going to show you how to turn our A/B test into a table. In this way, it will be easier to enter the values in the online tool without making any mistakes.

Go to the « Show me » tool at the top right.

Click on « text tables »

Click on « Swap Rows ans Columns » button.

Cool, now you have a table arranged in exactly the same way as the online tool.

In the online tool, we will select 2 rows and 3 columns.

As we have 3 categories and 2 possible results, we enter our values exactly as in the table we just created on Tableau.

Perfect, our table is ready. You can click on the « Calculate » button.

As you can see, we observe the same thing as the other online tool. There is our indicator « p » value which is less than 5%. Which means there is a meaning.

This statistical significance means that these results are valid for the total number of the bank’s clients and not just for the sample of 10 000 clients. We observe similar differences with A/B test « Country » whose results are based solely on the sample of 10 000 clients. We can conclude that in the total number of the bank’s clients, it’s the clients in Germany who are more likely to leave the bank. This is how we do things cleanly.

You saw, this online tool limited by 5 by 5 tables so you can’t use this tool when you have 6 categories or more. But fortunately it’s possible to do Chi-square test with any number of categories. It’s a special method and for you to understand that, I’ll give you a theoretical explanation.

Here we have 3 countries : German, Spain and France.

What we’re trying to compare is the clients number leaving the bank in each of these countries.

With our basic A/B test based on a sample of 10 000 clients, we obtained 16% for France, 32% for Germany and 17% for Spain. Now the question is : « Do we observe the same results on the total clients number of the bank ? », it means : « In general, does the country have a significant effect on the clients number leaving bank ? ». Germany has the largest number of clients leaving the bank so the idea is : « Why would we need to compare the 3 countries at the same time ? ».

If we do an A/B test statistical test with Germany and France and we get a significant difference in the clients number leaving the bank between these 2 countries, then that would mean that in general, the country has a significant effect on the clients number who bank. Indeed, if we find by comparing Germany and France that the Germans are more likely to leave the bank than the French, we can consider that Spain will not change anything. Germans will always be more likely to leave the bank than the French. Maybe there will be a different relationship between Germany and Spain but there will always be a statistically significant difference between France and Germany with a larger number of clients leaving the bank in Germany than France.

Here is a way to confirm that this logic is true. There is a test and the participants of this test are German, Spanish and French. Imagine that this test was done without looking at what is happening in Spain. Now you get the result and you ask yourself the question : « Would the results changed if you added Spain ? ». The answer is « no » because there is no interdependence between Germany, Spain and France. That is, the decision to leave the bank in France and Germany doesn’t depend on Spain. And therefore, it’s quite correct to separate the categories by putting 1 aside to compare the 2 others. And as now we have 2 categories, we can do a Chi-square test with the online tool that we used in the previous article.

So let’s go back to our worksheet and put a country aside to compare only 2 countries. Select « Country » tab.

What we observe is that the difference between Spain and France is very small, so it wouldn’t be interesting to do a Chi-square test between Spain and France. It’s more interesting to do a Chi-square test between Germany and France and to prove that there is a statistically significant difference between these 2 countries. This will be enough to conclude that the country has a statistically significant impact on the clients number who leave the bank.

Selects « Country Actual » tab.

We will use the online tool of the previous article, click here  .

We will make a copy of « Country Actual » to have a bar chart with absolute values. Select « Country Actual », right-click and select « Duplicate ».

In « Show Me », select « horizontal bars ».

Removes « SUM (Number of Records )» from « Columns » and removes « Exited » and « Geography » from « Rows ».

In « Dimensions », move « Geography » in « Columns ».

In « Measures », move « Number of Records » to « Rows ».

In « Measures », move « SUM(Number of Records) » in « Label ».

In « Dimensions », move « Exited » in « Label ».

In « Dimensions », move « Exited » in « Colors ».

We also need total absolute values, which means the total number of men and women. There is a very fast way to get that. Right-click on the vertical axis and select « Add Reference Line ».

Then in « Value », click on the drop-down on the right and select « Sum » to have the total sum of the observations.

And in « Scope », you select « Per Cell » option to specify that you want the total sums for each category, male and female.

Now, we have the total sum at the top of the bars. We will modify labels to have the absolute values. In « Label », we will change « Computation » to « Value » and click on the « OK » button.

Here’s how to enter the data :

For « Sample1 » in #success, you enter 810 because there are 810 people who left the bank. For « Sample1 » in #trials, you enter 5014 because there are 5014 people in total.

For « Sample2 » in #success, you enter 814 because there are 814 people who left the bank. For « Sample2 » in #trials, you enter 2509 because there are 2509 people in total.

Here is the verdict : « Sample2 is more successful ». « Sample2 » corresponds to German’s clients and #success is :« yes, the client left the bank ». This verdict means that of all the clients from German are more likely to leave the bank than clients from France. And look, there is something important, it’s « p<0.001 ». This means that the « p » is strictly less than 0.001. As you can see, « p » value is very small, which concludes that the tests are statistically significant.

Ooh, there’s another thing I wanted to show you with the tab « age » with the 2 bar charts in parallel.

As you can see, there are many categories (more than 5) because each category corresponds to a 5-year ago group with clients of the bank aged from 15 to 90 years old. This is a lot of comparison but it would be a good exercise for you to find what are the 2 categories to compare that shows that there is a significant statistic difference.

I give you a hint, compare slices from 50 to 54 years old or from 35 to 39 years olds. In fact, you should compare all peer categories where you observe difference on this basic A/B test. Do a basic A/B test with absolutes values. Then do a Chi-square test to check if the difference is statistically significant, I mean, if the result is valid for the total number of bank’s clients.

This is a way to statistically validate the insights we see onTableau. You see, it’s not very difficult and it’s effective. Here is a way to find insights on Tableau and validate them.

-Steph

## Work With An Alias

I have just enrolled in a Data Science course on Udemy and I learned good stuff.

In the last article, I showed you how to do a simple A/B test. We will continue with the result we had with the A/B test.

Here is the result of the A/B test. What is in orange is the percentage of men who left the bank, it’s 16%. What is in blue is the percentage of women who left the bank, it’s 25%.

With our bar chart we can quickly see that women are more likely to leave the bank than men, all the rest being equal in our sample.

I remind you that this is a basic A/B test. There are 2 type of A/B test, the basic A/B test and the statistical A/B test. The statistical A/B test is done with a statistical test like the KHI-2 test. For our case, the basic A/B test already give us good insights.

To make our bar chart even easier to read, we will work with aliases.

The first thing we will do is we will improve the format. Right-click on this space between « Gender » and the bars and select « Format… ».

The « Sheet » tab appears. In « Worksheet » changes the text size to « 12 ».

What is good with data mining is that we aren’t obligated to make a perfect chart because we don’t have to present them in a report to managers or a meeting.

For example, if I had to present this chart in a report, it would be necessary to change the vertical title. But we only make a model so this change isn’t necessary.

Now, look at this rectangle. We can see « Exited », « 0 » and « 1 ».

« 0 » means that the client stayed in the bank and « 1 » means that the client left the bank. We can also see that client who left the bank are in orange so 25% for women and 16% for men. And the client who stayed in the bank are blue so 75% for women and 84% for men.

We did an excellent basic A/B test but it would be much easier to read if we replace « 0 » with « Stayed » and « 1 » with « Exited ».

With aliases we can do that. An alias is to replace the binary results « 0 » and « 1 » with « Stayed » and « Exited » because it’s not easy to remember the meaning of « 0 » and « 1 ».

There are 2 ways to do it : create a calculated field or use aliases.

We will use aliases. Know that aliases are not going to change the « 0 » and « 1 » in the dataset, this change is only in Tableau.

In « Dimensions », right-click on « Exited » and select « Aliases… ».

A small window appears. In this small window, you can create an alias for each value contained in the « Exited » variable.

The variable « Exited » contains the value « 0 » and « 1 ». For the value « 0 », we will create the alias « Stayed » to say that the client stayed in the bank. For the value « 1 », we will create the alias « Exited » to say that the client left the bank. Then click on the « OK » button.

Look, we can see the new values in the rectangle.

The values « 0 » and « 1 » have been replaced by « Stayed » and « Exited ».

Now that the aliases saved, we will take the variable « Exited » in « Dimensions » and move it to « Label ».

Look, we have our aliases « Stayed » and « Exited » on the bar chart.

In this ways, it’s easier for people to read the bar chart without asking what meaning of « 0 » of « 1 » values. « Stayed » and « Exited » are clearer.

Now you know how to use aliases so that people can easily read the binary values of a chart.

-Steph

## Connect Tableau to An Excel File

I have just enrolled in a Data Science course on Udemy  and I learned good stuff.

Now that you downloaded the dataset in Excel file format, we’ll use Tableau to analyze this.

We’ll connect to the dataset using the « Excel » option.

Now that you downloaded the dataset which is in Excel format, we will use Tableau to analyze this.

We will connect the the dataset using the « Excel » option.

Select the dataset in Excel file you downloaded and click on the « Open » button.

And as you can see, there is only one tab.

There is only one tab because in the Excel file there is only one tab. If in the Excel file there were several tabs, they would all have been listed here.

It’s necessary to check that all data is « OK ». Scroll the lines and columns to see that. Everything is good, there are 10 000 lines as in the Excel file.

Excellent, we connected our Excel source file to Tableau.

Now, click on the « Sheet1 » tab to access the Worksheet.

We’ll have a little fun.

For example, let’s look at what we have with « Geography »

« Geography » is the dimension that gives us the country, so we’ll make a map to see where the clients from the bank come from.

Move « Geography » on this area.

Ah, it’s odd, nothing happens ?!? Why ? Look, when you look at « Geography », it’s not recognized by Tableau as a geographic dimension. Here,, you can see that Tableau recognized « Geography » as a dimension of type text with the label « ABC »

Don’t worry, we can fix it quickly. Click on the arrow of « Geography ».

Selects « Geography Roles » and « Country Region » so that the « Geography » dimension become geography’s type.

Now you remove « Geography » made a table with a click-and-drag.

Look, we have a globe next to « Geography ». This means that Tableau recognize that « Geography » is a geographic dimension.

Since « Geography » is a dimension of geography type, there are 2 new measures that have appeared : Latitude (generated) and Longitude (generated).

Put « Geography » in this space with a click and drag.

Look, this time there is a map.

You have the possibility of zooming with these buttons.

The map is fine but we’ll remove the blue dots and modify the map so that it’s easier to read.

We’ll color the countries and display the clients number that has in each country.

We know that in the dataset each line corresponds to a client. What we can do is use the « number Of Record », it means the total of number of observations. In this way, we can visualize the number of lines attended to each country and the number of lines attended to each country is the number of client per country.

Then, take the « number Of Record » and move it to « Colors ».

Boom ! Each country has a color.

Look at the color contrasts. France has a darker color which indicates that it is the country with the most clients. Germany and Spain have almost the same colors which indicates that they have almost the same clients number.

But we want to know the clients number per country without have the cursor on the country.

To do this we’ll add a label. Take « number Of Record » and moves it to « Label ».

We’ll increase the text’s size and put in bold. Click on « Label », click on « Font » and select « 12 » and bold.

It’s cool, we can see the clients number per country. You have the possibility to zoom on a region. Click on « Zoom area » and drag and drag to select the region on the map.

Now we can see that the majority of clients are in France, this represents almost half of the total clients number of the dataset. Germany and Spain have almost the same number of clients.

-Steph

I have just enrolled in a Data Science course on Udemy and I learned good stuff.

Podcast:

In the last article, we created our calculated field « TotalSales » that you can see in «Measure » zone.

In Tableau, the calculated field is very used (almost every time) because in most case the data don’t give the value you want to show.

The calculated field « TotalSales » is a simple example to make you understand how it works but know that you can do things more complex. I’ll show you that later.

In this article, I’ll show you how to manipulate colors because it’s an important element to communicate. With colors, people will understand more quickly what you want to explain to them.

Imagine that you have to show this bar chart to the manager who handles the bonuses. By putting a little color, a little art, you could improve the reading of this bar chart.

To use colors, click on this button.

You can change the color with the basic colors.

Or you can have more colors by clicking here.

If you have a picture in the background, you have the possibility to change the opacity to have a transparent effect of colors.

You can add a border, change the border’s color, etc.

But what would be nice to do is to have bars with different colors.

To start, take « Rep » and move it on « Colors ».

With this, there is a unique color for each representative.

There is also another method to do that. Instead of taking « Rep » and moving it to « Color », you can click « Rep » here.

If you move it to « Colors », you’ll break everything because « Rep » will no longer be in the « Columns » zone.

To avoid this, press Ctrl or Command on your keyboard and click « Rep » to make appear the sign « + ». Now that you made a copy of « Rep », move it to « Colors ». It’s like making a copy/paste from « Rep » to « Colors ».

With this method, « Rep » is always in the « Columns » zone. This is a method that is very practical when there are many dimensions.

It’s possible to change representative’s colors by clicking here.

As you can see, there are several choices of palettes.

You can test the « color blind » palette which is very useful for color blind people. To select this palette, click « Assign Palette » and « Apply ».

When a palette has fewer colors than representatives, you will have a message saying that some colors will be duplicated. But this is not a problem because there are names below the bars.

Now we want to see something else with our bar chart. Press “Ctrl” or “Command” on your keyboard and click on SUM(TotalSales) to display the « + » sign. Then move SUM(TotalSales) to « Colors » to replace « Rep ».

As you can see SUM(TotalSales) has different colors. The colors are on a continuous basis which means that the more sales there are, the darker the color.

For our case, this is not useful because the size of the bars represents the sales number but for other situations, this is useful.

The problem now is that there are duplicate colors and because of this, the Manager could misinterpret the results. An alternative approach would be to ensure that the Manager understands the results.

The solution is to take « Region » (by pressing “Ctrl” or “Command” on your keyboard) and move it to « Colors ».

You can also take « Region » (with “Ctrl” or “Command”) and move it to SUM(TotalSales) to replace SUM(TotalSales).

With that, the bars are colored by region.

That way, you can clearly see the 3 regions through colors that are unique to each region and you can see the total sales per representatives with the size of the bar.

This is a small example so that you can understand the basics to manipulate colors in Tableau. There are still more complex techniques to manage the colors that I will show you later.

Plays with the colors so you can fully understand how it works. You could find your favorite palette and find your style. Have fun.

-Steph

## Navigate In Tableau

I have just enrolled in a Data Science course on Udemy  and I learned good stuff.

We’ll explore Tableau’s tools

From the connection manager, we’ll go into the Tableau’s workspace.

Click on the « Sheet1 » tab at the bottom of the window.

Here is the Tableau’s workspace.

The 2 important elements of the workspace are « Data » on the left and the workspace on the right. It’s in the workspace that you’ll create tables and charts.

« Data » divided into 2 zones : dimensions and measures.

The dimensions and measures are 2 different rules that will allow you to manipulate data.

Tableau sets the numerical values in « measures » and the categorical or quantitative variables in « dimension ». This is the Tableau’s settings by default.

There is also another way to explain « dimension » and « measures ». The « dimensions » are independent variables and the « measures » are dependent variables.

For exemple, « Units » is a measure, it’s the number of items sold per product. « Region » is a dimension, it’s the geographic region where the product sold. With 2 elements we can know how many items sold by region. This means that « Region » is an independent variable and « Units » is a dependent variable because it will be grouped by region.

But if you don’t like it, you can move the entities between dimension and measures and the opposite by click and drag.

In the menu bar, at the top, there is « File » where you can open and save file.

« Data » to connect to new source files.

« Worksheet » is the workspace to create analyzes

« Dashboard » is a combination of worksheet

« Story » is a combination of worksheet and dashboard

« Analysis » to specify how you want to do your analysis on your workspace

« Map » to add maps to the workspace

« Format » contains formatting options

Now, let’s study the workspace.

In the workspace, the main elements are « Columns » and « Rows ». This is where you decide which data goes in columns and rows in your worksheet.

You can also choose different format for these elements like colors, size, text level of detail and tooltips (useful tool optional).

Let’s do a test. Use data from « Region » (which is in « dimension »). Move « Region » with a click and drop to the center of your workspace. Now, « Region » is in the element « Rows ».

A table appears in your workspace.

You put a dimension in your workspace. Now put a measure in your workspace.

Uses the « Units » data. Move « Units » with a click and drop next to the « Region » column.

As you can see, Tableau automatically put « Region » in the « Rows » element and the « Units » data aggregated by region. In this way, you can tell how many items were sold by region.

Now, what you can do is to move « SUM(Units) » to the « Columns » element.

And then, you have a « bar chart » to see how many items have been sold by region. You can enlarge the graphic with a click and drop.

Let’s look at the tools that are in « Show Me » zone.

Click on « Pie chart » to have this chart’s type.

Click on « Size » icon and drag from left to right you can increase the chart’s size.

In this chart, each region has a color and proportion of items sold by region.

You can also test the « bubble chart ». Tableau organizes the data automatically and everything and placed in the « Marks ».

You can test « Treemaps » chart. This is the same principle as « bubble chart » but it’s rectangles instead of circles.

As you can see in « Show Me », there are charts disabled. This is because you need some elelments in your data to be able to activate them.

For example for the « Area chart », you need « date »data to activate it.