## Validate Data Mining In Tableau With A Chi-Square Test

I have just enrolled in a Data Science course on Udemy  and I learned good stuff.

In this article we will start using statistics. Don’t worry we’ll do something simple, we’ll use the Chi-square test in a basic way. There is a special section to learn how to do statistics at an advanced level.

I’ll explain why we’re going to learn how to use the Chi-square test. The results we have with theses 2 bar charts are good. We see on theses 2 bar charts that age has a significant impact on the rate of client leaving the bank. We also see in which age groups the clients leaves the bank the most and which age groups the clients leave the bank the least. With that we have good insights.

In the A/B test « Gender », we can see that there is a correlation between the male and female sex and the choice to leave the bank. But as I said before, this A/B test is basic. The results of a basic A/B test visually shows us what is probably happenning in reality but we aren’t 100% sure of these results. To validate these results, we need do to use statistical tests like Chi-square test.

Doing a report based on basic A/B test is very risky and you can have completely false insights. I don’t advise you to do it (unless you want to leave your job). It’s for this reason that using Chi-square will help us to have strong insights.

Chi-square will allow us to know if our results are statistically significant. Our results are based on a sample of 10 000 clients and Chi-square test will tell us if these results are due to chance effects or if these results can represent all the client of the bank.

For example in our A/B test « Gender », we observed that in our sample of 10 000 clients, women are more likely to leave the bank compared to men.

Now, we aren’t sure if the results of this sample represent the behavior of all the bank’s clients.

To use basic Chi-square test, we use an online tool. Click here  .

On internet, there are plenty of websites to do a Chi-square test but we’ll use this one so that you can understand how it works. To do a Chi-square test, we need to use absolute values and in our A/B test we have percentage.

Let’s go back to Tableau. We’ll create a new tab with a version of A/B test with absolute values. In this way, we keep the A/B test with the percentages. Do a right-click on the « Gender » tab and select « Duplicate ».

Name the new tab « Gender Actual » to specify that it’s absolute values.

To have the absolute values, move « Number of Records » in « Measures » to the « Marks » area and put it over top of « SUM(Number of Records ».

Move « Number of Records » in « Measures » to « Rows » over « SUM(Number of Records ».

Cool, we have our absolute values.

We also need total absolute values, which means the total number of men and women. There is a very fast way to get that. Right-click on the vertical axis and select « Add Reference Line ».

Then in « Value », click on the drop-down on the right and select « Sum » to have the total sum of the observations.

And in « Scope », you select « Per Cell » option to specify that you want the total sums for each category, male and female.

Now, we have the total sum at the top of the bars. We will modify labels to have the absolute values. In « Label », we will change « Computation » to « Value » and click on the « OK » button.

Perfect, we have the total amount of observation at the top of each bar : 4543 women and 5457 men. We have what we need to use our online tool.

OK, I’ll explain how this tool works. « Sample1 » and « Sample2 » correspond to the independent variable « Gender ». You choose in which order you enter the data, « Sample1 » for men or the opposite. In our case, we use « Sample1 » for women and « Sample2 » for men.

« #success » corresponds to the result Y=1, which means in our case « yes, the client left the bank ».

« #trials » is the total number of observations, which means the total number of women in « Sample1 » and the total number of men « Sample2 ».

That’s how you enter the data :

• For « Sample1 » in #success, you enter 1139 because there are 1139 women who left the bank. For « Sample1 » in #trials, you enter 4543 because there are 4543 women in total.

• For « Sample2 » in #success, you enter 898 because there are 898 men who left the bank. For « Sample2 » in #trials, you enter 5457 because there are 5457 men in total.

Here is the verdict : « Sample1 is more successful ». « Sample1 » corresponds to women and #success is :« yes, the client left the bank ». This verdict means that of all the bank’s client, women are more likely to leave the bank than men. And look, there is something important, it’s « p<0.001 ». This means that the « p » is strictly less than 0.001.

« p » is the value that indicates whether an independent variable has a statistically significant effect on a dependent variable. In our case, the independent variable is « Gender » and the dependent variable is « Exited », which is : « yes, the client left the bank ». So « p » is strictly less than 0.001, which means that the independent variable « Gender » has a statistically significant effect on the dependent variable « Exited ». This shows us that out of the total number of bank’s clients, women are more likely to leave the bank than men.

This is how we use Chi-square test with this online tool. This is the same principle on all online tools that you can find on Google or DuckDuckGo . You can repeat these instructions that I gave you with other tools, you will get the same results.

It’s cool with the Chi-square we validated the A/B test and to specify that this A/B test is validated, we’ll color the tab in green.

Right-click on the tab, select « Color » and select « Green ».

Perfect, now we’ll validate another A/B test. Selects « HasCreditCard » tab.

We’re going to create an A/B test « HasCreditCard » only with absolute values. To save time, right-click on « Gender Actual » tab and select « Duplicate ».

We’ll remove the green color on the tab « Gender Actual (2) ». Right-click on the tab and select « Color » and « None ».

You rename the tab « HasCreditCard Actual ».

Move the variable « HasCrCard » over « Gender » in « Columns ».

Excellent, everything is ready to do a Chi-square test. We’ll remove « Exited » labels to better see the absolutes values. Make a click and drag out.

Perfect, let’s go back to our online tool. In this case, « Sample1 » is « no », which means client who don’t have credit card and « Sample2 » for « yes », which means clients who have a credit card.

That’s how you enter the data :

• For « Sample1 » in #success, you enter 613 because there are 613 clients who left the bank. For « Sample1 » in #trials, you enter 2945 because there are 2945 clients who don’t have a credit card.
• For « Sample2 » in #success, you enter 1424 because there are 1424 clients who left the bank. For « Sample2 » in #trials, you enter 7055 because there are 7055 clients who have a credit card.

Let’s look at the verdict, it’s « No significant difference ». « p » value is very high, it’s above 5%. This confirms that the independent variable « HasCrCard » has no statistically significant effect on the dependent variable « Exited ». That was the conclusion we had made when we had done the A/B test with percentages.

We had seen that there was 21% of « Exited » (clients who left the bank) in the category « no » and 20% in the category « yes ». With these results we concluded that most likely the variable « HasCrCard » had no impact on the rate of clients who left the bank. Chi-square test confirms our conclusion and we can put the tab « HasCrCard » in green to say that it’s OK.

Right-click on the tab « HasCreditCard » => « Color » => « Green ».

Excellent, now, you can do a statistical A/B test with 2 categories. Soon, we will do statistical A/B tests with more than 2 categories.

Share this article if you think it can help someone you know. Thank you.

-Steph

## A Pratical Tip To Validate Your Approach

I have just enrolled in a Data Science course on Udemy  and I learned good stuff.

How was the A/B test « Number Of Product » ? Easy or difficult ?

Here is the result I found.

I think you noticed there was something bizarre. There is an anomaly. We imagine that the more the client has products, the more the client is satisfied with the bank so this type of clients should stay in the bank.

In the first 2 bars we can see that a client who has 1 product is more likely to leave the bank than a client who has 2 products. But when a client has 3 or 4 products, we see a huge rate of clients leaving the bank.

Look, there is a little bizarre detail. In the 2nd bar, we can’t see the « Exited » label. This is because there is no place in the orange part to put the text. To make it simpler, we’ll remove the label « Exited ». Drag and drop on the « Exited » text label to the outside.

Perfect, we can read the percentages. On the 1st bar, we can see that among the client that have 1 products, 28% left the bank. On the 2nd bar, we can see that among clients who have 2 products, 8% left the bank. This show us that clients who have 1 products are more likely to leave the bank than clients with 2 products.

And for the next bars, we observe an anomaly. On the 3rd bar, we can see that among the clients who have 3 products, 83% left the bank. On the 4th bar, we can see that among clients who have 4 products, 100% left the bank. We clearly see that there is a problem and we need to do a deeper analysis to understand what is going on .

As a Data Scientist, we need to explain what happens in bars 3 and 4. Usually when a client has 3 or 4 banking products, that means he/she is satisfied and is loyal to the bank. But in our case, it’s the opposite because there is a high rate of client who left the bank. This is the time to do deeper analysis.

The first thing to analyze is the quality of the data. There is a very big anomaly and it may be because there is something insignificant in our data that disturbs the statistics. For example, it’s possible that when the bank selected these clients in this sample, there were very few clients with 4 products and all those clients with 4 products left the bank. Sometimes chance can create anomalies and you have to play attention to these effects of chance because they don’t seem important but they can create false interpretations.

To start, we will check the number of clients with 4 products.

In « Measure », move « Number Of Records » (which gives the number of observations) on « Label ».

We observe on the first 2 bars than many clients with 1 or 2 products selected for our sample. For clients with 3 or 4 products, we can see that there were fewer clients selected for our sample.

There are 220 clients with 3 products and 60 clients with 4 products. These small number of clients probably explain why we observe these anomalies.

In this sample of randomly selected clients, there are very few clients with 4 products and they all left the bank. In this situation, we can confirm that it’s a chance. When thing like that happen, you have to be very careful not to make conclusion too fast and make misinterpretations.

The conclusion is that a lot of clients have been selected for category 1 and 2. For category 3 and 4, there have been few clients selected so we can’t do accurate statistics. We need to do deeper analyze for these categories of clients with 3 and 4 products.

Now, let’s put the percentage back on the bar chart. Click on the « Back » button.

.

Or do a click and drag of « SUM(Number of Record) » to outside.

We saw that there is an anomaly and what is interesting to do is to have a comment to remember to do a more in-depth analysis of columns 3 and 4.

Right-click between the bar chart’s title and the bars. Select « Annotate » then « Areas… ».

A window appears. In this window, you write « Low observation in last 2 categories » and click on the « OK » button.

Click on the comment and move it on bars 3 and 4.

The next time you work on this bar chart, you will see this comment that will remind you to seriously analyze client who have 3 and 4 products.

# Validate our approach

It’s time to show you how to validate an approach and how to validate the data. For this we will create a new A/B test.

Duplicate this worksheet with a right-click on the « NumberOfProducts » tab and select « Duplicate ».

And rename the tab « Validation ».

For this tab, we will erase the comment. Select the comment and press the « Delete » button on your keyboard.

Everything is ready, the idea is to find a variable that doesn’t affect our results. That is a variable that has no impact on a client’s decision to leave or stay in the bank.

Take for example, the variable « Customer Id ». Client’s identification number has no influence on the client’s decision to stay or leave the bank.

We’ll do an A/B test with the last digit of the « Customer Id » and we’ill check that there is the same clients proportion who leave the bank in the 10 categories of the last digit of the « Customer Id ». The 10 categories are the numbers 0,1,2,3,4,5,6,7,8,9.

Let’s g.To start, we will create the variable that contains the last digit of the « Customer Id ». To have this variable, we will create a « Calculated Field ».

Right-click on « Customer Id », select « Create » and click on « Calculated Field ».

Name the calculated field « LastDigitOfCustID ». In the text field, we use the « RIGHT » function with « Customer Id » in parenthesis to select the last character of the « Customer Id ». In our case, the last character of the « Customer Id » is the last digit.

Here is the code to write in the text field : Right ({Customer Id},1)

Oooops, you see there is a small mistake => The calculation contains errors.

There is an error in the formula because « Customer Id » is a number variable and the « RIGHT » function applies to a variable of type « STRING ».

To use the « RIGHT » function, we will convert « Customer Id » into a string. We will use the « STR » function with « Customer Id » in parenthesis.

Here is the code to write in the text field

And click on the « OK » button : Right (STR({Customer Id}),1).

Now, you can see that our calculated field « LastDigitOfCustID » is in « Dimensions ».

Click on « LastDigitOfCustID » and move it on top of « NumOfProducts » in « Columns ».

Now we have a new bar chart and we see that for every last digit of the « Customer Id » there is about the same proportion of clients leaving the bank. All these proportions don’t correspond exactly to the average of 20% but these slight variations aren’t important.

Seeing this uniform distribution allows us to validate our data because these data are homogenous.

# Conculsion

Here’s how you can check the homogeneity of your data. You take a variable that has no impact on the fact that a client leaves or stays in the bank. The example we did with the last digit of the « Customer Id » is excellent. We were able to verify that in each of the categories taken by this variable, if there was the same proportion of clients leaving the bank. As is the case, we can validate our data.

Imagine another result. When we do the test with the last digit of the « Customer Id », we observe that for one of the numbers, the rate of clients who left is really higher than the average. This shows us that there is a problem in our data because it indicates an anomaly.

You can find other ways to verify your data by using other « insignificant variables » to see if the distribution is homogeneous. But be careful when you select an « insignificant variable » because there may be traps.

Here is an example. If you create a variable that takes the first letter of the first name, the distribution will not be homogeneous. The reason is simple, there are many more people who have a name that starts with the letter « M » than with the letter « Y ».

Share this article if you think it can help someone you know. Thank you.

-Steph

## Topic Of Content To Publish On Internet

I watched an Olivier Roland’s video  and I learned good stuff.

Many people are wondering how to find the content to publish on a blog or a Youtube channel. Before creating my blog, I was also asking this question and maybe you’re asking yourself this question today.

You want to create a blog or a Youtube channel for several reasons. And I think one of the reasons is that you want to build your own company. In this case, I advise you that your content is influenced by these 3 things :

• Passion – It must be a subject that fascinates you.

• Skill – It would be nice if it’s something you already have skills in.

• Economic potential – If your idea has no economic potential, you will not be able to build your own company.

There is something surprising with the economic potential. It’s possible to create a profitable blog or Youtube channel on a subject that doesn’t interest you. For example, there is a person who has a blog to teach math to help students in school to pass their exams. This person isn’t passionate about mathematics but it’s not something he hates. I know that in few years, he will stop this company even if it works well because he will have lost motivation. And that happens often when it’s not a passion.

This case is an exception because it’s very difficult to succeed when you aren’t in a subject that fascinates you. Statistics show that 95% of bloggers give up after 6 months. These are statistics from « Meta-blogs Technorati » which based on millions of blog creations.

I really advise you to create the content on one of your passions otherwise you finish in the 95% of people who give up after 6 months. Outside regulated areas like medicine, you can create a blog in an area where you don’t have skills and learn little by little. You can do that but you have to be transparent about your skills to your audience.

That’s what I do. I had no knowledge of anatomy and biomechanics in sport when I started publishing articles. Everyday I do my research and I learn little by little. The best example is Data Science. I had no knowledge, I paid for a class and I show you what I’m learning and I have not finished this class yet. At the beginning of my articles, you see my sources and that’s my strength.

Which means that you can’t be passionate and have no skill at the start. But it’s not possible to have a profitable blog or Youtube channel if the sector where you start has no economic potential.

To find out if you sector has an economic potential, there are several criteria to check :

• Are there forums or Facebook groups on this topic ?

If you find forums or Facebook groups on the subject that you want to do, it’s because there are people who spend several minutes every day discussing this subject. These people are likely to visit your blog or Youtube channel, if you share the content that may interest them.

• Are there already blogs or Youtube channels on this subject ?

If you find blogs or Youtube channels that have existed for several years and with a large number of subscribers, it’s that you can become one of their competitors.

• Are there many people who are passionnate about this ?

Creating a profitable blog or Youtube channel is like create a company, it’s necessary to do a market research. What is interesting today is that it’s possible to do this with internet. Search on internet your future competitors and analyze them.

There are still a dozen criteria but with these 3 criteria, you’ll have a good base to select your passion that has the best economic potential to create your company with a blog or a Youtube channel.

Share this article if you think it can help someone you know. Thank you.

-Step

## Schools To Be An Entrepreneur

I watched an Oliver Roland’s video  and I learned good stuff.

There are schools with specific courses about entrepreneurship. But several studies show that only 10% of students create their own companies. The vast majority go looking for a job.

It’s possible that these students create their own companies 5 or 10 years later but these are statistics that are hard to get.

It’s possible that these students start their company 5 or 10 years later, but these statistics are hard to have.

Look this podcast about « How should Business Schools prepare students for startup ? ».

What is interesting is to talk to entrepreneurs and to see that the vast majority of them have never studied at a school about entrepreneurship.

Being an entrepreneur is like any other exciting job, we always keep learning. Being an entrepreneur is a mission of life and it’s necessary to become better day after day.

If there is no effective school to learn entrepreneurship, we know that there is a curriculum to be a businessman.

# MBA

MBA (Master of Business Administration) is a huge curriculum. It’s often 1-2 years of your life after a professional experience. Which means that for 1-2 years, you don’t earn a salary. And it should be added that a MBA costs between \$ 50 000.- and \$ 100 000.-. You see, you really need to be motivated to do it.

It’s clear that having a MBA make it possible to have a better salary, it also helps to have knowledge in business but there are interesting critics on the results obtained on the field.

A MBA shows your employers that you’re able to sacrifice your life for your job. During a MBA, you work 70-80 hours a week to get it, which means that you’re able to do the same hours per week for your employer.

A MBA helps to create a network, but now with internet, there is another way to create a network. Interesting people don’t look for people who was in the best schools, they are looking for people who have projects and who shows that they’re capable to realize their projects.

Among the people who have negative reviews on MBA, we can find Seth Godin. Seth Godin is a well-know author and marketing expert in USA. He has a MBA from Standford and quotes : « Having A MBA is learning the best techniques for running a company in the 1990s for 2 years, while the world runs as fast as it can ».

We also have Peter Thiel. Peter Thiel is the co-founder of PayPay and Palantir and he’s the first external investor for Facebook. He has a MBA and he’s not satisfied with the results he obtained on the field.

What is certain is that a MBA is not really about creating entrepreneurs but about creating good employees able to run a company of the 1990s.

At the moment, there is no effective school in the world to learn to be a good entrepreneur. Being an entrepreneur is a state of mind and a decision of life. It’s all learned on the field, reading books and talking to more experienced entrepreneurs.

There is an interesting book « The Personal MBA »  that lets you learn the basics of a MBA without spending 1-2 years of your life without pay working 70-80 hours per week and saving between \$ 50 000.- and \$ 100 000.-. All this only by reading this book.

But I’m open-minded and it’s possible that someone took a course on entrepreneurship in a school. If that’s your case, it would be cool if you shared the experience you had in school and how it helped you to create and grow your company.

Share this article if you think it can help someone you know. Thank you.

-Steph