## Create Bins and View Distributions

I have just enrolled in a Data Science course on Udemy  and I learned good stuff.

It’s cool, you finished the 1st part. Now we’re going to do more deep Data Mining analysis with this bank’s dataset.

To make these analyzes more deep, we’ll create a more statistical approach.

To do that we will create a new tab.

For this new tab, we want to understand how client distributed according to their age. Is there a majority of young or old people ?

Move the variable « Age » in « Columns ».

As we want to see the distribution of client ages, we need to use the variable « Number of Records » to see the number of observations. Move the variable « Number of Record » to « Rows ».

Boom, we have a chart but there is only one point on the top right. What happened is that Tableau took the sum of the ages of all the bank’s clients and the sum of all the « Number of Records », it means the total number of clients, 10 000 clients.

We’ll find a solution but before we’ll change the format to better see the chart. Right-click in the middle of the chart and select « Format ».

For the font’s size, select « 12 ».

Here you can see that the total age is 39 218 but that’s not what we’re looking for. What we want to see is the number of clients for each age.

I’ll explain what’s going on. We took the aggregated sums of our variables. Aggregate means that we took the total sum of the variable for each category. We added the ages but in fact we want to see the total number of observations for each age separately.

To have that, just click on the arrow in « SUM(Age) » in « Columns ».

Then select « Dimensions »

You see, Tableau doesn’t take the aggregated sum of ages but it takes ages separately. We have a curve that shows us the continuous distribution of our clients ages. That is to say, for each age, the curve gives is the number of clients of this age.

We’ll look at the dataset. Right-click on « Churn Modelling » and select « View Data… ».

There is window that appears that shows us the data in detail. If you scroll to the right, you will find the column « Age ».

We see that the ages rounded. As all ages rounded, Tableau is able to group clients by age. By positioning the mouse on the curve, we can see that there are 200 clients who are 26 years old.

If in the dataset, ages weren’t rounded, you would have seen clients with 26.5 or 26.3 years. It would create a lot of irregularity, there would be plenty of spikes with lots of variations.

Oooooh look, there is a variation that isn’t normal.

Let’s analyze it in detail. Around this peak, we see that there are 348 clients who are 29 years old.

Here, 404 clients who are 31 years old.

And this peak down that shows us that there are 327 clients who are 30 years old.

How to explain this irregularity ? It’s possible that many people of 29 years old are about to turn 30 years old and many people of 31 years old who just had 31 years old. It’s chance that make us have inaccuracies. You may have other inaccuracies if you data isn’t precise and rounded. In our case, the ages are rounded but we want to get rid of our small irregularity that we see on our curve.

There is way to see our distribution without our irregularities, it’s « bins ». « Bins » consists of grouping the information into different categories. That is we’re going to regroup our clients in different age groups.

Right-click on « Age » in « Measures ». Select « Create » and select « Bins… ».

A window appears. We’ll group our clients in 5-years increments. In « Size of bins », write « 5 » and click on the « OK » button.

As you can see, the variable « Age » has remained in « Measures » but there is a new variable in « Dimensions ».This is the variable we created « Age(bins) ».

Our « Age(bins) » variable was correctly placed in « Dimensions » because it is a category variable because each category corresponds to a 5-year age group.

For example, one category is 20 to 24 age group. Now we’ll create a new distribution based on « bins ».

To do that, we’ll remove the variable « Age » from « Columns » with a click and drag outside.

You move the variable « Age(bins) » from « Dimensions » to « Columns ».

Note

In this case, it’s not possible to directly replace « Age » by « Age(bins) » over « Age » on « Columns ». This is because « Age » is a measure and « Age(bins) is a dimension.

That’s nice distribution, it’s usually the type of distribution (chart) we see in economics or mathematics. The difference with the old chart is that this chart is discrete. This chart is discrete because the clients grouped by age group while the previous chart was continuous.

On this distribution (chart), each bar corresponds to an age range. For example, this bar corresponds to the 25-29 age group.

Now, we’ll change the colors.

In « Row », move « SUM(Number of Record) » while holding down the « Ctrl » or « Command » key on your keyboard to « Colors ».

We get our distribution in blue but we’ll change the color to red. Click on « Colors » and click on « Edit Colors »

In the window that appears, click on the blue square on the right to display the color pallet.

Select the red color and click on the « OK » button.

Click on the « OK » button of the « Edit Colors » window.

To facilitate the reading of the bar chart, we’ll add the number of clients in each age group. In « Row », move « SUM (Number of Record) » while holding the « Ctrl » or « Command » key on your keyboard to « Label ».

That’s it, we can see how many clients there are in each age group.

We see that the dominant bar is the 35-39 age bracket and the second dominant bar is the 30-34 age bracket. Overall, we can see that most clients are between 25 and 40 years old, which seems consistent.

On our bar chart, we have absolute values. We’ll replace that with percentages. Click in the little arrow in « SUM(Number of Records) » in « Label » and you select « Add Table Calculation… » but I’ll show you another way to do it.

Instead of clicking « Add Table Calculation… », click on « Quick Table Calculation » and select « Percent of total ».

It’s cool, we have the exact percentage of people in each age bracket. Now, we can see that in the 25 to 40 age group, we have 20 + 23 +17= 60% of clients.

I’ll show you one last thing.You can change the size of the slices easily, just click on « Age(bins) » and select « Edit ».

In the windows, you can change the size of the slices (bins). Put « 10 » instead of « 5 » to get 10-years slices. Click on the « OK » button.

Now, we have a distibution with fewer slices and the dominant slice is 30 to 39 years old.

Well, it was just to show you how to change the size of bins. To go back to the old distribution with the 5-years slices, click on « Back » button.

As you can see, the values on bars are in percentages but the values on the axis are in absolutes values. Here is an exercise that I ask you to do : « Put the values of the axis in percentage ». I’ll give you the answer the next article.

-Steph

## A Pratical Tip To Validate Your Approach

I have just enrolled in a Data Science course on Udemy  and I learned good stuff.

How was the A/B test « Number Of Product » ? Easy or difficult ?

Here is the result I found.

I think you noticed there was something bizarre. There is an anomaly. We imagine that the more the client has products, the more the client is satisfied with the bank so this type of clients should stay in the bank.

In the first 2 bars we can see that a client who has 1 product is more likely to leave the bank than a client who has 2 products. But when a client has 3 or 4 products, we see a huge rate of clients leaving the bank.

Look, there is a little bizarre detail. In the 2nd bar, we can’t see the « Exited » label. This is because there is no place in the orange part to put the text. To make it simpler, we’ll remove the label « Exited ». Drag and drop on the « Exited » text label to the outside.

Perfect, we can read the percentages. On the 1st bar, we can see that among the client that have 1 products, 28% left the bank. On the 2nd bar, we can see that among clients who have 2 products, 8% left the bank. This show us that clients who have 1 products are more likely to leave the bank than clients with 2 products.

And for the next bars, we observe an anomaly. On the 3rd bar, we can see that among the clients who have 3 products, 83% left the bank. On the 4th bar, we can see that among clients who have 4 products, 100% left the bank. We clearly see that there is a problem and we need to do a deeper analysis to understand what is going on .

As a Data Scientist, we need to explain what happens in bars 3 and 4. Usually when a client has 3 or 4 banking products, that means he/she is satisfied and is loyal to the bank. But in our case, it’s the opposite because there is a high rate of client who left the bank. This is the time to do deeper analysis.

The first thing to analyze is the quality of the data. There is a very big anomaly and it may be because there is something insignificant in our data that disturbs the statistics. For example, it’s possible that when the bank selected these clients in this sample, there were very few clients with 4 products and all those clients with 4 products left the bank. Sometimes chance can create anomalies and you have to play attention to these effects of chance because they don’t seem important but they can create false interpretations.

To start, we will check the number of clients with 4 products.

In « Measure », move « Number Of Records » (which gives the number of observations) on « Label ».

We observe on the first 2 bars than many clients with 1 or 2 products selected for our sample. For clients with 3 or 4 products, we can see that there were fewer clients selected for our sample.

There are 220 clients with 3 products and 60 clients with 4 products. These small number of clients probably explain why we observe these anomalies.

In this sample of randomly selected clients, there are very few clients with 4 products and they all left the bank. In this situation, we can confirm that it’s a chance. When thing like that happen, you have to be very careful not to make conclusion too fast and make misinterpretations.

The conclusion is that a lot of clients have been selected for category 1 and 2. For category 3 and 4, there have been few clients selected so we can’t do accurate statistics. We need to do deeper analyze for these categories of clients with 3 and 4 products.

Now, let’s put the percentage back on the bar chart. Click on the « Back » button.

.

Or do a click and drag of « SUM(Number of Record) » to outside.

We saw that there is an anomaly and what is interesting to do is to have a comment to remember to do a more in-depth analysis of columns 3 and 4.

Right-click between the bar chart’s title and the bars. Select « Annotate » then « Areas… ».

A window appears. In this window, you write « Low observation in last 2 categories » and click on the « OK » button.

Click on the comment and move it on bars 3 and 4.

The next time you work on this bar chart, you will see this comment that will remind you to seriously analyze client who have 3 and 4 products.

# Validate our approach

It’s time to show you how to validate an approach and how to validate the data. For this we will create a new A/B test.

Duplicate this worksheet with a right-click on the « NumberOfProducts » tab and select « Duplicate ».

And rename the tab « Validation ».

For this tab, we will erase the comment. Select the comment and press the « Delete » button on your keyboard.

Everything is ready, the idea is to find a variable that doesn’t affect our results. That is a variable that has no impact on a client’s decision to leave or stay in the bank.

Take for example, the variable « Customer Id ». Client’s identification number has no influence on the client’s decision to stay or leave the bank.

We’ll do an A/B test with the last digit of the « Customer Id » and we’ill check that there is the same clients proportion who leave the bank in the 10 categories of the last digit of the « Customer Id ». The 10 categories are the numbers 0,1,2,3,4,5,6,7,8,9.

Let’s g.To start, we will create the variable that contains the last digit of the « Customer Id ». To have this variable, we will create a « Calculated Field ».

Right-click on « Customer Id », select « Create » and click on « Calculated Field ».

Name the calculated field « LastDigitOfCustID ». In the text field, we use the « RIGHT » function with « Customer Id » in parenthesis to select the last character of the « Customer Id ». In our case, the last character of the « Customer Id » is the last digit.

Here is the code to write in the text field : Right ({Customer Id},1)

Oooops, you see there is a small mistake => The calculation contains errors.

There is an error in the formula because « Customer Id » is a number variable and the « RIGHT » function applies to a variable of type « STRING ».

To use the « RIGHT » function, we will convert « Customer Id » into a string. We will use the « STR » function with « Customer Id » in parenthesis.

Here is the code to write in the text field

And click on the « OK » button : Right (STR({Customer Id}),1).

Now, you can see that our calculated field « LastDigitOfCustID » is in « Dimensions ».

Click on « LastDigitOfCustID » and move it on top of « NumOfProducts » in « Columns ».

Now we have a new bar chart and we see that for every last digit of the « Customer Id » there is about the same proportion of clients leaving the bank. All these proportions don’t correspond exactly to the average of 20% but these slight variations aren’t important.

Seeing this uniform distribution allows us to validate our data because these data are homogenous.

# Conculsion

Here’s how you can check the homogeneity of your data. You take a variable that has no impact on the fact that a client leaves or stays in the bank. The example we did with the last digit of the « Customer Id » is excellent. We were able to verify that in each of the categories taken by this variable, if there was the same proportion of clients leaving the bank. As is the case, we can validate our data.

Imagine another result. When we do the test with the last digit of the « Customer Id », we observe that for one of the numbers, the rate of clients who left is really higher than the average. This shows us that there is a problem in our data because it indicates an anomaly.

You can find other ways to verify your data by using other « insignificant variables » to see if the distribution is homogeneous. But be careful when you select an « insignificant variable » because there may be traps.

Here is an example. If you create a variable that takes the first letter of the first name, the distribution will not be homogeneous. The reason is simple, there are many more people who have a name that starts with the letter « M » than with the letter « Y ».