## A Pratical Tip To Validate Your Approach

I have just enrolled in a Data Science course on Udemy  and I learned good stuff.

How was the A/B test « Number Of Product » ? Easy or difficult ?

Here is the result I found.

I think you noticed there was something bizarre. There is an anomaly. We imagine that the more the client has products, the more the client is satisfied with the bank so this type of clients should stay in the bank.

In the first 2 bars we can see that a client who has 1 product is more likely to leave the bank than a client who has 2 products. But when a client has 3 or 4 products, we see a huge rate of clients leaving the bank.

Look, there is a little bizarre detail. In the 2nd bar, we can’t see the « Exited » label. This is because there is no place in the orange part to put the text. To make it simpler, we’ll remove the label « Exited ». Drag and drop on the « Exited » text label to the outside.

Perfect, we can read the percentages. On the 1st bar, we can see that among the client that have 1 products, 28% left the bank. On the 2nd bar, we can see that among clients who have 2 products, 8% left the bank. This show us that clients who have 1 products are more likely to leave the bank than clients with 2 products.

And for the next bars, we observe an anomaly. On the 3rd bar, we can see that among the clients who have 3 products, 83% left the bank. On the 4th bar, we can see that among clients who have 4 products, 100% left the bank. We clearly see that there is a problem and we need to do a deeper analysis to understand what is going on .

As a Data Scientist, we need to explain what happens in bars 3 and 4. Usually when a client has 3 or 4 banking products, that means he/she is satisfied and is loyal to the bank. But in our case, it’s the opposite because there is a high rate of client who left the bank. This is the time to do deeper analysis.

The first thing to analyze is the quality of the data. There is a very big anomaly and it may be because there is something insignificant in our data that disturbs the statistics. For example, it’s possible that when the bank selected these clients in this sample, there were very few clients with 4 products and all those clients with 4 products left the bank. Sometimes chance can create anomalies and you have to play attention to these effects of chance because they don’t seem important but they can create false interpretations.

To start, we will check the number of clients with 4 products.

In « Measure », move « Number Of Records » (which gives the number of observations) on « Label ».

We observe on the first 2 bars than many clients with 1 or 2 products selected for our sample. For clients with 3 or 4 products, we can see that there were fewer clients selected for our sample.

There are 220 clients with 3 products and 60 clients with 4 products. These small number of clients probably explain why we observe these anomalies.

In this sample of randomly selected clients, there are very few clients with 4 products and they all left the bank. In this situation, we can confirm that it’s a chance. When thing like that happen, you have to be very careful not to make conclusion too fast and make misinterpretations.

The conclusion is that a lot of clients have been selected for category 1 and 2. For category 3 and 4, there have been few clients selected so we can’t do accurate statistics. We need to do deeper analyze for these categories of clients with 3 and 4 products.

Now, let’s put the percentage back on the bar chart. Click on the « Back » button.

.

Or do a click and drag of « SUM(Number of Record) » to outside.

We saw that there is an anomaly and what is interesting to do is to have a comment to remember to do a more in-depth analysis of columns 3 and 4.

Right-click between the bar chart’s title and the bars. Select « Annotate » then « Areas… ».

A window appears. In this window, you write « Low observation in last 2 categories » and click on the « OK » button.

Click on the comment and move it on bars 3 and 4.

The next time you work on this bar chart, you will see this comment that will remind you to seriously analyze client who have 3 and 4 products.

# Validate our approach

It’s time to show you how to validate an approach and how to validate the data. For this we will create a new A/B test.

Duplicate this worksheet with a right-click on the « NumberOfProducts » tab and select « Duplicate ».

And rename the tab « Validation ».

For this tab, we will erase the comment. Select the comment and press the « Delete » button on your keyboard.

Everything is ready, the idea is to find a variable that doesn’t affect our results. That is a variable that has no impact on a client’s decision to leave or stay in the bank.

Take for example, the variable « Customer Id ». Client’s identification number has no influence on the client’s decision to stay or leave the bank.

We’ll do an A/B test with the last digit of the « Customer Id » and we’ill check that there is the same clients proportion who leave the bank in the 10 categories of the last digit of the « Customer Id ». The 10 categories are the numbers 0,1,2,3,4,5,6,7,8,9.

Let’s g.To start, we will create the variable that contains the last digit of the « Customer Id ». To have this variable, we will create a « Calculated Field ».

Right-click on « Customer Id », select « Create » and click on « Calculated Field ».

Name the calculated field « LastDigitOfCustID ». In the text field, we use the « RIGHT » function with « Customer Id » in parenthesis to select the last character of the « Customer Id ». In our case, the last character of the « Customer Id » is the last digit.

Here is the code to write in the text field : Right ({Customer Id},1)

Oooops, you see there is a small mistake => The calculation contains errors.

There is an error in the formula because « Customer Id » is a number variable and the « RIGHT » function applies to a variable of type « STRING ».

To use the « RIGHT » function, we will convert « Customer Id » into a string. We will use the « STR » function with « Customer Id » in parenthesis.

Here is the code to write in the text field

And click on the « OK » button : Right (STR({Customer Id}),1).

Now, you can see that our calculated field « LastDigitOfCustID » is in « Dimensions ».

Click on « LastDigitOfCustID » and move it on top of « NumOfProducts » in « Columns ».

Now we have a new bar chart and we see that for every last digit of the « Customer Id » there is about the same proportion of clients leaving the bank. All these proportions don’t correspond exactly to the average of 20% but these slight variations aren’t important.

Seeing this uniform distribution allows us to validate our data because these data are homogenous.

# Conculsion

Here’s how you can check the homogeneity of your data. You take a variable that has no impact on the fact that a client leaves or stays in the bank. The example we did with the last digit of the « Customer Id » is excellent. We were able to verify that in each of the categories taken by this variable, if there was the same proportion of clients leaving the bank. As is the case, we can validate our data.

Imagine another result. When we do the test with the last digit of the « Customer Id », we observe that for one of the numbers, the rate of clients who left is really higher than the average. This shows us that there is a problem in our data because it indicates an anomaly.

You can find other ways to verify your data by using other « insignificant variables » to see if the distribution is homogeneous. But be careful when you select an « insignificant variable » because there may be traps.

Here is an example. If you create a variable that takes the first letter of the first name, the distribution will not be homogeneous. The reason is simple, there are many more people who have a name that starts with the letter « M » than with the letter « Y ».

-Steph

## Work With An Alias

I have just enrolled in a Data Science course on Udemy and I learned good stuff.

In the last article, I showed you how to do a simple A/B test. We will continue with the result we had with the A/B test.

Here is the result of the A/B test. What is in orange is the percentage of men who left the bank, it’s 16%. What is in blue is the percentage of women who left the bank, it’s 25%.

With our bar chart we can quickly see that women are more likely to leave the bank than men, all the rest being equal in our sample.

I remind you that this is a basic A/B test. There are 2 type of A/B test, the basic A/B test and the statistical A/B test. The statistical A/B test is done with a statistical test like the KHI-2 test. For our case, the basic A/B test already give us good insights.

To make our bar chart even easier to read, we will work with aliases.

The first thing we will do is we will improve the format. Right-click on this space between « Gender » and the bars and select « Format… ».

The « Sheet » tab appears. In « Worksheet » changes the text size to « 12 ».

What is good with data mining is that we aren’t obligated to make a perfect chart because we don’t have to present them in a report to managers or a meeting.

For example, if I had to present this chart in a report, it would be necessary to change the vertical title. But we only make a model so this change isn’t necessary.

Now, look at this rectangle. We can see « Exited », « 0 » and « 1 ».

« 0 » means that the client stayed in the bank and « 1 » means that the client left the bank. We can also see that client who left the bank are in orange so 25% for women and 16% for men. And the client who stayed in the bank are blue so 75% for women and 84% for men.

We did an excellent basic A/B test but it would be much easier to read if we replace « 0 » with « Stayed » and « 1 » with « Exited ».

With aliases we can do that. An alias is to replace the binary results « 0 » and « 1 » with « Stayed » and « Exited » because it’s not easy to remember the meaning of « 0 » and « 1 ».

There are 2 ways to do it : create a calculated field or use aliases.

We will use aliases. Know that aliases are not going to change the « 0 » and « 1 » in the dataset, this change is only in Tableau.

In « Dimensions », right-click on « Exited » and select « Aliases… ».

A small window appears. In this small window, you can create an alias for each value contained in the « Exited » variable.

The variable « Exited » contains the value « 0 » and « 1 ». For the value « 0 », we will create the alias « Stayed » to say that the client stayed in the bank. For the value « 1 », we will create the alias « Exited » to say that the client left the bank. Then click on the « OK » button.

Look, we can see the new values in the rectangle.

The values « 0 » and « 1 » have been replaced by « Stayed » and « Exited ».

Now that the aliases saved, we will take the variable « Exited » in « Dimensions » and move it to « Label ».

Look, we have our aliases « Stayed » and « Exited » on the bar chart.

In this ways, it’s easier for people to read the bar chart without asking what meaning of « 0 » of « 1 » values. « Stayed » and « Exited » are clearer.

Now you know how to use aliases so that people can easily read the binary values of a chart.

-Steph

I have just enrolled in a Data Science course on Udemy and I learned good stuff.

Podcast:

In the last article, we created our calculated field « TotalSales » that you can see in «Measure » zone.

In Tableau, the calculated field is very used (almost every time) because in most case the data don’t give the value you want to show.

The calculated field « TotalSales » is a simple example to make you understand how it works but know that you can do things more complex. I’ll show you that later.

In this article, I’ll show you how to manipulate colors because it’s an important element to communicate. With colors, people will understand more quickly what you want to explain to them.

Imagine that you have to show this bar chart to the manager who handles the bonuses. By putting a little color, a little art, you could improve the reading of this bar chart.

To use colors, click on this button.

You can change the color with the basic colors.

Or you can have more colors by clicking here.

If you have a picture in the background, you have the possibility to change the opacity to have a transparent effect of colors.

You can add a border, change the border’s color, etc.

But what would be nice to do is to have bars with different colors.

To start, take « Rep » and move it on « Colors ».

With this, there is a unique color for each representative.

There is also another method to do that. Instead of taking « Rep » and moving it to « Color », you can click « Rep » here.

If you move it to « Colors », you’ll break everything because « Rep » will no longer be in the « Columns » zone.

To avoid this, press Ctrl or Command on your keyboard and click « Rep » to make appear the sign « + ». Now that you made a copy of « Rep », move it to « Colors ». It’s like making a copy/paste from « Rep » to « Colors ».

With this method, « Rep » is always in the « Columns » zone. This is a method that is very practical when there are many dimensions.

It’s possible to change representative’s colors by clicking here.

As you can see, there are several choices of palettes.

You can test the « color blind » palette which is very useful for color blind people. To select this palette, click « Assign Palette » and « Apply ».

When a palette has fewer colors than representatives, you will have a message saying that some colors will be duplicated. But this is not a problem because there are names below the bars.

Now we want to see something else with our bar chart. Press “Ctrl” or “Command” on your keyboard and click on SUM(TotalSales) to display the « + » sign. Then move SUM(TotalSales) to « Colors » to replace « Rep ».

As you can see SUM(TotalSales) has different colors. The colors are on a continuous basis which means that the more sales there are, the darker the color.

For our case, this is not useful because the size of the bars represents the sales number but for other situations, this is useful.

The problem now is that there are duplicate colors and because of this, the Manager could misinterpret the results. An alternative approach would be to ensure that the Manager understands the results.

The solution is to take « Region » (by pressing “Ctrl” or “Command” on your keyboard) and move it to « Colors ».

You can also take « Region » (with “Ctrl” or “Command”) and move it to SUM(TotalSales) to replace SUM(TotalSales).

With that, the bars are colored by region.

That way, you can clearly see the 3 regions through colors that are unique to each region and you can see the total sales per representatives with the size of the bar.

This is a small example so that you can understand the basics to manipulate colors in Tableau. There are still more complex techniques to manage the colors that I will show you later.

Plays with the colors so you can fully understand how it works. You could find your favorite palette and find your style. Have fun.

-Steph

## Create A Calculated Field

I have just enrolled in a Data Science course on Udemy  and I learned good stuff.

Now is the time to solve the problem of who is winning the bonus.

The first thing to do is clean the dashboard. To do this, click on the « Clear Sheet » button.

To start we need to create a « Bar Chart » to see the salesperson. In this data, salesperson named « Rep » for representatives.

To see how many items were sold by each sales representative, you need to put « Rep » in « Columns » and « Units » in « Rows ».

You can see that the representative who sold the most is Richard.

But we want to have more details. We need to know who is the best representative by region and for now, we can’t see that.

To see this, you put « Region » in « Columns » before « Rep ».

As you can see, the « Bar Chart » changed. There are separations by region.

In each region, you can see the representatives and the number of items they sold. Alex is the best in Central. Richard is the best in the East and James is the best in West.

To get better visibility, you can order the bars in descending order. Move the mouse over the label « Units » of the bar chart and an icon will apprear. Click on it and the bars will be sorted in descending order.

Unfortunately, we didn’t answer the question because « Units » is only the number of items sold. What interests us is the amount of money earned by selling the product but we don’t have this type of measure. There is no measure that shows us the total value of sales. There are only « Units » and « Units Price ».

In this case, you have to make a calculation to have the total sales in cash. For each representative, you need to multiply « Units » with « Units Price ».

Let’s look at the data to make a test calculation. Right-click on the « OfficeSupplies » data and click on « View Data ». For example, you see that Nick sold 29 binders to \$1.99. So 29 (Units) multiplied by 1.99 (Unit Price) equals \$57.71 for this sale.

In our data, we have no measure where « Units » is multiplied by « Units Price ».

To solve this problem, you will create an additional measure with a « Calculated Field ». « Calculated Field » si an element that allows us to create measures by calculating quantities.

To create this, right-click in the « Measure » zone and select « Create calculated field… ».

You can name the calcultated field « TotalSales »

Select « Units » uses the « * » sign to multiply and select « Unit Price » and click « OK ».

Now, you can see that your measure « TotalSales » in the « Measure » zone.

By looking good, you can see that there is an « = » sign before the « # ». This is to indicate that this measure is a calculated field.

Ok, the measure is ready, let’s go. Put « TotalSales » on « Units » to replace « Units » with « TotalSales. Tableau automatically takes the sum aggregate.

If you want to do this more cleanly, you can remove « Units » by dragging it outside « Rows », then take « TotalSales » and put it in « Rows ».

Now, that we have a « Bar Chart » with data from « TotalSales », we sort the bars in descending order by clicking here.

As you can see, the results are different because Richard is no longer the best representatives in the East, it’s Suzanne. The best representative in the East is Suzanne, the best representative at the Center is Mathiew and the best representative in the West is still James.

it’s with the calculated field « TotalSales » that we can know who are the best representatives by region so Suzanne, Mathiew and James earn a bonus.

This was an example to learn how to create a calculated field in Tableau. Have fun creating new calculated field to master this tool that is really useful.