## Create Bins and View Distributions

I have just enrolled in a Data Science course on Udemy  and I learned good stuff.

It’s cool, you finished the 1st part. Now we’re going to do more deep Data Mining analysis with this bank’s dataset.

To make these analyzes more deep, we’ll create a more statistical approach.

To do that we will create a new tab.

For this new tab, we want to understand how client distributed according to their age. Is there a majority of young or old people ?

Move the variable « Age » in « Columns ».

As we want to see the distribution of client ages, we need to use the variable « Number of Records » to see the number of observations. Move the variable « Number of Record » to « Rows ».

Boom, we have a chart but there is only one point on the top right. What happened is that Tableau took the sum of the ages of all the bank’s clients and the sum of all the « Number of Records », it means the total number of clients, 10 000 clients.

We’ll find a solution but before we’ll change the format to better see the chart. Right-click in the middle of the chart and select « Format ».

For the font’s size, select « 12 ».

Here you can see that the total age is 39 218 but that’s not what we’re looking for. What we want to see is the number of clients for each age.

I’ll explain what’s going on. We took the aggregated sums of our variables. Aggregate means that we took the total sum of the variable for each category. We added the ages but in fact we want to see the total number of observations for each age separately.

To have that, just click on the arrow in « SUM(Age) » in « Columns ».

Then select « Dimensions »

You see, Tableau doesn’t take the aggregated sum of ages but it takes ages separately. We have a curve that shows us the continuous distribution of our clients ages. That is to say, for each age, the curve gives is the number of clients of this age.

We’ll look at the dataset. Right-click on « Churn Modelling » and select « View Data… ».

There is window that appears that shows us the data in detail. If you scroll to the right, you will find the column « Age ».

We see that the ages rounded. As all ages rounded, Tableau is able to group clients by age. By positioning the mouse on the curve, we can see that there are 200 clients who are 26 years old.

If in the dataset, ages weren’t rounded, you would have seen clients with 26.5 or 26.3 years. It would create a lot of irregularity, there would be plenty of spikes with lots of variations.

Oooooh look, there is a variation that isn’t normal.

Let’s analyze it in detail. Around this peak, we see that there are 348 clients who are 29 years old.

Here, 404 clients who are 31 years old.

And this peak down that shows us that there are 327 clients who are 30 years old.

How to explain this irregularity ? It’s possible that many people of 29 years old are about to turn 30 years old and many people of 31 years old who just had 31 years old. It’s chance that make us have inaccuracies. You may have other inaccuracies if you data isn’t precise and rounded. In our case, the ages are rounded but we want to get rid of our small irregularity that we see on our curve.

There is way to see our distribution without our irregularities, it’s « bins ». « Bins » consists of grouping the information into different categories. That is we’re going to regroup our clients in different age groups.

Right-click on « Age » in « Measures ». Select « Create » and select « Bins… ».

A window appears. We’ll group our clients in 5-years increments. In « Size of bins », write « 5 » and click on the « OK » button.

As you can see, the variable « Age » has remained in « Measures » but there is a new variable in « Dimensions ».This is the variable we created « Age(bins) ».

Our « Age(bins) » variable was correctly placed in « Dimensions » because it is a category variable because each category corresponds to a 5-year age group.

For example, one category is 20 to 24 age group. Now we’ll create a new distribution based on « bins ».

To do that, we’ll remove the variable « Age » from « Columns » with a click and drag outside.

You move the variable « Age(bins) » from « Dimensions » to « Columns ».

Note

In this case, it’s not possible to directly replace « Age » by « Age(bins) » over « Age » on « Columns ». This is because « Age » is a measure and « Age(bins) is a dimension.

That’s nice distribution, it’s usually the type of distribution (chart) we see in economics or mathematics. The difference with the old chart is that this chart is discrete. This chart is discrete because the clients grouped by age group while the previous chart was continuous.

On this distribution (chart), each bar corresponds to an age range. For example, this bar corresponds to the 25-29 age group.

Now, we’ll change the colors.

In « Row », move « SUM(Number of Record) » while holding down the « Ctrl » or « Command » key on your keyboard to « Colors ».

We get our distribution in blue but we’ll change the color to red. Click on « Colors » and click on « Edit Colors »

In the window that appears, click on the blue square on the right to display the color pallet.

Select the red color and click on the « OK » button.

Click on the « OK » button of the « Edit Colors » window.

To facilitate the reading of the bar chart, we’ll add the number of clients in each age group. In « Row », move « SUM (Number of Record) » while holding the « Ctrl » or « Command » key on your keyboard to « Label ».

That’s it, we can see how many clients there are in each age group.

We see that the dominant bar is the 35-39 age bracket and the second dominant bar is the 30-34 age bracket. Overall, we can see that most clients are between 25 and 40 years old, which seems consistent.

On our bar chart, we have absolute values. We’ll replace that with percentages. Click in the little arrow in « SUM(Number of Records) » in « Label » and you select « Add Table Calculation… » but I’ll show you another way to do it.

Instead of clicking « Add Table Calculation… », click on « Quick Table Calculation » and select « Percent of total ».

It’s cool, we have the exact percentage of people in each age bracket. Now, we can see that in the 25 to 40 age group, we have 20 + 23 +17= 60% of clients.

I’ll show you one last thing.You can change the size of the slices easily, just click on « Age(bins) » and select « Edit ».

In the windows, you can change the size of the slices (bins). Put « 10 » instead of « 5 » to get 10-years slices. Click on the « OK » button.

Now, we have a distibution with fewer slices and the dominant slice is 30 to 39 years old.

Well, it was just to show you how to change the size of bins. To go back to the old distribution with the 5-years slices, click on « Back » button.

As you can see, the values on bars are in percentages but the values on the axis are in absolutes values. Here is an exercise that I ask you to do : « Put the values of the axis in percentage ». I’ll give you the answer the next article.

-Steph

I have just enrolled in a Data Science course on Udemy and I learned good stuff.

Podcast:

In the last article, we created our calculated field « TotalSales » that you can see in «Measure » zone.

In Tableau, the calculated field is very used (almost every time) because in most case the data don’t give the value you want to show.

The calculated field « TotalSales » is a simple example to make you understand how it works but know that you can do things more complex. I’ll show you that later.

In this article, I’ll show you how to manipulate colors because it’s an important element to communicate. With colors, people will understand more quickly what you want to explain to them.

Imagine that you have to show this bar chart to the manager who handles the bonuses. By putting a little color, a little art, you could improve the reading of this bar chart.

To use colors, click on this button.

You can change the color with the basic colors.

Or you can have more colors by clicking here.

If you have a picture in the background, you have the possibility to change the opacity to have a transparent effect of colors.

You can add a border, change the border’s color, etc.

But what would be nice to do is to have bars with different colors.

To start, take « Rep » and move it on « Colors ».

With this, there is a unique color for each representative.

There is also another method to do that. Instead of taking « Rep » and moving it to « Color », you can click « Rep » here.

If you move it to « Colors », you’ll break everything because « Rep » will no longer be in the « Columns » zone.

To avoid this, press Ctrl or Command on your keyboard and click « Rep » to make appear the sign « + ». Now that you made a copy of « Rep », move it to « Colors ». It’s like making a copy/paste from « Rep » to « Colors ».

With this method, « Rep » is always in the « Columns » zone. This is a method that is very practical when there are many dimensions.

It’s possible to change representative’s colors by clicking here.

As you can see, there are several choices of palettes.

You can test the « color blind » palette which is very useful for color blind people. To select this palette, click « Assign Palette » and « Apply ».

When a palette has fewer colors than representatives, you will have a message saying that some colors will be duplicated. But this is not a problem because there are names below the bars.

Now we want to see something else with our bar chart. Press “Ctrl” or “Command” on your keyboard and click on SUM(TotalSales) to display the « + » sign. Then move SUM(TotalSales) to « Colors » to replace « Rep ».

As you can see SUM(TotalSales) has different colors. The colors are on a continuous basis which means that the more sales there are, the darker the color.

For our case, this is not useful because the size of the bars represents the sales number but for other situations, this is useful.

The problem now is that there are duplicate colors and because of this, the Manager could misinterpret the results. An alternative approach would be to ensure that the Manager understands the results.

The solution is to take « Region » (by pressing “Ctrl” or “Command” on your keyboard) and move it to « Colors ».

You can also take « Region » (with “Ctrl” or “Command”) and move it to SUM(TotalSales) to replace SUM(TotalSales).

With that, the bars are colored by region.

That way, you can clearly see the 3 regions through colors that are unique to each region and you can see the total sales per representatives with the size of the bar.

This is a small example so that you can understand the basics to manipulate colors in Tableau. There are still more complex techniques to manage the colors that I will show you later.

Plays with the colors so you can fully understand how it works. You could find your favorite palette and find your style. Have fun.

-Steph

## Navigate In Tableau

I have just enrolled in a Data Science course on Udemy  and I learned good stuff.

We’ll explore Tableau’s tools

From the connection manager, we’ll go into the Tableau’s workspace.

Click on the « Sheet1 » tab at the bottom of the window.

Here is the Tableau’s workspace.

The 2 important elements of the workspace are « Data » on the left and the workspace on the right. It’s in the workspace that you’ll create tables and charts.

« Data » divided into 2 zones : dimensions and measures.

The dimensions and measures are 2 different rules that will allow you to manipulate data.

Tableau sets the numerical values in « measures » and the categorical or quantitative variables in « dimension ». This is the Tableau’s settings by default.

There is also another way to explain « dimension » and « measures ». The « dimensions » are independent variables and the « measures » are dependent variables.

For exemple, « Units » is a measure, it’s the number of items sold per product. « Region » is a dimension, it’s the geographic region where the product sold. With 2 elements we can know how many items sold by region. This means that « Region » is an independent variable and « Units » is a dependent variable because it will be grouped by region.

But if you don’t like it, you can move the entities between dimension and measures and the opposite by click and drag.

In the menu bar, at the top, there is « File » where you can open and save file.

« Data » to connect to new source files.

« Worksheet » is the workspace to create analyzes

« Dashboard » is a combination of worksheet

« Story » is a combination of worksheet and dashboard

« Analysis » to specify how you want to do your analysis on your workspace

« Map » to add maps to the workspace

« Format » contains formatting options

Now, let’s study the workspace.

In the workspace, the main elements are « Columns » and « Rows ». This is where you decide which data goes in columns and rows in your worksheet.

You can also choose different format for these elements like colors, size, text level of detail and tooltips (useful tool optional).

Let’s do a test. Use data from « Region » (which is in « dimension »). Move « Region » with a click and drop to the center of your workspace. Now, « Region » is in the element « Rows ».

A table appears in your workspace.

You put a dimension in your workspace. Now put a measure in your workspace.

Uses the « Units » data. Move « Units » with a click and drop next to the « Region » column.

As you can see, Tableau automatically put « Region » in the « Rows » element and the « Units » data aggregated by region. In this way, you can tell how many items were sold by region.

Now, what you can do is to move « SUM(Units) » to the « Columns » element.

And then, you have a « bar chart » to see how many items have been sold by region. You can enlarge the graphic with a click and drop.

Let’s look at the tools that are in « Show Me » zone.

Click on « Pie chart » to have this chart’s type.

Click on « Size » icon and drag from left to right you can increase the chart’s size.

In this chart, each region has a color and proportion of items sold by region.

You can also test the « bubble chart ». Tableau organizes the data automatically and everything and placed in the « Marks ».

You can test « Treemaps » chart. This is the same principle as « bubble chart » but it’s rectangles instead of circles.

As you can see in « Show Me », there are charts disabled. This is because you need some elelments in your data to be able to activate them.

For example for the « Area chart », you need « date »data to activate it.