Data Science Domains

matrix

I have just enrolled in a Data Science course on Udemy and I learned good stuff.

In Science, there are several domains. In Data Science, it’s the same.

data science domain

Data Science is composed of 3 fields : computer science, math and statistics and domain knowledge. But for some years, this changed a bit. Data Scientists need to have other skills than programming and statistics.

Look at this new diagram :

data science domain

Let’s look at these skills in detail.

Statistics

Data is the basis of the Data Scientists so they must be able to filter the data to have relevant data that will provide them with insights. This allow Data Scientist to build models to classify the population and make reliable forecasts of future events.

Visualization

Do you know the computers langage ? Do you know bytecode like « 00100010100101010110 » ? No and it’s the same for me. It’s for this reason that Data Scientists must have the ability to see through the data and especially show them to others. This is why visualization is an important skill to show the data.

Data Mining

This is the part of the work where the Data Scientist has to make the detective like Sherlock Holmes. It’s in this phase that we must look in the data for insights and abnormalities.

Database and Data process

It’s simple, the Data Scientist cleans the data, stores and processes the data in the database.

Pattern recognition, Machine learning et Neurocomputing

These 3 disciplines help explain to computers how to learn do to a specific task on its own. There are not things I’ll learn but these are interesting disciplines for some business problems.

In our world where competition is increasingly aggressive, technical skills are no longer enough. Here are other skills that Data Scientist need to have.

Communication

communication

Data Scientists need to interact with people everyday. They have to do that because the insights are not just in the data. There are insights that we can only find by talking to people. That’s why it’s important to not afraid to talk to people to ask them questions on a daily basis.

Presentation

This is another type of communication . In this case the Data Scientist doesn’t try to extract information but to explain what he/she found to the people. This is a very important skill because the Data Scientist is the intermediary between insights and people. It’s a bit the data translator, it’s simply explain the content of data.

Domain knowledge

Data Science can be used in any industry. One day you can do research to find fraudulent transactions and another day you can build a compensation model for employees of a medical establishment.

That is why, in what industry you work, you must do research and know quickly the necessary part of the industry. The rest will come naturally. Quickly learn the basics of the industry where you work.

Practice in real situations

Proverb : « It’s by forging that one you become a blacksmith » says everything. This concept is extremely applicable in Data Science.

Programmation

The 2nd basic domain of Data Scientists. The better you talk to your computer and the more efficient you are, the more successful you will be. If you don’t know how to program, learn this from today. Programming has to become a hobby, something you like to do.

Creativity

This is what make the difference between Data Scientist and Data Analysts. To become an excellent Data Scientist, you need to work your creativity. Be curious and you will find insights that nobody would never have found.

Now you know the skills needed to become an excellent Data Scientist. As you see I have a lot to do.

Share this article if you think it can help someone you know. Thank you.

-Steph

Data Science Underrated Job

data science

I have just enrolled in a Data Science course on Udemy  and I learned good stuff.

I know you’ve heard many times « Look this one, this is the job of the future ! ». The simplest thing I can do is explain why it’s interesting to learn about Data Science. This is extremely useful skills for the future.

The principle is that the more data there is, the more work there is for Data Scientists. Let’s look at the amount of data created in the world in the past, present, and estimate for the future.

130 Exabytes have been created by humans since the beginning of humanity until 2005. Ok, you didn’t understand. Don’t worry, it was the same for me. Let’s go back to the source.

Measuring data

measuring data

The source, it’s 1 byte (1B) and 1 byte is the necessary place for a hard drive to hold a letter. For example, the letter « S » = 1 byte (1B).

You go to the next level and you multiply 1 byte (1B) by 1000 which gives you 1 Kilobyte (1Kb). A book’s page contains between 2000 and 5000 letters so we can say that a half of page of text is about 1 Kilobyte (1Kb).

You go to the next level and you multiply 1 Kilobyte (1Kb) by 1000 which gives you 1 Megabyte (1Mb). A 500 pages book is about 1 Megabyte (1Mb).

You go to the next level and you multiply 1 Megabyte (1Mb) by 1000 which gives you 1 Gigabyte (Gb). A human genome (coded) can be contained in 1 Gigabyte (1Gb).

You go to the next level and you multiply 1 Gigabyte (1Gb) by 1000 which gives you 1 Terabyte (1Tb). If you take an HD camera and take a picture every day, every hour for 80 years. All videos can be contained in 1 Terabyte (1Tb).

You go to the next level and you multiply 1 Terabyte (1Tb) by 1000 which gives you 1 Petabyte (1Pb). If you take all trees of Amazonian forest to make paper and you write text on both sides each paper, all this paper represents between 1 and 2 Petabyte (1-2 Pb).

You go to the next level and you multiply 1 Petabyte (1Pb) per 1000 which gives you 1 Exabyte (1Eb). All existing data on planet Earth is contained in 1 Exabyte (1Eb).

More more more data

more data more problems

I think now you understand better how we measure the amount of data in a hard drive. At first, I told you that 130 Exabytes (130 Eb) created by humans from the beginning of humanity until 2005.

In 2010, this increased to 1200 Exabytes (1200 Eb). In 2015, this increased to 7900 Exabyte (7900 Eb). The forecast for 2020 is that this will increase up to 40 900 Exabyte (40 900 Eb).You see how data creation is growing in the world, it goes very very fast.

With a graphic, it’s easier to visualize all that.

data science forecast graph

The blue line on the graph corresponds to the quatitiy that machines (computers) can sore. You see, there is much more data than what computers can store.

The red line corresponds to what Data Scientists can process as data. You see, there is much more data than Data Scientists can process.

Another important point is that the gap between the machines and the Data Scientists will increase over time.

There are very few Data Scientits in the world and because they’re rare, they’re expensive or their salaries are high.

As companies increasingly seek ou Data Scientists, universities and engineering schools are beginning to offer this type of trainining.

The fact that the number of data increase, the companies demand to have Data Scientist to proccess data also increase. This demand is so enormous that it’s expected that in dozen years, everyone will know the Data Science’s basics as the programming now.

I advise you to do research on Data Science, you’ll see, it can be used in any industry, it’s really interesting.

Share this article if you think it can help someone you know. Thank you.

-Steph