Scientists have long been

interested in Machine Learning. But Computers made very bad students because

they started out from such a low level. For many people, it takes a PhD to become

a Data Scientist. That’s because Data Science requires a deep understanding of

Statistics, Programming, Machine Learning and Business. Having a basic knowledge in all

these different domains is achievable in a few months. But becoming an employable Data Scientist

is where the real challenge lies. You’ll need to be very strategic about what you learn and how you

learn it. Through my Master’s in Computer Science and Applied Maths as well as by working closely

with my Data Science colleagues at Microsoft, I have found a path that will not only provide

you with all the necessary skills, it will also prepare you for the data science interviews

at big tech companies.

In this video, I will share all the steps of this path and provide you

the free resources you will need at every step. Along the way, I will also tell you 3 mistakes

that stop people from becoming a Data Scientist. Let’s start with the first pillar of Data Science

and that is: Statistics. Let’s say that Google decides to change the color of the Search button

to Green. You’re the Data Scientist incharge of testing this change on a small portion of Google

users.

Statistics will help you design this experiment and it will also guide you on what to

measure. And not only that, Statistics can help you decide whether the data you collected in your

experiment is reliable or just some random noise. Statistics also sits at the core of machine

learning algorithms like linear regression. So, a good knowledge of Statistics is necessary

to become a good Data Scientist. But to learn Statistics, you need to know some basic

concepts of Mathematics. Look, I know that many of you don’t like Maths. And I wish I could say

that Maths is not needed for Data Science. But, if you’re looking to build a good career in Data

Science, you need to know some basic things. To make your life easy, I recommend doing this free

4 week course on Coursera. This course is called Data Science Math skills by Duke University.

This course covers important concepts like Mean, Variance, Derivatives and Bayes theorem.

The best part about this course is that it’s great for beginners.

For example, it covers

even the most basic things like Venn diagram and Sigma notation. Another great thing about

this course is that it does not try to teach you everything. It provides you just enough

knowledge to get started with Statistics. Now that you feel confident about your maths

skills, let’s learn Statistics. This is where many people make their first big mistake. And that is

they try to learn everything. Look, Statistics is a very vast field and it requires many many years

to fully understand it. For most Data Scientist jobs, you just need to know some key concepts

in Statistics. To put things in perspective, here is the distribution of different Data

Science roles in the market. In this diagram, we see that the majority of Data Science roles

are Analytics roles which means that they mainly focus on defining business metrics and making data driven decisions through data visualization, among other things. I will link this article in the description for

you to review. A major insight from this article is that Statistics heavy Data Science roles make

up a small minority of 5% of total roles.

So, we don’t need to go very deep into Statistics.

In my case, I did multiple advanced level courses in Statistics and later found out that I did

not need most of it for Data Science. To learn all the key concepts that you actually

need, I recommend this course called "Introduction to Statistics" by Stanford University.

This course covers all the important ideas like Probability, Normal distribution and Confidence

Intervals and many more.

By the end of this course, you would know all the Statistics you

need to move on to Machine Learning. But before we can move on to Machine

learning, we need to learn some Programming, which is the second pillar of Data Science.

When it comes to programming for Data Science, we have primarily 2 languages to choose from.

First one is R, which is purely designed for Statistics and Data analysis. Second and more

popular option is Python, which is a full-fledged programming language that can be used for

applications beyond Statistics and Machine Learning. That’s why I would recommend picking

Python as your programming language. But, How do we learn Python? In our video on the “Fastest

way to learn coding and actually get a job”, we recommended learning Python by doing

actual coding. For that, we gave you this website called “learnpython.org”. On this

website, complete the tutorials covering basics as well as Data Science.

As always, play with

the code and complete the exercise portion. Now that we have learnt Programming, let’s

move onto the third pillar of Data Science and that is Machine learning. This is where many

people make their second biggest mistake. They forget that knowing Machine learning algorithms

would not help much if you don’t know how to get the data to apply these algorithms to. When

you are working on your personal projects for Machine Learning, you can go to websites

like UC Irvine's Machine Learning Repo and choose data to work on. For example, In one

of my personal projects for Computer Vision class, I used UC Irvine’s Handwritten digits dataset. But

in the real world, you rarely get well defined, cleaned up data. You have to decide what

data makes sense for your application and then use SQL to extract that data. That’s

why SQL questions are very common in Data Science interviews. The mistake that people make is

that they skip learning SQL. To learn SQL, we will write some SQL queries. So, go

to this tutorial on W3 schools and do this hands-on tutorial.

Make sure to go through

at least the SQL Tutorial portion at the top. Also don’t forget the SQL examples portion at

the button where you can test your knowledge. Before you can apply a machine learning algorithm, you need to know what your data looks like. Some

of the best presentations that I have attended are the ones where Data Scientists slice and

dice data to bring some deep insights just through data visualization. Two very

popular libraries for data visualization in Python are Matplotlib and Seaborn. To learn

these libraries, you can do this course called “Data Visualization in Python” on Coursera. In

this course, you’ll learn how to make box plots, scatter plots and regression plots using the

Matplotlib, Seaborn and some other libraries. This course also covers dashboarding which is

an essential part of most Data Science jobs.