Being a Good Machine Learning Engineer/Data Scientist

August 2018

In my experience doing machine learning and data science, there are a lot of different elements you have to know quite well to be effective at your job. These include the following:

1. General Programming Familiarity

I consider data scientists to be engineers, where instead of coding a web app as a frontend engineer would do, they are responsible for architecting data processing pipelines, designing and implementing models, and developing infrastructure for system evaluation and metrics computation.

As you can imagine, performing these tasks requires a reasonable amount of fluency with a high-level programming language (think Python, R, Matlab, or Julia), as well as data science specific libraries (think Pandas, Scikit-learn, Matplotlib, or Tensorflow). Developing this skill alone is something that can make up a year or more of an undergraduate computer science degree.

2. Mathematical Maturity

Data scientists must be comfortable with a variety of mathematical topics. These include statistics, probability theory, and linear algebra, to name a few. Very often in a traditional data science role, you will have to read academic papers describing some model or dataset and be able to utilize and implement the key ideas of the paper. This requires being comfortable understanding mathematical concepts and notation, which can be quite hard since research papers are rarely written as if they are trying to be a New York Times popular science book, comprehensible by the general public.

I would also argue that developing this skill is more important than the programming familiarity, since this is what makes data scientists…well scientists. You need to build intuition for the models you are implementing, data you are processing, and analyses you are performing, since this intuition will help you to truly do your job well.

3. Data Insights

Using, understanding, and manipulating data is a very different skill from other traditional engineering jobs. I remember once walking into an interview for a data science role some years ago. After a brief introduction, the interviewer literally pulled up a CSV file in Sublime text with just a bunch of numbers in different columns and said “Here are our records for the past month of customers using our product. What would you do with this?”

While this may seem like an unorthodox (and admittedly intimidating) interview, these are the types of situations data scientists have to deal with on a daily basis. You will often be presented with unfiltered, incomplete, and sparse datasets and expected to derive meaningful insights from them. This will require you to get comfortable asking a score of different questions about your data including:

How do I fill in missing values?
How do I deal with/remove outliers?
How do I get more even label distributions to train my model?
What kinds of features are most relevant for my model?
How do I detect/deal with model overfitting/underfitting?
What metrics are most useful to assess model performance?

As you can see, being a machine learning engineer/data scientist oftentimes amounts to being a jack-of-all-trades. You have to know a fair bit, which is why it seems that there is a pretty high barrier to entry. That being said, it is an extremely rewarding job, one where you are the person shedding light on the unknown, a data-whisperer of sorts 🙂.

I’m also very confident that anyone with a disciplined course of study can achieve the skills necessary to become a machine learning engineer/data scientist, without having to spend years at some top university.

Shameless Pitch Alert: If you’re interested in practicing MLOps, data science, and data engineering concepts, check out Confetti AI the premier educational machine learning platform used by students at Harvard, Stanford, Berkeley, and more!