Jon Wayland is a Data Scientist working in the Healthcare industry with a BS and MS in mathematics. I always enjoy reading his answers on Quora because they are packed with useful information for aspiring data analysts/scientists.
If you’re not already following him I strongly suggest you do, and in the meantime I created a summary of Jon’s top 5 Data Science answers for you. Enjoy!
According to Jon, an analyst should have a basic understanding of the following:
- Mean & Median (specifically when to use one over the other)
- Standard Deviation
- Using the Distribution for Identifying Outliers
- Difference in Means
- Difference in Proportions
- Logistic Regression (with interpretation on likelihood)
- Linear Regression
Additionally, he thinks having the ability to research a statistical approach given a problem related to the data is crucial.
Jon notes that there are 32,305 (7/16/2018 at 9:30AM EST) job openings for data scientist positions worldwide on LinkedIn, and that half of them are in the United States. Many of these postings require a minimum of 2 years experience and a graduate degree in a STEM field with a preference on mathematics and statistics.
About 181k students were awarded a masters degree in a science or engineering field in 2015:
Notice how low the mathematics and statistics line is. Naively assuming everyone who graduated in 2015 with a masters in mathematics or statistics was planning on becoming a data scientist, they would have the preferred qualifications of a minimum of 2 years of experience and a graduate degree in math or stats. This would be less than 10k candidates for roughly 16k openings today.
Also, according to a report on Johns Hopkins University’s data science specialization offered through Coursera, there were 1.76 million course signups in less than a year after the specialization launched in April of 2014. Of the 1.76 million course signups, only 71,589 Signature Track verified certificates were awarded.
Based on this information, he gathered the following two take-aways:
- Yes, too many people are training to become data scientists.
- Not enough of these people are following through with what it actually takes to be considered qualified for a data scientist position.
Jon cleverly answers this question by writing that there are many reasons one would choose Python over R, and many reasons one would choose R over Python:
Reasons one would choose Python over R:
- General purpose programming language
- Compliments other efforts such as building applications
- Flows nicely for machine learning pipelines
Reasons one would choose R over Python:
- True statistical programming language
- Easy to explore data and perform any analysis on it
- Creates state of the art visualizations
Ultimately, the answer to the question is no, Python is not better than R for data science. If the question were reversed, i.e. Is R better than Python for data science? the answer would still be no.
His rule of thumb for the right tool to use is:
- When diving for insights to answer specific questions, I use R.
- When putting a model into a production workflow, I use Python.
Generally speaking, these the steps that he takes to understand a data set assuming it’s tabular:
- Where did the data come from? Is it static, or will it be updated on a consistent basis? Is it reliable??
- Check the dimensions of the table. How many columns are there? How many rows?
- What are the data types of each column? Are they continuous/numeric? Are they integer? Are they categorical?
- What are distributions of each column? For the numeric fields, what’s the mean, standard deviation, 1st – 3rd quartiles, minimum and maximum? For the categorical fields, how many observations fall within each category? Are there any signs of skewness for any of the fields? Are there any outliers? Are there many missing values?
- Check relationships. Are any of the numeric fields linearly correlated? Are any of the categorical fields significantly associated with one another? Are there any clear differences in distributions for numeric fields between each category of the categorical fields?
If you can answer all of these questions, you’ll have a very good understanding of the kind of data set you’re working with.
- Getting your first job won’t be easy
- A MS or PhD will give you a fast-pass in a line of BS degrees
- Your future company will have high expectations of you
- If you don’t like school, you won’t like data science
- Data science is not computer science
- Interpretation is more valuable than a black box model, so know how to interpret results
- Why confidence intervals trump a single prediction
- Machine learning is only a portion of the job
- Business acumen is as important (if not more important) than coding skills
- Most problems are literally word problems — it’s up to you to figure out what kind of problem it is and the appropriate statistical method
- It’s fun. And then it’s not. And then it is. And then it’s not.
- You’ll probably have to seek out problems
- Being prepared to prove your worth can be stressful, but is necessary. Your job will be sought after by others.
I hope this was useful for your journey to become a data analyst or data scientist. Do you agree with Jon’s answers?