What Affects Data Scientist Salaries?
In this Metis Data Science Bootcamp project I wanted to understand how different factors affect estimated salaries for data professionals on a leading employment website. I scraped thousands of job listings, conducted exploratory data analysis of the scraped data, and built a model to predict salary estimates for full-time data scientist and data analyst roles in five cities across the US.
The biggest challenge for me was getting salary estimate data. Very few listings on the website have salary information, but there is a filter on the search page (see below) that allows setting a minimum expected salary. Thus, for each job title, city and experience level query, I slided the salary filter from 40,000 until 300,000 with 5,000 dollar increments and saved all results using Beautiful Soup. Knowing at which salary filter value which job listing stopped appearing, I inferred the salary estimates.
Once I had all the individual job listing URLs, I visited each one of them and downloaded their job descriptions. I then created binary features indicating presence of certain keywords. In the example below we have SQL, Python and Tableau.
Exploratory Data Analysis and Modeling
As a result, my final dataset comprised almost three thousand job listings across 5 different major US cities – New York, San Francisco, Seattle, Austin and Chicago.
Here you can see the distributions of job listing salary estimates per city for data analysts and data scientists. We can clearly see that the salary estimates are higher for data scientists. Data analyst salary estimates are closer to each other across cities compared to the listings for data scientists. As you can see, San Francisco is leading across both.
The plot below shows the keyword frequencies across the job descriptions. Some of the biggest differences seem to be Python, Cloud and Spark, which are much more frequently seen in Data Scientist job descriptions.
For modeling I have used simple interpretable models – Ordinary Least Squares, Ridge and Lasso regressions, and did Cross-Validation Grid Search to find the best alpha hyperparameter value. Below, you can see a scatterplot of actual vs predicted values, color and size coded by job title and seniority respectively.
Here we can see the intercept and the coefficients for each feature of our final model.
We have to be very careful with the interpretation of these results. For example a negative coefficient for SQL obviously doesn’t mean that knowing SQL reduces your expected salary, but rather it means that having SQL in the job description in this dataset means it’s more likely to have a lesser estimated salary.