What Affects Data Scientist Salaries?

Scraping Data Scientist Job Listings and Predicting Salaries

Photo by Jonathan Riley on Unsplash

In this Metis Data Science Bootcamp project I wanted to understand how different factors affect estimated salaries for data professionals on a leading employment website. I scraped thousands of job listings, conducted exploratory data analysis of the scraped data, and built a model to predict salary estimates for full-time data scientist and data analyst roles in five cities across the US.

The biggest challenge for me was getting salary estimate data. Very few listings on the website have salary information, but there is a filter on the search page (see below) that allows setting a minimum expected salary. Thus, for each job title, city and experience level query, I slided the salary filter from 40,000 until 300,000 with 5,000 dollar increments and saved all results using Beautiful Soup. Knowing at which salary filter value which job listing stopped appearing, I inferred the salary estimates.

Job Result Search Page

Once I had all the individual job listing URLs, I visited each one of them and downloaded their job descriptions. I then created binary features indicating presence of certain keywords. In the example below we have SQL, Python and Tableau.

Job Listing Page

As a result, my final dataset comprised almost three thousand job listings across 5 different major US cities – New York, San Francisco, Seattle, Austin and Chicago.

Here you can see the distributions of job listing salary estimates per city for data analysts and data scientists. We can clearly see that the salary estimates are higher for data scientists. Data analyst salary estimates are closer to each other across cities compared to the listings for data scientists. As you can see, San Francisco is leading across both.

Distributions of salary estimates per city for data analyst and data scientist listings

The plot below shows the keyword frequencies across the job descriptions. Some of the biggest differences seem to be Python, Cloud and Spark, which are much more frequently seen in Data Scientist job descriptions.

Keyword frequencies in job descriptions for data analyst and data scientist listings

For modeling I have used simple interpretable models – Ordinary Least Squares, Ridge and Lasso regressions, and did Cross-Validation Grid Search to find the best alpha hyperparameter value. Below, you can see a scatterplot of actual vs predicted values, color and size coded by job title and seniority respectively.

Ridge Regression (α=0.1)

Here we can see the intercept and the coefficients for each feature of our final model.

Intercept and feature coefficients with an example

We have to be very careful with the interpretation of these results. For example a negative coefficient for SQL obviously doesn’t mean that knowing SQL reduces your expected salary, but rather it means that having SQL in the job description in this dataset means it’s more likely to have a lesser estimated salary.

Please visit my Github account to see the code for this and my other projects. Also feel free to reach out to me on LinkedIn.

Data Scientist with more than six years of experience working in Operations Management, Data Analytics, Finance and Marketing