Data Science Job Positions on Indeed. Observations and Analysis

Hello and welcome to the Data Science Job Positions on Indeed project.</br> First, this is a revisited project that will give you an idea of the jobs on indeed in over 21 cities in the United States that were published and scraped in August 2017.

The dataset has over 619 advertized positions that mention annual/monthly/daily/hourly salaries. Further code and findings will be based on such observations. Our task was to predict if in a given city, we will be able to predict that the salary in the city would be above or below the overall median. The median was selected as a parameter for our set due to a strong skew to the right, for that reason the mean would not be desciptive of the data set.

I decided to add statistical data for each city and see if the statistical data can help predict if the mean salary would be below or above the overall median salary.

If you are mostly interested in the structure of the web-scaping code, please proceed to the very bottom of the page.

In [1]:
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests
import re
import time
import sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score

Part 1. Data set and EDA

In [2]:
df = pd.read_csv('../data/jobsdf.csv')
In [3]:
print df.shape
df.head()
(15187, 8)
Out[3]:
Unnamed: 0 jobtitle company city location salarytxt paybase summary
0 0 machine learning engineer (associate) - intell... JP Morgan Chase New+York%2CNY New York, NY 10001 (Chelsea area) NaN NaN jp morgan intelligent solutions (jpmis) is a n...
1 1 data science developer Morgan Stanley New+York%2CNY New York, NY 10032 (Washington Heights area) NaN NaN as a data scientist developer your role will b...
2 2 quantitative analysis, full time analyst (nort... Citi New+York%2CNY New York, NY NaN NaN you're the brains behind our work. you’re read...
3 3 sr data scientist Fimo Info Solutions LLC New+York%2CNY New York, NY 110000.0 annual sr data scientist, nyc*. the data scientist wi...
4 4 tech lead - data scientist, nlp Grubhub New+York%2CNY New York, NY NaN NaN good understanding of data structures and comp...

Next steps:

- Drop column "Unnamed: 0"
- Drop all records that don't mention salaries leaves only 857 records
- See what's left
In [4]:
# Dropping "Unnamed: 0" and all lines that don't contain salaries
df = df[df.paybase.isnull() == False]
df.drop("Unnamed: 0", axis = 1, inplace = True)
print df.shape
df.head()
(857, 7)
Out[4]:
jobtitle company city location salarytxt paybase summary
3 sr data scientist Fimo Info Solutions LLC New+York%2CNY New York, NY 110000.0 annual sr data scientist, nyc*. the data scientist wi...
6 climate and sustainability analyst DEPT OF ENVIRONMENT PROTECTION New+York%2CNY New York, NY 65977.0 annual knowledge and practical application of quantit...
7 data analyst POLICE DEPARTMENT New+York%2CNY New York, NY 79249.5 annual extensive knowledge of applied statistics, ana...
15 data specialist, bureau of systems strengtheni... DEPT OF HEALTH/MENTAL HYGIENE New+York%2CNY Queens, NY 65600.0 hourly experience and the ability to conduct evaluati...
20 assistant research scientist i grade 14 job #1... New York State Psychiatry Institute New+York%2CNY New York, NY 32000.0 hourly the assistant research scientist is responsibl...

Drop duplicates

In [5]:
df.drop_duplicates(inplace = True)
df.reset_index(inplace = True)
df.drop("index", axis = 1, inplace = True)
print df.shape
df.head()
(619, 7)
Out[5]:
jobtitle company city location salarytxt paybase summary
0 sr data scientist Fimo Info Solutions LLC New+York%2CNY New York, NY 110000.0 annual sr data scientist, nyc*. the data scientist wi...
1 climate and sustainability analyst DEPT OF ENVIRONMENT PROTECTION New+York%2CNY New York, NY 65977.0 annual knowledge and practical application of quantit...
2 data analyst POLICE DEPARTMENT New+York%2CNY New York, NY 79249.5 annual extensive knowledge of applied statistics, ana...
3 data specialist, bureau of systems strengtheni... DEPT OF HEALTH/MENTAL HYGIENE New+York%2CNY Queens, NY 65600.0 hourly experience and the ability to conduct evaluati...
4 assistant research scientist i grade 14 job #1... New York State Psychiatry Institute New+York%2CNY New York, NY 32000.0 hourly the assistant research scientist is responsibl...
In [6]:
def eda(dataframe):
    print "missing values \n", dataframe.isnull().sum()
    print "dataframe index \n", dataframe.index
    print "dataframe types \n", dataframe.dtypes
    print "dataframe shape \n", dataframe.shape
    print "dataframe describe \n", dataframe.describe()
    for item in dataframe:
        print item
        print dataframe[item].nunique()

eda(df)
missing values 
jobtitle     0
company      0
city         0
location     0
salarytxt    0
paybase      0
summary      0
dtype: int64
dataframe index 
RangeIndex(start=0, stop=619, step=1)
dataframe types 
jobtitle      object
company       object
city          object
location      object
salarytxt    float64
paybase       object
summary       object
dtype: object
dataframe shape 
(619, 7)
dataframe describe 
           salarytxt
count     619.000000
mean    95153.390953
std     46808.678054
min     19200.000000
25%     60262.750000
50%     86400.000000
75%    125000.000000
max    300000.000000
jobtitle
489
company
288
city
29
location
185
salarytxt
266
paybase
3
summary
575
In [7]:
df.describe()
Out[7]:
salarytxt
count 619.000000
mean 95153.390953
std 46808.678054
min 19200.000000
25% 60262.750000
50% 86400.000000
75% 125000.000000
max 300000.000000

Plotting part

In [140]:
import matplotlib.pyplot as plt
df.salarytxt.plot.hist(figsize = (12,8),bins=10)
plt.show()

And here I am checking if there are any lines left with salaries below $20K

In [141]:
df[df.salarytxt < 20000]
Out[141]:
jobtitle company city location salarytxt paybase summary
267 market research analyst/quality assurance RDAssociates, Inc. Philadelphia Narberth, PA 19200.0 hourly marketing research and consulting firm seeks a...
365 data analysis intern hear.com Miami Miami, FL 33138 (Upper Eastside area) 19200.0 hourly hear.com, the leading innovator in providing m...

Groupping by cities to see what is the situation with salaries positions are there for each city

In [142]:
print df.groupby(df.city)['summary'].count()
df.groupby(df.city)['summary'].count().plot(figsize = (16,12),kind = 'bar', fontsize = 14)
_ = plt.title("Number of Vacancies with Salaries in City", fontsize = 16)
_ = plt.ylabel("Number of Vacancies", fontsize = 14)
_ = plt.xlabel("City", fontsize = 14)
plt.show()
city
Atlanta                  25
Austin                   22
Baltimore                25
Boston                   42
Charlotte                12
Chicago                  29
Columbus                  6
Dallas                    9
Denver                    9
Detroit                   4
Fort Worth                3
Houston                  24
Kansas+City               7
Los+Angeles              25
Mesa%2CAZ                11
Miami                    15
Minneapolis               5
New+York%2CNY           111
Philadelphia             25
Phoenix                   9
Pittsburgh                7
Portland%2COR            12
Raleigh                  21
San+Antonio%2CTX          2
San+Diego                13
San+Francisco            35
San+Jose                 29
Seattle                  23
Washington+City%2CDC     59
Name: summary, dtype: int64

Now let's save the median on salary

In [143]:
median = df.salarytxt.median()
median
Out[143]:
86400.0

I also found some statistics for major cities. This includes: population, density, dpf, latitude, longitude, and median houshold income

In [144]:
cities = pd.read_csv('../data/city_stats.csv')
cities.dropna(inplace = True)
In [145]:
cities.head()
Out[145]:
City State Population Density DPF Latitude Longitude MedianHHInc Ppower
0 New+York%2CNY New York 8537673 27012 230619.62 40.6643 73.9385 59799.0 100.0
1 Los+Angeles California 3976322 8092 32176.40 34.0194 118.4108 45903.0 118.0
2 Chicago Illinois 2704958 11842 32032.11 41.8376 87.6818 51046.0 118.0
3 Philadelphia Pennsylvania 1567872 11379 17840.82 40.0094 75.1333 47528.0 116.0
4 San+Francisco California 870887 17179 14960.97 37.7751 122.4193 63024.0 102.0

Then I merge the original dataframe with statistics for each city and will drop one "City" column

In [146]:
jobs = pd.merge(df, cities, left_on = 'city', right_on = 'City', how = "inner")
jobs
Out[146]:
jobtitle company city location salarytxt paybase summary City State Population Density DPF Latitude Longitude MedianHHInc Ppower
0 sr data scientist Fimo Info Solutions LLC New+York%2CNY New York, NY 110000.0 annual sr data scientist, nyc*. the data scientist wi... New+York%2CNY New York 8537673 27012 230619.62 40.6643 73.9385 59799.0 100.0
1 climate and sustainability analyst DEPT OF ENVIRONMENT PROTECTION New+York%2CNY New York, NY 65977.0 annual knowledge and practical application of quantit... New+York%2CNY New York 8537673 27012 230619.62 40.6643 73.9385 59799.0 100.0
2 data analyst POLICE DEPARTMENT New+York%2CNY New York, NY 79249.5 annual extensive knowledge of applied statistics, ana... New+York%2CNY New York 8537673 27012 230619.62 40.6643 73.9385 59799.0 100.0
3 data specialist, bureau of systems strengtheni... DEPT OF HEALTH/MENTAL HYGIENE New+York%2CNY Queens, NY 65600.0 hourly experience and the ability to conduct evaluati... New+York%2CNY New York 8537673 27012 230619.62 40.6643 73.9385 59799.0 100.0
4 assistant research scientist i grade 14 job #1... New York State Psychiatry Institute New+York%2CNY New York, NY 32000.0 hourly the assistant research scientist is responsibl... New+York%2CNY New York 8537673 27012 230619.62 40.6643 73.9385 59799.0 100.0
5 geospatial data scientist DEPARTMENT OF FINANCE New+York%2CNY Manhattan, NY 75643.0 annual the property valuation & mapping unit is seeki... New+York%2CNY New York 8537673 27012 230619.62 40.6643 73.9385 59799.0 100.0
6 data analyst, bureau of immunization/immunizat... DEPT OF HEALTH/MENTAL HYGIENE New+York%2CNY New York, NY 68000.0 hourly -conduct geographic analysis using cir data. e... New+York%2CNY New York 8537673 27012 230619.62 40.6643 73.9385 59799.0 100.0
7 research analyst Paradigm Investigations New+York%2CNY New York, NY 40000.0 hourly collecting documents and information about cus... New+York%2CNY New York 8537673 27012 230619.62 40.6643 73.9385 59799.0 100.0
8 entry level – research analyst/editor/content ... XG Consultants Group, Inc. New+York%2CNY New York, NY 10017 (Midtown area) 24000.0 hourly job overview: xg consultants group is looking ... New+York%2CNY New York 8537673 27012 230619.62 40.6643 73.9385 59799.0 100.0
9 senior data engineer Enterprise Select New+York%2CNY New York, NY 10011 (Chelsea area) 140000.0 annual ensure data pipeline provides clean organized,... New+York%2CNY New York 8537673 27012 230619.62 40.6643 73.9385 59799.0 100.0
10 city research scientist 1 DEPT OF ENVIRONMENT PROTECTION New+York%2CNY Queens, NY 65977.0 annual collect asset condition data and costs of vari... New+York%2CNY New York 8537673 27012 230619.62 40.6643 73.9385 59799.0 100.0
11 research analyst ii Research Foundation of The City University of ... New+York%2CNY New York, NY 65000.0 annual work closely with reps data management staff t... New+York%2CNY New York 8537673 27012 230619.62 40.6643 73.9385 59799.0 100.0
12 data scientist - neural networks Geode Executive Search New+York%2CNY New York, NY 130000.0 annual a well-known tech company in new york is looki... New+York%2CNY New York 8537673 27012 230619.62 40.6643 73.9385 59799.0 100.0
13 senior assessment data analyst DEPARTMENT OF FINANCE New+York%2CNY Manhattan, NY 75557.5 annual experience using relational databases, data mi... New+York%2CNY New York 8537673 27012 230619.62 40.6643 73.9385 59799.0 100.0
14 client facing python engineers for a.i data co... Birch & James Associates Limited New+York%2CNY New York, NY 90000.0 annual an interest in machine learning and/or data sc... New+York%2CNY New York 8537673 27012 230619.62 40.6643 73.9385 59799.0 100.0
15 predictive risk analyst ADMIN FOR CHILDREN'S SVCS New+York%2CNY New York, NY 77035.0 annual critical to this position is strong analytical... New+York%2CNY New York 8537673 27012 230619.62 40.6643 73.9385 59799.0 100.0
16 senior data engineer Harnham New+York%2CNY New York, NY 175000.0 annual senior data engineer. design, implement and ma... New+York%2CNY New York 8537673 27012 230619.62 40.6643 73.9385 59799.0 100.0
17 data analyst/modeler DEPARTMENT OF FINANCE New+York%2CNY Brooklyn, NY 75557.5 annual strong programming, data analysis, statistical... New+York%2CNY New York 8537673 27012 230619.62 40.6643 73.9385 59799.0 100.0
18 data engineer ektello New+York%2CNY New York, NY 72800.0 hourly experience with integration of data from multi... New+York%2CNY New York 8537673 27012 230619.62 40.6643 73.9385 59799.0 100.0
19 senior research analyst CONSUMER AFFAIRS New+York%2CNY Manhattan, NY 76143.0 annual and assisting with data analysis, management, ... New+York%2CNY New York 8537673 27012 230619.62 40.6643 73.9385 59799.0 100.0
20 environmental analyst, bureau of environmental... DEPT OF HEALTH/MENTAL HYGIENE New+York%2CNY Manhattan, NY 62693.0 annual completing data checks and documentation for r... New+York%2CNY New York 8537673 27012 230619.62 40.6643 73.9385 59799.0 100.0
21 health scientist Centers for Disease Control and Prevention New+York%2CNY New York, NY 120186.0 annual analyze scientific investigation data. the cen... New+York%2CNY New York 8537673 27012 230619.62 40.6643 73.9385 59799.0 100.0
22 statistician, level i POLICE DEPARTMENT New+York%2CNY New York, NY 52903.5 annual selected candidate will be responsible for pro... New+York%2CNY New York 8537673 27012 230619.62 40.6643 73.9385 59799.0 100.0
23 policy data analyst DEPARTMENT OF FINANCE New+York%2CNY Manhattan, NY 75557.5 annual strong quantitative and research skills, inclu... New+York%2CNY New York 8537673 27012 230619.62 40.6643 73.9385 59799.0 100.0
24 associate ux researcher Key Lime Interactive New+York%2CNY New York, NY 10011 (Chelsea area) 50000.0 annual analyzing qualitative & quantitative data. per... New+York%2CNY New York 8537673 27012 230619.62 40.6643 73.9385 59799.0 100.0
25 research assistant i GUTTMACHER New+York%2CNY New York, NY 40000.0 annual research associate, senior/principal research ... New+York%2CNY New York 8537673 27012 230619.62 40.6643 73.9385 59799.0 100.0
26 criminal justice analyst, bureau of mental health DEPT OF HEALTH/MENTAL HYGIENE New+York%2CNY Queens, NY 79249.5 annual develop expertise in maven, and other availabl... New+York%2CNY New York 8537673 27012 230619.62 40.6643 73.9385 59799.0 100.0
27 research analyst , bureau of hiv/aids preventi... DEPT OF HEALTH/MENTAL HYGIENE New+York%2CNY New York, NY 78790.5 annual and a general appreciation for data quality. -... New+York%2CNY New York 8537673 27012 230619.62 40.6643 73.9385 59799.0 100.0
28 bioinformatics analyst Envisagenics, Inc. New+York%2CNY New York, NY 60000.0 annual to work with envisagenics data engineers and b... New+York%2CNY New York 8537673 27012 230619.62 40.6643 73.9385 59799.0 100.0
29 vp, data scientist - banking Harnham New+York%2CNY New York, NY 180000.0 annual vp, data scientist - banking. data scientist |... New+York%2CNY New York 8537673 27012 230619.62 40.6643 73.9385 59799.0 100.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
550 program support specialist Department of Defense Washington+City%2CDC Bethesda, MD 47354.5 annual analyze problems to identify significant facto... Washington+City%2CDC District of Columbia 681170 9856 6713.61 38.9041 77.0171 57291.0 107.0
551 supervisory physical scientist zp-1301-05 (de/cr) Department of Commerce Washington+City%2CDC Suitland, MD 146833.5 annual supervisory physical scientist, zp-1301-05. th... Washington+City%2CDC District of Columbia 681170 9856 6713.61 38.9041 77.0171 57291.0 107.0
552 executive director, u.s. botanic garden (super... Legislative Branch Washington+City%2CDC Washington, DC 155703.0 annual e-verify is an internet-based system that comp... Washington+City%2CDC District of Columbia 681170 9856 6713.61 38.9041 77.0171 57291.0 107.0
553 apps developer Central Intelligence Agency Washington+City%2CDC Washington, DC 90936.0 annual big data concepts and technologies such as apa... Washington+City%2CDC District of Columbia 681170 9856 6713.61 38.9041 77.0171 57291.0 107.0
554 machine learning Jobspring Partners Washington+City%2CDC Washington, DC 140000.0 annual professional data science experience. familiar... Washington+City%2CDC District of Columbia 681170 9856 6713.61 38.9041 77.0171 57291.0 107.0
555 data scientist ( chinese speaking / deep learn... Venturi Ltd Washington+City%2CDC Washington, DC 20007 (Georgetown area) 107500.0 annual as the senior data scientist you will:. senior... Washington+City%2CDC District of Columbia 681170 9856 6713.61 38.9041 77.0171 57291.0 107.0
556 statistical data analyst (us citizen) PeopleTek Washington+City%2CDC Bethesda, MD 70000.0 annual of various data, reporting, and a wide variety... Washington+City%2CDC District of Columbia 681170 9856 6713.61 38.9041 77.0171 57291.0 107.0
557 data scientist - ousd(i)-hcmo-dcips Red Gate Group Washington+City%2CDC Arlington, VA 112500.0 annual data scientist (full-time contract). under thi... Washington+City%2CDC District of Columbia 681170 9856 6713.61 38.9041 77.0171 57291.0 107.0
558 fellow for clinical epidemiology NCI, Epidemiology and Genomics Research Program Washington+City%2CDC Rockville, MD 41700.0 annual and data analysis. managing and analyzing epid... Washington+City%2CDC District of Columbia 681170 9856 6713.61 38.9041 77.0171 57291.0 107.0
559 data scientist/developer/engineer with securit... ByteCubed Washington+City%2CDC Crystal City, VA 137500.0 annual bytecubed is seeking a data scientist/develope... Washington+City%2CDC District of Columbia 681170 9856 6713.61 38.9041 77.0171 57291.0 107.0
560 environmental scientist - water resources mana... WSSC Washington+City%2CDC Laurel, MD 20707 95396.5 annual ability to plan and direct the work of profess... Washington+City%2CDC District of Columbia 681170 9856 6713.61 38.9041 77.0171 57291.0 107.0
561 statistician - healthcare or pharma industry Contingent/Direct Consultants Washington+City%2CDC Gaithersburg, MD 20877 150000.0 annual experience of development, program design and ... Washington+City%2CDC District of Columbia 681170 9856 6713.61 38.9041 77.0171 57291.0 107.0
562 earned value management & cost research analys... National Security Agency Washington+City%2CDC Fort Meade, MD 77025.0 annual use scientific methods to collect and analyze ... Washington+City%2CDC District of Columbia 681170 9856 6713.61 38.9041 77.0171 57291.0 107.0
563 payroll support associate at epa Oak Ridge Associated Universities Washington+City%2CDC Washington, DC 35200.0 hourly performing data entry in various ord and agenc... Washington+City%2CDC District of Columbia 681170 9856 6713.61 38.9041 77.0171 57291.0 107.0
564 data scientist Workbridge Associates Washington+City%2CDC Washington, DC 95000.0 annual this data scientist will be responsible for an... Washington+City%2CDC District of Columbia 681170 9856 6713.61 38.9041 77.0171 57291.0 107.0
565 senior data scientist Jobspring Partners Washington+City%2CDC Washington, DC 145000.0 annual one of the largest media conglomerates in the ... Washington+City%2CDC District of Columbia 681170 9856 6713.61 38.9041 77.0171 57291.0 107.0
566 embedded engineer intern digiBlitz Inc Washington+City%2CDC Herndon, VA 42500.0 annual implementation and integration of heterogeneou... Washington+City%2CDC District of Columbia 681170 9856 6713.61 38.9041 77.0171 57291.0 107.0
567 senior java developer HireStrategy Washington+City%2CDC Washington, DC 20005 (Logan Circle area) 88000.0 hourly partner with data scientists to implement new ... Washington+City%2CDC District of Columbia 681170 9856 6713.61 38.9041 77.0171 57291.0 107.0
568 java/big data engineer Jobspring Partners Washington+City%2CDC Washington, DC 115000.0 annual these applications are transforming traditiona... Washington+City%2CDC District of Columbia 681170 9856 6713.61 38.9041 77.0171 57291.0 107.0
569 supervisory statistician Court Services and Offender Supervision Agency... Washington+City%2CDC Washington, DC 146833.5 annual directs and supervises data analyses and evalu... Washington+City%2CDC District of Columbia 681170 9856 6713.61 38.9041 77.0171 57291.0 107.0
570 computer scientist Department of Defense Washington+City%2CDC Fort Meade, MD 56530.5 annual general experience is defined as experience th... Washington+City%2CDC District of Columbia 681170 9856 6713.61 38.9041 77.0171 57291.0 107.0
571 data scientist Robert Walters Washington+City%2CDC Washington, DC 90000.0 annual about the data scientist:. key responsibilitie... Washington+City%2CDC District of Columbia 681170 9856 6713.61 38.9041 77.0171 57291.0 107.0
572 director, advanced analytics Harnham Washington+City%2CDC Washington, DC 180000.0 annual campaign, insight, analytics, security, politi... Washington+City%2CDC District of Columbia 681170 9856 6713.61 38.9041 77.0171 57291.0 107.0
573 software engineer NatureServe Washington+City%2CDC Arlington, VA 72500.0 annual experience with big data. most of these projec... Washington+City%2CDC District of Columbia 681170 9856 6713.61 38.9041 77.0171 57291.0 107.0
574 data analyst Jobspring Partners Washington+City%2CDC Washington, DC 125000.0 annual experience using data visualization tools such... Washington+City%2CDC District of Columbia 681170 9856 6713.61 38.9041 77.0171 57291.0 107.0
575 software engineer intern digiBlitz Inc Washington+City%2CDC Herndon, VA 42500.0 annual - knowledge on hadoop or other big data analyt... Washington+City%2CDC District of Columbia 681170 9856 6713.61 38.9041 77.0171 57291.0 107.0
576 medical technologist (molecular systems) Department of the Army Washington+City%2CDC Silver Spring, MD 70376.0 annual confirm test results and develop data that may... Washington+City%2CDC District of Columbia 681170 9856 6713.61 38.9041 77.0171 57291.0 107.0
577 machine learning data engineer Jobspring Partners Washington+City%2CDC Washington, DC 120000.0 annual professional data science experience. familiar... Washington+City%2CDC District of Columbia 681170 9856 6713.61 38.9041 77.0171 57291.0 107.0
578 principal environmental engineer WSSC Washington+City%2CDC Laurel, MD 20707 95396.5 annual ability to plan and oversee the work of profes... Washington+City%2CDC District of Columbia 681170 9856 6713.61 38.9041 77.0171 57291.0 107.0
579 principal statistician - oncology Penfield Search Partners Washington+City%2CDC Gaithersburg, MD 150000.0 annual experience of development, programme design an... Washington+City%2CDC District of Columbia 681170 9856 6713.61 38.9041 77.0171 57291.0 107.0

580 rows × 16 columns

  • Dropping the extra "City" column.
In [147]:
jobs[jobs.City.isnull() == True].head()
Out[147]:
jobtitle company city location salarytxt paybase summary City State Population Density DPF Latitude Longitude MedianHHInc Ppower
In [148]:
jobs = jobs.drop('city', axis = 1)
print jobs.shape
jobs.head()
(580, 15)
Out[148]:
jobtitle company location salarytxt paybase summary City State Population Density DPF Latitude Longitude MedianHHInc Ppower
0 sr data scientist Fimo Info Solutions LLC New York, NY 110000.0 annual sr data scientist, nyc*. the data scientist wi... New+York%2CNY New York 8537673 27012 230619.62 40.6643 73.9385 59799.0 100.0
1 climate and sustainability analyst DEPT OF ENVIRONMENT PROTECTION New York, NY 65977.0 annual knowledge and practical application of quantit... New+York%2CNY New York 8537673 27012 230619.62 40.6643 73.9385 59799.0 100.0
2 data analyst POLICE DEPARTMENT New York, NY 79249.5 annual extensive knowledge of applied statistics, ana... New+York%2CNY New York 8537673 27012 230619.62 40.6643 73.9385 59799.0 100.0
3 data specialist, bureau of systems strengtheni... DEPT OF HEALTH/MENTAL HYGIENE Queens, NY 65600.0 hourly experience and the ability to conduct evaluati... New+York%2CNY New York 8537673 27012 230619.62 40.6643 73.9385 59799.0 100.0
4 assistant research scientist i grade 14 job #1... New York State Psychiatry Institute New York, NY 32000.0 hourly the assistant research scientist is responsibl... New+York%2CNY New York 8537673 27012 230619.62 40.6643 73.9385 59799.0 100.0
In [149]:
temp = pd.DataFrame()
for i in jobs.paybase.unique():
    mean_ = round(jobs[jobs.paybase == i].salarytxt.mean(),2)
    median_ = round(jobs[jobs.paybase == i].salarytxt.median(),2)
    n = jobs[jobs.paybase == i].salarytxt.count()
    temp = temp.append([[i, mean_, median_, n]], ignore_index=True)
temp.columns = ['Base', 'Mean', 'Median','Count']
temp
Out[149]:
Base Mean Median Count
0 annual 106466.23 97200.0 452
1 hourly 62197.80 51200.0 91
2 monthly 63416.92 64140.0 37
In [150]:
temp.set_index('Base', inplace = True)
temp.Count.plot.pie(autopct='%.2f', figsize = (8,8), fontsize = 14)
plt.title('Number of Job Announcements with Salaries per Payment Base', fontsize = 14)
plt.show()
In [151]:
temp['Mean'].sort_values(ascending = False).plot(figsize = (12,8),kind = 'bar')
plt.show()

Now building a dummy variable comparing salaries to the median of USD

  • Since the annual salary is significantly different from salaries paid on monthl or hourly basis I decided to drop those.
  • Re-measure the median only for the annual salaries
In [152]:
jobs = jobs[jobs.paybase == 'annual']
median_ = jobs.salarytxt.median()
print median_
97200.0
In [153]:
jobs.shape
Out[153]:
(452, 15)

Categorizing Data

  • creating a dummy variable: True (1) if > median, False (0) if < median
In [154]:
# If the salary is higher than the city median, then it's ONE
jobs['sal_to_med'] = (jobs.salarytxt > median_)
jobs.head()
Out[154]:
jobtitle company location salarytxt paybase summary City State Population Density DPF Latitude Longitude MedianHHInc Ppower sal_to_med
0 sr data scientist Fimo Info Solutions LLC New York, NY 110000.0 annual sr data scientist, nyc*. the data scientist wi... New+York%2CNY New York 8537673 27012 230619.62 40.6643 73.9385 59799.0 100.0 True
1 climate and sustainability analyst DEPT OF ENVIRONMENT PROTECTION New York, NY 65977.0 annual knowledge and practical application of quantit... New+York%2CNY New York 8537673 27012 230619.62 40.6643 73.9385 59799.0 100.0 False
2 data analyst POLICE DEPARTMENT New York, NY 79249.5 annual extensive knowledge of applied statistics, ana... New+York%2CNY New York 8537673 27012 230619.62 40.6643 73.9385 59799.0 100.0 False
5 geospatial data scientist DEPARTMENT OF FINANCE Manhattan, NY 75643.0 annual the property valuation & mapping unit is seeki... New+York%2CNY New York 8537673 27012 230619.62 40.6643 73.9385 59799.0 100.0 False
9 senior data engineer Enterprise Select New York, NY 10011 (Chelsea area) 140000.0 annual ensure data pipeline provides clean organized,... New+York%2CNY New York 8537673 27012 230619.62 40.6643 73.9385 59799.0 100.0 True

Adding a dummy variable for "manager", "senior", "supervisor" in a job title¶

In [155]:
jobs['MgrDummy'] = jobs.jobtitle.str.contains('supervisor|manager|director|senior|president')
jobs.head(2)
Out[155]:
jobtitle company location salarytxt paybase summary City State Population Density DPF Latitude Longitude MedianHHInc Ppower sal_to_med MgrDummy
0 sr data scientist Fimo Info Solutions LLC New York, NY 110000.0 annual sr data scientist, nyc*. the data scientist wi... New+York%2CNY New York 8537673 27012 230619.62 40.6643 73.9385 59799.0 100.0 True False
1 climate and sustainability analyst DEPT OF ENVIRONMENT PROTECTION New York, NY 65977.0 annual knowledge and practical application of quantit... New+York%2CNY New York 8537673 27012 230619.62 40.6643 73.9385 59799.0 100.0 False False

Selecting and reorganizing columns, so that I can turn them into categories later¶

In [156]:
jobs_temp = jobs[['salarytxt', 'Population','Density','MgrDummy','sal_to_med']]
jobs_temp.shape
Out[156]:
(452, 5)
In [157]:
jobs_temp.head()
Out[157]:
salarytxt Population Density MgrDummy sal_to_med
0 110000.0 8537673 27012 False True
1 65977.0 8537673 27012 False False
2 79249.5 8537673 27012 False False
5 75643.0 8537673 27012 False False
9 140000.0 8537673 27012 True True

Categorizing the selected variables into quartiles, except for MgrDummy and the target column of sal_to_med - higher or lower the median¶

In [158]:
y = jobs_temp.sal_to_med
X = jobs_temp.drop(["sal_to_med",'salarytxt'], axis = 1)
print y.head(), X.head()
0     True
1    False
2    False
5    False
9     True
Name: sal_to_med, dtype: bool    Population  Density  MgrDummy
0     8537673    27012     False
1     8537673    27012     False
2     8537673    27012     False
5     8537673    27012     False
9     8537673    27012      True
In [159]:
from sklearn import preprocessing
X_norm = preprocessing.normalize(X, norm = 'l1')
In [160]:
print X.shape
X_norm
(452, 3)
Out[160]:
array([[ 0.99684612,  0.00315388,  0.        ],
       [ 0.99684612,  0.00315388,  0.        ],
       [ 0.99684612,  0.00315388,  0.        ],
       ..., 
       [ 0.98573715,  0.01426285,  0.        ],
       [ 0.98573715,  0.01426285,  0.        ],
       [ 0.98573715,  0.01426285,  0.        ]])
In [ ]:
 
In [161]:
X = pd.DataFrame(X_norm, columns = X.columns)
X.head()
Out[161]:
Population Density MgrDummy
0 0.996846 0.003154 0.000000e+00
1 0.996846 0.003154 0.000000e+00
2 0.996846 0.003154 0.000000e+00
3 0.996846 0.003154 0.000000e+00
4 0.996846 0.003154 1.167585e-07
In [162]:
jobs_categories = np.floor(jobs_temp[jobs_temp.columns[:-2]].rank() / len(jobs_temp) /.5001).astype(int)
jobs_categories = jobs_categories.join([jobs_temp.MgrDummy, jobs_temp.sal_to_med])
print jobs_categories.shape
jobs_categories.head()
(452, 5)
Out[162]:
salarytxt Population Density MgrDummy sal_to_med
0 1 1 1 False True
1 0 1 1 False False
2 0 1 1 False False
5 0 1 1 False False
9 1 1 1 True True
In [163]:
# Below I initiate train-split and populating the vocabulary. Then will merge the two dataframes and will run the 
# Random Forest

Populating vocabulary based on the job summaries¶

In [164]:
from sklearn.feature_extraction.text import CountVectorizer

# Initialize the "CountVectorizer" object, which is scikit-learn's
# bag of words tool.  
vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = 'english',   \
                             max_features = 50) 

# fit_transform() does two functions: First, it fits the model
# and learns the vocabulary; second, it transforms our training data
# into feature vectors. The input to fit_transform should be a list of 
# strings.
train_data_features = vectorizer.fit_transform(jobs.summary)

# Numpy arrays are easy to work with, so convert the result to an 
# array
train_data_features = train_data_features.toarray()
print train_data_features.shape
train_data_features
(452, 50)
Out[164]:
array([[0, 0, 0, ..., 1, 0, 0],
       [0, 1, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 1, 0, 0],
       [0, 1, 0, ..., 0, 0, 0]])

Next is to check the vocabulary (shape and the contents)¶

In [165]:
vocab = vectorizer.get_feature_names()
len(vocab)
Out[165]:
50

Create the dataframe¶

In [166]:
summary = pd.DataFrame(train_data_features, columns = vocab)
print summary.shape
summary.head()
(452, 50)
Out[166]:
ability analysis analyst analytics analyze analyzing big client clinical company ... solutions sources statistical team tools use using work working years
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
1 0 1 0 0 0 0 0 0 0 0 ... 0 0 1 0 0 0 0 0 0 0
2 0 0 0 1 0 0 1 0 0 0 ... 0 0 0 0 1 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 0 0 0

5 rows × 50 columns

In [167]:
print 'DataFrame with: senior, manager, director or president in jobtitle: ',jobs[jobs.MgrDummy == True].shape
print 'Overall DataFrame ', jobs.shape
DataFrame with: senior, manager, director or president in jobtitle:  (92, 17)
Overall DataFrame  (452, 17)
In [168]:
print jobs_categories.shape
jobs_categories.head(3)
(452, 5)
Out[168]:
salarytxt Population Density MgrDummy sal_to_med
0 1 1 1 False True
1 0 1 1 False False
2 0 1 1 False False
In [169]:
X.head()
Out[169]:
Population Density MgrDummy
0 0.996846 0.003154 0.000000e+00
1 0.996846 0.003154 0.000000e+00
2 0.996846 0.003154 0.000000e+00
3 0.996846 0.003154 0.000000e+00
4 0.996846 0.003154 1.167585e-07
In [170]:
summary.head()
Out[170]:
ability analysis analyst analytics analyze analyzing big client clinical company ... solutions sources statistical team tools use using work working years
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
1 0 1 0 0 0 0 0 0 0 0 ... 0 0 1 0 0 0 0 0 0 0
2 0 0 0 1 0 0 1 0 0 0 ... 0 0 0 0 1 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 0 0 0

5 rows × 50 columns

In [171]:
super_jobs = pd.concat([X,summary], axis = 1, ignore_index=True)
super_jobs.columns = list(X.columns) + list(summary.columns)
print 'Super Data Frame', super_jobs.shape
super_jobs.head()
Super Data Frame (452, 53)
Out[171]:
Population Density MgrDummy ability analysis analyst analytics analyze analyzing big ... solutions sources statistical team tools use using work working years
0 0.996846 0.003154 0.000000e+00 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
1 0.996846 0.003154 0.000000e+00 0 1 0 0 0 0 0 ... 0 0 1 0 0 0 0 0 0 0
2 0.996846 0.003154 0.000000e+00 0 0 0 1 0 0 1 ... 0 0 0 0 1 0 0 0 0 0
3 0.996846 0.003154 0.000000e+00 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0.996846 0.003154 1.167585e-07 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 0 0 0

5 rows × 53 columns

In [172]:
X = super_jobs
X.head()
Out[172]:
Population Density MgrDummy ability analysis analyst analytics analyze analyzing big ... solutions sources statistical team tools use using work working years
0 0.996846 0.003154 0.000000e+00 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
1 0.996846 0.003154 0.000000e+00 0 1 0 0 0 0 0 ... 0 0 1 0 0 0 0 0 0 0
2 0.996846 0.003154 0.000000e+00 0 0 0 1 0 0 1 ... 0 0 0 0 1 0 0 0 0 0
3 0.996846 0.003154 0.000000e+00 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0.996846 0.003154 1.167585e-07 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 0 0 0

5 rows × 53 columns

Let's try Random Forest on Population, Density, EmedHHI, MgrDummy as features and sal_to_med as predictor¶

In [173]:
jobs_categories.head()
y = jobs_categories.sal_to_med
print X.shape
X.head()
(452, 53)
Out[173]:
Population Density MgrDummy ability analysis analyst analytics analyze analyzing big ... solutions sources statistical team tools use using work working years
0 0.996846 0.003154 0.000000e+00 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
1 0.996846 0.003154 0.000000e+00 0 1 0 0 0 0 0 ... 0 0 1 0 0 0 0 0 0 0
2 0.996846 0.003154 0.000000e+00 0 0 0 1 0 0 1 ... 0 0 0 0 1 0 0 0 0 0
3 0.996846 0.003154 0.000000e+00 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0.996846 0.003154 1.167585e-07 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 0 0 0

5 rows × 53 columns

In [174]:
#1. Split the data into training and testing parts

feat_labels = list(X.columns)
print len(feat_labels)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
print X.shape
print X_train.shape, X_test.shape
print y_train.shape, y_test.shape
53
(452, 53)
(316, 53) (136, 53)
(316,) (136,)
In [175]:
# 2. Create a random forest classifier
clf = RandomForestClassifier(n_estimators=70, random_state=0, n_jobs=-1)

# Train the classifier
clf.fit(X_train, y_train)

# Print the name and gini importance of each feature
for feature in zip(feat_labels, clf.feature_importances_):
    print(feature[:20])
('Population', 0.095743921567549448)
('Density', 0.092879374851942781)
('MgrDummy', 0.045664113263210987)
(u'ability', 0.022270531453539601)
(u'analysis', 0.030215406246759845)
(u'analyst', 0.011007792874737059)
(u'analytics', 0.032237918886094855)
(u'analyze', 0.020139095162945299)
(u'analyzing', 0.010916816931129473)
(u'big', 0.01591316130714782)
(u'client', 0.0098473250840064256)
(u'clinical', 0.0051950999616316718)
(u'company', 0.02309398416319865)
(u'data', 0.06499031213688207)
(u'design', 0.0093490994589386454)
(u'develop', 0.010725509208902861)
(u'development', 0.0067876694383505283)
(u'engineer', 0.014742830664195636)
(u'engineers', 0.0053284235187250201)
(u'experience', 0.018829804749553803)
(u'growing', 0.0065143590198864553)
(u'join', 0.010133615174390759)
(u'learning', 0.02539238375475085)
(u'level', 0.0045626102635137914)
(u'looking', 0.019099852799998349)
(u'machine', 0.02239565230445242)
(u'management', 0.015105116386194266)
(u'methods', 0.011088771555010278)
(u'modeling', 0.0099924122394484914)
(u'new', 0.014481593605827245)
(u'processing', 0.010397202981342142)
(u'projects', 0.0086034585234265049)
(u'python', 0.0028732779064370023)
(u'quality', 0.0088362222130459103)
(u'quantitative', 0.0060850481062211942)
(u'reports', 0.008876007105868217)
(u'research', 0.041858871264691039)
(u'science', 0.021982614556429799)
(u'scientist', 0.051895841116068125)
(u'scientists', 0.022526144998467054)
(u'seeking', 0.0073873211371032468)
(u'senior', 0.019439338894228955)
(u'software', 0.0064214493582875846)
(u'solutions', 0.013458347509564825)
(u'sources', 0.0046799323287907814)
(u'statistical', 0.016872746185586478)
(u'team', 0.017132925456288785)
(u'tools', 0.0074698204151467705)
(u'use', 0.0066235429772719843)
(u'using', 0.0062791891508070059)
(u'work', 0.011675362481737072)
(u'working', 0.0058672204373647029)
(u'years', 0.0081135568629094063)

I will just try to select the words separately by their importance¶

In [176]:
X_w = summary
feat_labels = list(X_w.columns)
print len(feat_labels)

X_train, X_test, y_train, y_test = train_test_split(X_w, y, test_size=0.3, random_state=0)
50
In [177]:
# 2. Create a random forest classifier
clf = RandomForestClassifier(n_estimators=70, random_state=0, n_jobs=-1)

# Train the classifier
clf.fit(X_train, y_train)

# Print the name and gini importance of each feature
for feature in zip(feat_labels, clf.feature_importances_):
    print(feature)
(u'ability', 0.019731211862590803)
(u'analysis', 0.039797299790266499)
(u'analyst', 0.010826838292737638)
(u'analytics', 0.039833546679966754)
(u'analyze', 0.019980081272659302)
(u'analyzing', 0.015669750395107426)
(u'big', 0.021158356402510761)
(u'client', 0.012403419200068364)
(u'clinical', 0.010400502870459148)
(u'company', 0.030310916537109302)
(u'data', 0.096701208642020794)
(u'design', 0.01462885507184811)
(u'develop', 0.01172453545968126)
(u'development', 0.017826227078231571)
(u'engineer', 0.015829231637755514)
(u'engineers', 0.010895628208146416)
(u'experience', 0.024097686623611631)
(u'growing', 0.014031044519856577)
(u'join', 0.012675834345331403)
(u'learning', 0.035850116755974312)
(u'level', 0.0069589877168694881)
(u'looking', 0.021450911807070233)
(u'machine', 0.024865669559638492)
(u'management', 0.018720744101380353)
(u'methods', 0.011389095656509135)
(u'modeling', 0.012060254472723891)
(u'new', 0.020411143447278644)
(u'processing', 0.011763436273070235)
(u'projects', 0.010720395521500848)
(u'python', 0.0061078146783048104)
(u'quality', 0.010927817606336218)
(u'quantitative', 0.0097624302882069729)
(u'reports', 0.014483425188068689)
(u'research', 0.049036050163072435)
(u'science', 0.022676385745280532)
(u'scientist', 0.054750465422041344)
(u'scientists', 0.030392229228164855)
(u'seeking', 0.010642175524143458)
(u'senior', 0.039640780236206548)
(u'software', 0.0099552704653666224)
(u'solutions', 0.013889854316013453)
(u'sources', 0.0059238538249489405)
(u'statistical', 0.021306607112017227)
(u'team', 0.019107779721431229)
(u'tools', 0.011748094195373581)
(u'use', 0.0075455788169863976)
(u'using', 0.0086793132835269703)
(u'work', 0.021821442397927344)
(u'working', 0.0091954225058919394)
(u'years', 0.0096942790787155481)
In [178]:
listofimportance = pd.DataFrame(zip(feat_labels,clf.feature_importances_), columns = ['words','importance'])
In [179]:
importantwords = listofimportance[listofimportance['importance']>0.003].sort_values('importance', ascending=False)

So here is top 10 countdown of the most influential words in the summary

In [180]:
importantwords[:20]
Out[180]:
words importance
10 data 0.096701
35 scientist 0.054750
33 research 0.049036
3 analytics 0.039834
1 analysis 0.039797
38 senior 0.039641
19 learning 0.035850
36 scientists 0.030392
9 company 0.030311
22 machine 0.024866
16 experience 0.024098
34 science 0.022676
47 work 0.021821
21 looking 0.021451
42 statistical 0.021307
6 big 0.021158
26 new 0.020411
4 analyze 0.019980
0 ability 0.019731
43 team 0.019108
In [182]:
# Let's see how exactly those words affect the salary prediction:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
m = lr.fit(X_w, y)
m = m.coef_.tolist()
m = pd.DataFrame(zip(X.columns, m[0]), columns = ['features','log'])
m['exp'] = np.exp(m.log)
print "Seven words most negatively affecting salary", m.sort_values('exp', ascending=True).head(7)
print "Seven words most positively affecting salary", m.sort_values('exp', ascending=False).head(7)
Seven words most negatively affecting salary       features       log       exp
5      analyst -0.946242  0.388197
32      python -0.921383  0.397968
33     quality -0.847504  0.428483
0   Population -0.791032  0.453377
15     develop -0.690045  0.501554
48         use -0.593036  0.552647
8    analyzing -0.588213  0.555318
Seven words most positively affecting salary       features       log       exp
38   scientist  1.333546  3.794476
19  experience  1.136274  3.115140
9          big  1.053281  2.867043
25     machine  0.957801  2.605960
26  management  0.946400  2.576419
43   solutions  0.903495  2.468215
11    clinical  0.854428  2.350031

Nevertheless, I will return to the initial list of predictors and rerun all of Random Forest, Extra Tree and Decision Tree

In [183]:
y = jobs_temp.salarytxt
print X.shape
print y.head()
X.head()
(452, 53)
0    110000.0
1     65977.0
2     79249.5
5     75643.0
9    140000.0
Name: salarytxt, dtype: float64
Out[183]:
Population Density MgrDummy ability analysis analyst analytics analyze analyzing big ... solutions sources statistical team tools use using work working years
0 0.996846 0.003154 0.000000e+00 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
1 0.996846 0.003154 0.000000e+00 0 1 0 0 0 0 0 ... 0 0 1 0 0 0 0 0 0 0
2 0.996846 0.003154 0.000000e+00 0 0 0 1 0 0 1 ... 0 0 0 0 1 0 0 0 0 0
3 0.996846 0.003154 0.000000e+00 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0.996846 0.003154 1.167585e-07 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 0 0 0

5 rows × 53 columns

In [184]:
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, BaggingClassifier

skf = StratifiedKFold(n_splits=10)
skf.get_n_splits(X, y)

cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=41)
In [192]:
dt = RandomForestClassifier(class_weight='balanced')
s = cross_val_score(dt, X, y, cv=cv, n_jobs=30)
print "{} Score:\t{:0.3} ± {:0.3}".format("Decision Tree", s.mean().round(3), s.std().round(3))
Decision Tree Score:	0.716 ± 0.056
In [186]:
y = jobs_categories.sal_to_med
#X = jobs_categories.drop(['sal_to_med','salarytxt'], axis = 1)
print X.shape
print y.unique()
X.head()
(452, 53)
[ True False]
Out[186]:
Population Density MgrDummy ability analysis analyst analytics analyze analyzing big ... solutions sources statistical team tools use using work working years
0 0.996846 0.003154 0.000000e+00 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
1 0.996846 0.003154 0.000000e+00 0 1 0 0 0 0 0 ... 0 0 1 0 0 0 0 0 0 0
2 0.996846 0.003154 0.000000e+00 0 0 0 1 0 0 1 ... 0 0 0 0 1 0 0 0 0 0
3 0.996846 0.003154 0.000000e+00 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0.996846 0.003154 1.167585e-07 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 0 0 0

5 rows × 53 columns

Stratified K Fold on Decision Tree Classifier

In [187]:
cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=41)
dt = DecisionTreeClassifier(class_weight='balanced')
s = cross_val_score(dt, X, y, cv=cv, n_jobs=-1)
print "{} Score:\t{:0.3} ± {:0.3}".format("Decision Tree", s.mean().round(3), s.std().round(3))
Decision Tree Score:	0.677 ± 0.076

Cross-Validation with Random Forest Trees with different n_jobs parameters.

In [188]:
dt = RandomForestClassifier(class_weight='balanced')
s = cross_val_score(dt, X, y, cv=cv, n_jobs=1)
print "{} Score:\t{:0.3} ± {:0.3}".format("Random Forest with Balanced Classes", s.mean().round(3), s.std().round(3))
Random Forest with Balanced Classes Score:	0.763 ± 0.073
In [189]:
# Random Forest Classifier without class weight identification
In [190]:
dt = RandomForestClassifier()
s = cross_val_score(dt, X, y, cv=cv, n_jobs=20)
print "{} Score:\t{:0.3} ± {:0.3}".format("Random Forest without Balanced Classes", s.mean().round(3), s.std().round(3))
Random Forest without Balanced Classes Score:	0.752 ± 0.074

Bagging Classifier

In [191]:
dt = BaggingClassifier()
s = cross_val_score(dt, X, y, cv=cv, n_jobs=1)
print "{} Score:\t{:0.3} ± {:0.3}".format("Bagging", s.mean().round(3), s.std().round(3))
Bagging Score:	0.712 ± 0.088

Part 3. Web Scraper

Below is the web scraper code. It contains a couple of functions that you need to run before you start web scraping:

  1. Salary extraction function - announce it to be able to read the salary information
  2. The salary type classifier - as I discovered later, the salary quotations on hourly, monthly or daily basis differ significantly if translated on annual basis that skewes the overall salary statistic.

3.1. Salary Extraction Function¶

In [2]:
def salary(i):
    if i.find("span", {"class":"no-wrap"}) != None:
        js = str(i.find("span", {"class":"no-wrap"}).text.strip()).split()
        #print js
        #print float(re.sub(",", "",re.findall(r"\d+\,\d+|\d+\.\d+",js[0])[0]))
        if js[-1] == 'year':
            if js[1] == '-':
                js1 = float(re.sub(",","",(re.findall(r"\d+\,\d+|\d+\.\d+",js[0])[0])))
                #print js1
                js2 = float(re.sub(",","",(re.findall(r"\d+\,\d+|\d+\.\d+",js[2])[0])))
                #print js2
                js = (js1+js2)/2
                #print js
            else: js = float(re.sub(",","",(re.findall(r"\d+\,\d+|\d+\.\d+",js[0])[0])))
            #print js
        elif js[-1] == 'hour':
            if js[1] == '-':
                #print js[0], js[1], js[2]
                js1 = float(re.findall(r"\d+",js[0])[0])
                #print js1
                js2 = float(re.findall(r"\d+",js[2])[0])
                #print js2
                js = (js1+js2)/2*1600
                #print js
            else: js = float(re.findall(r"\d+",js[0])[0])*1600
        elif js[-1] == 'day':
            if js[1] == '-':
                #print js[0], js[1], js[2]
                js1 = float(re.findall(r"\d+",js[0])[0])
                #print js1
                js2 = float(re.findall(r"\d+",js[2])[0])
                #print js2
                js = (js1+js2)/2*200
                #print js
            else: js = float(re.findall(r"\d+",js[0])[0])*200
        elif js[-1] == 'month':
            if js[1] == '-':
                #print js[0], js[1], js[2]
                js1 = float(re.sub(",","",(re.findall(r"\d+\,\d+|\d+\.\d+",js[0])[0])))
                #print js1
                js2 = float(re.sub(",","",(re.findall(r"\d+\,\d+|\d+\.\d+",js[2])[0])))
                #print js2
                js = (js1+js2)/2*12
                #print js
            else: js = float(re.sub(",","",(re.findall(r"\d+\,\d+|\d+\.\d+",js[0])[0])))*12
    else: 
        js = str('NaN')
    
    #print js
    return js

3.2. Salary Classifier Function¶

In [3]:
def base(i):
    if i.find("span", {"class":"no-wrap"}) != None:
        pb = str(i.find("span", {"class":"no-wrap"}).text.strip()).split()
#        print pb
        if pb[-1] == 'year':
            pb = 'annual'
        elif pb[-1] == 'hour':
            pb = 'hourly'
        elif pb[-1] == 'day':
            pb = 'dayly'
        elif pb[-1] == 'month':
            pb = 'monthly'
    else: 
        pb = str("NaN")
#    print pb
    return pb

3.3. Body of web scaper¶

In [134]:
#['New+York', 'Chicago', 'San+Francisco', 'Austin', 'Seattle','Los+Angeles', 'Philadelphia', 'Atlanta', 'Dallas', 'Pittsburgh', 
#'Portland', 'Phoenix', 'Denver', 'Houston', 'Miami', MyCity]
MyCity = 'Washington+City%2CDC'
city_set = ['New+York%2CNY', 'Chicago', 'San+Francisco', 'Austin', 'Seattle', 
    'Los+Angeles', 'Philadelphia', 'Atlanta', 'Dallas', 'Pittsburgh', 
    'Portland%2COR', 'Phoenix', 'Denver', 'Houston', 'Miami', 'Boston', 'San+Diego', 
    'Baltimore', 'San+Jose', 'Minneapolis','San+Antonio%2CTX','Detroit','Columbus','Charlotte','Fort Worth',
    'Jacksonville+FL', 'Fresno', 'Kansas+City', 'Mesa%2CAZ','Raleigh', MyCity]
start = '10'
#max_results_per_city = min(10,n)
jobsdf = pd.DataFrame()

for city in city_set:
    page0 = requests.get('http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l='+str(city))
    soup = BeautifulSoup(page0.text, 'html.parser', from_encoding = 'utf-8')
    n = int((soup.find("div", {"id":"searchCount"}).text.strip().split()[-1]).replace(',' , ''))                     
    for start in range(0, min(n,2000), 10):
        page = requests.get('http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l='
                            +str(city)+'&start='+str(start))
        time.sleep(1) #the head is borrowed from M>Salmon's code. Giving it a second to rest
        soup = BeautifulSoup(page.text, 'html.parser', from_encoding = 'utf-8')
        #blocks will pack each block with job announcement wrapped in div-result tag-attrs
        blocks = soup.find_all("div", {"class":[" row result","lastRow row result"]})
        for i in blocks:
            jt = i.select_one("a")["title"].lower()
            if i.find("span", {"class":"company"}) != None:
                cn = i.find("span", {"class":"company"}).text.strip()
            else: cn = str('NaN')
            loc = str(i.find("span", {"class":"location"}).text.strip())
            js = salary(i)
            pb = base(i)
            jsum = i.find("span", {"class":"summary"}).text.strip().lower()
#            print jt,"/", cn, '/', city, "/", loc, '/', js, '/', pb, '/', jsum, "\n"
            row = pd.DataFrame([[jt, cn, city, loc, js, pb,jsum]], columns = ['jobtitle','company','city','location','salarytxt','paybase', 'summary'])
            jobsdf = pd.concat([jobsdf,row], ignore_index = True) 

jobsdf.to_csv('../../project_3_data/jobsdf.csv', encoding='utf-8')       
In [ ]: