Data Science Job Positions on Indeed. Observations and Analysis

Hello and welcome to the Data Science Job Positions on Indeed project.</br> First, this is a revisited project that will give you an idea of the jobs on indeed in over 21 cities in the United States that were published and scraped in August 2017.

The dataset has over 619 advertized positions that mention annual/monthly/daily/hourly salaries. Further code and findings will be based on such observations. Our task was to predict if in a given city, we will be able to predict that the salary in the city would be above or below the overall median. The median was selected as a parameter for our set due to a strong skew to the right, for that reason the mean would not be desciptive of the data set.

I decided to add statistical data for each city and see if the statistical data can help predict if the mean salary would be below or above the overall median salary.

If you are mostly interested in the structure of the web-scaping code, please proceed to the very bottom of the page.

import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests
import re
import time
import sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score

Part 1. Data set and EDA

df = pd.read_csv('../data/jobsdf.csv')

print df.shape
df.head()

(15187, 8)

Next steps:

- Drop column "Unnamed: 0"
- Drop all records that don't mention salaries leaves only 857 records
- See what's left

# Dropping "Unnamed: 0" and all lines that don't contain salaries
df = df[df.paybase.isnull() == False]
df.drop("Unnamed: 0", axis = 1, inplace = True)
print df.shape
df.head()

(857, 7)

Drop duplicates

df.drop_duplicates(inplace = True)
df.reset_index(inplace = True)
df.drop("index", axis = 1, inplace = True)
print df.shape
df.head()

(619, 7)

def eda(dataframe):
    print "missing values \n", dataframe.isnull().sum()
    print "dataframe index \n", dataframe.index
    print "dataframe types \n", dataframe.dtypes
    print "dataframe shape \n", dataframe.shape
    print "dataframe describe \n", dataframe.describe()
    for item in dataframe:
        print item
        print dataframe[item].nunique()

eda(df)

missing values 
jobtitle     0
company      0
city         0
location     0
salarytxt    0
paybase      0
summary      0
dtype: int64
dataframe index 
RangeIndex(start=0, stop=619, step=1)
dataframe types 
jobtitle      object
company       object
city          object
location      object
salarytxt    float64
paybase       object
summary       object
dtype: object
dataframe shape 
(619, 7)
dataframe describe 
           salarytxt
count     619.000000
mean    95153.390953
std     46808.678054
min     19200.000000
25%     60262.750000
50%     86400.000000
75%    125000.000000
max    300000.000000
jobtitle
489
company
288
city
29
location
185
salarytxt
266
paybase
3
summary
575

df.describe()

Plotting part

import matplotlib.pyplot as plt
df.salarytxt.plot.hist(figsize = (12,8),bins=10)
plt.show()

And here I am checking if there are any lines left with salaries below $20K

df[df.salarytxt < 20000]

Groupping by cities to see what is the situation with salaries positions are there for each city

print df.groupby(df.city)['summary'].count()
df.groupby(df.city)['summary'].count().plot(figsize = (16,12),kind = 'bar', fontsize = 14)
_ = plt.title("Number of Vacancies with Salaries in City", fontsize = 16)
_ = plt.ylabel("Number of Vacancies", fontsize = 14)
_ = plt.xlabel("City", fontsize = 14)
plt.show()

city
Atlanta                  25
Austin                   22
Baltimore                25
Boston                   42
Charlotte                12
Chicago                  29
Columbus                  6
Dallas                    9
Denver                    9
Detroit                   4
Fort Worth                3
Houston                  24
Kansas+City               7
Los+Angeles              25
Mesa%2CAZ                11
Miami                    15
Minneapolis               5
New+York%2CNY           111
Philadelphia             25
Phoenix                   9
Pittsburgh                7
Portland%2COR            12
Raleigh                  21
San+Antonio%2CTX          2
San+Diego                13
San+Francisco            35
San+Jose                 29
Seattle                  23
Washington+City%2CDC     59
Name: summary, dtype: int64

Now let's save the median on salary

median = df.salarytxt.median()
median

86400.0

I also found some statistics for major cities. This includes: population, density, dpf, latitude, longitude, and median houshold income

cities = pd.read_csv('../data/city_stats.csv')
cities.dropna(inplace = True)

cities.head()

Then I merge the original dataframe with statistics for each city and will drop one "City" column

jobs = pd.merge(df, cities, left_on = 'city', right_on = 'City', how = "inner")
jobs

Dropping the extra "City" column.

jobs[jobs.City.isnull() == True].head()

jobs = jobs.drop('city', axis = 1)
print jobs.shape
jobs.head()

(580, 15)

temp = pd.DataFrame()
for i in jobs.paybase.unique():
    mean_ = round(jobs[jobs.paybase == i].salarytxt.mean(),2)
    median_ = round(jobs[jobs.paybase == i].salarytxt.median(),2)
    n = jobs[jobs.paybase == i].salarytxt.count()
    temp = temp.append([[i, mean_, median_, n]], ignore_index=True)
temp.columns = ['Base', 'Mean', 'Median','Count']
temp

temp.set_index('Base', inplace = True)
temp.Count.plot.pie(autopct='%.2f', figsize = (8,8), fontsize = 14)
plt.title('Number of Job Announcements with Salaries per Payment Base', fontsize = 14)
plt.show()

temp['Mean'].sort_values(ascending = False).plot(figsize = (12,8),kind = 'bar')
plt.show()

Now building a dummy variable comparing salaries to the median of USD

Since the annual salary is significantly different from salaries paid on monthl or hourly basis I decided to drop those.
Re-measure the median only for the annual salaries

jobs = jobs[jobs.paybase == 'annual']
median_ = jobs.salarytxt.median()
print median_

97200.0

jobs.shape

(452, 15)

Categorizing Data

creating a dummy variable: True (1) if > median, False (0) if < median

# If the salary is higher than the city median, then it's ONE
jobs['sal_to_med'] = (jobs.salarytxt > median_)
jobs.head()

Adding a dummy variable for "manager", "senior", "supervisor" in a job titleÂ¶

jobs['MgrDummy'] = jobs.jobtitle.str.contains('supervisor|manager|director|senior|president')
jobs.head(2)

Selecting and reorganizing columns, so that I can turn them into categories laterÂ¶

jobs_temp = jobs[['salarytxt', 'Population','Density','MgrDummy','sal_to_med']]
jobs_temp.shape

(452, 5)

jobs_temp.head()

Categorizing the selected variables into quartiles, except for MgrDummy and the target column of sal_to_med - higher or lower the medianÂ¶

y = jobs_temp.sal_to_med
X = jobs_temp.drop(["sal_to_med",'salarytxt'], axis = 1)
print y.head(), X.head()

0     True
1    False
2    False
5    False
9     True
Name: sal_to_med, dtype: bool    Population  Density  MgrDummy
0     8537673    27012     False
1     8537673    27012     False
2     8537673    27012     False
5     8537673    27012     False
9     8537673    27012      True

from sklearn import preprocessing
X_norm = preprocessing.normalize(X, norm = 'l1')

print X.shape
X_norm

(452, 3)

array([[ 0.99684612,  0.00315388,  0.        ],
       [ 0.99684612,  0.00315388,  0.        ],
       [ 0.99684612,  0.00315388,  0.        ],
       ..., 
       [ 0.98573715,  0.01426285,  0.        ],
       [ 0.98573715,  0.01426285,  0.        ],
       [ 0.98573715,  0.01426285,  0.        ]])

X = pd.DataFrame(X_norm, columns = X.columns)
X.head()

jobs_categories = np.floor(jobs_temp[jobs_temp.columns[:-2]].rank() / len(jobs_temp) /.5001).astype(int)
jobs_categories = jobs_categories.join([jobs_temp.MgrDummy, jobs_temp.sal_to_med])
print jobs_categories.shape
jobs_categories.head()

(452, 5)

# Below I initiate train-split and populating the vocabulary. Then will merge the two dataframes and will run the 
# Random Forest

Populating vocabulary based on the job summariesÂ¶

from sklearn.feature_extraction.text import CountVectorizer

# Initialize the "CountVectorizer" object, which is scikit-learn's
# bag of words tool.  
vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = 'english',   \
                             max_features = 50) 

# fit_transform() does two functions: First, it fits the model
# and learns the vocabulary; second, it transforms our training data
# into feature vectors. The input to fit_transform should be a list of 
# strings.
train_data_features = vectorizer.fit_transform(jobs.summary)

# Numpy arrays are easy to work with, so convert the result to an 
# array
train_data_features = train_data_features.toarray()
print train_data_features.shape
train_data_features

(452, 50)

array([[0, 0, 0, ..., 1, 0, 0],
       [0, 1, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 1, 0, 0],
       [0, 1, 0, ..., 0, 0, 0]])

Next is to check the vocabulary (shape and the contents)Â¶

vocab = vectorizer.get_feature_names()
len(vocab)

50

Create the dataframeÂ¶

summary = pd.DataFrame(train_data_features, columns = vocab)
print summary.shape
summary.head()

(452, 50)

print 'DataFrame with: senior, manager, director or president in jobtitle: ',jobs[jobs.MgrDummy == True].shape
print 'Overall DataFrame ', jobs.shape

DataFrame with: senior, manager, director or president in jobtitle:  (92, 17)
Overall DataFrame  (452, 17)

print jobs_categories.shape
jobs_categories.head(3)

(452, 5)

X.head()

summary.head()

super_jobs = pd.concat([X,summary], axis = 1, ignore_index=True)
super_jobs.columns = list(X.columns) + list(summary.columns)
print 'Super Data Frame', super_jobs.shape
super_jobs.head()

Super Data Frame (452, 53)

X = super_jobs
X.head()

Let's try Random Forest on Population, Density, EmedHHI, MgrDummy as features and sal_to_med as predictorÂ¶

jobs_categories.head()
y = jobs_categories.sal_to_med
print X.shape
X.head()

(452, 53)

#1. Split the data into training and testing parts

feat_labels = list(X.columns)
print len(feat_labels)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
print X.shape
print X_train.shape, X_test.shape
print y_train.shape, y_test.shape

53
(452, 53)
(316, 53) (136, 53)
(316,) (136,)

# 2. Create a random forest classifier
clf = RandomForestClassifier(n_estimators=70, random_state=0, n_jobs=-1)

# Train the classifier
clf.fit(X_train, y_train)

# Print the name and gini importance of each feature
for feature in zip(feat_labels, clf.feature_importances_):
    print(feature[:20])

('Population', 0.095743921567549448)
('Density', 0.092879374851942781)
('MgrDummy', 0.045664113263210987)
(u'ability', 0.022270531453539601)
(u'analysis', 0.030215406246759845)
(u'analyst', 0.011007792874737059)
(u'analytics', 0.032237918886094855)
(u'analyze', 0.020139095162945299)
(u'analyzing', 0.010916816931129473)
(u'big', 0.01591316130714782)
(u'client', 0.0098473250840064256)
(u'clinical', 0.0051950999616316718)
(u'company', 0.02309398416319865)
(u'data', 0.06499031213688207)
(u'design', 0.0093490994589386454)
(u'develop', 0.010725509208902861)
(u'development', 0.0067876694383505283)
(u'engineer', 0.014742830664195636)
(u'engineers', 0.0053284235187250201)
(u'experience', 0.018829804749553803)
(u'growing', 0.0065143590198864553)
(u'join', 0.010133615174390759)
(u'learning', 0.02539238375475085)
(u'level', 0.0045626102635137914)
(u'looking', 0.019099852799998349)
(u'machine', 0.02239565230445242)
(u'management', 0.015105116386194266)
(u'methods', 0.011088771555010278)
(u'modeling', 0.0099924122394484914)
(u'new', 0.014481593605827245)
(u'processing', 0.010397202981342142)
(u'projects', 0.0086034585234265049)
(u'python', 0.0028732779064370023)
(u'quality', 0.0088362222130459103)
(u'quantitative', 0.0060850481062211942)
(u'reports', 0.008876007105868217)
(u'research', 0.041858871264691039)
(u'science', 0.021982614556429799)
(u'scientist', 0.051895841116068125)
(u'scientists', 0.022526144998467054)
(u'seeking', 0.0073873211371032468)
(u'senior', 0.019439338894228955)
(u'software', 0.0064214493582875846)
(u'solutions', 0.013458347509564825)
(u'sources', 0.0046799323287907814)
(u'statistical', 0.016872746185586478)
(u'team', 0.017132925456288785)
(u'tools', 0.0074698204151467705)
(u'use', 0.0066235429772719843)
(u'using', 0.0062791891508070059)
(u'work', 0.011675362481737072)
(u'working', 0.0058672204373647029)
(u'years', 0.0081135568629094063)

I will just try to select the words separately by their importanceÂ¶

X_w = summary
feat_labels = list(X_w.columns)
print len(feat_labels)

X_train, X_test, y_train, y_test = train_test_split(X_w, y, test_size=0.3, random_state=0)

50

# 2. Create a random forest classifier
clf = RandomForestClassifier(n_estimators=70, random_state=0, n_jobs=-1)

# Train the classifier
clf.fit(X_train, y_train)

# Print the name and gini importance of each feature
for feature in zip(feat_labels, clf.feature_importances_):
    print(feature)

(u'ability', 0.019731211862590803)
(u'analysis', 0.039797299790266499)
(u'analyst', 0.010826838292737638)
(u'analytics', 0.039833546679966754)
(u'analyze', 0.019980081272659302)
(u'analyzing', 0.015669750395107426)
(u'big', 0.021158356402510761)
(u'client', 0.012403419200068364)
(u'clinical', 0.010400502870459148)
(u'company', 0.030310916537109302)
(u'data', 0.096701208642020794)
(u'design', 0.01462885507184811)
(u'develop', 0.01172453545968126)
(u'development', 0.017826227078231571)
(u'engineer', 0.015829231637755514)
(u'engineers', 0.010895628208146416)
(u'experience', 0.024097686623611631)
(u'growing', 0.014031044519856577)
(u'join', 0.012675834345331403)
(u'learning', 0.035850116755974312)
(u'level', 0.0069589877168694881)
(u'looking', 0.021450911807070233)
(u'machine', 0.024865669559638492)
(u'management', 0.018720744101380353)
(u'methods', 0.011389095656509135)
(u'modeling', 0.012060254472723891)
(u'new', 0.020411143447278644)
(u'processing', 0.011763436273070235)
(u'projects', 0.010720395521500848)
(u'python', 0.0061078146783048104)
(u'quality', 0.010927817606336218)
(u'quantitative', 0.0097624302882069729)
(u'reports', 0.014483425188068689)
(u'research', 0.049036050163072435)
(u'science', 0.022676385745280532)
(u'scientist', 0.054750465422041344)
(u'scientists', 0.030392229228164855)
(u'seeking', 0.010642175524143458)
(u'senior', 0.039640780236206548)
(u'software', 0.0099552704653666224)
(u'solutions', 0.013889854316013453)
(u'sources', 0.0059238538249489405)
(u'statistical', 0.021306607112017227)
(u'team', 0.019107779721431229)
(u'tools', 0.011748094195373581)
(u'use', 0.0075455788169863976)
(u'using', 0.0086793132835269703)
(u'work', 0.021821442397927344)
(u'working', 0.0091954225058919394)
(u'years', 0.0096942790787155481)

listofimportance = pd.DataFrame(zip(feat_labels,clf.feature_importances_), columns = ['words','importance'])

importantwords = listofimportance[listofimportance['importance']>0.003].sort_values('importance', ascending=False)

So here is top 10 countdown of the most influential words in the summary

importantwords[:20]

# Let's see how exactly those words affect the salary prediction:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
m = lr.fit(X_w, y)
m = m.coef_.tolist()
m = pd.DataFrame(zip(X.columns, m[0]), columns = ['features','log'])
m['exp'] = np.exp(m.log)
print "Seven words most negatively affecting salary", m.sort_values('exp', ascending=True).head(7)
print "Seven words most positively affecting salary", m.sort_values('exp', ascending=False).head(7)

Seven words most negatively affecting salary       features       log       exp
5      analyst -0.946242  0.388197
32      python -0.921383  0.397968
33     quality -0.847504  0.428483
0   Population -0.791032  0.453377
15     develop -0.690045  0.501554
48         use -0.593036  0.552647
8    analyzing -0.588213  0.555318
Seven words most positively affecting salary       features       log       exp
38   scientist  1.333546  3.794476
19  experience  1.136274  3.115140
9          big  1.053281  2.867043
25     machine  0.957801  2.605960
26  management  0.946400  2.576419
43   solutions  0.903495  2.468215
11    clinical  0.854428  2.350031

Nevertheless, I will return to the initial list of predictors and rerun all of Random Forest, Extra Tree and Decision Tree

y = jobs_temp.salarytxt
print X.shape
print y.head()
X.head()

(452, 53)
0    110000.0
1     65977.0
2     79249.5
5     75643.0
9    140000.0
Name: salarytxt, dtype: float64

from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, BaggingClassifier

skf = StratifiedKFold(n_splits=10)
skf.get_n_splits(X, y)

cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=41)

dt = RandomForestClassifier(class_weight='balanced')
s = cross_val_score(dt, X, y, cv=cv, n_jobs=30)
print "{} Score:\t{:0.3} ± {:0.3}".format("Decision Tree", s.mean().round(3), s.std().round(3))

Decision Tree Score:	0.716 ± 0.056

y = jobs_categories.sal_to_med
#X = jobs_categories.drop(['sal_to_med','salarytxt'], axis = 1)
print X.shape
print y.unique()
X.head()

(452, 53)
[ True False]

Stratified K Fold on Decision Tree Classifier

cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=41)
dt = DecisionTreeClassifier(class_weight='balanced')
s = cross_val_score(dt, X, y, cv=cv, n_jobs=-1)
print "{} Score:\t{:0.3} ± {:0.3}".format("Decision Tree", s.mean().round(3), s.std().round(3))

Decision Tree Score:	0.677 ± 0.076

Cross-Validation with Random Forest Trees with different n_jobs parameters.

dt = RandomForestClassifier(class_weight='balanced')
s = cross_val_score(dt, X, y, cv=cv, n_jobs=1)
print "{} Score:\t{:0.3} ± {:0.3}".format("Random Forest with Balanced Classes", s.mean().round(3), s.std().round(3))

Random Forest with Balanced Classes Score:	0.763 ± 0.073

# Random Forest Classifier without class weight identification

dt = RandomForestClassifier()
s = cross_val_score(dt, X, y, cv=cv, n_jobs=20)
print "{} Score:\t{:0.3} ± {:0.3}".format("Random Forest without Balanced Classes", s.mean().round(3), s.std().round(3))

Random Forest without Balanced Classes Score:	0.752 ± 0.074

Bagging Classifier

dt = BaggingClassifier()
s = cross_val_score(dt, X, y, cv=cv, n_jobs=1)
print "{} Score:\t{:0.3} ± {:0.3}".format("Bagging", s.mean().round(3), s.std().round(3))

Bagging Score:	0.712 ± 0.088

Part 3. Web Scraper

Below is the web scraper code. It contains a couple of functions that you need to run before you start web scraping:

Salary extraction function - announce it to be able to read the salary information
The salary type classifier - as I discovered later, the salary quotations on hourly, monthly or daily basis differ significantly if translated on annual basis that skewes the overall salary statistic.

3.1. Salary Extraction FunctionÂ¶

def salary(i):
    if i.find("span", {"class":"no-wrap"}) != None:
        js = str(i.find("span", {"class":"no-wrap"}).text.strip()).split()
        #print js
        #print float(re.sub(",", "",re.findall(r"\d+\,\d+|\d+\.\d+",js[0])[0]))
        if js[-1] == 'year':
            if js[1] == '-':
                js1 = float(re.sub(",","",(re.findall(r"\d+\,\d+|\d+\.\d+",js[0])[0])))
                #print js1
                js2 = float(re.sub(",","",(re.findall(r"\d+\,\d+|\d+\.\d+",js[2])[0])))
                #print js2
                js = (js1+js2)/2
                #print js
            else: js = float(re.sub(",","",(re.findall(r"\d+\,\d+|\d+\.\d+",js[0])[0])))
            #print js
        elif js[-1] == 'hour':
            if js[1] == '-':
                #print js[0], js[1], js[2]
                js1 = float(re.findall(r"\d+",js[0])[0])
                #print js1
                js2 = float(re.findall(r"\d+",js[2])[0])
                #print js2
                js = (js1+js2)/2*1600
                #print js
            else: js = float(re.findall(r"\d+",js[0])[0])*1600
        elif js[-1] == 'day':
            if js[1] == '-':
                #print js[0], js[1], js[2]
                js1 = float(re.findall(r"\d+",js[0])[0])
                #print js1
                js2 = float(re.findall(r"\d+",js[2])[0])
                #print js2
                js = (js1+js2)/2*200
                #print js
            else: js = float(re.findall(r"\d+",js[0])[0])*200
        elif js[-1] == 'month':
            if js[1] == '-':
                #print js[0], js[1], js[2]
                js1 = float(re.sub(",","",(re.findall(r"\d+\,\d+|\d+\.\d+",js[0])[0])))
                #print js1
                js2 = float(re.sub(",","",(re.findall(r"\d+\,\d+|\d+\.\d+",js[2])[0])))
                #print js2
                js = (js1+js2)/2*12
                #print js
            else: js = float(re.sub(",","",(re.findall(r"\d+\,\d+|\d+\.\d+",js[0])[0])))*12
    else: 
        js = str('NaN')
    
    #print js
    return js

3.2. Salary Classifier FunctionÂ¶

def base(i):
    if i.find("span", {"class":"no-wrap"}) != None:
        pb = str(i.find("span", {"class":"no-wrap"}).text.strip()).split()
#        print pb
        if pb[-1] == 'year':
            pb = 'annual'
        elif pb[-1] == 'hour':
            pb = 'hourly'
        elif pb[-1] == 'day':
            pb = 'dayly'
        elif pb[-1] == 'month':
            pb = 'monthly'
    else: 
        pb = str("NaN")
#    print pb
    return pb

3.3. Body of web scaperÂ¶

#['New+York', 'Chicago', 'San+Francisco', 'Austin', 'Seattle','Los+Angeles', 'Philadelphia', 'Atlanta', 'Dallas', 'Pittsburgh', 
#'Portland', 'Phoenix', 'Denver', 'Houston', 'Miami', MyCity]
MyCity = 'Washington+City%2CDC'
city_set = ['New+York%2CNY', 'Chicago', 'San+Francisco', 'Austin', 'Seattle', 
    'Los+Angeles', 'Philadelphia', 'Atlanta', 'Dallas', 'Pittsburgh', 
    'Portland%2COR', 'Phoenix', 'Denver', 'Houston', 'Miami', 'Boston', 'San+Diego', 
    'Baltimore', 'San+Jose', 'Minneapolis','San+Antonio%2CTX','Detroit','Columbus','Charlotte','Fort Worth',
    'Jacksonville+FL', 'Fresno', 'Kansas+City', 'Mesa%2CAZ','Raleigh', MyCity]
start = '10'
#max_results_per_city = min(10,n)
jobsdf = pd.DataFrame()

for city in city_set:
    page0 = requests.get('http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l='+str(city))
    soup = BeautifulSoup(page0.text, 'html.parser', from_encoding = 'utf-8')
    n = int((soup.find("div", {"id":"searchCount"}).text.strip().split()[-1]).replace(',' , ''))                     
    for start in range(0, min(n,2000), 10):
        page = requests.get('http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l='
                            +str(city)+'&start='+str(start))
        time.sleep(1) #the head is borrowed from M>Salmon's code. Giving it a second to rest
        soup = BeautifulSoup(page.text, 'html.parser', from_encoding = 'utf-8')
        #blocks will pack each block with job announcement wrapped in div-result tag-attrs
        blocks = soup.find_all("div", {"class":[" row result","lastRow row result"]})
        for i in blocks:
            jt = i.select_one("a")["title"].lower()
            if i.find("span", {"class":"company"}) != None:
                cn = i.find("span", {"class":"company"}).text.strip()
            else: cn = str('NaN')
            loc = str(i.find("span", {"class":"location"}).text.strip())
            js = salary(i)
            pb = base(i)
            jsum = i.find("span", {"class":"summary"}).text.strip().lower()
#            print jt,"/", cn, '/', city, "/", loc, '/', js, '/', pb, '/', jsum, "\n"
            row = pd.DataFrame([[jt, cn, city, loc, js, pb,jsum]], columns = ['jobtitle','company','city','location','salarytxt','paybase', 'summary'])
            jobsdf = pd.concat([jobsdf,row], ignore_index = True) 

jobsdf.to_csv('../../project_3_data/jobsdf.csv', encoding='utf-8')

	jobtitle	company	city	location	salarytxt	paybase	summary
0	sr data scientist	Fimo Info Solutions LLC	New+York%2CNY	New York, NY	110000.0	annual	sr data scientist, nyc*. the data scientist wi...
1	climate and sustainability analyst	DEPT OF ENVIRONMENT PROTECTION	New+York%2CNY	New York, NY	65977.0	annual	knowledge and practical application of quantit...
2	data analyst	POLICE DEPARTMENT	New+York%2CNY	New York, NY	79249.5	annual	extensive knowledge of applied statistics, ana...
3	data specialist, bureau of systems strengtheni...	DEPT OF HEALTH/MENTAL HYGIENE	New+York%2CNY	Queens, NY	65600.0	hourly	experience and the ability to conduct evaluati...
4	assistant research scientist i grade 14 job #1...	New York State Psychiatry Institute	New+York%2CNY	New York, NY	32000.0	hourly	the assistant research scientist is responsibl...

	City	State	Population	Density	DPF	Latitude	Longitude	MedianHHInc	Ppower
0	New+York%2CNY	New York	8537673	27012	230619.62	40.6643	73.9385	59799.0	100.0
1	Los+Angeles	California	3976322	8092	32176.40	34.0194	118.4108	45903.0	118.0
2	Chicago	Illinois	2704958	11842	32032.11	41.8376	87.6818	51046.0	118.0
3	Philadelphia	Pennsylvania	1567872	11379	17840.82	40.0094	75.1333	47528.0	116.0
4	San+Francisco	California	870887	17179	14960.97	37.7751	122.4193	63024.0	102.0

	jobtitle	company	city	location	salarytxt	paybase	summary	City	State	Population	Density	DPF	Latitude	Longitude	MedianHHInc	Ppower
0	sr data scientist	Fimo Info Solutions LLC	New+York%2CNY	New York, NY	110000.0	annual	sr data scientist, nyc*. the data scientist wi...	New+York%2CNY	New York	8537673	27012	230619.62	40.6643	73.9385	59799.0	100.0
1	climate and sustainability analyst	DEPT OF ENVIRONMENT PROTECTION	New+York%2CNY	New York, NY	65977.0	annual	knowledge and practical application of quantit...	New+York%2CNY	New York	8537673	27012	230619.62	40.6643	73.9385	59799.0	100.0
2	data analyst	POLICE DEPARTMENT	New+York%2CNY	New York, NY	79249.5	annual	extensive knowledge of applied statistics, ana...	New+York%2CNY	New York	8537673	27012	230619.62	40.6643	73.9385	59799.0	100.0
3	data specialist, bureau of systems strengtheni...	DEPT OF HEALTH/MENTAL HYGIENE	New+York%2CNY	Queens, NY	65600.0	hourly	experience and the ability to conduct evaluati...	New+York%2CNY	New York	8537673	27012	230619.62	40.6643	73.9385	59799.0	100.0
4	assistant research scientist i grade 14 job #1...	New York State Psychiatry Institute	New+York%2CNY	New York, NY	32000.0	hourly	the assistant research scientist is responsibl...	New+York%2CNY	New York	8537673	27012	230619.62	40.6643	73.9385	59799.0	100.0
5	geospatial data scientist	DEPARTMENT OF FINANCE	New+York%2CNY	Manhattan, NY	75643.0	annual	the property valuation & mapping unit is seeki...	New+York%2CNY	New York	8537673	27012	230619.62	40.6643	73.9385	59799.0	100.0
6	data analyst, bureau of immunization/immunizat...	DEPT OF HEALTH/MENTAL HYGIENE	New+York%2CNY	New York, NY	68000.0	hourly	-conduct geographic analysis using cir data. e...	New+York%2CNY	New York	8537673	27012	230619.62	40.6643	73.9385	59799.0	100.0
7	research analyst	Paradigm Investigations	New+York%2CNY	New York, NY	40000.0	hourly	collecting documents and information about cus...	New+York%2CNY	New York	8537673	27012	230619.62	40.6643	73.9385	59799.0	100.0
8	entry level – research analyst/editor/content ...	XG Consultants Group, Inc.	New+York%2CNY	New York, NY 10017 (Midtown area)	24000.0	hourly	job overview: xg consultants group is looking ...	New+York%2CNY	New York	8537673	27012	230619.62	40.6643	73.9385	59799.0	100.0
9	senior data engineer	Enterprise Select	New+York%2CNY	New York, NY 10011 (Chelsea area)	140000.0	annual	ensure data pipeline provides clean organized,...	New+York%2CNY	New York	8537673	27012	230619.62	40.6643	73.9385	59799.0	100.0
10	city research scientist 1	DEPT OF ENVIRONMENT PROTECTION	New+York%2CNY	Queens, NY	65977.0	annual	collect asset condition data and costs of vari...	New+York%2CNY	New York	8537673	27012	230619.62	40.6643	73.9385	59799.0	100.0
11	research analyst ii	Research Foundation of The City University of ...	New+York%2CNY	New York, NY	65000.0	annual	work closely with reps data management staff t...	New+York%2CNY	New York	8537673	27012	230619.62	40.6643	73.9385	59799.0	100.0
12	data scientist - neural networks	Geode Executive Search	New+York%2CNY	New York, NY	130000.0	annual	a well-known tech company in new york is looki...	New+York%2CNY	New York	8537673	27012	230619.62	40.6643	73.9385	59799.0	100.0
13	senior assessment data analyst	DEPARTMENT OF FINANCE	New+York%2CNY	Manhattan, NY	75557.5	annual	experience using relational databases, data mi...	New+York%2CNY	New York	8537673	27012	230619.62	40.6643	73.9385	59799.0	100.0
14	client facing python engineers for a.i data co...	Birch & James Associates Limited	New+York%2CNY	New York, NY	90000.0	annual	an interest in machine learning and/or data sc...	New+York%2CNY	New York	8537673	27012	230619.62	40.6643	73.9385	59799.0	100.0
15	predictive risk analyst	ADMIN FOR CHILDREN'S SVCS	New+York%2CNY	New York, NY	77035.0	annual	critical to this position is strong analytical...	New+York%2CNY	New York	8537673	27012	230619.62	40.6643	73.9385	59799.0	100.0
16	senior data engineer	Harnham	New+York%2CNY	New York, NY	175000.0	annual	senior data engineer. design, implement and ma...	New+York%2CNY	New York	8537673	27012	230619.62	40.6643	73.9385	59799.0	100.0
17	data analyst/modeler	DEPARTMENT OF FINANCE	New+York%2CNY	Brooklyn, NY	75557.5	annual	strong programming, data analysis, statistical...	New+York%2CNY	New York	8537673	27012	230619.62	40.6643	73.9385	59799.0	100.0
18	data engineer	ektello	New+York%2CNY	New York, NY	72800.0	hourly	experience with integration of data from multi...	New+York%2CNY	New York	8537673	27012	230619.62	40.6643	73.9385	59799.0	100.0
19	senior research analyst	CONSUMER AFFAIRS	New+York%2CNY	Manhattan, NY	76143.0	annual	and assisting with data analysis, management, ...	New+York%2CNY	New York	8537673	27012	230619.62	40.6643	73.9385	59799.0	100.0
20	environmental analyst, bureau of environmental...	DEPT OF HEALTH/MENTAL HYGIENE	New+York%2CNY	Manhattan, NY	62693.0	annual	completing data checks and documentation for r...	New+York%2CNY	New York	8537673	27012	230619.62	40.6643	73.9385	59799.0	100.0
21	health scientist	Centers for Disease Control and Prevention	New+York%2CNY	New York, NY	120186.0	annual	analyze scientific investigation data. the cen...	New+York%2CNY	New York	8537673	27012	230619.62	40.6643	73.9385	59799.0	100.0
22	statistician, level i	POLICE DEPARTMENT	New+York%2CNY	New York, NY	52903.5	annual	selected candidate will be responsible for pro...	New+York%2CNY	New York	8537673	27012	230619.62	40.6643	73.9385	59799.0	100.0
23	policy data analyst	DEPARTMENT OF FINANCE	New+York%2CNY	Manhattan, NY	75557.5	annual	strong quantitative and research skills, inclu...	New+York%2CNY	New York	8537673	27012	230619.62	40.6643	73.9385	59799.0	100.0
24	associate ux researcher	Key Lime Interactive	New+York%2CNY	New York, NY 10011 (Chelsea area)	50000.0	annual	analyzing qualitative & quantitative data. per...	New+York%2CNY	New York	8537673	27012	230619.62	40.6643	73.9385	59799.0	100.0
25	research assistant i	GUTTMACHER	New+York%2CNY	New York, NY	40000.0	annual	research associate, senior/principal research ...	New+York%2CNY	New York	8537673	27012	230619.62	40.6643	73.9385	59799.0	100.0
26	criminal justice analyst, bureau of mental health	DEPT OF HEALTH/MENTAL HYGIENE	New+York%2CNY	Queens, NY	79249.5	annual	develop expertise in maven, and other availabl...	New+York%2CNY	New York	8537673	27012	230619.62	40.6643	73.9385	59799.0	100.0
27	research analyst , bureau of hiv/aids preventi...	DEPT OF HEALTH/MENTAL HYGIENE	New+York%2CNY	New York, NY	78790.5	annual	and a general appreciation for data quality. -...	New+York%2CNY	New York	8537673	27012	230619.62	40.6643	73.9385	59799.0	100.0
28	bioinformatics analyst	Envisagenics, Inc.	New+York%2CNY	New York, NY	60000.0	annual	to work with envisagenics data engineers and b...	New+York%2CNY	New York	8537673	27012	230619.62	40.6643	73.9385	59799.0	100.0
29	vp, data scientist - banking	Harnham	New+York%2CNY	New York, NY	180000.0	annual	vp, data scientist - banking. data scientist \|...	New+York%2CNY	New York	8537673	27012	230619.62	40.6643	73.9385	59799.0	100.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
550	program support specialist	Department of Defense	Washington+City%2CDC	Bethesda, MD	47354.5	annual	analyze problems to identify significant facto...	Washington+City%2CDC	District of Columbia	681170	9856	6713.61	38.9041	77.0171	57291.0	107.0
551	supervisory physical scientist zp-1301-05 (de/cr)	Department of Commerce	Washington+City%2CDC	Suitland, MD	146833.5	annual	supervisory physical scientist, zp-1301-05. th...	Washington+City%2CDC	District of Columbia	681170	9856	6713.61	38.9041	77.0171	57291.0	107.0
552	executive director, u.s. botanic garden (super...	Legislative Branch	Washington+City%2CDC	Washington, DC	155703.0	annual	e-verify is an internet-based system that comp...	Washington+City%2CDC	District of Columbia	681170	9856	6713.61	38.9041	77.0171	57291.0	107.0
553	apps developer	Central Intelligence Agency	Washington+City%2CDC	Washington, DC	90936.0	annual	big data concepts and technologies such as apa...	Washington+City%2CDC	District of Columbia	681170	9856	6713.61	38.9041	77.0171	57291.0	107.0
554	machine learning	Jobspring Partners	Washington+City%2CDC	Washington, DC	140000.0	annual	professional data science experience. familiar...	Washington+City%2CDC	District of Columbia	681170	9856	6713.61	38.9041	77.0171	57291.0	107.0
555	data scientist ( chinese speaking / deep learn...	Venturi Ltd	Washington+City%2CDC	Washington, DC 20007 (Georgetown area)	107500.0	annual	as the senior data scientist you will:. senior...	Washington+City%2CDC	District of Columbia	681170	9856	6713.61	38.9041	77.0171	57291.0	107.0
556	statistical data analyst (us citizen)	PeopleTek	Washington+City%2CDC	Bethesda, MD	70000.0	annual	of various data, reporting, and a wide variety...	Washington+City%2CDC	District of Columbia	681170	9856	6713.61	38.9041	77.0171	57291.0	107.0
557	data scientist - ousd(i)-hcmo-dcips	Red Gate Group	Washington+City%2CDC	Arlington, VA	112500.0	annual	data scientist (full-time contract). under thi...	Washington+City%2CDC	District of Columbia	681170	9856	6713.61	38.9041	77.0171	57291.0	107.0
558	fellow for clinical epidemiology	NCI, Epidemiology and Genomics Research Program	Washington+City%2CDC	Rockville, MD	41700.0	annual	and data analysis. managing and analyzing epid...	Washington+City%2CDC	District of Columbia	681170	9856	6713.61	38.9041	77.0171	57291.0	107.0
559	data scientist/developer/engineer with securit...	ByteCubed	Washington+City%2CDC	Crystal City, VA	137500.0	annual	bytecubed is seeking a data scientist/develope...	Washington+City%2CDC	District of Columbia	681170	9856	6713.61	38.9041	77.0171	57291.0	107.0
560	environmental scientist - water resources mana...	WSSC	Washington+City%2CDC	Laurel, MD 20707	95396.5	annual	ability to plan and direct the work of profess...	Washington+City%2CDC	District of Columbia	681170	9856	6713.61	38.9041	77.0171	57291.0	107.0
561	statistician - healthcare or pharma industry	Contingent/Direct Consultants	Washington+City%2CDC	Gaithersburg, MD 20877	150000.0	annual	experience of development, program design and ...	Washington+City%2CDC	District of Columbia	681170	9856	6713.61	38.9041	77.0171	57291.0	107.0
562	earned value management & cost research analys...	National Security Agency	Washington+City%2CDC	Fort Meade, MD	77025.0	annual	use scientific methods to collect and analyze ...	Washington+City%2CDC	District of Columbia	681170	9856	6713.61	38.9041	77.0171	57291.0	107.0
563	payroll support associate at epa	Oak Ridge Associated Universities	Washington+City%2CDC	Washington, DC	35200.0	hourly	performing data entry in various ord and agenc...	Washington+City%2CDC	District of Columbia	681170	9856	6713.61	38.9041	77.0171	57291.0	107.0
564	data scientist	Workbridge Associates	Washington+City%2CDC	Washington, DC	95000.0	annual	this data scientist will be responsible for an...	Washington+City%2CDC	District of Columbia	681170	9856	6713.61	38.9041	77.0171	57291.0	107.0
565	senior data scientist	Jobspring Partners	Washington+City%2CDC	Washington, DC	145000.0	annual	one of the largest media conglomerates in the ...	Washington+City%2CDC	District of Columbia	681170	9856	6713.61	38.9041	77.0171	57291.0	107.0
566	embedded engineer intern	digiBlitz Inc	Washington+City%2CDC	Herndon, VA	42500.0	annual	implementation and integration of heterogeneou...	Washington+City%2CDC	District of Columbia	681170	9856	6713.61	38.9041	77.0171	57291.0	107.0
567	senior java developer	HireStrategy	Washington+City%2CDC	Washington, DC 20005 (Logan Circle area)	88000.0	hourly	partner with data scientists to implement new ...	Washington+City%2CDC	District of Columbia	681170	9856	6713.61	38.9041	77.0171	57291.0	107.0
568	java/big data engineer	Jobspring Partners	Washington+City%2CDC	Washington, DC	115000.0	annual	these applications are transforming traditiona...	Washington+City%2CDC	District of Columbia	681170	9856	6713.61	38.9041	77.0171	57291.0	107.0
569	supervisory statistician	Court Services and Offender Supervision Agency...	Washington+City%2CDC	Washington, DC	146833.5	annual	directs and supervises data analyses and evalu...	Washington+City%2CDC	District of Columbia	681170	9856	6713.61	38.9041	77.0171	57291.0	107.0
570	computer scientist	Department of Defense	Washington+City%2CDC	Fort Meade, MD	56530.5	annual	general experience is defined as experience th...	Washington+City%2CDC	District of Columbia	681170	9856	6713.61	38.9041	77.0171	57291.0	107.0
571	data scientist	Robert Walters	Washington+City%2CDC	Washington, DC	90000.0	annual	about the data scientist:. key responsibilitie...	Washington+City%2CDC	District of Columbia	681170	9856	6713.61	38.9041	77.0171	57291.0	107.0
572	director, advanced analytics	Harnham	Washington+City%2CDC	Washington, DC	180000.0	annual	campaign, insight, analytics, security, politi...	Washington+City%2CDC	District of Columbia	681170	9856	6713.61	38.9041	77.0171	57291.0	107.0
573	software engineer	NatureServe	Washington+City%2CDC	Arlington, VA	72500.0	annual	experience with big data. most of these projec...	Washington+City%2CDC	District of Columbia	681170	9856	6713.61	38.9041	77.0171	57291.0	107.0
574	data analyst	Jobspring Partners	Washington+City%2CDC	Washington, DC	125000.0	annual	experience using data visualization tools such...	Washington+City%2CDC	District of Columbia	681170	9856	6713.61	38.9041	77.0171	57291.0	107.0
575	software engineer intern	digiBlitz Inc	Washington+City%2CDC	Herndon, VA	42500.0	annual	- knowledge on hadoop or other big data analyt...	Washington+City%2CDC	District of Columbia	681170	9856	6713.61	38.9041	77.0171	57291.0	107.0
576	medical technologist (molecular systems)	Department of the Army	Washington+City%2CDC	Silver Spring, MD	70376.0	annual	confirm test results and develop data that may...	Washington+City%2CDC	District of Columbia	681170	9856	6713.61	38.9041	77.0171	57291.0	107.0
577	machine learning data engineer	Jobspring Partners	Washington+City%2CDC	Washington, DC	120000.0	annual	professional data science experience. familiar...	Washington+City%2CDC	District of Columbia	681170	9856	6713.61	38.9041	77.0171	57291.0	107.0
578	principal environmental engineer	WSSC	Washington+City%2CDC	Laurel, MD 20707	95396.5	annual	ability to plan and oversee the work of profes...	Washington+City%2CDC	District of Columbia	681170	9856	6713.61	38.9041	77.0171	57291.0	107.0
579	principal statistician - oncology	Penfield Search Partners	Washington+City%2CDC	Gaithersburg, MD	150000.0	annual	experience of development, programme design an...	Washington+City%2CDC	District of Columbia	681170	9856	6713.61	38.9041	77.0171	57291.0	107.0

	jobtitle	company	location	salarytxt	paybase	summary	City	State	Population	Density	DPF	Latitude	Longitude	MedianHHInc	Ppower
0	sr data scientist	Fimo Info Solutions LLC	New York, NY	110000.0	annual	sr data scientist, nyc*. the data scientist wi...	New+York%2CNY	New York	8537673	27012	230619.62	40.6643	73.9385	59799.0	100.0
1	climate and sustainability analyst	DEPT OF ENVIRONMENT PROTECTION	New York, NY	65977.0	annual	knowledge and practical application of quantit...	New+York%2CNY	New York	8537673	27012	230619.62	40.6643	73.9385	59799.0	100.0
2	data analyst	POLICE DEPARTMENT	New York, NY	79249.5	annual	extensive knowledge of applied statistics, ana...	New+York%2CNY	New York	8537673	27012	230619.62	40.6643	73.9385	59799.0	100.0
3	data specialist, bureau of systems strengtheni...	DEPT OF HEALTH/MENTAL HYGIENE	Queens, NY	65600.0	hourly	experience and the ability to conduct evaluati...	New+York%2CNY	New York	8537673	27012	230619.62	40.6643	73.9385	59799.0	100.0
4	assistant research scientist i grade 14 job #1...	New York State Psychiatry Institute	New York, NY	32000.0	hourly	the assistant research scientist is responsibl...	New+York%2CNY	New York	8537673	27012	230619.62	40.6643	73.9385	59799.0	100.0

	jobtitle	company	location	salarytxt	paybase	summary	City	State	Population	Density	DPF	Latitude	Longitude	MedianHHInc	Ppower	sal_to_med
0	sr data scientist	Fimo Info Solutions LLC	New York, NY	110000.0	annual	sr data scientist, nyc*. the data scientist wi...	New+York%2CNY	New York	8537673	27012	230619.62	40.6643	73.9385	59799.0	100.0	True
1	climate and sustainability analyst	DEPT OF ENVIRONMENT PROTECTION	New York, NY	65977.0	annual	knowledge and practical application of quantit...	New+York%2CNY	New York	8537673	27012	230619.62	40.6643	73.9385	59799.0	100.0	False
2	data analyst	POLICE DEPARTMENT	New York, NY	79249.5	annual	extensive knowledge of applied statistics, ana...	New+York%2CNY	New York	8537673	27012	230619.62	40.6643	73.9385	59799.0	100.0	False
5	geospatial data scientist	DEPARTMENT OF FINANCE	Manhattan, NY	75643.0	annual	the property valuation & mapping unit is seeki...	New+York%2CNY	New York	8537673	27012	230619.62	40.6643	73.9385	59799.0	100.0	False
9	senior data engineer	Enterprise Select	New York, NY 10011 (Chelsea area)	140000.0	annual	ensure data pipeline provides clean organized,...	New+York%2CNY	New York	8537673	27012	230619.62	40.6643	73.9385	59799.0	100.0	True

	Unnamed: 0	jobtitle	company	city	location	salarytxt	paybase	summary
0	0	machine learning engineer (associate) - intell...	JP Morgan Chase	New+York%2CNY	New York, NY 10001 (Chelsea area)	NaN	NaN	jp morgan intelligent solutions (jpmis) is a n...
1	1	data science developer	Morgan Stanley	New+York%2CNY	New York, NY 10032 (Washington Heights area)	NaN	NaN	as a data scientist developer your role will b...
2	2	quantitative analysis, full time analyst (nort...	Citi	New+York%2CNY	New York, NY	NaN	NaN	you're the brains behind our work. you’re read...
3	3	sr data scientist	Fimo Info Solutions LLC	New+York%2CNY	New York, NY	110000.0	annual	sr data scientist, nyc*. the data scientist wi...
4	4	tech lead - data scientist, nlp	Grubhub	New+York%2CNY	New York, NY	NaN	NaN	good understanding of data structures and comp...

	salarytxt
count	619.000000
mean	95153.390953
std	46808.678054
min	19200.000000
25%	60262.750000
50%	86400.000000
75%	125000.000000
max	300000.000000

	jobtitle	company	city	location	salarytxt	paybase	summary
267	market research analyst/quality assurance	RDAssociates, Inc.	Philadelphia	Narberth, PA	19200.0	hourly	marketing research and consulting firm seeks a...
365	data analysis intern	hear.com	Miami	Miami, FL 33138 (Upper Eastside area)	19200.0	hourly	hear.com, the leading innovator in providing m...

	Base	Mean	Median	Count
0	annual	106466.23	97200.0	452
1	hourly	62197.80	51200.0	91
2	monthly	63416.92	64140.0	37

	Population	Density	MgrDummy
0	0.996846	0.003154	0.000000e+00
1	0.996846	0.003154	0.000000e+00
2	0.996846	0.003154	0.000000e+00
3	0.996846	0.003154	0.000000e+00
4	0.996846	0.003154	1.167585e-07

	analysis	analytics	big	...	statistical	tools	use	work
0	0	0	0	...	0	0	0	1
1	1	0	0	...	1	0	0	0
2	0	1	1	...	0	1	0	0
3	0	0	0	...	0	0	0	0
4	0	0	0	...	0	0	1	0

	analysis	analytics	big	...	statistical	tools	use	work
0	0	0	0	...	0	0	0	1
1	1	0	0	...	1	0	0	0
2	0	1	1	...	0	1	0	0
3	0	0	0	...	0	0	0	0
4	0	0	0	...	0	0	1	0

	words	importance
10	data	0.096701
35	scientist	0.054750
33	research	0.049036
3	analytics	0.039834
1	analysis	0.039797
38	senior	0.039641
19	learning	0.035850
36	scientists	0.030392
9	company	0.030311
22	machine	0.024866
16	experience	0.024098
34	science	0.022676
47	work	0.021821
21	looking	0.021451
42	statistical	0.021307
6	big	0.021158
26	new	0.020411
4	analyze	0.019980
0	ability	0.019731
43	team	0.019108

	analysis	analytics	big	...	statistical	tools	use	work
0	0	0	0	...	0	0	0	1
1	1	0	0	...	1	0	0	0
2	0	1	1	...	0	1	0	0
3	0	0	0	...	0	0	0	0
4	0	0	0	...	0	0	1	0

	analysis	analytics	big	...	statistical	tools	use	work
0	0	0	0	...	0	0	0	1
1	1	0	0	...	1	0	0	0
2	0	1	1	...	0	1	0	0
3	0	0	0	...	0	0	0	0
4	0	0	0	...	0	0	1	0

	analysis	analytics	big	...	statistical	tools	use	work
0	0	0	0	...	0	0	0	1
1	1	0	0	...	1	0	0	0
2	0	1	1	...	0	1	0	0
3	0	0	0	...	0	0	0	0
4	0	0	0	...	0	0	1	0

	analysis	analytics	big	...	statistical	tools	use	work
0	0	0	0	...	0	0	0	1
1	1	0	0	...	1	0	0	0
2	0	1	1	...	0	1	0	0
3	0	0	0	...	0	0	0	0
4	0	0	0	...	0	0	1	0