This project was the first that we were given in Data Science Immersive course. The data set includes the SAT score results for 2001 across 52 states with a split between verbal and math tests. The project was mostly focused on exploratory data analysis and our ability to build inferences from the given data.
sat_scores.csv
file. Investigate the data, and answer the questions below.¶import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("../data/sat_scores.csv")
print df.shape
df.head()
The data set is 52 observations with four features:
df[df.isnull() != False].count()
df.plot(figsize=(12,8))
plt.show()
As we may see the Verbal and Math scores are in agreement with each other. This is intuitively expected. I will give it another look on a scatter plot below. Before that I will add another column with the total score Math + Verbal
df['Total'] = df['Verbal'] + df['Math']
z = df.Rate
y = df.Total
n = df.State
fig, ax = plt.subplots(figsize = (12,8))
ax.scatter(z, y)
for i, txt in enumerate(n):
ax.annotate(txt, (z[i],y[i]))
plt.show()
It looks like the higher rate of students taking the test, the lower average result is in the state. Under futher investigation it became clear that those states are mostly on the shores and SAT is better presented in those states. However, in the states with lower rates of participation, students take the test if they wanted to apply for schools that require SAT to enter. That's why the average level of SAT score is higher.
There are five columns of data:
- US State coded to two letters,
- Rate - unclear at this moment, but at this moment
I suppose it is rate of high school seniors taking exam
- mean SAT verbal exam score for each state
- mean SAT math exam score for each state
- SAT overall exam score for each state'''
import csv
data =[]
print 'Opening File. Data: '
print ''
with open('../data/sat_scores.csv', 'rU') as f:
reader = csv.reader(f)
for row in reader:
data.append(row)
f.close()
print data[0:10]
print 'file closed'
data[0:len(data)][0]
'''SO DO NOT TRY TO RUN THIS CODE SEVERAL TIMES. Because it would be moving the top line of the data to the header
and cutting the data by one line. I is understood that this part would be a part of a longer code'''
header = data[0]
print header
data.remove(header)
print data[0:5]
if len(data)<52:
print"Run this code again from the top of page"
#In case you re-run this code and len(data) is below 52, then go to top and re-run the code for the whole page'''
#I think I did not get the idea of what the hint was about. So I just decided, I would extract every fourth item
#in the list of lists.
states = []
rates = []
sat_verb = []
sat_math = []
item = 0
for item in range(len(data)-1):
states += data[item][0::4]
rates += data[item][1::4]
sat_verb +=data[item][2::4]
sat_math +=data[item][3::4]
print states[:10]
print rates[:10]
print sat_verb[:10]
print sat_math[:10]
len(row)
print type(rates[0])
print type (sat_verb[0])
print type (sat_math[0])
for i in range(len(data)-1):
rates[i] = float(rates[i])
sat_verb[i] = float(sat_verb[i])
sat_math[i] = float(sat_math[i])
print rates[:10]
print sat_verb[:10]
print sat_math[:10]
rates_dict = dict(zip(states, rates))
sat_verb_dict = dict(zip(states, sat_verb))
sat_math_dict = dict(zip(states, sat_math))
print rates_dict
print sat_verb_dict
print sat_math_dict
print 'Minimum rate of college-bound high school students is in ', min(rates_dict, key=rates_dict.get), rates_dict[min(rates_dict, key=rates_dict.get)]
print 'Maximum rate of college-bound high school students is in ', max(rates_dict, key=rates_dict.get), rates_dict[max(rates_dict, key=rates_dict.get)]
print 'Minimum Verbal Score among students is in ', min(sat_verb_dict, key=sat_verb_dict.get), sat_verb_dict[min(sat_verb_dict, key=sat_verb_dict.get)]
print 'Maximum Verbal Score among students is in ', max(sat_verb_dict, key=sat_verb_dict.get), sat_verb_dict[max(sat_verb_dict, key=sat_verb_dict.get)]
print 'Minimum Math Score among students is in ', min(sat_math_dict, key=sat_math_dict.get), sat_math_dict[min(sat_math_dict, key=sat_math_dict.get)]
print 'Maximum Math Score among students is in ', max(sat_math_dict, key=sat_math_dict.get), sat_math_dict[max(sat_math_dict, key=sat_math_dict.get)]
# Here is a function and it's calculations, just to assure that
# I wrote them rigth below is the calculations done with help of numpy package.
def std_list(datalist):
return (sum([(sum(datalist)/len(datalist) - datalist[i])**2 for i in range(len(datalist))])/len(datalist))**.5
print 'Standard deviation of rates of CBHS-students in US is ', std_list(rates)
print 'Standard deviation of verbal SAT test of CBHS-students in US is ', std_list(sat_verb)
print 'Standard deviation of math SAT test of CBHS-students in US is ', std_list(sat_math)
import numpy as np
print 'Standard deviation of rates of CBHS-students in US is ', np.std(rates_dict.values())
print 'Standard deviation of verbal SAT test of CBHS-students in US is ', np.std(sat_verb_dict.values())
print 'Standard deviation of math SAT test of CBHS-students in US is ', np.std(sat_math_dict.values())
import matplotlib.pyplot as plt
_ = plt.hist(rates, bins = 10)
_ = plt.xlabel('Rates of Participation')
_ = plt.ylabel('frequency')
_ = plt.legend('Distribution of Rates of Test Participation')
plt.show()
import matplotlib.pyplot as plt
_ = plt.hist(sat_math, bins = 10)
_ = plt.xlabel('SAT Math Scores')
_ = plt.ylabel('frequency')
_ = plt.legend('Distribution of SAT Math scores USA')
plt.show()
import matplotlib.pyplot as plt
_ = plt.hist(sat_verb, bins = 10)
_ = plt.xlabel('SAT Verbal Scores')
_ = plt.ylabel('frequency')
_ = plt.legend('Distribution of SAT Verba scores USA')
plt.show()
# - That it would be normal
# But it is not normal. It's rather a conjunction of two
#normal distributions of those people who study and those who don't.
figure
to present multiple plots at once.¶import matplotlib.pyplot as plt
_ = plt.plot(sat_verb, sat_math, marker = '.',linestyle ='None')
_ = plt.xlabel('SAT verbal test scores')
_ = plt.ylabel('SAT math test scores')
plt.show()
import matplotlib.pyplot as plt
_ = plt.plot(sat_verb, sat_math, marker = '.',linestyle ='None')
_ = plt.xlabel('SAT verbal test scores')
_ = plt.ylabel('SAT math test scores')
plt.show()
_ = plt.boxplot(rates, 0,'gD')
plt.show()
_ = plt.boxplot(sat_verb, 1)
plt.show()
_ = plt.boxplot(sat_math,0)
plt.show()
# Here I am trying to open an image that I exported from Tableau
from IPython.core.display import Image
Image(url= "../images/Overview.png")