State of Iowa Liquor Sales

GA project term suggested to choose one of two options:

  • to forecast sales for the state budgeting purposes
  • to give recommendations on where and why opening a news liquor store

I decided to go with the first option.

A short introduction to the project and data

The state of Iowa controls wholesales of alcoholic beverages with content of alcohol higher than 10%. That excludes wines and beers. Thus all stronger alcoholic beverages either imported or purchased by the state owned Iowa Alcoholic Beverages Division that adds universal markup of 50%, that increases the price for alcoholic beverages 1.5 times when sold to stores and wholesale distributors.

The dataset consists of about 2.7mln observations (records) of each individual purchase from the state by retailers or wholesalers like COSCO or Walmart. The data is stored at my Google Drive and you can read walking through this Jupyter Notebook. Each observation has 18 features describing the date, store number, store location town, zip code, county number, alcoholic beverage category, vendor number, item number, item name, volume of bottle, state cost per bottle, bottle cost, bottle retail price, number of bottles sold, revenue from yeach transaction and total volume sold.

For the purposes of this presentation, I used the extract of 10% of the initial data that has aboutu 270 thousand observations. You can try and re-run the project with the original set of data store here - https://drive.google.com/file/d/0B8xhteSGeVx6QWpXMUcxWGxJdk0/view?usp=sharing

The make takeaways are:

  • forecasting of sales per each store is not necessary for the taxation purposes,
  • as reality has shown, the revenue is connected to the way the states manages sales of liquors,
  • there is an strong ethical issue of increasing sales of alcohol and decreasing the risks associated with its consumption (health, crime, road accidents).
  • the state has one of the most strict liquor sales regulations, that pushes those living closer to the borders of the state to go to other states to purchase ../data/2_project_Iowa_ABD.pdf)
  • two small cities with largest and smalles consuption of alcohol are located at the borders with the states with stricter and more liberal legislations.

Don't hesitate to check the presentation (../data/2_project_Iowa_ABD.pdf)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = 10, 10
import seaborn as sns
plt.style.use('ggplot')
% matplotlib inline
%load_ext giphy_magic
% matplotlib inline
from scipy import stats
from sklearn import datasets, linear_model
from sklearn.model_selection import train_test_split
In [5]:
## Load the data into a DataFrame
df = pd.read_csv('../data/Iowa_Liquor_sales_sample_10pct.csv',na_filter = True, low_memory=False)#, infer_datetime_format=True, dtype={‘County Number’:np.int64, ‘Category’: np.int64, 'State Bottle Cost' : np.float64, 'State Bottle Retail' : np.float64, 'Sale (Dollars)' : np.float64}, low_memory = False)
print df.shape
df.head()
(270955, 18)
Out[5]:
Date Store Number City Zip Code County Number County Category Category Name Vendor Number Item Number Item Description Bottle Volume (ml) State Bottle Cost State Bottle Retail Bottles Sold Sale (Dollars) Volume Sold (Liters) Volume Sold (Gallons)
0 11/04/2015 3717 SUMNER 50674 9.0 Bremer 1051100.0 APRICOT BRANDIES 55 54436 Mr. Boston Apricot Brandy 750 $4.50 $6.75 12 $81.00 9.0 2.38
1 03/02/2016 2614 DAVENPORT 52807 82.0 Scott 1011100.0 BLENDED WHISKIES 395 27605 Tin Cup 750 $13.75 $20.63 2 $41.26 1.5 0.40
2 02/11/2016 2106 CEDAR FALLS 50613 7.0 Black Hawk 1011200.0 STRAIGHT BOURBON WHISKIES 65 19067 Jim Beam 1000 $12.59 $18.89 24 $453.36 24.0 6.34
3 02/03/2016 2501 AMES 50010 85.0 Story 1071100.0 AMERICAN COCKTAILS 395 59154 1800 Ultimate Margarita 1750 $9.50 $14.25 6 $85.50 10.5 2.77
4 08/18/2015 3654 BELMOND 50421 99.0 Wright 1031080.0 VODKA 80 PROOF 297 35918 Five O'clock Vodka 1750 $7.20 $10.80 12 $129.60 21.0 5.55

Exploratory Data Analysis

Let's see if the data reflects the true types, if not let's bring some of the data into the workable format.

In [9]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 270955 entries, 0 to 270954
Data columns (total 18 columns):
Date                     270955 non-null object
Store Number             270955 non-null int64
City                     270955 non-null object
Zip Code                 270955 non-null object
County Number            269878 non-null float64
County                   269878 non-null object
Category                 270887 non-null float64
Category Name            270323 non-null object
Vendor Number            270955 non-null int64
Item Number              270955 non-null int64
Item Description         270955 non-null object
Bottle Volume (ml)       270955 non-null int64
State Bottle Cost        270955 non-null object
State Bottle Retail      270955 non-null object
Bottles Sold             270955 non-null int64
Sale (Dollars)           270955 non-null object
Volume Sold (Liters)     270955 non-null float64
Volume Sold (Gallons)    270955 non-null float64
dtypes: float64(4), int64(5), object(9)
memory usage: 37.2+ MB
In [10]:
## Transform the dates if needed, e.g.
df.Date = pd.to_datetime(df["Date"], format = "%m/%d/%Y")
#I tried to convert all the dollar amounts into numbers. But was not succesfull to turn the County and Category numbers into integerts
#df['County Number'] = df['County Number'].astype(int)
#df['Category'] = df['Category'].astype(int)
df['State Bottle Cost'] = df['State Bottle Cost'].map(lambda x: str(x)[1:]).astype(float)
df['State Bottle Retail'] = df['State Bottle Retail'].map(lambda x: str(x)[1:]).astype(float)
df['Sale (Dollars)'] = df['Sale (Dollars)'].map(lambda x: str(x)[1:]).astype(float)
df['month'] = df.Date.dt.month
df['year'] = df.Date.dt.year
df.head()
Out[10]:
Date Store Number City Zip Code County Number County Category Category Name Vendor Number Item Number Item Description Bottle Volume (ml) State Bottle Cost State Bottle Retail Bottles Sold Sale (Dollars) Volume Sold (Liters) Volume Sold (Gallons) month year
0 2015-11-04 3717 SUMNER 50674 9.0 Bremer 1051100.0 APRICOT BRANDIES 55 54436 Mr. Boston Apricot Brandy 750 4.50 6.75 12 81.00 9.0 2.38 11 2015
1 2016-03-02 2614 DAVENPORT 52807 82.0 Scott 1011100.0 BLENDED WHISKIES 395 27605 Tin Cup 750 13.75 20.63 2 41.26 1.5 0.40 3 2016
2 2016-02-11 2106 CEDAR FALLS 50613 7.0 Black Hawk 1011200.0 STRAIGHT BOURBON WHISKIES 65 19067 Jim Beam 1000 12.59 18.89 24 453.36 24.0 6.34 2 2016
3 2016-02-03 2501 AMES 50010 85.0 Story 1071100.0 AMERICAN COCKTAILS 395 59154 1800 Ultimate Margarita 1750 9.50 14.25 6 85.50 10.5 2.77 2 2016
4 2015-08-18 3654 BELMOND 50421 99.0 Wright 1031080.0 VODKA 80 PROOF 297 35918 Five O'clock Vodka 1750 7.20 10.80 12 129.60 21.0 5.55 8 2015
In [12]:
def eda(dataframe):
    print "missing values \n", dataframe.isnull().sum()
    print "dataframe index \n", dataframe.index
    print "dataframe types \n", dataframe.dtypes
    print "dataframe shape \n", dataframe.shape
    print "dataframe describe \n", dataframe.describe()
    for item in dataframe:
        print item
        print dataframe[item].nunique()

eda(df)
missing values 
Date                        0
Store Number                0
City                        0
Zip Code                    0
County Number            1077
County                   1077
Category                   68
Category Name             632
Vendor Number               0
Item Number                 0
Item Description            0
Bottle Volume (ml)          0
State Bottle Cost           0
State Bottle Retail         0
Bottles Sold                0
Sale (Dollars)              0
Volume Sold (Liters)        0
Volume Sold (Gallons)       0
month                       0
year                        0
dtype: int64
dataframe index 
RangeIndex(start=0, stop=270955, step=1)
dataframe types 
Date                     datetime64[ns]
Store Number                      int64
City                             object
Zip Code                         object
County Number                   float64
County                           object
Category                        float64
Category Name                    object
Vendor Number                     int64
Item Number                       int64
Item Description                 object
Bottle Volume (ml)                int64
State Bottle Cost               float64
State Bottle Retail             float64
Bottles Sold                      int64
Sale (Dollars)                  float64
Volume Sold (Liters)            float64
Volume Sold (Gallons)           float64
month                             int64
year                              int64
dtype: object
dataframe shape 
(270955, 20)
dataframe describe 
        Store Number  County Number      Category  Vendor Number  \
count  270955.000000  269878.000000  2.708870e+05   270955.00000   
mean     3590.263701      57.231642  1.043888e+06      256.43443   
std       947.662050      27.341205  5.018211e+04      141.01489   
min      2106.000000       1.000000  1.011100e+06       10.00000   
25%      2604.000000      31.000000  1.012200e+06      115.00000   
50%      3722.000000      62.000000  1.031200e+06      260.00000   
75%      4378.000000      77.000000  1.062310e+06      380.00000   
max      9023.000000      99.000000  1.701100e+06      978.00000   

         Item Number  Bottle Volume (ml)  State Bottle Cost  \
count  270955.000000       270955.000000      270955.000000   
mean    45974.963300          924.830341           9.773281   
std     52757.043086          493.088489           7.047119   
min       168.000000           50.000000           0.890000   
25%     26827.000000          750.000000           5.500000   
50%     38176.000000          750.000000           8.000000   
75%     64573.000000         1000.000000          11.920000   
max    995507.000000         6000.000000         425.000000   

       State Bottle Retail   Bottles Sold  Sale (Dollars)  \
count        270955.000000  270955.000000   270955.000000   
mean             14.675760       9.871285      128.902375   
std              10.570145      24.040912      383.027369   
min               1.340000       1.000000        1.340000   
25%               8.270000       2.000000       30.450000   
50%              12.270000       6.000000       70.560000   
75%              17.880000      12.000000      135.000000   
max             637.500000    2508.000000    36392.400000   

       Volume Sold (Liters)  Volume Sold (Gallons)          month  \
count         270955.000000          270955.000000  270955.000000   
mean               8.981351               2.372830       5.821103   
std               28.913690               7.638182       3.621291   
min                0.100000               0.030000       1.000000   
25%                1.500000               0.400000       3.000000   
50%                5.250000               1.390000       5.000000   
75%               10.500000               2.770000       9.000000   
max             2508.000000             662.540000      12.000000   

                year  
count  270955.000000  
mean     2015.193246  
std         0.394845  
min      2015.000000  
25%      2015.000000  
50%      2015.000000  
75%      2015.000000  
max      2016.000000  
Date
274
Store Number
1400
City
385
Zip Code
415
County Number
99
County
99
Category
83
Category Name
71
Vendor Number
116
Item Number
2696
Item Description
2173
Bottle Volume (ml)
29
State Bottle Cost
1086
State Bottle Retail
1112
Bottles Sold
137
Sale (Dollars)
6580
Volume Sold (Liters)
265
Volume Sold (Gallons)
261
month
12
year
2

Let me see, what effect of dropping the lines would be¶

In [13]:
print 'Total sum of sales with missing category is ---- $', df['Sale (Dollars)'][pd.isnull(df['Category']) == True].sum(), ', that is ', df['Sale (Dollars)'][pd.isnull(df['Category']) == True].sum()/(df['Sale (Dollars)'].sum()), 'of total sales'
print 'Total sum of sales with missing category name is $', df['Sale (Dollars)'][pd.isnull(df['Category Name']) == True].sum(),', that is ', df['Sale (Dollars)'][pd.isnull(df['Category Name']) == True].sum()/(df['Sale (Dollars)'].sum()), 'of total sales'
print 'Total sum of sales with missing County is ------ $', df['Sale (Dollars)'][pd.isnull(df['County']) == True].sum(),', that is ', df['Sale (Dollars)'][pd.isnull(df['Category Name']) == True].sum()/(df['Sale (Dollars)'].sum()), 'of total sales'
print 'While total sales are -------------------------- $', df['Sale (Dollars)'].sum()
dropped_sum = (df['Sale (Dollars)'][pd.isnull(df['Category']) == True].sum() + df['Sale (Dollars)'][pd.isnull(df['Category Name']) == True].sum() + df['Sale (Dollars)'][pd.isnull(df['County']) == True].sum())
print 'Total loss of sales numbers from missing data -- $',dropped_sum, 'That is ', dropped_sum/df['Sale (Dollars)'].sum()
Total sum of sales with missing category is ---- $ 13789.92 , that is  0.000394824104261 of total sales
Total sum of sales with missing category name is $ 124113.99 , that is  0.00355355179204 of total sales
Total sum of sales with missing County is ------ $ 116833.68 , that is  0.00355355179204 of total sales
While total sales are -------------------------- $ 34926742.95
Total loss of sales numbers from missing data -- $ 254737.59 That is  0.0072934825433
In [14]:
# Here I gave another thought and decided to finally drop the missing lines and duplicates
df.dropna(inplace = True)
df.drop_duplicates(inplace = True)
df.shape
Out[14]:
(269223, 20)

There is another dataset with population per county that I wanted to incorporate into the dataset¶

In [15]:
population = pd.read_csv('../data/Total_County_Population_by_Year.csv',index_col = 'County', na_filter = True)
population.head()
Out[15]:
Population
County
Adair 7228
Adams 3796
Allamakee 13886
Appanoose 12529
Audubon 5773

Below to run once only! - County population is merged with the rest of dataframe¶

I know I do it after the presentation and after the submission date. I was not happy with everthing I did. So I am redoing it¶

In [16]:
df = df.join(population, on = "County", how = "outer")
df.head()
Out[16]:
Date Store Number City Zip Code County Number County Category Category Name Vendor Number Item Number ... Bottle Volume (ml) State Bottle Cost State Bottle Retail Bottles Sold Sale (Dollars) Volume Sold (Liters) Volume Sold (Gallons) month year Population
0 2015-11-04 3717 SUMNER 50674 9.0 Bremer 1051100.0 APRICOT BRANDIES 55 54436 ... 750 4.50 6.75 12 81.00 9.0 2.38 11 2015 24722
15 2015-06-10 3717 SUMNER 50674 9.0 Bremer 1051100.0 APRICOT BRANDIES 55 54436 ... 750 4.50 6.75 4 27.00 3.0 0.79 6 2015 24722
226 2016-02-24 4670 TRIPOLI 50676 9.0 Bremer 1011300.0 TENNESSEE WHISKIES 85 26827 ... 1000 18.38 27.57 6 165.42 6.0 1.59 2 2016 24722
546 2016-02-02 2651 WAVERLY 50677 9.0 Bremer 1071100.0 AMERICAN COCKTAILS 395 58838 ... 1750 8.20 12.30 6 73.80 10.5 2.77 2 2016 24722
579 2015-03-10 2651 WAVERLY 50677 9.0 Bremer 1022100.0 TEQUILA 434 89387 ... 1000 6.92 10.38 12 124.56 12.0 3.17 3 2015 24722

5 rows × 21 columns

In [ ]:
 

Now let's play with pivot tables.¶

In [17]:
dfpivot = pd.pivot_table(df, values = ['County','Sale (Dollars)','Bottles Sold', 'Volume Sold (Liters)'], index = ['Store Number'], aggfunc=np.sum)
dfpivot['av_price_unit'] = dfpivot['Sale (Dollars)']/dfpivot['Bottles Sold']
dfpivot['av_vol_unit'] = dfpivot['Volume Sold (Liters)'] / dfpivot['Bottles Sold']
dfpivot['Population'] = df['Population']
dfpivot['County'] = df['County']
dfpivot['liquor_capita'] = dfpivot['Volume Sold (Liters)'] / dfpivot['Population']


dfpivot = pd.DataFrame(dfpivot)
#dfpivot = dfpivot.merge(population, on = 'County', how = 'inner')

dfpivot.head()
Out[17]:
Bottles Sold Sale (Dollars) Volume Sold (Liters) av_price_unit av_vol_unit Population County liquor_capita
Store Number
2106 12573 176517.45 11836.10 14.039406 0.941390 80133.0 Dallas 0.147706
2113 830 11376.12 836.85 13.706169 1.008253 26643.0 Boone 0.031410
2130 9144 139440.02 8436.27 15.249346 0.922602 102782.0 Woodbury 0.082079
2152 670 8625.74 720.87 12.874239 1.075925 97125.0 Dubuque 0.007422
2178 2408 29912.68 2437.92 12.422209 1.012425 10676.0 Guthrie 0.228355
In [19]:
df[(df.Date.dt.quarter == 1) & (df.Date.dt.year == 2015)].groupby('Store Number')
Out[19]:
<pandas.core.groupby.DataFrameGroupBy object at 0x10487b790>
In [ ]:
 
In [20]:
#print '5 counties with top average price per bottle bought \n'
#dfpivot.sort_values('av_price_unit', ascending = False).head()
In [21]:
# print '5 counties with least average price per bottle bought \n'
# dfpivot.sort_values('av_price_unit', ascending = False).tail()
In [22]:
# print '5 counties with top average volume per bottle bought \n'
# dfpivot.sort_values('av_vol_unit', ascending = False).head()
In [23]:
# print '5 counties with least average volume per bottle bought \n'
# dfpivot.sort_values('av_vol_unit', ascending = False).tail()

Calculate the yearly liquor sales for each store using the provided data. You can add up the transactions for each¶

Let's see what summaries we can produce from this set:¶

In [24]:
df.head()
Out[24]:
Date Store Number City Zip Code County Number County Category Category Name Vendor Number Item Number ... Bottle Volume (ml) State Bottle Cost State Bottle Retail Bottles Sold Sale (Dollars) Volume Sold (Liters) Volume Sold (Gallons) month year Population
0 2015-11-04 3717 SUMNER 50674 9.0 Bremer 1051100.0 APRICOT BRANDIES 55 54436 ... 750 4.50 6.75 12 81.00 9.0 2.38 11 2015 24722
15 2015-06-10 3717 SUMNER 50674 9.0 Bremer 1051100.0 APRICOT BRANDIES 55 54436 ... 750 4.50 6.75 4 27.00 3.0 0.79 6 2015 24722
226 2016-02-24 4670 TRIPOLI 50676 9.0 Bremer 1011300.0 TENNESSEE WHISKIES 85 26827 ... 1000 18.38 27.57 6 165.42 6.0 1.59 2 2016 24722
546 2016-02-02 2651 WAVERLY 50677 9.0 Bremer 1071100.0 AMERICAN COCKTAILS 395 58838 ... 1750 8.20 12.30 6 73.80 10.5 2.77 2 2016 24722
579 2015-03-10 2651 WAVERLY 50677 9.0 Bremer 1022100.0 TEQUILA 434 89387 ... 1000 6.92 10.38 12 124.56 12.0 3.17 3 2015 24722

5 rows × 21 columns

I grouped the sales by each store for the year 2015

In [25]:
VolumeL = pd.DataFrame({'VolumeL':df.groupby(['Store Number','County','Population'])['Volume Sold (Liters)'].sum()})
Bottles = pd.DataFrame({'Bottles':df.groupby(['Store Number','County','Population'])['Bottles Sold'].sum()})
Sales_Y15Q1 = pd.DataFrame({'Q1Y15':df[(df.Date >= '2015-01-01') & (df.Date < '2015-04-01')].groupby(['Store Number','County', 'Population'])['Sale (Dollars)'].sum()})
Sales_Y15Q2 = pd.DataFrame({'Q2Y15':df[(df.Date >= '2015-04-01') & (df.Date < '2015-07-01')].groupby(['Store Number','County', 'Population'])['Sale (Dollars)'].sum()})
Sales_Y15Q3 = pd.DataFrame({'Q3Y15':df[(df.Date >= '2015-07-01') & (df.Date < '2015-10-01')].groupby(['Store Number','County', 'Population'])['Sale (Dollars)'].sum()})
Sales_Y15Q4 = pd.DataFrame({'Q4Y15':df[(df.Date >= '2015-10-01') & (df.Date < '2016-01-01')].groupby(['Store Number','County', 'Population'])['Sale (Dollars)'].sum()})
Sales_Y16Q1 = pd.DataFrame({'Q1Y16':df[(df.Date >= '2016-01-01') & (df.Date < '2016-04-01')].groupby(['Store Number','County', 'Population'])['Sale (Dollars)'].sum()})
Total_Sales = pd.concat([Bottles, VolumeL ,Sales_Y15Q1, Sales_Y15Q2, Sales_Y15Q3, Sales_Y15Q4, Sales_Y16Q1], axis =1, join = 'outer')

Total_Sales.fillna(0,inplace = True)

Total_Sales.head()
Total_Sales.tail()
Out[25]:
Bottles VolumeL Q1Y15 Q2Y15 Q3Y15 Q4Y15 Q1Y16
Store Number County Population
9002 Scott 172126 2286 1291.5 4865.16 10206.36 10180.68 12448.32 1382.64
9010 Warren 48626 360 221.4 3861.60 1155.60 740.76 0.00 177.48
9013 Warren 48626 264 201.0 0.00 677.16 165.12 1375.20 361.56
9018 Dubuque 97125 300 225.0 0.00 0.00 0.00 6435.00 0.00
9023 Carroll 20498 24 18.0 0.00 0.00 0.00 0.00 633.36
In [26]:
# print Total_Sales.Q1Y16.sort_values(ascending = False).head()
# print 'Total Sales in Q1Y16 $', Total_Sales.Q1Y16.sort_values(ascending = False).sum()

REVISED:¶

I will:

  1. Group the Total_Sales by Counties now,
  2. Add average bottle volume
  3. Average vol / capita And build model based on 4 variables: 3 above and Q1Y15 sales. Q1Y16 is predicted value.
In [27]:
VolumeL = pd.DataFrame({'VolumeL':df.groupby(['County','Population'])['Volume Sold (Liters)'].sum()})
Bottles = pd.DataFrame({'Bottles':df.groupby(['County','Population'])['Bottles Sold'].sum()})
Sales_Y15Q1 = pd.DataFrame({'Q1Y15':df[(df.Date >= '2015-01-01') & (df.Date < '2015-04-01')].groupby(['County', 'Population'])['Sale (Dollars)'].sum()})
Sales_Y15Q2 = pd.DataFrame({'Q2Y15':df[(df.Date >= '2015-04-01') & (df.Date < '2015-07-01')].groupby(['County', 'Population'])['Sale (Dollars)'].sum()})
Sales_Y15Q3 = pd.DataFrame({'Q3Y15':df[(df.Date >= '2015-07-01') & (df.Date < '2015-10-01')].groupby(['County', 'Population'])['Sale (Dollars)'].sum()})
Sales_Y15Q4 = pd.DataFrame({'Q4Y15':df[(df.Date >= '2015-10-01') & (df.Date < '2016-01-01')].groupby(['County', 'Population'])['Sale (Dollars)'].sum()})
Sales_Y16Q1 = pd.DataFrame({'Q1Y16':df[(df.Date >= '2016-01-01') & (df.Date < '2016-04-01')].groupby(['County', 'Population'])['Sale (Dollars)'].sum()})
County_Sales = pd.concat([Bottles, VolumeL ,Sales_Y15Q1, Sales_Y15Q2, Sales_Y15Q3, Sales_Y15Q4, Sales_Y16Q1], axis =1, join = 'outer')
County_Sales.reset_index(level = 1, inplace = True)
County_Sales['AvgBotVol'] = County_Sales.VolumeL / County_Sales.Bottles
County_Sales['AvgVolCap'] = County_Sales.VolumeL / County_Sales['Population']

County_Sales.head()
Out[27]:
Population Bottles VolumeL Q1Y15 Q2Y15 Q3Y15 Q4Y15 Q1Y16 AvgBotVol AvgVolCap
County
Adair 7228 4638 4350.84 10222.00 10820.66 11879.83 10730.49 11797.02 0.938085 0.601942
Adams 3796 955 981.38 2253.60 2015.54 3504.52 2471.48 2196.57 1.027623 0.258530
Allamakee 13886 7450 7868.04 14907.98 23686.20 19409.71 21914.58 19081.53 1.056113 0.566617
Appanoose 12529 8171 8038.16 15228.09 25218.66 20258.66 16328.23 21396.23 0.983743 0.641564
Audubon 5773 1694 1717.80 5510.75 2783.12 3239.32 4896.07 3226.28 1.014050 0.297558
In [28]:
print Total_Sales.sum()
k = Total_Sales.Q1Y16.sum()/Total_Sales.Q1Y15.sum()
print 'YoY increment from 2015 to 2016:', k
Q2Y16 = Total_Sales.Q2Y15.sum()*Total_Sales.Q1Y16.sum()/Total_Sales.Q1Y15.sum()
print 'Forecast of sales for Q2 Y16: $', Q2Y16
FY16 = Total_Sales.Q3Y15.sum() + Total_Sales.Q4Y15.sum() + Total_Sales.Q1Y16.sum() + Q2Y16
print 'Thus the forecast fot the fiscal year 2016: $', FY16
Bottles    2658010.00
VolumeL    2420108.31
Q1Y15      6147480.47
Q2Y15      7308799.87
Q3Y15      6718786.91
Q4Y15      8295874.82
Q1Y16      6208995.80
dtype: float64
YoY increment from 2015 to 2016: 1.01000659218
Forecast of sales for Q2 Y16: $ 7381936.04963
Thus the forecast fot the fiscal year 2016: $ 28605593.5796

Preliminary result: The forecast amount of sales in FY2016 - USD289,790,002 vs real USD288,908,790

Difference of $881,211, that's 1.2 percent error

For annual reports please click here - https://abd.iowa.gov/annual-reports

Now let me try to run the linear regression on the dataframe with just quarterly sales to predict sales of each store in Q2Y16¶

In [29]:
# For forecasting I will be using the per store sales 
# Let's see the shape of the Working_Stores
print County_Sales.shape
County_Sales.head()
(99, 10)
Out[29]:
Population Bottles VolumeL Q1Y15 Q2Y15 Q3Y15 Q4Y15 Q1Y16 AvgBotVol AvgVolCap
County
Adair 7228 4638 4350.84 10222.00 10820.66 11879.83 10730.49 11797.02 0.938085 0.601942
Adams 3796 955 981.38 2253.60 2015.54 3504.52 2471.48 2196.57 1.027623 0.258530
Allamakee 13886 7450 7868.04 14907.98 23686.20 19409.71 21914.58 19081.53 1.056113 0.566617
Appanoose 12529 8171 8038.16 15228.09 25218.66 20258.66 16328.23 21396.23 0.983743 0.641564
Audubon 5773 1694 1717.80 5510.75 2783.12 3239.32 4896.07 3226.28 1.014050 0.297558
In [30]:
y = County_Sales.Q1Y16
X = County_Sales[['Q1Y15', 'AvgBotVol', 'AvgVolCap']]
# create training and testing vars
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
print X_train.shape, y_train.shape
print X_test.shape, y_test.shape
(69, 3) (69,)
(30, 3) (30,)
In [31]:
# fit a model
lm = linear_model.LinearRegression()
print lm
model = lm.fit(X_train, y_train)
print model
predictions = lm.predict(X_test)
print predictions
print X_train.shape
print y_train.shape
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
[ 26831.94669702  26253.57062783   9798.8960754    7440.79262964
  77241.48878451  30212.24138306  56470.75342467  99048.95006273
  12534.46829682   5823.97060365  13365.85443148  14754.82192909
  74619.23305879  11477.90883066  31153.10972357  30316.68594541
  14418.48062607  11902.39930987  15568.40906036   5363.22662336
  39507.44802931  20207.30998635  28881.96106603  34560.39940023
  20555.32738915   9774.03297333   3946.93126442  26269.6456897
  14161.76473531  17092.16946972]
(69, 3)
(69,)
In [32]:
#validate the model on the cv = k-1
## The line / model
plt.scatter(y_test, predictions)
plt.xlabel("True Values")
plt.ylabel("Predictions")
Out[32]:
<matplotlib.text.Text at 0x1048a41d0>

So here are the results and the model is:¶

In [33]:
print "Score:", model.score(X_test, y_test)  
print 'Model coefficient is:', lm.coef_
print lm.intercept_
print 'Based on multilinear regression, the forecast for fiscal year 2016 is $'
Score: 0.874231431656
Model coefficient is: [  1.00834934e+00  -1.62269828e+04   1.99972543e+03]
15133.7585652
Based on multilinear regression, the forecast for fiscal year 2016 is $

Thus, the MLR looks like this:¶

Sales2016 = (Sales2015)1.042768 + (AvgBotVol)1,123,697 - (AvgVolCap)*6,917.840 - 90,023.¶

Let's try to apply for each county¶

In [34]:
County_Sales['FY2016'] = (County_Sales.Q1Y15+County_Sales.Q2Y15+County_Sales.Q3Y15+County_Sales.Q4Y15)*lm.coef_[0]+County_Sales.AvgBotVol*lm.coef_[1]+County_Sales.AvgVolCap*lm.coef_[2]+lm.intercept_
County_Sales.head()
County_Sales.FY2016.sum()
Out[34]:
28733016.270820655

The Final Part - Regularization¶

In [35]:
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn import metrics
In [36]:
# Perform 6-fold cross validation
print County_Sales.shape
scores = cross_val_score(lm, X_train, y_train, cv=6)
print "Cross-validated scores:", scores
print "Average: ", scores.mean()
(99, 11)
Cross-validated scores: [ 0.99866802  0.99118501  0.97591599  0.9975724   0.98556389  0.95918441]
Average:  0.984681619309
In [37]:
# LASSO regression

# Let me play with alpha = 0.01
In [38]:
# try alpha=0.01 and examine coefficients
from sklearn.linear_model import Lasso
lassoreg = Lasso(alpha=0.01, normalize=True)
lassoreg.fit(X_train, y_train)
print lassoreg.coef_
[  1.00835052e+00  -1.62254089e+04   1.99899112e+03]
In [39]:
# calculate MSE (for alpha=0.01)
y_pred = lassoreg.predict(X_test)
print np.sqrt(metrics.mean_squared_error(y_test, y_pred))
7344.77250749
In [40]:
# Not bad. Now I am happier.

I hope I am done. I am happier with the current results. I would present this model, if I had to.¶

In [ ]: