The aim of my project is to explore algorythms to help identify fake news from real news. The subject of fake news became popular in connection with the recent US presidential elections and with possible Russian meddling into elections in the US, France, and Germany. The topic has become so hot that several crowdsourced groups composed a dataset that has about 13,000 articles published on various web-sites and categorized them between "bs" (as totally fake), just fake, conspiracy, hate, junksci, and bias.
I selected only the ones classified as 'bs' and only in English language to see what makes them stand aside and how I would be able to predict how likely the fresh news be credible or not. In order to compare the set with the credible news, I used a news web scraper package called "newspaper". It helped me to add 6000 credible news from the left, center and the right wings of the political specter.
Having successfully trained the model on DataCamp's dataset produced interesting results on the data scraped from web and the data set I borrowed on Kaggle.com. This article has its own data set of about six thousand English language news, marked as fake or true. The article uses Multinomial Naive Bayes classifier and fits a model with .858 accuracy score.
I intend to use the model that has been fit and tested on Datacamp's data and see if it is applicable to an independent dataset that I am creating in this notebook. Additionally, I want to see how using this model will result in credibility for the news from different political specters.
Since applying MultinomialNB algorythm produced results that made me question the common sense, I decided to use other algorithms and finally went with RandomForest. It produced the most reliable results and proved strong through thrain/test split.
Instead of introduction, I decided to use the dataset from Datacamp.com website and based on the model they develop, use it to check on the datasets I have from Kaggle.com and the one I've scraped. (https://www.datacamp.com/community/tutorials/scikit-learn-fake-news#gs.664H2N0)
import pandas as pd
import matplotlib.pyplot as plt
dt = pd.read_csv("data/fake_or_real_news.csv")
dt.info()
dt.head()
dt = dt.set_index("Unnamed: 0")
dt.head()
# Checking how balanced the observation groups are.
dt.label.value_counts()
from sklearn.model_selection import train_test_split
#Set `y`
y = dt.label
# Drop the `label` column
dt.drop("label", axis=1)
# Make training and test sets
X_train, X_test, y_train, y_test = train_test_split(dt['text'], y, test_size=0.33, random_state=53)
# Originally the article on Datacamp uses both Countvectorizer and Tfid Vectorizer.
# I just decided to go with the latter as for the large dataset and given the different size of articles
# would be a better decision on MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
# Initialize the `tfidf_vectorizer`
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_df=0.8)
# Fit and transform the training data
tfidf_train = tfidf_vectorizer.fit_transform(X_train)
# Transform the test set
tfidf_test = tfidf_vectorizer.transform(X_test)
# Get the feature names of `tfidf_vectorizer`
print(tfidf_vectorizer.get_feature_names()[-10:])
tfidf_df = pd.DataFrame(tfidf_train.A, columns=tfidf_vectorizer.get_feature_names())
tfidf_df.head()
def plot_confusion_matrix(cm, classes,
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
See full source and example:
http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, cm[i, j],
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
import numpy as np
import itertools
clf = MultinomialNB()
clf.fit(tfidf_train, y_train)
pred = clf.predict(tfidf_test)
score = metrics.accuracy_score(y_test, pred)
#rocauc = metrics.roc_auc_score(y_test, pred, average='weighted')
print ("accuracy: %0.3f" % score)
#print ("ROC_AUC score: %.3f" % rocauc)
cm = metrics.confusion_matrix(y_test, pred, labels=['FAKE', 'REAL'])
plot_confusion_matrix(cm, classes=['FAKE', 'REAL'])
<font color = "grey"> So the model is fit and I will use it later on the Kaggle's dataset of fake news and the set of scraped news. </font>
In this part I import the dataset that was kindly crowdsourced and shared on Kaggle.com. I used this set also to try and identify what political wing the news sources belong to and try to identify any commonalities. I also will test this set on model fit above in Part 1.
%matplotlib inline
import time
import re
import tldextract
f = pd.read_csv('data/fake.csv')
In case you decide to upload the data from the Internet, you may have challenges with parsing the file.
# url_ = 'https://github.com/baursafi/GA_DSI5_Capstone_Data/blob/master/fake.csv'
# a = pd.read_csv(url_)
print f.shape
f.head()
f.type.value_counts()
plt.figure(figsize=(10,6))
f.type.value_counts().plot(kind='bar',title = 'Frequency of Observations by Type', grid = True)
len(f.language.unique())
plt.figure(figsize=(10,6))
f.language.value_counts().plot(kind='bar',title = 'Frequency of Observations by Language', grid = True)
l = []
for i in range(len(f.text)):
l.append(len(str(f.text[i])))
l = pd.DataFrame(l)
f['textlen'] = l
plt.figure(figsize=(10,6))
f.textlen.hist(bins = 100)
- in other languages except English
- only identified as 'bs' or 'fake'
- only with lengh of more than 500 and less than 12,000 signs
#f = f[(f.type == 'bs')|(f.type == 'fake')]
f = f[f.language == 'english']
f = f[(f.textlen > 500)&(f.textlen < 12000)]
f = f[['site_url', 'title', 'text']]
f.reset_index(drop = 'index', inplace=True)
f.rename(columns = {'site_url':'url'}, inplace=True)
print f.shape
f.head()
<font color = "red">NOTE:</font>
I will go with only those news that come from more or less established news agencies and agglomerators. I decided to take the first 200 web-sites. It looks like they have the vast majority of all the fake news (9885 news left after trimming off the blogs and long papers that hardly fall into the news category)
f.url.value_counts()[:200].sum()
#At this stage I can add 'label' of 'FAKE' news to the fake part of the dataset
f['label'] = 'FAKE'
f.head(2)
<font color = "red">
I checked as many as possible from https://mediabiasfactcheck.com
Out of 215 news resources, 60 remain unidentified by their political affiliation. But a great part of them either totally into conspiracy or satire.
f['wing'] = '0-unknown'
f['conspiracy'] = 0
f['satire'] = 0
f.head(1)
# below I have checked 215 news resources at https://mediabiasfactcheck.com to identify the political affiliation
# for each of the news aggregator or the news agency
f.loc[f.url.str.contains('consortiumnews|usatoday|politifact|sctimes|timesofsandiego'),'wing'] = '4-center'
f.loc[f.url.str.contains('presstv|mintpressnews|latimes|chicagotribune|bustle|natmonitor|politico'),'wing'] = '3-center left'
f.loc[f.url.str.contains('antiwar|russia-insider|sputniknews|strategic-culture|postbulletin|hpenews|ustfactsdaily'),'wing'] = '5-center right'
f.loc[f.url.str.contains('politicususa|opednews|liberalamerica|truthdig|counterpunch|blackagendareport|guardianlv|ahtribune|intrepidreport|wakingtimes|addictinginfo|activistpost|other98|countercurrents|huffingtonpost|rabble|cnn'),'wing'] = '2-left'
f.loc[f.url.str.contains('naturalnews|ijr|wearechange|awdnews|twitchy|thenewamerican|amtvmedia|abovetopsecret|nowtheendbegins|thecommonsenseshow|fromthetrenchesworldreport|nakedcapitalism|prisonplanet|investmentwatchblog|ronpaulinst|thecontroversialfiles|gulagbound|rt|thedailybell|corbettreport|zerohedge|whatreally|wikileaks|newstarg|regated|southfront'),'wing'] = '6-right'
f.loc[f.url.str.contains('occupydemocrats|ifyouonlynews|pravdareport|usuncut|newcenturytimes|trueactivist|dailynewsbin'),'wing'] = '1-extreme left'
f.loc[f.url.str.contains('madworldnews|thefederalistpapers|conservativetribune|libertyunyielding|truthfeed|freedomoutpost|frontpagemag|dccl|othesline|wnd|ihavethetruth|amren|barenakedislam|returnofkings|trunews|jewsnews|shtfplan|lewrockwell|dailystormer|libertynews|endingthefield|dailywire|vdare|100percentfedup|21stcenturywire|westernjournalism|redflagnews|libertywritersnews|conservativedailypost|departed|breitbart|donaldtrumpnews.co|bipartisanreport|americanlookout|spinzon|usapoliticsnow|usanewsflash|hangthebankers|toprightnews|usasupreme|americasfreedomfighters|viralliberty'),'wing'] = '7-extreme right'
# below, similarly to the cell above the code assigns whether the news agency or agregator
# publishes conspiracy/pseudoscience news / information
f.loc[f.url.str.contains('yournewswire|trunews|naturalnews|infowars|eutimes|truthfeed|topinfopost|thedailysheeple|jewsnews|wearechange|awdnews|worldtruth|govtslaves|thetruthseeker.co|amtvmedia|sott|abovetopsecret.com|collective-evolution.com|shtfplan.com|theeventchronicle.com|thefreethoughtproject.com|humansarefree.com|veteranstoday.com|lewrockwell.com|nowtheendbegins.com|thecommonsenseshow.com|themindunleashed.comfromthetrenchesworldreport.com|intellihub.com|realfarmacy.com|greanvillepost.com|dailystormer.com|disclose.tv|whydontyoutrythis.com|prisonplanet.com|investmentwatchblog.com|thecontroversialfiles.net|godlikeproductions.com|anonhq.com|abeldanger.net|wakingtimes.com |gulagbound.com|endingthefed.com|healthimpactnews.com|truthbroadcastnetwork.com|21stcenturywire.com|corbettreport.com|undergroundhealth.com|zerohedge.com|geoengineeringwatch.org|conservativedailypost.com|pakalertpress.com|whatreallyhappened.com|coasttocoastam.com|trueactivist.com|activistpost.com|theantimedia.org|usapoliticsnow.com|newstarget.com|theearthchild.co.za|anonews.co|southfront.org|americasfreedomfighters.com|davidwolfe.com|vigilantcitizen.com'),'conspiracy'] = 1
f.loc[f.url.str.contains('waterfordwhispersnews|theonion|thedailymash|thespoof|clickhole|newsthump|newsbiscuit|theunrealtimes|dailysquib|adobochronicles|gomerblog|thelastlineofdefense|satirewire|reductress'),'satire'] = 1
Thus the news that belong to those news sources split between the political affiliations in the following manner
f.wing.value_counts()
f.head()
# I could only identify about 140 resources by their political wing affiliation. The rest 60 remained unknown.
# The great part of those are satirical resource. So I decided to keep them.
f.wing.value_counts().plot(kind = 'bar')
f.wing.value_counts()
# z_ = pd.DataFrame(z_)
# z_.columns = i
# z_
z_z = pd.DataFrame()
z_ = []
for i in sorted(f.wing.unique()):
cvec = CountVectorizer(stop_words='english')
cvec.fit(f[f.wing == i].text)
cvecdata = cvec.transform(f[f.wing == i].text)
df = pd.DataFrame(cvecdata.todense(),
columns = cvec.get_feature_names())
z_ = pd.DataFrame(df.sum())
z_.columns = ["sums"]
z_ = pd.DataFrame(z_.sums.sort_values(ascending=False)[:100]/float(z_.sum())*100)
z_z = pd.concat([z_z,z_], axis = 1)
z_z.columns = f.wing.unique()
z_z.dropna(inplace = True)
print z_z.shape
z_z = z_z[sorted(f.wing.unique())]
z_z['mean'] = z_z.mean(axis=1)
z_z = z_z.sort_values(by = "mean", ascending=False)[:10]
z_z.to_csv("data/fake_word_freq.csv")
z_z
df.head()
f.groupby('wing').sum()
cvec = CountVectorizer(stop_words='english')
cvec.fit(f[f.satire == 1].text)
cvecdata = cvec.transform(f[f.satire == 1].text)
df = pd.DataFrame(cvecdata.todense(),
columns = cvec.get_feature_names())
z_ = pd.DataFrame(df.sum())
z_.columns = ["percent"]
z_ = pd.DataFrame(z_.percent.sort_values(ascending=False)[:100]/float(z_.sum())*100)
z_.head(10)
# Transform the test set
tfidf_test = tfidf_vectorizer.transform(f.text)
y = f.label
# Get the feature names of `tfidf_vectorizer`
print(tfidf_vectorizer.get_feature_names()[-10:])
pred = clf.predict(tfidf_test)
pred_df = pd.DataFrame(pred)
pred_df[0].value_counts()
score = metrics.accuracy_score(y, pred)
print("accuracy: %0.3f" % score)
Not a bad result for the FAKE news
Code is borrowed from Evann Smith's LDA presentation https://github.com/baursafi/GA/blob/master/modeling_text_data.ipynb
import random
import string
import re
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from tqdm import tqdm
tqdm.pandas(desc='progress-bar')
stop = set(stopwords.words('english') + list(string.punctuation))
stemmer = PorterStemmer()
re_punct = re.compile('[' + ''.join(string.punctuation) + ']')
def preprocess(text):
try:
text = text.lower()
tokens = word_tokenize(text)
tokens = [t for t in tokens if not t in stop]
tokens = [re.sub(re_punct, '', t) for t in tokens]
tokens = [t for t in tokens if len(t) > 2]
tokens = [stemmer.stem(t) for t in tokens]
if len(tokens) == 0:
return None
else:
return ' '.join(tokens)
except:
return None
data_f = f[['url','title','text','wing', 'conspiracy', 'label']]
print data_f.shape
data_f.head()
data_f['tokens'] = data_f['text'].progress_map(preprocess)
data_f = data_f[data_f['tokens'].notnull()]
data_f.reset_index(inplace=True)
data_f.drop('index', inplace=True, axis=1)
print('{} text'.format(len(data_f)))
data_f.head()
Now I can run the tokenized preprocessed news in the model fit on the DataCamp dataset.
# Transform the test set
tfidf_test_tokens = tfidf_vectorizer.transform(data_f.tokens)
y = data_f.label
# Get the feature names of `tfidf_vectorizer`
print(tfidf_vectorizer.get_feature_names()[-10:])
pred = clf.predict(tfidf_test_tokens)
pred_f = pd.DataFrame(pred)
pred_f[0].value_counts()
score = metrics.accuracy_score(y, pred)
print("accuracy: %0.3f" % score)
data_f['pred'] = pred
data_f.head()
# Here I decided to build a little table of how each wing of political affiliation performs on the FAKE/REAL model
# trained on the DataCamp.com dataset.
z_f = pd.concat([data_f[data_f.pred == "FAKE"].wing.value_counts(),data_f[data_f.pred == "REAL"].wing.value_counts()], join='outer', axis = 1)
z_f.columns = ['FAKE', 'REAL']
z_f['TOTAL'] = z_f.REAL + z_f.FAKE
z_f['CREDIBILITY'] = z_f.REAL/(z_f.FAKE +z_f.REAL)
z_f.to_csv('data/z_f_pred.csv')
z_f
plt.figure(figsize=(8,6))
z_f.CREDIBILITY.plot(kind = 'bar')
_ = plt.legend("Credibility of Kaggle's FAKE' news in accordance with the DataCamp's model")
plt.show()
Tokenizing the text decreases accuracy of the model trained on a non-tokenized texts.
You will see below that the same model gives unreliable results for the web-scraped tokenized news's text.
Based on such results, I decided to:
<font color = "green">
I would like to check how they will show on Real/Fake check. The scraping code is in the bottom part of this Jupyter Notebook.
df = pd.read_csv('data/scraped_news.csv')
df.head()
#Let's return the set its original index
df = df.set_index("Unnamed: 0")
print df.shape
df.head(2)
df['wing'] = '0-unknown'
df['conspiracy'] = 0
df['satire'] = 0
# below I have checked 215 news resources at https://mediabiasfactcheck.com to identify the political affiliation
# for each of the news aggregator or the news agency
df.loc[df.url.str.contains('consortiumnews|usatoday|politifact|sctimes|timesofsandiego'),'wing'] = '4-center'
df.loc[df.url.str.contains('presstv|mintpressnews|latimes|chicagotribune|bustle|natmonitor|politico|nytimes'),'wing'] = '3-center left'
df.loc[df.url.str.contains('antiwar|russia-insider|sputniknews|strategic-culture|postbulletin|hpenews|ustfactsdaily'),'wing'] = '5-center right'
df.loc[df.url.str.contains('politicususa|opednews|liberalamerica|truthdig|counterpunch|blackagendareport|guardianlv|ahtribune|intrepidreport|wakingtimes|addictinginfo|activistpost|other98|countercurrents|huffingtonpost|rabble|cnn'),'wing'] = '2-left'
df.loc[df.url.str.contains('naturalnews|ijr|wearechange|awdnews|twitchy|thenewamerican|amtvmedia|abovetopsecret|nowtheendbegins|thecommonsenseshow|fromthetrenchesworldreport|nakedcapitalism|prisonplanet|investmentwatchblog|ronpaulinst|thecontroversialfiles|gulagbound|rt|thedailybell|corbettreport|zerohedge|whatreally|wikileaks|newstarg|regated|southfront'),'wing'] = '6-right'
df.loc[df.url.str.contains('occupydemocrats|ifyouonlynews|pravdareport|usuncut|newcenturytimes|trueactivist|dailynewsbin'),'wing'] = '1-extreme left'
df.loc[df.url.str.contains('madworldnews|thefederalistpapers|conservativetribune|libertyunyielding|truthfeed|freedomoutpost|frontpagemag|dccl|othesline|wnd|ihavethetruth|amren|barenakedislam|returnofkings|trunews|jewsnews|shtfplan|lewrockwell|dailystormer|libertynews|endingthefield|dailywire|vdare|100percentfedup|21stcenturywire|westernjournalism|redflagnews|libertywritersnews|conservativedailypost|departed|breitbart|donaldtrumpnews|bipartisanreport|americanlookout|spinzon|usapoliticsnow|usanewsflash|hangthebankers|toprightnews|usasupreme|americasfreedomfighters|viralliberty'),'wing'] = '7-extreme right'
# below, similarly to the cell above the code assigns whether the news agency or agregator
# publishes conspiracy/pseudoscience news / information
df.loc[df.url.str.contains('yournewswire|trunews|naturalnews|infowars|eutimes|truthfeed|topinfopost|thedailysheeple|jewsnews|wearechange|awdnews|worldtruth|govtslaves|thetruthseeker.co|amtvmedia|sott|abovetopsecret.com|collective-evolution.com|shtfplan.com|theeventchronicle.com|thefreethoughtproject.com|humansarefree.com|veteranstoday.com|lewrockwell.com|nowtheendbegins.com|thecommonsenseshow.com|themindunleashed.comfromthetrenchesworldreport.com|intellihub.com|realfarmacy.com|greanvillepost.com|dailystormer.com|disclose.tv|whydontyoutrythis.com|prisonplanet.com|investmentwatchblog.com|thecontroversialfiles.net|godlikeproductions.com|anonhq.com|abeldanger.net|wakingtimes.com |gulagbound.com|endingthefed.com|healthimpactnews.com|truthbroadcastnetwork.com|21stcenturywire.com|corbettreport.com|undergroundhealth.com|zerohedge.com|geoengineeringwatch.org|conservativedailypost.com|pakalertpress.com|whatreallyhappened.com|coasttocoastam.com|trueactivist.com|activistpost.com|theantimedia.org|usapoliticsnow.com|newstarget.com|theearthchild.co.za|anonews.co|southfront.org|americasfreedomfighters.com|davidwolfe.com|vigilantcitizen.com'),'conspiracy'] = 1
df.loc[df.url.str.contains('waterfordwhispersnews.com|theonion.com|thedailymash.co.uk|thespoof.com|clickhole.com|newsthump.com|newsbiscuit.com|theunrealtimes.com|dailysquib.co.uk|adobochronicles.com|gomerblog.com|thelastlineofdefense.org|satirewire.com|reductress.com'),'satire'] = 1
print df.wing.value_counts()
df.wing.value_counts().plot(kind = 'bar')
df['label'] = 'REAL'
df.tail()
df = df[df.wing !='unknown']
df.reset_index(inplace = True)
df.drop('Unnamed: 0',axis = 1, inplace = True)
# Simplifying the url situation. will just leave domain and suffix.
# Later you will see even this is too much
for i in range(df.shape[0]):
ext = tldextract.extract(df.loc[i,'url'])
try:
df.set_value(i,'url', (ext.domain+'.'+ext.suffix))
except:
None
df.tail()
df.groupby('wing').sum()
With the web scraped set I decided to go different ways: Code is borrowed from Evann Smith's LDA presentation https://github.com/baursafi/GA/blob/master/modeling_text_data.ipynb
stop = set(stopwords.words('english') + list(string.punctuation))
stemmer = PorterStemmer()
re_punct = re.compile('[' + ''.join(string.punctuation) + ']')
def preprocess(text):
try:
text = text.lower()
tokens = word_tokenize(text)
tokens = [t for t in tokens if not t in stop]
tokens = [re.sub(re_punct, '', t) for t in tokens]
tokens = [t for t in tokens if len(t) > 2]
tokens = [stemmer.stem(t) for t in tokens]
if len(tokens) == 0:
return None
else:
return ' '.join(tokens)
except:
return None
data_w = df[['url','title','text','wing', 'conspiracy', 'label']]
print data_w.shape
data_w.head()
data_w['tokens'] = data_w['text'].progress_map(preprocess)
data_w = data_w[data_w['tokens'].notnull()]
data_w.reset_index(inplace=True)
data_w.drop('index', inplace=True, axis=1)
print('{} text'.format(len(data_w)))
data_w.head()
Now I can run the tokenized preprocessed news in the model fit on the DataCamp dataset.
# Transform the test set
tfidf_test_tokens = tfidf_vectorizer.transform(data_w.tokens)
y = data_w.label
# Get the feature names of `tfidf_vectorizer`
print(tfidf_vectorizer.get_feature_names()[-10:])
pred = clf.predict(tfidf_test_tokens)
pred_df = pd.DataFrame(pred)
pred_df[0].value_counts()
score = metrics.accuracy_score(y, pred)
print("accuracy: %0.3f" % score)
data_w['pred'] = pred
data_w.head()
# Here I decided to build a little table of how each wing of political affiliation performs on the FAKE/REAL model
# trained on the DataCamp.com dataset.
z = pd.concat([data_w[data_w.pred == "FAKE"].wing.value_counts(),data_w[data_w.pred == "REAL"].wing.value_counts()], join='outer', axis = 1)
z.columns = ['FAKE', 'REAL']
z['TOTAL'] = z.REAL + z.FAKE
z['CREDIBILITY'] = z.REAL/(z.FAKE +z.REAL)
z.to_csv('data/z_pred.csv')
z
<font color = "red"> It is strange and I am very curious why the scraped news produce such a strange result for the left and extreme right wings. </font>
data_w[data_w.wing == '2-left'].url.value_counts()
data_w[data_w.wing == '7-extreme right'].url.value_counts()
plt.figure(figsize=(10,6))
z.CREDIBILITY.plot(kind = 'bar')
I know, I know: but we have to remember, the model has been trained on a different dataset.
</h3>
# Let's do the same for individual web-sites:
z_url = pd.concat([data_w[data_w.pred == "FAKE"].url.value_counts(),data_w[data_w.pred == "REAL"].url.value_counts()], join='outer', axis = 1)
z_url.fillna(0, inplace = True)
z_url.columns = ['FAKE', 'REAL']
z_url['TOTAL'] = z_url.REAL + z_url.FAKE
z_url['CREDIBILITY'] = z_url.REAL/(z_url.FAKE +z_url.REAL)
z_url.to_csv('data/z_url_pred.csv')
z_url[z_url.TOTAL>10].sort_values(by = 'CREDIBILITY', na_position='last', ascending=0).head(10)
z_url[z_url.TOTAL>10].sort_values(by = 'CREDIBILITY', na_position='last', ascending=0).tail(10)
Below I will merge two datasets and re-fit the model and check it's accuracy
print data_f.shape
data_f.head(3)
print data_w.shape
data_w.head(3)
data_fw = pd.concat([data_f, data_w], axis=0, ignore_index=True)
data_fw.drop('pred', axis = 1, inplace = True)
print data_fw.shape
data_fw.head(3)
# Checking how balanced the observation groups are.
data_fw.label.value_counts()
#Set `y`
y = data_fw.label
# Make training and test sets
X_train, X_test, y_train, y_test = train_test_split(data_fw['text'], y, test_size=0.33, random_state=53)
print 'X_train - ', X_train.shape, 'y_train - ', y_train.shape
print 'X_test - ', X_test.shape, 'y_test - ', y_test.shape
# Initialize the `tfidf_vectorizer`
tfidf_vectorizer = TfidfVectorizer(stop_words=['english','indian','chinese','korean'], max_df=0.8)
# Fit and transform the training data
tfidf_train = tfidf_vectorizer.fit_transform(X_train)
# Transform the test set
tfidf_test = tfidf_vectorizer.transform(X_test)
# Get the feature names of `tfidf_vectorizer`
print(tfidf_vectorizer.get_feature_names()[-10:])
tfidf_df = pd.DataFrame(tfidf_train.A, columns=tfidf_vectorizer.get_feature_names())
print tfidf_df.shape
tfidf_df.head()
clf = MultinomialNB()
# Decided to run cross-validation to mixup a little bit the data
from sklearn.model_selection import cross_val_score
clf_score = cross_val_score(clf, tfidf_train, y_train, cv=10)
print clf_score
print "Accuracy =", clf_score.mean()
clf.fit(tfidf_train, y_train)
pred = clf.predict(tfidf_test)
score = metrics.accuracy_score(y_test, pred)
print("accuracy: %0.3f" % score)
cm = metrics.confusion_matrix(y_test, pred, labels=['FAKE', 'REAL'])
plot_confusion_matrix(cm, classes=['FAKE', 'REAL'])
# Transform the whole set
tfidf_fw = tfidf_vectorizer.transform(data_fw.text)
pred_fw = clf.predict(tfidf_fw)
pred_fw.shape
data_fw['pred'] = pred_fw
data_fw.head()
print 'Fake news correctly predicted - ',data_fw['url'][(data_fw.label == "FAKE")&(data_fw.pred == "FAKE")].count()/float(data_fw['url'][(data_fw.label == "FAKE")].count())
print 'Real news correctly predicted - ',data_fw['url'][(data_fw.label == "REAL")&(data_fw.pred == "REAL")].count()/float(data_fw['url'][(data_fw.label == "REAL")].count())
z_real = data_fw[data_fw.label == 'REAL']
# Now let's see how it looks in terms of politial affiliations
z_fw = pd.concat([z_real[z_real.pred == "FAKE"].wing.value_counts(),z_real[z_real.pred == "REAL"].wing.value_counts()], join='inner', axis = 1)
z_fw.columns = ['FAKE', 'REAL']
z_fw['TOTAL'] = z_fw.REAL + z_fw.FAKE
z_fw['CREDIBILITY'] = z_fw.REAL/(z_fw.FAKE +z_fw.REAL)
z_fw.to_csv('data/z_fw_pred.csv')
z_fw.sort_index(ascending=True,inplace=True)
<font color = "red"> It is strange and I am very curious why the scraped news produce such a strange result for the left and extreme right wings. </font>
plt.figure(figsize=(10,6))
z_fw.CREDIBILITY.plot()
z_fw_url = pd.concat([z_real[z_real.pred == "FAKE"].url.value_counts(),z_real[z_real.pred == "REAL"].url.value_counts()], join='inner', axis = 1)
z_fw_url.columns = ['FAKE', 'REAL']
z_fw_url['TOTAL'] = z_fw_url.REAL + z_fw_url.FAKE
z_fw_url['CREDIBILITY'] = z_fw_url.REAL/(z_fw_url.FAKE +z_fw_url.REAL)
z_fw_url[z_fw_url.TOTAL>20].sort_values(by = 'CREDIBILITY', na_position='last', ascending=0).head(15)
The initial idea - to use Naive Bayes, the way it is used to identify spam emails helped to confirm the initial fake set but proved barely reliable in estimating the reliability of the news through web-scraping. Identification of fake news shall be approached from the point of view of fact checking. In other words it is important to separate opinions from facts and as opinions may vary from extreme left to extreme right, even staying neutral in opinions remains in the realm of opinions and can't be fact checked.
Fact checking, unlike opinions, must be based on a different approach. It will require more advanced machine learning, topic identification and fact comparison.
#Here I will try to run SVM to better understand how to read my findings
texts = data_fw.tokens.tolist()
y = data_fw.label.tolist()
vectorizer = TfidfVectorizer(min_df=5, max_df=0.8)
X = vectorizer.fit_transform(texts)
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)
Feature Selection
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
total_features = len(vectorizer.get_feature_names())
print('{} total features prior to selection'.format(total_features))
ch2 = SelectKBest(chi2, k=500)
X_train = ch2.fit_transform(X_train, y_train)
X_test = ch2.transform(X_test)
X = ch2.transform(X)
X.shape
feature_names = list(vectorizer.get_feature_names())
mask = ch2.get_support() #list of booleans
new_features = [] # The list of your K best features
for bool, feature in zip(mask, feature_names):
if bool:
new_features.append(feature)
data_frame = pd.DataFrame(data=X.todense(), columns=new_features)
data_frame.describe()
#Train and predict
classifier_SV = LinearSVC()
%time classifier_SV.fit(X_train, y_train)
print('Accuracy: {}'.format(round(classifier_SV.score(X_test, y_test), 3)))
classifier_RFC = RandomForestClassifier()
%time rfc = classifier_RFC.fit(X_train, y_train)
print('Accuracy: {}'.format(round(classifier_RFC.score(X_test, y_test), 3)))
classifier_GNB = GaussianNB()
%time classifier_GNB.fit(X_train.toarray(), y_train)
print('Accuracy: {}'.format(round(classifier_GNB.score(X_test.toarray(), y_test), 3)))
classifier_GBC = GradientBoostingClassifier()
%time classifier_GBC.fit(X_train.toarray(), y_train)
print('Accuracy: {}'.format(round(classifier_GBC.score(X_test.toarray(), y_test), 3)))
I decided to try Random Forest Classifier and see how the web scraped news set performed in the test
pred_RFC = classifier_RFC.predict(X)
data_fw['pred_RFC'] = pred_RFC
data_fw.head()
print 'Fake news correctly predicted - ',data_fw['url'][(data_fw.label == "FAKE")&(data_fw.pred_RFC == "FAKE")].count()/float(data_fw['url'][(data_fw.label == "FAKE")].count())
print 'Real news correctly predicted - ',data_fw['url'][(data_fw.label == "REAL")&(data_fw.pred_RFC == "REAL")].count()/float(data_fw['url'][(data_fw.label == "REAL")].count())
importances_rfc = pd.DataFrame(zip(new_features,rfc.feature_importances_), columns = ["words","importance"])
importances_rfc.sort_values(by="importance",ascending=False).head(10)
z_real = data_fw[data_fw.label == 'REAL']
# Now let's see how it looks in terms of politial affiliations
z_fw = pd.concat([z_real[z_real.pred_RFC == "FAKE"].wing.value_counts(),z_real[z_real.label == "REAL"].wing.value_counts()], join='inner', axis = 1)
z_fw.columns = ['FAKE', 'REAL']
z_fw['TOTAL'] = z_fw.REAL + z_fw.FAKE
z_fw['CREDIBILITY'] = z_fw.REAL/(z_fw.FAKE +z_fw.REAL)
z_fw.to_csv('data/z_fw_pred.csv')
z_fw.sort_index(ascending=True,inplace=True)
z_fw
plt.figure(figsize=(10,6))
z_fw.CREDIBILITY.plot()
z_fw_url = pd.concat([z_real[z_real.pred_RFC == "FAKE"].url.value_counts(),z_real[z_real.pred_RFC == "REAL"].url.value_counts()], join='inner', axis = 1)
z_fw_url.columns = ['FAKE', 'REAL']
z_fw_url['TOTAL'] = z_fw_url.REAL + z_fw_url.FAKE
z_fw_url['CREDIBILITY'] = z_fw_url.REAL/(z_fw_url.FAKE +z_fw_url.REAL)
z_fw_url[z_fw_url.TOTAL>20].sort_values(by = 'CREDIBILITY', na_position='last', ascending=0).head(10)
z_fw_url[z_fw_url.TOTAL>20].sort_values(by = 'CREDIBILITY', na_position='last', ascending=0).tail(10)
# Below is the code to check which news were predicted as fake. Very intersting
# print list(data_fw[(data_fw.url == 'sctimes.com')&(data_fw.pred_RFC == 'FAKE')].text)
<font color = "green">
print(__doc__)
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_digits
from sklearn.model_selection import validation_curve
digits = load_digits()
X, y = digits.data, digits.target
param_range = [4,8,16,32]
train_scores, test_scores = validation_curve(
RandomForestClassifier(), X, y, param_name="max_features", param_range=param_range,
cv=10, scoring="accuracy", n_jobs=1)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.title("Validation Curve with RandomForest")
plt.xlabel("Max Features")
plt.ylabel("Score")
plt.ylim(0.0, 1.1)
lw = 2
plt.semilogx(param_range, train_scores_mean, label="Training score",
color="darkorange", lw=lw)
plt.fill_between(param_range, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.2,
color="darkorange", lw=lw)
plt.semilogx(param_range, test_scores_mean, label="Cross-validation score",
color="navy", lw=lw)
plt.fill_between(param_range, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.2,
color="navy", lw=lw)
plt.legend(loc="best")
plt.show()
# THe code below is borrowed from sklearn documentation:
# http://scikit-learn.org/stable/auto_examples/model_selection/plot_validation_curve.html
print(__doc__)
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_digits
from sklearn.model_selection import validation_curve
digits = load_digits()
X, y = digits.data, digits.target
param_range = [1,10,100,500]
train_scores, test_scores = validation_curve(
RandomForestClassifier(), X, y, param_name="min_samples_leaf", param_range=param_range,
cv=10, scoring="accuracy", n_jobs=1)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.title("Validation Curve with RandomForest")
plt.xlabel("Min Samples Leaf")
plt.ylabel("Score")
plt.ylim(0.0, 1.1)
lw = 2
plt.semilogx(param_range, train_scores_mean, label="Training score",
color="darkorange", lw=lw)
plt.fill_between(param_range, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.2,
color="darkorange", lw=lw)
plt.semilogx(param_range, test_scores_mean, label="Cross-validation score",
color="navy", lw=lw)
plt.fill_between(param_range, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.2,
color="navy", lw=lw)
plt.legend(loc="best")
plt.show()
I selected 15 news agencies and aggregators with a high level of credibility. The source of credibility ranking is mediabiasfactcheck.com Given that the news agencies / aggregators have high credibility rating I will use them as "REAL" news to later add to the "FAKE" news data set.
import newspaper
url = ['http://hpenews.com', 'http://latimes.com', 'http://usatoday.com',
'http://politifact.com/', 'http://chicagotribune.com', 'http://huffingtonpost.com/',
'http://sctimes.com/', 'http://justfactsdaily.com/', 'http://bustle.com/',
'http://rabble.ca/', 'http://justfactsdaily.com/', 'http://timesofsandiego.com/',
'http://postbulletin.com/', 'http://natmonitor.com/','http://politico.com/',
'http://cnn.com', 'http://www.breitbart.com/']
webdata = pd.DataFrame(columns=['url', 'authors', 'title', 'text'])
Below is the news web-scrapering loop. I saved it as a Markdown in order to not start it again.
for i in url: paper = newspaper.build(i) for article in paper.articles[:500]: time.sleep(1) article.download() article.parse() url = article.url authors = article.authors title = article.title text = article.text webdata = webdata.append({'url':url, 'authors':authors, 'title':title, 'text':text}, ignore_index=True)
webdata = webdata[webdata['title'] != 'Error']
webdata.shape
webdata = webdata[['url', 'title','text']]
webdata.head(2)
I scraped several times on several resources. In order to avoid the repetition of the same news scraped below is the check for duplicates
webdata[webdata.duplicated() == True].shape
webdata.drop_duplicates(keep='first',inplace=True)
webdata.shape
It seems smart to just save it as a csv file. As the computer started working unreliably.
webdata.to_csv('../capstone_data/scraped_news.csv', sep=',', encoding = 'UTF-8')