Fake News Analysis: Natural Language Processing(NLP) using Python

There was a time when it was difficult to find out the whether the news is fake or real.

Every day lot of news is posted on social media or broadcasted in news channel or newspaper. it is not easy to identify which news is fake or real. Fake news creates rumours, and a lot of discontent. When the news comes for the very first time and people have no or little idea or knowledge regarding that. By chance if it is a fake news then it can mislead or create damage to person’s or company’s image or are created to have political or financial gains. It can also create disturbance, discontent, and violence among the masses. It can have serious consequences such as damage to public property, riots, mass killing etc. This will lead to bad impact on the current and future generations.

Some of the examples are:

1. Top 20 fake news stories about 2016 US Presidential Election which received lot of engagements in the social media sites such as Facebook to gain political favours.

2. In November 2016 Indian Government implemented Demonetization in the country. At that time there was a fake news in the market regarding INR 2000/- rupee note. According to this news all INR 2000/- rupee note had a magnetic chip which can track people who had accumulated this rupee in huge amount. It created lot of discontent and anger among the public.

Data Collection

I had collected data from different sources and on that data, I build model to predict real and fake news. I am using different packages in Python to build my model. As the text data requires special preparation before used as input in any model hence, I have done pre-processing of this data depicted as below in the flow diagram


I have used count vectorizer and tf-idf feature extraction method and implemented passive aggressive and naive bayes model to predict real ad fake news.
The scikit- learn library offers easy to use tools to performs both tokenization and feature extraction of our data.

CountVectorizer for word count

Count_Vectorizer is used to tokenize a collection of text documents and build a vocabulary of known words. Here are the following steps to understand how this works.

  1. Create a CountVectorizer class in the same instance.
  2. call fit () function in order to learn vocabulary from one or more documents
  3. each document are encoded as a vector with the help of transform () function.

Tf-idf Vectorizer—-

Tf-idf holds for term frequency-inverse document frequency. It is used for information reflows, and text mining. This is also used to find out how important a word to a document in a corpus is. Here important point is that the number of times a word is coming in document and it is offset by the frequency of the word in the corpus. Tf-idf is used for stop-words filtering in various subject fields including text summarization and classification.

Using tf-idf, we convert a collection of the raw document to a matrix of Tf-idf features.

Tf : Term Frequency– it measures how much time a word present in the document. Because every document has a different length, a word may come more times in a longer document, and short document words come fewer times. thus, the word is divided by the document length.

Idf: Inverse Document Frequency– it is decided that how important that word. when Tf is calculating the word that time all words are given equal importance. sometimes “is”, “of” and “that” comes more times but it will take less importance.

Passive Aggressive Classifier—

Passive Aggressive Algorithms are online learning algorithms. it is called passive classifier because it remains passive for correct classification but works aggressively for incorrect classification. This algorithm gives correct classification output, and turning lofty miscalculation, updating and adjusting the same. It does not converge just like any other algorithm. Its objective is to make updates that correct the loss and delivers very little change in the norm of the weight vector.

Naive Bayes Classifiers—-

Naive Bayes classifiers are a collection of classification algorithms based on Bayes theorem. Bayes algorithm is not a single algorithm, it is a family of algorithms, here all of them share a common principle. Here each pair of features are classified and independent of each other. Naive Bayes Algorithm is used for binary (two-class) and multi-class classification problems. When we use binary and categorical input, algorithm will easily understand the data. You can use naive Bayes algorithm on a small amount of data to estimate the necessary parameters. it is very fast compared to sophisticated methods. Bayes theorem uses probability of event when another event has already occurred.

Here I have taken packages such as NumPy, Pandas, CountVectorizer, TfidfVectorizer, PassiveAggressiveClassifer, MultinomialNB, word_tokenize, matplotlib.pyplot etc. which I have used for building model. After that, I have imported the data from my news.csv file.

import numpy as np
import pandas as pd
import itertools
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.naive_bayes import MultinomialNB
from nltk import word_tokenize
import matplotlib.pyplot as plt

Data Exploration

In this part I have described each variables I need to analyse for my model.

I generated a line plot, date values in the x-axis display how many news per day came.

After that I created a column chart to show news category and their frequency. This can help to identify in which category there are more fake news.

Finding number of real and fake for each category

When my news is generated categories wise then I have created Pivot reports for Headline_Category with label to find number of Real and Fake news count.

After generating a column chart then I have to count total how much percentage Real news and fake news we in data. So, the pie chart easily shows real and fake news.

I have generated the pie chart in which I have shown from which month fake news is coming more. After I got each month’s fake news count, post that I have drawn column chart for each month. I have also drawn a line chart for fake news and real news.

This is the REAL News Graph.

This is the Fake News Graph.

In the news data set I have split it into train(80%) and test(20%).

After the train and test dataset creation I plotted column chart on them which gave me Fake and Real news count.

Feature Extraction Tf-idf Method-

Here I used the first Feature Extraction Method 1 -Tf-idf Method. Here I have created 2 models.

  • Passive_Aggressive_Classifiers
  • Naïve Bayes Clasifier

Using Passive_Aggressive_Classifier I have got 89.5% Accuracy.

Second, I took the Naive Bayes Classifier, using this I have got the 94.25% accuracy.

Features Extraction Method 2- CountVectorizer-

The second time I have used Features Extraction Method 2- CountVectorizer Method. Here I create 2 models.

  • Passive_Aggressive_Classifiers
  • Naïve Bayes Classifier

In Passive_Aggressive_Classifier I got 88.38% Accuracy.

In Naïve Bayes Classifier I have got 94.15% Accuracy.


Here I have analysed data with two methods. My first method is TF-IDF Method and second method is COUNT VECTORIZER. Each method has two models viz Passive_Aggressive_Classifier and Naïve Bayes Classifier.

In TF-IDF Method First model Passive_Aggressive_Classifier Accuracy is 89.5% and Naïve Bayes Classifier Accuracy is 94.25%.

And Second Method Count_Vectorizer here also two model first model is Passive_Aggressive_Classifer and Accuracy is 88.38% and second model is Naïve Bayes Classifier and Accuracy is 94.15%.

In both the methods Naïve Bayes Classifier has given higher accuracy but it is not Predicting Fake News.

In both the methods Passive_Aggressive_Classifier gives Lesser accuracy compared to Naïve Bayes Classifier but Passive_Aggressive_Classifier is able to Predict Fake News.

So I have gone with TF-IDF Method and Passive_Aggressive_Classifer for predicting fake news, the model having Accuracy of 89.5%.


  1. https://twitter.com/rsprasad/status/799874766991523840?lang=sv

Written by:

Amit Kumar Verma:

LinkedIn ID: https://www.linkedin.com/in/amit-kumar-verma-20070b189

AMIT KUMAR VERMA is B.Tech in Computer Science From (AVIT College, Chennai). Currently he is working as Data Scientist with NikhilGuru Consulting Service LLP (Nikhil Analytics), Bangalore

1 Comment on "Fake News Analysis: Natural Language Processing(NLP) using Python"

  1. Capt.Anil Kumar | November 28, 2019 at 11:55 pm | Reply

    Nice contents and refrences

Leave a comment

Your email address will not be published.



Subscribe for Data Analytics Edge Newsletter & Share..:-)

error: Content is protected !!