Fake News : Exploratory Data Analysis
The Stage F of the HamoyeHQ Data Science Internship 2020 which ended few weeks back had me working in a team on an open source project hosted on github. My team collaborated on developing an algorithm for fake news classifier using datasets hosted on Kaggle. One of my responsibilities in the team was to perform EDA and visualization on the datasets and this post is basically me sharing my findings.
“The rise of fake news during the 2016 U.S. Presidential Election highlighted not only the dangers of the effects of fake news but also the challenges presented when attempting to separate fake news from real news. Fake news may be a relatively new term but it is not necessarily a new phenomenon. Fake news has technically been around at least since the appearance and popularity of one-sided, partisan newspapers in the 19th century. However, advances in technology and the spread of news through different types of media have increased the spread of fake news today”.
Two different datasets were merged in the project; Fake news and True news. Fake news has 17903 unique values with 8 columns, true news has 20826 unique values with 8 columns. The columns include; ‘index’ ‘fake/true’, ‘title’, ‘text’, ‘subject’, ‘date’, ‘year’ and ‘month’.
EDA revealed that news subjects associated with True news are two — politics news and world news. Fake news has 6 different subject — government news, middle- east news, news, US news, left news and politics news. You will agree with me that some of Fake news subjects are rather ambiguous. For instance, ‘news’ and ‘left news’. What exactly does ‘left news’ mean? Of course, we understand that news could be left, right or center leaning but that explanation can hardly pass as a news subject. We assumed that news under the subject ‘news’ are news which do not belong to a clear cut subject category.
Campaign and Election period in the United States of America typically peaks around August up until November of every election year. Our analyses established that fake news have always existed. That is, the propagating of fake news in the USA did not begin with the 2016 elections. In fact, the period of fake news captioned in the data ranges from 2015 to 2018 while that of true news was between 2016 and 2017.
However, between 2015 and 2018, the year that recorded the highest volume of Fake news in circulation was 2016. Amount of fake news in circulation rose from as low as 2,500 in 2015 to a lousy 12,000 count in year 2016. That is over a 450% increase from the previous year. We also discovered that even though Fake news increased greatly in year 2016, circulation of True news did not meet up with the increase. Analyses showed that in 2016, for every 3 news an average American read or watched, at least two were Fake news.
By 2017, the total number of true news in circulation quadrupled while those of fake news took a 17% nose-dive. The most surprising observation was that by year 2018, amount of fake news further nose-dived and here I was thinking fake news was meant to increase by the year. Why the reduction? Remember, I said earlier that fake news count was low in 2015. This then raises questions of whether, increase in fake news rampage in the USA was linked to the elections. Although the chart above shows that Fake news subject — ‘politics’ ranks next to the ambiguous ‘news’ subject, conclusions cannot be drawn just yet on whether increase in fake news in the USA is directly linked to the elections based on the data available to us.
Further analysis by month reveals that from January to December, total number of Fake news was steady. Also, True news was steady for a period even though Fake news article were twice as many as those of True news. However, by August, total amount of True news in circulation suddenly shot up exceeding that of Fake news. This increment was consistent up until November before it took a downward glide in December.
This gets quite interesting because the sudden shoot up in True news happened few month leading to the election which are also periods of rigorous campaigns and then drops again at the end of November (election month). One would expect ‘politics news’ to be the news subject that resulted in the swell but on a closer look it was ‘world news’. This got me thinking; why would there be an increase in world news during election period. Not only that, it was a time when fake news relating to the political atmosphere was gaining ground in the USA? Why was the focus of True news then on the outside world rather than within the country? Could it be because the whole world was so keen on what was happening in the USA at that point in time? Again, these are questions the data we are presented with do not give answers to.
To buttress my explanations above, the charts below shows the most common words (excluding conjunctions and pronouns) in title and text columns of both Fake and True news. For Fake news, the most common words were “Trump”, “Video”, ”Obama” and “Hillary”. Whereas for True news, while “Trump” was prominent, “North Korea”, “White House”, “Russia” were also prominent words in the text.
In conclusion, an important fact about the dataset used is that, although it was collected from real world sources, the True news articles were obtained by crawling articles from Reuters.com and the Fake news articles were collected from different websites flagged unreliable by Wikipedia and PolitiFact. PolitiFact is a fact checking website in the USA run by Poynter Institute in St. Petersburg, Florida. The fact that the sources which constitute for news under the “Fake” category could be basically news from anywhere is absolutely worthy of note.