INTRODUCTION
In the digital age, online forums have emerged as vibrant hubs of conversation, ideas, and community engagement. Nairaland.com, Africa’s largest online forum, stands as a testament to this, hosting tons of discussions that span various topics relevant to Nigeria, the African continent and beyond. However, the exponential growth of the forum also brings a significant challenge: the need to efficiently manage and categorize the vast amount of user-generated content, while simultaneously reducing the manual workload of moderators tasked with overseeing the diverse range of topics.
This article delves into process of developing and fine-tuning a machine learning model that is efficient and culturally aware. This model is set to transform the way content is organized and accessed on the platform. By the end, readers will gain insight into not only the technical aspects of building a state-of-the-art text classifier for a unique platform like Nairaland but they will also appreciate the wider impact of this technology in enhancing digital communication across diverse communities.
DATA COLLECTION
The foundation of any machine learning model lies in the quality and relevance of its data. For the intelligent text classifier designed for Nairaland, this step is critical. The data collection process involves meticulously gathering the necessary information that reflects the diverse and dynamic nature of the forum’s content. This phase is not just about quantity but focuses on obtaining data that can impart meaningful insights and patterns for the classifier to learn from.
Web Scraping and Disclaimer
Web scraping is a powerful tool in data science, used to extract large amounts of data from websites. In this project, web scraping is employed to collect relevant information from Nairaland. It’s important to note that web scraping must be conducted responsibly and ethically. The methods demonstrated here are for educational purposes only and should not be used without considering the legal and ethical implications, particularly respecting the terms of service of the website in question.
Approach to Scraping
The approach to scraping for this project is methodical and tailored to the structure of Nairaland.com. Given the site’s vast and varied content, the scraping is conducted in two distinct parts. The first part focuses on gathering links to articles featured on the forum’s frontpage, as these represent the most engaging and quality-driven content. The second part involves iterating over these links to scrape detailed information from each post. This dual-part strategy ensures not only the acquisition of high-quality data but also the capture of a comprehensive range of attributes for each post, including title, body, comments, and more. The objective is to create a rich dataset that provides a deep understanding of the content dynamics on the forum.
Detailed Explanation of the Scraping Process
Part One: Gathering Links
The first part of the scraping process is focused on collecting the URLs of posts that have made it to the frontpage of Nairaland.com. This is a crucial step as the frontpage typically features the most relevant and engaging content, which is indicative of the quality that the model aims to learn.
-
Method: The script iterates through a sequence of pages on the Nairaland news section.
-
Tools Used: A combination of the cloudscraper library and BeautifulSoup is employed. cloudscraper is essential here to bypass Cloudflare’s anti-bot measures, ensuring uninterrupted access to the site’s data.
-
Data Extraction: For each post, the script extracts the URL, along with the time and date of posting. Regular expressions (re) are used to clean and format the URLs correctly.
-
Data Storage: Extracted data is periodically saved to a CSV file. This ensures that the data is not lost in case of any interruptions during the scraping process.
-
Efficiency Measures: The script includes a delay mechanism (time.sleep) to prevent overwhelming the server and to mimic human browsing patterns, reducing the risk of being blocked.
!pip install cloudscraper
from bs4 import BeautifulSoup
import cloudscraper
import csv
import re
import time
import pandas as pd
import random
list_of_links = []
# Write the headers to the CSV file
with open('links_file.csv', mode='w', newline='', encoding='utf-8') as csvfile:
csv_writer = csv.writer(csvfile)
csv_writer.writerow(['Url', 'Time posted', 'Date posted'])
for i in range(0, 150, 1):
url = f"https://www.nairaland.com/news/{i}"
scraper = cloudscraper.create_scraper(interpreter='js2py')
raw_html = (scraper.get(url).text)
soup = BeautifulSoup(raw_html, "html.parser")
post_elements = soup.select("hr + a")
for elem in post_elements:
post_link = elem["href"]
post_link = re.sub(r"/\d+(?=#|$)", "", post_link)
time_and_date = elem.find_next_siblings("b")
post_time = time_and_date[0].text
post_date = time_and_date[1].text
if len(time_and_date) > 2 and time_and_date[2].text.isdigit():
post_year = time_and_date[2].text
else:
post_year = '2023'
list_of_links.append([post_link, post_time, f"{post_date}, {post_year}"])
# Sleep for a random time between 2 to 5 minutes
time.sleep(random.uniform(120, 300))
with open('/content/drive/MyDrive/links_file.csv', mode='a', newline='', encoding='utf-8') as csvfile:
csv_writer = csv.writer(csvfile)
for item in list_of_links:
csv_writer.writerow(item)
list_of_links.clear()
Part Two: Extracting Post Details
The second part involves visiting each URL gathered in Part One and scraping detailed information from each post.
-
Method: The script iterates over the list of collected URLs, accessing each post’s content.
-
Detailed Data Extraction: For each post, the script extracts several pieces of information: the article title, body, category, number of comments, total views, and the actual comments themselves.
-
Custom Functions: Several functions (get_title, get_body, get_comments_and_count, get_total_views) are defined to handle the extraction of different data elements from the posts. This modular approach allows for more manageable and readable code.
-
Handling Pagination: The comments section of a post can span multiple pages. The script navigates through these pages (if any) to collect all comments.
-
Data Consolidation: The extracted data, including the title, body, comments, and other details, are consolidated into a single record for each post and saved to a CSV file. This creates a structured dataset ready for analysis and modeling.
-
Error Handling: The script includes error handling to deal with any issues encountered while scraping individual posts. This ensures that the process continues smoothly even if some posts are not accessible or present unexpected formats.
Through these two carefully structured parts, a comprehensive dataset is created, capturing the essence and diversity of content on Nairaland.com. This dataset forms the backbone of the subsequent machine learning model development.
def get_title(soup):
title_and_views = soup.find('h2')
if title_and_views is None:
return 'PLACEHOLDER', 'PLACEHOLDER'
title_parts = title_and_views.text.strip().split(' - ')
return title_parts[0], title_parts[-2] if len(title_parts) >= 2 else 'PLACEHOLDER'
def get_total_views(soup):
views_element = soup.find('p', class_='bold')
if views_element:
views_text = views_element.text.strip()
views_match = re.search(r'\((\d{1,3}(,\d{3})*|\d+)\s*Views\)', views_text)
if views_match:
return int(views_match.group(1).replace(',', ''))
return 0
def get_body(soup):
body = soup.find('div', {'class': 'narrow'})
return body.text.strip() if body is not None else 'PLACEHOLDER'
def get_comments_and_count(soup, scraper, url):
all_comments = []
previous_comments = None
page = 0
while True:
current_url = f"{url}/{page}" if page > 0 else url
raw_html = scraper.get(current_url).text
soup = BeautifulSoup(raw_html, 'html.parser')
comments = []
for div in soup.find_all('div', class_='narrow'):
if div.blockquote:
div.blockquote.extract()
comments.append(div.text.strip())
if comments == previous_comments:
break
all_comments.extend(comments)
previous_comments = comments
page += 1
comments_joined = ';;;'.join(all_comments)
comment_count = len(all_comments)
return comments_joined, comment_count
# Define the filename and the header of the CSV file
filename = '/content/drive/MyDrive/Nairaland Dataset Part16.csv'
header = ['Title', 'Body', 'Category', 'Comments', 'Comment Count', 'Time Posted', 'Date Posted', 'Total Views', 'Url']
# Loop through the list of URLs and write the data to the CSV file
with open(filename, mode='a', encoding='utf-8', newline='') as file:
# writer = csv.writer(file)
writer = csv.writer(file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
writer.writerow(header)
counter = 0 # Counts the number of iterations
for index, row in links_file.iterrows():
url = row['Url']
post_time = row['Time Posted']
post_date = row['Date Posted']
scraper = cloudscraper.create_scraper(interpreter='js2py')
try: # Add try block here
raw_text = scraper.get(url).text
soup = BeautifulSoup(raw_text, 'html.parser')
Title, Category = get_title(soup)
Body = get_body(soup)
comments_joined, comment_count = get_comments_and_count(soup, scraper, url)
total_views = get_total_views(soup)
data = [Title, Body, Category, comments_joined, comment_count, post_time, post_date, total_views, url]
writer.writerow(data)
except Exception as e: # Add except block here
print(f"Error occurred while processing {url}: {e}")
# Add delay after every 3 iterations to avoid overloading the server
if counter % 3 == 0 and counter != 0:
print(f"Pausing for 10 seconds after {counter} iterations")
time.sleep(10)
counter += 1
The scrapped data represents the news article from November 2005 to October 2023. It encompassed a rich collection of data points, with over 319,000 entries spanning across nine columns, were stored. These columns included Title, Body, Category, Comments, Comment Count, Time Posted, Date Posted, Total Views, and URL.
LOADING OF DATA
This was done using the pandas read_csv() function. A quick inspection of the data using df.info() revealed the presence of rows with null values across few of the columns, these are taken care of during the crucial phase of data cleaning.
DATA CLEANING
The data cleaning process began with the removal of duplicate header rows, a necessary step to ensure the integrity and accuracy of the dataset. This was followed by an assessment of missing values, a common challenge in real-world data. The dataset was then refined by dropping rows that had null values in essential columns like Title, Body, and Category, as these rows held limited value for meaningful analysis. Additionally, rows where the Category was set to ‘PLACEHOLDER’—a marker for unsuccessfully scraped articles—were also removed. This step was crucial in maintaining the quality and relevance of the data for the classifier.
-
Dropping Duplicates
2. Removing duplicates of header rows
3. Handling missing Values
The rows with missing values in the vital columns are then dropped
4. Removing Rows of Articles set to PLACEHOLDER
Further cleaning involved addressing the ‘Comment Count’ and ‘Total Views’ columns. Missing values in these columns were filled with zeros, and the data was converted to numeric format to facilitate accurate analysis. The final step in this cleaning process was converting these columns to integer data types, ensuring consistency and ease of interpretation in subsequent analyses. It is important to note that this article covers only a subset of the data cleaning steps implemented. The comprehensive process, detailed in the project’s notebook, includes additional layers of cleaning and preprocessing to refine the dataset further. Readers interested in a deeper dive into these steps can refer to the notebook linked at the end of this article, providing a more exhaustive view of the data preparation journey for this ambitious project.
EXPLORATORY DATA ANALYSIS
Exploratory Data Analysis (EDA) is a critical step in machine learning, offering initial insights through visual and statistical examination of the data. It’s essential for identifying patterns, anomalies, and determining the right preprocessing strategies. However, for the sake of brevity, this article will discuss only a few key aspects of the EDA performed. Readers interested in a more comprehensive analysis can refer to the full EDA detailed in the project’s notebook linked at the end of the article.
1. Number of News Article in the Dataset
Output:
There are a total of 283122 news articles on Nairaland front page
2. Number of Distinct Categories in the News Article
Output:
There are 61 distinct categories of News/Articles on the Frontpage
3. Category Distribution
Output:
Politics: 97183
Celebrities: 38645
Crime: 21497
Romance: 12387
Sports: 11636
Education: 9761
Religion: 9046
Travel: 7621
Business: 7459
Family: 7327
Health: 6392
Jobs/Vacancies: 5005
Phones: 4225
Car Talk: 3876
Career: 3809
Foreign Affairs: 3358
TV/Movies: 3358
European Football (EPL, UEFA, La Liga): 3001
General: 2836
Culture: 2340
Music/Radio: 2233
Islam for Muslims: 2154
Fashion: 1968
Properties: 1892
Food: 1883
Literature: 1590
Webmasters: 1586
NYSC: 1461
Events: 1327
Science/Technology: 1066
Jokes Etc: 874
Agriculture: 712
Investment: 613
Computers: 531
Art, Graphics & Video: 438
Pets: 390
Programming: 306
Autos: 300
Forum Games: 255
Gaming: 245
Entertainment: 162
Poems For Review: 87
Technology Market: 68
Dating And Meet-up Zone: 55
Adverts: 50
Satellite TV Technology: 30
Phone/Internet Market: 19
Music Business: 13
Rap Battles: 13
Fashion/Clothing Market: 7
Software/Programmer Market: 6
Business To Business: 5
Certification And Training Adverts: 4
Educational Services: 3
Travel Ads: 3
Nairaland Ads: 3
Graphics/Video Market: 2
Computer Market: 2
Web Market: 2
Top Pages: 1
Literature/Writing Ads: 1
4. Top 10 Categories ((According to number of Published News Article))
The distribution of articles across different categories in the dataset is analyzed, focus being on the top 10 most frequent categories, and the result is visualized using a bar plot
5. Bottom 10 Categories (According to number of Published News Article)
Output:
6. Total Views Garnered on all the Articles
Output:
Total views of all the news articles on frontpage: 9,244,902,581
7. Categories with the Most Total Views
8. Categories with the least Total Views
Output:
Category
Fashion/Clothing Market 245,068 Views
Rap Battles 234,308 Views
Educational Services 211,273 Views
Software/Programmer Market 39,756 Views
Certification And Training Adverts 34,478 Views
Top Pages 17,041 Views
Computer Market 13,516 Views
Web Market 11,891 Views
Graphics/Video Market 10,255 Views
Literature/Writing Ads 0 Views
Name: Total Views, dtype: object
9. Top 10 Categories (According to Average views per Post)
Category
Adverts 2,015,243 Views Per frontpage post
Satellite TV Technology 292,494 Views Per frontpage post
Nairaland Ads 113,913 Views Per frontpage post
Travel Ads 97,794 Views Per frontpage post
Business To Business 94,325 Views Per frontpage post
Educational Services 70,424 Views Per frontpage post
Investment 61,920 Views Per frontpage post
Technology Market 55,867 Views Per frontpage post
TV/Movies 55,151 Views Per frontpage post
Literature 50,677 Views Per frontpage post
Name: Total Views, dtype: object
10. Bottom 10 Categories (According to Average views per Post
Output:
Average Lowest Views Per Category
Category
Programming 16,475 Views per Frontpage Post
Islam for Muslims 15,355 Views per Frontpage Post
Webmasters 11,605 Views per Frontpage Post
Certification And Training Adverts 8,620 Views per Frontpage Post
Poems For Review 8,588 Views per Frontpage Post
Computer Market 6,758 Views per Frontpage Post
Software/Programmer Market 6,626 Views per Frontpage Post
Web Market 5,946 Views per Frontpage Post
Graphics/Video Market 5,128 Views per Frontpage Post
Literature/Writing Ads 0 Views per Frontpage Post
Name: Total Views, dtype: object
11. Descriptive Statistics of Post Views
Output:
count 283122.00
mean 32653.42
std 111693.01
min 0.00
25% 15856.25
50% 27178.00
75% 41725.00
max 26157898.00
Name: Total Views, dtype: float64
12. Box Plot Visualization of Post Views
Box plot is a type of visualization that helps to detect the presence of outliers in a data. From above it is seen that the average number of views is 32,653. But the box plot below clearly shows that a few post have extreme high view which certainly would have impacted the average. In this case, a median would be a better choice to evaluate central tendency
13. News Article with the highest Views (Category by Category)
14. Trends of Average Views over the Years
It is not surprising that 2020 is the year with highest average views on each frontpage post. This is definitely due to the covid 19 lockdown.
COMMENT ANALYSIS
15. The Total Number of Comments in all News Articles
Total number of comments on all frontpage articles: 26,996,332
16. Overall the Top 20 Posts with the highest number of Comments
17. Box Plot Visualization of Overall Comments
18. Comment Distribution Histogram
From the histogram above, it is clear that over 90% of the posts have less than 300 comments with the peak number of comments revolving around 60 to 100
ANALYSIS ON DATE OF ARTICLE PUBLISHED
19. Number of Article Published Per Year
Results:
Total Number of Articles Published by Year
2005: 597 articles
2006: 1943 articles
2007: 1537 articles
2008: 704 articles
2009: 2293 articles
2010: 3166 articles
2011: 6498 articles
2012: 12211 articles
2013: 14181 articles
2014: 6769 articles
2015: 27826 articles
2016: 36102 articles
2017: 41202 articles
2018: 45187 articles
2019: 42206 articles
2020: 12612 articles
2021: 8974 articles
2022: 7778 articles
2023: 11303 articles
20. Number of News Articles Published According to Months
Results
Total Number of Articles Published by Month
Jan: 25531 articles
Feb: 21059 articles
Mar: 22118 articles
Apr: 25052 articles
May: 22509 articles
Jun: 22978 articles
Jul: 26463 articles
Aug: 23401 articles
Sep: 24940 articles
Oct: 22222 articles
Nov: 23396 articles
Dec: 23420 articles
21. HeatMap Visualization of the Number of Article Published Month by Month
22. Distribution of Posts by Hour of the Day
23. Relationship Between the Time of Publish vs the Number of Comment a post Received
24. Relationship Between time of publish vs Total View a post garnered
ANALYSIS OF NEWS ARTICLE SOURCES
25. Percentage of News Articles with/without Source Url
Output:
Percentage of posts with Source URL: 75.46%
Percentage of posts without Source URL: 24.54%
26. Top Cited Websites
Output:
www.youtube.com 12029
www.vanguardngr.com 8593
punchng.com 6580
www.instagram.com 6205
www.nationalhelm.co 5052
www.trezzyhelm.com 4812
thenationonlineng.net 3951
www.punchng.com 3946
twitter.com 3846
dailypost.ng 3495
www.premiumtimesng.com 3136
www.dailytrust.com.ng 2994
gistmore.com 2656
saharareporters.com 2630
www.lailasblog.com 2483
mobile.twitter.com 2390
www.google.com 2246
www.nairaland.com 2085
m.facebook.com 1998
www.nationalhelm.net 1965
Name: Domain, dtype: int64
27. Percentage of Posts containing Source Url (By Category)
Output:
Category
Crime 93.366516
Celebrities 92.137478
Foreign Affairs 86.148347
Politics 85.646662
Sports 78.581865
Travel 76.220472
Health 75.391114
Science/Technology 75.328330
Business 73.233204
Entertainment 72.839506
Events 70.610399
TV/Movies 69.446099
Music/Radio 69.086022
Car Talk 66.847265
Nairaland Ads 66.666667
Culture 66.082121
Education 63.018752
Investment 59.706362
Agriculture 59.410112
NYSC 59.069131
Religion 58.993919
Fashion 58.333333
Pets 55.641026
Phones 54.924242
Properties 52.642706
Web Market 50.000000
Career 49.317227
Webmasters 49.054224
Jobs/Vacancies 46.433566
Art, Graphics & Video 44.977169
Autos 44.666667
Family 44.370138
Food 43.706851
Romance 43.008235
General 42.630465
Islam for Muslims 41.179201
Satellite TV Technology 40.000000
Phone/Internet Market 36.842105
European Football (EPL, UEFA, La Liga) 33.566900
Travel Ads 33.333333
Computers 33.145009
Literature 30.943396
Gaming 28.979592
Technology Market 26.470588
Programming 25.163399
Certification And Training Adverts 25.000000
Jokes Etc 21.878580
Rap Battles 15.384615
Fashion/Clothing Market 14.285714
Adverts 14.000000
Music Business 7.692308
Forum Games 7.058824
Poems For Review 6.976744
Dating And Meet-up Zone 3.636364
Business To Business NaN
Computer Market NaN
Educational Services NaN
Graphics/Video Market NaN
Literature/Writing Ads NaN
Software/Programmer Market NaN
Top Pages NaN
dtype: float64
TEXTUAL ANALYSIS
In the “Textual Analysis” subsection, I will delve into the intricacies of the dataset’s language through N-GRAM analysis and word cloud visualization. This approach allows us to uncover the most prevalent words and phrases, providing a visual and quantitative representation of the key themes and topics emerging from the forum’s discussions. This insight is vital for understanding the dominant textual patterns that characterize the content on Nairaland.com.
To begin, we first need to define a function that clean the text to make it suitable for the analysis
def clean_text(data):
if pd.isna(data):
return ""
data = str(data) # Convert data to string
data = data.lower()
# Remove URLs
data = re.sub(r'http\S+|www\S+|https\S+', '', data, flags=re.MULTILINE)
# Remove HTML tags
data = re.sub(r'<.*?>', '', data)
# Remove punctuations and emojis
punct_tag = re.compile(r'[^\w\s]')
emoji_clean = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
"]+", flags=re.UNICODE)
data = punct_tag.sub(r'', data)
data = emoji_clean.sub(r'', data)
# Remove non-informative words
non_informative_words = {'was', 'has', 'said', 'like', 'one','dey', 'na', 'will', 'said', 'wey', 'and', 'to', 'of'}
words = data.split()
data = ' '.join([word for word in words if word.lower() not in non_informative_words])
# Remove numbers except for years (assuming years to be 4-digit numbers from 1000 to current year)
data = re.sub(r'\b(?!\d{4}\b)\d+\b', '', data)
# Lemmatization
wn = WordNetLemmatizer()
lemmatized_words = []
for w in data.split():
if w.lower() == "lagos": # Check for the word "Lagos" in a case-insensitive manner
lemmatized_words.append(w) # Append the word "Lagos" as-is without lemmatizing
else:
lemmatized_words.append(wn.lemmatize(w))
data = ' '.join(lemmatized_words)
# Remove stopwords
stop_words = set(stopwords.words('english'))
words = data.split()
data = ' '.join([word for word in words if word.lower() not in stop_words])
return data
data['Title'] = data['Title'].apply(clean_text)
data['Body'] = data['Body'].apply(clean_text)
The provided code block is a text cleaning function, clean_text()
, designed to preprocess and sanitize text data. Here are the main cleaning steps performed:
-
Conversion to Lowercase: The text was converted to lowercase to ensure uniformity, making the data case-insensitive.
-
Removing URLs: All types of URLs are removed from the text. This step cleans out web addresses, which are typically not informative for text analysis.
-
Removing HTML Tags: Any HTML tags present in the text are stripped out. This is crucial for texts scraped from web pages.
-
Removing Punctuations and Emojis: The code removes all punctuation marks and emojis. These elements are often not useful for standard text analysis and can introduce noise.
-
Filtering Out Non-Informative Words: A set of specified non-informative words (like common verbs and prepositions) are removed. Additionally, specific colloquial terms relevant to the dataset’s context (e.g., ‘dey’, ‘na’) are also excluded.
-
Removing Numbers (Except Years): All standalone numbers are removed, except for four-digit numbers, which are presumed to be years.
-
Lemmatization: The text is lemmatized, which involves converting words to their base or dictionary form. Notably, the word “Lagos” is kept as-is, indicating a specific treatment for this term.
-
Removing Stopwords: Common stopwords from the English language are removed. These are typically high-frequency words that don’t contribute to the specific meaning of the text.
This function collectively cleans and standardizes the text data, making it more suitable for further analysis such as machine learning or natural language processing tasks.
The next step is to define some set of functions that will be used in processing and visualizing text data,
-
Lowercasing the Titles: This is a common preprocessing step in text analysis to ensure consistency, as it treats words like “Nigeria,” “nigeriae,” and “NIGERIA” as the same word.
data['Title'] = data['Title'].str.lower()
-
Custom Stopwords Definition: It defines a list of custom stopwords by extending the default English stopwords list (
ENGLISH_STOP_WORDS
) with a custom set of terms.
custom_stopwords = list(ENGLISH_STOP_WORDS) + ['com', 'http', 'https', 'www', 'youtube', 'instagram']
-
URL Cleaner Function: This function (
url_cleaner
) uses regular expressions to remove URLs and email-like patterns from the text. It’s a part of preprocessing to clean the text data.
def url_cleaner(text):
text = re.sub(r'https?://\S+|www\.\S+|@\S+', '', text)
return text
-
N-Gram Display and Visualization Function: N-grams are contiguous sequences of ‘n’ items from a given sample of text or speech. The ‘items’ can be phonemes, syllables, letters, words, or base pairs according to the application. The ‘n’ in n-grams represents the number of elements in the sequence. N-grams are used in various fields of linguistic analysis and text mining.
For example the phrase “My name is Timothy”
has 4 unigrams i.e ‘My’, ‘name’, ‘is’, ‘Timothy’.
has 3 bigrams i.e ‘My name’, ‘name is’, ‘is Timothy’
has 2 trigrams i.e ‘My name is’, ‘name is Timothy’
The function
display_and_visualize_ngrams
takes a DataFrame, a column name, an n-gram value (ngram_val
), and the number of top n-grams to display (top_n
). It usesCountVectorizer
to count the frequency of n-grams in the specified column. It then displays the top n-grams and their frequencies in both a printed list and a horizontal bar chart. The n-grams are sequences of n words; for example, bigrams are sequences of two words.
28. Top 15 Unigrams, Bigrams and Trigrams in News Article Title
Unigrams
photo 42363
nigerian 14093
nigeria 13752
buhari 11633
man 10577
lagos 10092
state 8404
new 7119
video 6951
apc 6435
picture 6261
woman 6192
lady 5926
wife 5836
pic 5687
Bigrams
boko haram 2625
president buhari 1704
graphic photo 1639
new photo 1196
state photo 1091
lagos photo 1061
peter obi 1034
throwback photo 897
akwa ibom 889
tiwa savage 829
photo video 828
world cup 807
nigerian man 776
river state 772
dino melaye 766
Trigrams
job recruitment position 480
state graphic photo 223
boko haram attack 189
big brother naija 171
river state photo 165
got people talking 150
boko haram member 141
boko haram terrorist 138
killed boko haram 118
orji uzor kalu 116
latest job recruitment 115
delta state photo 114
akwa ibom state 112
best graduating student 110
akwa ibom photo 109
29. WordCloud of New Article Title
30. Unigram, Bigram and Trigram Analysis of News Articles Body
Unigrams
state 320582
nigeria 182838
people 180344
government 162423
president 155480
nigerian 151304
time 122262
year 118035
governor 117412
party 107168
national 92861
country 91328
mr 91318
police 88124
know 87582
Bigrams
federal government 28839
local government 24214
state government 21487
lagos state 21324
muhammadu buhari 19695
state governor 18553
president muhammadu 18040
boko haram 17874
national assembly 15623
progressive congress 15236
social medium 15095
democratic party 14667
people democratic 14335
government area 13662
river state 13340
Trigrams:
president muhammadu buhari 14764
people democratic party 13611
local government area 13538
progressive congress apc 10192
democratic party pdp 8888
president goodluck jonathan 6848
federal high court 5461
economic financial crime 5146
financial crime commission 4831
independent national electoral 4741
national electoral commission 4577
state police command 4470
state house assembly 4415
public relation officer 4215
central bank nigeria 3774
31. WordCloud Visualization of News Article Bodies
33. Retaining only the Text Columns
In the next phase of preparing for the text classifier, the focus narrows down to the most critical pieces of data that will fuel the model’s learning. As shown in the code snippet below, the first step is to streamline the dataset by selecting only the columns that contain text: ‘Title’, ‘Body’, and ‘Category’. This reduction serves a dual purpose: it simplifies the dataset to the essential attributes that contain natural language information and discards non-textual data that won’t be used in the classification process. By retaining only these text-based columns, the preparation ensures that the upcoming text analysis and feature extraction are concentrated on the relevant data, setting a solid foundation for building an accurate and efficient text classifier
34. Data Pruning and Distribution Analysis
Here the categories were filtered based on article count and their distribution visualized. Any category with less than 1000 articles were eliminated, this ensures the dataset were reduced to the most significant categories
Output:
Category
Business 7457
Car Talk 3876
Career 3808
Celebrities 38639
Crime 21497
Culture 2338
Education 9759
European Football (EPL, UEFA, La Liga) 2997
Events 1327
Family 7327
Fashion 1968
Food 1883
Foreign Affairs 3357
General 2836
Health 6392
Islam for Muslims 2154
Jobs/Vacancies 5005
Literature 1590
Music/Radio 2232
NYSC 1461
Phones 4224
Politics 97176
Properties 1892
Religion 9045
Romance 12386
Science/Technology 1066
Sports 11635
TV/Movies 3358
Travel 7620
Webmasters 1586
dtype: int64
As can be seen from the visualization above, the data is hugely imbalanced. A model trained with such data might be bias and unable to generalize well to unseen data. The next step is therefore crucial in fixing the imbalanced data
35. Sampling
Sampling strategies are crucial in preparing datasets for machine learning, as they directly influence the model’s ability to learn and generalize. These strategies, whether aiming to reflect the natural class distribution or to create a balanced environment, play a pivotal role in the predictive performance and robustness of the resulting classifier.
For this project, a mix of over-sampling and under-sampling was used. That means the majority class was reduced to a predetermined threshold to prevent its overwhelming influence, while the minority classes were augmented to ensure sufficient representation for learning. The desired sample was set to 10,000 articles per category after which the RandomUnderSampler() and RandomOverSampler() functions which were part of the ‘imbalanced-learn’ library was used to achieve this sampling strategy.
Output:
Category
Business 10000
Car Talk 10000
Career 10000
Celebrities 10000
Crime 10000
Culture 10000
Education 10000
European Football (EPL, UEFA, La Liga) 10000
Events 10000
Family 10000
Fashion 10000
Food 10000
Foreign Affairs 10000
General 10000
Health 10000
Islam for Muslims 10000
Jobs/Vacancies 10000
Literature 10000
Music/Radio 10000
NYSC 10000
Phones 10000
Politics 10000
Properties 10000
Religion 10000
Romance 10000
Science/Technology 10000
Sports 10000
TV/Movies 10000
Travel 10000
Webmasters 10000
dtype: int64
36. TRAIN/TEST/VALIDATION SPLIT
The train/test/validation split is a fundamental practice in machine learning that serves several key purposes: The training set is used to fit the model; it learns to recognize patterns in this data. The validation set acts as a proxy for the test set, allowing you to evaluate the model’s performance during the tuning process without contaminating the test set. This helps in making decisions about model adjustments without overfitting to the test data. The test set provides an unbiased evaluation of a final model fit on the training dataset. It’s only used once the model has been trained and validated, to test its performance. This is critical for assessing how well the model is likely to perform on unseen data.
Before splitting, the Title and the Body of each news article is joined together as both contain useful features for training the classifier
37. LABEL ENCODING
This is used to transform the categorical labels into numerical format.
38. EMBEDDING LAYER SET-UP
Here, a pre-trained text embedding model from TensorFlow Hub is being loaded as a Keras layer. This layer will convert text inputs into fixed-size numerical vectors, which can then be used for further processing in the neural network. The trainable=True
parameter indicates that the weights of this embedding layer can be updated during training.
39. MODEL DEFINITIONS
Here, the neural network model is defined as a sequence of layers. It adds the previously defined embedding layer, a dense layer with 16 units and ReLU activation, and an output dense layer with a number of units equal to the number of classes, using softmax activation.
Output:
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
keras_layer (KerasLayer) (None, 50) 48190600
dense (Dense) (None, 16) 816
dense_1 (Dense) (None, 30) 510
=================================================================
Total params: 48191926 (183.84 MB)
Trainable params: 48191926 (183.84 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
40. MODEL COMPILATION
Here, the model is compiled for training. ‘adam’ is set as the optimizer, sparse categorical crossentropy is used as the loss function (appropriate for multi-class classification tasks with integer labels), and ‘accuracy’ is specified as the metric to be tracked.
Setting Up Early Stopping Callback
Early stopping is a form of regularization used to avoid overfitting by stopping training before the model learns the noise in the training data. It can also save time and computational resources by reducing unnecessary training epochs.
Here’s what each parameter is set to achieve:
monitor=’val_loss’: This tells the callback to monitor the validation loss. The validation loss is used as a performance metric to determine when to stop training.
-
patience=3: This sets the number of epochs with no improvement after which training will be stopped. In this case, if the validation loss does not improve for three consecutive epochs, the training process will be halted.
-
verbose=1: This enables verbose output in the log, meaning that messages will be shown in the console when the training is stopped early.
-
restore_best_weights=True: This parameter ensures that once training is stopped, the model’s weights are rolled back to those that achieved the lowest validation loss.
Model Training:
Here the training data and labels, the number of epochs (iterations over the entire dataset), and the batch size (number of samples processed before the model is updated) are specified. The model is validated on a separate dataset, and early stopping is used to halt training if there is no improvement in validation loss for several epochs. The start and end times are recorded to calculate and print the total training time.
Output:
Epoch 1/20
469/469 [==============================] - 259s 550ms/step - loss: 1.7662 - accuracy: 0.5722 - val_loss: 0.9420 - val_accuracy: 0.7590
Epoch 2/20
469/469 [==============================] - 263s 561ms/step - loss: 0.7405 - accuracy: 0.8078 - val_loss: 0.6894 - val_accuracy: 0.8194
Epoch 3/20
469/469 [==============================] - 262s 558ms/step - loss: 0.5204 - accuracy: 0.8646 - val_loss: 0.5832 - val_accuracy: 0.8476
Epoch 4/20
469/469 [==============================] - 262s 558ms/step - loss: 0.3912 - accuracy: 0.8997 - val_loss: 0.5237 - val_accuracy: 0.8654
Epoch 5/20
469/469 [==============================] - 261s 557ms/step - loss: 0.3015 - accuracy: 0.9241 - val_loss: 0.4932 - val_accuracy: 0.8781
Epoch 6/20
469/469 [==============================] - 261s 556ms/step - loss: 0.2346 - accuracy: 0.9430 - val_loss: 0.4777 - val_accuracy: 0.8861
Epoch 7/20
469/469 [==============================] - 259s 552ms/step - loss: 0.1825 - accuracy: 0.9568 - val_loss: 0.4723 - val_accuracy: 0.8927
Epoch 8/20
469/469 [==============================] - 257s 548ms/step - loss: 0.1415 - accuracy: 0.9682 - val_loss: 0.4822 - val_accuracy: 0.8943
Epoch 9/20
469/469 [==============================] - 257s 548ms/step - loss: 0.1086 - accuracy: 0.9769 - val_loss: 0.4972 - val_accuracy: 0.8979
Epoch 10/20
469/469 [==============================] - ETA: 0s - loss: 0.0824 - accuracy: 0.9836Restoring model weights from the end of the best epoch: 7.
469/469 [==============================] - 259s 552ms/step - loss: 0.0824 - accuracy: 0.9836 - val_loss: 0.5217 - val_accuracy: 0.8980
Epoch 10: early stopping
Time taken to train the model: 2599.22061085701 seconds
The training was stopped after there was no improvement in the validation loss after 3 successive epochs, the graph of the performance during training is plotted
Output:
MODEL EVALUATION
The model.predict.() method was used to evaluate the model on the pre-processed test data and the results were printed
Output:
938/938 [==============================] - 136s 145ms/step
Accuracy: 0.8901
Precision: 0.8880
Recall: 0.8901
F1-score: 0.8883
HYPERPARAMETER TUNING
Hyperparameter tuning is a critical process in machine learning that involves finding the most optimal parameters for a model, enhancing its performance and accuracy. This process is essential for fine-tuning a model’s behavior to align with the specific characteristics and complexities of the data.
In this project, hyperparameter tuning is implemented for a text classification model using TensorFlow and Keras. The process is outlined below
Model Builder Function (build_model
): This function is designed to create the model. It takes hyperparameters as input and incorporates them into the model’s architecture. The hyperparameters being tuned are the number of units in the dense layer (hp_units
) and the learning rate of the optimizer (hp_learning_rate
). Additionally, an embedding layer is used, sourced from TensorFlow Hub.
Initializing the Tuner (RandomSearch
): The RandomSearch
tuner from Keras Tuner is used, which explores different combinations of hyperparameters randomly. The tuner is set to optimize for validation accuracy (val_accuracy
), with a maximum of 5 trials and 1 execution per trial.
Early Stopping: An EarlyStopping
callback is used during training to prevent overfitting. It monitors validation loss and stops the training if there’s no improvement, restoring the best weights achieved.
Training with Hyperparameter Tuning: The tuner.search
method is called to start the training process. It trains the model on the training data, validating on a separate validation dataset, and uses the early stopping callback.
Output:
Trial 5 Complete [01h 23m 48s]
val_accuracy: 0.8161666393280029
Best val_accuracy So Far: 0.9069333076477051
Total elapsed time: 05h 26m 35s
Results and Best Model Selection: After the tuning process, tuner.results_summary()
displays the best hyperparameters found. The best performing model is retrieved and evaluated on the test data to report its loss and accuracy.
Output
Results summary
Results in my_dir/hyperparameter_tuning
Showing 10 best trials
Objective(name="val_accuracy", direction="max")
Trial 1 summary
Hyperparameters:
units: 48
learning_rate: 0.001
Score: 0.9069333076477051
Trial 3 summary
Hyperparameters:
units: 64
learning_rate: 0.001
Score: 0.9067999720573425
Trial 2 summary
Hyperparameters:
units: 48
learning_rate: 0.0001
Score: 0.8416666388511658
Trial 4 summary
Hyperparameters:
units: 16
learning_rate: 0.0001
Score: 0.8161666393280029
Trial 0 summary
Hyperparameters:
units: 32
learning_rate: 1e-05
Score: 0.5175999999046326
Output:
938/938 [==============================] - 3s 3ms/step - loss: 0.4422 - accuracy: 0.9074
Test Loss: 0.44224512577056885
Test Accuracy: 0.9073666930198669
Saving the Best Model: The best model is saved in TensorFlow format for future use or deployment
LOCAL DEPLOYMENT
To do this, a local workspace was set up in Visual Studio Code within the directory containing the saved model. A virtual environment named ‘classification’ was created and activated, encapsulating the project’s dependencies and ensuring an isolated execution environment.
With the environment prepared, essential Python libraries such as Streamlit, TensorFlow, TensorFlow Hub, NLTK, and Scikit-learn were installed to support the model and web application functionality. The app.py
file was crafted to serve as the backbone of the web application. It was equipped with a text cleaning function, model loading routines, and a user interface for inputting and classifying news articles.
Upon executing streamlit run app.py
, a local web server was initiated, presenting a user-friendly interface through a web page on localhost
. The interface, titled ‘NAIRALAND NEWS CLASSIFICATION’, prompted users to enter a news article’s title and body. Upon submission, the model processed the inputs, applying text cleaning and leveraging the trained classifier to predict the category of the news article, displaying the result in real-time.
This local deployment served as an instrumental step towards operationalizing the machine learning model, allowing for immediate and interactive predictions in a user-centric manner.
RECOMMENDED FUTURE WORK
My recommendations for future work to enhance the Nairaland News Classification project would be focused on scalability, robustness, and continuous improvement. Here are some suggestions:
-
Model Versioning and Experiment Tracking: Implementing tools like MLflow or DVC for model versioning, experiment tracking, and reproducibility. This will enable the team to keep track of various experiments, model versions, and their corresponding performances.
-
Automated Retraining Pipeline: Establishing an automated retraining pipeline that can periodically retrain the model with new data. This can be orchestrated using CI/CD tools like Jenkins or GitHub Actions, which would also include automated testing to ensure model quality.
-
Advanced Model Monitoring: Setting up advanced model monitoring capabilities to track model drift, data quality issues, and prediction performance in production. Consider using tools like Prometheus, Grafana, or custom dashboards to visualize monitoring metrics.
-
Feature Store Implementation: Building or integrating a feature store to manage, share, and reuse features across different models. This can help in maintaining consistency, speeding up the development of new models, and facilitating more complex analyses.
-
Expand Model Interpretability: Incorporating model interpretability tools like SHAP or LIME to provide insights into the model’s decision-making process. This can help in identifying biases, improving model trustworthiness, and facilitating better decision-making.
-
Ensemble and Multi-Model Approaches: Exploring ensemble methods or multi-model architectures to improve prediction performance and robustness. This might involve combining different types of models or using a microservices architecture to deploy multiple models.
-
User Feedback Loop: Creating a user feedback system where predictions can be reviewed and corrected by users, with the corrected labels used to further train the model, enhancing its accuracy over time.
-
Model Serving Enhancements: Considering the use of advanced model serving solutions like TensorFlow Serving or TorchServe that can provide more efficient model updates, A/B testing capabilities, and can scale according to request load.
-
Cloud-Native Technologies: Leveraging cloud-native technologies such as Kubernetes for orchestrating containerized applications, enabling the system to be more resilient and scalable.
-
Data Privacy and Security: Ensure data privacy and security best practices are in place, especially when handling user-generated content. This could involve data anonymization, secure data storage, and adherence to GDPR or other relevant regulations.
-
Cross-Platform Deployment: Working on cross-platform deployment strategies, ensuring the model can be deployed to various environments, from cloud platforms to edge devices, increasing the model’s accessibility.
-
Expand Language Support: Given Nairaland’s diverse user base, incorporating NLP models that better handle local dialects and pidgin, potentially using transfer learning with language models pre-trained on African languages and dialects.
These suggestions aim to ensure that the project not only maintains its relevance but also continues to evolve with technological advancements and user needs.
IF YOU ENJOY THIS ARTICLE, KINDLY SHARE