INTRODUCTION

In the digital age, online forums have emerged as vibrant hubs of conversation, ideas, and community engagement. Nairaland.com, Africa’s largest online forum, stands as a testament to this, hosting tons of discussions that span various topics relevant to Nigeria, the African continent and beyond. However, the exponential growth of the forum also brings a significant challenge: the need to efficiently manage and categorize the vast amount of user-generated content, while simultaneously reducing the manual workload of moderators tasked with overseeing the diverse range of topics.

This article delves into process of developing and fine-tuning a machine learning model that is efficient and culturally aware. This model is set to transform the way content is organized and accessed on the platform. By the end, readers will gain insight into not only the technical aspects of building a state-of-the-art text classifier for a unique platform like Nairaland but they will also appreciate the wider impact of this technology in enhancing digital communication across diverse communities.

DATA COLLECTION

The foundation of any machine learning model lies in the quality and relevance of its data. For the intelligent text classifier designed for Nairaland, this step is critical. The data collection process involves meticulously gathering the necessary information that reflects the diverse and dynamic nature of the forum’s content. This phase is not just about quantity but focuses on obtaining data that can impart meaningful insights and patterns for the classifier to learn from.

Web Scraping and Disclaimer

Web scraping is a powerful tool in data science, used to extract large amounts of data from websites. In this project, web scraping is employed to collect relevant information from Nairaland. It’s important to note that web scraping must be conducted responsibly and ethically. The methods demonstrated here are for educational purposes only and should not be used without considering the legal and ethical implications, particularly respecting the terms of service of the website in question.

Approach to Scraping

The approach to scraping for this project is methodical and tailored to the structure of Nairaland.com. Given the site’s vast and varied content, the scraping is conducted in two distinct parts. The first part focuses on gathering links to articles featured on the forum’s frontpage, as these represent the most engaging and quality-driven content. The second part involves iterating over these links to scrape detailed information from each post. This dual-part strategy ensures not only the acquisition of high-quality data but also the capture of a comprehensive range of attributes for each post, including title, body, comments, and more. The objective is to create a rich dataset that provides a deep understanding of the content dynamics on the forum.

Detailed Explanation of the Scraping Process

Part One: Gathering Links

The first part of the scraping process is focused on collecting the URLs of posts that have made it to the frontpage of Nairaland.com. This is a crucial step as the frontpage typically features the most relevant and engaging content, which is indicative of the quality that the model aims to learn.

  • Method: The script iterates through a sequence of pages on the Nairaland news section.

  • Tools Used: A combination of the cloudscraper library and BeautifulSoup is employed. cloudscraper is essential here to bypass Cloudflare’s anti-bot measures, ensuring uninterrupted access to the site’s data.

  • Data Extraction: For each post, the script extracts the URL, along with the time and date of posting. Regular expressions (re) are used to clean and format the URLs correctly.

  • Data Storage: Extracted data is periodically saved to a CSV file. This ensures that the data is not lost in case of any interruptions during the scraping process.

  • Efficiency Measures: The script includes a delay mechanism (time.sleep) to prevent overwhelming the server and to mimic human browsing patterns, reducing the risk of being blocked.

!pip install cloudscraper
from bs4 import BeautifulSoup
import cloudscraper
import csv
import re
import time
import pandas as pd
import random


list_of_links = []

# Write the headers to the CSV file
with open('links_file.csv', mode='w', newline='', encoding='utf-8') as csvfile:
    csv_writer = csv.writer(csvfile)
    csv_writer.writerow(['Url', 'Time posted', 'Date posted'])

for i in range(0, 150, 1):
    url      = f"https://www.nairaland.com/news/{i}"
    scraper  = cloudscraper.create_scraper(interpreter='js2py')
    raw_html = (scraper.get(url).text)
    soup     = BeautifulSoup(raw_html, "html.parser")
    post_elements = soup.select("hr + a")

    for elem in post_elements:
        post_link = elem["href"]
        post_link = re.sub(r"/\d+(?=#|$)", "", post_link)

        time_and_date = elem.find_next_siblings("b")
        post_time     = time_and_date[0].text
        post_date     = time_and_date[1].text

        if len(time_and_date) > 2 and time_and_date[2].text.isdigit():
            post_year = time_and_date[2].text
        else:
            post_year = '2023'

        list_of_links.append([post_link, post_time, f"{post_date}, {post_year}"])

    # Sleep for a random time between 2 to 5 minutes
    time.sleep(random.uniform(120, 300))

    with open('/content/drive/MyDrive/links_file.csv', mode='a', newline='', encoding='utf-8') as csvfile:
        csv_writer = csv.writer(csvfile)

        for item in list_of_links:
            csv_writer.writerow(item)

    list_of_links.clear()

Part Two: Extracting Post Details

The second part involves visiting each URL gathered in Part One and scraping detailed information from each post.

  • Method: The script iterates over the list of collected URLs, accessing each post’s content.

  • Detailed Data Extraction: For each post, the script extracts several pieces of information: the article title, body, category, number of comments, total views, and the actual comments themselves.

  • Custom Functions: Several functions (get_title, get_body, get_comments_and_count, get_total_views) are defined to handle the extraction of different data elements from the posts. This modular approach allows for more manageable and readable code.

  • Handling Pagination: The comments section of a post can span multiple pages. The script navigates through these pages (if any) to collect all comments.

  • Data Consolidation: The extracted data, including the title, body, comments, and other details, are consolidated into a single record for each post and saved to a CSV file. This creates a structured dataset ready for analysis and modeling.

  • Error Handling: The script includes error handling to deal with any issues encountered while scraping individual posts. This ensures that the process continues smoothly even if some posts are not accessible or present unexpected formats.

Through these two carefully structured parts, a comprehensive dataset is created, capturing the essence and diversity of content on Nairaland.com. This dataset forms the backbone of the subsequent machine learning model development.

def get_title(soup):
    title_and_views = soup.find('h2')
    if title_and_views is None:
        return 'PLACEHOLDER', 'PLACEHOLDER'
    title_parts = title_and_views.text.strip().split(' - ')
    return title_parts[0], title_parts[-2] if len(title_parts) >= 2 else 'PLACEHOLDER'


def get_total_views(soup):
    views_element = soup.find('p', class_='bold')
    if views_element:
        views_text = views_element.text.strip()
        views_match = re.search(r'\((\d{1,3}(,\d{3})*|\d+)\s*Views\)', views_text)
        if views_match:
            return int(views_match.group(1).replace(',', ''))
    return 0

def get_body(soup):
    body = soup.find('div', {'class': 'narrow'})
    return body.text.strip() if body is not None else 'PLACEHOLDER'


def get_comments_and_count(soup, scraper, url):
    all_comments = []
    previous_comments = None
    page = 0
    while True:
        current_url = f"{url}/{page}" if page > 0 else url
        raw_html = scraper.get(current_url).text
        soup = BeautifulSoup(raw_html, 'html.parser')

        comments = []
        for div in soup.find_all('div', class_='narrow'):
            if div.blockquote:
                div.blockquote.extract()
            comments.append(div.text.strip())

        if comments == previous_comments:
            break

        all_comments.extend(comments)
        previous_comments = comments
        page += 1

    comments_joined = ';;;'.join(all_comments)
    comment_count = len(all_comments)
    return comments_joined, comment_count


# Define the filename and the header of the CSV file
filename = '/content/drive/MyDrive/Nairaland Dataset Part16.csv'
header = ['Title', 'Body', 'Category', 'Comments', 'Comment Count', 'Time Posted', 'Date Posted', 'Total Views', 'Url']


# Loop through the list of URLs and write the data to the CSV file
with open(filename, mode='a', encoding='utf-8', newline='') as file:
   # writer = csv.writer(file)
    writer = csv.writer(file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)

    writer.writerow(header)
    counter = 0  # Counts the number of iterations

    for index, row in links_file.iterrows():
        url = row['Url']
        post_time = row['Time Posted']
        post_date = row['Date Posted']

        scraper = cloudscraper.create_scraper(interpreter='js2py')

        try:  # Add try block here
            raw_text = scraper.get(url).text
            soup = BeautifulSoup(raw_text, 'html.parser')

            Title, Category = get_title(soup)
            Body = get_body(soup)
            comments_joined, comment_count = get_comments_and_count(soup, scraper, url)
            total_views = get_total_views(soup)

            data = [Title, Body, Category, comments_joined, comment_count, post_time, post_date, total_views, url]
            writer.writerow(data)


        except Exception as e:  # Add except block here
            print(f"Error occurred while processing {url}: {e}")


        # Add delay after every 3 iterations to avoid overloading the server
        if counter % 3 == 0 and counter != 0:
            print(f"Pausing for 10 seconds after {counter} iterations")
            time.sleep(10)
        counter += 1

The scrapped data represents the news article from November 2005 to October 2023. It encompassed a rich collection of data points, with over 319,000 entries spanning across nine columns, were stored. These columns included Title, Body, Category, Comments, Comment Count, Time Posted, Date Posted, Total Views, and URL.

LOADING OF DATA

This was done using the pandas read_csv() function. A quick inspection of the data using df.info() revealed the presence of rows with null values across few of the columns, these are taken care of during the crucial phase of data cleaning.

DATA CLEANING

The data cleaning process began with the removal of duplicate header rows, a necessary step to ensure the integrity and accuracy of the dataset. This was followed by an assessment of missing values, a common challenge in real-world data. The dataset was then refined by dropping rows that had null values in essential columns like Title, Body, and Category, as these rows held limited value for meaningful analysis. Additionally, rows where the Category was set to ‘PLACEHOLDER’—a marker for unsuccessfully scraped articles—were also removed. This step was crucial in maintaining the quality and relevance of the data for the classifier.

  1. Dropping Duplicates

2. Removing duplicates of header rows

3. Handling missing Values

The rows with missing values in the vital columns are then dropped

4. Removing Rows of Articles set to PLACEHOLDER

Further cleaning involved addressing the ‘Comment Count’ and ‘Total Views’ columns. Missing values in these columns were filled with zeros, and the data was converted to numeric format to facilitate accurate analysis. The final step in this cleaning process was converting these columns to integer data types, ensuring consistency and ease of interpretation in subsequent analyses. It is important to note that this article covers only a subset of the data cleaning steps implemented. The comprehensive process, detailed in the project’s notebook, includes additional layers of cleaning and preprocessing to refine the dataset further. Readers interested in a deeper dive into these steps can refer to the notebook linked at the end of this article, providing a more exhaustive view of the data preparation journey for this ambitious project.

EXPLORATORY DATA ANALYSIS

Exploratory Data Analysis (EDA) is a critical step in machine learning, offering initial insights through visual and statistical examination of the data. It’s essential for identifying patterns, anomalies, and determining the right preprocessing strategies. However, for the sake of brevity, this article will discuss only a few key aspects of the EDA performed. Readers interested in a more comprehensive analysis can refer to the full EDA detailed in the project’s notebook linked at the end of the article.

1. Number of News Article in the Dataset

Output:

There are a total of 283122 news articles on Nairaland front page

2. Number of Distinct Categories in the News Article

Output:

There are 61 distinct categories of News/Articles on the Frontpage

3. Category Distribution

Output:

Politics: 97183
Celebrities: 38645
Crime: 21497
Romance: 12387
Sports: 11636
Education: 9761
Religion: 9046
Travel: 7621
Business: 7459
Family: 7327
Health: 6392
Jobs/Vacancies: 5005
Phones: 4225
Car Talk: 3876
Career: 3809
Foreign Affairs: 3358
TV/Movies: 3358
European Football (EPL, UEFA, La Liga): 3001
General: 2836
Culture: 2340
Music/Radio: 2233
Islam for Muslims: 2154
Fashion: 1968
Properties: 1892
Food: 1883
Literature: 1590
Webmasters: 1586
NYSC: 1461
Events: 1327
Science/Technology: 1066
Jokes Etc: 874
Agriculture: 712
Investment: 613
Computers: 531
Art, Graphics & Video: 438
Pets: 390
Programming: 306
Autos: 300
Forum Games: 255
Gaming: 245
Entertainment: 162
Poems For Review: 87
Technology Market: 68
Dating And Meet-up Zone: 55
Adverts: 50
Satellite TV Technology: 30
Phone/Internet Market: 19
Music Business: 13
Rap Battles: 13
Fashion/Clothing Market: 7
Software/Programmer Market: 6
Business To Business: 5
Certification And Training Adverts: 4
Educational Services: 3
Travel Ads: 3
Nairaland Ads: 3
Graphics/Video Market: 2
Computer Market: 2
Web Market: 2
Top Pages: 1
Literature/Writing Ads: 1

4. Top 10 Categories ((According to number of Published News Article))

The distribution of articles across different categories in the dataset is analyzed, focus being on the top 10 most frequent categories, and the result is visualized using a bar plot

5. Bottom 10 Categories (According to number of Published News Article)

Output:

6. Total Views Garnered on all the Articles

Output:

Total views of all the news articles on frontpage: 9,244,902,581

7. Categories with the Most Total Views

8. Categories with the least Total Views

Output:

Category
Fashion/Clothing Market               245,068 Views
Rap Battles                           234,308 Views
Educational Services                  211,273 Views
Software/Programmer Market             39,756 Views
Certification And Training Adverts     34,478 Views
Top Pages                              17,041 Views
Computer Market                        13,516 Views
Web Market                             11,891 Views
Graphics/Video Market                  10,255 Views
Literature/Writing Ads                      0 Views
Name: Total Views, dtype: object

9. Top 10 Categories (According to Average views per Post)

Category
Adverts                    2,015,243 Views Per frontpage post
Satellite TV Technology      292,494 Views Per frontpage post
Nairaland Ads                113,913 Views Per frontpage post
Travel Ads                    97,794 Views Per frontpage post
Business To Business          94,325 Views Per frontpage post
Educational Services          70,424 Views Per frontpage post
Investment                    61,920 Views Per frontpage post
Technology Market             55,867 Views Per frontpage post
TV/Movies                     55,151 Views Per frontpage post
Literature                    50,677 Views Per frontpage post
Name: Total Views, dtype: object

10. Bottom 10 Categories (According to Average views per Post

Output:

Average Lowest Views Per Category
Category
Programming                           16,475 Views per Frontpage Post
Islam for Muslims                     15,355 Views per Frontpage Post
Webmasters                            11,605 Views per Frontpage Post
Certification And Training Adverts     8,620 Views per Frontpage Post
Poems For Review                       8,588 Views per Frontpage Post
Computer Market                        6,758 Views per Frontpage Post
Software/Programmer Market             6,626 Views per Frontpage Post
Web Market                             5,946 Views per Frontpage Post
Graphics/Video Market                  5,128 Views per Frontpage Post
Literature/Writing Ads                     0 Views per Frontpage Post
Name: Total Views, dtype: object

11. Descriptive Statistics of Post Views

Output:

count     283122.00
mean       32653.42
std       111693.01
min            0.00
25%        15856.25
50%        27178.00
75%        41725.00
max     26157898.00
Name: Total Views, dtype: float64

12. Box Plot Visualization of Post Views

Box plot is a type of visualization that helps to detect the presence of outliers in a data. From above it is seen that the average number of views is 32,653. But the box plot below clearly shows that a few post have extreme high view which certainly would have impacted the average. In this case, a median would be a better choice to evaluate central tendency

13. News Article with the highest Views (Category by Category)

14. Trends of Average Views over the Years

It is not surprising that 2020 is the year with highest average views on each frontpage post. This is definitely due to the covid 19 lockdown.

COMMENT ANALYSIS

15. The Total Number of Comments in all News Articles

Total number of comments on all frontpage articles: 26,996,332

16. Overall the Top 20 Posts with the highest number of Comments

17. Box Plot Visualization of Overall Comments

18. Comment Distribution Histogram

From the histogram above, it is clear that over 90% of the posts have less than 300 comments with the peak number of comments revolving around 60 to 100

ANALYSIS ON DATE OF ARTICLE PUBLISHED

19. Number of Article Published Per Year

Results:

Total Number of Articles Published by Year
2005: 597 articles
2006: 1943 articles
2007: 1537 articles
2008: 704 articles
2009: 2293 articles
2010: 3166 articles
2011: 6498 articles
2012: 12211 articles
2013: 14181 articles
2014: 6769 articles
2015: 27826 articles
2016: 36102 articles
2017: 41202 articles
2018: 45187 articles
2019: 42206 articles
2020: 12612 articles
2021: 8974 articles
2022: 7778 articles
2023: 11303 articles

20. Number of News Articles Published According to Months

Results

Total Number of Articles Published by Month
Jan: 25531 articles
Feb: 21059 articles
Mar: 22118 articles
Apr: 25052 articles
May: 22509 articles
Jun: 22978 articles
Jul: 26463 articles
Aug: 23401 articles
Sep: 24940 articles
Oct: 22222 articles
Nov: 23396 articles
Dec: 23420 articles

21. HeatMap Visualization of the Number of Article Published Month by Month

22. Distribution of Posts by Hour of the Day

23. Relationship Between the Time of Publish vs the Number of Comment a post Received

24. Relationship Between time of publish vs Total View a post garnered

ANALYSIS OF NEWS ARTICLE SOURCES

25. Percentage of News Articles with/without Source Url

Output:

Percentage of posts with Source URL: 75.46%
Percentage of posts without Source URL: 24.54%

26. Top Cited Websites

Output:

www.youtube.com           12029
www.vanguardngr.com        8593
punchng.com                6580
www.instagram.com          6205
www.nationalhelm.co        5052
www.trezzyhelm.com         4812
thenationonlineng.net      3951
www.punchng.com            3946
twitter.com                3846
dailypost.ng               3495
www.premiumtimesng.com     3136
www.dailytrust.com.ng      2994
gistmore.com               2656
saharareporters.com        2630
www.lailasblog.com         2483
mobile.twitter.com         2390
www.google.com             2246
www.nairaland.com          2085
m.facebook.com             1998
www.nationalhelm.net       1965
Name: Domain, dtype: int64

27. Percentage of Posts containing Source Url (By Category)

Output:

Category
Crime                                     93.366516
Celebrities                               92.137478
Foreign Affairs                           86.148347
Politics                                  85.646662
Sports                                    78.581865
Travel                                    76.220472
Health                                    75.391114
Science/Technology                        75.328330
Business                                  73.233204
Entertainment                             72.839506
Events                                    70.610399
TV/Movies                                 69.446099
Music/Radio                               69.086022
Car Talk                                  66.847265
Nairaland Ads                             66.666667
Culture                                   66.082121
Education                                 63.018752
Investment                                59.706362
Agriculture                               59.410112
NYSC                                      59.069131
Religion                                  58.993919
Fashion                                   58.333333
Pets                                      55.641026
Phones                                    54.924242
Properties                                52.642706
Web Market                                50.000000
Career                                    49.317227
Webmasters                                49.054224
Jobs/Vacancies                            46.433566
Art, Graphics & Video                     44.977169
Autos                                     44.666667
Family                                    44.370138
Food                                      43.706851
Romance                                   43.008235
General                                   42.630465
Islam for Muslims                         41.179201
Satellite TV Technology                   40.000000
Phone/Internet Market                     36.842105
European Football (EPL, UEFA, La Liga)    33.566900
Travel Ads                                33.333333
Computers                                 33.145009
Literature                                30.943396
Gaming                                    28.979592
Technology Market                         26.470588
Programming                               25.163399
Certification And Training Adverts        25.000000
Jokes Etc                                 21.878580
Rap Battles                               15.384615
Fashion/Clothing Market                   14.285714
Adverts                                   14.000000
Music Business                             7.692308
Forum Games                                7.058824
Poems For Review                           6.976744
Dating And Meet-up Zone                    3.636364
Business To Business                            NaN
Computer Market                                 NaN
Educational Services                            NaN
Graphics/Video Market                           NaN
Literature/Writing Ads                          NaN
Software/Programmer Market                      NaN
Top Pages                                       NaN
dtype: float64

TEXTUAL ANALYSIS

In the “Textual Analysis” subsection, I will delve into the intricacies of the dataset’s language through N-GRAM analysis and word cloud visualization. This approach allows us to uncover the most prevalent words and phrases, providing a visual and quantitative representation of the key themes and topics emerging from the forum’s discussions. This insight is vital for understanding the dominant textual patterns that characterize the content on Nairaland.com.

To begin, we first need to define a function that clean the text to make it suitable for the analysis

def clean_text(data):
    if pd.isna(data):
        return ""

    data = str(data)  # Convert data to string
    data = data.lower()

    # Remove URLs
    data = re.sub(r'http\S+|www\S+|https\S+', '', data, flags=re.MULTILINE)

    # Remove HTML tags
    data = re.sub(r'<.*?>', '', data)

    # Remove punctuations and emojis
    punct_tag = re.compile(r'[^\w\s]')
    emoji_clean = re.compile("["
                             u"\U0001F600-\U0001F64F"  # emoticons
                             u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                             u"\U0001F680-\U0001F6FF"  # transport & map symbols
                             u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                             u"\U00002702-\U000027B0"
                             u"\U000024C2-\U0001F251"
                             "]+", flags=re.UNICODE)

    data = punct_tag.sub(r'', data)
    data = emoji_clean.sub(r'', data)

    # Remove non-informative words
    non_informative_words = {'was', 'has', 'said', 'like', 'one','dey', 'na', 'will', 'said', 'wey', 'and', 'to', 'of'}
    words = data.split()
    data = ' '.join([word for word in words if word.lower() not in non_informative_words])

    # Remove numbers except for years (assuming years to be 4-digit numbers from 1000 to current year)
    data = re.sub(r'\b(?!\d{4}\b)\d+\b', '', data)

    # Lemmatization
    wn = WordNetLemmatizer()
    lemmatized_words = []
    for w in data.split():
        if w.lower() == "lagos":  # Check for the word "Lagos" in a case-insensitive manner
            lemmatized_words.append(w)  # Append the word "Lagos" as-is without lemmatizing
        else:
            lemmatized_words.append(wn.lemmatize(w))
    data = ' '.join(lemmatized_words)


    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    words = data.split()
    data = ' '.join([word for word in words if word.lower() not in stop_words])

    return data

data['Title'] = data['Title'].apply(clean_text)
data['Body'] = data['Body'].apply(clean_text)

The provided code block is a text cleaning function, clean_text(), designed to preprocess and sanitize text data. Here are the main cleaning steps performed:

  1. Conversion to Lowercase: The text was converted to lowercase to ensure uniformity, making the data case-insensitive.

  2. Removing URLs: All types of URLs are removed from the text. This step cleans out web addresses, which are typically not informative for text analysis.

  3. Removing HTML Tags: Any HTML tags present in the text are stripped out. This is crucial for texts scraped from web pages.

  4. Removing Punctuations and Emojis: The code removes all punctuation marks and emojis. These elements are often not useful for standard text analysis and can introduce noise.

  5. Filtering Out Non-Informative Words: A set of specified non-informative words (like common verbs and prepositions) are removed. Additionally, specific colloquial terms relevant to the dataset’s context (e.g., ‘dey’, ‘na’) are also excluded.

  6. Removing Numbers (Except Years): All standalone numbers are removed, except for four-digit numbers, which are presumed to be years.

  7. Lemmatization: The text is lemmatized, which involves converting words to their base or dictionary form. Notably, the word “Lagos” is kept as-is, indicating a specific treatment for this term.

  8. Removing Stopwords: Common stopwords from the English language are removed. These are typically high-frequency words that don’t contribute to the specific meaning of the text.

This function collectively cleans and standardizes the text data, making it more suitable for further analysis such as machine learning or natural language processing tasks.

The next step is to define some set of functions that will be used in processing and visualizing text data,

  • Lowercasing the Titles: This is a common preprocessing step in text analysis to ensure consistency, as it treats words like “Nigeria,” “nigeriae,” and “NIGERIA” as the same word.

data['Title'] = data['Title'].str.lower()
  • Custom Stopwords Definition: It defines a list of custom stopwords by extending the default English stopwords list (ENGLISH_STOP_WORDS) with a custom set of terms.

custom_stopwords = list(ENGLISH_STOP_WORDS) + ['com', 'http', 'https', 'www', 'youtube', 'instagram']
  • URL Cleaner Function: This function (url_cleaner) uses regular expressions to remove URLs and email-like patterns from the text. It’s a part of preprocessing to clean the text data.

def url_cleaner(text):
    text = re.sub(r'https?://\S+|www\.\S+|@\S+', '', text)
    return text
  • N-Gram Display and Visualization Function: N-grams are contiguous sequences of ‘n’ items from a given sample of text or speech. The ‘items’ can be phonemes, syllables, letters, words, or base pairs according to the application. The ‘n’ in n-grams represents the number of elements in the sequence. N-grams are used in various fields of linguistic analysis and text mining.

    For example the phrase “My name is Timothy”

    has 4 unigrams i.e ‘My’, ‘name’, ‘is’, ‘Timothy’.

    has 3 bigrams i.e ‘My name’, ‘name is’, ‘is Timothy’

    has 2 trigrams i.e ‘My name is’, ‘name is Timothy’

    The function display_and_visualize_ngrams takes a DataFrame, a column name, an n-gram value (ngram_val), and the number of top n-grams to display (top_n). It uses CountVectorizer to count the frequency of n-grams in the specified column. It then displays the top n-grams and their frequencies in both a printed list and a horizontal bar chart. The n-grams are sequences of n words; for example, bigrams are sequences of two words.

28. Top 15 Unigrams, Bigrams and Trigrams in News Article Title

Unigrams

photo 42363
nigerian 14093
nigeria 13752
buhari 11633
man 10577
lagos 10092
state 8404
new 7119
video 6951
apc 6435
picture 6261
woman 6192
lady 5926
wife 5836
pic 5687

Bigrams

boko haram 2625
president buhari 1704
graphic photo 1639
new photo 1196
state photo 1091
lagos photo 1061
peter obi 1034
throwback photo 897
akwa ibom 889
tiwa savage 829
photo video 828
world cup 807
nigerian man 776
river state 772
dino melaye 766

Trigrams

job recruitment position 480
state graphic photo 223
boko haram attack 189
big brother naija 171
river state photo 165
got people talking 150
boko haram member 141
boko haram terrorist 138
killed boko haram 118
orji uzor kalu 116
latest job recruitment 115
delta state photo 114
akwa ibom state 112
best graduating student 110
akwa ibom photo 109

29. WordCloud of New Article Title

30. Unigram, Bigram and Trigram Analysis of News Articles Body

Unigrams

state 320582
nigeria 182838
people 180344
government 162423
president 155480
nigerian 151304
time 122262
year 118035
governor 117412
party 107168
national 92861
country 91328
mr 91318
police 88124
know 87582

Bigrams

federal government 28839
local government 24214
state government 21487
lagos state 21324
muhammadu buhari 19695
state governor 18553
president muhammadu 18040
boko haram 17874
national assembly 15623
progressive congress 15236
social medium 15095
democratic party 14667
people democratic 14335
government area 13662
river state 13340

Trigrams:

president muhammadu buhari 14764
people democratic party 13611
local government area 13538
progressive congress apc 10192
democratic party pdp 8888
president goodluck jonathan 6848
federal high court 5461
economic financial crime 5146
financial crime commission 4831
independent national electoral 4741
national electoral commission 4577
state police command 4470
state house assembly 4415
public relation officer 4215
central bank nigeria 3774

31. WordCloud Visualization of News Article Bodies

33. Retaining only the Text Columns

In the next phase of preparing for the text classifier, the focus narrows down to the most critical pieces of data that will fuel the model’s learning. As shown in the code snippet below, the first step is to streamline the dataset by selecting only the columns that contain text: ‘Title’, ‘Body’, and ‘Category’. This reduction serves a dual purpose: it simplifies the dataset to the essential attributes that contain natural language information and discards non-textual data that won’t be used in the classification process. By retaining only these text-based columns, the preparation ensures that the upcoming text analysis and feature extraction are concentrated on the relevant data, setting a solid foundation for building an accurate and efficient text classifier

34. Data Pruning and Distribution Analysis

Here the categories were filtered based on article count and their distribution visualized. Any category with less than 1000 articles were eliminated, this ensures the dataset were reduced to the most significant categories

Output:

Category
Business                                   7457
Car Talk                                   3876
Career                                     3808
Celebrities                               38639
Crime                                     21497
Culture                                    2338
Education                                  9759
European Football (EPL, UEFA, La Liga)     2997
Events                                     1327
Family                                     7327
Fashion                                    1968
Food                                       1883
Foreign Affairs                            3357
General                                    2836
Health                                     6392
Islam for Muslims                          2154
Jobs/Vacancies                             5005
Literature                                 1590
Music/Radio                                2232
NYSC                                       1461
Phones                                     4224
Politics                                  97176
Properties                                 1892
Religion                                   9045
Romance                                   12386
Science/Technology                         1066
Sports                                    11635
TV/Movies                                  3358
Travel                                     7620
Webmasters                                 1586
dtype: int64

As can be seen from the visualization above, the data is hugely imbalanced. A model trained with such data might be bias and unable to generalize well to unseen data. The next step is therefore crucial in fixing the imbalanced data

35. Sampling

Sampling strategies are crucial in preparing datasets for machine learning, as they directly influence the model’s ability to learn and generalize. These strategies, whether aiming to reflect the natural class distribution or to create a balanced environment, play a pivotal role in the predictive performance and robustness of the resulting classifier.

For this project, a mix of over-sampling and under-sampling was used. That means the majority class was reduced to a predetermined threshold to prevent its overwhelming influence, while the minority classes were augmented to ensure sufficient representation for learning. The desired sample was set to 10,000 articles per category after which the RandomUnderSampler() and RandomOverSampler() functions which were part of the ‘imbalanced-learn’ library was used to achieve this sampling strategy.

Output:

Category
Business                                  10000
Car Talk                                  10000
Career                                    10000
Celebrities                               10000
Crime                                     10000
Culture                                   10000
Education                                 10000
European Football (EPL, UEFA, La Liga)    10000
Events                                    10000
Family                                    10000
Fashion                                   10000
Food                                      10000
Foreign Affairs                           10000
General                                   10000
Health                                    10000
Islam for Muslims                         10000
Jobs/Vacancies                            10000
Literature                                10000
Music/Radio                               10000
NYSC                                      10000
Phones                                    10000
Politics                                  10000
Properties                                10000
Religion                                  10000
Romance                                   10000
Science/Technology                        10000
Sports                                    10000
TV/Movies                                 10000
Travel                                    10000
Webmasters                                10000
dtype: int64

36. TRAIN/TEST/VALIDATION SPLIT

The train/test/validation split is a fundamental practice in machine learning that serves several key purposes: The training set is used to fit the model; it learns to recognize patterns in this data. The validation set acts as a proxy for the test set, allowing you to evaluate the model’s performance during the tuning process without contaminating the test set. This helps in making decisions about model adjustments without overfitting to the test data. The test set provides an unbiased evaluation of a final model fit on the training dataset. It’s only used once the model has been trained and validated, to test its performance. This is critical for assessing how well the model is likely to perform on unseen data.

Before splitting, the Title and the Body of each news article is joined together as both contain useful features for training the classifier

Each article’s title and bodies are joined together

37. LABEL ENCODING

This is used to transform the categorical labels into numerical format.

38. EMBEDDING LAYER SET-UP

Here, a pre-trained text embedding model from TensorFlow Hub is being loaded as a Keras layer. This layer will convert text inputs into fixed-size numerical vectors, which can then be used for further processing in the neural network. The trainable=True parameter indicates that the weights of this embedding layer can be updated during training.

39. MODEL DEFINITIONS

Here, the neural network model is defined as a sequence of layers. It adds the previously defined embedding layer, a dense layer with 16 units and ReLU activation, and an output dense layer with a number of units equal to the number of classes, using softmax activation.

Output:

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 keras_layer (KerasLayer)    (None, 50)                48190600  
                                                                 
 dense (Dense)               (None, 16)                816       
                                                                 
 dense_1 (Dense)             (None, 30)                510       
                                                                 
=================================================================
Total params: 48191926 (183.84 MB)
Trainable params: 48191926 (183.84 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________

40. MODEL COMPILATION

Here, the model is compiled for training. ‘adam’ is set as the optimizer, sparse categorical crossentropy is used as the loss function (appropriate for multi-class classification tasks with integer labels), and ‘accuracy’ is specified as the metric to be tracked.

Setting Up Early Stopping Callback

Early stopping is a form of regularization used to avoid overfitting by stopping training before the model learns the noise in the training data. It can also save time and computational resources by reducing unnecessary training epochs.

Here’s what each parameter is set to achieve:

monitor=’val_loss’: This tells the callback to monitor the validation loss. The validation loss is used as a performance metric to determine when to stop training.

  1. patience=3: This sets the number of epochs with no improvement after which training will be stopped. In this case, if the validation loss does not improve for three consecutive epochs, the training process will be halted.

  2. verbose=1: This enables verbose output in the log, meaning that messages will be shown in the console when the training is stopped early.

  3. restore_best_weights=True: This parameter ensures that once training is stopped, the model’s weights are rolled back to those that achieved the lowest validation loss.

Model Training:

Here the training data and labels, the number of epochs (iterations over the entire dataset), and the batch size (number of samples processed before the model is updated) are specified. The model is validated on a separate dataset, and early stopping is used to halt training if there is no improvement in validation loss for several epochs. The start and end times are recorded to calculate and print the total training time.

Output:

Epoch 1/20
469/469 [==============================] - 259s 550ms/step - loss: 1.7662 - accuracy: 0.5722 - val_loss: 0.9420 - val_accuracy: 0.7590
Epoch 2/20
469/469 [==============================] - 263s 561ms/step - loss: 0.7405 - accuracy: 0.8078 - val_loss: 0.6894 - val_accuracy: 0.8194
Epoch 3/20
469/469 [==============================] - 262s 558ms/step - loss: 0.5204 - accuracy: 0.8646 - val_loss: 0.5832 - val_accuracy: 0.8476
Epoch 4/20
469/469 [==============================] - 262s 558ms/step - loss: 0.3912 - accuracy: 0.8997 - val_loss: 0.5237 - val_accuracy: 0.8654
Epoch 5/20
469/469 [==============================] - 261s 557ms/step - loss: 0.3015 - accuracy: 0.9241 - val_loss: 0.4932 - val_accuracy: 0.8781
Epoch 6/20
469/469 [==============================] - 261s 556ms/step - loss: 0.2346 - accuracy: 0.9430 - val_loss: 0.4777 - val_accuracy: 0.8861
Epoch 7/20
469/469 [==============================] - 259s 552ms/step - loss: 0.1825 - accuracy: 0.9568 - val_loss: 0.4723 - val_accuracy: 0.8927
Epoch 8/20
469/469 [==============================] - 257s 548ms/step - loss: 0.1415 - accuracy: 0.9682 - val_loss: 0.4822 - val_accuracy: 0.8943
Epoch 9/20
469/469 [==============================] - 257s 548ms/step - loss: 0.1086 - accuracy: 0.9769 - val_loss: 0.4972 - val_accuracy: 0.8979
Epoch 10/20
469/469 [==============================] - ETA: 0s - loss: 0.0824 - accuracy: 0.9836Restoring model weights from the end of the best epoch: 7.
469/469 [==============================] - 259s 552ms/step - loss: 0.0824 - accuracy: 0.9836 - val_loss: 0.5217 - val_accuracy: 0.8980
Epoch 10: early stopping
Time taken to train the model:  2599.22061085701 seconds

The training was stopped after there was no improvement in the validation loss after 3 successive epochs, the graph of the performance during training is plotted

Output:

MODEL EVALUATION

The model.predict.() method was used to evaluate the model on the pre-processed test data and the results were printed

Output:

938/938 [==============================] - 136s 145ms/step
Accuracy: 0.8901
Precision: 0.8880
Recall: 0.8901
F1-score: 0.8883

HYPERPARAMETER TUNING

Hyperparameter tuning is a critical process in machine learning that involves finding the most optimal parameters for a model, enhancing its performance and accuracy. This process is essential for fine-tuning a model’s behavior to align with the specific characteristics and complexities of the data.

In this project, hyperparameter tuning is implemented for a text classification model using TensorFlow and Keras. The process is outlined below

Model Builder Function (build_model): This function is designed to create the model. It takes hyperparameters as input and incorporates them into the model’s architecture. The hyperparameters being tuned are the number of units in the dense layer (hp_units) and the learning rate of the optimizer (hp_learning_rate). Additionally, an embedding layer is used, sourced from TensorFlow Hub.

Initializing the Tuner (RandomSearch): The RandomSearch tuner from Keras Tuner is used, which explores different combinations of hyperparameters randomly. The tuner is set to optimize for validation accuracy (val_accuracy), with a maximum of 5 trials and 1 execution per trial.

Early Stopping: An EarlyStopping callback is used during training to prevent overfitting. It monitors validation loss and stops the training if there’s no improvement, restoring the best weights achieved.

Training with Hyperparameter Tuning: The tuner.search method is called to start the training process. It trains the model on the training data, validating on a separate validation dataset, and uses the early stopping callback.

Output:

Trial 5 Complete [01h 23m 48s]
val_accuracy: 0.8161666393280029

Best val_accuracy So Far: 0.9069333076477051
Total elapsed time: 05h 26m 35s

Results and Best Model Selection: After the tuning process, tuner.results_summary() displays the best hyperparameters found. The best performing model is retrieved and evaluated on the test data to report its loss and accuracy.

Output

Results summary
Results in my_dir/hyperparameter_tuning
Showing 10 best trials
Objective(name="val_accuracy", direction="max")

Trial 1 summary
Hyperparameters:
units: 48
learning_rate: 0.001
Score: 0.9069333076477051

Trial 3 summary
Hyperparameters:
units: 64
learning_rate: 0.001
Score: 0.9067999720573425

Trial 2 summary
Hyperparameters:
units: 48
learning_rate: 0.0001
Score: 0.8416666388511658

Trial 4 summary
Hyperparameters:
units: 16
learning_rate: 0.0001
Score: 0.8161666393280029

Trial 0 summary
Hyperparameters:
units: 32
learning_rate: 1e-05
Score: 0.5175999999046326

Output:

938/938 [==============================] - 3s 3ms/step - loss: 0.4422 - accuracy: 0.9074
Test Loss: 0.44224512577056885
Test Accuracy: 0.9073666930198669

Saving the Best Model: The best model is saved in TensorFlow format for future use or deployment

LOCAL DEPLOYMENT

To do this, a local workspace was set up in Visual Studio Code within the directory containing the saved model. A virtual environment named ‘classification’ was created and activated, encapsulating the project’s dependencies and ensuring an isolated execution environment.

With the environment prepared, essential Python libraries such as Streamlit, TensorFlow, TensorFlow Hub, NLTK, and Scikit-learn were installed to support the model and web application functionality. The app.py file was crafted to serve as the backbone of the web application. It was equipped with a text cleaning function, model loading routines, and a user interface for inputting and classifying news articles.

Upon executing streamlit run app.py, a local web server was initiated, presenting a user-friendly interface through a web page on localhost. The interface, titled ‘NAIRALAND NEWS CLASSIFICATION’, prompted users to enter a news article’s title and body. Upon submission, the model processed the inputs, applying text cleaning and leveraging the trained classifier to predict the category of the news article, displaying the result in real-time.

This local deployment served as an instrumental step towards operationalizing the machine learning model, allowing for immediate and interactive predictions in a user-centric manner.

RECOMMENDED FUTURE WORK

My recommendations for future work to enhance the Nairaland News Classification project would be focused on scalability, robustness, and continuous improvement. Here are some suggestions:

  1. Model Versioning and Experiment Tracking: Implementing tools like MLflow or DVC for model versioning, experiment tracking, and reproducibility. This will enable the team to keep track of various experiments, model versions, and their corresponding performances.

  2. Automated Retraining Pipeline: Establishing an automated retraining pipeline that can periodically retrain the model with new data. This can be orchestrated using CI/CD tools like Jenkins or GitHub Actions, which would also include automated testing to ensure model quality.

  3. Advanced Model Monitoring: Setting up advanced model monitoring capabilities to track model drift, data quality issues, and prediction performance in production. Consider using tools like Prometheus, Grafana, or custom dashboards to visualize monitoring metrics.

  4. Feature Store Implementation: Building or integrating a feature store to manage, share, and reuse features across different models. This can help in maintaining consistency, speeding up the development of new models, and facilitating more complex analyses.

  5. Expand Model Interpretability: Incorporating model interpretability tools like SHAP or LIME to provide insights into the model’s decision-making process. This can help in identifying biases, improving model trustworthiness, and facilitating better decision-making.

  6. Ensemble and Multi-Model Approaches: Exploring ensemble methods or multi-model architectures to improve prediction performance and robustness. This might involve combining different types of models or using a microservices architecture to deploy multiple models.

  7. User Feedback Loop: Creating a user feedback system where predictions can be reviewed and corrected by users, with the corrected labels used to further train the model, enhancing its accuracy over time.

  8. Model Serving Enhancements: Considering the use of advanced model serving solutions like TensorFlow Serving or TorchServe that can provide more efficient model updates, A/B testing capabilities, and can scale according to request load.

  9. Cloud-Native Technologies: Leveraging cloud-native technologies such as Kubernetes for orchestrating containerized applications, enabling the system to be more resilient and scalable.

  10. Data Privacy and Security: Ensure data privacy and security best practices are in place, especially when handling user-generated content. This could involve data anonymization, secure data storage, and adherence to GDPR or other relevant regulations.

  11. Cross-Platform Deployment: Working on cross-platform deployment strategies, ensuring the model can be deployed to various environments, from cloud platforms to edge devices, increasing the model’s accessibility.

  12. Expand Language Support: Given Nairaland’s diverse user base, incorporating NLP models that better handle local dialects and pidgin, potentially using transfer learning with language models pre-trained on African languages and dialects.

These suggestions aim to ensure that the project not only maintains its relevance but also continues to evolve with technological advancements and user needs.

IF YOU ENJOY THIS ARTICLE, KINDLY SHARE

By Timothy Adegbola

Timothy Adegbola hails from the vibrant land of Nigeria but currently navigates the world of Artificial Intelligence as a postgrad in Manchester, UK. Passionate about technology and with a heart for teaching, He pens insightful articles & tutorials on Data analysis, Machine learning and A.I, and the intricate dance of Mathematics. And when he's not deep in tech or numbers? He's cheering for Arsenal! Connect and geek out with Timothy on Twitter and LinkedIn.

Leave a Reply

Your email address will not be published. Required fields are marked *