Beyond the Numbers: How Machine Learning Differs from Theoretical Statistics

Machine learning has revolutionized the way we interact with data, enabling us to extract valuable insights from massive datasets that were previously considered too complex to analyze. Despite its impressive capabilities, some skeptics argue that machine learning is simply an extension of statistics, and that its perceived sophistication is nothing more than an illusion. In this article, i will explore why reducing machine learning to statistics falls short of capturing the true potential and power of this field, and why machine learning should be considered a distinct discipline in its own right.

A misleading meme depicting Maching learning as ‘Statistics in disguise’

Machine learning is a type of technology that uses a lot of data to learn and make predictions. It’s similar to statistics, but it deals with bigger and more complex sets of information. Unlike traditional statistics, machine learning can’t use certain methods to analyze the data, so it has to rely on other approaches. Machine learning is more like engineering than a purely theoretical field, and it’s all about using real-world results to create better software and hardware.

But first, Let’s take a look at the role statistics play in a machine learning model that detects Sarcasms in text supplied. The code is given below

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB


data = pd.read_json("Sarcasm.json", lines=True)
print(data.head())

data["is_sarcastic"] = data["is_sarcastic"].map({0: "Not Sarcasm", 1: "Sarcasm"})
print(data.head())

data = data[["headline", "is_sarcastic"]]
x = np.array(data["headline"])
y = np.array(data["is_sarcastic"])

cv = CountVectorizer()
X = cv.fit_transform(x) # Fit the Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

model = BernoulliNB()
model.fit(X_train, y_train)
print(model.score(X_test, y_test))

toContinue = 1
while iscontinue == 1:
    user = input("Enter a Text: ")
    data = cv.transform([user]).toarray()
    output = model.predict(data)
    print(output)
    toContinue = int(input("Press '1' to continue or '0' to terminate the program: "))

The code above uses the pandas library to read a JSON file of sarcastic and non-sarcastic headlines and prepares the data for modeling. It then uses the scikit-learn library to create a Bernoulli Naive Bayes classifier and train it on the data. Finally, it allows the user to input their own text and the model predicts whether the text is sarcastic or not based on the previously trained model. The program runs until the user decides to terminate it.

The Bernoulli Naive Bayes classifier is a probabilistic algorithm mainly used in text classification problems, where the features are the presence or absence of a particular word in a document. In such problems, the algorithm calculates the probability that a document belongs to a certain class based on the presence or absence of particular words.

The algorithm above is based on a statistical concept known as Bayes’ theorem which states that the probability of a class given a set of features (denoted as P(C | F1, F2, …, Fn)) can be computed using the following formula:

P(C | F1, F2, …, Fn) = P(C) * P(F1, F2, …, Fn | C) / P(F1, F2, …, Fn)

where C is the class, and F1, F2, …, Fn are the features.

If you are not an ex student of math, statistics or engineering, don’t worry, I’ll explain in a more beginner friendly way. Imagine you have a box of chocolates, having different colors of chocolate inside. The colors are either red, green, or blue. You want to know what is the chance of getting a blue chocolate if you pick one randomly from the box.

Bayes’ theorem says that the chance of getting a blue chocolate (which we can call event B) given that you have a chocolate with a certain color on the inside (let’s say event A) is equal to the chance of getting event A given event B, multiplied by the chance of getting event B, divided by the chance of getting event A.

An end-to-end machine learning project usually involves several steps, including:

Data collection: Collect a large dataset of sarcastic and non-sarcastic statements or sentences, along with their labels (i.e. “sarcastic” or “not sarcastic”). This dataset will be used to train the machine learning model.
Data pre-processing: Clean and pre-process the collected data to remove any irrelevant information and format the data into a suitable format for training the model.
Feature extraction: Extract relevant features from the pre-processed data, such as the use of certain words, tone of voice, or sentence structure.
Model selection: Select a suitable machine learning algorithm, such as a support vector machine (SVM), a decision tree, or a neural network, to train the sarcasm detection model.
Model training: Train the selected model on the pre-processed data using the extracted features, adjusting the parameters and adjusting the algorithm as needed.
Model evaluation: Evaluate the performance of the trained model by testing it on a separate dataset, and measuring its accuracy, precision, recall, and other metrics.
Model deployment: Deploy the trained and evaluated model in a sarcasm detection software, where it can be used to automatically detect sarcasm in real-world text data.

Only one or two of those steps requires a deep knowledge of statistical methods. Other technical skills required to complete such project include Software engineering & version control, Data analysis and visualization, cloud computing, database management, natural language processing e.t.c. Not also forgetting the importance of domain knowledge. For instance, a data scientist or ML engineer working in healthcare may need to have a strong understanding of biology and medical research methods, while those working in finance may need expertise in financial modelling and risk management.

Conclusion

While statistics undoubtedly plays a critical role in machine learning, it is essential to recognize that machine learning is much more than just a set of statistical techniques. It may be related to statistics but also differs in several important ways, in the same sense that medicine is related to chemistry but can not be reduced to chemistry. The power of machine learning lies in its ability to learn from data and improve its performance over time, making it a valuable asset in a world where data is increasingly important.

Beyond the Numbers: How Machine Learning Differs from Theoretical Statistics

ByTimothy Adegbola

By Timothy Adegbola

Related Post

How I Built a Review Analysis Pipeline for JD Wetherspoon Pubs using Python and NLP

Building an Automated Text Classifier for Nairaland.com: A Step-by-Step Guide

Learning to Think Like a Machine: A Kid-Friendly Guide to Machine Learning Algorithms

Leave a Reply Cancel reply