Sarcasm Detection Natural Language Processing

Intro to Neural Networks

Neural Networks are a type of artificial intelligence model that can infer information from a given text. They have evolved significantly over the years, becoming a cornerstone in various NLP tasks due to their ability to learn complex patterns and representations from large datasets.

How they work

A Neural Network is a collection of interconnected nodes, each representing a simple equation. These nodes, or neurons, classify inputs based on weights that are adjusted during training.

During training, the network compares its output to a gold standard dataset to measure accuracy. It then propagates errors backward through the network (a process known as backpropagation) to update the weights and improve future predictions. This iterative process enables the network to learn from data. However, the features and weights used by neural networks are typically not interpretable by humans, which can be a limitation when understanding the model's decisions.

Neural Networks for Natural Language Processing

Neural Networks are particularly well-suited for NLP tasks because they can capture nuances and contextual relationships in text that simpler algorithms might miss. For instance, in tasks such as sentiment analysis, translation, and text generation, neural networks, especially deep learning models like Recurrent Neural Networks (RNNs) and Transformers, have shown remarkable performance.

Text Classification

Text classification involves tagging text with predefined labels. For NLP models, this task is fundamental yet essential, providing a basis for more complex operations. By using pre-tagged datasets, we can train models and easily verify their accuracy, making it a straightforward entry point into NLP applications.

Sarcasm Detection

Project Members

Alexander Sitzman

Chris Farrer

Maxwell Schultz

Problem Statement

We are attempting to build a set of models which can accurately predict whether a news article title contains sarcasm or not. We are finding ways to effectively normalize and vectorize the data before finding optimal weights to a variety of ML models which will solve the binary classification problem.

This project has the potential to reach a deeper linguistic understanding of subtle pragmatic structures such as sarcasm. It will be able to show the potential of ML machinery paired with NLP methodologies, and how real social phenomena can be modeled in such a way.

Expected Learning Outcome

We expect to learn more about the reliability of the given dataset, the difficulty of the problem, and the ability of the various models for this type of problem. We expect to better understand the process of setting up various models, and gain deeper insights to the methodology of hyperparameter tweaking.

Datasets

News Headlines Dataset For Sarcasm Detection (kaggle.com)

A collection of over 28,000 news headlines from real and satirical news outlets. Each datapoint is tagged as either sarcastic or non sarcastic, as well as a link to the original article. The advantage of this data set is that it is self contained, that is unlike a tweet set, there is not a connection between previous data points to provide context.

Sarcasm detection (kaggle.com)

Another popular dataset from Kaggle is the Sarcasm detection data set. It provides similar examples of headlines tagged by sarcastic and not sarcastic. This data set claims to be smaller and more refined and is newer. May be potentially interesting to see the comparisons between the two datasets

These two cannot both be used at the same time as they may have overlapping data. The strongest approach will be to divide the datasets in half to create a training set and a test set from each one and test the effectiveness of either set independently

Technical Approach

To achieve the objective of accurately predicting sarcasm in news headlines, we will utilize multiple models to test model efficiency, including both traditional machine learning approaches and neural network approaches. The process will include the following steps:

Data Preprocessing

Normalization: Standardizing the text data by converting all headlines to lowercase, removing punctuation, and stemming/lemmatizing words to ensure uniformity.

Vectorization: Transforming text data into numerical vectors using techniques such as TF-IDF (Term Frequency-Inverse Document Frequency) and word embeddings like Word2Vec

Model Selection and Training

Traditional Machine Learning Models: We will implement models such as Logistic Regression and Support Vector Machines (SVM) to establish baseline performance metrics.

Neural Networks: We will explore more advanced models like Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs)

Model Evaluation

Performance Metrics: Evaluating models using accuracy, precision, recall, and F1 score

Comparative Analysis

We will compare the performance of different models to determine the most effective approach for sarcasm detection. Additionally, we will analyze the differences in model performance between the two datasets to gain insights into how dataset characteristics influence model efficacy.

Our Findings

The project is currently in progress and has not yielded any specific results as of yet. After examining our dataset, we discovered that our dataset was flawed and more accurately described as fake news detection instead of sarcasm detection. Consequently, we adjusted our goal to focus on detecting the validity of news articles.

Updated Goal: Detecting Fake News

After realizing the limitations of our initial dataset for sarcasm detection, we pivoted our model's objective to determining the validity of news articles. We acquired a new dataset tagged with fake or real news, which provides a more accurate context for our analysis. The dataset used headlines from The Onion, a satirical news website, as data points for sarcasm. However, we concluded that this context was flawed. Satire, while similar to sarcasm, is more akin to fake news because it presents a fabricated reaction to a fictional context, rather than a sarcastic reaction to a real context. A satirical headline paired with a fake article gives the impression of a genuine response rather than a sarcastic one.

To achieve a true sarcasm dataset, the data would need to come from a different source, such as forum posts marked with a "/s" postfix, which is a common tag for online users to denote sarcasm. Data collected in this manner could provide the necessary context for creating a functional sarcasm detection model.

As a result, we decided to pivot our current dataset focus to satire and fake news detection for the purposes of this model. We plan to revisit and reevaluate sarcasm detection in the future with a more appropriate dataset.

Data Splitting:

We split our Kaggle dataset into 85% training and 15% test sets, and further split the training set into 85% training and 15% validation sets. This approach allows us to have a validation set for hyperparameter tuning and an untouched test set to measure the final model's validity.

Models Implemented:

Neural Bag of Words: Achieved 96.2% accuracy on the test set.

Recurrent Neural Network.

Transformer Model: Achieved 99.2% accuracy on the test set. Model Performance: The Transformer model was the most accurate, correctly identifying real and fake news articles. However, it still showed some blind spots, particularly in detecting satire.

Key Takeaways:

Sarcasm detection requires highly nuanced data. In practice, the language used in sarcasm can be quite similar to lies. What often separates sarcasm from lies is the delivery, which is challenging to convey through text alone. The only reliable indicator is the context provided, which was inadequate in our datasets.

Fake news detection also faces significant challenges due to the lack of real cultural context within the model. A fake news article can appear real to someone unfamiliar with the subject matter. This limitation can lead to false positives or negatives in the model's predictions.

Our Transformer model demonstrated high accuracy but had notable blind spots. For instance, it failed to identify the satirical nature of an article titled "Nation Forgets What It Was They Didn’t Like About O.J. Simpson." The article humorously suggested that 340 million Americans suddenly forgot why they disliked O.J. Simpson, portraying a fictional collective amnesia about his well-known criminal history. The model's inability to recognize the satirical context highlights its limitation in understanding deeper cultural and historical nuances.

References

News Headlines Dataset For Sarcasm Detection Sarcasm Detection Fake News Detection The Onion