Building a Sentiment Analysis Tool with Python

Sentiment analysis is one of the most practical applications of natural language processing. Businesses use it to monitor brand perception on social media, analyze customer reviews, and gauge public opinion on products and policies. In this tutorial, we will build a complete sentiment analysis pipeline using Python and NLTK (Natural Language Toolkit). We will cover text preprocessing, tokenization, and the VADER sentiment analyzer, which is specifically tuned for social media text and handles slang, emojis, and informal language remarkably well.

Setting Up NLTK and Required Data

Start by installing NLTK and downloading the necessary data packages. NLTK ships as a lightweight library, with corpora and models available as separate downloads:

pip install nltk pandas matplotlib
import nltk

# Download required NLTK data
nltk.download("punkt")
nltk.download("punkt_tab")
nltk.download("stopwords")
nltk.download("vader_lexicon")
nltk.download("wordnet")

from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer

print("NLTK setup complete!")

The vader_lexicon contains the sentiment scores that VADER uses. The punkt tokenizer models handle sentence and word boundary detection, while stopwords provides common words like “the”, “is”, and “at” that we can filter out during preprocessing.

ADVERTISEMENT

Text Preprocessing Pipeline

Raw text is messy. Before analyzing sentiment, we need to clean and normalize it. A solid preprocessing pipeline handles case normalization, punctuation removal, tokenization, stopword removal, and lemmatization:

import re
from typing import List

class TextPreprocessor:
    def __init__(self):
        self.stop_words = set(stopwords.words("english"))
        # Keep negation words -- they flip sentiment
        self.stop_words -= {"not", "no", "nor", "neither",
                            "never", "nobody", "nothing"}
        self.lemmatizer = WordNetLemmatizer()

    def clean_text(self, text: str) -> str:
        """Remove URLs, mentions, special characters."""
        text = re.sub(r"http\S+|www\S+", "", text)    # URLs
        text = re.sub(r"@\w+", "", text)               # @mentions
        text = re.sub(r"#(\w+)", r"\1", text)           # keep hashtag text
        text = re.sub(r"[^a-zA-Z\s!?.]", "", text)     # keep !, ?, .
        text = re.sub(r"\s+", " ", text).strip()
        return text

    def tokenize(self, text: str) -> List[str]:
        """Split text into individual word tokens."""
        return word_tokenize(text.lower())

    def remove_stopwords(self, tokens: List[str]) -> List[str]:
        """Remove common words that don't carry sentiment."""
        return [t for t in tokens if t not in self.stop_words and len(t) > 2]

    def lemmatize(self, tokens: List[str]) -> List[str]:
        """Reduce words to their base form."""
        return [self.lemmatizer.lemmatize(t) for t in tokens]

    def preprocess(self, text: str) -> str:
        """Full preprocessing pipeline."""
        cleaned = self.clean_text(text)
        tokens = self.tokenize(cleaned)
        tokens = self.remove_stopwords(tokens)
        tokens = self.lemmatize(tokens)
        return " ".join(tokens)


# Demonstrate the pipeline
preprocessor = TextPreprocessor()

sample = "@TechCo I absolutely LOVE the new update!!! #amazing https://t.co/abc"
print(f"Original:     {sample}")
print(f"Preprocessed: {preprocessor.preprocess(sample)}")

Notice that we deliberately keep negation words in our stopword filter. Words like “not” and “never” completely flip the sentiment of a sentence. Removing them would cause “not happy” to be scored the same as “happy”, which would be a critical error.

Sentiment Analysis with VADER

VADER (Valence Aware Dictionary and sEntiment Reasoner) is a rule-based sentiment analyzer that is particularly effective on short, informal text. It produces four scores: positive, negative, neutral, and a compound score that summarizes overall sentiment on a scale from -1 (most negative) to +1 (most positive):

class SentimentAnalyzer:
    def __init__(self):
        self.analyzer = SentimentIntensityAnalyzer()
        self.preprocessor = TextPreprocessor()

    def analyze(self, text: str) -> dict:
        """Analyze sentiment of a single text."""
        scores = self.analyzer.polarity_scores(text)

        # Classify based on compound score thresholds
        compound = scores["compound"]
        if compound >= 0.05:
            label = "positive"
        elif compound <= -0.05:
            label = "negative"
        else:
            label = "neutral"

        return {
            "text": text,
            "label": label,
            "compound": compound,
            "positive": scores["pos"],
            "negative": scores["neg"],
            "neutral": scores["neu"]
        }

    def analyze_batch(self, texts: List[str]) -> List[dict]:
        """Analyze sentiment for a list of texts."""
        return [self.analyze(text) for text in texts]


# Test with various examples
sa = SentimentAnalyzer()

test_texts = [
    "This product is absolutely wonderful! Best purchase I ever made.",
    "Terrible customer service. I waited 3 hours and got no help.",
    "The package arrived on Tuesday as expected.",
    "I'm not happy with the quality, but the price was fair.",
    "AMAZING!!! This exceeded all my expectations :)",
    "Meh, it's okay I guess. Nothing special."
]

for text in test_texts:
    result = sa.analyze(text)
    print(f"[{result['label']:>8}] ({result['compound']:>6.3f}) {text}")

VADER handles capitalization (amplifies sentiment), exclamation marks (increases intensity), and common emoticons and slang. The compound score is computed by summing the valence scores of each word, adjusting for rules like negation and intensifiers, and normalizing to the -1 to +1 range.

Building a Complete Analysis Pipeline

Let us combine everything into a pipeline that can analyze a dataset of reviews and produce a summary report:

import pandas as pd

class SentimentPipeline:
    def __init__(self):
        self.analyzer = SentimentAnalyzer()

    def analyze_reviews(self, reviews: List[str]) -> pd.DataFrame:
        """Analyze a batch of reviews and return a DataFrame."""
        results = self.analyzer.analyze_batch(reviews)
        df = pd.DataFrame(results)
        return df

    def generate_report(self, df: pd.DataFrame) -> dict:
        """Generate a summary report from analyzed reviews."""
        total = len(df)
        label_counts = df["label"].value_counts()

        report = {
            "total_reviews": total,
            "positive_count": label_counts.get("positive", 0),
            "negative_count": label_counts.get("negative", 0),
            "neutral_count": label_counts.get("neutral", 0),
            "positive_pct": label_counts.get("positive", 0) / total * 100,
            "negative_pct": label_counts.get("negative", 0) / total * 100,
            "avg_compound": df["compound"].mean(),
            "most_positive": df.loc[df["compound"].idxmax(), "text"],
            "most_negative": df.loc[df["compound"].idxmin(), "text"],
        }

        return report

    def run(self, reviews: List[str]) -> None:
        """Execute the full pipeline and print results."""
        df = self.analyze_reviews(reviews)
        report = self.generate_report(df)

        print("=" * 60)
        print("SENTIMENT ANALYSIS REPORT")
        print("=" * 60)
        print(f"Total reviews analyzed: {report['total_reviews']}")
        print(f"Positive: {report['positive_count']} ({report['positive_pct']:.1f}%)")
        print(f"Negative: {report['negative_count']} ({report['negative_pct']:.1f}%)")
        print(f"Neutral:  {report['neutral_count']}")
        print(f"Average compound score: {report['avg_compound']:.3f}")
        print(f"\nMost positive: \"{report['most_positive']}\"")
        print(f"Most negative: \"{report['most_negative']}\"")


# Run the pipeline
reviews = [
    "Absolutely love this app! The interface is clean and intuitive.",
    "Crashed three times today. Uninstalling immediately.",
    "Works as described. Does what it needs to do.",
    "The new update ruined everything. Bring back the old version!",
    "Best tool I've found for project management. Highly recommend!",
    "Customer support was friendly but couldn't resolve my issue.",
    "Downloaded yesterday. Pretty decent so far, no complaints.",
    "This is a game changer for our team's productivity!"
]

pipeline = SentimentPipeline()
pipeline.run(reviews)

The pipeline processes each review through the VADER analyzer, collects results into a pandas DataFrame for easy manipulation, and generates a summary report with distribution statistics and the most extreme reviews. In a production setting, you would read reviews from a database or API and store results for dashboarding.

Sentence-Level Analysis for Nuanced Reviews

Some reviews contain mixed sentiment. Analyzing at the sentence level gives a more nuanced understanding:

def analyze_by_sentence(text: str) -> List[dict]:
    """Break a review into sentences and analyze each one."""
    sa = SentimentAnalyzer()
    sentences = sent_tokenize(text)
    results = []

    for sentence in sentences:
        result = sa.analyze(sentence)
        results.append(result)

    return results


mixed_review = ("The build quality is excellent and feels premium. "
                "However, the battery life is disappointing. "
                "It barely lasts 4 hours. "
                "The camera makes up for it though with stunning photos.")

print(f"Full review: {mixed_review}\n")
print("Sentence-level breakdown:")
for result in analyze_by_sentence(mixed_review):
    print(f"  [{result['label']:>8}] ({result['compound']:>6.3f}) {result['text']}")

This approach reveals that a review with an overall neutral compound score might actually contain strongly positive and strongly negative sentences. This granularity is valuable for product teams who want to know exactly what customers love and what they dislike.

Conclusion

We have built a complete sentiment analysis pipeline from text preprocessing through analysis to reporting. VADER is an excellent choice for quick, rule-based analysis that works well on social media and review text without needing any training data. For more demanding applications where you need higher accuracy on domain-specific text, consider fine-tuning a transformer model like BERT or using the Hugging Face transformers library. But for many practical use cases, VADER combined with good preprocessing delivers reliable results with minimal setup and zero training time.

ADVERTISEMENT

Leave a Comment

Your email address will not be published. Required fields are marked with an asterisk.