Summary of "Text Representation | NLP Lecture 4 | Bag of Words | Tf-Idf | N-grams, Bi-grams and Uni-grams"
Overview / main goal
The lecture explains why and how we convert raw text into numeric features for machine learning (feature extraction / text representation). It covers basic techniques — one‑hot encoding, bag‑of‑words (count vectors), n‑grams, and TF‑IDF — plus custom/hand‑crafted features. Word embeddings (Word2Vec and other deep‑learning embeddings) are presented as the next topic. The IMDB 50k movie‑reviews dataset is used as a running example.
Why text representation is needed
- Machine learning models accept numeric input, not raw text. Converting text into numeric vectors is required to feed algorithms.
- Good feature representations matter more than algorithmic complexity: Garbage In → Garbage Out. Features should preserve the semantic signal needed for the task (e.g., sentiment).
Desiderata for a good text representation
A useful text representation should:
- Produce fixed-size numeric vectors suitable for model input.
- Preserve meaningful information (semantics, important signals) as much as possible.
- Be robust to vocabulary size, out‑of‑vocabulary words, and avoid becoming unworkably sparse or huge.
Techniques explained
1) One‑hot encoding (token level)
What it is:
- Each unique token in the corpus is a separate dimension. For a document, set the dimension(s) corresponding to tokens that appear to 1 (or to counts if desired).
How to create:
- Build the vocabulary = set of unique tokens across the corpus.
- For each document, create a vector of length |vocab| and set indices for tokens to 1 (or increment for multiple occurrences).
Pros:
- Very simple and intuitive; easy to implement.
Cons:
- Extremely high dimensional and sparse for real corpora.
- Doesn’t capture similarity between words (no semantics).
- Vocabulary / vector length can change as new words appear (unless fixed).
2) Bag‑of‑Words (BoW) / Count vectors
What it is:
- Represent each document as a vector of token counts (word order is ignored).
Key intuition:
- Documents with similar word-frequency patterns are likely similar in class/content.
How to create (scikit‑learn style):
- Preprocess text (lowercase, optional stopword removal, basic cleaning).
- Build vocabulary from the training corpus.
- For each document, count occurrences of each vocabulary token → vector of counts.
- Use
CountVectorizer.fiton training text andtransformon new text (unknown tokens are ignored).
Important configurable parameters (typical in CountVectorizer):
binary=True→ presence/absence (0/1) instead of counts.ngram_range=(min_n, max_n)→ include n‑grams.max_df/min_df/max_features→ limit or filter vocabulary.lowercase,stop_words, etc.
Pros:
- Intuitive and widely used; strong baseline for many classification tasks.
- Fixed output shape after fitting vocabulary (stable model input size).
Cons:
- Ignores word order and context.
- Large vocabularies → high dimensional, sparse features → more computational cost and risk of overfitting.
- Unknown words at prediction time are ignored (information loss).
3) N‑grams (bigrams, trigrams, etc.)
What it is:
- Extend the token vocabulary to include contiguous sequences of n tokens (e.g., bigrams capture “not good”).
Why use n‑grams:
- Capture some local word order and short phrases; help disambiguate cases where single tokens lose meaning (e.g., “not good” vs “good”).
How to create:
- Use
ngram_rangeinCountVectorizer/TfidfVectorizeror generate n‑gram tokens manually. The vocabulary then contains unigrams and/or n‑grams depending on the range.
Pros:
- Better captures phrase‑level information and short local order; often improves classification (helps with negation and fixed phrases).
Cons:
- Vocabulary size increases rapidly with n, increasing sparsity and computational cost.
- Still cannot capture long‑range dependencies or deep semantics.
- OOV problem remains for unseen n‑grams at prediction time.
4) TF‑IDF (Term Frequency — Inverse Document Frequency)
What it is:
- A weighting scheme: TF (term frequency in a document) multiplied by IDF (inverse document frequency across the corpus) to downweight very common terms and upweight discriminative (rare) terms.
Formulas:
- TF(t, d) = (count of term t in document d) / (total number of terms in document d)
- IDF(t) = log(N / df(t)) where N = total number of documents and df(t) = number of documents containing term t
- Smoothing variants are common, e.g. IDF(t) = log(1 + N / (1 + df(t)))
- TF‑IDF(t, d) = TF(t, d) * IDF(t)
Why the log?
- Log smooths IDF values so extremely rare terms do not produce excessively large weights; it compresses wide ranges into manageable scales.
How to compute:
- Fit a
TfidfVectorizeron the training corpus (it computes IDF per vocabulary term and transforms documents into TF‑IDF vectors). - Transform new documents using the fitted vectorizer (unknown terms are ignored).
- Key parameters:
use_idf,smooth_idf,sublinear_tf(e.g., use log(1+tf)),ngram_range,max_df/min_df,norm.
Pros:
- Emphasizes discriminative terms; commonly used in information retrieval and many classification tasks.
Cons:
- Still high dimensional and sparse for large vocabularies.
- Does not solve synonymy or semantic similarity between words.
- OOV words are ignored at prediction time.
5) Custom / hand‑crafted features
What they are:
- Task-specific numeric features engineered from text using domain knowledge.
Examples:
- Counts of positive/negative words (using a sentiment lexicon).
- Number of exclamation marks, uppercase words, emoticons.
- Average word length, document length, punctuation counts.
- Frequency of specific keywords (e.g., “excellent”, “terrible”).
How to use:
- Compute these features per document and concatenate them with vectorized features (BoW / TF‑IDF) to create hybrid feature vectors.
Pros:
- Can capture signals not well represented by BoW/TF‑IDF; often improves model performance.
Cons:
- Require domain knowledge and manual effort; may not generalize across domains.
Practical pipeline placement and preprocessing tips
Preprocessing (before vectorization):
- Lowercase, tokenize, optional stopword removal, punctuation stripping, optional stemming/lemmatization depending on the task.
Typical workflow:
- Fetch data (e.g., IMDB reviews).
- Preprocess text.
- Choose representation:
CountVectorizer(BoW),TfidfVectorizer(TF‑IDF) with chosenngram_rangeand other params, or custom features + vectorizer. - Fit vectorizer on training data → learn vocabulary and IDF.
- Transform training and test sets into fixed-size numeric matrices.
- Train ML model and evaluate.
Handling OOV / vocabulary mismatch:
- During prediction, new tokens not in the training vocabulary are ignored (no numeric column exists).
- To mitigate unknown tokens, consider subword features, character n‑grams, the hashing trick, or switch to embeddings.
Implementation notes and scikit‑learn pointers
- Use
CountVectorizerfor BoW / counts;TfidfVectorizerfor TF‑IDF. - Important parameters to try:
ngram_range,binary=True,max_df/min_df,max_features,stop_words,lowercase,smooth_idf,sublinear_tf. - Common practice: fit the vectorizer on the training set and reuse it for test/prediction so the feature space remains fixed.
Advantages / disadvantages summary (by technique)
-
One‑hot
- Advantage: simple, interpretable.
- Disadvantage: extremely high dimensional, sparse; no semantics; vocabulary growth and OOV problems.
-
Bag‑of‑Words (Count)
- Advantage: simple, strong baseline for classification; fixed-size after fitting; easy to implement.
- Disadvantage: ignores order/context; high dimensionality and sparsity; OOV ignored.
-
N‑grams
- Advantage: capture local order and phrases (helps with negation and phrase meaning).
- Disadvantage: dramatically increases vocabulary size and sparsity; still not semantic.
-
TF‑IDF
- Advantage: downweights common words and highlights discriminative ones; valuable in IR and classification.
- Disadvantage: still sparse, high dimensional; does not capture synonyms or semantics.
Overall: classical methods do not capture deep semantics (synonymy, long‑range dependencies). Embeddings and deep learning approaches address many of these shortcomings.
Assignments (practical exercises with IMDB 50k)
Tasks to practice / deliver:
- Apply preprocessing (lowercasing, tokenization, optional stopword removal/cleaning).
- Compute corpus statistics: total number of tokens across corpus and vocabulary size (unique tokens).
- Implement one‑hot / BoW (
CountVectorizer) and examine the vocabulary and feature matrix shape. - Build n‑gram vocabularies (e.g., unigrams vs. bigrams) and compare vocabulary sizes and computational impact.
- Compute TF‑IDF vectors, print IDF values for terms and inspect/interpret them (try smoothing variants).
- Optionally implement parts of the logic manually (compute TF and IDF by code rather than relying solely on libraries) to deepen understanding.
Objective: verify how vocabulary size grows, see differences between unigrams / n‑grams / TF‑IDF, and understand practical parameter effects.
Other practical remarks emphasized
- Vectorizer vocabulary size can explode on real datasets → increased sparsity and computational cost.
- Many hyperparameters (
binary,ngram_range,max_df/min_df,max_features, smoothing) materially affect behavior; experiment with them. - For sentiment analysis, binary presence features sometimes work well; TF‑IDF is commonly used; n‑grams help with negation/phrases.
- Combining custom features with vectorizers often yields better real‑world results than any single technique alone.
What comes next
The next lecture covers word embeddings and deep learning approaches (Word2Vec, contextual embeddings) to better capture semantics and long‑range relations.
Speakers / sources featured
- Lecture / instructor (YouTube channel presenting the lecture)
- IMDB 50k movie reviews dataset (used as the running example)
- scikit‑learn vectorizers:
CountVectorizer,TfidfVectorizer - Upcoming topics / sources: Word2Vec and other embedding / deep‑learning approaches
Category
Educational
Share this summary
Is the summary off?
If you think the summary is inaccurate, you can reprocess it with the latest model.