Summary of "Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 1 - Intro and Word Vectors"
Summary of Stanford CS224N Lecture 1: Intro and Word Vectors (Spring 2024)
Course Introduction and Logistics
- The course is popular and well-attended, with many seats available.
- Instructor: Christopher Manning; Head TA absent due to health; several TAs present.
- Main communication and resources are on the course website and Ed discussion board.
- First assignment is a warm-up, already live, due next Tuesday.
- Office hours start immediately; a Python and NumPy tutorial is offered for those needing a refresher.
- Course grading:
- Four assignments (about 50% of grade).
- Final project (custom or default option, ~50%).
- Participation counts for a small percentage.
- Collaboration policy emphasizes doing your own work; AI tools can be used for assistance but not for answering assignment questions outright.
- Final project mentors assigned either by students finding mentors or by TAs balancing expertise and student distribution.
Course Learning Goals
- Foundations and Current Methods of Deep Learning for NLP:
- Start with basics like word vectors, feed-forward neural networks, recurrent networks, attention.
- Move to Transformers, encoder-decoder models, large language models, pre/post training, adaptation, interpretability, agents.
- Understanding Human Language and Its Challenges:
- Convey linguistic concepts and difficulties in language understanding and generation by computers.
- Building Practical NLP Systems:
- Equip students to build real-world NLP applications (e.g., text classification, information extraction).
Human Language and Its Role
- Humans are distinguished from other primates primarily by language.
- Language enables communication and advanced thought/planning.
- Writing (~5,000 years old) allowed knowledge sharing across time and space, accelerating human progress.
- Language is flexible, socially used, imprecise, and constantly evolving—especially among younger generations.
- Quote from Herb Clark: Language use is more about people and intentions than just words and meanings.
Advances in NLP and Deep Learning
- NLP research began in the 1950s; major progress in the last decade due to deep learning.
- Neural machine translation became commercially viable around 2014-2016, drastically changing global communication.
- Modern search engines use neural networks to retrieve, rerank, and synthesize answers, moving beyond keyword matching.
- Large language models (LLMs) like GPT-2 (2019) demonstrated fluent text generation and contextual understanding.
- Current models like GPT-4 and ChatGPT enable interactive, multimodal AI applications (text, images, etc.).
- Foundation models generalize LLM technology across modalities (images, sound, bioinformatics, seismic data).
- Example: DALL·E generates images from text prompts with iterative refinement (adding context, style, elements).
Meaning and Word Representation
- Traditional linguistic meaning (denotational semantics): pairing of symbol (word) and idea/thing it denotes.
- Early computational approaches (e.g., WordNet) captured word relations (synonyms, hyponyms) but lacked nuance and coverage.
- Limitations of WordNet include incompleteness, lack of slang, and inability to capture subtle meaning differences.
- Alternative: Distributional semantics ("You shall know a word by the company it keeps")—meaning derived from the context words appear in.
- Words represented as dense vectors (embeddings) capturing semantic similarity, unlike sparse one-hot vectors where words are orthogonal.
- Embeddings place similar words close in a high-dimensional vector space (typically 100-2000 dimensions).
- Visualizations use dimensionality reduction techniques like t-SNE to project embeddings into 2D.
- Embeddings capture fine-grained semantic clusters (e.g., countries, verbs with similar meanings).
- Current focus: learning one embedding per word type (not per context); contextual embeddings addressed later in the course.
Word2Vec Algorithm (Mikolov et al., 2013)
- A simple, fast method to learn word embeddings from large text corpora.
- Key concepts:
- Corpus: large body of text.
- Word type: unique word (e.g., "bank").
- Word token: specific instance of a word in text.
- Each word type is assigned a vector.
- Context window: words surrounding a center word (e.g., 2 words left and right).
- Objective:
- Maximize the probability that words appearing in the context window co-occur with the center word.
- Use the dot product of word vectors to estimate similarity and probability of co-occurrence.
- Probability model:
- Use softmax function to convert dot products (real numbers) into probabilities (0 to 1).
- Probability of an outside word given center word is proportional to exponentiated dot product normalized over all vocabulary.
- Training:
- Start with random vectors.
- Optimize (minimize negative log-likelihood) via gradient descent.
- Compute gradients
Category
Educational