Summary of "Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 1 - Intro and Word Vectors"

The course is popular and well-attended, with many seats available.
Instructor: Christopher Manning; Head TA absent due to health; several TAs present.
Main communication and resources are on the course website and Ed discussion board.
First assignment is a warm-up, already live, due next Tuesday.
Office hours start immediately; a Python and NumPy tutorial is offered for those needing a refresher.
Course grading:
- Four assignments (about 50% of grade).
- Final project (custom or default option, ~50%).
- Participation counts for a small percentage.
Collaboration policy emphasizes doing your own work; AI tools can be used for assistance but not for answering assignment questions outright.
Final project mentors assigned either by students finding mentors or by TAs balancing expertise and student distribution.

Foundations and Current Methods of Deep Learning for NLP:
- Start with basics like word vectors, feed-forward neural networks, recurrent networks, attention.
- Move to Transformers, encoder-decoder models, large language models, pre/post training, adaptation, interpretability, agents.
Understanding Human Language and Its Challenges:
- Convey linguistic concepts and difficulties in language understanding and generation by computers.
Building Practical NLP Systems:
- Equip students to build real-world NLP applications (e.g., text classification, information extraction).

Humans are distinguished from other primates primarily by language.
Language enables communication and advanced thought/planning.
Writing (~5,000 years old) allowed knowledge sharing across time and space, accelerating human progress.
Language is flexible, socially used, imprecise, and constantly evolving—especially among younger generations.
Quote from Herb Clark: Language use is more about people and intentions than just words and meanings.

NLP research began in the 1950s; major progress in the last decade due to deep learning.
Neural machine translation became commercially viable around 2014-2016, drastically changing global communication.
Modern search engines use neural networks to retrieve, rerank, and synthesize answers, moving beyond keyword matching.
Large language models (LLMs) like GPT-2 (2019) demonstrated fluent text generation and contextual understanding.
Current models like GPT-4 and ChatGPT enable interactive, multimodal AI applications (text, images, etc.).
Foundation models generalize LLM technology across modalities (images, sound, bioinformatics, seismic data).
Example: DALL·E generates images from text prompts with iterative refinement (adding context, style, elements).

Traditional linguistic meaning (denotational semantics): pairing of symbol (word) and idea/thing it denotes.
Early computational approaches (e.g., WordNet) captured word relations (synonyms, hyponyms) but lacked nuance and coverage.
Limitations of WordNet include incompleteness, lack of slang, and inability to capture subtle meaning differences.
Alternative: Distributional semantics ("You shall know a word by the company it keeps")—meaning derived from the context words appear in.
Words represented as dense vectors (embeddings) capturing semantic similarity, unlike sparse one-hot vectors where words are orthogonal.
Embeddings place similar words close in a high-dimensional vector space (typically 100-2000 dimensions).
Visualizations use dimensionality reduction techniques like t-SNE to project embeddings into 2D.
Embeddings capture fine-grained semantic clusters (e.g., countries, verbs with similar meanings).
Current focus: learning one embedding per word type (not per context); contextual embeddings addressed later in the course.

A simple, fast method to learn word embeddings from large text corpora.
Key concepts:
- Corpus: large body of text.
- Word type: unique word (e.g., "bank").
- Word token: specific instance of a word in text.
- Each word type is assigned a vector.
- Context window: words surrounding a center word (e.g., 2 words left and right).
Objective:
- Maximize the probability that words appearing in the context window co-occur with the center word.
- Use the dot product of word vectors to estimate similarity and probability of co-occurrence.
Probability model:
- Use softmax function to convert dot products (real numbers) into probabilities (0 to 1).
- Probability of an outside word given center word is proportional to exponentiated dot product normalized over all vocabulary.
Training:
- Start with random vectors.
- Optimize (minimize negative log-likelihood) via gradient descent.
- Compute gradients