Summary of "Stanford CS230 | Autumn 2025 | Lecture 2: Supervised, Self-Supervised, & Weakly Supervised Learning"

Summary of Stanford CS230 | Autumn 2025 | Lecture 2: Supervised, Self-Supervised, & Weakly Supervised Learning

Overview and Instructor Introduction

Instructor: Keon Katon Fuj, co-creator and co-lecturer of CS230 alongside Andrew.
Brings industry experience, leading a company using AI for skill measurement.
The course emphasizes practical, industry-relevant examples beyond academic theory.
Topics this quarter include:
- Decision making in AI projects
- Adversarial attacks and defenses
- Deep reinforcement learning
- Retrieval augmented generation
- AI agents and multi-agent systems

Lecture Structure

Recap of Week 1: Basics of neurons, layers, and deep neural networks
Supervised Learning Projects:
- Day and night classification
- Trigger word detection
- Face verification
Self-Supervised and Weakly Supervised Learning:
- Introduction to embeddings and their significance
Adversarial Attacks and Defenses: (If time permits)

Core Concepts and Lessons

1. Recap of Supervised Learning

Model Components:
- Architecture (blueprint or skeleton)
- Parameters (weights and biases)
Learning Process: Gradient descent optimization minimizes loss by adjusting parameters based on prediction errors.
Input/Output Variability: Inputs can be images, text, audio, video, or structured data. Outputs can be classification labels, regression values, or generative outputs (e.g., image super-resolution).
Neural Network Structure: From simple multi-layer perceptrons to CNNs, RNNs, and transformers.
Loss Functions: Critical for model feedback and training. Example: YOLO’s complex loss function designed for object detection.
Feature Learning vs. Feature Engineering: Deep learning automates feature extraction (feature learning) instead of manual feature engineering.
Encoding vs. Embedding:
- Encoding: any vector representation.
- Embedding: encoding where distances between vectors have semantic meaning.

2. Supervised Learning Case Studies

a. Day and Night Classification

Problem: Classify images as day or night.
Data Collection: Diverse images from different locations and lighting conditions. Human experiments to determine minimum resolution needed (e.g., 64×64×3).
Model Design Decisions:
- Input resolution impacts compute and accuracy.
- Output: binary classification (day=1, night=0).
- Architecture: shallow network with convolutional layers recommended.
- Loss function: binary cross-entropy (logistic loss).
Challenges: Variability in lighting, indoor/outdoor settings, dawn/dusk ambiguity. Hardware constraints affect model size and training speed.
Human Proxy Experiments: Using humans to test hypotheses about data and model requirements.

b. Trigger Word Detection

Problem: Detect a specific word (e.g., “activate”) in 10-second audio clips.
Data Collection: Positive samples (word “activate”), negative samples (similar words like “deactivate”), and background noise. Synthetic data augmentation by mixing words and noise. Importance of diverse accents, ages, speaking speeds, and background noises.
Model Inputs: Audio sequences preprocessed (e.g., Fourier transform).
Output: Binary sequence indicating presence of the trigger word at each time step.
Architecture: Likely recurrent neural networks (RNNs) due to sequential nature.
Loss Function: Sequential binary cross-entropy applied at each time step.
Labeling Strategies: Tradeoff between coarse labeling (presence/absence) vs. fine-grained (exact timing). Balanced datasets to avoid skewed predictions.
Practical Tips: Leveraging expert advice for architecture. Using licensing knowledge to collect free data. Human experiments to test labeling schemes.

c. Face Verification

Problem: Verify if two face images belong to the same person (e.g., student ID verification).
Data Considerations: High-resolution images (e.g., 412×412×3) needed for fine details. Variations in lighting, angle, accessories (glasses, hats), age differences. Preprocessing like cropping to align faces.
Traditional Approach: Pixel-wise comparison fails due to lighting, translation, scale issues.
Neural Network Approach: Use deep network to encode images into vectors (embeddings). Compare embeddings using distance metrics (e.g., Euclidean, cosine similarity).
Training Method: Triplet loss with triplets of images: anchor, positive (same person), negative (different person). Objective: minimize distance between anchor and positive; maximize distance between anchor and negative.
Extensions:
- Face identification: match a single image against a database using K-nearest neighbors on embeddings.
- Face clustering: group images into clusters of the same person using unsupervised clustering (e.g., k-means).
Key Insight: No explicit feature engineering required; model learns meaningful facial features automatically.

3. Self-Supervised and Weakly Supervised Learning

Motivation: Labeling is expensive and time-consuming.
Self-Supervised Learning: Uses data itself to generate supervisory signals without manual labels. Examples:
- Contrastive learning — create pairs of augmented images (e.g., rotated, cropped) and train model to produce similar embeddings for augmented versions of the same image.
- Next token prediction in text (e.g., GPT) — predict next word based on previous words.
Benefits: Enables training on massive unlabeled datasets. Models learn generalizable features and emergent behaviors (e.g., semantic understanding, reasoning).
Emergent Behaviors: Unexpected capabilities arising from large-scale training on simple objectives.
Other Modalities:
- Audio: Mask parts of audio and predict missing segments.
- Video: Predict missing frames.
- Biology: Predict masked portions of protein or DNA sequences.
- Multimodal learning: Connect different data types (e.g., image and text captions, audio and video).
Weakly Supervised Learning: Uses naturally occurring paired data (e.g., images with captions on social media) without explicit labeling.
Example Paper: ImageBind — aligns multiple modalities (text, image, audio, thermal, etc.) into a shared embedding space.
Key Idea: Text often serves as the central pivot connecting various modalities.

4. Practical Advice and Takeaways

Use human experiments as proxies for model capabilities and data requirements.
Data labeling strategies critically impact model training efficiency.
Architecture search is less frequent now but important for specialized constraints.
Leveraging expert knowledge accelerates project development.
Understand the tradeoffs between model capacity, data size, and hardware constraints.
Embrace self-supervised and weakly supervised methods to scale learning without heavy labeling.
Multimodal embeddings are a major frontier in connecting diverse data sources.

Speakers / Sources Featured

Keon Katon Fuj: Primary lecturer, AI industry practitioner, co-creator of CS230.
Andrew: Co-lecturer (mentioned but not directly speaking in the transcript).
Students / Participants: Various students contributed ideas and questions during the interactive lecture.

Detailed Methodologies / Instructions Presented

Supervised Learning: Day and Night Classification

Collect diverse labeled images (day=1, night=0).
Choose consistent input resolution balancing information and compute.
Design shallow CNN architecture.
Use sigmoid activation for output.
Use binary cross-entropy loss.
Conduct human proxy experiments to validate resolution choice.

Supervised Learning: Trigger Word Detection

Collect positive, negative, and background noise audio samples.
Use synthetic data augmentation by mixing words and noise.
Preprocess audio (e.g., Fourier transform).
Label sequences with binary vectors indicating word presence at each time step.
Use RNN architectures.
Use sequential binary cross-entropy loss.
Balance dataset to avoid skewed labels.
Leverage expert advice for architecture selection.

Supervised Learning: Face Verification

Gather multiple images per person with labels.
Preprocess images for alignment.
Train deep neural network to encode images into embeddings.
Use triplet loss with anchor, positive, and negative images:
- Minimize distance(anchor, positive)
- Maximize distance(anchor, negative)
Use distance threshold for verification.
Extend to identification with K-nearest neighbors on embeddings.
Use clustering algorithms for face clustering.

Self-Supervised Learning: Contrastive Learning

Generate augmented pairs of the same image (rotations, crops, noise).
Train network to produce similar embeddings for augmented pairs.
Push embeddings of different images apart.
No manual labels required.
Extend to text with next token prediction.
Apply similar principles to audio, video, biology, and multimodal data.

This lecture provided foundational understanding of supervised, self-supervised, and weakly supervised learning through practical case studies, emphasizing data strategies, model design, loss functions, and the importance of embeddings and multimodal learning in modern AI systems.