Summary of "I Trained My Own AI... It beat ChatGPT"
High-level overview
- The creator trained and fine-tuned an open-source coding LLM locally and benchmarked it against several public models (ChatGPT‑4, Gwen, DeepSeek, LLaMA variants, Google Gemini).
- Early sensational claims (that the creator’s model “beat ChatGPT”) were later qualified: results depended heavily on data quality, benchmark/harness format, contamination, and which base model/version was used.
Project goal and approach
- Objective: build a small/affordable model that performs strongly on coding benchmarks by using supervised fine‑tuning (SFT) and data engineering, rather than training a base model from scratch.
- Base models used: Gwen family (Gwen 32B / Gwen 2.5 / Gwen 3 mentioned). The creator initially used a Gwen 32B coder variant but at one point accidentally trained a non‑coder variant, which produced poor scores until corrected.
- Training method:
- Gather a large, curated dataset of coding tasks.
- Augment, enrich, and synthesize additional samples.
- Run supervised fine‑tuning (multiple epochs).
- Do lightweight post‑training on high‑quality samples.
Data collection, augmentation, and validation
- Data sources considered:
- The Stack (large public code corpus), public datasets, GitHub scraping (with licensing/ethical concerns), synthetic generation.
- Synthetic generation tools and approaches:
- OSS‑Instruct, MagicCoder, Evo‑Instruct (used for cloning/format conversion and progressive difficulty).
- Synthetic data produced many target‑format examples but required careful validation because models can fabricate incorrect outputs.
- Data hygiene problems encountered:
- Garbage or poorly formatted code (whitespace, syntax errors).
- Harness/test format mismatches.
- Contamination: benchmark examples leaked into training data.
- Actions taken:
- Contamination detection and removal.
- Validation of synthetic outputs and filtering low‑quality items.
Benchmarks, formats, and evaluation details
- Primary benchmark: “Ader Polyglot” (name may vary slightly due to auto‑generated captions). It:
- Evaluates coding across multiple languages.
- Supports two task formats: “diff” (edit an existing file) and “whole” (generate an entire file).
- Important effects:
- Models often performed very differently between the two formats.
- Fixing format/harness (converting to expected diff/whole format) had large impacts on scores.
- Several harness/test issues existed (e.g., C++ and JavaScript not properly tested), which skewed early results until fixed.
Performance timeline (key numbers extracted)
- Baseline behavior (before fixes): Gwen variant scored very low (about 8% in one format, 16% in another).
- ChatGPT baseline cited: ~18.2% on this benchmark (creator’s initial target).
- Results after iterative improvements:
- 16% (ceiling after initial fixes)
- 17% after small SFT on high‑quality samples
- 19.6% on a contaminated run (invalid due to contamination)
- 25% after switching to the correct coder model and retraining
- 36% after fixing the harness for missing languages and re‑running
- 39.1% after post‑training on 1,500 high‑quality decontaminated samples and more epochs
- Important caveat: Gwen 3 (a newer upstream model) scored ~40% on the same benchmark, so the creator’s result, while competitive, does not conclusively surpass the state of the art.
Benchmark wins are fragile: format, contamination, harness bugs, model version differences, and run randomness can change results substantially.
Techniques that improved results
- Fixing benchmark format and harness (ensuring correct diff vs whole formats and that all languages are tested).
- Adding step‑by‑step reasoning / chain‑of‑thought style explanations to training samples to improve problem solving.
- Curating and decontaminating datasets (removing leaked benchmark items).
- Synthetic data generation and targeted augmentation to match the desired input/output format.
- Supervised fine‑tuning (multiple epochs) followed by focused post‑training on the best samples.
Hardware, compute, and practical problems
- Training run on a local DIY rig with multiple GPUs (including second‑hand/Chinese 4090s that were undervolted and unstable).
- Frequent hardware failures and constraints:
- A GPU died after an electrical event.
- Power/cabling was overloaded; the creator improvised (circuit tapping, swapping cables) to get more compute.
- System crashes and instability limited dataset size and number of training runs.
- Compute/resource limits were a major bottleneck.
Limitations, reliability, and next steps
- Limitations:
- Single‑benchmark improvements are fragile and easily invalidated by contamination or harness issues.
- Results vary with base model/version (e.g., Gwen 3 outperformed the creator’s tuned run).
- Planned next steps:
- Test on additional coding benchmarks (Sweetbench and others) to validate generality.
- Continue improving data hygiene, evaluation harnesses, and cross‑benchmark validation.
Tools, datasets, courses, and sponsors mentioned
- Models/platforms: Gwen family (Gwen 32B / Gwen 2.5 / Gwen 3), DeepSeek, LLaMA 4 Maverick, ChatGPT‑4, Google Gemini / Gemini Pro.
- Datasets/sources: The Stack (~60 TB), GitHub scraping, public datasets.
- Synthetic data tools: OSS‑Instruct, MagicCoder, Evo‑Instruct.
- Benchmarks: Ader Polyglot (primary), Sweetbench (future).
- Learning resources: boot.dev (Linux course; “Create your own AI agent in Python” course) — recommended by the creator.
- Sponsor: NordVPN (ad read about using VPN on public Wi‑Fi).
Key takeaways / lessons
- Data quality, harness correctness, contamination checks, and using the correct base model/version are at least as important as model size.
- Adding reasoning‑style training samples and carefully curated SFT data can materially improve coding benchmark performance.
- Local small‑team experiments are feasible thanks to open research and tooling, but they’re sensitive to many failure modes (hardware, data, evaluation).
- Iterative learning and embracing failure are important; practical experience and careful validation matter more than flashy claims.
Main speakers / sources referenced
- Video creator: self‑identified as “Felix” (narrator and experimenter).
- Mentioned models and groups: DeepSeek (China), Gwen (32B / 2.5 / 3), Facebook / LLaMA 4 Maverick, OpenAI (ChatGPT‑4), Google (Gemini / Gemini Pro).
- Tools/communities: open‑source community, OSS‑Instruct, MagicCoder, Evo‑Instruct.
- Educational sponsor: boot.dev; VPN sponsor: NordVPN.
Optional follow‑ups (available)
- A concise checklist of steps and pitfalls for someone attempting a similar fine‑tuning project (data sourcing, contamination checks, harness testing, hardware tips).
- A detailed timeline of experiments and hyperparameters (epochs, dataset sizes) extracted from the subtitles into a training log.
Category
Technology
Share this summary
Is the summary off?
If you think the summary is inaccurate, you can reprocess it with the latest model.
Preparing reprocess...