Summary of "I Trained My Own AI... It beat ChatGPT"

High-level overview

The creator trained and fine-tuned an open-source coding LLM locally and benchmarked it against several public models (ChatGPT‑4, Gwen, DeepSeek, LLaMA variants, Google Gemini).
Early sensational claims (that the creator’s model “beat ChatGPT”) were later qualified: results depended heavily on data quality, benchmark/harness format, contamination, and which base model/version was used.

Objective: build a small/affordable model that performs strongly on coding benchmarks by using supervised fine‑tuning (SFT) and data engineering, rather than training a base model from scratch.
Base models used: Gwen family (Gwen 32B / Gwen 2.5 / Gwen 3 mentioned). The creator initially used a Gwen 32B coder variant but at one point accidentally trained a non‑coder variant, which produced poor scores until corrected.
Training method:
- Gather a large, curated dataset of coding tasks.
- Augment, enrich, and synthesize additional samples.
- Run supervised fine‑tuning (multiple epochs).
- Do lightweight post‑training on high‑quality samples.

Data sources considered:
- The Stack (large public code corpus), public datasets, GitHub scraping (with licensing/ethical concerns), synthetic generation.
Synthetic generation tools and approaches:
- OSS‑Instruct, MagicCoder, Evo‑Instruct (used for cloning/format conversion and progressive difficulty).
- Synthetic data produced many target‑format examples but required careful validation because models can fabricate incorrect outputs.
Data hygiene problems encountered:
- Garbage or poorly formatted code (whitespace, syntax errors).
- Harness/test format mismatches.
- Contamination: benchmark examples leaked into training data.
Actions taken:
- Contamination detection and removal.
- Validation of synthetic outputs and filtering low‑quality items.

Primary benchmark: “Ader Polyglot” (name may vary slightly due to auto‑generated captions). It:
- Evaluates coding across multiple languages.
- Supports two task formats: “diff” (edit an existing file) and “whole” (generate an entire file).
Important effects:
- Models often performed very differently between the two formats.
- Fixing format/harness (converting to expected diff/whole format) had large impacts on scores.
- Several harness/test issues existed (e.g., C++ and JavaScript not properly tested), which skewed early results until fixed.

Baseline behavior (before fixes): Gwen variant scored very low (about 8% in one format, 16% in another).
ChatGPT baseline cited: ~18.2% on this benchmark (creator’s initial target).
Results after iterative improvements:
- 16% (ceiling after initial fixes)
- 17% after small SFT on high‑quality samples
- 19.6% on a contaminated run (invalid due to contamination)
- 25% after switching to the correct coder model and retraining
- 36% after fixing the harness for missing languages and re‑running
- 39.1% after post‑training on 1,500 high‑quality decontaminated samples and more epochs
Important caveat: Gwen 3 (a newer upstream model) scored ~40% on the same benchmark, so the creator’s result, while competitive, does not conclusively surpass the state of the art.

Benchmark wins are fragile: format, contamination, harness bugs, model version differences, and run randomness can change results substantially.

Fixing benchmark format and harness (ensuring correct diff vs whole formats and that all languages are tested).
Adding step‑by‑step reasoning / chain‑of‑thought style explanations to training samples to improve problem solving.
Curating and decontaminating datasets (removing leaked benchmark items).
Synthetic data generation and targeted augmentation to match the desired input/output format.
Supervised fine‑tuning (multiple epochs) followed by focused post‑training on the best samples.

Training run on a local DIY rig with multiple GPUs (including second‑hand/Chinese 4090s that were undervolted and unstable).
Frequent hardware failures and constraints:
- A GPU died after an electrical event.
- Power/cabling was overloaded; the creator improvised (circuit tapping, swapping cables) to get more compute.
- System crashes and instability limited dataset size and number of training runs.
Compute/resource limits were a major bottleneck.

Limitations:
- Single‑benchmark improvements are fragile and easily invalidated by contamination or harness issues.
- Results vary with base model/version (e.g., Gwen 3 outperformed the creator’s tuned run).
Planned next steps:
- Test on additional coding benchmarks (Sweetbench and others) to validate generality.
- Continue improving data hygiene, evaluation harnesses, and cross‑benchmark validation.

Models/platforms: Gwen family (Gwen 32B / Gwen 2.5 / Gwen 3), DeepSeek, LLaMA 4 Maverick, ChatGPT‑4, Google Gemini / Gemini Pro.
Datasets/sources: The Stack (~60 TB), GitHub scraping, public datasets.
Synthetic data tools: OSS‑Instruct, MagicCoder, Evo‑Instruct.
Benchmarks: Ader Polyglot (primary), Sweetbench (future).
Learning resources: boot.dev (Linux course; “Create your own AI agent in Python” course) — recommended by the creator.
Sponsor: NordVPN (ad read about using VPN on public Wi‑Fi).

Data quality, harness correctness, contamination checks, and using the correct base model/version are at least as important as model size.
Adding reasoning‑style training samples and carefully curated SFT data can materially improve coding benchmark performance.
Local small‑team experiments are feasible thanks to open research and tooling, but they’re sensitive to many failure modes (hardware, data, evaluation).
Iterative learning and embracing failure are important; practical experience and careful validation matter more than flashy claims.

Video creator: self‑identified as “Felix” (narrator and experimenter).
Mentioned models and groups: DeepSeek (China), Gwen (32B / 2.5 / 3), Facebook / LLaMA 4 Maverick, OpenAI (ChatGPT‑4), Google (Gemini / Gemini Pro).
Tools/communities: open‑source community, OSS‑Instruct, MagicCoder, Evo‑Instruct.
Educational sponsor: boot.dev; VPN sponsor: NordVPN.

A concise checklist of steps and pitfalls for someone attempting a similar fine‑tuning project (data sourcing, contamination checks, harness testing, hardware tips).
A detailed timeline of experiments and hyperparameters (epochs, dataset sizes) extracted from the subtitles into a training log.