Summary of "Applying random forest classifiers to single-cell RNAseq data"

Concise summary — main ideas

The video demonstrates applying a Random Forest classifier to single-cell RNA-seq (scRNA-seq) data to:
- predict cell identity across datasets, and
- distinguish cells from infected versus non-infected patients.
Emphasis is on practical preprocessing, building balanced training sets, training scikit-learn’s RandomForestClassifier, extracting feature importance (genes driving classification), and evaluating predictions on held-out test data.
Two concrete use-cases shown:
1. Classify endothelial cells vs other cells (including transferring a classifier across datasets).
2. Classify whether AT2 epithelial cells came from COVID-infected vs control patients.

Detailed methodology — step-by-step

Setup
- Install and import required libraries: pandas, scanpy (scanpy as sc), scikit-learn (RandomForestClassifier, metrics), plus others as needed (e.g., random).
- Load datasets:
  - A SARS‑CoV‑2 study lung dataset (training; mixed infected & healthy).
  - A second dataset for testing (Tabula Sapiens lung data in the demo).
Initial preprocessing (performed before most downstream steps)
- Remove doublets and perform basic cell-level QC.
- Keep the raw count matrix (adata.X), but add UMAP coordinates and cell-type annotations (adata.obs) for visualization and labeling.
- Filter genes: keep genes expressed in at least N cells (demo used >=100 cells).
- Normalize each cell to total counts (e.g., 10,000 counts per cell) and log-transform (log1p).

Important normalization caveat: Any normalization or transformation that borrows information across cells (batch/global normalization, scaling using global statistics, etc.) can introduce information leakage. Either split train/test before such normalizations, or process train and test independently but using the same pipeline to avoid inflated accuracy.

Toy example — simple endothelial vs not-endothelial classifier (within the same dataset)
- Create binary labels Y: 1 if cell type == “endothelial”, else 0.
- Initialize RandomForestClassifier (default n_estimators=100); optionally set n_jobs for parallelism.
- Fit model on adata.X (cells × genes) and Y.
- Extract feature importances via model.feature_importances_. Map to adata.var_names (genes), sort, and display/top-plot important genes.
- (Optional) Predict on the same data to visualize predictions on UMAP (note: this is biased since train/test are identical).
Building a balanced training set for cross-dataset classification (endothelial vs others)
- Address class imbalance (e.g., 6,714 endothelial cells out of ~90k):
  - Randomly sample an equal number of non-endothelial cells to match endothelial count (downsampling majority class).
  - Combine endothelial and sampled non-endothelial barcodes to form a balanced training subset.
- Harmonize features between datasets:
  - Concatenate train and test AnnData objects with scanpy.concat (defaults to intersection of genes).
  - Optionally select a reduced feature set such as highly variable genes (HVGs) — demo used 2,000 HVGs.
  - Ensure train and test have identical gene order (model expects consistent feature order).
- Train RandomForest on the balanced training subset.
- Ensure test data was processed similarly but independently.
- Predict on test.X and add predictions to test.obs for visualization.
- Obtain ground-truth labels from test.obs (map endothelial subtypes → 1, others → 0).
- Evaluate with sklearn.metrics.accuracy_score(true_labels, predicted_labels). Demo reported ~95% accuracy for cross-dataset endothelial classification.
Example — predicting infected vs non-infected cells (AT2 cells)
- Subset to a single cell type (AT2) to control for cell-type differences.
- Filter genes (present in >=100 cells) and compute HVGs for feature selection.
- Split data by sample (not by individual cells) to avoid leakage across individuals:
  - Use one infected sample and one control sample as the test set (all cells from those samples).
  - Use remaining samples for training.
- Create binary labels from sample names (e.g., sample name containing “COV” → infected = 1, else control = 0).
- Train RandomForest on train.X and train labels; predict on test.X.
- Evaluate with accuracy_score. Demo reported very high accuracy (~98–99%) for infected vs control classification within AT2 cells.
- Extract model.feature_importances_ to find genes driving the classification (example top gene: AGBL1).

Tips, caveats, and suggestions

Random Forests are easy to use, fast, and provide feature importance (less of a black box than some methods).
Reduce features (e.g., HVGs) when the number of features is much larger than samples to reduce overfitting and speed training.
Balance classes in the training set to avoid biased classifiers (downsample majority or upsample minority).
Process train and test in the same way but independently to prevent information leakage.
Consider advanced normalization/integration (e.g., scVI) to improve cross-dataset transferability.
Tune Random Forest hyperparameters (n_estimators, max_depth, etc.) and experiment with different feature sets for improved performance.

Key outcomes / results shown

RandomForestClassifier can extract biologically meaningful marker genes for cell identity (feature importance aligned with known endothelial markers in the toy example).
Cross-dataset cell-type labeling (training on one dataset, testing on a differently processed dataset) achieved ~95% accuracy in the demo.
Distinguishing infected vs non-infected cells (same cell type, split by samples) achieved very high accuracy (~98–99%), with top features identified (example: AGBL1).

Speakers, datasets and tools referenced

Speaker: the video’s presenter/instructor (unnamed; single primary speaker).
Datasets:
- SARS‑CoV‑2 study lung dataset (training data; mixed infected and healthy samples).
- Tabula Sapiens lung dataset (external test dataset used in the demo).
Software/tools/packages:
- scanpy (AnnData, UMAP, concatenation, HVG selection)
- pandas
- scikit-learn (RandomForestClassifier, metrics.accuracy_score)
- scVI (mentioned as an alternative normalization/integration approach)
Example gene mentioned as important: AGBL1