Summary of "Applying random forest classifiers to single-cell RNAseq data"

Concise summary — main ideas

Detailed methodology — step-by-step

  1. Setup

    • Install and import required libraries: pandas, scanpy (scanpy as sc), scikit-learn (RandomForestClassifier, metrics), plus others as needed (e.g., random).
    • Load datasets:
      • A SARS‑CoV‑2 study lung dataset (training; mixed infected & healthy).
      • A second dataset for testing (Tabula Sapiens lung data in the demo).
  2. Initial preprocessing (performed before most downstream steps)

    • Remove doublets and perform basic cell-level QC.
    • Keep the raw count matrix (adata.X), but add UMAP coordinates and cell-type annotations (adata.obs) for visualization and labeling.
    • Filter genes: keep genes expressed in at least N cells (demo used >=100 cells).
    • Normalize each cell to total counts (e.g., 10,000 counts per cell) and log-transform (log1p).

Important normalization caveat: Any normalization or transformation that borrows information across cells (batch/global normalization, scaling using global statistics, etc.) can introduce information leakage. Either split train/test before such normalizations, or process train and test independently but using the same pipeline to avoid inflated accuracy.

  1. Toy example — simple endothelial vs not-endothelial classifier (within the same dataset)

    • Create binary labels Y: 1 if cell type == “endothelial”, else 0.
    • Initialize RandomForestClassifier (default n_estimators=100); optionally set n_jobs for parallelism.
    • Fit model on adata.X (cells × genes) and Y.
    • Extract feature importances via model.feature_importances_. Map to adata.var_names (genes), sort, and display/top-plot important genes.
    • (Optional) Predict on the same data to visualize predictions on UMAP (note: this is biased since train/test are identical).
  2. Building a balanced training set for cross-dataset classification (endothelial vs others)

    • Address class imbalance (e.g., 6,714 endothelial cells out of ~90k):
      • Randomly sample an equal number of non-endothelial cells to match endothelial count (downsampling majority class).
      • Combine endothelial and sampled non-endothelial barcodes to form a balanced training subset.
    • Harmonize features between datasets:
      • Concatenate train and test AnnData objects with scanpy.concat (defaults to intersection of genes).
      • Optionally select a reduced feature set such as highly variable genes (HVGs) — demo used 2,000 HVGs.
      • Ensure train and test have identical gene order (model expects consistent feature order).
    • Train RandomForest on the balanced training subset.
    • Ensure test data was processed similarly but independently.
    • Predict on test.X and add predictions to test.obs for visualization.
    • Obtain ground-truth labels from test.obs (map endothelial subtypes → 1, others → 0).
    • Evaluate with sklearn.metrics.accuracy_score(true_labels, predicted_labels). Demo reported ~95% accuracy for cross-dataset endothelial classification.
  3. Example — predicting infected vs non-infected cells (AT2 cells)

    • Subset to a single cell type (AT2) to control for cell-type differences.
    • Filter genes (present in >=100 cells) and compute HVGs for feature selection.
    • Split data by sample (not by individual cells) to avoid leakage across individuals:
      • Use one infected sample and one control sample as the test set (all cells from those samples).
      • Use remaining samples for training.
    • Create binary labels from sample names (e.g., sample name containing “COV” → infected = 1, else control = 0).
    • Train RandomForest on train.X and train labels; predict on test.X.
    • Evaluate with accuracy_score. Demo reported very high accuracy (~98–99%) for infected vs control classification within AT2 cells.
    • Extract model.feature_importances_ to find genes driving the classification (example top gene: AGBL1).

Tips, caveats, and suggestions

Key outcomes / results shown

Speakers, datasets and tools referenced

(End of summary.)

Category ?

Educational


Share this summary


Is the summary off?

If you think the summary is inaccurate, you can reprocess it with the latest model.

Video