Summary of "Applying random forest classifiers to single-cell RNAseq data"
Concise summary — main ideas
- The video demonstrates applying a Random Forest classifier to single-cell RNA-seq (scRNA-seq) data to:
- predict cell identity across datasets, and
- distinguish cells from infected versus non-infected patients.
- Emphasis is on practical preprocessing, building balanced training sets, training scikit-learn’s RandomForestClassifier, extracting feature importance (genes driving classification), and evaluating predictions on held-out test data.
- Two concrete use-cases shown:
- Classify endothelial cells vs other cells (including transferring a classifier across datasets).
- Classify whether AT2 epithelial cells came from COVID-infected vs control patients.
Detailed methodology — step-by-step
-
Setup
- Install and import required libraries: pandas, scanpy (scanpy as sc), scikit-learn (RandomForestClassifier, metrics), plus others as needed (e.g., random).
- Load datasets:
- A SARS‑CoV‑2 study lung dataset (training; mixed infected & healthy).
- A second dataset for testing (Tabula Sapiens lung data in the demo).
-
Initial preprocessing (performed before most downstream steps)
- Remove doublets and perform basic cell-level QC.
- Keep the raw count matrix (adata.X), but add UMAP coordinates and cell-type annotations (adata.obs) for visualization and labeling.
- Filter genes: keep genes expressed in at least N cells (demo used >=100 cells).
- Normalize each cell to total counts (e.g., 10,000 counts per cell) and log-transform (log1p).
Important normalization caveat: Any normalization or transformation that borrows information across cells (batch/global normalization, scaling using global statistics, etc.) can introduce information leakage. Either split train/test before such normalizations, or process train and test independently but using the same pipeline to avoid inflated accuracy.
-
Toy example — simple endothelial vs not-endothelial classifier (within the same dataset)
- Create binary labels Y: 1 if cell type == “endothelial”, else 0.
- Initialize RandomForestClassifier (default n_estimators=100); optionally set n_jobs for parallelism.
- Fit model on adata.X (cells × genes) and Y.
- Extract feature importances via model.feature_importances_. Map to adata.var_names (genes), sort, and display/top-plot important genes.
- (Optional) Predict on the same data to visualize predictions on UMAP (note: this is biased since train/test are identical).
-
Building a balanced training set for cross-dataset classification (endothelial vs others)
- Address class imbalance (e.g., 6,714 endothelial cells out of ~90k):
- Randomly sample an equal number of non-endothelial cells to match endothelial count (downsampling majority class).
- Combine endothelial and sampled non-endothelial barcodes to form a balanced training subset.
- Harmonize features between datasets:
- Concatenate train and test AnnData objects with scanpy.concat (defaults to intersection of genes).
- Optionally select a reduced feature set such as highly variable genes (HVGs) — demo used 2,000 HVGs.
- Ensure train and test have identical gene order (model expects consistent feature order).
- Train RandomForest on the balanced training subset.
- Ensure test data was processed similarly but independently.
- Predict on test.X and add predictions to test.obs for visualization.
- Obtain ground-truth labels from test.obs (map endothelial subtypes → 1, others → 0).
- Evaluate with sklearn.metrics.accuracy_score(true_labels, predicted_labels). Demo reported ~95% accuracy for cross-dataset endothelial classification.
- Address class imbalance (e.g., 6,714 endothelial cells out of ~90k):
-
Example — predicting infected vs non-infected cells (AT2 cells)
- Subset to a single cell type (AT2) to control for cell-type differences.
- Filter genes (present in >=100 cells) and compute HVGs for feature selection.
- Split data by sample (not by individual cells) to avoid leakage across individuals:
- Use one infected sample and one control sample as the test set (all cells from those samples).
- Use remaining samples for training.
- Create binary labels from sample names (e.g., sample name containing “COV” → infected = 1, else control = 0).
- Train RandomForest on train.X and train labels; predict on test.X.
- Evaluate with accuracy_score. Demo reported very high accuracy (~98–99%) for infected vs control classification within AT2 cells.
- Extract model.feature_importances_ to find genes driving the classification (example top gene: AGBL1).
Tips, caveats, and suggestions
- Random Forests are easy to use, fast, and provide feature importance (less of a black box than some methods).
- Reduce features (e.g., HVGs) when the number of features is much larger than samples to reduce overfitting and speed training.
- Balance classes in the training set to avoid biased classifiers (downsample majority or upsample minority).
- Process train and test in the same way but independently to prevent information leakage.
- Consider advanced normalization/integration (e.g., scVI) to improve cross-dataset transferability.
- Tune Random Forest hyperparameters (n_estimators, max_depth, etc.) and experiment with different feature sets for improved performance.
Key outcomes / results shown
- RandomForestClassifier can extract biologically meaningful marker genes for cell identity (feature importance aligned with known endothelial markers in the toy example).
- Cross-dataset cell-type labeling (training on one dataset, testing on a differently processed dataset) achieved ~95% accuracy in the demo.
- Distinguishing infected vs non-infected cells (same cell type, split by samples) achieved very high accuracy (~98–99%), with top features identified (example: AGBL1).
Speakers, datasets and tools referenced
- Speaker: the video’s presenter/instructor (unnamed; single primary speaker).
- Datasets:
- SARS‑CoV‑2 study lung dataset (training data; mixed infected and healthy samples).
- Tabula Sapiens lung dataset (external test dataset used in the demo).
- Software/tools/packages:
- scanpy (AnnData, UMAP, concatenation, HVG selection)
- pandas
- scikit-learn (RandomForestClassifier, metrics.accuracy_score)
- scVI (mentioned as an alternative normalization/integration approach)
- Example gene mentioned as important: AGBL1
(End of summary.)
Category
Educational
Share this summary
Is the summary off?
If you think the summary is inaccurate, you can reprocess it with the latest model.