5 Steps to Effectively Use Scrublet for Single-Cell RNA-Seq Data Analysis

2025-04-17

howtookop

Scrublet Tutorial Using Scrublet

Unleash the power of clean, reliable single-cell RNA sequencing data. Tired of ambiguous doublets polluting your precious datasets and confounding your groundbreaking research? Scrublet, a powerful and readily accessible tool, provides a robust solution for identifying and removing these unwanted artifacts. In the following sections, we will delve into the practical application of Scrublet, guiding you step-by-step through the process of integrating it into your single-cell analysis workflow. Moreover, we will explore best practices for optimizing Scrublet’s performance and interpreting its results, ultimately empowering you to extract meaningful biological insights with confidence. Finally, we’ll discuss some advanced techniques and considerations for tackling challenging datasets and adapting Scrublet to your specific research needs. Prepare to transform your single-cell analyses and unlock the true potential of your data.

First, you’ll need to install Scrublet within your Python environment. This can typically be accomplished via pip, a popular Python package installer. Subsequently, import the necessary libraries, including Scrublet itself, along with data manipulation tools such as SciPy and NumPy. Furthermore, load your single-cell data into a suitable format, such as an AnnData object, commonly used in single-cell analysis. Once your data is loaded, initialize a Scrublet object, specifying relevant parameters such as the expected doublet rate. Next, run the Scrublet algorithm, which will generate doublet scores for each cell in your dataset. These scores reflect the likelihood of a cell being a doublet, based on the distribution of simulated doublets. Critically, visualize the doublet scores using a histogram or scatter plot to assess the overall quality of your data and identify potential thresholds for doublet removal. Ultimately, by carefully examining the distribution of doublet scores, you can make informed decisions about which cells to exclude from downstream analysis.

While Scrublet offers a robust solution for doublet detection, it’s crucial to understand its limitations and potential pitfalls. For instance, the accuracy of Scrublet can be influenced by factors such as the sequencing depth and the complexity of the cell population being analyzed. Therefore, it’s essential to carefully consider these factors when interpreting Scrublet’s results. Additionally, Scrublet assumes that doublets arise from the random combination of two different cell types. However, in certain scenarios, such as the presence of cell aggregates or closely related cell subtypes, this assumption may not hold true. In such cases, it’s advisable to explore alternative doublet detection methods or combine Scrublet with other quality control measures. Furthermore, remember that removing doublets is just one step in the broader single-cell analysis workflow. Consequently, it’s important to integrate Scrublet into a comprehensive quality control pipeline that includes other steps such as filtering low-quality cells and normalizing gene expression data. By taking a holistic approach to quality control, you can ensure the reliability and reproducibility of your single-cell analyses.

Installing Scrublet

Alright, so you want to get started with Scrublet? No problem! The first thing you need to do is make sure you have Python installed. Scrublet is a Python library, so it needs Python to run. Ideally, you should be using Python 3.6 or a newer version. Older versions *might* work, but they aren’t officially supported, so you might run into some unexpected issues. To check your Python version, just open up your terminal or command prompt and type python --version or python3 --version and hit enter. If you don’t have Python, or have an older version, head over to the official Python website (python.org) and download the latest installer for your operating system.

Once Python is all set up, the easiest way to install Scrublet is using pip, Python’s package installer. Just open your terminal or command prompt again and type pip install scrublet. This will download and install the latest stable version of Scrublet and all its necessary dependencies. If you’re using a virtual environment (which is generally a good practice for managing project dependencies), make sure you’ve activated it first before running the install command. If you run into any permission errors, try adding --user to the end of the command, like this: pip install scrublet --user. This will install Scrublet in your user directory instead of the system directory, which often resolves permission issues.

Now, sometimes you might be working with a particularly large dataset, and the standard Scrublet installation might not cut it. In those cases, you can consider installing the development version of Scrublet directly from GitHub. This version might include performance improvements or bug fixes that aren’t yet available in the stable release. To install from GitHub, use this command in your terminal: pip install git+https://github.com/swolock/scrublet.git. This will fetch the most recent code from the GitHub repository and install it on your system. Keep in mind that the development version might be less stable than the release version, so use it with a bit of caution.

Finally, let’s quickly verify that Scrublet installed correctly. Open a Python interactive session (just type python or python3 in your terminal) and type import scrublet. If you don’t see any error messages, you’re good to go! You’ve successfully installed Scrublet and are ready to start cleaning up your single-cell RNA-seq data.

Loading Your Data

Okay, so you’ve got Scrublet installed, now let’s get your data loaded and ready to go. Scrublet generally works with single-cell RNA-seq data in the form of a counts matrix. This matrix should have genes as rows and cells as columns, with the values representing the raw counts of UMIs (Unique Molecular Identifiers) or reads for each gene in each cell. Scrublet is compatible with a few different Python libraries for handling these matrices, the most common being anndata, scipy.sparse, and numpy. Here’s a quick breakdown of each:

Library	Description	Example
anndata	Recommended for storing single-cell data, including metadata.	`import anndata as ad <br/>adata = ad.read_h5ad("your_data.h5ad") <br/>counts_matrix = adata.X`
scipy.sparse	Efficient for storing sparse matrices (lots of zeros).	`import scipy.sparse as sp <br/>counts_matrix = sp.load_npz("your_data.npz")`
numpy	Suitable for dense matrices, but can be memory-intensive.	`import numpy as np <br/>counts_matrix = np.load("your_data.npy")`

Make sure you choose the appropriate library based on your data format and how it’s currently stored. Once you’ve loaded your counts matrix into one of these formats, you’re almost ready to start using Scrublet. Just remember to double-check that your matrix is oriented correctly – genes as rows and cells as columns – otherwise, Scrublet might not work as expected.

Creating a Scrublet Object

First things first, you need to bring Scrublet into your Python environment. This assumes you’ve already installed it (using pip install scrublet). You start by importing the Scrublet module. Then, to actually use it, you create a Scrublet object. Think of this object as your dedicated Scrublet workspace. You feed it your gene expression matrix – the raw data you want to analyze. This matrix should typically be in a format where rows represent genes and columns represent cells.

Here’s how you do it in code:

import scrublet as scr
scrub = scr.Scrublet(counts\_matrix)

In this snippet, counts\_matrix is your gene expression matrix. It’s worth noting that Scrublet is designed to work directly with raw counts, not normalized data. So, make sure your counts\_matrix holds raw counts of gene expression, typically from a single-cell RNA sequencing experiment.

Setting Parameters

Scrublet comes with a handful of parameters you can tweak to fine-tune the doublet detection process. Understanding these parameters gives you more control over how Scrublet identifies doublets in your data. While Scrublet often works well with default settings, adjusting these parameters can be beneficial, especially if you’re dealing with datasets that have unique characteristics or if the default settings aren’t yielding the expected results.

Expected Doublet Rate

One crucial parameter is the expected\_doublet\_rate. This tells Scrublet roughly how many doublets you anticipate in your data. This isn’t always easy to know precisely, but a reasonable starting point is often around 0.05 or 0.06 (meaning you expect 5-6% of your cells to be doublets). This parameter influences how Scrublet sets its thresholds for identifying doublets. If you have prior knowledge about the doublet rate in your experiment (perhaps from the library preparation protocol), you can use that information here. For instance, some microfluidic platforms provide estimates of expected doublet rates.

Minimum Counts, Minimum Cells, and Minimum Gene Variety

Scrublet also lets you filter out genes based on their expression levels using parameters like min\_counts, min\_cells, and min\_gene\_variety. min\_counts sets the minimum number of total counts a gene must have across all cells to be included in the analysis. Similarly, min\_cells specifies the minimum number of cells in which a gene must be detected. min\_gene\_variety filters out genes that don’t show enough variation in expression across cells. These parameters are helpful for removing genes that might add noise to the doublet detection process – genes with very low expression are unlikely to be informative for identifying doublets.

Simulate Doublets

Scrublet’s simulate\_doublets parameter dictates whether Scrublet should create artificial doublets to help calibrate its detection process. This is generally recommended and usually set to ‘True’. These simulated doublets act as a control group, allowing Scrublet to learn the characteristics of doublets within the context of your specific dataset.

Parameter	Description	Default Value
`expected\_doublet\_rate`	Estimated proportion of doublets in the data	0.06
`min\_counts`	Minimum total counts for a gene	3
`min\_cells`	Minimum number of cells expressing a gene	3
`min\_gene\_variety`	Minimum expression variation for a gene	0.85
`simulate\_doublets`	Whether to create artificial doublets	True

Here’s how you might set these parameters in code:

scrub = scr.Scrublet(counts\_matrix, expected\_doublet\_rate=0.05, min\_counts=2, min\_cells=3, min\_gene\_variety=0.85)

By understanding and adjusting these parameters, you can ensure Scrublet is optimally configured for your data and you’re getting the most accurate doublet identification possible.

Running the Scrublet Score Calculation

Scrublet is a handy tool used to detect doublets (artificial cells created during single-cell RNA sequencing) in your data. A key step in using Scrublet involves calculating a ‘doublet score’ for each cell. This score represents the likelihood of a cell being a doublet. Higher scores indicate a higher probability of being a doublet. Let’s break down how to run this score calculation.

Setting up Scrublet

First things first, you’ll need to import the Scrublet library into your Python environment. If you haven’t already installed it, you can do so using pip: pip install scrublet. Once installed, import it into your script with import scrublet as scr. After that, you’ll need your single-cell data loaded as a count matrix. This is typically a matrix where rows correspond to genes and columns correspond to cells. You might be using a specialized object like an AnnData object from Scanpy, or a simple numpy array. Ensure this matrix contains raw counts, not normalized data.

Creating the Scrublet Object

Next, create a Scrublet object by passing your count matrix to the scrublet.Scrublet() function. For example: scrub\_object = scr.Scrublet(counts\_matrix). There’s an optional expected\_doublet\_rate parameter you can adjust. This tells Scrublet roughly how many doublets you anticipate in your data. By default, it’s set to 0.06, which is a good starting point for many datasets, but if you have prior knowledge about your experiment, adjust this accordingly. For example, if you suspect a higher doublet rate, you might set it to 0.1. Lower values are appropriate when you expect fewer doublets.

Calculating Doublet Scores

Now comes the core part: calculating the doublet scores. Use the scrub\_object.scrub\_doublets() method for this. This function does the heavy lifting, simulating doublets and comparing them to your actual data. The simplest way to run it is: doublet\_scores, predicted\_doublets = scrub\_object.scrub\_doublets(). This will return two important pieces of information: a list of doublet scores (one for each cell) and a list of boolean values indicating whether each cell is predicted to be a doublet (True) or not (False) based on Scrublet’s default threshold. You can customize this threshold based on your data. This method accepts a few optional parameters that provide finer control over the doublet detection process:

Parameter	Description
`min\_counts`	Minimum number of counts required for a gene to be considered. Helps filter out low-expressed genes.
`min\_cells`	Minimum number of cells a gene must be expressed in. Also assists in filtering.
`min\_gene\_variability\_pctl`	Filters out genes with low variability. Value represents a percentile; setting it to 75, for example, removes the 25% least variable genes.
`n\_prin\_comps`	Number of principal components to use in the doublet detection process. Using more components can capture more complex variations but can be computationally more expensive.
`sim\_doublet\_ratio`	Controls how many artificial doublets Scrublet generates for comparison. Increasing this value can improve accuracy but also increases computational cost. The default value of 2.0 is usually sufficient.

For example, to run the doublet score calculation with a specific minimum counts and cells parameters and a custom simulated doublet ratio, you would execute: doublet\_scores, predicted\_doublets = scrub\_object.scrub\_doublets(min\_counts=10, min\_cells=3, sim\_doublet\_ratio=3).

Once the scores are calculated, you can access them directly. For instance, doublet\_scores[0] would give you the doublet score of the first cell. You can also visualize the distribution of these scores to help you identify the appropriate threshold for classifying doublets. A histogram is particularly useful for this. After you’ve identified a good threshold, you can then use the predicted\_doublets output or apply the threshold yourself to categorize cells as doublets or singlets. This allows you to then proceed with further analysis, confident that the impact of doublets has been mitigated. Remember to always inspect your results visually and adjust parameters as needed based on your specific data’s characteristics.

Interpreting the Doublet Scores

Scrublet assigns each cell a ‘doublet score’, which represents the probability that the cell is a doublet. Understanding these scores and how to use them to identify doublets is key to successfully applying Scrublet.

What is a Doublet Score?

The doublet score is a value between 0 and 1. A score closer to 1 suggests a higher probability that the cell is a doublet, while a score closer to 0 suggests it’s a singlet (a genuine, single cell). Think of it like this: if a cell has a score of 0.9, Scrublet is essentially saying there’s a 90% chance that cell is actually two cells stuck together, masquerading as one.

Setting a Threshold for Doublet Identification

Scrublet doesn’t automatically call a cell a doublet or singlet. It provides the scores, and you, the researcher, need to decide on a cutoff or threshold. This means you have to choose a doublet score above which you’ll consider a cell a doublet. There’s no one-size-fits-all magic number. The best threshold depends on your specific dataset and the expected doublet rate. Scrublet offers a ‘predicted doublet rate’ based on its analysis, and that can be a good starting point.

The Histogram: A Visual Aid

Scrublet generates a histogram visualizing the distribution of doublet scores across all cells in your data. This histogram is extremely helpful in choosing a threshold. Typically, you’ll see a peak near 0, representing the majority of singlet cells. Look for a smaller, secondary “bump” or elevated area at higher doublet scores – this often indicates the presence of doublets. You’ll want to set your threshold somewhere within this elevated region, to capture these likely doublets.

Factors Influencing Doublet Scores and Threshold Selection

Selecting the right doublet score threshold is crucial for accurate doublet identification. Here’s a deeper dive into interpreting doublet scores and how to select an appropriate threshold, considering various factors:

The Predicted Doublet Rate: Scrublet estimates the overall doublet rate in your experiment based on factors like the number of cells captured. This estimate provides a good initial guess. You can adjust this based on your knowledge of the experimental setup. For example, if you loaded a very high concentration of cells, you might anticipate a higher doublet rate than Scrublet predicts.

The Histogram Shape: Look carefully at the distribution of doublet scores. A clear separation between a peak of low scores (singlets) and a separate “bump” of higher scores (potential doublets) simplifies threshold selection. Set the threshold somewhere within this “bump” region. If the separation isn’t clear, you may need to be more conservative with your threshold, or investigate other factors impacting your data.

Manual Inspection and Downstream Analysis: After setting a threshold, examine marker gene expression in the identified doublets. Do they express markers of multiple distinct cell types? If so, this strengthens the case for their classification as doublets. Conversely, if the “doublets” show consistent marker expression, you may need to re-evaluate your threshold. Consider the potential impact of removing doublets on downstream analyses like clustering or differential expression. Are you losing biologically relevant populations? If so, you might want to explore a less stringent threshold. This process often involves some iteration and refinement.

Example Scenarios and Threshold Adjustments:

Scenario	Suggested Threshold Adjustment
High expected doublet rate (e.g., high cell loading)	Potentially increase threshold slightly above Scrublet’s prediction.
Unclear separation in the histogram	Consider a more conservative (higher) threshold.
Loss of biologically relevant populations after doublet removal	Explore a less stringent (lower) threshold.
“Doublets” express markers of a single cell type	Re-evaluate threshold, consider other quality control metrics.

Ultimately, choosing the best threshold requires careful consideration of these factors, combined with an understanding of your specific experimental context and downstream analysis goals.

Visualizing Doublet Scores and Thresholds

Scrublet assigns a ‘doublet score’ to each cell, representing the probability that it’s a doublet. A key step in using Scrublet involves visualizing these scores to determine an appropriate threshold for classifying cells as doublets. This visualization helps us understand the distribution of doublet scores and identify a cutoff point that separates true singlets from likely doublets.

Using Histograms and Scatter Plots

Histograms and scatter plots are powerful tools for visualizing doublet scores. A histogram displays the distribution of doublet scores across all cells. Ideally, you’ll see a bimodal distribution: a peak representing the majority of singlets with low doublet scores, and a smaller peak (or a long tail) representing potential doublets with higher scores.

Scatter plots can further enhance our understanding. We typically plot the doublet scores against other cell-specific metrics, like the number of unique molecular identifiers (UMIs) or the number of expressed genes. Doublets often exhibit higher UMI counts and express more genes than singlets, so this visualization helps confirm potential doublets identified by their high doublet scores.

Interpreting the Visualization

The threshold for classifying doublets isn’t always clear-cut. Look for a dip or ‘valley’ in the histogram between the singlet and doublet peaks. This dip suggests a natural separation point. In the scatter plot, look for a cluster of cells with high doublet scores and elevated UMI/gene counts. These cells are strong doublet candidates.

Setting the Threshold - Automated and Manual Approaches

Scrublet offers an automated threshold suggestion based on the expected doublet rate. However, it’s crucial to manually inspect the visualizations and adjust the threshold if needed. The automated suggestion serves as a good starting point, but the optimal threshold depends on the specific dataset and experimental conditions. Don’t be afraid to tweak the threshold to best separate singlets from doublets in your data.

Threshold Examples Based on Different Expected Doublet Rates

The optimal doublet score threshold will vary depending on the expected doublet rate in your experiment. Here are some examples:

Expected Doublet Rate	Approximate Threshold Range	Considerations
Low (<5%)	0.05 - 0.10	Stringent threshold to minimize false positives.
Moderate (5-15%)	0.10 - 0.20	Balance between sensitivity and specificity.
High (>15%)	0.20 - 0.30 or higher	More lenient threshold to capture more doublets, but potentially higher false positives.

Remember that these are just guidelines. Always visualize your data and adjust the threshold accordingly. It’s better to err on the side of caution and potentially miss a few doublets than to misclassify a large number of singlets as doublets.

Refining the Threshold with Simulated Doublets

Scrublet generates simulated doublets as part of its analysis. These simulated doublets help inform the threshold selection. By comparing the doublet scores of real cells to those of simulated doublets, we can better estimate the appropriate threshold. Ideally, the threshold should be set such that most simulated doublets are classified as doublets, while the majority of real cells are classified as singlets.

Advanced Visualization Techniques

Beyond histograms and scatter plots, other visualization techniques can be helpful. For example, you can use a knee plot, plotting the doublet scores against the number of cells identified as doublets at each threshold. The “knee” point in the plot, where the curve starts to flatten, can suggest a suitable threshold. Additionally, visualizing the data in a reduced dimensionality space, such as using t-SNE or UMAP, and coloring the cells by their doublet scores can reveal clusters of potential doublets. These advanced techniques provide a more nuanced view of the data and can aid in fine-tuning the threshold for optimal doublet identification.

A Practical Guide to Using Scrublet for Single-Cell RNA-Seq Data Analysis

Scrublet is a powerful tool for identifying doublets in single-cell RNA sequencing (scRNA-seq) data. Doublets, which arise when two or more cells are captured in the same droplet or well, can confound downstream analyses by creating artificial cell types or masking true biological variation. Effectively using Scrublet involves understanding its underlying principles and carefully considering its parameters to optimize doublet detection within your specific dataset. This guide outlines key considerations for implementing Scrublet effectively.

First, ensure your data is appropriately preprocessed. This includes quality control steps such as filtering out low-quality cells and genes, and normalizing the data. Scrublet operates on gene expression counts, so proper normalization is essential. Second, understand the expected\_doublet\_rate parameter. This value influences Scrublet’s sensitivity and should be adjusted based on the platform and experimental protocol used. Overestimating this rate can lead to false positives, while underestimating it can reduce sensitivity.

Third, carefully examine the Scrublet output. The doublet scores assigned to each cell represent the probability of being a doublet. While Scrublet provides a suggested threshold, it is crucial to visually inspect the doublet score histogram and potentially adjust the threshold based on the distribution. Consider integrating Scrublet results with other doublet detection methods or independent validation, especially in critical applications. Finally, remember that Scrublet, like any computational tool, has limitations. It performs best when doublets exhibit distinct transcriptional profiles. In cases where doublets are transcriptionally similar to genuine cell types, Scrublet’s accuracy might be reduced.