Data Handling
This tutorial describes how to start working with the MPX output data format. The primary data output file of pixelator is the PXL file. Here, you will learn how to read and interact with it. We will go through the different components of a single sample PXL file and also how to aggregate results from multiple samples. You can read more about the PXL file format here.
After completing this tutorial, you will be able to:
- Load PXL files in R and access the multi-modal data contents including protein counts, metadata and spatial scores.
- Understand and work with spatial polarity (protein clustering patterns).
- Interpret and work with colocalization scores (protein co-clustering).
- Directly access the spatial graph structure through the edge list.
- Aggregate multiple PXL files into an integrated data object with sample identities.
- Save aggregated data objects for later reuse in analyses.
Setup
To start with, we need to load some packages and functions that we will need.
library(pixelatorR)
library(SeuratObject)
library(dplyr)
library(stringr)
library(here)
library(tibble)
Loading data
We begin by locating the output from the nf-core-pixelator
pipeline. The input for
downstream analysis is the PXL file contained in the layout directory.
DATA_DIR <- "path/to/local/folder"
Sys.setenv("DATA_DIR" = DATA_DIR)
Run the following code in the terminal to download the dataset.
BASEURL="https://pixelgen-technologies-datasets.s3.eu-north-1.amazonaws.com/mpx-datasets/pixelator/next/technote-v1-vs-v2-immunology-II"
curl -L -O -C -  --create-dirs --output-dir $DATA_DIR "${BASEURL}/Sample05_V2_PBMC_r1.layout.dataset.pxl"
And than proceed in R.
# The folders the data is located in:
filename <- file.path(DATA_DIR, "Sample05_V2_PBMC_r1.layout.dataset.pxl")
pg_data <- ReadMPX_Seurat(filename)
Component meta data
In addition to the counts data, the object contains meta data of each
MPX graph component, each corresponding to a cell. This table contains
information that can be useful in quality control, with information such
as how many protein molecules were detected (molecules), the
sequencing depth (reads), and the graph connectivity of each component
(mean_b_pixels_per_a_pixel).
as_tibble(pg_data[[]], rownames = "component")
# A tibble: 1,125 × 19
   component     a_pixel_b_pixel_ratio a_pixels antibodies b_pixels leiden
   <chr>                         <dbl>    <int>      <int>    <int> <fct> 
 1 RCVCMP0000000                  1.80     3749         79     2084 0     
 2 RCVCMP0000001                  2.33     3338         79     1431 1     
 3 RCVCMP0000002                  1.81     2724         78     1503 0     
 4 RCVCMP0000004                  1.92     2159         79     1126 2     
 5 RCVCMP0000005                  1.81     6413         79     3545 3     
 6 RCVCMP0000007                  2.34     2953         78     1264 4     
 7 RCVCMP0000009                  3.42     1382         76      404 9     
 8 RCVCMP0000011                  2.65     6791         79     2564 3     
 9 RCVCMP0000014                  2.11     5128         79     2436 6     
10 RCVCMP0000016                  2.09     2432         77     1161 13    
# ℹ 1,115 more rows
# ℹ 13 more variables: mean_a_pixels_per_b_pixel <dbl>,
#   mean_b_pixels_per_a_pixel <dbl>, mean_molecules_per_a_pixel <dbl>,
#   mean_reads_per_molecule <dbl>, median_a_pixels_per_b_pixel <dbl>,
#   median_b_pixels_per_a_pixel <dbl>, median_molecules_per_a_pixel <dbl>,
#   median_reads_per_molecule <dbl>, molecules <int>, pixels <int>,
#   reads <int>, tau <dbl>, tau_type <chr>
The antibody count matrix can be accessed from the Seurat object using
the LayerData method.
# Display the 10 first rows and 10 first columns of the antibody counts
LayerData(pg_data, layer = "counts")[1:10, 1:10]
10 x 10 sparse Matrix of class "dgCMatrix"
                                                      
B2M   4690 3379 3070 2014 6731 5090 106 3926 5993 3225
CD102  349   94   39   26  227  155  32  110  154   58
CD11a  509  262  202   63  240  366  65  262  495  140
CD11b   81  156   58   92  185  102  88  416  314   20
CD11c  679   50   35   18  112   28  51  168  757   30
CD123    2   15    5    1    5    4   .   15   19    4
CD14    50   78   42   19   79   26  40  178  613   16
CD150   23  105   30   27  134   14  53  155   22   48
CD152   13   21   17   14   66   12  35  108   25   11
CD154   31   34   36   16   60   22  59  137   34   23
Spatial data
In addition to count data corresponding to protein abundance, MPX also generates spatial data which be used to visualize individual cells, and to find spatial structures of proteins across cells. The PXL file comes with MPX polarity scores, describing the degree of spatial clustering of each protein on each cell, and MPX colocalization scores, which describe the degree of spatial colocalization of pairs of proteins on each cell. If you want to read more about spatial metrics in MPX data, click here.
pixelatorR Assay classes
SeuratObject provides the Assay class and more recently the Assay5
class to store single-cell abundance data in. PixelatorR provides
additional classes CellGraphAssay and CellGraphAssay5 to store both
abundance data and spatial metrics in. These classes inherit the
corresponding SeuratObject classes but also contain additional slots for
the spatial metrics. Here for instance, pg_data contains a
CellGraphAssay5 object called “mpxCells” which can be accessed as
follows:
pg_data[["mpxCells"]]
CellGraphAssay (v5) data with 84 features for 1125 cells
Top 10 variable features:
 B2M, CD102, CD11a, CD11b, CD11c, CD123, CD14, CD150, CD152, CD154 
Layers:
 counts 
Loaded CellGraph objects:
 0
In the output above, we can confirm that the object is of class
CellGraphAssay5 and that it currently contains 0 loaded CellGraph
objects. More about that later.
MPX Polarity scores
The spatial metrics can be accessed from the CellGraphAssay5 object
with:
PolarizationScores(pg_data)
# A tibble: 84,022 × 6
   marker  morans_i morans_z morans_p_value morans_p_adjusted component    
   <chr>      <dbl>    <dbl>          <dbl>             <dbl> <chr>        
 1 ACTB   -0.00827   -1.42          0.0774             0.300  RCVCMP0000000
 2 B2M    -0.00589   -1.77          0.0380             0.206  RCVCMP0000000
 3 CD102  -0.00151   -0.0780        0.469              0.494  RCVCMP0000000
 4 CD11a   0.00856    1.46          0.0721             0.290  RCVCMP0000000
 5 CD11b   0.00410    1.01          0.156              0.393  RCVCMP0000000
 6 CD11c   0.00319    0.851         0.197              0.417  RCVCMP0000000
 7 CD14    0.0131     2.89          0.00190            0.0221 RCVCMP0000000
 8 CD150  -0.00459   -0.712         0.238              0.435  RCVCMP0000000
 9 CD152   0.0100     0.998         0.159              0.395  RCVCMP0000000
10 CD154   0.000512   0.328         0.372              0.473  RCVCMP0000000
# ℹ 84,012 more rows
MPX colocalization scores
Similarly, pixelatorR provides a method to access the colocalization scores:
ColocalizationScores(pg_data)
# A tibble: 3,111,941 × 15
   marker_1 marker_2  pearson pearson_mean pearson_stdev pearson_z
   <chr>    <chr>       <dbl>        <dbl>         <dbl>     <dbl>
 1 ACTB     B2M      -0.00456    -0.0127          0.0259    0.316 
 2 ACTB     CD102    -0.139      -0.00133         0.0222   -6.21  
 3 B2M      CD102    -0.207      -0.0403          0.0273   -6.11  
 4 ACTB     CD11a    -0.0803     -0.00523         0.0247   -3.04  
 5 B2M      CD11a     0.0335     -0.0453          0.0194    4.06  
 6 CD102    CD11a    -0.147      -0.0179          0.0230   -5.64  
 7 ACTB     CD11b    -0.00205    -0.000778        0.0240   -0.0528
 8 B2M      CD11b    -0.0842     -0.0187          0.0217   -3.03  
 9 CD102    CD11b    -0.00990    -0.00598         0.0266   -0.147 
10 CD11a    CD11b     0.0690     -0.00677         0.0209    3.63  
# ℹ 3,111,931 more rows
# ℹ 9 more variables: pearson_p_value <dbl>, pearson_p_value_adjusted <dbl>,
#   jaccard <dbl>, jaccard_mean <dbl>, jaccard_stdev <dbl>, jaccard_z <dbl>,
#   jaccard_p_value <dbl>, jaccard_p_value_adjusted <dbl>, component <chr>
Edgelist
The edgelist is a list of uniquely identified antibodies per each cell, and simultaneously constitutes the bipartite graph which makes up an MPX experiment. Each row contains an edge (a uniquely identified antibody) which is identified by a UMI, UPIA, and a UPIB. The UPIA is the unique identifier for the specific A pixel, while the UPIB is the unique identifier for the B pixel. Both A pixels and B pixels may contain multiple edges, resulting in a spatial graph which can be used to infer spatial relationships between antibody molecules, to calculate spatial statistics, and to visualize the cell.
edge_list <- 
  ReadMPX_edgelist(filename)
edge_list
# A tibble: 26,003,773 × 8
   upia       upib  umi   marker sequence count unique_molecules_count component
   <fct>      <fct> <fct> <fct>  <fct>    <int>                  <int> <fct>    
 1 CGGGTTAGT… ATTG… CGTC… CD19   ACCAACTT    15                      1 RCVCMP00…
 2 CGTGATCAG… GCAG… GAAG… CD19   ACCAACTT    15                      1 RCVCMP00…
 3 CTTTTAGTG… CGTT… TGCG… CD19   ACCAACTT    15                      1 RCVCMP00…
 4 ACCCAGTTC… GATG… GAGT… CD19   ACCAACTT    14                      1 RCVCMP00…
 5 TTGGTGTAC… AGCG… TAGT… CD19   ACCAACTT    14                      1 RCVCMP00…
 6 GTGAAAAAT… ATTA… GTGC… CD19   ACCAACTT    13                      1 RCVCMP00…
 7 TATCCCAGC… AGTC… TGTC… CD19   ACCAACTT    13                      1 RCVCMP00…
 8 TTGATTGTT… CTAG… CCCG… CD19   ACCAACTT    13                      1 RCVCMP00…
 9 AATCGATAG… TGGC… ATGG… CD19   ACCAACTT    12                      1 RCVCMP00…
10 TTACTTGTT… CATA… CATC… CD19   ACCAACTT    12                      1 RCVCMP00…
# ℹ 26,003,763 more rows
Seurat objects and PXL files
For some analytical tasks, we will need access to the edgelist. For
instance, if we want to load component graphs for visualization, we need
access to the edgelist to build these graphs. However, the edgelist is
not stored in the Seurat object, it’s only available in the PXL file.
The CellGraphAssay5 that was mentioned earlier in this tutorial keeps
the path to the PXL file so that we can access it when needed. This has
the advantage that we can keep the size of the Seurat object relatively
small until we decide to load the graphs. On the down side, it also
means that the path to the PXL file need to remain unchanged. If the PXL
file is moved or renamed, we can no longer access it.
This has an important implication when saving and sharing Seurat objects containing MPX data. If you want to export a Seurat object as an RDS file and share it with someone else, you should also share associated the PXL file. In addition, the path to the PXL file has to be updated in the Seurat object.
If this is unclear at the moment, don’t worry. We will come back to this topic in the tutorials about 2D and 3D cell visualization.
Aggregating data
For cases when we want to process and analyze samples in parallel, it is
convenient to merge them into a single object. Here, we read multiple
files and aggregate them into a single object. When doing so, the cell
ID and sample ID are merged to form the new cell ID in the merged object
(e.g. “S1_RCVCMP0000830”). When merging Seurat objects, you can provide
add.cell.ids to control the sample prefixes.
In the code chunk below, we will create a merged object from four PXL files. These PXL files represent a resting and PHA stimulated PBMC sample, both in duplicate. We will create a merged object and update the object’s metadata to indicate the condition and replicate number. The resulting object will be explored further in the next tutorial on Quality Control.
BASEURL="https://pixelgen-technologies-datasets.s3.eu-north-1.amazonaws.com/mpx-datasets/pixelator/next/technote-v1-vs-v2-immunology-II"
curl -L -O -C -  --create-dirs --output-dir $DATA_DIR "${BASEURL}/Sample05_V2_PBMC_r1.layout.dataset.pxl"
curl -L -O -C -  --create-dirs --output-dir $DATA_DIR "${BASEURL}/Sample06_V2_PBMC_r2.layout.dataset.pxl"
curl -L -O -C -  --create-dirs --output-dir $DATA_DIR "${BASEURL}/Sample07_V2_PHA_PBMC_r1.layout.dataset.pxl"
curl -L -O -C -  --create-dirs --output-dir $DATA_DIR "${BASEURL}/Sample08_V2_PHA_PBMC_r2.layout.dataset.pxl"
# The folders the data is located in:
data_files <- 
  c(
    # Resting
    resting_r1 = file.path(DATA_DIR, "Sample05_V2_PBMC_r1.layout.dataset.pxl"),
    resting_r2 = file.path(DATA_DIR, "Sample06_V2_PBMC_r2.layout.dataset.pxl"),
    # PHA
    stimulated_r1 = file.path(DATA_DIR, "Sample07_V2_PHA_PBMC_r1.layout.dataset.pxl"),
    stimulated_r2 = file.path(DATA_DIR, "Sample08_V2_PHA_PBMC_r2.layout.dataset.pxl")
    )
# Read PXL files as a list of Seurat objects
obj_list <- lapply(data_files, ReadMPX_Seurat)
# Combine files into a single Seurat object
pg_data_combined <- 
  merge(obj_list[[1]],
        y = obj_list[-1],
        add.cell.ids = names(obj_list)) 
# Add meta data describing the sample and donor origin
pg_data_combined <- 
  AddMetaData(pg_data_combined, 
              metadata = data.frame(id = colnames(pg_data_combined)) %>% 
                mutate(
                        condition = str_remove(id, "_.*"),
                        sample = str_remove(id, ".{17}$"), 
                        replicate = str_extract(id, "_r[0-9]_") %>% str_remove_all("_")
                      ) %>% 
                column_to_rownames("id"))
pg_data_combined
An object of class Seurat 
84 features across 4375 samples within 1 assay 
Active assay: mpxCells (84 features, 84 variable features)
 4 layers present: counts.1, counts.2, counts.3, counts.4
As you can see in the object description above, the count layers of a merged object will be kept separate as individual layers (“counts.1”, “counts.2”, …). For many downstream applications, one needs a singled merged counts matrix. This layer can be created with the following command.
pg_data_combined <- JoinLayers(pg_data_combined)
pg_data_combined
An object of class Seurat 
84 features across 4375 samples within 1 assay 
Active assay: mpxCells (84 features, 84 variable features)
 1 layer present: counts
Save data
After combining the samples, we can save the merged data to an RDS file to use later. Remember that a Seurat object containing MPX data is “connected” to a PXL file, so you need to share the PXL file along with the RDS file to ensure that component graphs can be loaded. If component graphs are not needed for subsequent analysis, the PXL file is no longer needed.
# This is an optional pause step; if you prefer to continue to the next tutorial without saving the object, that will also work.
saveRDS(pg_data_combined, file.path(DATA_DIR, "combined_data.rds"))
We have now seen how to load MPX data, inspect its key properties and prepare the data for integrated analysis across experimental samples. In the next tutorial we will apply these skills and demonstrate how to quality control MPX data.