# Introduction to cytometree

cytometree is a package that implements a binary tree algorithm for the analysis of cytometry data.

Its core algorithm is based on the construction of a binary tree, whose nodes represents cell subpopulations.

## Binary tree construction

1. At each node, observed cells (or “events”) and markers are modeled by both a normal distribution (so unimodal), and a mixture of 2 normal distributions (so bimodal).

2. Splitting of the events at each node is done according to a normalized difference of AIC between the two distributional fit (unimodal or bimodal), allowing to pick the marker that best splits those data.

3. When AIC normalized differences $$D$$ are not significant anymore, the tree has been constructed, and the cells have been automatically gated (i.e. partitioned).

## Post-hoc annotation

Given the unsupervised nature of the binary tree, some of the available markers may not be used to find the different cell populations present in a given sample. To recover a complete annotation, we defined, as a post processing procedure, an annotation method which allows the user to distinguish two (or three) expression levels per marker.

# Examples of analysis with cytometree

## Diffuse large B-cell lymphoma dataset with 3 markers

In this example, we will use a diffuse large B-cell lymphoma dataset (from the FlowCAP-I challenge, with only 3 markers.

### Automatic gating step

First, we need to load the package cytometree and we can have a look at the data:

library(cytometree)
data(DLBCL)
dim(DLBCL)
## [1] 5524    4
head(DLBCL)
##   FL1 FL2 FL4 label
## 1 416 251 293     2
## 2 210  93 184     2
## 3 387 300 233     2
## 4 403 438 288     2
## 5 399 128 116     2
## 6 538 288 189     2

We have 3 markers measured FL1, FL2, FL4 as well as the label obtained from manual gating, and 5524 cells.

# getting the only the cell events with the 3 markers measured:
cellevents <- DLBCL[, c("FL1", "FL2", "FL4")]
# storing the maanual gating reference from FlowCAP-I:
manual_labels <- DLBCL[, "label"]

The function CytomeTree builds the binary tree according to our algorithm, that gives the automatic gating (given in the resulting labels attribute):

# Build the binary tree:
Tree <- CytomeTree(cellevents, minleaf = 1, t = 0.1)
# Retreive the resulting partition (i.e. automatic gating):
Tree_Partition <- Tree$labels One can fiddle with the t threshold to change the desired depth of the tree. The function plot_graph plots a graph representing this binary tree, showing which markers were used in which order to splits the events: # Plot a graph of the tree (with specific graphical parameters): plot_graph(Tree, edge.arrow.size = 0.3, Vcex = 0.8, vertex.size = 30) The function plot_nodes allows to graphically evaluate the fit of the gaussian family distribution at each node. It can also be used to investigate a particular node: # Plot the distribution fit for each node: plot_nodes(Tree) # Plot the distribution fit for a particular node: plot_nodes(Tree, "FL4.1") ### Post-processing annotation of automatically gated subpopulations The function Annotation annotates the subpopulations partitioned by the binary tree: # Run the annotation algorithm: Annot <- Annotation(Tree, plot = FALSE) Annot$combinations
##   FL1 FL2 FL4 leaves count   prop
## 1   0   1   0      1  4905 0.8879
## 2   0   0   1      2   590 0.1068
## 3   0   1   1      3    25 0.0045
## 4   1   1   1      4     4 0.0007

The post-processing annotation can be compared to the incomplete annotation obtained from the tree alone:

# Annotation gotten from the tree:
Tree$annotation ## FL1 FL2 FL4 labels count prop ## 1 NA NA 0 1 4905 0.8879 ## 2 NA 0 1 2 590 0.1068 ## 3 0 1 1 3 25 0.0045 ## 4 1 1 1 4 4 0.0007 This completed annotation eases the search for specific cell types of interest, where 1 represents a positive measurement of a marker, while 0 corresponds to its absence: # seeked cell type: FL2+FL4+ pheno_ex1 <- RetrievePops(Annot, phenotypes = list(rbind(c("FL2", 1), c("FL4", 1)))) pheno_ex1$phenotypesinfo
## [[1]]
## [[1]]$phenotype ## [1] "FL2-1" "FL4-1" ## ## [[1]]$proportion
## [1] 0.0052
##
## [[1]]$Mergedlabels ## [1] 3 4 ## ## [[1]]$Newlabel
## [1] 5
# One can look for several cell types at once:
phenotypes <- list()
# FL2+FL4-
phenotypes[[1]] <- rbind(c("FL2", 1), c("FL4", 0))
# FL2-FL4+
phenotypes[[2]] <- rbind(c("FL2", 0), c("FL4", 1))
# Retreive corresponding cell populations:
PhenoInfos <- RetrievePops(Annot, phenotypes)
PhenoInfos$phenotypesinfo ## [[1]] ## [[1]]$phenotype
## [1] "FL2-1" "FL4-0"
##
## [[1]]$proportion ## [1] 0.8879 ## ## [[1]]$label
## [1] 1
##
##
## [[2]]
## [[2]]$phenotype ## [1] "FL2-0" "FL4-1" ## ## [[2]]$proportion
## [1] 0.1068
##
## [[2]]$label ## [1] 2 ### Comparison between manual and automatic gating Finally, we can compare the automatic gating obtained using cytometree to the original manual gating (which had 47 events inconsistently gated across different operators performing the manual gating on those very same data - so those 47 cells were identified as outliers in the reference label and were kept out of the evaluation criteria in the FlowCAP-I challenge): ## PBMC sample from the Human Immunology Project In this example, we will use a dataset from the HIPC (Human Immunology Project Consortium program where a PBMC sample from patient 12828 was analyzed by Stanford. ### Automatic gating step First, we need to load the package cytometree and we can have a look at the data: data(HIPC) dim(HIPC) ## [1] 33992 7 head(HIPC) ## CCR7 CD4 CD45RA HLADR CD38 CD8 label ## [1,] 1561.2429 1052.6527 3169.944 889.99200 1326.2701 2733.810 1 ## [2,] 1362.2820 978.1219 2826.469 1810.43335 1294.4667 3054.555 1 ## [3,] 1321.3787 1180.9497 2780.128 -57.02759 211.0515 2450.657 1 ## [4,] 779.8231 1097.8448 3054.914 -67.70583 579.6187 1898.014 1 ## [5,] 956.2596 942.8704 2612.188 60.35059 1776.3430 3262.799 1 ## [6,] 1100.1621 1080.4724 2560.610 702.39624 1179.5389 2836.238 1 We have 6 markers measured CCR7, CD4, CD45RA, HLADR, CD38, CD8 as well as the label obtained from manual gating, and 33992 cell. # getting the only the cell events with the 3 markers measured: cellevents <- HIPC[, c("CCR7", "CD4", "CD45RA", "HLADR", "CD38", "CD8")] # storing the maanual gating reference from FlowCAP-I: manual_labels <- HIPC[, "label"] The function CytomeTree builds the binary tree according to our algorithm, that gives the automatic gating (given in the resulting labels attribute): # Build the binary tree: Tree <- CytomeTree(cellevents, minleaf = 1, t = 0.2) # Retreive the resulting partition (i.e. automatic gating): Tree_Partition <- Tree$labels

One can fiddle with the t threshold to change the desired depth of the tree.

The function plot_graph plots a graph representing this binary tree, showing which markers were used in which order to splits the events:

# Plot a graph of the tree (with specific graphical parameters):
plot_graph(Tree, edge.arrow.size = 0.4, Vcex = 0.45)

The function plot_nodes allows to graphically evaluate the fit of the gaussian family distribution at each node. It can also be used to investigate a particular node:

### Post-processing annotation of automatically gated subpopulations

The function Annotation annotates the subpopulations partitioned by the binary tree:

# Run the annotation algorithm:
Annot <- Annotation(Tree, plot = FALSE)
Annot$combinations ## CCR7 CD4 CD45RA HLADR CD38 CD8 leaves count prop ## 5 1 1 1 0 1 0 5 12124 0.3567 ## 4 1 1 0 0 0 0 4 8218 0.2418 ## 3 1 0 1 0 1 1 3 4588 0.1350 ## 8 0 1 0 0 0 0 8 4071 0.1198 ## 1 0 0 0 0 0 1 1 3138 0.0923 ## 2 0 0 1 0 0 1 2 830 0.0244 ## 9 0 1 0 0 1 0 9 475 0.0140 ## 6 0 0 0 1 1 1 6 193 0.0057 ## 10 0 1 0 1 1 0 10 142 0.0042 ## 7 1 0 0 1 1 1 7 102 0.0030 ## 11 0 1 0 1 0 0 11 83 0.0024 ## 12 0 1 1 1 0 0 12 28 0.0008 This completed annotation eases the search for specific cell types of interest, where 1 represents a positive measurement of a marker, while 0 corresponds to its absence: # One can look for several cell types at once: phenotypes <- list() # CD8-CD4+CCR7-CD45RA+ CD4effector <- rbind(c("CD8", 0), c("CD4", 1), c("CCR7", 0), c("CD45RA", 1)) phenotypes[[1]] <- CD4effector # CD8-CD4+CCR7+CD45RA+ CD4naive <- rbind(c("CD8", 0), c("CD4", 1), c("CCR7", 1), c("CD45RA", 1)) phenotypes[[2]] <- CD4naive # CD8-CD4+CCR7+CD45RA- CD4CM <- rbind(c("CD8", 0), c("CD4", 1), c("CCR7", 1), c("CD45RA", 0)) phenotypes[[3]] <- CD4CM # CD8-CD4+CCR7+CD45RA+ CD4effectorM <- rbind(c("CD8", 0), c("CD4", 1), c("CCR7", 0), c("CD45RA", 0)) phenotypes[[4]] <- CD4effectorM # Retreive corresponding cell populations: PhenoInfos <- RetrievePops(Annot, phenotypes) # One can find informations about the seeked phenotypes in PhenoInfos$phenotypesinfo
## [[1]]
## [[1]]$phenotype ## [1] "CD8-0" "CD4-1" "CCR7-0" "CD45RA-1" ## ## [[1]]$proportion
## [1] 8e-04
##
## [[1]]$label ## [1] 12 ## ## ## [[2]] ## [[2]]$phenotype
## [1] "CD8-0"    "CD4-1"    "CCR7-1"   "CD45RA-1"
##
## [[2]]$proportion ## [1] 0.3567 ## ## [[2]]$label
## [1] 5
##
##
## [[3]]
## [[3]]$phenotype ## [1] "CD8-0" "CD4-1" "CCR7-1" "CD45RA-0" ## ## [[3]]$proportion
## [1] 0.2418
##
## [[3]]$label ## [1] 4 ## ## ## [[4]] ## [[4]]$phenotype
## [1] "CD8-0"    "CD4-1"    "CCR7-0"   "CD45RA-0"
##
## [[4]]$proportion ## [1] 0.1404 ## ## [[4]]$Mergedlabels
## [1]  8  9 10 11
##
## [[4]]$Newlabel ## [1] 13 # According to PhenoInfos$phenotypesinfo, CD8-CD4+CCR7-CD45RA+ population is
# labeled 12 According to PhenoInfos$phenotypesinfo, CD8-CD4+CCR7+CD45RA+ # population is labeled 5 According to PhenoInfos$phenotypesinfo,
# CD8-CD4+CCR7+CD45RA- population is labeled 4 According to
# PhenoInfos$phenotypesinfo, CD8-CD4+CCR7+CD45RA+ population is comprised of 4 # clusters labeled 8,9,10, 11. They were merged into a new cluster labeled 13. # The new partition, with merged clusters is to be found in # PhenoInfos$Mergedleaves.
auto_labels_merged <- PhenoInfos\$Mergedleaves

# Get the indices of the subpopulation comprised of classes labeled 12, 5, 4,13.
subpopulation_indices <- auto_labels_merged %in% c(12, 5, 4, 13)

# Scatter plot the data.
CD45RA <- cellevents[subpopulation_indices, "CD45RA"]
CCR7 <- cellevents[subpopulation_indices, "CCR7"]
auto_lab <- factor(auto_labels_merged[subpopulation_indices])
levels(auto_lab) <- viridis::viridis(length(levels(auto_lab)))
auto_lab <- as.character(auto_lab)
plot(CD45RA, CCR7, col = auto_lab, main = "Automatic gating")

## Semi-supervised gating

It is possible to force the markers which will be used to gate the data first, using the force_first_markers argument of the CytomeTree() function. Once those forced markers are used, automatic behavior of the algorithm is resumed for the remaining markers.

# Build the binary tree, forcing the first node to be CD4, and the second ones
"HLADR"))
# Plot a graph of the tree (with specific graphical parameters):
plot_graph(Tree, edge.arrow.size = 0.4, Vcex = 0.45)