Contents

1 Overview of MACP package

Systematic mapping of multiprotein complexes formed by protein-protein interactions (PPIs) can enhance our knowledge and mechanistic basis of how proteins function in the cells. Co-fractionation coupled with mass spectrometry (CF-MS) is gaining momentum as a cost-effective strategy for charting protein assemblies under native conditions using high-resolution chromatography separation techniques (e.g., size-exclusion and ion-exchange) without the need for antibodies or tagging of individual proteins. CF-MS is initially developed for detecting native soluble human protein complexes from the cytosolic and nuclear extracts of cultured cells, and was later adapted to create the mitochondrial connectivity maps using the mitochondrial extracts of chemically cross-linked cultures of neuronal-like cells. To capture high-quality PPIs from CF-MS co-elution profile, we have developed a well standardized and fully automated CF-MS data analysis software toolkit, referred to as MACP (Macromolecular Assemblies from the Co-elution Profile) in an open-source R package, beginning with the processing of raw co-elution data to reconstruction of high-confidence PPI networks via supervised machine-learning.

Unlike existing software tools (EPIC, PrInCE), MACP facilitates CF-MS data analysis with flexible functions for data filtering and interaction scoring that can be tailored to user’s needs. In addition to the similarity scoring measures used in EPIC, MACP utilizes ten other co-elution profile similarity correlation metrics over the entire co-elution profile for scoring and predicting native macromolecular assemblies. MACP also offers an individual or an ensemble classifier to enhance the quality of predicted PPIs. Notably, MACP toolkit includes independent functions for creating the predictive model, and allowing users to handle imbalanced distribution of the training data set
(i.e., interacting and noninteracting pairs). Unlike aforesaid tools, MACP automatically estimates the prediction performance via a k-fold cross-validation using the training set. Lastly, due to its modular architecture, MACP: (1) Allows users to evaluate the model performance using external test sets, (2) Provides an optimized clustering procedure to improve the quality of predicted complexes using a grid optimization strategy, and (3) Offers functionalities that can be used to predict either mitochondrial or non-mitochondrial PPIs from various biological samples.

2 MACP computational core

For each CF-MS experimental dataset, the search output derived from database searching of co-elution proteomic experiment is summarized into a table matrix containing relative quantification of proteins across collected fractions. In the matrix table, rows correspond to identified proteins, each labelled with a unique name, and relative quantification of identified proteins for a corresponding fraction, are represented as columns. Following converting the search results into a compatible format, the MACP includes various data pre-processing to improve the prediction quality, including missing value imputation, data-noise reduction, and data normalization. Following data processing, MACP creates possible protein pairs for each elution experiment, followed by computing their corresponding co-elution scores via using up to 18 similarity metrics that emphasize different profile similarity aspects. For classification purposes, the literature-curated co-complex PPIs derived from a public database (e.g., CORUM) are then mapped onto a feature matrix (i.e., matrix containing similarity scores for potential protein pairs), which in turn will be used as an input set for the built-in machine learning algorithms to score potential proteins pairs. The probabilistic interaction network is further denoised and finally partitioned using parametrized unsupervised approaches to predict putative complexes.

3 Preparation

3.1 Installing MACP

Installation from CRAN:

install.packages('MACP')

To install the development version in R, run:

if(!requireNamespace("devtools", quietly = TRUE)) {
install.packages("devtools")
}
devtools::install_github("BabuLab-UofR/MACP")

Load the package and other libraries for data manipulation:

library(MACP)
library(dplyr)
library(tidyr)

3.2 Data preparation

The data files accepted by MACP are simple tables containing relative quantification of proteins across collected fractions. The rows correspond to identified proteins, each labelled with a unique name, and columns correspond to fractions. To demonstrate the use of MACP, we will use a demo co-elution data, derived from mitochondrial (mt) extracts of mouse brain culture, fractionated by (size-exclusion chromatography, SEC). An example of CF-MS input data,bundled with the MACP package, can be loaded with the following command:

# Loading the demo data
data(exampleData)
dim(exampleData)
## [1] 284  83
# Inspect the data 
glimpse(exampleData)
##  num [1:284, 1:83] 0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:284] "Q99L13" "P05202" "Q9CXJ4" "P55096" ...
##   ..$ : chr [1:83] "FR.1.1" "FR.2.1" "FR.3.1" "FR.4.1" ...

4 MACP computational workflow: step-by-step analysis

In the following section, we will describe the major steps in MACP, emphasizing the arguments adjusted by the user. This includes: (1) Data pre-processing, (2) Protein-protein interactions (PPIs) scoring, (3) Prediction and network denoising, as well as (4) Network-based prediction of protein complexes.

4.1 Pre-processing

  • data_filtering removes those proteins measured only in one fraction (i.e., one-hit-wonders) from further analysis. Besides the one-hit-wonders, MACP removes common contaminants (e.g., keratins) only for mouse and human organisms and frequent flyers (i.e., proteins observed in >80% of fractions).
data_p1 = data_filtering(exampleData)
# Inspect the number of retained proteins 
dim(data_p1)
## [1] 249  83
  • impute_MissingData imputes zero values for proteins either not detected by MS or not expressed in the cell by replacing missing values with an average of adjacent neighbors (i.e., fractions) values.
x <- data_p1
# Assign column 10 to zeros
x[,10] <- NA
data_p2 <- impute_MissingData(x)
  • scaling performs column- and row-wise normalization of protein co-elution profile matrix to correct for sample injection variation and fraction bias using the command line below:
data_p3 <- scaling(data_p1)
  • Optional keepMT removes all the non-mitochondrial proteins by mapping the co-eluted proteins from chromatography fractions to MitoCarta database using the ensuing command. Note that this function is only applicable to mouse or human organisms.
data_p3 <- keepMT(data_p3)
# Inspect the number of retained proteins 
dim(data_p3)
## [1] 211  83

4.2 Protein-protein interactions (PPIs) scoring

The next step is to compute similarity scores for each protein pair based on their co-elution profile, as proteins belonging to the same protein complex is expected to co-elute in the same or neighboring fractions, and thus will show similar elution profiles (i.e., high similarity score). Prior to calculating correlation similarity metrics from co-elution profiles, MACP discards proteins that do not occur in the same fraction across CF-MS experiments. By default, MACP considers 18 metrics (for details, see calculate_PPIscore documentation), to compute similarity of two protein elution profiles for all possible protein pairs in each CF-MS experiment using the ensuing command line. To further minimize the computational run time, MACP provides users with an option to choose an appropriate co-fractionation correlation score cut-off using the corr_cutoff argument (apex and pcc_p is not included), if argument corr_removal is set to TRUE.

The following command can be executed to compute features for pre-processed data:

set.seed(100)
scored_PPI <- calculate_PPIscore(data_p3,
                                corr_removal = FALSE)
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |========                                                              |  11%
  |                                                                            
  |================                                                      |  22%
  |                                                                            
  |=======================                                               |  33%
  |                                                                            
  |===============================                                       |  44%
  |                                                                            
  |=======================================                               |  56%
  |                                                                            
  |===============================================                       |  67%
  |                                                                            
  |======================================================                |  78%
  |                                                                            
  |==============================================================        |  89%
  |                                                                            
  |======================================================================| 100%

4.3 Prediction and network denoising

4.3.1 Build reference data set

MACP uses a supervised machine learning algorithm to infer interactions; thus it requires a set of associated external attributes (i.e., class labels) for training purposes. To get class labels for training machine learning classifiers, MACP first retrieves the gold reference set from the CORUM database by getCPX function, followed by generating a reference set of true positive (i.e., intra-complex) and negative interactions (inter-complex) using the scored_PPI as input via generate_refInt function. Also, users can submit their list of reference complexes. The refcpx, bundled with the MACP package, is extracted from CORUM database.

  1. Load reference complexes:
data("refcpx")
  1. Generate class labels for interactions:
# separate the interaction pairs
PPI_pairs <-
  scored_PPI %>%
  separate(PPI, c("p1", "p2"), sep = "~") %>% select(p1,p2)

# Generate reference set of positive and negative interactions
class_labels <-
    generate_refInt(PPI_pairs[,c(1,2)],refcpx)
table(class_labels$label)
## 
## Negative Positive 
##      118      130

Optional Note that if the ratio of positive to negative of training set is imbalanced, users can perform under-sampling downSample technique to balance the ratio as the prediction works bests normally having 1:1 or 1:5 positive to negative protein pairs.

Optional Furthermore, if training set deemed insufficient, we urge users to provide MACP with suitable reference datasets defined based on biochemical approaches collected from literature or other public databases (e.g., IntAct, Reactome, GO) or one-to-one orthologous protein mapping between human and test species of interest using orthMappingCpx function provided in the MACP package as follow:

# for example to convert mouse complexes to human complexes
## load the mouse complexes
data("refcpx")
orth_mapping <- orthMappingCpx (refcpx,
  input_species = "mouse",
  output_species = "human",
  input_taxid = "10090",
  output_taxid = "9606")

4.3.2 Protein-protein interactions (PPIs) predction

Using input data scored_PPI and training data set class_labels, MACP creates a single composite probability score by combining each of the scored interactions from 18 different similarity measures either through an individual or ensemble of supervised machine-learning models (RF, GLM, SVM), including other base classifiers provided in the R caret package to enhance PPI prediction quality. MACP then evaluates the precision of the resulting networks through k-fold cross validation and the use of different performance measures such as Recall (Sensitivity), Specificity, Accuracy, Precision, F1-score, and Matthews correlation coefficient (MCC).

The corresponding formulae are as follows:

\[ Recall=Sensitivity=TPR=\frac{TP}{TP+FN} \]

\[ Specificity=1-FPR=\frac{TN}{TN+FP} \]

\[ Accuracy=\frac{TP+TN}{TP+TN+FP+FN} \]

\[ Precision=\frac{TP}{TP+FP} \]

\[ F1=2 \text{*} \frac{Precision \text{*} Recall}{Precision + Recall} \]

\[ MCC=\frac{TP \text{*} TN - FP \text{*} FN}{\sqrt{(TP+FP)\text{*} (TP+FN)\text{*} (TN+FP)\text{*} (TN+FN)}} \]

The predPPI_ensemble function provided in the MACP package takes following parameters:

  • features A data frame with protein-protein interactions (PPIs) in the first column, and features to be passed to the classifier in the remaining columns. Note that, this data includes both unknown and known PPIs.
  • gd A data frame with gold_standard PPIs and class label indicating if such PPIs are positive or negative.
  • classifier Type of classifiers.
  • cv_fold Number of partitions for cross-validation.
  • plots Logical value, indicating whether to plot the performance of the predictive learning algorithm using k-fold cross-validation.
  • filename A character string, indicating the location and output pdf filename for performance plots. Defaults is temp() directory.

Predicting interactions with ensemble algorithm is simple as the following command:

set.seed(101)
predPPI_ensemble <- 
  ensemble_model(scored_PPI,
                  class_labels,
                  classifier = c("glm", "svmRadial", "ranger"),
                  cv_fold = 5,
                  plots = FALSE,
                  verboseIter = FALSE,
                  filename=file.path(tempdir(),"plots.pdf"))

# Subset predicted interactions 
pred_interactions <- predPPI_ensemble$predicted_interactions

When the plots argument set to TRUE, the ensemble_model function generates one pdf file containing three figures indicating the performance of the RF classier using k-fold cross-validation.

  • The first plot shows the Receiver Operating Characteristic (ROC) curve.

    Figure 1: ROC_Curve curve.

  • The second plot shows the Precision-Recall (PR) curve

    Figure 2: Precision-Recall (PR) curve.

  • The third plot shows the accuracy (ACC), F1-score ,positive predictive value (PPV),sensitivity (SE),and Matthews correlation coefficient (MCC) of ensemble classifier vs selected individual classifiers.