MACP 0.1.0
Systematic mapping of multiprotein complexes formed by protein-protein interactions (PPIs) can enhance our knowledge and mechanistic basis of how proteins function in the cells. Co-fractionation coupled with mass spectrometry (CF-MS) is gaining momentum as a cost-effective strategy for charting protein assemblies under native conditions using high-resolution chromatography separation techniques (e.g., size-exclusion and ion-exchange) without the need for antibodies or tagging of individual proteins. CF-MS is initially developed for detecting native soluble human protein complexes from the cytosolic and nuclear extracts of cultured cells, and was later adapted to create the mitochondrial connectivity maps using the mitochondrial extracts of chemically cross-linked cultures of neuronal-like cells. To capture high-quality PPIs from CF-MS co-elution profile, we have developed a well standardized and fully automated CF-MS data analysis software toolkit, referred to as MACP (Macromolecular Assemblies from the Co-elution Profile) in an open-source R package, beginning with the processing of raw co-elution data to reconstruction of high-confidence PPI networks via supervised machine-learning.
Unlike existing software tools (EPIC,
PrInCE), MACP facilitates CF-MS
data analysis with flexible functions for data filtering and interaction
scoring that can be tailored to user’s needs. In addition to the similarity
scoring measures used in EPIC, MACP utilizes ten other co-elution profile
similarity correlation metrics over the entire co-elution profile for scoring
and predicting native macromolecular assemblies. MACP also offers an individual
or an ensemble classifier to enhance the quality of predicted PPIs. Notably,
MACP toolkit includes independent functions for creating the predictive model,
and allowing users to handle imbalanced distribution of the training data set
(i.e., interacting and noninteracting pairs). Unlike aforesaid tools, MACP
automatically estimates the prediction performance via a k-fold
cross-validation using the training set. Lastly, due to its modular
architecture, MACP: (1) Allows users to evaluate the model performance
using external test sets, (2) Provides an optimized clustering procedure to
improve the quality of predicted complexes using a grid optimization strategy,
and (3) Offers functionalities that can be used to predict either
mitochondrial or non-mitochondrial PPIs from various biological samples.
For each CF-MS experimental dataset, the search output derived from database searching of co-elution proteomic experiment is summarized into a table matrix containing relative quantification of proteins across collected fractions. In the matrix table, rows correspond to identified proteins, each labelled with a unique name, and relative quantification of identified proteins for a corresponding fraction, are represented as columns. Following converting the search results into a compatible format, the MACP includes various data pre-processing to improve the prediction quality, including missing value imputation, data-noise reduction, and data normalization. Following data processing, MACP creates possible protein pairs for each elution experiment, followed by computing their corresponding co-elution scores via using up to 18 similarity metrics that emphasize different profile similarity aspects. For classification purposes, the literature-curated co-complex PPIs derived from a public database (e.g., CORUM) are then mapped onto a feature matrix (i.e., matrix containing similarity scores for potential protein pairs), which in turn will be used as an input set for the built-in machine learning algorithms to score potential proteins pairs. The probabilistic interaction network is further denoised and finally partitioned using parametrized unsupervised approaches to predict putative complexes.
Installation from CRAN:
install.packages('MACP')
To install the development version in R, run:
if(!requireNamespace("devtools", quietly = TRUE)) {
install.packages("devtools")
}
devtools::install_github("BabuLab-UofR/MACP")
Load the package and other libraries for data manipulation:
library(MACP)
library(dplyr)
library(tidyr)
The data files accepted by MACP are simple tables containing relative quantification of proteins across collected fractions. The rows correspond to identified proteins, each labelled with a unique name, and columns correspond to fractions. To demonstrate the use of MACP, we will use a demo co-elution data, derived from mitochondrial (mt) extracts of mouse brain culture, fractionated by (size-exclusion chromatography, SEC). An example of CF-MS input data,bundled with the MACP package, can be loaded with the following command:
# Loading the demo data
data(exampleData)
dim(exampleData)
## [1] 284 83
# Inspect the data
glimpse(exampleData)
## num [1:284, 1:83] 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:284] "Q99L13" "P05202" "Q9CXJ4" "P55096" ...
## ..$ : chr [1:83] "FR.1.1" "FR.2.1" "FR.3.1" "FR.4.1" ...
In the following section, we will describe the major steps in MACP, emphasizing the arguments adjusted by the user. This includes: (1) Data pre-processing, (2) Protein-protein interactions (PPIs) scoring, (3) Prediction and network denoising, as well as (4) Network-based prediction of protein complexes.
data_filtering
removes those proteins measured only in one
fraction (i.e., one-hit-wonders) from further analysis. Besides the
one-hit-wonders, MACP removes common contaminants (e.g., keratins) only for
mouse and human organisms and
frequent flyers (i.e., proteins observed in >80% of fractions).data_p1 = data_filtering(exampleData)
# Inspect the number of retained proteins
dim(data_p1)
## [1] 249 83
impute_MissingData
imputes zero values for proteins either not
detected by MS or not expressed in the cell by replacing missing values with an
average of adjacent neighbors (i.e., fractions) values.x <- data_p1
# Assign column 10 to zeros
x[,10] <- NA
data_p2 <- impute_MissingData(x)
scaling
performs column- and row-wise normalization of protein
co-elution profile matrix to correct for sample injection
variation and fraction bias using the command line below:data_p3 <- scaling(data_p1)
keepMT
removes all the non-mitochondrial proteins by mapping the
co-eluted proteins from chromatography fractions to MitoCarta database using
the ensuing command. Note that this function is only
applicable to mouse or human organisms.data_p3 <- keepMT(data_p3)
# Inspect the number of retained proteins
dim(data_p3)
## [1] 211 83
The next step is to compute similarity
scores for each protein pair based on their co-elution profile, as proteins
belonging to the same protein complex is expected to co-elute in the
same or neighboring fractions, and thus will show similar elution profiles
(i.e., high similarity score). Prior to calculating correlation similarity
metrics from co-elution profiles, MACP discards proteins that do not occur in
the same fraction across CF-MS experiments. By default, MACP
considers 18 metrics
(for details, see calculate_PPIscore
documentation), to
compute similarity of two protein elution profiles for all possible
protein pairs in each CF-MS experiment using the ensuing command line. To
further minimize the computational run time, MACP provides users with an
option to choose an appropriate co-fractionation correlation score cut-off
using the corr_cutoff
argument (apex
and pcc_p
is
not included), if argument corr_removal
is set to TRUE.
The following command can be executed to compute features for pre-processed data:
set.seed(100)
scored_PPI <- calculate_PPIscore(data_p3,
corr_removal = FALSE)
##
|
| | 0%
|
|======== | 11%
|
|================ | 22%
|
|======================= | 33%
|
|=============================== | 44%
|
|======================================= | 56%
|
|=============================================== | 67%
|
|====================================================== | 78%
|
|============================================================== | 89%
|
|======================================================================| 100%
MACP uses a supervised machine learning algorithm to infer interactions; thus it
requires a set of associated external attributes (i.e., class labels) for
training purposes. To get class labels for training
machine learning classifiers, MACP first retrieves the gold reference set
from the CORUM database by getCPX
function, followed by generating a reference set of true positive
(i.e., intra-complex) and negative interactions (inter-complex) using the
scored_PPI
as input via generate_refInt
function. Also, users can submit
their list of reference complexes. The refcpx
, bundled with the MACP package,
is extracted from CORUM database.
data("refcpx")
# separate the interaction pairs
PPI_pairs <-
scored_PPI %>%
separate(PPI, c("p1", "p2"), sep = "~") %>% select(p1,p2)
# Generate reference set of positive and negative interactions
class_labels <-
generate_refInt(PPI_pairs[,c(1,2)],refcpx)
table(class_labels$label)
##
## Negative Positive
## 118 130
Optional Note that if the ratio of positive to negative of training set is imbalanced, users can perform under-sampling downSample technique to balance the ratio as the prediction works bests normally having 1:1 or 1:5 positive to negative protein pairs.
Optional Furthermore, if training set
deemed insufficient, we urge users to provide MACP with suitable
reference datasets defined based on biochemical approaches collected from
literature or other public databases (e.g., IntAct, Reactome, GO) or
one-to-one orthologous protein mapping between human and test species of
interest using orthMappingCpx
function provided in the MACP package as
follow:
# for example to convert mouse complexes to human complexes
## load the mouse complexes
data("refcpx")
orth_mapping <- orthMappingCpx (refcpx,
input_species = "mouse",
output_species = "human",
input_taxid = "10090",
output_taxid = "9606")
Using input data scored_PPI
and training data set class_labels
, MACP
creates a single composite probability score by combining each of the scored
interactions from 18 different similarity measures either through an individual
or ensemble of supervised machine-learning models (RF, GLM, SVM),
including other base classifiers provided in the R
caret package
to enhance PPI prediction quality.
MACP then evaluates the precision of the resulting networks through
k-fold cross validation and the use of different performance measures
such as Recall (Sensitivity), Specificity, Accuracy, Precision, F1-score,
and Matthews correlation coefficient (MCC).
The corresponding formulae are as follows:
\[ Recall=Sensitivity=TPR=\frac{TP}{TP+FN} \]
\[ Specificity=1-FPR=\frac{TN}{TN+FP} \]
\[ Accuracy=\frac{TP+TN}{TP+TN+FP+FN} \]
\[ Precision=\frac{TP}{TP+FP} \]
\[ F1=2 \text{*} \frac{Precision \text{*} Recall}{Precision + Recall} \]
\[ MCC=\frac{TP \text{*} TN - FP \text{*} FN}{\sqrt{(TP+FP)\text{*} (TP+FN)\text{*} (TN+FP)\text{*} (TN+FN)}} \]
The predPPI_ensemble
function provided in the MACP package takes
following parameters:
features
A data frame with protein-protein interactions
(PPIs) in the first column, and features to be passed to the classifier
in the remaining columns. Note that, this data includes both unknown and
known PPIs.gd
A data frame with gold_standard PPIs and class label
indicating if such PPIs are positive or negative.classifier
Type of classifiers.cv_fold
Number of partitions for cross-validation.plots
Logical value, indicating whether to plot the performance of the
predictive learning algorithm using k-fold cross-validation.filename
A character string, indicating the location and output pdf
filename for performance plots. Defaults is temp() directory.Predicting interactions with ensemble algorithm is simple as the following command:
set.seed(101)
predPPI_ensemble <-
ensemble_model(scored_PPI,
class_labels,
classifier = c("glm", "svmRadial", "ranger"),
cv_fold = 5,
plots = FALSE,
verboseIter = FALSE,
filename=file.path(tempdir(),"plots.pdf"))
# Subset predicted interactions
pred_interactions <- predPPI_ensemble$predicted_interactions
When the plots
argument set to TRUE, the ensemble_model
function generates
one pdf file containing three figures indicating the performance of the
RF classier using k-fold cross-validation.
The first plot shows the Receiver Operating Characteristic (ROC) curve.
Figure 1: ROC_Curve curve.
The second plot shows the Precision-Recall (PR) curve
Figure 2: Precision-Recall (PR) curve.
The third plot shows the accuracy (ACC), F1-score ,positive predictive value
(PPV),sensitivity (SE),and Matthews correlation coefficient (MCC)
of ensemble classifier vs selected individual classifiers.