Contents

1 Introduction

Ambiguity of protein identifications is an important issue in shotgun proteomics. It is due to the presence of shared peptides, which can generate quite complex peptide-to-protein mapping structures. These structures can be efficiently represented using bipartite graphs, with peptides and proteins as vertices and with edges featuring peptide to protein membership. Ambiguity in protein identifications can be assessed using graph connected components (CCs), i.e., the largest subgraphs in which any two vertices are connected to each other by a path and not connected to any other of the vertices in the supergraph. Proteins sharing one or more peptides are gathered in the same CC (multi-protein CCs), while proteins exclusively identified by specific peptides represent CCs with a single protein vertex (single-protein CCs). The proportion of multi-protein CCs and their size (i.e., number of protein members) can be used to measure the level of ambiguity of protein identifications.

Nowadays, an increasing number of sample-matched proteomic and transcriptomic datasets is available, which can be exploited to deal with ambiguity of protein identifications. Indeed, according to the central dogma of biology, there can be no protein without the corresponding transcript. Hence, protein identifications for which the corresponding transcript is identified in the sample-matched transcriptome are more likely to be correct than protein identifications with no expressed transcript. This rationale can be used to filter proteomic identifications based on transcriptome information and reduce ambiguity of protein identifications.

The net4pg package implements two main functionalities:

1. a pipeline to build, from proteomic identifications, a bipartite graph of peptide-to-protein mappings and calculate graph CCs to assess ambiguity of protein identifications. CCs represent a peptide-centric strategy to group proteins and it is independent from the variety of protein-centric strategies of protein grouping and protein inference. As such, it does not require protein inference and it is widely applicable, reproducible and transparent.
2. a transcriptome-informed filtering strategy which exploits sample-matched transcriptome information to reduce ambiguity of protein identifications in shotgun proteomics. This strategy fundamentally consists in the removal of proteins for which the corresponding transcript is not detected in the sample-matched transcriptome. The impact of the filtering strategy can be then assessed by building a new graph of filtered peptide-to-protein mappings, calculating the proportion of multi-protein CCs and visually inspecting peptide-to-protein mappings for ambiguous protein identifications of interest.

This vignette illustrates how to use the above described methods implemented in this package for the following applications:

2 Build a graph from shotgun proteomic identifications and calculate its connected components (CCs) to quantify ambiguity of protein identifications

The required input are proteomic identifications obtained by shotgun proteomics and, more precisely, all valid peptide identifications and their corresponding proteins, prior to any protein inference strategy. The required input must be in the form of a tab-delimited incidence matrix with peptides along the rows and proteins along the columns and 1 or 0 cell values to indicate whether or not a peptide maps on the corresponding protein. Protein identifiers (column names) are to be provided, one per line, in a separate file and should be in the Ensembl format (i.e. ENSPXXXXXXXXXXX for human, ENSMUSPXXXXXXXXXXX for mouse) or, if protein contaminants, in any format followed by a unique tag to indicate them as contaminants. Peptide identifiers (row names) are to be provided, one per line, in a separate file and can be in any format (amino acid sequence, numeric identifiers, …).

The pipeline to generate a bipartite graph of peptide-to-protein mapping and calculate their connected components consists in the following steps:

  1. Read in input the incidence matrix describing peptide-to-protein mappings;
  2. Reduce the size of the incidence matrix;
  3. Calculate the adjacency matrix describing protein-to-protein connections via shared peptides;
  4. Generate a graph of protein-to-protein connections and calculate its connected components (CCs);
  5. Assess the ambiguity of protein identifications based on the proportion of multi-protein CCs and shared peptides;
  6. Visualize peptide-to-protein mappings for ambiguous protein identifications of interest;

2.1 Read in input the incidence matrix describing peptide-to-protein mappings

Incidence matrices generated from proteomic datasets can be quite large but they can be easily read in input chunk by chunk using the read_inc_matrix() function. This function requires three inputs:
* the name of the tab-delimited file containing the incidence matrix (peptide-to-protein mappings), with no column or row names;
* the name of the file containing the matrix column names (protein identifiers), one per line;
* the name of the file containing the matrix row names (peptide identifiers), one per line.

Read the incidence matrix describing peptides identified in a shotgun proteomic experiment and their and corresponding proteins

incM_filename <- system.file("extdata"
                        , "incM_example"
                        , package = "net4pg"
                        , mustWork = TRUE)
rownames_filename <- system.file("extdata"
                        , "peptideIDs_incM_example"
                        , package = "net4pg"
                        , mustWork = TRUE)
colnames_filename <- system.file("extdata"
                        , "proteinIDs_incM_example"
                        , package = "net4pg"
                        , mustWork = TRUE)
incM <- read_inc_matrix(incM_filename = incM_filename
                , colnames_filename = colnames_filename
                , rownames_filename = rownames_filename)

Check the size of the input incidence matrix:

dim(incM)
#> [1] 100 100

2.2 Reduce the size of the incidence matrix

Reduce the size of the input incidence matrix by removing all proteins not sharing peptides and all peptides exclusively mapping to these proteins. Only ambiguous protein identifications are left, which is proteins connected by shared peptides, hence belonging to a multi-protein CC. Reducing data size allows to decrease the computational cost of calculating graph connected components.

incM_reduced <- reduce_inc_matrix(incM)
dim(incM_reduced) # check the size of the reduced incidence matrix
#> [1] 57 73

2.3 Calculate the adjacency matrix describing protein-to-protein connections via shared peptides

The adjacency matrix is calculated by cross-product of the reduced incidence matrix and it constitutes a more compact representation than an incidence matrix; it scales linearly with the number of proteins and it is independent of the number of peptides, which is convenient to speed up connected components calculation.

adjM <- get_adj_matrix(incM_reduced)
dim(adjM) # check the size of the adjacency matrix:
#> [1] 73 73

2.4 Generate a graph of protein-to-protein connections and calculate its connected components (CCs)

Build a graph representing protein-to-protein connections by shared peptides, as described in the adjacency matrix, and calculate graph connected components, using the get_cc() function. The function provides in output a list of two elements:
i. a graph of the protein connections (by shared peptides) described in the adjacency matrix;
ii. a list of n vectors (one per connected component) enumerating protein members of each connected component

multProteinCC <- get_cc(adjM)

Extract the number of multi-protein CCs

cc.multProteins <- multProteinCC$ccs
length(cc.multProteins)
#> [1] 12

2.5 Assess the ambiguity of protein identifications based on the proportion of multi-protein CCs and shared peptides

The proportion of single- or multi-protein CCs and the size of multi-protein CCs represent a measure for the level of ambiguity of protein identifications and they can be obtained by the cc_stats() function, provided with the above calculated CCs and the original incidence matrix. The above calculated CCs exclusively include multi-protein CCs and no single-protein CC, since they were calculated on the reduced incidence matrix from step 2, which only contains proteins sharing peptides. This is specified by the reducedIncM parameter in the cc_stats() function. Single-protein CCs are calculated by cc_stats() as all those proteins from the original incidence matrix which do not belong to multi-protein CCs.

# Calculate CCs size and percentage of single- vs multi-protein CCs
CCstatsOut <- cc_stats(incM = incM
                       , cc.proteins = cc.multProteins
                       , reducedIncM = TRUE)

# Number of single-protein CCs:
CCstatsOut$N_singleProtCC
#> [1] 27

# Number of multi-protein CCs
CCstatsOut$N_multiProtCC
#> [1] 12

# Total number of CCs
totCCs <- CCstatsOut$N_singleProtCC + CCstatsOut$N_multiProtCC
totCCs
#> [1] 39

# Percentage of single-protein CCs:
PercSingleP <- round(CCstatsOut$N_singleProtCC / totCCs * 100, digits = 2)
PercSingleP
#> [1] 69.23

# View table of CC size distribution
CCstatsOut$NproteinsDistribution
#>    N_proteins N_CCs
#> 1           1    27
#> 2           2     4
#> 3           3     0
#> 4           4     3
#> 5           5     0
#> 6           6     2
#> 7           7     0
#> 8           8     0
#> 9           9     0
#> 10         10     2
#> 11        >10     1

# Plot CC size distribution
plot(factor(CCstatsOut$NproteinsDistribution$N_proteins
       , levels = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", ">10"))
     , as.numeric(as.vector(CCstatsOut$NproteinsDistribution$N_CC))
     , type = "s"
     , xlab = "N_proteins"
     , ylab = "N_CCs")

The proportion of specific or shared peptides also represents a measure for the level of ambiguity of protein identifications. It can be calculated using the peptide_stats() function.

peptideStatsOut <- peptide_stats(incM = incM)

# Number of shared peptides
peptideStatsOut$nbShared
#> [1] 57

# Number of specific peptides
peptideStatsOut$nbSpecific
#> [1] 43

# Percentage of specific peptides
peptideStatsOut$percSpecific
#> [1] "43%"

2.6 Visualize peptide-to-protein mappings for ambiguous protein identifications of interest

Given an ambiguous protein identification of interest (a protein with shared peptides), find the multi-protein CC it belong to, identify all protein members of that CC and all their peptides and represent this information as a bipartite graph of peptide-to-protein mappings.

First, extract all peptides and peptide-to-protein mapping information for each CC using the cc.peptides.incM() function. This function generates in output a list of two elements:
i. a list of vectors enumerating peptides included in each CC;
ii. a list of incidence matrices (or vectors if only one peptide) describing peptide-to-protein mappings.

cc.peptides.incM <- cc_composition(cc.multProteins, incM = incM)

Then plot a bipartite graph representing the CC which includes the protein of interest:

# Generate the bipartite graph
prot <- "ENSP261"
subgraphCC <- plot_cc(prot = prot
        , cc.proteins = cc.multProteins
        , cc.subincM = cc.peptides.incM$cc.subincM
        , incM = incM
        , tagProt = "ENSP"
        , tagContam = "Contam")
#> [1] "Protein ENSP261 is member of a multi-protein CC"
# Plot it
plot.igraph(subgraphCC$g
            , layout = layout_as_bipartite
            , edge.width = 1
            ,  edge.arrow.width = 0.3
            , vertex.size = 35
            , edge.arrow.size = 0.5
            , vertex.size2 = 35
            , vertex.label.cex = 1
            , asp = 0.25
            , margin = -0.1) +
title(paste0("Protein ", prot, " in CC#", subgraphCC$cc_id), line = -1)

#> integer(0)

3 Perform a transcriptome-informed filtering of shotgun proteomic identifications to reduce ambiguity of protein identifications

The required input is the same as in section 2: a tab-delimited incidence matrix describing all valid peptide identifications from shotgun proteomics and their corresponding proteins, prior to any protein inference strategy.

Filtering can be performed in three different ways, established by the remove parameter of the transcriptome_filter() function:
i. remove all proteins for which no transcript is expressed according to the sample-matched transcriptome; then remove all (specific or shared) peptides exclusively mapping on them (remove = “all”);
ii. remove proteins for which no transcript is expressed, if they are not identified by any specific peptide; then remove all (shared) peptides exclusively mapping on them (remove = “sharedOnly”);
iii. remove proteins for which no transcript is expressed, if they are not identified by any specific peptide and if their peptides are shared with at least one protein for which the corresponding transcript is expressed (i.e. not filtered out); hence, no peptide needs to be removed (remove = “sharedNoRemove”).

3.1 Read in input the incidence matrix describing peptide-to-protein mappings

Read the incidence matrix describing peptides identified in a shotgun proteomic experiment and their corresponding proteins

incM_filename <- system.file("extdata"
                        , "incM_example"
                        , package = "net4pg"
                        , mustWork = TRUE)
rownames_filename <- system.file("extdata"
                        , "peptideIDs_incM_example"
                        , package = "net4pg"
                        , mustWork = TRUE)
colnames_filename <- system.file("extdata"
                        , "proteinIDs_incM_example"
                        , package = "net4pg"
                        , mustWork = TRUE)
incM <- read_inc_matrix(incM_filename = incM_filename
                , colnames_filename = colnames_filename
                , rownames_filename = rownames_filename)

Check the size of the input incidence matrix:

dim(incM)
#> [1] 100 100

3.2 Perform transcriptome-informed filtering

Perform transcriptome-informed post-hoc filtering using the transcriptome_filter() function, which requires three inputs:
i. the name of the file containing the incidence matrix of peptide-to-protein mappings
ii. the name of the file containing identifiers of transcripts detected in the sample-matched transcriptome
iii. the name of the file containing for each protein identifier the corresponding transcript identifier. 1 It can be easily obtained from the Ensembl protein sequence database file (e.g., Homo_sapiens.GRCh38.pep.all.fa for the GRCh38 human genome assembly), publicly available on Ensembl. The following bash command generates a tab-delimited file containing protein identifier and corresponding transcript identifier: cat Homo_sapiens.GRCh38.pep.all.fa | grep ">" | awk -F ' ' '{print $1,$5}' | sed 's/>//g' | sed 's/transcript://g' | sed 's/ /\t/' >> transcriptToProteinIDs_GRCh38.txt

# Read input file names
exprTranscriptsFile <- system.file("extdata"
                        , "expressed_transcripts.txt"
                        , package = "net4pg"
                        , mustWork = TRUE)
protein2transcriptFile <- system.file("extdata"
                        , "protein_to_transcript"
                        , package = "net4pg"
                        , mustWork = TRUE)

# Perform filtering
incM_filt <- transcriptome_filter(incM
                            , exprTranscriptsFile = exprTranscriptsFile
                            , proteinToTranscriptFile = protein2transcriptFile
                            , tagContam = "Contam"
                            , remove = "sharedOnly")

# Check size after transcriptome-informed filtering
dim(incM_filt)
#> [1] 87 37

3.3 Assess the impact of transcriptome-informed filtering on ambiguity of protein identifications

To measure the impact of transcriptome-informed filtering on ambiguity of protein identifications:
3.3.1 Compare the proportion and size of multi-protein CCs obtained before and after filtering
3.3.2 Compare the proportion of shared peptides obtained before and after filtering
3.3.3 Visualize peptide-to-protein mappings for ambiguous protein identifications of interest

3.3.1 Compare the proportion and size of multi-protein CCs obtained before and after filtering

Calculate the proportion and size of multi-protein CCs obtained after filtering and compare it to that obtained from the original proteomic identifications, prior to transcriptome-informed filtering.

Calculate CCs on the graph of proteomic identifications obtained after transcriptome-informed filtering.

# Reduce incidence matrix size to accelerate downstream computation
incM_filt_reduced <- reduce_inc_matrix(incM_filt)

# Calculate the adjacency matrix describing protein-to-protein connections
adjM_filt <- get_adj_matrix(incM_filt_reduced)

# Generate a graph of protein-to-protein connections by shared peptides and
# calculate its CCs (i.e., sets of proteins connected by shared peptides
multProteinCC_filt <- get_cc(adjM_filt)

# Extract the list of vectors enumerating protein members in each CC 
cc.multProteins_filt <- multProteinCC_filt$ccs

# Calculate CCs size and % of single- vs multi-protein CCs obtained after
# transcriptome-informed filtering
CCstatsOut_filt <- cc_stats(incM = incM_filt
                                , cc.proteins = multProteinCC_filt$ccs
                                , reducedIncM = TRUE)

# Number of single-protein CCs
CCstatsOut_filt$N_singleProtCC
#> [1] 15

# Number of multi-protein CCs
CCstatsOut_filt$N_multiProtCC
#> [1] 8

# Total number of CCs
totCCs_filt <- CCstatsOut_filt$N_singleProtCC + CCstatsOut_filt$N_multiProtCC
totCCs_filt
#> [1] 23

# Percentage of single-protein CCs
PercSingleP_filt <- round(CCstatsOut_filt$N_singleProtCC / totCCs_filt * 100
                          , digits = 2)

# View table of CC size distribution
CCstatsOut_filt$NproteinsDistribution
#>    N_proteins N_CCs
#> 1           1    15
#> 2           2     5
#> 3           3     0
#> 4           4     3
#> 5           5     0
#> 6           6     0
#> 7           7     0
#> 8           8     0
#> 9           9     0
#> 10         10     0
#> 11        >10     0

# Plot CC size distribution
plot(factor(CCstatsOut_filt$NproteinsDistribution$N_proteins
       , levels = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", ">10"))
     , as.numeric(as.vector(CCstatsOut_filt$NproteinsDistribution$N_CC))
     , type = "s"
     , xlab = "N_proteins"
     , ylab = "N_CCs")

Compare the percentage of single-protein CCs obtained before and after transcriptome-informed filtering.

comp <- as.data.frame(cbind(as.character(as.vector(c("before_filter"
                                                  , "after_filter")))
                         , as.numeric(as.vector(c(PercSingleP
                                                  , PercSingleP_filt)))))
colnames(comp) <- c("Filter", "Perc_SingleP")

ggplot(data = comp
       , aes(x = as.factor(Filter), y = as.numeric(as.vector(Perc_SingleP)))) +
      geom_bar(stat = "identity") +
      theme_classic() +
      xlab("") +
      ylab("% single-prot CCs") +
      ylim(0, 100) +
      coord_flip() +
      geom_text(aes(label = as.numeric(as.vector(Perc_SingleP)))
                , hjust = 1.5, color = "white", size = 4)

Plot CC size distribution before and after transcriptome-informed filtering.

old.par <- par(no.readonly = TRUE) # save default par values

ymax_before <- as.numeric(as.vector(CCstatsOut$NproteinsDistribution$N_CC))
ymax_after <- as.numeric(as.vector(CCstatsOut_filt$NproteinsDistribution$N_CC))

ymax <- max(max(ymax_before), max(ymax_after))

par(mfrow = c(1, 2))
plot(factor(CCstatsOut$NproteinsDistribution$N_proteins
        , levels = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", ">10"))
      , as.numeric(as.vector(CCstatsOut$NproteinsDistribution$N_CC))
      , type = "s"
      , xlab = "N_proteins"
      , ylab = "N_CCs"
      , ylim = c(0, ymax)
      , main = "before filtering")
plot(factor(CCstatsOut_filt$NproteinsDistribution$N_proteins
      , levels = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", ">10"))
      , as.numeric(as.vector(CCstatsOut_filt$NproteinsDistribution$N_CC))
      , type = "s"
      , xlab = "N_proteins"
      , ylab = "N_CCs"
      , ylim = c(0, ymax)
      , main = "after filtering")


par(old.par) # restore default par values

In the above example, transcriptome-informed filtering does not lead to an increased percentage of single-protein CCs (i.e., percentage of non ambiguous protein identifications); however, it yields smaller multi-protein CCs (i.e., fewer protein members per CC), which corresponds to reduced, although not solved, ambiguity of protein identifications.

3.3.2 Compare the proportion of shared peptides obtained before and after filtering

Calculate the proportion of shared peptides obtained after transcriptome-informed filtering

peptideStatsOut_filt <- peptide_stats(incM = incM_filt)

Compare the proportion of shared peptides obtained before and after transcriptome-informed filtering

comp <- as.data.frame(cbind(
                      as.character(as.vector(c("before_filter"
                                            , "after_filter")))
                    , as.numeric(as.vector(c(peptideStatsOut$nbShared
                                        , peptideStatsOut_filt$nbShared)))))
colnames(comp) <- c("Filter", "Perc_sharedPeptides")

ggplot(data = comp
    , aes(x = as.factor(Filter)
          , y = as.numeric(as.vector(Perc_sharedPeptides)))) +
      geom_bar(stat = "identity") +
      theme_classic() +
      xlab("") +
      ylab("% shared peptides") +
      ylim(0, 100) +
      coord_flip() +
      geom_text(aes(label = as.numeric(as.vector(Perc_sharedPeptides)))
                , hjust = 1.5, color = "white", size = 4)

3.3.3 Visualize peptide-to-protein mappings for ambiguous protein identifications of interest

To visually inspect the impact of transcriptome-informed filtering on any ambiguous protein identification of interest, plot the bipartite graph representing the CC which contains the protein of interest. Compare the CC before or after transcriptome-informed filtering.

# Extract peptides and peptide-to-protein mappings for each CC after filtering
cc.peptides.incM_filt <- cc_composition(cc.multProteins_filt
                                            , incM = incM_filt)

# Generate a bipartite graph of the CC which contains the protein of interest,
# before and after transcriptome-informed filtering.
prot <- "ENSP261"
subgraphCC_beforeFilter <- plot_cc(prot = prot
                      , cc.proteins = cc.multProteins
                      , cc.subincM = cc.peptides.incM$cc.subincM
                      , incM = incM
                      , tagProt = "ENSP"
                      , tagContam = "Contam")
#> [1] "Protein ENSP261 is member of a multi-protein CC"

subgraphCC_afterFilter <- plot_cc(prot = prot
                     , cc.proteins = cc.multProteins_filt
                     , cc.subincM = cc.peptides.incM_filt$cc.subincM
                     , incM = incM_filt
                     , tagProt = "ENSP"
                     , tagContam = "Contam")
#> [1] "Protein ENSP261 is member of a multi-protein CC"

# Plot
old.par <- par(no.readonly = TRUE) # save default par values

par(mfrow = c(2, 1))
plot.igraph(subgraphCC_beforeFilter$g
            , layout = layout_as_bipartite
            , edge.width = 1
            , edge.arrow.width = 0.3
            , vertex.size = 35
            , edge.arrow.size = 0.5
            , vertex.size2 = 35
            , vertex.label.cex = 1
            , asp = 0.45
            , margin = -0.1) +
title(paste0("Protein "
             , prot
             , " in CC #"
             , subgraphCC_beforeFilter$cc_id
             , " before filtering")
      , line = -1)
#> integer(0)
plot.igraph(subgraphCC_afterFilter$g
            , layout = layout_as_bipartite
            , edge.width = 1
            , edge.arrow.width = 0.3
            , vertex.size = 35
            , edge.arrow.size = 0.5
            , vertex.size2 = 35
            , vertex.label.cex = 1
            , asp = 0.45
            , margin = -0.1) +
title(paste0("Protein "
             , prot, " in CC #"
             , subgraphCC_beforeFilter$cc_id
             , " after filtering")
      , line = -1)