The package contains tools to calculate similarity score matrix for DNA k-mers. The pairwise similarity score is calculated using PAM or BLOSUM substitution matrix. The results are evaluated by similarity score calculated by Needleman-Wunsch (global) (Needleman and Wunsch 1970) or Smith-Waterman (local) alignment. (Smith and Waterman 1981). Higher similarity score indicates more similar sequences for BLOSUM and less similar sequences for PAM matrix; 30, 40, 70, 120, 250 and 62, 45, 50, 62, 80, 100 matrix versions are available for PAM and BLOSUM, respectively.

```
# Import the package
library(kmeRs)
```

Simply apply the **kmeRs_similarity_matrix** function and mark as an input the vector contains the nucleotides letters for witch the score should be calculated.

```
# Simple BLOSUM62 similarity matrix for all DNA nucleotides
result <- kmeRs_similarity_matrix(kmers_given = c("A", "T", "C", "G"), submat = "BLOSUM62")
# Fancy knitr table
knitr::kable(result)
```

A | T | C | G | |
---|---|---|---|---|

A | 4 | 0 | 0 | 0 |

T | 0 | 5 | -1 | -2 |

C | 0 | -1 | 9 | -3 |

G | 0 | -2 | -3 | 6 |

In this example, the most *‘different’* k-mer to “GATTACA” sequence will be indicated from given set of heptamers. Here, 7 heptamer (being an anagram of the movie title “GATTACA”) are given, as follow:

```
# Given hexamers
kmers_given <- c("GATTACA", "ACAGATT", "GAATTAC", "GAAATCT", "CTATAGA", "GTACATA", "AACGATT")
# Matrix calculation
result <- kmeRs_similarity_matrix(kmers_given = c("GATTACA"), compare_to = kmers_given , submat = "BLOSUM62")
# Fancy knitr table
knitr::kable(result)
```

GATTACA | |
---|---|

GATTACA | 37 |

ACAGATT | 1 |

GAATTAC | 15 |

GAAATCT | 19 |

CTATAGA | 7 |

GTACATA | 12 |

AACGATT | 4 |

Now, applying **kmeRs_score_and_sort** function the total score is calculated and the matrix is sorted by decreasing score value. The lowest value (in case of BLOSUM) indicates the most *‘different’* sequence from given k-mers, in contrast to the highest value which indicates the most similar one.

```
# Score and sort the matrix
result <- kmeRs_score_and_sort(result)
# Fancy knitr table
knitr::kable(result)
```

GATTACA | score_total | |
---|---|---|

ACAGATT | 1 | 1 |

AACGATT | 4 | 4 |

CTATAGA | 7 | 7 |

GTACATA | 12 | 12 |

GAATTAC | 15 | 15 |

GAAATCT | 19 | 19 |

GATTACA | 37 | 37 |

As can be observed, the most *‘different’* sequence to GATTACA is ACAGATT with total score equal to 1 and the most similar to GATTACA sequence is of course GATTACA sequence with the highest score equal to 37.

In this example, the most *‘different’* k-mer to whole given set of heptamers will be indicated. The same heptamers as in example 2 are used.

```
# Given hexamers
kmers_given <- c("GATTACA", "ACAGATT", "GAATTAC", "GAAATCT", "CTATAGA", "GTACATA", "AACGATT")
# Matrix calculation
result <- kmeRs_similarity_matrix(kmers_given = kmers_given, submat = "BLOSUM62")
# Score the matrix and sort by decreasing score
result <- kmeRs_score_and_sort(result)
# Fancy knitr table
knitr::kable(result)
```

GATTACA | ACAGATT | GAATTAC | GAAATCT | CTATAGA | GTACATA | AACGATT | score_total | |
---|---|---|---|---|---|---|---|---|

CTATAGA | 7 | 3 | 6 | -2 | 37 | 11 | 0 | 62 |

AACGATT | 4 | 24 | 1 | 8 | 0 | 6 | 37 | 80 |

ACAGATT | 1 | 37 | 1 | 8 | 3 | 9 | 24 | 83 |

GAATTAC | 15 | 1 | 37 | 18 | 6 | 9 | 1 | 87 |

GTACATA | 12 | 9 | 9 | 9 | 11 | 37 | 6 | 93 |

GATTACA | 37 | 1 | 15 | 19 | 7 | 12 | 4 | 95 |

GAAATCT | 19 | 8 | 18 | 37 | -2 | 9 | 8 | 97 |

As can be observed, the most *‘different’* sequence to all given heptamers is CTATAGA with score equal to 62 and the most similar sequence is GAAATCT with the highest score equal to 97.

Applying function **kmeRs_statistics** to the result matrix (here, result matrix from example 3) the basic statistics can be calculated as additional columns. When **summary_statistics_only** is set to TRUE only summary table is returned. It is much more elegant way to present results, especially in case of *‘big data’* output.

```
# Score the matrix and sort by decreasing score
result <- kmeRs_statistics(result)
# Fancy knitr table
knitr::kable(result[ , 1:(length(result[1, ])-4)])
```

GATTACA | ACAGATT | GAATTAC | GAAATCT | CTATAGA | GTACATA | AACGATT | score_total | |
---|---|---|---|---|---|---|---|---|

CTATAGA | 7.00 | 3.00 | 6.00 | -2.00 | 37.00 | 11.00 | 0.00 | 62.00 |

AACGATT | 4.00 | 24.00 | 1.00 | 8.00 | 0.00 | 6.00 | 37.00 | 80.00 |

ACAGATT | 1.00 | 37.00 | 1.00 | 8.00 | 3.00 | 9.00 | 24.00 | 83.00 |

GAATTAC | 15.00 | 1.00 | 37.00 | 18.00 | 6.00 | 9.00 | 1.00 | 87.00 |

GTACATA | 12.00 | 9.00 | 9.00 | 9.00 | 11.00 | 37.00 | 6.00 | 93.00 |

GATTACA | 37.00 | 1.00 | 15.00 | 19.00 | 7.00 | 12.00 | 4.00 | 95.00 |

GAAATCT | 19.00 | 8.00 | 18.00 | 37.00 | -2.00 | 9.00 | 8.00 | 97.00 |

Min | 1.00 | 1.00 | 1.00 | -2.00 | -2.00 | 6.00 | 0.00 | 62.00 |

Max | 37.00 | 37.00 | 37.00 | 37.00 | 37.00 | 37.00 | 37.00 | 97.00 |

Mean | 13.57 | 11.86 | 12.43 | 13.86 | 8.86 | 13.29 | 11.43 | 85.29 |

Sd | 12.08 | 13.64 | 12.62 | 12.40 | 13.16 | 10.63 | 13.83 | 12.04 |

Needleman, Saul B., and Christian D. Wunsch. 1970. “A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins.” *Journal of Molecular Biology* 48 (3): 443–53. doi:10.1016/0022-2836(70)90057-4.

Smith, T.F., and M.S. Waterman. 1981. “Identification of Common Molecular Subsequences.” *Journal of Molecular Biology* 147 (1): 195–97. doi:10.1016/0022-2836(81)90087-5.

BioTesseract Cambridge Bioinformatics Solutions, Cambridgeshire, Cambridge, UK↩