MicroSEC logo

MicroSEC pipeline for FFPE artifacts

This pipeline is designed for filtering sequence errors found in formalin-fixed and paraffin-embedded (FFPE) samples.
This repository contains all the codes to regenerate results from our paper:
“MicroSEC: Sequence error filtering pipeline for formalin-fixed and paraffin-embedded samples”

Supplementary Code

M. Ikegami et al., “MicroSEC: Sequence error filtering pipeline for formalin-fixed and paraffin-embedded samples”, in preparation.



This pipeline is designed for filtering mutations found in formalin-fixed and paraffin-embedded (FFPE) samples.
The MicroSEC filter utilizes a statistical analysis, and the results for mutations with less than 10 supporting reads are not reliable.
Four files are necessary for the analysis: mutation information file, BAM file, and mutation supporting read ID information file.

File 1: mutation information file

This excel file should contain at least these contents:

Sample       Gene       HGVS.c  HGVS.p Mut_type Total_QV>=20   %Alt   Chr  Pos         Ref Alt SimpleRepeat_TRF Neighborhood_sequence                      Transition  
SL_1010-N6-B SLC25A24    _      _      1-snv    366            1.0929 chr1 108130741   C   T   N                CTACCTGGAGAATGGGCCCATGTGTCCAGGTAGCAGTAAGC  C>T_t

File 2: BAM file

This file should contain at least these contents:

File 3: mutation supporting read ID information tsv file

This file should contain at least these contents:

Chr  Pos       Ref Alt        Mut_ID                                                                                                Mut
chr1 2561609   T   A          _;ID001-1:579185f,ID004-1:1873933f;ID006-1:1131647f,ID001-1:570086f,ID008-1:1953407r,ID002-2:749570r  .;A;N
chr6 346429    G   GACACACAC  _;ID005-2:545593f,ID006-1:1132212f,ID006-1:1132212r;ID004-1:991476r,ID004-2:516896f;ID002-1:670026f   .;.+ACACACAC;.-ACAC;N

File 4: sample information tsv file

Seven or eight columns are necessary.
The file contains no header.

[sample name] [mutation information excel file] [BAM file]          [read ID information directory] [read length] [adapter sequence read 1]         [optional: adapter sequence read 2] [reference genome]  
PC9           /mnt/source/CCLE.xlsx             /mnt/source/PC9.bam /mnt/source/PC9_Cell_line       127           AGATCGGAAGAGCACACGTCTGAACTCCAGTCA AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT   hg38  

Filtering detail

This pipeline contains 8 filtering processes.

Supporting lengths are adjusted considering small repeat sequences around the mutations.

Results are saved in a excel file.
The explatation of the results is written in detail in the second sheet of the excel file.

github url: https://github.com/MANO-B/MicroSEC

System Requirements

Hardware Requirements

The scripts requires only a standard computer with enough RAM to support the operations defined by a user. For minimal performance, this will be a computer with about 8 GB of RAM. For optimal performance, we recommend a computer with the following specs:

RAM: 32+ GB
CPU: 4+ cores, 4.2+ GHz/core

The runtimes below are generated using a computer with the recommended specs (32 GB RAM, 4 cores@4.2 GHz) and internet of speed 100 Mbps.

Software Requirements

R language

This script files runs on R for Windows, Mac, or Linux, which requires the R version 3.4.0 or later.

Package dependencies

Users should install the following packages prior to use the scripts, from an R terminal:

if (!requireNamespace("BiocManager", quietly = TRUE)){
install.packages(c('tidyr', 'openxlsx', 'data.table', 'R.utils', 'stringr', 'magrittr', 'dplyr', 'gtools', 'devtools'), dependencies = TRUE)
BiocManager::install(c("Rsamtools", "Biostrings", "GenomicAlignments", "GenomeInfoDb"), update=FALSE)

# install necessary genomes
BiocManager::install("BSgenome.Hsapiens.UCSC.hg38", update=FALSE)
BiocManager::install("BSgenome.Hsapiens.UCSC.hg19", update=FALSE)
BiocManager::install("BSgenome.Mmusculus.UCSC.mm10", update=FALSE)

which will install in about 30 minutes on a recommended machine.

Package Versions

All packages are in their latest versions as they appear on CRAN on Oct. 31, 2020. Users can check CRAN snapshot for details. The versions of software are, specifically:

> packageVersion("tidyr")
[1] '1.1.2'
> packageVersion("openxlsx")
[1] '4.2.2'
> packageVersion("data.table")
[1] '1.13.2'
> packageVersion("R.utils")
[1] '2.10.1'
> packageVersion("stringr")
[1] '1.4.0'
> packageVersion("magrittr")
[1] '1.5'
> packageVersion("dplyr")
[1] '1.0.2'
> packageVersion("Biostrings")
[1] '2.54.0'
> packageVersion("BSgenome.Hsapiens.UCSC.hg38")
[1] '1.4.1'
> packageVersion("BSgenome.Hsapiens.UCSC.hg19")
[1] '1.4.0'
> packageVersion("BSgenome.Mmusculus.UCSC.mm10")
[1] '1.4.0'
> packageVersion("GenomicAlignments")
[1] '1.22.1'
> packageVersion("Rsamtools")
[1] '2.0.3'
> packageVersion("gtools")
[1] '3.8.2'
> packageVersion("devtools")
[1] '2.3.2'
> packageVersion("GenomeInfoDb")
[1] '1.22.1'

Instructions for Use

devtools::install_github("MANO-B/MicroSEC", upgrade="never")  
# download only once
wget https://raw.githubusercontent.com/MANO-B/MicroSEC/main/MicroSEC.R

# Style 1
Rscript MicroSEC.R [working/output directory] [sample information tsv file] [progress bar Y/N]
# Style 2 (for pipeline use)
Rscript MicroSEC.R [output filename] [sample name] [mutation information tsv.gz] [BAM file] [read ID information directory] [read length] [adapter sequence read 1] [adapter sequence read 2] [sample type: Human or Mouse]
Rscript MicroSEC.R /mnt/HDD8TB/MicroSEC /mnt/HDD8TB/MicroSEC/source/Sample_list.txt Y  
Rscript MicroSEC.R /mnt/result/post_filter/SAMPLE.gz /mnt/result/mutation/SAMPLE.gz /mnt/result/mutation/SAMPLE.gz /mnt/result/BAM/SAMPLE.bam /mnt/result/ID 150    AGATCGGAAGAGCACACGTCTGAACTCCAGTCA AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT hg38
## Necessary packages

wd = "/mnt/HDD8TB/MicroSEC" # set your working/output directory
sample_list = "/mnt/HDD8TB/MicroSEC/source/Sample_list.txt"
progress_bar = "Y"
# load sample information tsv file
sample_info = read.csv(sample_list,
                       header = FALSE,
                       stringsAsFactors = FALSE,
                       sep = "\t")

# initialize
msec = NULL
homology_search = NULL
mut_depth = NULL

for (sample in seq_len(dim(sample_info)[1])) {
  sample_name = sample_info[sample, 1]
  mutation_file = sample_info[sample, 2]
  bam_file = sample_info[sample, 3]
  read_list = sample_info[sample, 4]
  read_length = as.integer(sample_info[sample, 5])
  adapter_1 = sample_info[sample, 6]
  if (sample_info[sample, 7] %in%
      c("Human", "Mouse", "hg19", "hg38", "mm10")) {
    adapter_2 = adapter_1
    organism = sample_info[sample, 7]
  } else{
    adapter_2 = sample_info[sample, 7]
    organism = sample_info[sample, 8]

  # load mutation information
  df_mutation = fun_load_mutation(mutation_file, sample_name)
  df_bam = fun_load_bam(bam_file)
  df_mut_call = fun_load_id(read_list)

  # load genomic sequence
  ref_genome = fun_load_genome(organism)
  chr_no = fun_load_chr_no(organism)

  # analysis
  result = fun_read_check(df_mutation = df_mutation,
                          df_bam =  df_bam,
                          df_mut_call = df_mut_call,
                          ref_genome = ref_genome,
                          sample_name = sample_name,
                          read_length = read_length,
                          adapter_1 = adapter_1,
                          adapter_2 = adapter_2,
                          short_homology_search_length = 4,
                          progress_bar = progress_bar)
  msec = rbind(msec, result[[1]])
  homology_search = rbind(homology_search, result[[2]])
  mut_depth = rbind(mut_depth, result[[3]])
# search homologous sequences
msec = fun_homology(msec,
                    min_homology_search = 40,
                    progress_bar = progress_bar)

# statistical analysis
msec = fun_summary(msec)
msec = fun_analysis(msec,
                    short_homology_search_length = 4,
                    min_homology_search = 40,
                    threshold_p = 10 ^ (-6),
                    threshold_hairpin_ratio = 0.50,
                    threshold_soft_clip_ratio = 0.90,
                    threshold_short_length = 0.8,
                    threshold_distant_homology = 0.2,
                    threshold_low_quality_rate = 0.1,
                    homopolymer_length = 15)

# save the results
fun_save(msec, sample_info[1,1], wd)
MicroSEC-result_[Sample name]_[Date].xlsx
MicroSEC_[Sample name].tsv

like following:


Conversion of mutation information files

MicroSEC utilizes a mutation information file. The format of the file is written differently from common HGVS nomenclature.
A conversion function is prepared.

# Conversion example
# Set column names correctly.
# Before (hg19)
# Sample Hugo_Symbol    Chr   Start_Position   End_Position Variant_Type Reference Tumor_Seq Protein_Change
# LS411N BRAF           7     140453136        140453136    SNP          A         T         p.V600E
# LS411N CELSR1         22    46931227         46931227     DEL          C         -         p.G615fs
# LS411N APC            5     112175951        112175952    INS          -         A         p.E1554fs
# After
# Sample Gene     Chr   Pos          Mut_type  Ref   Alt HGVS.p       Neighborhood_sequence
# LS411N BRAF     7     140453136    1-snv     A     T   p.V600E      ACCCACTCCATCGAGATTTCTCTGTAGCTAGACCAAAATCA
# LS411N CELSR1   22    46931226     1-del     GC    G   p.G615fs     TCTTAGGCCCAGCGCTGCCGCCCCCAGAAAGGTGGAGGCC
# LS411N APC      5     112175951    1-ins     G     GA  p.E1554fs    AAAACCAAGAGAAAGAGGCAGAAAAAAACTATTGATTCTGAA    


mutation_file = "/mnt/HDD8TB/MicroSEC/mutation.xlsx"
organism = "hg19"

df_mutation = fun_convert(mutation_file = mutation_file,
                          organism = organism)

write.xlsx(df_mutation, "/mnt/HDD8TB/MicroSEC/mutation_modified.xlsx")


The source code is available in MicroSEC.R. Source data will be available at the Japanese Genotype-Phenotype Archive (http://trace.ddbj.nig.ac.jp/jga), which is hosted by the DNA Data Bank of Japan (the accession number is written in our paper). Each filtering process takes about 5–120 minutes per sample on a recommended machine, according to the depth and mutation amount of the sample.