Tutorial: discovering dataPreparation functionalities

2020-11-13

This vignette introduces dataPreparation package (v1.0.0), what it offers, how simple it is to use it.

1 Introduction

1.1 Package presentation

Based on data.table package, dataPreparation will allow you to do most of the painful data preparation for a data science project with a minimum amount of code.

This package is

data.table and other dependencies are handled at installation.

1.2 Main preparation steps

Before using any machine learning (ML) algorithm, one needs to prepare its data. Preparing a data set for a data science project can be long and tricky. The main steps are the followings:

Here are the functions available in this package to tackle those issues:

Correct Transform Filter Pre model manipulation Shape
un_factor generate_date_diffs fast_filter_variables fast_handle_na shape_set
find_and_transform_dates generate_factor_from_date which_are_constant fast_discretization same_shape
find_and_transform_numerics aggregate_by_key which_are_in_double fast_scale set_as_numeric_matrix
set_col_as_character generate_from_factor which_are_bijection one_hot_encoder
set_col_as_numeric generate_from_character remove_sd_outlier
set_col_as_date fast_round remove_rare_categorical
set_col_as_factor target_encode remove_percentile_outlier

All of those functions are integrated in the full pipeline function prepare_set.

In this tutorial we will detail all those steps and how to treat them with this package using an example data set.

1.3 Tutorial data

For this tutorial, we are going to use a messy version of adult data base.

data(messy_adult)
print(head(messy_adult, n = 4))
#        date1      date2        date3              date4    num1   num2 constant
# 1:      <NA> 1510441200  24-Mar-2017     26-march, 2017  1.9309 0,0864        1
# 2: 2017-26-9 1490482800  01-Feb-2017  03-february, 2017 -0.4273 0,6345        1
# 3:      <NA> 1510614000  18-Sep-2017 20-september, 2017  0.6093 1,8958        1
# 4:  2017-6-1         NA  25-Jun-2017      27-june, 2017 -0.5138 0,4505        1
#                                mail    num3 age    type_employer fnlwgt
# 1:          pierre.caroline@aol.com  1,9309  39        State-gov  77516
# 2:           pierre.lucas@yahoo.com -0,4273  50 Self-emp-not-inc  83311
# 3: caroline.caroline@protonmail.com  0,6093  38          Private 215646
# 4:         marie.caroline@gmail.com -0,5138  53          Private 234721
#    education education_num            marital        occupation  relationship
# 1: Bachelors            13      Never-married      Adm-clerical Not-in-family
# 2: Bachelors            13 Married-civ-spouse   Exec-managerial       Husband
# 3:   HS-grad             9           Divorced Handlers-cleaners Not-in-family
# 4:      11th             7 Married-civ-spouse Handlers-cleaners       Husband
#     race  sex capital_gain capital_loss hr_per_week       country income
# 1: White Male         2174            0          40 United-States  <=50K
# 2: White Male            0            0          13 United-States  <=50K
# 3: White Male            0            0          40 United-States  <=50K
# 4: Black Male            0            0          40 United-States  <=50K

We added 9 really ugly columns to the data set:

There are some columns that contains the same information with different représentation (ex: education and eduction_num).

2 Correct functions

2.1 Identifying factor that shouldn't be

It often happens when reading a data set that R put string into a factor even if it shouldn't be. In this tutorial data set, mail is a factor but shouldn't be. It will automatically be detected using un_factor function:

print(class(messy_adult$mail))
# "factor"
messy_adult <- un_factor(messy_adult)
# "date1"         "date2"         "date3"         "date4"        
# "num1"          "num2"          "constant"      "num3"         
# "age"           "fnlwgt"        "education_num" "capital_gain" 
# "capital_loss"  "hr_per_week"  
# [1] "un_factor: c(\"date1\", \"date2\", \"date3\", \"date4\", \"num1\", \"num2\", \"constant\", \"num3\", \"age\", \"fnlwgt\", \"education_num\", \"capital_gain\", \"capital_loss\", \"hr_per_week\") aren't columns of types factor i do nothing for those variables."
# [1] "un_factor: I will identify variable that are factor but shouldn't be."
# [1] "un_factor: I unfactor mail."
# [1] "un_factor: It took me 0s to unfactor 1 column(s)."
print(class(messy_adult$mail))
# "character"

2.2 Identifing and transforming date columns

The next thing to do is to identify columns that are dates (the first 4 ones) and transform them.

messy_adult <- find_and_transform_dates(messy_adult)
# "find_and_transform_dates: It took me 0.7s to identify formats"
# "find_and_transform_dates: It took me 0.08s to transform 4 columns to a Date format."
Let's have a look to the transformation performed on those 4 columns:
date1_prev date2_prev date3_prev date4_prev transfo date1 date2 date3 date4
NA 1510441200 24-Mar-2017 26-march, 2017 => NA 2017-11-12 00:00:00 2017-03-24 2017-03-26
2017-26-9 1490482800 01-Feb-2017 03-february, 2017 => 2017-09-26 2017-03-26 00:00:00 2017-02-01 2017-02-03
NA 1510614000 18-Sep-2017 20-september, 2017 => NA 2017-11-14 00:00:00 2017-09-18 2017-09-20
2017-6-1 NA 25-Jun-2017 27-june, 2017 => 2017-01-06 NA 2017-06-25 2017-06-27
NA 1494457200 26-Jan-2017 28-january, 2017 => NA 2017-05-11 01:00:00 2017-01-26 2017-01-28
2017-18-7 1494370800 04-Apr-2017 06-april, 2017 => 2017-07-18 2017-05-10 01:00:00 2017-04-04 2017-04-06

As one can see, even if formats were different and somehow ugly, they were all handled.

2.3 Identifying and transforming numeric columns

And now the same thing with numeric

messy_adult <- find_and_transform_numerics(messy_adult)
# "find_and_transform_numerics: It took me 0s to identify 3 numerics column(s), i will set them as numerics"
# "set_col_as_numeric: I will set some columns as numeric"
# "set_col_as_numeric: I am doing the column num1."
# "set_col_as_numeric: 0 NA have been created due to transformation to numeric."
# "set_col_as_numeric: I will set some columns as numeric"
# "set_col_as_numeric: I am doing the column num2."
# "set_col_as_numeric: 0 NA have been created due to transformation to numeric."
# "set_col_as_numeric: I am doing the column num3."
# "set_col_as_numeric: 0 NA have been created due to transformation to numeric."
# "find_and_transform_numerics: It took me 0.05s to transform 3 column(s) to a numeric format."
num1_prev num2_prev num3_prev transfo num1 num2 num3
1.9309 0,0864 1,9309 => 1.9309 0.0864 1.9309
-0.4273 0,6345 -0,4273 => -0.4273 0.6345 -0.4273
0.6093 1,8958 0,6093 => 0.6093 1.8958 0.6093
-0.5138 0,4505 -0,5138 => -0.5138 0.4505 -0.5138
1.0563 1,342 1,0563 => 1.0563 1.3420 1.0563
-0.9377 -0,0421 -0,9377 => -0.9377 -0.0421 -0.9377

So now our data set is a bit less ugly.

3 Filter functions

The idea now is to identify useless columns:

3.1 Look for constant variables

constant_cols <- which_are_constant(messy_adult)
# "which_are_constant: constant is constant."
# "which_are_constant: it took me 0s to identify 1 constant column(s)"

3.2 Look for columns in double

double_cols <- which_are_in_double(messy_adult)
# "which_are_in_double: it took me 0s to identify 1 column(s) to drop."

3.3 Look for columns that are bijections of one another

bijections_cols <- which_are_bijection(messy_adult)
# "which_are_bijection: it took me 0.21s to identify 3 column(s) to drop."
To control this, let's have a look to the concerned columns:
constant date3 date4 num1 num3 education education_num
1 2017-03-24 2017-03-26 1.9309 1.9309 Bachelors 13
1 2017-02-01 2017-02-03 -0.4273 -0.4273 Bachelors 13
1 2017-09-18 2017-09-20 0.6093 0.6093 HS-grad 9
1 2017-06-25 2017-06-27 -0.5138 -0.5138 11th 7
1 2017-01-26 2017-01-28 1.0563 1.0563 Bachelors 13
1 2017-04-04 2017-04-06 -0.9377 -0.9377 Masters 14

Indeed:

3.4 Filter them all

To directly filter all of them:

ncols <- ncol(messy_adult)
messy_adult <- fast_filter_variables(messy_adult)
print(paste0("messy_adult now have ", ncol(messy_adult), " columns; so ", ncols - ncol(messy_adult), " less than before."))
# "fast_filter_variables: I check for constant columns."
# "fast_filter_variables: I delete 1 constant column(s) in data_set."
# "fast_filter_variables: I check for columns in double."
# "fast_filter_variables: I delete 1 column(s) that are in double in data_set."
# "fast_filter_variables: I check for columns that are bijections of another column."
# "fast_filter_variables: I delete 2 column(s) that are bijections of another column in data_set."
# "messy_adult now have 20 columns; so 4 less than before."

4 useless columns have been deleted. Without those useless columns, your machine learning algorithm will at least be faster and maybe give better results.

4 Transform functions

Before sending this to a machine learning algorithm, a few transformations should be performed.

The idea with the functions presented here is to perform those transformations in a RAM efficient way.

4.1 Dates differences

Since no machine learning algorithm handle Dates, one needs to transform them or drop them. A way to transform dates is to perform differences between every date.

We can also add an analysis date to compare dates with the date your data is from. For example, if you have a birth-date you may want to compute age by performing today - birth-date.

messy_adult <- generate_date_diffs(messy_adult, cols = "auto", analysis_date = as.Date("2018-01-01"), units = "days")
# "num1"          "num2"          "mail"          "age"          
# "type_employer" "fnlwgt"        "education"     "marital"      
# "occupation"    "relationship"  "race"          "sex"          
# "capital_gain"  "capital_loss"  "hr_per_week"   "country"      
# "income"       
# [1] "generate_date_diffs: c(\"num1\", \"num2\", \"mail\", \"age\", \"type_employer\", \"fnlwgt\", \"education\", \"marital\", \"occupation\", \"relationship\", \"race\", \"sex\", \"capital_gain\", \"capital_loss\", \"hr_per_week\", \"country\", \"income\") aren't columns of types date i do nothing for those variables."
# [1] "generate_date_diffs: I will generate difference between dates."
# [1] "generate_date_diffs: It took me 0.01s to create 6 column(s)."
... date1.Minus.date3 date1.Minus.analysis.date date2.Minus.date3 date2.Minus.analysis.date date3.Minus.analysis.date
... NA NA 232.95833 -50 -282.9583
... 237 -96.95833 52.95833 -281 -333.9583
... NA NA 56.95833 -48 -104.9583
... -170 -359.95833 NA NA -189.9583
... NA NA 104.95833 -235 -339.9583
... 105 -166.95833 35.95833 -236 -271.9583

4.2 Transforming dates into aggregates

Another way to work around dates would be to aggregate them at some level. This time drop is set to TRUE in order to drop date columns.

messy_adult <- generate_factor_from_date(messy_adult, cols = "auto", type = "quarter", drop = TRUE)
# "num1"                      "num2"                     
# "mail"                      "age"                      
# "type_employer"             "fnlwgt"                   
# "education"                 "marital"                  
# "occupation"                "relationship"             
# "race"                      "sex"                      
# "capital_gain"              "capital_loss"             
# "hr_per_week"               "country"                  
# "income"                    "date1.Minus.date2"        
# "date1.Minus.date3"         "date1.Minus.analysis.date"
# "date2.Minus.date3"         "date2.Minus.analysis.date"
# "date3.Minus.analysis.date"
# [1] "generate_factor_from_date: c(\"num1\", \"num2\", \"mail\", \"age\", \"type_employer\", \"fnlwgt\", \"education\", \"marital\", \"occupation\", \"relationship\", \"race\", \"sex\", \"capital_gain\", \"capital_loss\", \"hr_per_week\", \"country\", \"income\", \"date1.Minus.date2\", \"date1.Minus.date3\", \"date1.Minus.analysis.date\", \"date2.Minus.date3\", \"date2.Minus.analysis.date\", \"date3.Minus.analysis.date\") aren't columns of types date i do nothing for those variables."
# [1] "generate_factor_from_date: I will create a factor column from each date column."
# [1] "generate_factor_from_date: It took me 0.03s to transform 3 column(s)."
... date1.quarter date2.quarter date3.quarter
... QNA Q4 Q1
... Q3 Q1 Q1
... QNA Q4 Q3
... Q1 QNA Q2
... QNA Q2 Q1
... Q3 Q2 Q2

4.3 Generate features from character columns

Character columns are not handled by any machine learning algorithm, one should transform them. Function generate_from_character build some new feature from them, and then drop them.

messy_adult <- generate_from_character(messy_adult, cols = "auto", drop = TRUE)
# "num1"                      "num2"                     
# "age"                       "type_employer"            
# "fnlwgt"                    "education"                
# "marital"                   "occupation"               
# "relationship"              "race"                     
# "sex"                       "capital_gain"             
# "capital_loss"              "hr_per_week"              
# "country"                   "income"                   
# "date1.Minus.date2"         "date1.Minus.date3"        
# "date1.Minus.analysis.date" "date2.Minus.date3"        
# "date2.Minus.analysis.date" "date3.Minus.analysis.date"
# "date1.quarter"             "date2.quarter"            
# "date3.quarter"            
# [1] "generate_from_character: c(\"num1\", \"num2\", \"age\", \"type_employer\", \"fnlwgt\", \"education\", \"marital\", \"occupation\", \"relationship\", \"race\", \"sex\", \"capital_gain\", \"capital_loss\", \"hr_per_week\", \"country\", \"income\", \"date1.Minus.date2\", \"date1.Minus.date3\", \"date1.Minus.analysis.date\", \"date2.Minus.date3\", \"date2.Minus.analysis.date\", \"date3.Minus.analysis.date\", \"date1.quarter\", \"date2.quarter\", \"date3.quarter\") aren't columns of types character i do nothing for those variables."
# [1] "generate_from_character: it took me: 0.02s to transform 1 character columns into, 3 new columns."
mail.notnull mail.num mail.order
TRUE 200 1
TRUE 200 1
TRUE 200 1
TRUE 200 1
TRUE 200 1
TRUE 200 1

4.4 Aggregate according to a key

To model something by country; one would want to to compute an aggregation of this table in order to have one line per country.

agg_adult <- aggregate_by_key(messy_adult, key = "country")
# "aggregate_by_key: I start to aggregate"
# "aggregate_by_key: 139 columns have been constructed. It took 0.27 seconds. "
country max.age type_employer.Without-pay education.Assoc-acdm marital.Married-AF-spouse ...
? 90 0 10 0 ...
Cambodia 65 0 0 0 ...
Canada 80 0 1 0 ...
China 75 0 0 0 ...
Columbia 75 0 4 0 ...
Cuba 82 0 3 0 ...

Every time you have more than one line per individual this function would be pretty cool.

4.5 Rounding

One might want to round numeric variables in order to save some RAM, or for algorithmic reasons:

messy_adult <- fast_round(messy_adult, digits = 2)
# "type_employer" "education"     "marital"       "occupation"   
# "relationship"  "race"          "sex"           "country"      
# "income"        "date1.quarter" "date2.quarter" "date3.quarter"
# "mail.notnull" 
# [1] "fast_round: c(\"type_employer\", \"education\", \"marital\", \"occupation\", \"relationship\", \"race\", \"sex\", \"country\", \"income\", \"date1.quarter\", \"date2.quarter\", \"date3.quarter\", \"mail.notnull\") aren't columns of types numeric or integer i do nothing for those variables."
num1 num2 age type_employer fnlwgt education ...
0.59 -0.50 60 Private 173960 Bachelors ...
NA -0.60 25 Private 371987 Bachelors ...
NA 0.48 26 Private 94936 Assoc-acdm ...
0.02 2.83 28 Private 166481 7th-8th ...
-0.87 -0.39 45 Self-emp-inc 197332 Some-college ...
1.20 -0.74 31 Private 244147 HS-grad ...

5 Handling NAs values

Then, let's handle NAs, fast_handle_na allows you to choose how you fill NAs. By default, numeric cols are filled with 0, boolean cols are filled with FALSE and character cols are filled with "".

messy_adult <- fast_handle_na(messy_adult)
#    num1  num2 age type_employer   ...       country income date1.Minus.date2
# 1: 0.59 -0.50  60       Private   ... United-States  <=50K           -173.96
# 2: 0.00 -0.60  25       Private   ... United-States  <=50K             23.04
# 3: 0.00  0.48  26       Private   ... United-States  <=50K            -73.96
# 4: 0.02  2.83  28       Private   ...   Puerto-Rico  <=50K           -234.96
#    date1.Minus.date3 date1.Minus.analysis.date date2.Minus.date3
# 1:                65                   -293.96            238.96
# 2:              -117                   -334.96           -140.04
# 3:               -33                   -138.96             40.96
# 4:              -228                   -336.96              6.96
#    date2.Minus.analysis.date date3.Minus.analysis.date date1.quarter
# 1:                      -120                   -358.96            Q1
# 2:                      -358                   -217.96            Q1
# 3:                       -65                   -105.96            Q3
# 4:                      -102                   -108.96            Q1
#    date2.quarter date3.quarter mail.notnull mail.num mail.order
# 1:            Q3            Q1         TRUE      200          1
# 2:            Q1            Q2         TRUE      200          1
# 3:            Q4            Q3         TRUE      200          1
# 4:            Q3            Q3         TRUE      200          1

If you want to put some specific values (constants, or even a function for example mean of values) you should go check fast_handle_na documentation.

6 Shape functions

There are two types of machine learning algorithm in R: those which accept data.table and factor, those which only accept numeric matrix.

Transforming a data set into something acceptable for a machine learning algorithm could be tricky.

The shape_set function do it for you, you just have to choose if you want a data.table or a numerical_matrix.

First with data.table:

clean_adult <- shape_set(copy(messy_adult), final_form = "data.table", verbose = FALSE)
print(table(sapply(clean_adult, class)))
# 
#  factor integer numeric 
#      12       1      15

As one can see, there only are, numeric and factors.

Now with numerical_matrix:

clean_adult <- shape_set(copy(messy_adult), final_form = "numerical_matrix", verbose = FALSE)
num1 num2 age type_employer? type_employerFederal-gov type_employerLocal-gov ...
0.59 -0.50 60 0 0 0 ...
0.00 -0.60 25 0 0 0 ...
0.00 0.48 26 0 0 0 ...
0.02 2.83 28 0 0 0 ...
-0.87 -0.39 45 0 0 0 ...
1.20 -0.74 31 0 0 0 ...

As one can see, with final_form = "numerical_matrix" every character and factor have been binarized.

7 Full pipeline

Doing it all with one function is possible:

To do that we will reload the ugly data set and perform aggregation.

data("messy_adult")
agg_adult <- prepare_set(messy_adult, final_form = "data.table", key = "country", analysis_date = Sys.Date(), digits = 2)
# "prepare_set: step one: correcting mistakes."
# "fast_filter_variables: I check for constant columns."
# "fast_filter_variables: I delete 1 constant column(s) in data_set."
# "fast_filter_variables: I check for columns in double."
# "fast_filter_variables: I check for columns that are bijections of another column."
# "fast_filter_variables: I delete 3 column(s) that are bijections of another column in data_set."
#  [1] "date1"        "date2"        "date4"        "num2"         "num3"        
#  [6] "age"          "fnlwgt"       "capital_gain" "capital_loss" "hr_per_week" 
# "un_factor: c(\"date1\", \"date2\", \"date4\", \"num2\", \"num3\", \"age\", \"fnlwgt\", \"capital_gain\", \"capital_loss\", \"hr_per_week\") aren't columns of types factor i do nothing for those variables."
# "un_factor: I will identify variable that are factor but shouldn't be."
# "un_factor: I unfactor mail."
# "un_factor: It took me 0.02s to unfactor 1 column(s)."
# "find_and_transform_numerics: It took me 0s to identify 2 numerics column(s), i will set them as numerics"
# "set_col_as_numeric: I will set some columns as numeric"
# "set_col_as_numeric: I am doing the column num2."
# "set_col_as_numeric: 0 NA have been created due to transformation to numeric."
# "set_col_as_numeric: I am doing the column num3."
# "set_col_as_numeric: 0 NA have been created due to transformation to numeric."
# "find_and_transform_numerics: It took me 0.03s to transform 2 column(s) to a numeric format."
# "find_and_transform_dates: It took me 0.45s to identify formats"
# "find_and_transform_dates: It took me 0.05s to transform 3 columns to a Date format."
# "prepare_set: step two: transforming data_set."
#  [1] "num2"          "mail"          "num3"          "age"          
#  [5] "type_employer" "fnlwgt"        "education"     "marital"      
#  [9] "occupation"    "relationship"  "race"          "sex"          
# [13] "capital_gain"  "capital_loss"  "hr_per_week"   "country"      
# [17] "income"       
# "prepare_set: c(\"num2\", \"mail\", \"num3\", \"age\", \"type_employer\", \"fnlwgt\", \"education\", \"marital\", \"occupation\", \"relationship\", \"race\", \"sex\", \"capital_gain\", \"capital_loss\", \"hr_per_week\", \"country\", \"income\") aren't columns of types date i do nothing for those variables."
# "generate_date_diffs: I will generate difference between dates."
# "generate_date_diffs: It took me 0s to create 6 column(s)."
# "generate_factor_from_date: I will create a factor column from each date column."
# "generate_factor_from_date: It took me 0.34s to transform 3 column(s)."
#  [1] "date1"                     "date2"                    
#  [3] "date4"                     "num2"                     
#  [5] "num3"                      "age"                      
#  [7] "type_employer"             "fnlwgt"                   
#  [9] "education"                 "marital"                  
# [11] "occupation"                "relationship"             
# [13] "race"                      "sex"                      
# [15] "capital_gain"              "capital_loss"             
# [17] "hr_per_week"               "country"                  
# [19] "income"                    "date1.Minus.date2"        
# [21] "date1.Minus.date4"         "date1.Minus.analysis.date"
# [23] "date2.Minus.date4"         "date2.Minus.analysis.date"
# [25] "date4.Minus.analysis.date"
# "prepare_set: c(\"date1\", \"date2\", \"date4\", \"num2\", \"num3\", \"age\", \"type_employer\", \"fnlwgt\", \"education\", \"marital\", \"occupation\", \"relationship\", \"race\", \"sex\", \"capital_gain\", \"capital_loss\", \"hr_per_week\", \"country\", \"income\", \"date1.Minus.date2\", \"date1.Minus.date4\", \"date1.Minus.analysis.date\", \"date2.Minus.date4\", \"date2.Minus.analysis.date\", \"date4.Minus.analysis.date\") aren't columns of types character i do nothing for those variables."
# "generate_from_character: it took me: 0s to transform 1 character columns into, 3 new columns."
# "aggregate_by_key: I start to aggregate"
# "aggregate_by_key: 164 columns have been constructed. It took 0.25 seconds. "
# "prepare_set: step three: filtering data_set."
# "fast_filter_variables: I check for constant columns."
# "fast_filter_variables: I delete 2 constant column(s) in result."
# "fast_filter_variables: I check for columns in double."
# "fast_filter_variables: I delete 1 column(s) that are in double in result."
# "fast_filter_variables: I check for columns that are bijections of another column."
# "fast_filter_variables: I delete 35 column(s) that are bijections of another column in result."
# "country"
# "fast_round: country aren't columns of types numeric or integer i do nothing for those variables."
# "prepare_set: step four: handling NA."
# "prepare_set: step five: shaping result."
# "set_col_as_factor: I will set some columns to factor."
# "set_col_as_factor: it took me: 0s to transform 0 column(s) to factor."
# "shape_set: Transforming numerical variables into factors when length(unique(col)) <= 10."
# "shape_set: Previous distribution of column types:"
# col_class_init
#  factor numeric 
#       1     125 
# "shape_set: Current distribution of column types:"
# col_class_end
#  factor numeric 
#      37      89

As one can see, every previously steps have been done.

Let's have a look to the result

# "126 columns have been built; for 42 countries."
country nbr_lines mean.num2 sd.num2 mean.num3 sd.num3 min.age ...
? 529 0 0 0 0 17 ...
Cambodia 16 0.08 0.78 0 0 25 ...
Canada 108 0 0 0 0 17 ...
China 67 0 0 0 0 22 ...
Columbia 53 0 0 0 0 21 ...
Cuba 88 0 0 0 0 21 ...

8 Description

Finally, to generate a description file from this data set, function description is available.

It will describe, the set and its variables. Here we put level=0 to have some global descriptions:

description(agg_adult, level = 0)
# "data_set is a data.table-data.frame"
# [1] "data_set contains 42 rows and 126 cols."
# [1] "Columns are of the following classes:"
# 
#  factor numeric 
#      37      89

9 Conclusion

We presented some of the functions of dataPreparation package. There are a few more available, plus they have some parameters to make their use easier. So if you liked it, please go check the package documentation (by installing it or on CRAN)

We hope that this package is helpful, that it helped you prepare your data in a faster way.

If you would like to give us some feedback, report some issues, add some features to this package, please tell us on GitHub. Also if you want to contribute, please don't hesitate to contact us.