The dataset R Package

lifecycle CRAN_Status_Badge CRAN_time_from_release Status at rOpenSci Software Peer Review DOI devel-version dataobservatory Codecov test coverage pkgcheck AppVeyor build status R-CMD-check

The primary aim of dataset is create well-referenced, well-described, interoperable datasets from data.frames, tibbles or data.tables that translate well into the W3C DataSet definition within the Data Cube Vocabulary in a reproducible manner. The data cube model in itself is is originated in the Statistical Data and Metadata eXchange, and it is almost fully harmonized with the Resource Description Framework (RDF), the standard model for data interchange on the web1.

The development version of the dataset package is very significantly different from the CRAN release. The documentation has not been rewritten yet! You can follow the discussion of this package on rOpenSci.

library(dataset)
iris_ds <- dataset(
  x = iris,
  title = "Iris Dataset",
  author = person("Edgar", "Anderson", role = "aut"),
  publisher = "American Iris Society",
  source = "https://doi.org/10.1111/j.1469-1809.1936.tb02137.x",
  date = 1935,
  language = "en",
  description = "This famous (Fisher's or Anderson's) iris data set."
)

It is mandatory to add a title, author to a dataset, and if the date is not specified, the current date will be added.

As the dataset at this point is just created, if it is not published yet, the identifer receives the default :tba value, a version of 0.1.0 and the :unas (unassigned) publisher field.

The dataset behaves as expected, with all data.frame methods applicable. If the dataset was originally a tibble or data.table object, it retained all methods of these s3 classes because the dataset class only implements further methods in the attributes of the original object.

summary(iris_ds)
#> Anderson E (2023). "Iris Dataset."
#> Further metadata: describe(iris_ds)
#>   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
#>  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
#>  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
#>  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
#>  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
#>  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
#>  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
#>        Species  
#>  setosa    :50  
#>  versicolor:50  
#>  virginica :50  
#>                 
#>                 
#> 

A brief description of the extended metadata attributes:

describe(iris_ds)
#> Iris Dataset 
#> Dataset with 150 observations (rows) and 5 variables (columns).
#> Description: This famous (Fisher's or Anderson's) iris data set.
#> Creator: Edgar Anderson [aut]
#> Publisher: American Iris Society
paste0("Publisher:", publisher(iris_ds))
#> [1] "Publisher:American Iris Society"
paste0("Rights:", rights(iris_ds))
#> [1] "Rights::unas"

The descriptive metadata are added to a utils::bibentry object which has many printing options (see ?bibentry).

mybibentry <- dataset_bibentry(iris_ds)
print(mybibentry, "text")
#> Anderson E (2023). "Iris Dataset."
print(mybibentry, "Bibtex")
#> @Misc{,
#>   title = {Iris Dataset},
#>   author = {Edgar Anderson},
#>   publisher = {American Iris Society},
#>   year = {2023},
#>   resourcetype = {Dataset},
#>   identifier = {:tba},
#>   version = {0.1.0},
#>   description = {This famous (Fisher's or Anderson's) iris data set.},
#>   language = {en},
#>   format = {application/r-rds},
#>   rights = {:unas},
#> }
rights(iris_ds) <- "CC0"
rights(iris_ds)
#> [1] "CC0"
rights(iris_ds, overwrite = FALSE) <- "GNU-2"
#> The dataset has already a rights field: CC0

Some important metadata is protected from accidental overwriting (except for the default :unas unassigned and :tba to-be-announced values.)

rights(iris_ds, overwrite = TRUE)  <- "GNU-2"

Code of Conduct

Please note that the dataset package is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

Furthermore, rOpenSci Community Contributing Guide - A guide to help people find ways to contribute to rOpenSci is also applicable, because dataset is under software review for potential inclusion in rOpenSci.


  1. RDF Data Cube Vocabulary, W3C Recommendation 16 January 2014 https://www.w3.org/TR/vocab-data-cube/, Introduction to SDMX data modeling https://www.unescap.org/sites/default/files/Session_4_SDMX_Data_Modeling_%20Intro_UNSD_WS_National_SDG_10-13Sep2019.pdf