The primary aim of dataset is create well-referenced, well-described, interoperable datasets from data.frames, tibbles or data.tables that translate well into the W3C DataSet definition within the Data Cube Vocabulary in a reproducible manner. The data cube model in itself is is originated in the Statistical Data and Metadata eXchange, and it is almost fully harmonized with the Resource Description Framework (RDF), the standard model for data interchange on the web1.
The development version of the dataset
package is very significantly different from the CRAN release. The documentation has not been rewritten yet! You can follow the discussion of this package on rOpenSci.
library(dataset)
iris_ds <- dataset(
x = iris,
title = "Iris Dataset",
author = person("Edgar", "Anderson", role = "aut"),
publisher = "American Iris Society",
source = "https://doi.org/10.1111/j.1469-1809.1936.tb02137.x",
date = 1935,
language = "en",
description = "This famous (Fisher's or Anderson's) iris data set."
)
It is mandatory to add a title
, author
to a dataset, and if the date
is not specified, the current date will be added.
As the dataset at this point is just created, if it is not published yet, the identifer
receives the default :tba
value, a version
of 0.1.0 and the :unas
(unassigned) publisher
field.
The dataset behaves as expected, with all data.frame methods applicable. If the dataset was originally a tibble or data.table object, it retained all methods of these s3 classes because the dataset class only implements further methods in the attributes of the original object.
summary(iris_ds)
#> Anderson E (2023). "Iris Dataset."
#> Further metadata: describe(iris_ds)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
#> 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
#> Median :5.800 Median :3.000 Median :4.350 Median :1.300
#> Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
#> 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
#> Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
#> Species
#> setosa :50
#> versicolor:50
#> virginica :50
#>
#>
#>
A brief description of the extended metadata attributes:
describe(iris_ds)
#> Iris Dataset
#> Dataset with 150 observations (rows) and 5 variables (columns).
#> Description: This famous (Fisher's or Anderson's) iris data set.
#> Creator: Edgar Anderson [aut]
#> Publisher: American Iris Society
paste0("Publisher:", publisher(iris_ds))
#> [1] "Publisher:American Iris Society"
paste0("Rights:", rights(iris_ds))
#> [1] "Rights::unas"
The descriptive metadata are added to a utils::bibentry
object which has many printing options (see ?bibentry
).
mybibentry <- dataset_bibentry(iris_ds)
print(mybibentry, "text")
#> Anderson E (2023). "Iris Dataset."
print(mybibentry, "Bibtex")
#> @Misc{,
#> title = {Iris Dataset},
#> author = {Edgar Anderson},
#> publisher = {American Iris Society},
#> year = {2023},
#> resourcetype = {Dataset},
#> identifier = {:tba},
#> version = {0.1.0},
#> description = {This famous (Fisher's or Anderson's) iris data set.},
#> language = {en},
#> format = {application/r-rds},
#> rights = {:unas},
#> }
rights(iris_ds) <- "CC0"
rights(iris_ds)
#> [1] "CC0"
rights(iris_ds, overwrite = FALSE) <- "GNU-2"
#> The dataset has already a rights field: CC0
Some important metadata is protected from accidental overwriting (except for the default :unas
unassigned and :tba
to-be-announced values.)
Please note that the dataset
package is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.
Furthermore, rOpenSci Community Contributing Guide - A guide to help people find ways to contribute to rOpenSci is also applicable, because dataset
is under software review for potential inclusion in rOpenSci.
RDF Data Cube Vocabulary, W3C Recommendation 16 January 2014 https://www.w3.org/TR/vocab-data-cube/, Introduction to SDMX data modeling https://www.unescap.org/sites/default/files/Session_4_SDMX_Data_Modeling_%20Intro_UNSD_WS_National_SDG_10-13Sep2019.pdf↩