In the R language, datasets are usually contained in a data.frame() object, or in one of their modernized versions. For example, tibble::tibble() or data.table::data.table() are inherited from the base data.frame().
This documentation is not updated yet to the development version of the [dataset] package.
The base data.frame() constructor, like most base R types, is very flexible. It allows the use of any kind of metadata attached to the object.
For reproducible research, publication, or linking resources on the
web, the standardization of metadata is critically important. The aim
dataset() class is a modernized data.frame that has
According to the RDF Data Cube Vocabulary DataSet is a collection of statistical data that corresponds to a defined structure. The data in a data set can be roughly described as belonging to one of the following kinds:
Observations: these are the measured values, and the cells of a data frame object in R.
Organizational structure: To locate an observation within the hypercube, one has at least to know the value of each dimension at which the observation is located, so these values must be specified for each observation. Datasets can have additional organizational structure in the form of slices as described in section 7.2.
Structural metadata: Metadata to interpret the data. What is the unit of measurement? Is it a normal value or a series break? Is the value measured or estimated? These metadata are provided as attributes and can be attached to individual observations, or to higher levels.
Reference metadata: Metadata that describes the dataset as a whole, such as categorization of the dataset, its publisher, or an endpoint where it can be accessed.
|dimensions||first column section of the dataset|
|measurements||second column section of the dataset|
|attributes||third column section of the dataset|
|reference||attributes of the R object|
rd_e_gerdtot <- eurostat::get_eurostat('rd_e_gerdtot') head(rd_e_gerdtot) #> # A tibble: 6 × 5 #> sectperf unit geo time values #> <chr> <chr> <chr> <date> <dbl> #> 1 BES EUR_HAB AT 2022-01-01 1098. #> 2 BES EUR_HAB BE 2022-01-01 1195. #> 3 BES EUR_HAB BG 2022-01-01 64.2 #> 4 BES EUR_HAB CY 2022-01-01 96.2 #> 5 BES EUR_HAB CZ 2022-01-01 331. #> 6 BES EUR_HAB DE 2022-01-01 983.
Dimensions are usually needed in data analysis because
they are used to subsetting (slicing) the dataset. They contain
information about the reference time period and geographical area.
In a dataset that has homogeneous dimensions (all data relate to the year 2022 and the area of the United States), you could move the dimensions into the attributes of the R object, or simply omit them. However, dimensions are critically important for filtering out the observations (measurements) that you want to work with or to correctly join (integrate) datasets. If you want to create a composite indicator from two datasets that are related to the United States and the year 2022, you do not want to match measurements about 2021 or Canada accidentally.
The measurements are the actual observed values. In a long-form tidy dataset you usually have only one: ‘value’
Attributes are similar to dimensions, but they can be fully static and constant in a dataset. You may have measurements for the same reference area and time available in both kilograms and tons in the same dataset, in which case you will likely use filter the correct unit of measure when you do analytical work or join (integrate) the data.
If your measurement unit is always millimeters (like in the iris dataset), it is tempting to treat this as a dataset-wide constant (and therefore move it to the attributes of the data frame R object), but we do not recommend this approach. Imagine that you want to join this dataset with some other data that is measured in centimeters or inches, or a dataset that has values in both millimeters and centimeters. To correctly match your data you will be filtering on attributes, too.
Attributes that may vary across observations (rows) should remain in
the dataset in the datacube model. To avoid confusion with the base R
attributes() function, we named the function that sets the
attributes within a dataset to
dataset R package aims to increase the
Findability, Accessibility, Interoperability, and
Reuse of digital assets, particularly datacubes and datasets
used in statistics and data analysis. The FAIR principles
“…emphasize machine-actionability (i.e., the capacity of computational
systems to find, access, interoperate, and reuse data with none or
minimal human intervention) because humans increasingly rely on
computational support to deal with data as a result of the increase in
volume, complexity, and creation speed of data.”
This is the role of the
reference metadata in the RDF Data Cube
Vocabulary and the SDMX data cube model. We generally keep the
reference metadata as
attributes() of the R object, because
they do not relate to the rows (observations) of the data, but the
entire set of data. However, omitting all reference metadata from the
columns is not a good practice if you want your data to be used in a
knowledge graph (or the semantic web, or the Web 3.0.)
toc <- eurostat::get_eurostat_toc() rd_e_gerdtot_reference <- toc[which(toc$code == "rd_e_gerdtot"),] rd_e_gerdtot_bibentry <- datacite( Title = 'GERD by sector of performance', Creator = person("Daniel", "Antal"), Identifier = 'eurostat_rd_e_gerdtot', Publisher = 'Eurostat', PublicationYear = substr(rd_e_gerdtot_reference$`last update of data`, 7,11), Subject = subject_create("Reserach", subjectScheme = "LC Subject Headings", schemeURI = "http://id.loc.gov/authorities/subjects", valueURI = "http://id.loc.gov/authorities/subjects/sh85113021"), Language = "English")
Bibtex (????). “list(title =”GERD by sector of performance”, author = list(list(given = “Daniel”, family = “Antal”, role = NULL, email = NULL, comment = NULL)), identifier = “eurostat_rd_e_gerdtot”, publisher = “Eurostat”, year = “2023”, date = “:tba”, language = “English”, alternateidentifier = “:unas”, relatedidentifier = “:unas”, format = “:tba”, rights = “:tba”, geolocation = “:unas”, fundingreference = “:unas”).”
Following the datacube model, our datasets are data frames with
clearly defined dimensions (
sex), measurements (
value), and attributes
status). In this
example, all dimensions and values are following the SDMX attribute
definition, i.e. they have a standardized, natural language independent
codelist. (To use these codelists, use the statcodelist data
R objects inherited from the base
data.frame() have row
(observation) identifiers as
row.names() attributes. This
works well if you work with a single data frame, but this approach is
not sufficient to identify observations if you work with several data
frame, and you want to organize them into a database, or join them into
new tables, or you want to make them available on a knowledge graph.
When joining data tables or working in a relational database, you need unique identifiers for each unique observation unit in your system. If you want to broaden the usability of your data to the entire semantic web, and use it as linked data, you need a truly unique identifier (URI) for each observation.
We recommend the use of an explicit row identifier. The popular
modern R data frames,
data.table::data.table() use row identifiers.
One of the advantages of using an explicit row identifier is that it can form the root for minting a URI for the entire dataset by collapsing all dimensions and attributes into a concatenated string starting with the row identifier. This will make your dataset ready to be used in triplets, a strict, tidy, three-column long-form dataset used in linked open data applications. As mentioned earlier, in homogeneous (or homogeneously subsetted) datasets, you could move the dimensions and the attributes out from the data frame cells into the descriptive attributes. However, if you want to work with linked data, you must have all structural information present in the data cells, because this makes it possible that different data publisher’s data can be linked together without having a utopistic, global database map.
In the following example, we concatenate the
into a single URI. We can do this because in a well-organized dataset
the combination of dimensions is unique (otherwise, we would be just
simply duplicating an observation.) However, adding the attributes to
the URI would be superfluous because their combination is not unique in
The From dataset To RDF vignette article shows you how to organize your data into strict, tidy, three-column triples that can be serialized into RDF data.