readr 2.0.1

readr 2.0.0

second edition changes

readr 2.0.0 is a major release of readr and introduces a new second edition parsing and writing engine implemented via the vroom package.

This engine takes advantage of lazy reading, multi-threading and performance characteristics of modern SSD drives to significantly improve the performance of reading and writing compared to the first edition engine.

We will continue to support the first edition for a number of releases, but eventually this support will be first deprecated and then removed.

You can use the with_edition() or local_edition() functions to temporarily change the edition of readr for a section of code.

e.g.

Lazy reading

Edition two uses lazy reading by default. When you first call a read_*() function the delimiters and newlines throughout the entire file are found, but the data is not actually read until it is used in your program. This can provide substantial speed improvements for reading character data. It is particularly useful during interactive exploration of only a subset of a full dataset.

However this also means that problematic values are not necessarily seen immediately, only when they are actually read. Because of this a warning will be issued the first time a problem is encountered, which may happen after initial reading.

Run problems() on your dataset to read the entire dataset and return all of the problems found. Run problems(lazy = TRUE) if you only want to retrieve the problems found so far.

Deleting files after reading is also impacted by laziness. On Windows open files cannot be deleted as long as a process has the file open. Because readr keeps a file open when reading lazily this means you cannot read, then immediately delete the file. readr will in most cases close the file once it has been completely read. However, if you know you want to be able to delete the file after reading it is best to pass lazy = FALSE when reading the file.

Reading multiple files at once

Edition two has built-in support for reading sets of files with the same columns into one output table in a single command. Just pass the filenames to be read in the same vector to the reading function.

First we generate some files to read by splitting the nycflights dataset by airline.

library(nycflights13)
purrr::iwalk(
  split(flights, flights$carrier),
  ~ { .x$carrier[[1]]; vroom::vroom_write(.x, glue::glue("flights_{.y}.tsv"), delim = "\t") }
)

Then we can efficiently read them into one tibble by passing the filenames directly to readr.

files <- fs::dir_ls(glob = "flights*tsv")
files
readr::read_tsv(files)

If the filenames contain data, such as the date when the sample was collected, use id argument to include the paths as a column in the data. You will likely have to post-process the paths to keep only the relevant portion for your use case.

Delimiter guessing

Edition two supports automatic guessing of delimiters. Because of this you can now use read_delim() without specifying a delim argument in many cases.

x <- read_delim(readr_example("mtcars.csv"))

Literal data

In edition one the reading functions treated any input with a newline in it or vectors of length > 1 as literal data. In edition two vectors of length > 1 are now assumed to correspond to multiple files. Because of this we now have a more explicit way to represent literal data, by putting I() around the input.

readr::read_csv(I("a,b\n1,2"))

License changes

We are systematically re-licensing tidyverse and r-lib packages to use the MIT license, to make our package licenses as clear and permissive as possible.

To this end the readr and vroom packages are now released under the MIT license.

Deprecated or superseded functions and features

Other second edition changes

Additional features and fixes

readr 1.4.0

Breaking changes

New features

Additional features and fixes

readr 1.3.1

readr 1.3.0

Breaking Changes

Blank line skipping

readr’s blank line skipping has been modified to be more consistent and to avoid edge cases that affected the behavior in 1.2.0. The skip parameter now behaves more similar to how it worked previous to readr 1.2.0, but in addition the parameter skip_blank_rows can be used to control if fully blank lines are skipped. (#923)

tibble data frame subclass

readr 1.3.0 returns results with a spec_tbl_df subclass. This differs from a regular tibble only that the spec attribute (which holds the column specification) is lost as soon as the object is subset (and a normal tbl_df object is returned).

Historically tbl_df’s lost their attributes once they were subset. However recent versions of tibble retain the attributes when subetting, so the spec_tbl_df subclass is needed to ensure the previous behavior.

This should only break compatibility if you are explicitly checking the class of the returned object. A way to get backwards compatible behavior is to call subset with no arguments on your object, e.g. x[].

Bugfixes

readr 1.2.1

This release skips the clipboard tests on CRAN servers

readr 1.2.0

Breaking Changes

Integer column guessing

readr functions no longer guess columns are of type integer, instead these columns are guessed as numeric. Because R uses 32 bit integers and 64 bit doubles all integers can be stored in doubles, guaranteeing no loss of information. This change was made to remove errors when numeric columns were incorrectly guessed as integers. If you know a certain column is an integer and would like to read them as such you can do so by specifying the column type explicitly with the col_types argument.

Blank line skipping

readr now always skips blank lines automatically when parsing, which may change the number of lines you need to pass to the skip parameter. For instance if your file had a one blank line then two more lines you want to skip previously you would pass skip = 3, now you only need to pass skip = 2.

New features

Melt functions

There is now a family of melt_*() functions in readr. These functions store data in ‘long’ or ‘melted’ form, where each row corresponds to a single value in the dataset. This form is useful when your data is ragged and not rectangular.

data <-"a,b,c
1,2
w,x,y,z"

readr::melt_csv(data)
#> # A tibble: 9 x 4
#>     row   col data_type value
#>   <dbl> <dbl> <chr>     <chr>
#> 1     1     1 character a    
#> 2     1     2 character b    
#> 3     1     3 character c    
#> 4     2     1 integer   1    
#> 5     2     2 integer   2    
#> 6     3     1 character w    
#> 7     3     2 character x    
#> 8     3     3 character y    
#> 9     3     4 character z

Thanks to Duncan Garmonsway (@nacnudus) for great work on the idea an implementation of the melt_*() functions!

Connection improvements

readr 1.2.0 changes how R connections are parsed by readr. In previous versions of readr the connections were read into an in-memory raw vector, then passed to the readr functions. This made reading connections from small to medium datasets fast, but also meant that the dataset had to fit into memory at least twice (once for the raw data, once for the parsed data). It also meant that reading could not begin until the full vector was read through the connection.

Now we instead write the connection to a temporary file (in the R temporary directory), than parse that temporary file. This means connections may take a little longer to be read, but also means they will no longer need to fit into memory. It also allows the use of the chunked readers to process the data in parts.

Future improvements to readr would allow it to parse data from connections in a streaming fashion, which would avoid many of the drawbacks of either method.

Additional new features

Bug Fixes

readr 1.1.1

readr 1.1.0

New features

Parser improvements

Whitespace / fixed width improvements

Writing to connections

Miscellaneous features

Bugfixes

readr 1.0.0

Column guessing

The process by which readr guesses the types of columns has received a substantial overhaul to make it easier to fix problems when the initial guesses aren’t correct, and to make it easier to generate reproducible code. Now column specifications are printing by default when you read from a file:

challenge <- read_csv(readr_example("challenge.csv"))
#> Parsed with column specification:
#> cols(
#>   x = col_integer(),
#>   y = col_character()
#> )

And you can extract those values after the fact with spec():

spec(challenge)
#> cols(
#>   x = col_integer(),
#>   y = col_character()
#> )

This makes it easier to quickly identify parsing problems and fix them (#314). If the column specification is long, the new cols_condense() is used to condense the spec by identifying the most common type and setting it as the default. This is particularly useful when only a handful of columns have a different type (#466).

You can also generating an initial specification without parsing the file using spec_csv(), spec_tsv(), etc.

Once you have figured out the correct column types for a file, it’s often useful to make the parsing strict. You can do this either by copying and pasting the printed output, or for very long specs, saving the spec to disk with write_rds(). In production scripts, combine this with stop_for_problems() (#465): if the input data changes form, you’ll fail fast with an error.

You can now also adjust the number of rows that readr uses to guess the column types with guess_max:

challenge <- read_csv(readr_example("challenge.csv"), guess_max = 1500)
#> Parsed with column specification:
#> cols(
#>   x = col_double(),
#>   y = col_date(format = "")
#> )

You can now access the guessing algorithm from R. guess_parser() will tell you which parser readr will select for a character vector (#377). We’ve made a number of fixes to the guessing algorithm:

We have made a number of improvements to the reification of the col_types, col_names and the actual data:

Column parsing

The date time parsers recognise three new format strings:

%y and %Y are now strict and require 2 or 4 characters respectively.

Date and time parsing functions received a number of small enhancements:

parse_number() is slightly more flexible - it now parses numbers up to the first ill-formed character. For example parse_number("-3-") and parse_number("...3...") now return -3 and 3 respectively. We also fixed a major bug where parsing negative numbers yielded positive values (#308).

parse_logical() now accepts 0, 1 as well as lowercase t, f, true, false.

New readers and writers

Minor features and bug fixes

readr 0.2.2

readr 0.2.1

readr 0.2.0

Internationalisation

readr now has a strategy for dealing with settings that vary from place to place: locales. The default locale is still US centric (because R itself is), but you can now easily override the default timezone, decimal separator, grouping mark, day & month names, date format, and encoding. This has lead to a number of changes:

See vignette("locales") for more details.

File parsing improvements

Column parsing improvements

Readr gains vignette("column-types") which describes how the defaults work and how to override them (#122).

As well as improvements to the parser, I’ve also made a number of tweaks to the heuristics that readr uses to guess column types:

Minor improvements and bug fixes