Interactive and reproducible data cleaning

Launches the datacleanr app for interactive and reproducible cleaning. See Details for more information.

dcr_app(dframe, browser = TRUE)

Arguments

dframe: Character, a string naming a data.frame, tbl or data.table in the environment or a path to a .Rds file. Note, that data.tables are converted to tibbles internally.`
browser: logical, should app start in OS's default browser? (default TRUE)

Value

When datacleanr is ended by clicking on Close in the app's navigation bar, a list is invisibly returned with the following items:

df_name: character, object name/file path passed into dcr_app
dcr_df: tibble, filtered data set with additional columns .dcrkey, .dcrindex, .annotation - the latter is NA for non-outliers, an empty string for outliers without annotation, and a custom string for annotated outliers
dcr_selected_outliers: data.frame, contains the outlier .dcrkey, the .annotation and a selection_count (integer, count incrementer) column
dcr_groups: character, a vector defining the groups (via group_by) used throughout datacleanr
dcr_condition_df: tibble, with columns filter (character, statement used for filtering) and group (list, of integers), defining groups that correspond to .dcrindex
dcr_code: character string, containing Reproducible Recipe

Details

datacleanr provides an interactive data overview, and allows reproducible filtering and (manual, interactive) visual outlier detection and annotation across multiple app tabs:

Overview and Set-up: set groups (see below) and generate a exploratory summary of dframe
Filtering: Provide and apply filter statements (groupwise, see below and filter_scoped_df)
Visualization and Annotating: interactive visualization allowing outlier highlighting, annotating and before/after histograms of displayed (numeric) variables
Extraction: generates Reproducible Recipe and outputs

For data sets exceeding 1.5 million rows, we suggest splitting the data, if possible, by a grouping factor. This is because at this volume interactive visualizations using plotly stretch the limits of what modern web browsers can handle. A simple example using iris is:

iris_split <- split(iris, iris$Species)
dcr_app(iris_split[[1]])
# or
lapply(iris_split, dcr_app)

Extensive documentation is provided on each of the tabs for individual procedures in help links. datacleanr relies on 1) generating a column of unique IDs (.dcrkey) and subsetting dframe into sub-groups (generated in-app, added as column .dcrindex) for filtering and visualization. These groups are composed of unique combinations of columns in the data set (must be factor) and are passed to group_by, and are carried through the app for exploratory analyses (tab Overview and Set-up), filtering (tab Filtering) and plotting (tab Visualization). These groups should ideally be chosen to facilitate a convenient filtering and viewing/cleaning process. For example, a data set with time series of multiple sensors could be grouped by sensor and/or additional columns, such that periods of interest can be visualized and cleaned simultaneously in the interactive plot.

Filtering is achieved by providing expressions that evaluate to TRUE \ FALSE, and can be applied to the entire data set, or individual/all groups via scoped filtering (see filter_scoped_df).

The interactive visualization allows selecting and deselecting points with lasso and box select tools, as well as interactive zooming (toolbar or clicking on legend items or group overview table, see tab in-app) as well as panning (toolbar and hover over plot's axes). Data formats supported are

Observational (numeric), timeseries (POSIXct) and categorical data in x and y dimensions/axis
Observational (numeric) data in z dimension (point size)
Spatial data, when lon and lat in decimal degrees are present in x and y.

Displaying spatial data requires a Mapbox account, from which an access token needs to be copied into your .Renviron (e.g. MAPBOX_TOKEN=your_copied_token).

Note, that when a column .dcrflag (logical, TRUE \ FALSE) is present in dframe, respective observations are given contrasting symbols (FALSE = circle, TRUE = star-triangle). This column is employed as a cross-referencing tool for e.g.other outlier detection or data-processing algorithms that were applied prior.

The tab Extraction provides code to reproduce the entire procedure (a Reproducible Recipe), which

can be copied, or sent directly to an active RStudio script when used interactively (i.e. when dframe is an object in R's environment),
can be saved to disk with intermediate outputs (filter statements and selected outliers), where file names are based on the input file and configurable suffixes when dframe is a path.