Data cleaning and outlier detection

This function launches a Shiny application that (1) visualizes raw and outlier-free time series interactively (using plotly), (2) highlights automatically detected outliers, (3) allows the user to revise the automatically detected outliers and manually include data points, and (4) exports the original data, the automatically selected outliers, the manually selected outliers, and the outlier-free time series in an is.trex-compliant object that can be further processed.

outlier()

Value

The function does not return a value, but allows the user to save a list containing the raw and outlier-free data, as well as the automatically and manually selected outliers in separate items. Once the user is satisfied with the selected outliers, the ‘Download Cleaned Time Series’ button will allow to export this list as a ".Rds" file. This file can be subsequently assigned to an object using readRDS. The list contained in this file is called trex_outlier_output and has four data.frames, namely series_input with the raw data, select_auto with the automatically selected outliers, select_manual with the manually selected outliers, and series_cleaned with the outlier-free time series. Each of these data frames has a column with the timestamp and a column for the sensor values.

Details

Note, that due to the interactive nature of the application, the reactive graphs can become rather slow in updating. We hence suggest breaking long-time series into smaller chunks that do not strain the available memory too much. Trial and error is useful here, but we generally suggest working on a maximum of up to one year at a time. Once the application is launched, the user can load an .RData file where a data.frame with a imestamp and sensor data (multiple sensor columns are supported). The timestamp in this data.frame should be of class POSIXct. Users can select the x and y axes of the interactive time series plots. In addition, the user can provide the units of the imported data (e.g., degrees \(C\) or \(mV\) for \(\Delta T\) or \(\Delta V\), respectively). A parameter (alpha) for automatic outlier detection can be supplied. More specifically, the automatic identification of outliers is based on a two-step procedure: i) the Tukey’s method (Tukey, 1977) is applied to detect statistical outliers as values falling outside the range \([q_{0.25} - alpha * IQR, q_{0.75} + alpha * IQR]\), where \(IQR\) is the interquartile range (\(q_{0.75} - q_{0.25}\)) with \(q_{0.25}\) denoting the 25% lower quartile and \(q_{0.75}\) the 75% upper quartile, and alpha is a user-defined parameter (default value alpha = 3; although visual inspection through the interactive plots allows for adjusting alpha and optimizing the automatic detection of outliers), and ii) the lag-1 differences of the raw data are calculated and data points with lag-1 differences greater than the mean of the raw input time series, are excluded. The raw input data from the provided .RData file are depicted with black points in the first plot titled ‘Raw and automatic detection’ while the automatically detected outliers are also highlighted in this plot in red. The user can adjust the parameter alpha and visually inspect the automatically detected outliers in order to achieve the optimal automatic outlier selection. This plot allows also interactivity (by hovering the mouse in the upper right corner the available interactive tools appear, e.g., zoom in/out). Also, the lower subpanel of this plot provides a better overview of the temporal extent of the data and allows the user to select narrower time window for a more thorough data inspection.

Once the user is satisfied with the automatically selected data points, one can proceed to the manual outlier selection. The second interactive plot (titled ‘Filtered and manual selection’) presents the raw data after removing the automatically detected outliers of the previous step, and allows the user to manually select (point, rectangular, and lasso selections are allowed) data points. The first selection identifies points to be removed (outliers), and their color changes to red. If a point is selected for a second time, this will undo its classification as outlier and its color is set back to black (i.e., not an outlier). The red-color data points correspond to the selected outliers to be removed from the data, in addition to those identified in the automated detection.

Examples

if (FALSE) {
# find example file path
system.file("exdata", "example.RData", package = "TREX", mustWork = TRUE)
# either copy-paste this into the navigation bar of the file selection window
# or navigate here manually for selection

# launch shiny application
outlier()

# after saving the output, run e.g.:

my_cleaned_data <- readRDS("./cleaned_file.Rds")

## With full workflow:

# get an example time series
raw   <- example.data(type="doy")
input <- is.trex(raw, tz="GMT", time.format="%H:%M",
                 solar.time=TRUE, long.deg=7.7459, ref.add=FALSE, df=FALSE)

# clip a period of interest
input<-dt.steps(input,time.int=60,start="2014-02-01 00:00",
                end="2014-05-01 00:00",max.gap=180,decimals=15)

# organise a data.frame
input_df  = data.frame(date = zoo::index(input), data = zoo::coredata(input))

# save the RData file to e.g. a temp file, or your project root directory

#temp_file_path <- tempfile()
# save(input_df, file=temp_file_path)

# project_root_path <- "."
# save(input_df, file=project_root_path)


# call the oulier function and navigate to where the "test.RData" is stored
outlier()


}