This function launches a Shiny
application that
(1) visualizes raw and outlier-free time series interactively
(using plotly
),
(2) highlights automatically detected outliers,
(3) allows the user to revise the automatically detected outliers
and manually include data points, and
(4) exports the original data, the automatically selected outliers,
the manually selected outliers, and the outlier-free time series
in an is.trex
-compliant object that can be further processed.
outlier()
The function does not return a value,
but allows the user to save a list
containing the raw and outlier-free data,
as well as the automatically and manually selected outliers in separate items.
Once the user is satisfied with the selected outliers,
the ‘Download Cleaned Time Series’ button will allow to export this list
as a ".Rds
"
file. This file can be subsequently assigned to an object using readRDS
.
The list contained in this file is called trex_outlier_output
and has four data.frames
,
namely series_input
with the raw data, select_auto
with
the automatically selected outliers, select_manual
with the manually selected outliers,
and series_cleaned
with the outlier-free time series.
Each of these data frames has a column with the timestamp and a column for the sensor values.
Note, that due to the interactive nature of the application, the reactive graphs can become
rather slow in updating. We hence suggest breaking long-time series into smaller chunks
that do not strain the available memory too much. Trial and error is useful here, but we
generally suggest working on a maximum of up to one year at a time.
Once the application is launched,
the user can load an .RData
file where a data.frame
with a imestamp and sensor data (multiple sensor columns are supported).
The timestamp in this data.frame
should be of class POSIXct
.
Users can select the x and y axes of the interactive time series plots.
In addition, the user can provide the units of the imported data
(e.g., degrees \(C\) or \(mV\) for \(\Delta T\) or \(\Delta V\), respectively).
A parameter (alpha) for automatic outlier detection can be supplied.
More specifically, the automatic identification of outliers is based on a
two-step procedure:
i) the Tukey’s method (Tukey, 1977) is applied to detect statistical outliers
as values falling outside the range
\([q_{0.25} - alpha * IQR, q_{0.75} + alpha * IQR]\),
where \(IQR\) is the interquartile range
(\(q_{0.75} - q_{0.25}\))
with \(q_{0.25}\) denoting the 25% lower quartile and \(q_{0.75}\) the
75% upper quartile, and alpha is a user-defined parameter
(default value alpha = 3
;
although visual inspection through the interactive plots allows for adjusting
alpha and optimizing the automatic detection of outliers),
and ii) the lag-1 differences of the raw data are calculated
and data points with lag-1 differences greater
than the mean of the raw input time series, are excluded.
The raw input data from the provided .RData
file are depicted with
black points in the first plot titled ‘Raw and automatic detection’
while the automatically detected outliers are also highlighted in this plot in red.
The user can adjust the parameter alpha
and visually inspect the
automatically detected outliers in order to achieve the optimal automatic outlier selection.
This plot allows also interactivity (by hovering the mouse in the upper right corner
the available interactive tools appear, e.g., zoom in/out).
Also, the lower subpanel of this plot provides a better overview of the temporal extent
of the data and allows the user to select narrower time window for a more thorough data inspection.
Once the user is satisfied with the automatically selected data points, one can proceed to the manual outlier selection. The second interactive plot (titled ‘Filtered and manual selection’) presents the raw data after removing the automatically detected outliers of the previous step, and allows the user to manually select (point, rectangular, and lasso selections are allowed) data points. The first selection identifies points to be removed (outliers), and their color changes to red. If a point is selected for a second time, this will undo its classification as outlier and its color is set back to black (i.e., not an outlier). The red-color data points correspond to the selected outliers to be removed from the data, in addition to those identified in the automated detection.
if (FALSE) {
# find example file path
system.file("exdata", "example.RData", package = "TREX", mustWork = TRUE)
# either copy-paste this into the navigation bar of the file selection window
# or navigate here manually for selection
# launch shiny application
outlier()
# after saving the output, run e.g.:
my_cleaned_data <- readRDS("./cleaned_file.Rds")
## With full workflow:
# get an example time series
raw <- example.data(type="doy")
input <- is.trex(raw, tz="GMT", time.format="%H:%M",
solar.time=TRUE, long.deg=7.7459, ref.add=FALSE, df=FALSE)
# clip a period of interest
input<-dt.steps(input,time.int=60,start="2014-02-01 00:00",
end="2014-05-01 00:00",max.gap=180,decimals=15)
# organise a data.frame
input_df = data.frame(date = zoo::index(input), data = zoo::coredata(input))
# save the RData file to e.g. a temp file, or your project root directory
#temp_file_path <- tempfile()
# save(input_df, file=temp_file_path)
# project_root_path <- "."
# save(input_df, file=project_root_path)
# call the oulier function and navigate to where the "test.RData" is stored
outlier()
}