Title: | Toolbox for Model Building and Forecasting |
---|---|
Description: | Implements functions and instruments for regression model building and its application to forecasting. The main scope of the package is in variables selection and models specification for cases of time series data. This includes promotional modelling, selection between different dynamic regressions with non-standard distributions of errors, selection based on cross validation, solutions to the fat regression model problem and more. Models developed in the package are tailored specifically for forecasting purposes. So as a results there are several methods that allow producing forecasts from these models and visualising them. |
Authors: | Ivan Svetunkov [aut, cre] (Senior Lecturer at Centre for Marketing Analytics and Forecasting, Lancaster University, UK), Yves R. Sagaert [ctb] (Visiting Research at Centre for Marketing Analytics and Forecasting, Lancaster University, UK) |
Maintainer: | Ivan Svetunkov <[email protected]> |
License: | LGPL-2.1 |
Version: | 2.0.3.41005 |
Built: | 2024-11-08 05:45:33 UTC |
Source: | https://github.com/config-i1/greybox |
Function produces error measures for the provided object and the holdout values of the
response variable. Note that instead of parameters x
, test
, the function
accepts the vector of values in holdout
. Also, the parameters d
and D
are not supported - MASE is always calculated via division by first differences.
## S3 method for class 'greybox' accuracy(object, holdout = NULL, ...) ## S3 method for class 'predict.greybox' accuracy(object, holdout = NULL, ...)
## S3 method for class 'greybox' accuracy(object, holdout = NULL, ...) ## S3 method for class 'predict.greybox' accuracy(object, holdout = NULL, ...)
object |
The estimated model or a forecast from the estimated model generated via
either |
holdout |
The vector of values of the response variable in the holdout (test) set. If not provided, then the function will return the in-sample error measures. |
... |
Other variables passed to the |
The function is a wrapper for the measures function and is implemented for convenience.
Ivan Svetunkov, [email protected]
xreg <- cbind(rlaplace(100,10,3),rnorm(100,50,5)) xreg <- cbind(100+0.5*xreg[,1]-0.75*xreg[,2]+rlaplace(100,0,3),xreg,rnorm(100,300,10)) colnames(xreg) <- c("y","x1","x2","Noise") ourModel <- alm(y~x1+x2+trend, xreg, subset=c(1:80), distribution="dlaplace") predict(ourModel,xreg[-c(1:80),]) |> accuracy(xreg[-c(1:80),"y"])
xreg <- cbind(rlaplace(100,10,3),rnorm(100,50,5)) xreg <- cbind(100+0.5*xreg[,1]-0.75*xreg[,2]+rlaplace(100,0,3),xreg,rnorm(100,300,10)) colnames(xreg) <- c("y","x1","x2","Noise") ourModel <- alm(y~x1+x2+trend, xreg, subset=c(1:80), distribution="dlaplace") predict(ourModel,xreg[-c(1:80),]) |> accuracy(xreg[-c(1:80),"y"])
This is a simple method that returns the values of the response variable of the model
actuals(object, all = TRUE, ...) ## Default S3 method: actuals(object, all = TRUE, ...) ## S3 method for class 'lm' actuals(object, all = TRUE, ...) ## S3 method for class 'alm' actuals(object, all = TRUE, ...) ## S3 method for class 'predict.greybox' actuals(object, all = TRUE, ...)
actuals(object, all = TRUE, ...) ## Default S3 method: actuals(object, all = TRUE, ...) ## S3 method for class 'lm' actuals(object, all = TRUE, ...) ## S3 method for class 'alm' actuals(object, all = TRUE, ...) ## S3 method for class 'predict.greybox' actuals(object, all = TRUE, ...)
object |
Model estimated using one of the functions of smooth package. |
all |
If |
... |
Other parameters to pass to the method. Currently nothing is supported here. |
The vector of the response variable.
Ivan Svetunkov, [email protected]
xreg <- cbind(rnorm(100,10,3),rnorm(100,50,5)) xreg <- cbind(100+0.5*xreg[,1]-0.75*xreg[,2]+rnorm(100,0,3),xreg,rnorm(100,300,10)) colnames(xreg) <- c("y","x1","x2","Noise") ourModel <- stepwise(xreg) actuals(ourModel)
xreg <- cbind(rnorm(100,10,3),rnorm(100,50,5)) xreg <- cbind(100+0.5*xreg[,1]-0.75*xreg[,2]+rnorm(100,0,3),xreg,rnorm(100,300,10)) colnames(xreg) <- c("y","x1","x2","Noise") ourModel <- stepwise(xreg) actuals(ourModel)
This function extracts AICc / BICc from models. It can be applied to wide variety of models that use logLik() and nobs() methods (including the popular lm, forecast, smooth classes).
AICc(object, ...) BICc(object, ...)
AICc(object, ...) BICc(object, ...)
object |
Time series model. |
... |
Some stuff. |
AICc was proposed by Nariaki Sugiura in 1978 and is used on small samples for the models with normally distributed residuals. BICc was derived in McQuarrie (1999) and is used in similar circumstances.
IMPORTANT NOTE: both of the criteria can only be used for univariate models (regression models, ARIMA, ETS etc) with normally distributed residuals! In case of multivariate models, both criteria need to be modified. See Bedrick & Tsai (1994) for details.
This function returns numeric value.
Ivan Svetunkov, [email protected]
Burnham Kenneth P. and Anderson David R. (2002). Model Selection and Multimodel Inference. A Practical Information-Theoretic Approach. Springer-Verlag New York. DOI: [10.1007/b97636](http://dx.doi.org/10.1007/b97636).
McQuarrie, A. D. (1999). A small-sample correction for the Schwarz SIC model selection criterion. Statistics & Probability Letters, 44(1), 79–86. [10.1016/S0167-7152(98)00294-6](https://doi.org/10.1016/S0167-7152(98)00294-6).
McQuarrie A.D., A small-sample correction for the Schwarz SIC model selection criterion, Statistics & Probability Letters 44 (1999) pp.79-86. doi:10.1016/S0167-7152(98)00294-6
Sugiura Nariaki (1978) Further analysts of the data by Akaike's information criterion and the finite corrections, Communications in Statistics - Theory and Methods, 7:1, 13-26, doi:10.1080/03610927808827599
Bedrick, E. J., & Tsai, C.-L. (1994). Model Selection for Multivariate Regression in Small Samples. Biometrics, 50(1), 226. doi:10.2307/2533213
xreg <- cbind(rnorm(100,10,3),rnorm(100,50,5)) xreg <- cbind(100+0.5*xreg[,1]-0.75*xreg[,2]+rnorm(100,0,3),xreg,rnorm(100,300,10)) colnames(xreg) <- c("y","x1","x2","Noise") ourModel <- stepwise(xreg) AICc(ourModel) BICc(ourModel)
xreg <- cbind(rnorm(100,10,3),rnorm(100,50,5)) xreg <- cbind(100+0.5*xreg[,1]-0.75*xreg[,2]+rnorm(100,0,3),xreg,rnorm(100,300,10)) colnames(xreg) <- c("y","x1","x2","Noise") ourModel <- stepwise(xreg) AICc(ourModel) BICc(ourModel)
The function applies several models on the provided time series and identifies what type of demand it is based on an information criterion.
aid(y, ic = c("AICc", "AIC", "BICc", "BIC"), level = 0.99, loss = "likelihood", ...)
aid(y, ic = c("AICc", "AIC", "BICc", "BIC"), level = 0.99, loss = "likelihood", ...)
y |
The vector of the data. |
ic |
Information criterion to use. |
level |
The confidence level used in stockouts identification. |
loss |
The type of loss function to use in model estimation. See alm for possible options. |
... |
Other parameters passed to the |
In the first step, function creates inter-demand intervals and fits a model with LOWESS of it assuming Geometric distribution. The outliers from this model are treated as potential stock outs.
In the second step, the function creates explanatory variables based on LOWESS of the original data, then applies Normal, Normal + Bernoulli models and selects the one that has the lowest IC. Based on that, it decides what type of demand the data corresponds to: regular or intermittent. Finally, if the data is count, the function will identify that.
Class "aid" is returned, which contains:
y - The original data;
models - All fitted models;
ICs - Values of information criteria;
type - The type of the identified demand;
stockouts - List with start and end ids of potential stockouts;
new - Binary showing whether the data start with the abnormal number of zeroes. Must be a new product then;
obsolete - Binary showing whether the data ends with the abnormal number of zeroes. Must be product that was discontinued (obsolete).
Ivan Svetunkov, [email protected]
# Data from Poisson distribution y <- rpois(120, 0.7) aid(y)
# Data from Poisson distribution y <- rpois(120, 0.7) aid(y)
Function estimates model based on the selected distribution
alm(formula, data, subset, na.action, distribution = c("dnorm", "dlaplace", "ds", "dgnorm", "dlogis", "dt", "dalaplace", "dlnorm", "dllaplace", "dls", "dlgnorm", "dbcnorm", "dinvgauss", "dgamma", "dexp", "dfnorm", "drectnorm", "dpois", "dnbinom", "dbinom", "dgeom", "dbeta", "dlogitnorm", "plogis", "pnorm"), loss = c("likelihood", "MSE", "MAE", "HAM", "LASSO", "RIDGE", "ROLE"), occurrence = c("none", "plogis", "pnorm"), scale = NULL, orders = c(0, 0, 0), parameters = NULL, fast = FALSE, ...)
alm(formula, data, subset, na.action, distribution = c("dnorm", "dlaplace", "ds", "dgnorm", "dlogis", "dt", "dalaplace", "dlnorm", "dllaplace", "dls", "dlgnorm", "dbcnorm", "dinvgauss", "dgamma", "dexp", "dfnorm", "drectnorm", "dpois", "dnbinom", "dbinom", "dgeom", "dbeta", "dlogitnorm", "plogis", "pnorm"), loss = c("likelihood", "MSE", "MAE", "HAM", "LASSO", "RIDGE", "ROLE"), occurrence = c("none", "plogis", "pnorm"), scale = NULL, orders = c(0, 0, 0), parameters = NULL, fast = FALSE, ...)
formula |
an object of class "formula" (or one that can be coerced to
that class): a symbolic description of the model to be fitted. Can also include
|
data |
a data frame or a matrix, containing the variables in the model. |
subset |
an optional vector specifying a subset of observations to be used in the fitting process. |
na.action |
a function which indicates what should happen when the data contain NAs. The default is set by the na.action setting of options, and is na.fail if that is unset. The factory-fresh default is na.omit. Another possible value is NULL, no action. Value na.exclude can be useful. |
distribution |
what density function to use in the process. The full name of the distribution should be provided here. Values with "d" in the beginning of the name refer to the density function, while "p" stands for "probability" (cumulative distribution function). The names align with the names of distribution functions in R. For example, see dnorm. |
loss |
The type of Loss Function used in optimization.
In case of LASSO / RIDGE, the variables are not normalised prior to the estimation,
but the parameters are divided by the standard deviations of explanatory variables
inside the optimisation. As the result the parameters of the final model have the
same interpretation as in the case of classical linear regression. Note that the
user is expected to provide the parameter A user can also provide their own function here as well, making sure
that it accepts parameters
See |
occurrence |
what distribution to use for occurrence variable. Can be
If this is not |
scale |
formula for scale parameter of the model. If |
orders |
the orders of ARIMA to include in the model. Only non-seasonal orders are accepted. |
parameters |
vector of parameters of the linear model. When |
fast |
if |
... |
additional parameters to pass to distribution functions. This includes:
You can also pass parameters to the optimiser:
You can read more about these parameters by running the function nloptr.print.options. |
This is a function, similar to lm, but using likelihood for the cases of several non-normal distributions. These include:
dnorm - Normal distribution,
dlaplace - Laplace distribution,
ds - S-distribution,
dgnorm - Generalised Normal distribution,
dlogis - Logistic Distribution,
dt - T-distribution,
dalaplace - Asymmetric Laplace distribution,
dlnorm - Log-Normal distribution,
dllaplace - Log-Laplace distribution,
dls - Log-S distribution,
dlgnorm - Log-Generalised Normal distribution,
dfnorm - Folded normal distribution,
drectnorm - Rectified normal distribution,
dbcnorm - Box-Cox normal distribution,
dinvgauss - Inverse Gaussian distribution,
dgamma - Gamma distribution,
dexp - Exponential distribution,
dlogitnorm - Logit-normal distribution,
dbeta - Beta distribution,
dpois - Poisson Distribution,
dnbinom - Negative Binomial Distribution,
dbinom - Binomial Distribution,
dgeom - Geometric Distribution,
plogis - Cumulative Logistic Distribution,
pnorm - Cumulative Normal distribution.
This function can be considered as an analogue of glm, but with the
focus on time series. This is why, for example, the function has orders
parameter
for ARIMA and produces time series analysis plots with plot(alm(...))
.
This function is slower than lm
, because it relies on likelihood estimation
of parameters, hessian calculation and matrix multiplication. So think twice when
using distribution="dnorm"
here.
The estimation is done via the maximisation of likelihood of a selected distribution,
so the number of estimated parameters always includes the scale. Thus the number of degrees
of freedom of the model in case of alm
will typically be lower than in the case of
lm
.
See more details and examples in the vignette for "ALM": vignette("alm","greybox")
Function returns model
- the final model of the class
"alm", which contains:
coefficients - estimated parameters of the model,
FI - Fisher Information of parameters of the model. Returned only when FI=TRUE
,
fitted - fitted values,
residuals - residuals of the model,
mu - the estimated location parameter of the distribution,
scale - the estimated scale parameter of the distribution. If a formula was provided for scale, then an object of class "scale" will be returned.
distribution - distribution used in the estimation,
logLik - log-likelihood of the model. Only returned, when loss="likelihood"
or
loss="ROLE"
and in several special cases of distribution and loss
combinations (e.g. loss="MSE"
, distribution="dnorm"),
loss - the type of the loss function used in the estimation,
lossFunction - the loss function, if the custom is provided by the user,
lossValue - the value of the loss function,
res - the output of the optimisation (nloptr function),
df.residual - number of degrees of freedom of the residuals of the model,
df - number of degrees of freedom of the model,
call - how the model was called,
rank - rank of the model,
data - data used for the model construction,
terms - terms of the data. Needed for some additional methods to work,
occurrence - the occurrence model used in the estimation,
B - the value of the optimised parameters. Typically, this is a duplicate of coefficients,
other - the list of all the other parameters either passed to the
function or estimated in the process, but not included in the standard output
(e.g. alpha
for Asymmetric Laplace),
timeElapsed - the time elapsed for the estimation of the model.
Ivan Svetunkov, [email protected]
stepwise, lmCombine,
xregTransformer
### An example with mtcars data and factors mtcars2 <- within(mtcars, { vs <- factor(vs, labels = c("V", "S")) am <- factor(am, labels = c("automatic", "manual")) cyl <- factor(cyl) gear <- factor(gear) carb <- factor(carb) }) # The standard model with Log-Normal distribution ourModel <- alm(mpg~., mtcars2[1:30,], distribution="dlnorm") summary(ourModel) plot(ourModel) # Produce table based on the output for LaTeX xtable(summary(ourModel)) # Produce predictions with the one sided interval (upper bound) predict(ourModel, mtcars2[-c(1:30),], interval="p", side="u") # Model with heteroscedasticity (scale changes with the change of qsec) ourModel <- alm(mpg~., mtcars2[1:30,], scale=~qsec) ### Artificial data for the other examples xreg <- cbind(rlaplace(100,10,3),rnorm(100,50,5)) xreg <- cbind(100+0.5*xreg[,1]-0.75*xreg[,2]+rlaplace(100,0,3),xreg,rnorm(100,300,10)) colnames(xreg) <- c("y","x1","x2","Noise") # An example with Laplace distribution ourModel <- alm(y~x1+x2+trend, xreg, subset=c(1:80), distribution="dlaplace") summary(ourModel) plot(predict(ourModel,xreg[-c(1:80),])) # And another one with Asymmetric Laplace distribution (quantile regression) # with optimised alpha ourModel <- alm(y~x1+x2, xreg, subset=c(1:80), distribution="dalaplace") # An example with AR(1) order ourModel <- alm(y~x1+x2, xreg, subset=c(1:80), distribution="dnorm", orders=c(1,0,0)) summary(ourModel) plot(predict(ourModel,xreg[-c(1:80),])) ### Examples with the count data xreg[,1] <- round(exp(xreg[,1]-70),0) # Negative Binomial distribution ourModel <- alm(y~x1+x2, xreg, subset=c(1:80), distribution="dnbinom") summary(ourModel) predict(ourModel,xreg[-c(1:80),],interval="p",side="u") # Poisson distribution ourModel <- alm(y~x1+x2, xreg, subset=c(1:80), distribution="dpois") summary(ourModel) predict(ourModel,xreg[-c(1:80),],interval="p",side="u") ### Examples with binary response variable xreg[,1] <- round(xreg[,1] / (1 + xreg[,1]),0) # Logistic distribution (logit regression) ourModel <- alm(y~x1+x2, xreg, subset=c(1:80), distribution="plogis") summary(ourModel) plot(predict(ourModel,xreg[-c(1:80),],interval="c")) # Normal distribution (probit regression) ourModel <- alm(y~x1+x2, xreg, subset=c(1:80), distribution="pnorm") summary(ourModel) plot(predict(ourModel,xreg[-c(1:80),],interval="p"))
### An example with mtcars data and factors mtcars2 <- within(mtcars, { vs <- factor(vs, labels = c("V", "S")) am <- factor(am, labels = c("automatic", "manual")) cyl <- factor(cyl) gear <- factor(gear) carb <- factor(carb) }) # The standard model with Log-Normal distribution ourModel <- alm(mpg~., mtcars2[1:30,], distribution="dlnorm") summary(ourModel) plot(ourModel) # Produce table based on the output for LaTeX xtable(summary(ourModel)) # Produce predictions with the one sided interval (upper bound) predict(ourModel, mtcars2[-c(1:30),], interval="p", side="u") # Model with heteroscedasticity (scale changes with the change of qsec) ourModel <- alm(mpg~., mtcars2[1:30,], scale=~qsec) ### Artificial data for the other examples xreg <- cbind(rlaplace(100,10,3),rnorm(100,50,5)) xreg <- cbind(100+0.5*xreg[,1]-0.75*xreg[,2]+rlaplace(100,0,3),xreg,rnorm(100,300,10)) colnames(xreg) <- c("y","x1","x2","Noise") # An example with Laplace distribution ourModel <- alm(y~x1+x2+trend, xreg, subset=c(1:80), distribution="dlaplace") summary(ourModel) plot(predict(ourModel,xreg[-c(1:80),])) # And another one with Asymmetric Laplace distribution (quantile regression) # with optimised alpha ourModel <- alm(y~x1+x2, xreg, subset=c(1:80), distribution="dalaplace") # An example with AR(1) order ourModel <- alm(y~x1+x2, xreg, subset=c(1:80), distribution="dnorm", orders=c(1,0,0)) summary(ourModel) plot(predict(ourModel,xreg[-c(1:80),])) ### Examples with the count data xreg[,1] <- round(exp(xreg[,1]-70),0) # Negative Binomial distribution ourModel <- alm(y~x1+x2, xreg, subset=c(1:80), distribution="dnbinom") summary(ourModel) predict(ourModel,xreg[-c(1:80),],interval="p",side="u") # Poisson distribution ourModel <- alm(y~x1+x2, xreg, subset=c(1:80), distribution="dpois") summary(ourModel) predict(ourModel,xreg[-c(1:80),],interval="p",side="u") ### Examples with binary response variable xreg[,1] <- round(xreg[,1] / (1 + xreg[,1]),0) # Logistic distribution (logit regression) ourModel <- alm(y~x1+x2, xreg, subset=c(1:80), distribution="plogis") summary(ourModel) plot(predict(ourModel,xreg[-c(1:80),],interval="c")) # Normal distribution (probit regression) ourModel <- alm(y~x1+x2, xreg, subset=c(1:80), distribution="pnorm") summary(ourModel) plot(predict(ourModel,xreg[-c(1:80),],interval="p"))
Function returns the matrix of measures of association for different types of variables.
association(x, y = NULL, use = c("na.or.complete", "complete.obs", "everything", "all.obs"), method = c("auto", "pearson", "spearman", "kendall", "cramer")) assoc(x, y = NULL, use = c("na.or.complete", "complete.obs", "everything", "all.obs"), method = c("auto", "pearson", "spearman", "kendall", "cramer"))
association(x, y = NULL, use = c("na.or.complete", "complete.obs", "everything", "all.obs"), method = c("auto", "pearson", "spearman", "kendall", "cramer")) assoc(x, y = NULL, use = c("na.or.complete", "complete.obs", "everything", "all.obs"), method = c("auto", "pearson", "spearman", "kendall", "cramer"))
x |
Either data.frame or a matrix |
y |
The numerical variable. |
use |
What observations to use. See cor function for details.
The only option that is not available here is |
method |
Which method to use for the calculation of measures of association.
By default this is
Be aware that the wrong usage of measures of association might give misleading results. |
The function looks at the types of the variables and calculates different measures depending on the result:
If both variables are numeric, then Pearson's correlation is calculated;
If both variables are categorical, then Cramer's V is calculated;
Finally, if one of the variables is categorical, and the other is numeric, then multiple correlation is returned.
After that the measures are wrapped up in a matrix.
Function also calculates the p-values associated with the respective measures (see the return).
See details in the vignette "Marketing analytics with greybox":
vignette("maUsingGreybox","greybox")
assoc()
is just a short name for the association{}
.
The following list of values is returned:
value - Matrix of the coefficients of association;
p.value - The p-values for the parameters;
type - The matrix of the types of measures of association.
Ivan Svetunkov, [email protected]
table, tableplot, spread,
cramer, mcor
association(mtcars)
association(mtcars)
These are the basic methods for the alm and greybox models that extract coefficients, their covariance matrix, confidence intervals or generating the summary of the model. If the non-likelihood related loss was used in the process, then it is recommended to use bootstrap (which is slow, but more reliable).
## S3 method for class 'greybox' coef(object, bootstrap = FALSE, ...) ## S3 method for class 'alm' confint(object, parm, level = 0.95, bootstrap = FALSE, ...) ## S3 method for class 'scale' confint(object, parm, level = 0.95, bootstrap = FALSE, ...) ## S3 method for class 'alm' vcov(object, bootstrap = FALSE, ...) ## S3 method for class 'scale' vcov(object, bootstrap = FALSE, ...) ## S3 method for class 'alm' summary(object, level = 0.95, bootstrap = FALSE, ...)
## S3 method for class 'greybox' coef(object, bootstrap = FALSE, ...) ## S3 method for class 'alm' confint(object, parm, level = 0.95, bootstrap = FALSE, ...) ## S3 method for class 'scale' confint(object, parm, level = 0.95, bootstrap = FALSE, ...) ## S3 method for class 'alm' vcov(object, bootstrap = FALSE, ...) ## S3 method for class 'scale' vcov(object, bootstrap = FALSE, ...) ## S3 method for class 'alm' summary(object, level = 0.95, bootstrap = FALSE, ...)
object |
The model estimated using alm or other greybox function. |
bootstrap |
The logical, which determines, whether to use bootstrap in the process or not. |
... |
Parameters passed to coefbootstrap function. |
parm |
The parameters that need to be extracted. |
level |
The confidence level for the construction of the interval. |
The coef()
method returns the vector of parameters of the model. If
bootstrap=TRUE
, then the coefficients are calculated as the mean values of the
bootstrapped ones.
The vcov()
method returns the covariance matrix of parameters. If
bootstrap=TRUE
, then the bootstrap is done using coefbootstrap
function
The confint()
constructs the confidence intervals for parameters. Once again,
this can be done using bootstrap=TRUE
.
Finally, the summary()
returns the table with parameters, their standard errors,
confidence intervals and general information about the model.
Depending on the used method, different values are returned.
Ivan Svetunkov, [email protected]
# An example with ALM ourModel <- alm(mpg~., mtcars, distribution="dlnorm") coef(ourModel) vcov(ourModel) confint(ourModel) summary(ourModel)
# An example with ALM ourModel <- alm(mpg~., mtcars, distribution="dlnorm") coef(ourModel) vcov(ourModel) confint(ourModel) summary(ourModel)
The function does the bootstrap for parameters of models and returns covariance matrix together with the original bootstrapped data.
coefbootstrap(object, nsim = 1000, size = floor(0.75 * nobs(object)), replace = FALSE, prob = NULL, parallel = FALSE, method = c("dsr", "cr"), ...) ## S3 method for class 'lm' coefbootstrap(object, nsim = 1000, size = floor(0.75 * nobs(object)), replace = FALSE, prob = NULL, parallel = FALSE, method = c("dsr", "cr"), ...) ## S3 method for class 'alm' coefbootstrap(object, nsim = 1000, size = floor(0.75 * nobs(object)), replace = FALSE, prob = NULL, parallel = FALSE, method = c("dsr", "cr"), ...)
coefbootstrap(object, nsim = 1000, size = floor(0.75 * nobs(object)), replace = FALSE, prob = NULL, parallel = FALSE, method = c("dsr", "cr"), ...) ## S3 method for class 'lm' coefbootstrap(object, nsim = 1000, size = floor(0.75 * nobs(object)), replace = FALSE, prob = NULL, parallel = FALSE, method = c("dsr", "cr"), ...) ## S3 method for class 'alm' coefbootstrap(object, nsim = 1000, size = floor(0.75 * nobs(object)), replace = FALSE, prob = NULL, parallel = FALSE, method = c("dsr", "cr"), ...)
object |
The model estimated using either lm, or alm, or glm. |
nsim |
Number of iterations (simulations) to run. |
size |
A non-negative integer giving the number of items to choose (the sample size),
passed to sample function in R. If not provided and model contains ARIMA
components, this value will be selected at random on each iteration. This is only used for
|
replace |
Should sampling be with replacement? Also, passed to sample
function in R. Only used in |
prob |
A vector of probability weights for obtaining the elements of the vector
being sampled. This is passed to the sample as well. Only used with
|
parallel |
Either a logical, specifying whether to do the calculations in parallel, or the number, specifying the number of cores to use for the parallel calculation. |
method |
Which bootstrap method to use. Currently two options are supported:
|
... |
Parameters passed to the dsrboot function. |
The function applies the same model as in the provided object on a smaller sample in order to get the estimates of parameters and capture the uncertainty about them. This is a simple implementation of the case resampling, which assumes that the observations are independent.
Class "bootstrap" is returned, which contains:
vcov - the covariance matrix of parameters;
coefficients - the matrix with the bootstrapped coefficients.
nsim - number of runs done;
size - the sample size used in the bootstrap;
replace - whether the sampling was done with replacement;
prob - a vector of probability weights used in the process;
parallel - whether the calculations were done in parallel;
model - the name of the model used (the name of the function);
timeElapsed - the time that was spend on the calculations.
Ivan Svetunkov, [email protected]
# An example with ALM ourModel <- alm(mpg~., mtcars, distribution="dlnorm", loss="HAM") # A fast example with 10 iterations. Use at least 100 to get better results coefbootstrap(ourModel, nsim=10)
# An example with ALM ourModel <- alm(mpg~., mtcars, distribution="dlnorm", loss="HAM") # A fast example with 10 iterations. Use at least 100 to get better results coefbootstrap(ourModel, nsim=10)
Function calculates Cramer's V for two categorical variables based on the table function
cramer(x, y, use = c("na.or.complete", "complete.obs", "everything", "all.obs"), unbiased = TRUE)
cramer(x, y, use = c("na.or.complete", "complete.obs", "everything", "all.obs"), unbiased = TRUE)
x |
First categorical variable. |
y |
Second categorical variable. |
use |
What observations to use. See cor function for details.
The only option that is not available here is |
unbiased |
Determines whether to calculate the biased version of Cramer's V or the one with the small sample correction. |
The function calculates Cramer's V and also returns the associated statistics from Chi-Squared test with the null hypothesis of independence of the two variables.
See details in the vignette "Marketing analytics with greybox":
vignette("maUsingGreybox","greybox")
The following list of values is returned:
value - The value of Cramer's V;
statistic - The value of Chi squared statistic associated with the Cramer's V;
p.value - The p-value of Chi squared test associated with the Cramer's V;
df - The number of degrees of freedom from the test.
Ivan Svetunkov, [email protected]
Wicher Bergsma (2013), A bias-correction for Cramér's V and Tschuprow's T. Journal of the Korean Statistical Society, 42, pp. 323-328. doi:10.1016/j.jkss.2012.10.002.
table, tableplot, spread,
mcor, association
cramer(mtcars$am, mtcars$gear)
cramer(mtcars$am, mtcars$gear)
Density, cumulative distribution, quantile functions and random number generation for the Asymmetric Laplace distribution with the location parameter mu, scale and the asymmetry parameter alpha.
dalaplace(q, mu = 0, scale = 1, alpha = 0.5, log = FALSE) palaplace(q, mu = 0, scale = 1, alpha = 0.5) qalaplace(p, mu = 0, scale = 1, alpha = 0.5) ralaplace(n = 1, mu = 0, scale = 1, alpha = 0.5)
dalaplace(q, mu = 0, scale = 1, alpha = 0.5, log = FALSE) palaplace(q, mu = 0, scale = 1, alpha = 0.5) qalaplace(p, mu = 0, scale = 1, alpha = 0.5) ralaplace(n = 1, mu = 0, scale = 1, alpha = 0.5)
q |
vector of quantiles. |
mu |
vector of location parameters (means). |
scale |
vector of scale parameters. |
alpha |
value of asymmetry parameter. Varies from 0 to 1. |
log |
if |
p |
vector of probabilities. |
n |
number of observations. Should be a single number. |
When mu=0 and scale=1, the Laplace distribution becomes standardized. The distribution has the following density function:
f(x) = alpha (1-alpha) / scale exp(-(x-mu)/scale (alpha - I(x<=mu))),
where I(.) is the indicator function (equal to 1 if the condition is satisfied and zero otherwise).
When alpha=0.5, then the distribution becomes Symmetric Laplace, where scale = 1/2 MAE.
This distribution function aligns with the quantile estimates of parameters (Geraci & Bottai, 2007).
Finally, both palaplace
and qalaplace
are returned for
the lower tail of the distribution.
Depending on the function, various things are returned (usually either vector or scalar):
dalaplace
returns the density function value for the
provided parameters.
palaplace
returns the value of the cumulative function
for the provided parameters.
qalaplace
returns quantiles of the distribution. Depending
on what was provided in p
, mu
and scale
, this
can be either a vector or a matrix, or an array.
ralaplace
returns a vector of random variables
generated from the Laplace distribution. Depending on what was
provided in mu
and scale
, this can be either a vector
or a matrix or an array.
Ivan Svetunkov, [email protected]
Geraci Marco, Bottai Matteo (2007). Quantile regression for longitudinal data using the asymmetric Laplace distribution. Biostatistics (2007), 8, 1, pp. 140-154 doi:10.1093/biostatistics/kxj039
Yu, K., & Zhang, J. (2005). A three-parameter asymmetric laplace distribution and its extension. Communications in Statistics - Theory and Methods, 34, 1867-1879. doi:10.1080/03610920500199018
x <- dalaplace(c(-100:100)/10, 0, 1, 0.2) plot(x, type="l") x <- palaplace(c(-100:100)/10, 0, 1, 0.2) plot(x, type="l") qalaplace(c(0.025,0.975), 0, c(1,2), c(0.2,0.3)) x <- ralaplace(1000, 0, 1, 0.2) hist(x)
x <- dalaplace(c(-100:100)/10, 0, 1, 0.2) plot(x, type="l") x <- palaplace(c(-100:100)/10, 0, 1, 0.2) plot(x, type="l") qalaplace(c(0.025,0.975), 0, c(1,2), c(0.2,0.3)) x <- ralaplace(1000, 0, 1, 0.2) hist(x)
Density, cumulative distribution, quantile functions and random number generation for the distribution that becomes normal after the Box-Cox transformation. Note that this is based on the original Box-Cox paper.
dbcnorm(q, mu = 0, sigma = 1, lambda = 0, log = FALSE) pbcnorm(q, mu = 0, sigma = 1, lambda = 0) qbcnorm(p, mu = 0, sigma = 1, lambda = 0) rbcnorm(n = 1, mu = 0, sigma = 1, lambda = 0)
dbcnorm(q, mu = 0, sigma = 1, lambda = 0, log = FALSE) pbcnorm(q, mu = 0, sigma = 1, lambda = 0) qbcnorm(p, mu = 0, sigma = 1, lambda = 0) rbcnorm(n = 1, mu = 0, sigma = 1, lambda = 0)
q |
vector of quantiles. |
mu |
vector of location parameters (means). |
sigma |
vector of scale parameters. |
lambda |
the value of the Box-Cox transform parameter. |
log |
if |
p |
vector of probabilities. |
n |
number of observations. Should be a single number. |
The distribution has the following density function:
f(y) = y^(lambda-1) 1/sqrt(2 pi) exp(-((y^lambda-1)/lambda -mu)^2 / (2 sigma^2))
Both pbcnorm
and qbcnorm
are returned for the lower
tail of the distribution.
In case of lambda=0, the values of the log normal distribution are returned. In case of lambda=1, the values of the normal distribution are returned with mu=mu+1.
All the functions are defined for non-negative values only.
Depending on the function, various things are returned (usually either vector or scalar):
dbcnorm
returns the density function value for the
provided parameters.
pbcnorm
returns the value of the cumulative function
for the provided parameters.
qbcnorm
returns quantiles of the distribution. Depending
on what was provided in p
, mu
and sigma
, this
can be either a vector or a matrix, or an array.
rbcnorm
returns a vector of random variables
generated from the bcnorm distribution. Depending on what was
provided in mu
and sigma
, this can be either a vector
or a matrix or an array.
Ivan Svetunkov, [email protected]
Box, G. E., & Cox, D. R. (1964). An Analysis of Transformations. Journal of the Royal Statistical Society. Series B (Methodological), 26(2), 211–252. Retrieved from https://www.jstor.org/stable/2984418
x <- dbcnorm(c(-1000:1000)/200, 0, 1, 1) plot(c(-1000:1000)/200, x, type="l") x <- pbcnorm(c(-1000:1000)/200, 0, 1, 1) plot(c(-1000:1000)/200, x, type="l") qbcnorm(c(0.025,0.975), 0, c(1,2), 1) x <- rbcnorm(1000, 0, 1, 1) hist(x)
x <- dbcnorm(c(-1000:1000)/200, 0, 1, 1) plot(c(-1000:1000)/200, x, type="l") x <- pbcnorm(c(-1000:1000)/200, 0, 1, 1) plot(c(-1000:1000)/200, x, type="l") qbcnorm(c(0.025,0.975), 0, c(1,2), 1) x <- rbcnorm(1000, 0, 1, 1) hist(x)
Functions to detect, when Daylight Saving Time and leap year start and finish
detectdst(object) detectleap(object)
detectdst(object) detectleap(object)
object |
Either a zoo / xts object or a vector of dates / times in POSIXt / Date
class. Note that in order for |
The detectdst
function detects, when the change for the DST starts and ends. It
assumes that the frequency of data is not lower than hourly.
The detectleap
function does similar for the leap year, but flagging the 29th
of February as a starting and to the 28th of February next year as the ending dates.
In order for the methods to work, the object needs to be of either zoo / xts or POSIXt class and should contain valid dates.
List containing:
start - data frame with id (number of observation) and the respective dates, when the DST / leap year start;
end - data frame with id and dates, when DST / leap year end.
Ivan Svetunkov, [email protected]
xregExpander, temporaldummy,
outlierdummy
# Generate matrix with monthly dummies for a zoo object x <- as.POSIXct("2004-01-01")+0:(365*24*8)*60*60 detectdst(x) detectleap(x)
# Generate matrix with monthly dummies for a zoo object x <- as.POSIXct("2004-01-01")+0:(365*24*8)*60*60 detectdst(x) detectleap(x)
Function produces coefficients of determination for the provided data
determination(xreg, bruteforce = TRUE, ...) determ(object, ...)
determination(xreg, bruteforce = TRUE, ...) determ(object, ...)
xreg |
Data frame or a matrix, containing the exogenous variables. |
bruteforce |
If |
... |
Other values passed to cor function. |
object |
The object, for which to calculate the coefficients of determination. |
The function calculates coefficients of determination (aka R^2) between all the provided variables. The higher the coefficient for a variable is, the higher the potential multicollinearity effect in the model with the variable will be. Coefficients of determination are connected directly to Variance Inflation Factor (VIF): VIF = 1 / (1 - determination). Arguably it is easier to interpret, because it is restricted with (0, 1) bounds. The multicollinearity can be considered as serious, when determination > 0.9 (which corresponds to VIF > 10).
The method determ
can be applied to wide variety of classes,
including lm
, glm
and alm
.
See details in the vignette "Marketing analytics with greybox":
vignette("maUsingGreybox","greybox")
Function returns the vector of determination coefficients.
Ivan Svetunkov, [email protected]
### Simple example xreg <- cbind(rnorm(100,10,3),rnorm(100,50,5)) xreg <- cbind(100+0.5*xreg[,1]-0.75*xreg[,2]+rnorm(100,0,3),xreg,rnorm(100,300,10)) colnames(xreg) <- c("x1","x2","x3","Noise") determination(xreg)
### Simple example xreg <- cbind(rnorm(100,10,3),rnorm(100,50,5)) xreg <- cbind(100+0.5*xreg[,1]-0.75*xreg[,2]+rnorm(100,0,3),xreg,rnorm(100,300,10)) colnames(xreg) <- c("x1","x2","x3","Noise") determination(xreg)
Density, cumulative distribution, quantile functions and random number generation for the folded normal distribution with the location parameter mu and the scale sigma (which corresponds to standard deviation in normal distribution).
dfnorm(q, mu = 0, sigma = 1, log = FALSE) pfnorm(q, mu = 0, sigma = 1) qfnorm(p, mu = 0, sigma = 1) rfnorm(n = 1, mu = 0, sigma = 1)
dfnorm(q, mu = 0, sigma = 1, log = FALSE) pfnorm(q, mu = 0, sigma = 1) qfnorm(p, mu = 0, sigma = 1) rfnorm(n = 1, mu = 0, sigma = 1)
q |
vector of quantiles. |
mu |
vector of location parameters (means). |
sigma |
vector of scale parameters. |
log |
if |
p |
vector of probabilities. |
n |
number of observations. Should be a single number. |
The distribution has the following density function:
f(x) = 1/sqrt(2 pi) (exp(-(x-mu)^2 / (2 sigma^2)) + exp(-(x+mu)^2 / (2 sigma^2)))
Both pfnorm
and qfnorm
are returned for the lower
tail of the distribution.
Depending on the function, various things are returned (usually either vector or scalar):
dfnorm
returns the density function value for the
provided parameters.
pfnorm
returns the value of the cumulative function
for the provided parameters.
qfnorm
returns quantiles of the distribution. Depending
on what was provided in p
, mu
and sigma
, this
can be either a vector or a matrix, or an array.
rfnorm
returns a vector of random variables
generated from the fnorm distribution. Depending on what was
provided in mu
and sigma
, this can be either a vector
or a matrix or an array.
Ivan Svetunkov, [email protected]
Wikipedia page on folded normal distribution: https://en.wikipedia.org/wiki/Folded_normal_distribution.
x <- dfnorm(c(-1000:1000)/200, 0, 1) plot(x, type="l") x <- pfnorm(c(-1000:1000)/200, 0, 1) plot(x, type="l") qfnorm(c(0.025,0.975), 0, c(1,2)) x <- rfnorm(1000, 0, 1) hist(x)
x <- dfnorm(c(-1000:1000)/200, 0, 1) plot(x, type="l") x <- pfnorm(c(-1000:1000)/200, 0, 1) plot(x, type="l") qfnorm(c(0.025,0.975), 0, c(1,2)) x <- rfnorm(1000, 0, 1) hist(x)
Density, cumulative distribution, quantile functions and random number generation for the Generalised Normal distribution with the location mu, a scale and a shape parameters.
dgnorm(q, mu = 0, scale = 1, shape = 1, log = FALSE) pgnorm(q, mu = 0, scale = 1, shape = 1, lower.tail = TRUE, log.p = FALSE) qgnorm(p, mu = 0, scale = 1, shape = 1, lower.tail = TRUE, log.p = FALSE) rgnorm(n, mu = 0, scale = 1, shape = 1)
dgnorm(q, mu = 0, scale = 1, shape = 1, log = FALSE) pgnorm(q, mu = 0, scale = 1, shape = 1, lower.tail = TRUE, log.p = FALSE) qgnorm(p, mu = 0, scale = 1, shape = 1, lower.tail = TRUE, log.p = FALSE) rgnorm(n, mu = 0, scale = 1, shape = 1)
q |
vector of quantiles |
mu |
location parameter |
scale |
scale parameter |
shape |
shape parameter |
log , log.p
|
logical; if TRUE, probabilities p are given as log(p) |
lower.tail |
logical; if TRUE (default), probabilities are |
p |
vector of probabilities |
n |
number of observations |
A generalized normal random variable with parameters location
,
scale
and shape
has density:
The mean and variance of are
and
, respectively.
The function are based on the functions from gnorm package of Maryclare Griffin (package has been abandoned since 2018).
The quantile and cumulative functions use uniform approximation for cases
shape>100
. This is needed, because otherwise it is not possible to calculate
the values correctly due to scale^(shape)=Inf
in R.
Maryclare Griffin and Ivan Svetunkov
dgnorm
, pgnorm
, qgnorm
andrgnorm
are all
parametrized as in Version 1 of the
Generalized
Normal Distribution Wikipedia page,
which uses the parametrization given by in Nadarajah (2005).
The same distribution was described much earlier by Subbotin (1923) and named
the exponential power distribution by Box and Tiao (1973).
Box, G. E. P. and G. C. Tiao. "Bayesian inference in Statistical Analysis." Addison-Wesley Pub. Co., Reading, Mass (1973).
Nadarajah, Saralees. "A generalized normal distribution." Journal of Applied Statistics 32.7 (2005): 685-694.
Subbotin, M. T. "On the Law of Frequency of Error." Matematicheskii Sbornik 31.2 (1923): 206-301.
# Density function values for standard normal distribution x <- dgnorm(seq(-1, 1, length.out = 100), 0, sqrt(2), 2) plot(x, type="l") #CDF of standard Laplace x <- pgnorm(c(-100:100), 0, 1, 1) plot(x, type="l") # Quantiles of S distribution qgnorm(c(0.025,0.975), 0, 1, 0.5) # Random numbers from a distribution with shape=10000 (approximately uniform) x <- rgnorm(1000, 0, 1, 1000) hist(x)
# Density function values for standard normal distribution x <- dgnorm(seq(-1, 1, length.out = 100), 0, sqrt(2), 2) plot(x, type="l") #CDF of standard Laplace x <- pgnorm(c(-100:100), 0, 1, 1) plot(x, type="l") # Quantiles of S distribution qgnorm(c(0.025,0.975), 0, 1, 0.5) # Random numbers from a distribution with shape=10000 (approximately uniform) x <- rgnorm(1000, 0, 1, 1000) hist(x)
The greybox package implements several distribution functions. In this document, I list the main functions and provide links to related resources.
Generalised normal distribution (with a kurtosis parameter by Nadarajah, 2005): qgnorm, dgnorm, pgnorm, rgnorm.
S distribution (a special case of Generalised Normal with shape=0.5): qs, ds, ps, rs.
Laplace distribution (special case of Generalised Normal with shape=1): qlaplace, dlaplace, plaplace, rlaplace.
Asymmetric Laplace distribution (Yu & Zhang, 2005): qalaplace, dalaplace, palaplace, ralaplace.
Logit Normal distribution (Mead, 1965): qlogitnorm, dlogitnorm, plogitnorm, rlogitnorm.
Box-Cox Normal distribution (Box & Cox, 1964): qbcnorm, dbcnorm, pbcnorm, rbcnorm.
Rectified Normal distribution: qrectnorm, drectnorm, prectnorm, rrectnorm.
Three parameter Log Normal distribution (Sangal & Biswas, 1970): qtplnorm, dtplnorm, ptplnorm, rtplnorm.
Ivan Svetunkov, [email protected]
Ivan Svetunkov, [email protected]
Nadarajah, Saralees. "A generalized normal distribution." Journal of Applied Statistics 32.7 (2005): 685-694.
Wikipedia page on Laplace distribution: https://en.wikipedia.org/wiki/Laplace_distribution.
Yu, K., & Zhang, J. (2005). A three-parameter asymmetric laplace distribution and its extension. Communications in Statistics - Theory and Methods, 34, 1867-1879. doi:10.1080/03610920500199018
Mead, R. (1965). A Generalised Logit-Normal Distribution. Biometrics, 21 (3), 721–732. doi: 10.2307/2528553
Box, G. E., & Cox, D. R. (1964). An Analysis of Transformations. Journal of the Royal Statistical Society. Series B (Methodological), 26(2), 211–252. Retrieved from https://www.jstor.org/stable/2984418
Sangal, B. P., & Biswas, A. K. (1970). The 3-Parameter Distribution Applications in Hydrology. Water Resources Research, 6(2), 505–515. doi:10.1029/WR006i002p00505
Distributions
from the stats package.
Density, cumulative distribution, quantile functions and random number generation for the Laplace distribution with the location parameter mu and the scale parameter (which is equal to Mean Absolute Error, aka Mean Absolute Deviation).
dlaplace(q, mu = 0, scale = 1, log = FALSE) plaplace(q, mu = 0, scale = 1) qlaplace(p, mu = 0, scale = 1) rlaplace(n = 1, mu = 0, scale = 1)
dlaplace(q, mu = 0, scale = 1, log = FALSE) plaplace(q, mu = 0, scale = 1) qlaplace(p, mu = 0, scale = 1) rlaplace(n = 1, mu = 0, scale = 1)
q |
vector of quantiles. |
mu |
vector of location parameters (means). |
scale |
vector of mean absolute errors. |
log |
if |
p |
vector of probabilities. |
n |
number of observations. Should be a single number. |
When mu=0 and scale=1, the Laplace distribution becomes standardized. The distribution has the following density function:
f(x) = 1/(2 scale) exp(-abs(x-mu) / scale)
Both plaplace
and qlaplace
are returned for the lower
tail of the distribution.
Depending on the function, various things are returned (usually either vector or scalar):
dlaplace
returns the density function value for the
provided parameters.
plaplace
returns the value of the cumulative function
for the provided parameters.
qlaplace
returns quantiles of the distribution. Depending
on what was provided in p
, mu
and scale
, this
can be either a vector or a matrix, or an array.
rlaplace
returns a vector of random variables
generated from the Laplace distribution. Depending on what was
provided in mu
and scale
, this can be either a vector
or a matrix or an array.
Ivan Svetunkov, [email protected]
Wikipedia page on Laplace distribution: https://en.wikipedia.org/wiki/Laplace_distribution.
x <- dlaplace(c(-100:100)/10, 0, 1) plot(x, type="l") x <- plaplace(c(-100:100)/10, 0, 1) plot(x, type="l") qlaplace(c(0.025,0.975), 0, c(1,2)) x <- rlaplace(1000, 0, 1) hist(x)
x <- dlaplace(c(-100:100)/10, 0, 1) plot(x, type="l") x <- plaplace(c(-100:100)/10, 0, 1) plot(x, type="l") qlaplace(c(0.025,0.975), 0, c(1,2)) x <- rlaplace(1000, 0, 1) hist(x)
Density, cumulative distribution, quantile functions and random number generation for the distribution that becomes normal after the Logit transformation.
dlogitnorm(q, mu = 0, sigma = 1, log = FALSE) plogitnorm(q, mu = 0, sigma = 1) qlogitnorm(p, mu = 0, sigma = 1) rlogitnorm(n = 1, mu = 0, sigma = 1)
dlogitnorm(q, mu = 0, sigma = 1, log = FALSE) plogitnorm(q, mu = 0, sigma = 1) qlogitnorm(p, mu = 0, sigma = 1) rlogitnorm(n = 1, mu = 0, sigma = 1)
q |
vector of quantiles. |
mu |
vector of location parameters (means). |
sigma |
vector of scale parameters. |
log |
if |
p |
vector of probabilities. |
n |
number of observations. Should be a single number. |
The distribution has the following density function:
f(y) = 1/(sqrt(2 pi) y (1-y)) exp(-(logit(y) -mu)^2 / (2 sigma^2))
where y is in (0, 1) and logit(y) =log(y/(1-y)).
Both plogitnorm
and qlogitnorm
are returned for the lower
tail of the distribution.
All the functions are defined for the values between 0 and 1.
Depending on the function, various things are returned (usually either vector or scalar):
dlogitnorm
returns the density function value for the
provided parameters.
plogitnorm
returns the value of the cumulative function
for the provided parameters.
qlogitnorm
returns quantiles of the distribution. Depending
on what was provided in p
, mu
and sigma
, this
can be either a vector or a matrix, or an array.
rlogitnorm
returns a vector of random variables
generated from the logitnorm distribution. Depending on what was
provided in mu
and sigma
, this can be either a vector
or a matrix or an array.
Ivan Svetunkov, [email protected]
Mead, R. (1965). A Generalised Logit-Normal Distribution. Biometrics, 21 (3), 721–732. doi: 10.2307/2528553
x <- dlogitnorm(c(-1000:1000)/200, 0, 1) plot(c(-1000:1000)/200, x, type="l") x <- plogitnorm(c(-1000:1000)/200, 0, 1) plot(c(-1000:1000)/200, x, type="l") qlogitnorm(c(0.025,0.975), 0, c(1,2)) x <- rlogitnorm(1000, 0, 1) hist(x)
x <- dlogitnorm(c(-1000:1000)/200, 0, 1) plot(c(-1000:1000)/200, x, type="l") x <- plogitnorm(c(-1000:1000)/200, 0, 1) plot(c(-1000:1000)/200, x, type="l") qlogitnorm(c(0.025,0.975), 0, c(1,2)) x <- rlogitnorm(1000, 0, 1) hist(x)
Density, cumulative distribution, quantile functions and random number generation for the Rectified Normal distribution.
drectnorm(q, mu = 0, sigma = 1, log = FALSE) prectnorm(q, mu = 0, sigma = 1) qrectnorm(p, mu = 0, sigma = 1) rrectnorm(n = 1, mu = 0, sigma = 1)
drectnorm(q, mu = 0, sigma = 1, log = FALSE) prectnorm(q, mu = 0, sigma = 1) qrectnorm(p, mu = 0, sigma = 1) rrectnorm(n = 1, mu = 0, sigma = 1)
q |
vector of quantiles. |
mu |
vector of location parameters (means). |
sigma |
vector of scale parameters. |
log |
if |
p |
vector of probabilities. |
n |
number of observations. Should be a single number. |
If x~N(mu, sigma^2) then y = max(0, x) follows Rectified Normal distribution: y~RectN(mu, sigma^2), which can be written as:
f_y = 1(x<=0) F_x(mu, sigma) + 1(x>0) f_x(x, mu, sigma),
where F_x is the cumulative distribution function and f_x is the probability density function of normal distribution.
Both prectnorm
and qrectnorm
are returned for the lower
tail of the distribution.
All the functions are defined for non-negative values only.
Depending on the function, various things are returned (usually either vector or scalar):
drectnorm
returns the density function value for the
provided parameters.
prectnorm
returns the value of the cumulative function
for the provided parameters.
qrectnorm
returns quantiles of the distribution. Depending
on what was provided in p
, mu
and sigma
, this
can be either a vector or a matrix, or an array.
rrectnorm
returns a vector of random variables
generated from the RectN distribution. Depending on what was
provided in mu
and sigma
, this can be either a vector
or a matrix or an array.
Ivan Svetunkov, [email protected]
x <- drectnorm(c(-1000:1000)/200, 0, 1) plot(c(-1000:1000)/200, x, type="l") x <- prectnorm(c(-1000:1000)/200, 0, 1) plot(c(-1000:1000)/200, x, type="l") qrectnorm(c(0.025,0.975), 0, c(1,2)) x <- rrectnorm(1000, 0, 1) hist(x)
x <- drectnorm(c(-1000:1000)/200, 0, 1) plot(c(-1000:1000)/200, x, type="l") x <- prectnorm(c(-1000:1000)/200, 0, 1) plot(c(-1000:1000)/200, x, type="l") qrectnorm(c(0.025,0.975), 0, c(1,2)) x <- rrectnorm(1000, 0, 1) hist(x)
Density, cumulative distribution, quantile functions and random number generation for the S distribution with the location parameter mu and a scaling parameter scale.
ds(q, mu = 0, scale = 1, log = FALSE) ps(q, mu = 0, scale = 1) qs(p, mu = 0, scale = 1) rs(n = 1, mu = 0, scale = 1)
ds(q, mu = 0, scale = 1, log = FALSE) ps(q, mu = 0, scale = 1) qs(p, mu = 0, scale = 1) rs(n = 1, mu = 0, scale = 1)
q |
vector of quantiles. |
mu |
vector of location parameters (means). |
scale |
vector of scaling parameter (which are equal to ham/2). |
log |
if |
p |
vector of probabilities. |
n |
number of observations. Should be a single number. |
When mu=0 and ham=2, the S distribution becomes standardized with scale=1 (this is because scale=ham/2). The distribution has the following density function:
f(x) = 1/(4 scale^2) exp(-sqrt(abs(x-mu)) / scale)
The S distribution has fat tails and large excess.
Both ps
and qs
are returned for the lower tail of
the distribution.
Depending on the function, various things are returned (usually either vector or scalar):
ds
returns the density function value for the
provided parameters.
ps
returns the value of the cumulative function
for the provided parameters.
qs
returns quantiles of the distribution. Depending
on what was provided in p
, mu
and scale
, this
can be either a vector or a matrix, or an array.
rs
returns a vector of random variables
generated from the S distribution. Depending on what was provided
in mu
and scale
, this can be either a vector or a matrix
or an array.
Ivan Svetunkov, [email protected]
x <- ds(c(-1000:1000)/10, 0, 1) plot(x, type="l") x <- ps(c(-1000:1000)/10, 0, 1) plot(x, type="l") qs(c(0.025,0.975), 0, 1) x <- rs(1000, 0, 1) hist(x)
x <- ds(c(-1000:1000)/10, 0, 1) plot(x, type="l") x <- ps(c(-1000:1000)/10, 0, 1) plot(x, type="l") qs(c(0.025,0.975), 0, 1) x <- rs(1000, 0, 1) hist(x)
The function implements a bootstrap inspired by the Maximum Entropy Bootstrap
dsrboot(y, nsim = 100, intermittent = FALSE, type = c("multiplicative", "additive"), kind = c("nonparametric", "parametric"), lag = frequency(y), sd = NULL, scale = TRUE) ## S3 method for class 'dsrboot' plot(x, sorted = FALSE, legend = TRUE, ...)
dsrboot(y, nsim = 100, intermittent = FALSE, type = c("multiplicative", "additive"), kind = c("nonparametric", "parametric"), lag = frequency(y), sd = NULL, scale = TRUE) ## S3 method for class 'dsrboot' plot(x, sorted = FALSE, legend = TRUE, ...)
y |
The original time series |
nsim |
Number of iterations (simulations) to run. |
intermittent |
Whether to treat the demand as intermittent or not. |
type |
Type of bootstrap to use. |
kind |
A kind of the bootstrap to do: nonparametric or parametric. The latter relies on the normal distribution, while the former uses the empirical distribution of differences of the data. |
lag |
The lag to use in the calculation of differences. Should be 1 for non-seasonal data. |
sd |
Standard deviation to use in the normal distribution. Estimated as mean absolute differences of the data if omitted. |
scale |
Whether or not to do scaling of time series to the bootstrapped ones to have similar variance to the original data. |
x |
The object of the class dsrboot. |
sorted |
Whether the sorted ( |
legend |
Whether to produce the legend on the plot. |
... |
Other parameters passed to the plot function. |
The "Data Shape Replication" bootstrap reproduces the shape of the original time series by creating randomness around it. It is done in the following steps:
1. Sort the data in the ascending order, recording the original order of elements; 2. Take first differences of the original data and sort them; 3. Generate random numbers from the uniform distribution between 0 and 1; 4. Get the smoothed differences that correspond to the random numbers (randomly extract empirical quantiles). This way we take the empirical density into account when selecting the differences; 5. Add the random differences to the sorted series from (1) to get a new time series; 6. Sort the new time series in the ascending order; 7. Reorder (6) based on the initial order of series; 8. Centre the data around the original series; 9. Scale the data to make sure that the variance is constant over time.
If the multiplicative bootstrap is used then logarithms of the sorted series are used and at the very end, the exponent of the resulting data is taken. This way the discrepancies in the data have similar scale no matter what the level of the original series is. In case of the additive bootstrap, the trended series will be noisier when the level of series is low.
The function returns:
call
- the call that was used originally;
data
- the original data used in the function;
boot
- the matrix with the new series in columns and observations in rows.
type
- type of the bootstrap used.
sd
- the value of sd used in case of parameteric bootstrap.
scale
- whether the scaling was needed.
smooth
- the smoothed ordered actual data.
Ivan Svetunkov, [email protected]
Vinod HD, López-de-Lacalle J (2009). "Maximum Entropy Bootstrap for Time Series: The meboot R Package." Journal of Statistical Software, 29(5), 1–19. doi:10.18637/jss.v029.i05.
dsrboot(AirPassengers) |> plot()
dsrboot(AirPassengers) |> plot()
Density, cumulative distribution, quantile functions and random number generation for the 3 parameter log normal distribution with the location parameter mu, scale sigma (which corresponds to standard deviation in normal distribution) and shifting parameter shift.
dtplnorm(q, mu = 0, sigma = 1, shift = 0, log = FALSE) ptplnorm(q, mu = 0, sigma = 1, shift = 0) qtplnorm(p, mu = 0, sigma = 1, shift = 0) rtplnorm(n = 1, mu = 0, sigma = 1, shift = 0)
dtplnorm(q, mu = 0, sigma = 1, shift = 0, log = FALSE) ptplnorm(q, mu = 0, sigma = 1, shift = 0) qtplnorm(p, mu = 0, sigma = 1, shift = 0) rtplnorm(n = 1, mu = 0, sigma = 1, shift = 0)
q |
vector of quantiles. |
mu |
vector of location parameters (means). |
sigma |
vector of scale parameters. |
shift |
vector of shift parameters. |
log |
if |
p |
vector of probabilities. |
n |
number of observations. Should be a single number. |
The distribution has the following density function:
f(x) = 1/(x-a) 1/sqrt(2 pi) exp(-(log(x-a)-mu)^2 / (2 sigma^2))
Both ptplnorm
and qtplnorm
are returned for the lower
tail of the distribution.
The function is based on the lnorm functions from stats package, introducing the shift parameter.
Depending on the function, various things are returned (usually either vector or scalar):
dtplnorm
returns the density function value for the
provided parameters.
ptplnorm
returns the value of the cumulative function
for the provided parameters.
qtplnorm
returns quantiles of the distribution. Depending
on what was provided in p
, mu
and sigma
, this
can be either a vector or a matrix, or an array.
rtplnorm
returns a vector of random variables
generated from the tplnorm distribution. Depending on what was
provided in mu
and sigma
, this can be either a vector
or a matrix or an array.
Ivan Svetunkov, [email protected]
Sangal, B. P., & Biswas, A. K. (1970). The 3-Parameter Distribution Applications in Hydrology. Water Resources Research, 6(2), 505–515. doi:10.1029/WR006i002p00505
x <- dtplnorm(c(-1000:1000)/200, 0, 1, 1) plot(c(-1000:1000)/200, x, type="l") x <- ptplnorm(c(-1000:1000)/200, 0, 1, 1) plot(c(-1000:1000)/200, x, type="l") qtplnorm(c(0.025,0.975), 0, c(1,2), 1) x <- rtplnorm(1000, 0, 1, 1) hist(x)
x <- dtplnorm(c(-1000:1000)/200, 0, 1, 1) plot(c(-1000:1000)/200, x, type="l") x <- ptplnorm(c(-1000:1000)/200, 0, 1, 1) plot(c(-1000:1000)/200, x, type="l") qtplnorm(c(0.025,0.975), 0, c(1,2), 1) x <- rtplnorm(1000, 0, 1, 1) hist(x)
This function allows extracting error type from any model.
errorType(object, ...)
errorType(object, ...)
object |
Model estimated using one of the functions of smooth package. |
... |
Currently nothing is accepted via ellipsis. |
errorType
extracts the type of error from the model
(either additive or multiplicative).
Either "A"
for additive error or "M"
for multiplicative.
All the other functions return strings of character.
Ivan Svetunkov, [email protected]
xreg <- cbind(rnorm(100,10,3),rnorm(100,50,5)) xreg <- cbind(100+0.5*xreg[,1]-0.75*xreg[,2]+rnorm(100,0,3),xreg,rnorm(100,300,10)) colnames(xreg) <- c("y","x1","x2","Noise") ourModel <- alm(y~x1+x2,as.data.frame(xreg)) errorType(ourModel)
xreg <- cbind(rnorm(100,10,3),rnorm(100,50,5)) xreg <- cbind(100+0.5*xreg[,1]-0.75*xreg[,2]+rnorm(100,0,3),xreg,rnorm(100,300,10)) colnames(xreg) <- c("y","x1","x2","Noise") ourModel <- alm(y~x1+x2,as.data.frame(xreg)) errorType(ourModel)
Functions extract scale and the standard error of the residuals. Mainly needed for the work with the model estimated via sm.
extractScale(object, ...) ## Default S3 method: extractScale(object, ...) ## S3 method for class 'greybox' extractScale(object, ...) extractSigma(object, ...) ## Default S3 method: extractSigma(object, ...) ## S3 method for class 'greybox' extractSigma(object, ...)
extractScale(object, ...) ## Default S3 method: extractScale(object, ...) ## S3 method for class 'greybox' extractScale(object, ...) extractSigma(object, ...) ## Default S3 method: extractSigma(object, ...) ## S3 method for class 'greybox' extractSigma(object, ...)
object |
The model estimated using lm / alm / etc. |
... |
Other parameters (currently nothing). |
In case of a simpler model, the functions will return the scalar using
sigma()
method. If the scale model was estimated, extractScale()
and
extractSigma()
will return the conditional scale and the conditional
standard error of the residuals respectively.
One of the two is returned, depending on the type of estimated model:
Scalar from sigma()
method if the variance is assumed to be constant.
Vector of values if the scale model was estimated
Ivan Svetunkov, [email protected]
# Generate the data xreg <- cbind(rnorm(100,10,3),rnorm(100,50,5)) xreg <- cbind(100+0.5*xreg[,1]-0.75*xreg[,2]+sqrt(exp(0.8+0.2*xreg[,1]))*rnorm(100,0,1), xreg,rnorm(100,300,10)) colnames(xreg) <- c("y","x1","x2","Noise") # Estimate the location and scale model ourModel <- alm(y~., xreg, scale=~x1+x2) # Extract scale extractScale(ourModel) # Extract standard error extractSigma(ourModel)
# Generate the data xreg <- cbind(rnorm(100,10,3),rnorm(100,50,5)) xreg <- cbind(100+0.5*xreg[,1]-0.75*xreg[,2]+sqrt(exp(0.8+0.2*xreg[,1]))*rnorm(100,0,1), xreg,rnorm(100,300,10)) colnames(xreg) <- c("y","x1","x2","Noise") # Estimate the location and scale model ourModel <- alm(y~., xreg, scale=~x1+x2) # Extract scale extractScale(ourModel) # Extract standard error extractSigma(ourModel)
The function makes a standard linear graph using the provided actuals and forecasts.
graphmaker(actuals, forecast, fitted = NULL, lower = NULL, upper = NULL, level = NULL, legend = TRUE, cumulative = FALSE, vline = TRUE, parReset = TRUE, ...)
graphmaker(actuals, forecast, fitted = NULL, lower = NULL, upper = NULL, level = NULL, legend = TRUE, cumulative = FALSE, vline = TRUE, parReset = TRUE, ...)
actuals |
The vector of actual values |
forecast |
The vector of forecasts. Should be |
fitted |
The vector of fitted values. |
lower |
The vector of lower bound values of a prediction interval.
Should be ts object that start at the end of |
upper |
The vector of upper bound values of a prediction interval.
Should be ts object that start at the end of |
level |
The width of the prediction interval. |
legend |
If |
cumulative |
If |
vline |
Whether to draw the vertical line, splitting the in-sample and the holdout sample. |
parReset |
Whether to reset par() after plotting things or not.
If |
... |
Other parameters passed to |
Function uses the provided data to construct a linear graph. It is strongly
advised to use ts
objects to define the start of each of the vectors.
Otherwise the data may be plotted incorrectly.
Function does not return anything.
Ivan Svetunkov
xreg <- cbind(y=rnorm(100,100,10),x=rnorm(100,10,10)) almModel <- alm(y~x, xreg, subset=c(1:90)) values <- predict(almModel, newdata=xreg[-c(1:90),], interval="prediction") graphmaker(xreg[,1],values$mean,fitted(values)) graphmaker(xreg[,1],values$mean,fitted(values),legend=FALSE) graphmaker(xreg[,1],values$mean,fitted(values),legend=FALSE,lower=values$lower,upper=values$upper) # Produce the necessary ts objects from an arbitrary vectors actuals <- ts(c(1:10), start=c(2000,1), frequency=4) forecast <- ts(c(11:15),start=end(actuals)[1]+end(actuals)[2]*deltat(actuals), frequency=frequency(actuals)) graphmaker(actuals,forecast) # This should work as well graphmaker(c(1:10),c(11:15)) # This way you can add additional elements to the plot graphmaker(c(1:10),c(11:15), parReset=FALSE) points(c(1:15)) # But don't forget to do dev.off() in order to reset the plotting area afterwards
xreg <- cbind(y=rnorm(100,100,10),x=rnorm(100,10,10)) almModel <- alm(y~x, xreg, subset=c(1:90)) values <- predict(almModel, newdata=xreg[-c(1:90),], interval="prediction") graphmaker(xreg[,1],values$mean,fitted(values)) graphmaker(xreg[,1],values$mean,fitted(values),legend=FALSE) graphmaker(xreg[,1],values$mean,fitted(values),legend=FALSE,lower=values$lower,upper=values$upper) # Produce the necessary ts objects from an arbitrary vectors actuals <- ts(c(1:10), start=c(2000,1), frequency=4) forecast <- ts(c(11:15),start=end(actuals)[1]+end(actuals)[2]*deltat(actuals), frequency=frequency(actuals)) graphmaker(actuals,forecast) # This should work as well graphmaker(c(1:10),c(11:15)) # This way you can add additional elements to the plot graphmaker(c(1:10),c(11:15), parReset=FALSE) points(c(1:15)) # But don't forget to do dev.off() in order to reset the plotting area afterwards
Toolbox for working with univariate models for purposes of analysis and forecasting
Package: | greybox |
Type: | Package |
Date: | 2018-02-13 - Inf |
License: | GPL-2 |
The following functions are included in the package:
pointLik - point likelihood of the function.
pAIC, pAICc, pBIC, pBICc - point versions of respective information criteria.
coefbootstrap - Method that uses a simple implementation of the case resampling to get bootstrapped estimates of parameters of the model.
dsrboot - Bootstrap inspired by the meboot package, that creates bootstraped series based on the provided one.
determination - Coefficients of determination between different exogenous variables.
temporaldummy - Matrix with seasonal dummy variables.
outlierdummy - Matrix with dummies for outliers.
alm - Advanced Linear Model - regression, estimated using likelihood with specified distribution (e.g. Laplace or Chi-Squared).
sm - Scale Model - Regression model for scale of distribution
(e.g. for Variance of Normal distribution). Requires an lm()
or alm()
model.
stepwise - Stepwise based on information criteria and partial correlations. Efficient and fast.
xregExpander - Function that expands the provided data into the data with lags and leads.
xregTransformer - Function produces mathematical transformations of the variables, such as taking logarithms, square roots etc.
xregMultiplier - Function produces cross-products of the matrix of the provided variables.
lmCombine - Function combines lm models from the estimated based on information criteria weights.
lmDynamic - Dynamic regression based on point AIC.
ro - Rolling origin evaluation.
Distributions - document explaining the distribution functions of the greybox package.
spread - function that produces scatterplots / boxplots / tableplots, depending on the types of variables.
assoc - function that calculates measures of association, depending on the types of variables.
Ivan Svetunkov, [email protected]
xreg <- cbind(rnorm(100,10,3),rnorm(100,50,5)) xreg <- cbind(100+0.5*xreg[,1]-0.75*xreg[,2]+rnorm(100,0,3),xreg,rnorm(100,300,10)) colnames(xreg) <- c("y","x1","x2","Noise") stepwise(xreg)
xreg <- cbind(rnorm(100,10,3),rnorm(100,50,5)) xreg <- cbind(100+0.5*xreg[,1]-0.75*xreg[,2]+rnorm(100,0,3),xreg,rnorm(100,300,10)) colnames(xreg) <- c("y","x1","x2","Noise") stepwise(xreg)
hm()
function estimates half moment from some predefined constant
C
. ham()
estimates the Half Absolute Moment. asymmetry()
function returns Asymmetry coefficient, while extremity()
returns the coefficient of Extremity, both based on hm()
. Finally,
cextremity()
returns the Complex Extremity coefficient, based on hm()
.
hm(x, C = mean(x, na.rm = TRUE), ...) ham(x, C = mean(x, na.rm = TRUE), ...) asymmetry(x, C = mean(x, na.rm = TRUE), ...) extremity(x, C = mean(x, na.rm = TRUE), ...) cextremity(x, C = mean(x, na.rm = TRUE), ...)
hm(x, C = mean(x, na.rm = TRUE), ...) ham(x, C = mean(x, na.rm = TRUE), ...) asymmetry(x, C = mean(x, na.rm = TRUE), ...) extremity(x, C = mean(x, na.rm = TRUE), ...) cextremity(x, C = mean(x, na.rm = TRUE), ...)
x |
A variable based on which HM is estimated. |
C |
Centring parameter. |
... |
Other parameters passed to mean function. |
NA
values of x
are excluded on the first step of calculation.
A complex variable is returned for the hm()
and cextremity()
functions, and real values are returned for ham()
,
asymmetry()
and extremity()
.
Ivan Svetunkov, [email protected]
Svetunkov I., Kourentzes N., Svetunkov S. "Half Central Moment for Data Analysis". Working Paper of Department of Management Science, Lancaster University, 2023:3, 1–21.
x <- rnorm(100,0,1) hm(x) ham(x) asymmetry(x) extremity(x) cextremity(x)
x <- rnorm(100,0,1) hm(x) ham(x) asymmetry(x) extremity(x) cextremity(x)
The function implants the scale model into the location model. It currently works
with alm / adam and sm()
method.
implant(location, scale, ...)
implant(location, scale, ...)
location |
Model estimated using either |
scale |
The scale model, estimate with |
... |
Currently nothing. Implemented for flexibility. |
The function is needed in order to treat the scale of model correctly in the methods
like forecast()
.
The model of the same class as the location model, but with scale
from the estimated model via sm()
. This is needed to produce
appropriate forecasts in case of scale model and to take into account
the correct number of estimated parameters.
Ivan Svetunkov, [email protected]
xreg <- cbind(rnorm(100,10,3),rnorm(100,50,5)) xreg <- cbind(100+0.5*xreg[,1]-0.75*xreg[,2]+sqrt(exp(0.8+0.2*xreg[,1]))*rnorm(100,0,1), xreg,rnorm(100,300,10)) colnames(xreg) <- c("y","x1","x2","Noise") # Estimate the location model ourModel <- alm(y~.,xreg) # Estimate the scale model ourScale <- sm(ourModel,formula=~x1+x2) # Implant scale into location ourModel <- implant(ourModel, ourScale)
xreg <- cbind(rnorm(100,10,3),rnorm(100,50,5)) xreg <- cbind(100+0.5*xreg[,1]-0.75*xreg[,2]+sqrt(exp(0.8+0.2*xreg[,1]))*rnorm(100,0,1), xreg,rnorm(100,300,10)) colnames(xreg) <- c("y","x1","x2","Noise") # Estimate the location model ourModel <- alm(y~.,xreg) # Estimate the scale model ourScale <- sm(ourModel,formula=~x1+x2) # Implant scale into location ourModel <- implant(ourModel, ourScale)
Functions to check if an object is of the specified class
is.greybox(x) is.alm(x) is.occurrence(x) is.greyboxC(x) is.greyboxD(x) is.rollingOrigin(x) is.rmc(x) is.scale(x)
is.greybox(x) is.alm(x) is.occurrence(x) is.greyboxC(x) is.greyboxD(x) is.rollingOrigin(x) is.rmc(x) is.scale(x)
x |
The object to check. |
The list of functions includes:
is.greybox()
tests if the object was produced by a greybox function
(e.g. alm / stepwise / lmCombine
/ lmDynamic);
is.alm()
tests if the object was produced by alm()
function;
is.occurrence()
tests if an occurrence part of the model was produced;
is.greyboxC()
tests if the object was produced by lmCombine()
function;
is.greyboxD()
tests if the object was produced by lmDynamic()
function;
is.rmc()
tests if the object was produced by rmc()
function;
is.rollingOrigin()
tests if the object was produced by ro()
function;
is.scale()
tests if the object is of the class "scale" (produced by
alm or sm in case of heteroscedastic model);
TRUE
if this is the specified class and FALSE
otherwise.
Ivan Svetunkov, [email protected]
xreg <- cbind(rlaplace(100,10,3),rnorm(100,50,5)) xreg <- cbind(100+0.5*xreg[,1]-0.75*xreg[,2]+rlaplace(100,0,3),xreg,rnorm(100,300,10)) colnames(xreg) <- c("y","x1","x2","Noise") ourModel <- alm(y~x1+x2, xreg, distribution="dnorm") is.alm(ourModel) is.greybox(ourModel) is.greyboxC(ourModel) is.greyboxD(ourModel)
xreg <- cbind(rlaplace(100,10,3),rnorm(100,50,5)) xreg <- cbind(100+0.5*xreg[,1]-0.75*xreg[,2]+rlaplace(100,0,3),xreg,rnorm(100,300,10)) colnames(xreg) <- c("y","x1","x2","Noise") ourModel <- alm(y~x1+x2, xreg, distribution="dnorm") is.alm(ourModel) is.greybox(ourModel) is.greyboxC(ourModel) is.greyboxD(ourModel)
Function combines parameters of linear regressions of the first variable on all the other provided data.
lmCombine(data, ic = c("AICc", "AIC", "BIC", "BICc"), bruteforce = FALSE, silent = TRUE, formula = NULL, subset = NULL, distribution = c("dnorm", "dlaplace", "ds", "dgnorm", "dlogis", "dt", "dalaplace", "dlnorm", "dllaplace", "dls", "dlgnorm", "dbcnorm", "dinvgauss", "dgamma", "dexp", "dfnorm", "drectnorm", "dpois", "dnbinom", "dbeta", "dlogitnorm", "plogis", "pnorm"), parallel = FALSE, ...)
lmCombine(data, ic = c("AICc", "AIC", "BIC", "BICc"), bruteforce = FALSE, silent = TRUE, formula = NULL, subset = NULL, distribution = c("dnorm", "dlaplace", "ds", "dgnorm", "dlogis", "dt", "dalaplace", "dlnorm", "dllaplace", "dls", "dlgnorm", "dbcnorm", "dinvgauss", "dgamma", "dexp", "dfnorm", "drectnorm", "dpois", "dnbinom", "dbeta", "dlogitnorm", "plogis", "pnorm"), parallel = FALSE, ...)
data |
Data frame containing dependent variable in the first column and the others in the rest. |
ic |
Information criterion to use. |
bruteforce |
If |
silent |
If |
formula |
If provided, then the selection will be done from the listed variables in the formula after all the necessary transformations. |
subset |
an optional vector specifying a subset of observations to be used in the fitting process. |
distribution |
Distribution to pass to |
parallel |
If |
... |
Other parameters passed to |
The algorithm uses alm() to fit different models and then combines the models based on the selected IC. The parameters are combined so that if they are not present in some of models, it is assumed that they are equal to zero. Thus, there is a shrinkage effect in the combination.
Some details and examples of application are also given in the vignette
"Greybox": vignette("greybox","greybox")
Function returns model
- the final model of the class
"greyboxC". The list of variables:
coefficients - combined parameters of the model,
vcov - combined covariance matrix of the model,
fitted - the fitted values,
residuals - residual of the model,
distribution - distribution used in the estimation,
logLik - combined log-likelihood of the model,
IC - the values of the combined information criterion,
ICType - the type of information criterion used,
df.residual - number of degrees of freedom of the residuals of the combined model,
df - number of degrees of freedom of the combined model,
importance - importance of the parameters,
combination - the table, indicating which variables were used in every model construction and what were the weights for each model,
timeElapsed - the time elapsed for the estimation of the model.
Ivan Svetunkov, [email protected]
Burnham Kenneth P. and Anderson David R. (2002). Model Selection and Multimodel Inference. A Practical Information-Theoretic Approach. Springer-Verlag New York. DOI: [10.1007/b97636](http://dx.doi.org/10.1007/b97636).
McQuarrie, A. D. (1999). A small-sample correction for the Schwarz SIC model selection criterion. Statistics & Probability Letters, 44(1), 79–86. [10.1016/S0167-7152(98)00294-6](https://doi.org/10.1016/S0167-7152(98)00294-6).
### Simple example xreg <- cbind(rnorm(100,10,3),rnorm(100,50,5)) xreg <- cbind(100+0.5*xreg[,1]-0.75*xreg[,2]+rnorm(100,0,3),xreg,rnorm(100,300,10)) colnames(xreg) <- c("y","x1","x2","Noise") inSample <- xreg[1:80,] outSample <- xreg[-c(1:80),] # Combine all the possible models ourModel <- lmCombine(inSample,bruteforce=TRUE) predict(ourModel,outSample) plot(predict(ourModel,outSample)) ### Fat regression example xreg <- matrix(rnorm(5000,10,3),50,100) xreg <- cbind(100+0.5*xreg[,1]-0.75*xreg[,2]+rnorm(50,0,3),xreg,rnorm(50,300,10)) colnames(xreg) <- c("y",paste0("x",c(1:100)),"Noise") inSample <- xreg[1:40,] outSample <- xreg[-c(1:40),] # Combine only the models close to the optimal ourModel <- lmCombine(inSample, ic="BICc",bruteforce=FALSE) summary(ourModel) plot(predict(ourModel, outSample)) # Combine in parallel - should increase speed in case of big data ## Not run: ourModel <- lmCombine(inSample, ic="BICc", bruteforce=TRUE, parallel=TRUE) summary(ourModel) plot(predict(ourModel, outSample)) ## End(Not run)
### Simple example xreg <- cbind(rnorm(100,10,3),rnorm(100,50,5)) xreg <- cbind(100+0.5*xreg[,1]-0.75*xreg[,2]+rnorm(100,0,3),xreg,rnorm(100,300,10)) colnames(xreg) <- c("y","x1","x2","Noise") inSample <- xreg[1:80,] outSample <- xreg[-c(1:80),] # Combine all the possible models ourModel <- lmCombine(inSample,bruteforce=TRUE) predict(ourModel,outSample) plot(predict(ourModel,outSample)) ### Fat regression example xreg <- matrix(rnorm(5000,10,3),50,100) xreg <- cbind(100+0.5*xreg[,1]-0.75*xreg[,2]+rnorm(50,0,3),xreg,rnorm(50,300,10)) colnames(xreg) <- c("y",paste0("x",c(1:100)),"Noise") inSample <- xreg[1:40,] outSample <- xreg[-c(1:40),] # Combine only the models close to the optimal ourModel <- lmCombine(inSample, ic="BICc",bruteforce=FALSE) summary(ourModel) plot(predict(ourModel, outSample)) # Combine in parallel - should increase speed in case of big data ## Not run: ourModel <- lmCombine(inSample, ic="BICc", bruteforce=TRUE, parallel=TRUE) summary(ourModel) plot(predict(ourModel, outSample)) ## End(Not run)
Function combines parameters of linear regressions of the first variable on all the other provided data using pAIC weights. This is an extension of the lmCombine function, which relies upon the idea that the combination weights might change over time.
lmDynamic(data, ic = c("AICc", "AIC", "BIC", "BICc"), bruteforce = FALSE, silent = TRUE, formula = NULL, subset = NULL, distribution = c("dnorm", "dlaplace", "ds", "dgnorm", "dlogis", "dt", "dalaplace", "dlnorm", "dllaplace", "dls", "dlgnorm", "dbcnorm", "dfnorm", "dinvgauss", "dgamma", "dpois", "dnbinom", "dlogitnorm", "plogis", "pnorm"), parallel = FALSE, lowess = TRUE, f = NULL, ...)
lmDynamic(data, ic = c("AICc", "AIC", "BIC", "BICc"), bruteforce = FALSE, silent = TRUE, formula = NULL, subset = NULL, distribution = c("dnorm", "dlaplace", "ds", "dgnorm", "dlogis", "dt", "dalaplace", "dlnorm", "dllaplace", "dls", "dlgnorm", "dbcnorm", "dfnorm", "dinvgauss", "dgamma", "dpois", "dnbinom", "dlogitnorm", "plogis", "pnorm"), parallel = FALSE, lowess = TRUE, f = NULL, ...)
data |
Data frame containing dependent variable in the first column and the others in the rest. |
ic |
Information criterion to use. |
bruteforce |
If |
silent |
If |
formula |
If provided, then the selection will be done from the listed variables in the formula after all the necessary transformations. |
subset |
an optional vector specifying a subset of observations to be used in the fitting process. |
distribution |
Distribution to pass to |
parallel |
If |
lowess |
Logical defining, whether LOWESS should be used to smooth the
dynamic weights. By default it is |
f |
the smoother span for LOWESS. This gives the proportion of points in
the plot which influence the smooth at each value. Larger values give more
smoothness. If |
... |
Other parameters passed to |
The algorithm uses alm() to fit different models and then combines the models
based on the selected point IC. The combination weights are calculated for each
observation based on the point IC and then smoothed via LOWESS if the respective
parameter (lowess
) is set to TRUE.
Some details and examples of application are also given in the vignette
"Greybox": vignette("greybox","greybox")
Function returns model
- the final model of the class
"greyboxD", which includes time varying parameters and dynamic importance
of each variable. The list of variables:
coefficients - the mean (over time) parameters of the model,
vcov - the combined covariance matrix of the model,
fitted - the fitted values,
residuals - the residuals of the model,
distribution - the distribution used in the estimation,
logLik - the mean (over time) log-likelihood of the model,
IC - dynamic values of the information criterion (pIC),
ICType - the type of information criterion used,
df.residual - mean number of degrees of freedom of the residuals of the model,
df - mean number of degrees of freedom of the model,
importance - dynamic importance of the parameters,
call - call used in the function,
rank - rank of the combined model,
data - the data used in the model,
mu - the location value of the distribution,
scale - the scale parameter if alm() was used,
coefficientsDynamic - table with parameters of the model, varying over the time,
df.residualDynamic - dynamic df.residual,
dfDynamic - dynamic df,
weights - the dynamic weights for each model under consideration,
timeElapsed - the time elapsed for the estimation of the model.
Ivan Svetunkov, [email protected]
Burnham Kenneth P. and Anderson David R. (2002). Model Selection and Multimodel Inference. A Practical Information-Theoretic Approach. Springer-Verlag New York. DOI: [10.1007/b97636](http://dx.doi.org/10.1007/b97636).
McQuarrie, A. D. (1999). A small-sample correction for the Schwarz SIC model selection criterion. Statistics & Probability Letters, 44(1), 79–86. [10.1016/S0167-7152(98)00294-6](https://doi.org/10.1016/S0167-7152(98)00294-6).
### Simple example xreg <- cbind(rnorm(100,10,3),rnorm(100,50,5)) xreg <- cbind(100+0.5*xreg[,1]-0.75*xreg[,2]+rnorm(100,0,3),xreg,rnorm(100,300,10)) colnames(xreg) <- c("y","x1","x2","Noise") inSample <- xreg[1:80,] outSample <- xreg[-c(1:80),] # Combine all the possible models ourModel <- lmDynamic(inSample,bruteforce=TRUE) predict(ourModel,outSample) plot(predict(ourModel,outSample))
### Simple example xreg <- cbind(rnorm(100,10,3),rnorm(100,50,5)) xreg <- cbind(100+0.5*xreg[,1]-0.75*xreg[,2]+rnorm(100,0,3),xreg,rnorm(100,300,10)) colnames(xreg) <- c("y","x1","x2","Noise") inSample <- xreg[1:80,] outSample <- xreg[-c(1:80),] # Combine all the possible models ourModel <- lmDynamic(inSample,bruteforce=TRUE) predict(ourModel,outSample) plot(predict(ourModel,outSample))
Function calculates multiple correlation between y and x, constructing a linear regression model
mcor(x, y, use = c("na.or.complete", "complete.obs", "everything", "all.obs"))
mcor(x, y, use = c("na.or.complete", "complete.obs", "everything", "all.obs"))
x |
Either data.frame or a matrix |
y |
The numerical variable. |
use |
What observations to use. See cor function for details.
The only option that is not available here is |
This is based on the linear regression model with the set of variables in x. The returned value is just a coefficient of multiple correlation from regression, the F-statistics of the model (thus testing the null hypothesis that all the parameters are equal to zero), the associated p-value and the degrees of freedom.
See details in the vignette "Marketing analytics with greybox":
vignette("maUsingGreybox","greybox")
The following list of values is returned:
value - The value of the coefficient;
statistic - The value of F-statistics associated with the parameter;
p.value - The p-value of F-statistics associated with the parameter;
df.residual - The number of degrees of freedom for the residuals;
df - The number of degrees of freedom for the data.
Ivan Svetunkov, [email protected]
table, tableplot, spread,
cramer, association
mcor(mtcars$am, mtcars$mpg)
mcor(mtcars$am, mtcars$mpg)
Functions allow to calculate different types of errors for point and interval predictions:
ME - Mean Error,
MAE - Mean Absolute Error,
MSE - Mean Squared Error,
MRE - Mean Root Error (Kourentzes, 2014),
MIS - Mean Interval Score (Gneiting & Raftery, 2007),
MPE - Mean Percentage Error,
MAPE - Mean Absolute Percentage Error (See Svetunkov, 2017 for the critique),
MASE - Mean Absolute Scaled Error (Hyndman & Koehler, 2006),
RMSSE - Root Mean Squared Scaled Error (used in M5 Competition),
rMAE - Relative Mean Absolute Error (Davydenko & Fildes, 2013),
rRMSE - Relative Root Mean Squared Error,
rAME - Relative Absolute Mean Error,
rMIS - Relative Mean Interval Score,
sMSE - Scaled Mean Squared Error (Petropoulos & Kourentzes, 2015),
sPIS- Scaled Periods-In-Stock (Wallstrom & Segerstedt, 2010),
sCE - Scaled Cumulative Error,
sMIS - Scaled Mean Interval Score,
GMRAE - Geometric Mean Relative Absolute Error.
ME(holdout, forecast, na.rm = TRUE) MAE(holdout, forecast, na.rm = TRUE) MSE(holdout, forecast, na.rm = TRUE) MRE(holdout, forecast, na.rm = TRUE) MIS(holdout, lower, upper, level = 0.95, na.rm = TRUE) MPE(holdout, forecast, na.rm = TRUE) MAPE(holdout, forecast, na.rm = TRUE) MASE(holdout, forecast, scale, na.rm = TRUE) RMSSE(holdout, forecast, scale, na.rm = TRUE) rMAE(holdout, forecast, benchmark, na.rm = TRUE) rRMSE(holdout, forecast, benchmark, na.rm = TRUE) rAME(holdout, forecast, benchmark, na.rm = TRUE) rMIS(holdout, lower, upper, benchmarkLower, benchmarkUpper, level = 0.95, na.rm = TRUE) sMSE(holdout, forecast, scale, na.rm = TRUE) sPIS(holdout, forecast, scale, na.rm = TRUE) sCE(holdout, forecast, scale, na.rm = TRUE) sMIS(holdout, lower, upper, scale, level = 0.95, na.rm = TRUE) GMRAE(holdout, forecast, benchmark, na.rm = TRUE)
ME(holdout, forecast, na.rm = TRUE) MAE(holdout, forecast, na.rm = TRUE) MSE(holdout, forecast, na.rm = TRUE) MRE(holdout, forecast, na.rm = TRUE) MIS(holdout, lower, upper, level = 0.95, na.rm = TRUE) MPE(holdout, forecast, na.rm = TRUE) MAPE(holdout, forecast, na.rm = TRUE) MASE(holdout, forecast, scale, na.rm = TRUE) RMSSE(holdout, forecast, scale, na.rm = TRUE) rMAE(holdout, forecast, benchmark, na.rm = TRUE) rRMSE(holdout, forecast, benchmark, na.rm = TRUE) rAME(holdout, forecast, benchmark, na.rm = TRUE) rMIS(holdout, lower, upper, benchmarkLower, benchmarkUpper, level = 0.95, na.rm = TRUE) sMSE(holdout, forecast, scale, na.rm = TRUE) sPIS(holdout, forecast, scale, na.rm = TRUE) sCE(holdout, forecast, scale, na.rm = TRUE) sMIS(holdout, lower, upper, scale, level = 0.95, na.rm = TRUE) GMRAE(holdout, forecast, benchmark, na.rm = TRUE)
holdout |
The vector or matrix of holdout values. |
forecast |
The vector or matrix of forecasts values. |
na.rm |
Logical, defining whether to remove the NAs from the provided data or not. |
lower |
The lower bound of the prediction interval. |
upper |
The upper bound of the prediction interval. |
level |
The confidence level of the constructed interval. |
scale |
The value that should be used in the denominator of MASE. Can be anything but advised values are: mean absolute deviation of in-sample one step ahead Naive error or mean absolute value of the in-sample actuals. |
benchmark |
The vector or matrix of the forecasts of the benchmark model. |
benchmarkLower |
The lower bound of the prediction interval of the benchmark model. |
benchmarkUpper |
The upper bound of the prediction interval of the benchmark model. |
In case of sMSE
, scale
needs to be a squared value. Typical
one – squared mean value of in-sample actuals.
If all the measures are needed, then measures function can help.
There are several other measures, see details of pinball and hm.
All the functions return the scalar value.
Ivan Svetunkov, [email protected]
Kourentzes N. (2014). The Bias Coefficient: a new metric for forecast bias https://kourentzes.com/forecasting/2014/12/17/the-bias-coefficient-a-new-metric-for-forecast-bias/
Svetunkov, I. (2017). Naughty APEs and the quest for the holy grail. https://openforecast.org/2017/07/29/naughty-apes-and-the-quest-for-the-holy-grail/
Fildes R. (1992). The evaluation of extrapolative forecasting methods. International Journal of Forecasting, 8, pp.81-98.
Hyndman R.J., Koehler A.B. (2006). Another look at measures of forecast accuracy. International Journal of Forecasting, 22, pp.679-688.
Petropoulos F., Kourentzes N. (2015). Forecast combinations for intermittent demand. Journal of the Operational Research Society, 66, pp.914-924.
Wallstrom P., Segerstedt A. (2010). Evaluation of forecasting error measurements and techniques for intermittent demand. International Journal of Production Economics, 128, pp.625-636.
Davydenko, A., Fildes, R. (2013). Measuring Forecasting Accuracy: The Case Of Judgmental Adjustments To Sku-Level Demand Forecasts. International Journal of Forecasting, 29(3), 510-522. doi:10.1016/j.ijforecast.2012.09.002
Gneiting, T., & Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477), 359–378. doi:10.1198/016214506000001437
y <- rnorm(100,10,2) testForecast <- rep(mean(y[1:90]),10) MAE(y[91:100],testForecast) MSE(y[91:100],testForecast) MPE(y[91:100],testForecast) MAPE(y[91:100],testForecast) # Measures from Petropoulos & Kourentzes (2015) MASE(y[91:100],testForecast,mean(abs(y[1:90]))) sMSE(y[91:100],testForecast,mean(abs(y[1:90]))^2) sPIS(y[91:100],testForecast,mean(abs(y[1:90]))) sCE(y[91:100],testForecast,mean(abs(y[1:90]))) # Original MASE from Hyndman & Koehler (2006) MASE(y[91:100],testForecast,mean(abs(diff(y[1:90])))) testForecast2 <- rep(y[91],10) # Relative measures, from and inspired by Davydenko & Fildes (2013) rMAE(y[91:100],testForecast2,testForecast) rRMSE(y[91:100],testForecast2,testForecast) rAME(y[91:100],testForecast2,testForecast) GMRAE(y[91:100],testForecast2,testForecast) #### Measures for the prediction intervals # An example with mtcars data ourModel <- alm(mpg~., mtcars[1:30,], distribution="dnorm") ourBenchmark <- alm(mpg~1, mtcars[1:30,], distribution="dnorm") # Produce predictions with the interval ourForecast <- predict(ourModel, mtcars[-c(1:30),], interval="p") ourBenchmarkForecast <- predict(ourBenchmark, mtcars[-c(1:30),], interval="p") MIS(mtcars$mpg[-c(1:30)],ourForecast$lower,ourForecast$upper,0.95) sMIS(mtcars$mpg[-c(1:30)],ourForecast$lower,ourForecast$upper,mean(mtcars$mpg[1:30]),0.95) rMIS(mtcars$mpg[-c(1:30)],ourForecast$lower,ourForecast$upper, ourBenchmarkForecast$lower,ourBenchmarkForecast$upper,0.95) ### Also, see pinball function for other measures for the intervals
y <- rnorm(100,10,2) testForecast <- rep(mean(y[1:90]),10) MAE(y[91:100],testForecast) MSE(y[91:100],testForecast) MPE(y[91:100],testForecast) MAPE(y[91:100],testForecast) # Measures from Petropoulos & Kourentzes (2015) MASE(y[91:100],testForecast,mean(abs(y[1:90]))) sMSE(y[91:100],testForecast,mean(abs(y[1:90]))^2) sPIS(y[91:100],testForecast,mean(abs(y[1:90]))) sCE(y[91:100],testForecast,mean(abs(y[1:90]))) # Original MASE from Hyndman & Koehler (2006) MASE(y[91:100],testForecast,mean(abs(diff(y[1:90])))) testForecast2 <- rep(y[91],10) # Relative measures, from and inspired by Davydenko & Fildes (2013) rMAE(y[91:100],testForecast2,testForecast) rRMSE(y[91:100],testForecast2,testForecast) rAME(y[91:100],testForecast2,testForecast) GMRAE(y[91:100],testForecast2,testForecast) #### Measures for the prediction intervals # An example with mtcars data ourModel <- alm(mpg~., mtcars[1:30,], distribution="dnorm") ourBenchmark <- alm(mpg~1, mtcars[1:30,], distribution="dnorm") # Produce predictions with the interval ourForecast <- predict(ourModel, mtcars[-c(1:30),], interval="p") ourBenchmarkForecast <- predict(ourBenchmark, mtcars[-c(1:30),], interval="p") MIS(mtcars$mpg[-c(1:30)],ourForecast$lower,ourForecast$upper,0.95) sMIS(mtcars$mpg[-c(1:30)],ourForecast$lower,ourForecast$upper,mean(mtcars$mpg[1:30]),0.95) rMIS(mtcars$mpg[-c(1:30)],ourForecast$lower,ourForecast$upper, ourBenchmarkForecast$lower,ourBenchmarkForecast$upper,0.95) ### Also, see pinball function for other measures for the intervals
Function calculates several error measures using the provided forecasts and the data for the holdout sample.
measures(holdout, forecast, actual, digits = NULL, benchmark = c("naive", "mean"))
measures(holdout, forecast, actual, digits = NULL, benchmark = c("naive", "mean"))
holdout |
The vector of the holdout values. |
forecast |
The vector of forecasts produced by a model. |
actual |
The vector of actual in-sample values. |
digits |
Number of digits of the output. If |
benchmark |
The character variable, defining what to use as
benchmark for relative measures. Can be either |
The functions returns the named vector of errors:
ME,
MAE,
MSE
MPE,
MAPE,
MASE,
sMAE,
RMSSE,
sMSE,
sCE,
rMAE,
rRMSE,
rAME,
asymmetry,
sPIS.
For the details on these errors, see Errors.
Ivan Svetunkov, [email protected]
Svetunkov, I. (2017). Naughty APEs and the quest for the holy grail. https://openforecast.org/2017/07/29/naughty-apes-and-the-quest-for-the-holy-grail/
Fildes R. (1992). The evaluation of extrapolative forecasting methods. International Journal of Forecasting, 8, pp.81-98.
Hyndman R.J., Koehler A.B. (2006). Another look at measures of forecast accuracy. International Journal of Forecasting, 22, pp.679-688.
Petropoulos F., Kourentzes N. (2015). Forecast combinations for intermittent demand. Journal of the Operational Research Society, 66, pp.914-924.
Wallstrom P., Segerstedt A. (2010). Evaluation of forecasting error measurements and techniques for intermittent demand. International Journal of Production Economics, 128, pp.625-636.
Davydenko, A., Fildes, R. (2013). Measuring Forecasting Accuracy: The Case Of Judgmental Adjustments To Sku-Level Demand Forecasts. International Journal of Forecasting, 29(3), 510-522. doi:10.1016/j.ijforecast.2012.09.002
y <- rnorm(100,10,2) ourForecast <- rep(mean(y[1:90]),10) measures(y[91:100],ourForecast,y[1:90],digits=5)
y <- rnorm(100,10,2) ourForecast <- rep(mean(y[1:90]),10) measures(y[91:100],ourForecast,y[1:90],digits=5)
nparam()
returns the number of estimated parameters in the model,
while nvariate()
returns number of variates for the response
variable.
nparam(object, ...) nvariate(object, ...)
nparam(object, ...) nvariate(object, ...)
object |
Time series model. |
... |
Some other parameters passed to the method. |
nparam()
is a very basic and a simple function which does what it says:
extracts number of estimated parameters in the model. nvariate()
returns
number of variates (dimensions, columns) for the response variable (1 for the
univariate regression).
Both functions return numeric values.
Ivan Svetunkov, [email protected]
### Simple example xreg <- cbind(rnorm(100,10,3),rnorm(100,50,5)) xreg <- cbind(100+0.5*xreg[,1]-0.75*xreg[,2]+rnorm(100,0,3),xreg,rnorm(100,300,10)) colnames(xreg) <- c("y","x1","x2","Noise") ourModel <- lm(y~.,data=as.data.frame(xreg)) nparam(ourModel) nvariate(ourModel)
### Simple example xreg <- cbind(rnorm(100,10,3),rnorm(100,50,5)) xreg <- cbind(100+0.5*xreg[,1]-0.75*xreg[,2]+rnorm(100,0,3),xreg,rnorm(100,300,10)) colnames(xreg) <- c("y","x1","x2","Noise") ourModel <- lm(y~.,data=as.data.frame(xreg)) nparam(ourModel) nvariate(ourModel)
Function detects outliers and creates a matrix with dummy variables. Only point outliers are considered (no level shifts).
outlierdummy(object, ...) ## Default S3 method: outlierdummy(object, level = 0.999, type = c("rstandard", "rstudent"), ...) ## S3 method for class 'alm' outlierdummy(object, level = 0.999, type = c("rstandard", "rstudent"), ...)
outlierdummy(object, ...) ## Default S3 method: outlierdummy(object, level = 0.999, type = c("rstandard", "rstudent"), ...) ## S3 method for class 'alm' outlierdummy(object, level = 0.999, type = c("rstandard", "rstudent"), ...)
object |
Model estimated using one of the functions of smooth package. |
... |
Other parameters. Not used yet. |
level |
Confidence level to use. Everything that is outside the constructed bounds based on that is flagged as outliers. |
type |
Type of residuals to use: either standardised or studentised. Ignored if count distributions used. |
The detection is done based on the type of distribution used and confidence level specified by user.
The class "outlierdummy", which contains the list:
outliers - the matrix with the dummy variables, flagging outliers;
statistic - the value of the statistic for the normalised variable;
id - the ids of the outliers (which observations have them);
level - the confidence level used in the process;
type - the type of the residuals used;
errors - the errors used in the detection. In case of count distributions, probabilities are returned.
Ivan Svetunkov, [email protected]
# Generate the data with S distribution xreg <- cbind(rnorm(100,10,3),rnorm(100,50,5)) xreg <- cbind(100+0.5*xreg[,1]-0.75*xreg[,2]+rs(100,0,3),xreg) colnames(xreg) <- c("y","x1","x2") # Fit the normal distribution model ourModel <- alm(y~x1+x2, xreg, distribution="dnorm") # Detect outliers xregOutlierDummy <- outlierdummy(ourModel)
# Generate the data with S distribution xreg <- cbind(rnorm(100,10,3),rnorm(100,50,5)) xreg <- cbind(100+0.5*xreg[,1]-0.75*xreg[,2]+rs(100,0,3),xreg) colnames(xreg) <- c("y","x1","x2") # Fit the normal distribution model ourModel <- alm(y~x1+x2, xreg, distribution="dnorm") # Detect outliers xregOutlierDummy <- outlierdummy(ourModel)
This function returns a vector of AIC values for the in-sample observations
pAIC(object, ...) pAICc(object, ...) pBIC(object, ...) pBICc(object, ...)
pAIC(object, ...) pAICc(object, ...) pBIC(object, ...) pBICc(object, ...)
object |
Time series model. |
... |
Some stuff. |
This is based on pointLik function. The formula for this is: pAIC_t = 2 * k - 2 * T * l_t , where k is the number of parameters, T is the number of observations and l_t is the point likelihood. This way we preserve the property that AIC = mean(pAIC).
The function returns the vector of point AIC values.
Ivan Svetunkov, [email protected]
xreg <- cbind(rnorm(100,10,3),rnorm(100,50,5)) xreg <- cbind(100+0.5*xreg[,1]-0.75*xreg[,2]+rnorm(100,0,3),xreg,rnorm(100,300,10)) colnames(xreg) <- c("y","x1","x2","Noise") ourModel <- alm(y~x1+x2,as.data.frame(xreg)) pAICValues <- pAIC(ourModel) mean(pAICValues) AIC(ourModel)
xreg <- cbind(rnorm(100,10,3),rnorm(100,50,5)) xreg <- cbind(100+0.5*xreg[,1]-0.75*xreg[,2]+rnorm(100,0,3),xreg,rnorm(100,300,10)) colnames(xreg) <- c("y","x1","x2","Noise") ourModel <- alm(y~x1+x2,as.data.frame(xreg)) pAICValues <- pAIC(ourModel) mean(pAICValues) AIC(ourModel)
Function calculates partial correlations between the provided variables
pcor(x, y = NULL, use = c("na.or.complete", "complete.obs", "everything", "all.obs"), method = c("pearson", "spearman", "kendall"))
pcor(x, y = NULL, use = c("na.or.complete", "complete.obs", "everything", "all.obs"), method = c("pearson", "spearman", "kendall"))
x |
Either data.frame or a matrix with numeric values. |
y |
The numerical variable. |
use |
What observations to use. See cor function for details.
The only option that is not available here is |
method |
Which method to use for the calculation of the partial correlations. This can be either Pearson's, Spearman's or Kendall's coefficient. See cor for details. |
The calculation is done based on multiple linear regressions. The function calculates them for each pair of variables based on the residuals of linear models of those variables from the other variables in the dataset.
The following list of values is returned:
value - Matrix of the coefficients of partial correlations;
p.value - The p-values for the parameters;
method - The method used in the calculations.
Ivan Svetunkov, [email protected]
pcor(mtcars)
pcor(mtcars)
The function returns the value from the pinball function for the specified level and the type of loss
pinball(holdout, forecast, level, loss = 1, na.rm = TRUE)
pinball(holdout, forecast, level, loss = 1, na.rm = TRUE)
holdout |
The vector or matrix of the holdout values. |
forecast |
The forecast of a distribution (e.g. quantile or expectile). It should be the same length as the holdout. |
level |
The level associated with the forecast (e.g. level of quantile). |
loss |
The type of loss to use. The number which corresponds to L1, L2 etc. L1 implies the loss for quantiles, while L2 is for the expectile. |
na.rm |
Logical, defining whether to remove the NAs from the provided data or not. |
The function returns the scalar value.
Ivan Svetunkov, [email protected]
# An example with mtcars data ourModel <- alm(mpg~., mtcars[1:30,], distribution="dnorm") # Produce predictions with the interval ourForecast <- predict(ourModel, mtcars[-c(1:30),], interval="p") # Pinball with the L1 (quantile value) pinball(mtcars$mpg[-c(1:30)],ourForecast$upper,level=0.975,loss=1) pinball(mtcars$mpg[-c(1:30)],ourForecast$lower,level=0.025,loss=1) # Pinball with the L2 (expectile value) pinball(mtcars$mpg[-c(1:30)],ourForecast$upper,level=0.975,loss=2) pinball(mtcars$mpg[-c(1:30)],ourForecast$lower,level=0.025,loss=2)
# An example with mtcars data ourModel <- alm(mpg~., mtcars[1:30,], distribution="dnorm") # Produce predictions with the interval ourForecast <- predict(ourModel, mtcars[-c(1:30),], interval="p") # Pinball with the L1 (quantile value) pinball(mtcars$mpg[-c(1:30)],ourForecast$upper,level=0.975,loss=1) pinball(mtcars$mpg[-c(1:30)],ourForecast$lower,level=0.025,loss=1) # Pinball with the L2 (expectile value) pinball(mtcars$mpg[-c(1:30)],ourForecast$upper,level=0.975,loss=2) pinball(mtcars$mpg[-c(1:30)],ourForecast$lower,level=0.025,loss=2)
The function produces diagnostics plots for a greybox
model
## S3 method for class 'greybox' plot(x, which = c(1, 2, 4, 6), level = 0.95, legend = FALSE, ask = prod(par("mfcol")) < length(which) && dev.interactive(), lowess = TRUE, ...)
## S3 method for class 'greybox' plot(x, which = c(1, 2, 4, 6), level = 0.95, legend = FALSE, ask = prod(par("mfcol")) < length(which) && dev.interactive(), lowess = TRUE, ...)
x |
Estimated greybox model. |
which |
Which of the plots to produce. The possible options (see details for explanations):
|
level |
Confidence level. Defines width of confidence interval. Used in plots (2), (3), (7), (8), (9), (10) and (11). |
legend |
If |
ask |
Logical; if |
lowess |
Logical; if |
... |
The parameters passed to the plot functions. Recommended to use with separate plots. |
The list of produced plots includes:
Actuals vs Fitted values. Allows analysing, whether there are any issues in the fit. Does the variability of actuals increase with the increase of fitted values? Is the relation well captured? They grey line on the plot corresponds to the perfect fit of the model.
Standardised residuals vs Fitted. Plots the points and the confidence bounds
(red lines) for the specified confidence level
. Useful for the analysis of outliers;
Studentised residuals vs Fitted. This is similar to the previous plot, but with the residuals divided by the scales with the leave-one-out approach. Should be more sensitive to outliers;
Absolute residuals vs Fitted. Useful for the analysis of heteroscedasticity;
Squared residuals vs Fitted - similar to (3), but with squared values;
Q-Q plot with the specified distribution. Can be used in order to see if the
residuals follow the assumed distribution. The type of distribution depends on the one used
in the estimation (see distribution
parameter in alm);
Fitted over time. Plots actuals (black line), fitted values (purple line) and
prediction interval (red lines) of width level
, but only in the case, when there
are some values lying outside of it. Can be used in order to make sure that the model
did not miss any important events over time;
Standardised residuals vs Time. Useful if you want to see, if there is autocorrelation or if there is heteroscedasticity in time. This also shows, when the outliers happen;
Studentised residuals vs Time. Similar to previous, but with studentised residuals;
ACF of the residuals. Are the residuals autocorrelated? See acf for details;
PACF of the residuals. No, really, are they autocorrelated? See pacf for details;
Cook's distance over time. Shows influential observations. 0.5, 0.75 and 0.95 quantile lines from Fisher's distribution are also plotted. If the value is above them then the observation is influencial. This does not work well for non-normal distributions;
Absolute standardised residuals vs Fitted. Similar to the previous, but with absolute values. This is more relevant to the models where scale is calculated as an absolute value of something (e.g. Laplace);
Squared standardised residuals vs Fitted. This is an additional plot needed to diagnose
heteroscedasticity in a model with varying scale. The variance on this plot will be constant if
the adequate model for scale
was constructed. This is more appropriate for normal and
the related distributions.
Which of the plots to produce, is specified via the which
parameter. The plots 2, 3, 7,
8 and 9 also use the parameters level
, which specifies the confidence level for
the intervals.
The function produces the number of plots, specified in the parameter which
.
Ivan Svetunkov, [email protected]
xreg <- cbind(rlaplace(100,10,3),rnorm(100,50,5)) xreg <- cbind(100+0.5*xreg[,1]-0.75*xreg[,2]+rlaplace(100,0,3),xreg,rnorm(100,300,10)) colnames(xreg) <- c("y","x1","x2","Noise") ourModel <- alm(y~x1+x2, xreg, distribution="dnorm") par(mfcol=c(4,4)) plot(ourModel, c(1:14))
xreg <- cbind(rlaplace(100,10,3),rnorm(100,50,5)) xreg <- cbind(100+0.5*xreg[,1]-0.75*xreg[,2]+rlaplace(100,0,3),xreg,rnorm(100,300,10)) colnames(xreg) <- c("y","x1","x2","Noise") ourModel <- alm(y~x1+x2, xreg, distribution="dnorm") par(mfcol=c(4,4)) plot(ourModel, c(1:14))
This function returns a vector of logarithms of likelihoods for each observation
pointLik(object, log = TRUE, ...)
pointLik(object, log = TRUE, ...)
object |
Time series model. |
log |
Whether to take logarithm of likelihoods. |
... |
Some stuff. |
Instead of taking the expected log-likelihood for the whole series, this function calculates the individual value for each separate observation. Note that these values are biased, so you would possibly need to take number of degrees of freedom into account in order to have an unbiased estimator.
This value is based on the general likelihood (not its concentrated version), so the sum of these values may slightly differ from the output of logLik.
This function returns a vector.
Ivan Svetunkov, [email protected]
xreg <- cbind(rnorm(100,10,3),rnorm(100,50,5)) xreg <- cbind(100+0.5*xreg[,1]-0.75*xreg[,2]+rnorm(100,0,3),xreg,rnorm(100,300,10)) colnames(xreg) <- c("y","x1","x2","Noise") ourModel <- alm(y~x1+x2,as.data.frame(xreg)) pointLik(ourModel) # Bias correction pointLik(ourModel) - nparam(ourModel) # Bias correction in AIC style 2*(nparam(ourModel)/nobs(ourModel) - pointLik(ourModel)) # BIC calculation based on pointLik log(nobs(ourModel))*nparam(ourModel) - 2*sum(pointLik(ourModel))
xreg <- cbind(rnorm(100,10,3),rnorm(100,50,5)) xreg <- cbind(100+0.5*xreg[,1]-0.75*xreg[,2]+rnorm(100,0,3),xreg,rnorm(100,300,10)) colnames(xreg) <- c("y","x1","x2","Noise") ourModel <- alm(y~x1+x2,as.data.frame(xreg)) pointLik(ourModel) # Bias correction pointLik(ourModel) - nparam(ourModel) # Bias correction in AIC style 2*(nparam(ourModel)/nobs(ourModel) - pointLik(ourModel)) # BIC calculation based on pointLik log(nobs(ourModel))*nparam(ourModel) - 2*sum(pointLik(ourModel))
The function accepts two vectors with the parameters for the polynomials and returns the vector of parameters after their multiplication. This can be especially useful, when working with ARIMA models.
polyprod(x, y)
polyprod(x, y)
x |
The vector of parameters of the first polynomial. |
y |
The vector of parameters of the second polynomial. |
The function returns a matrix with one column with the parameters for the polynomial, starting from the 0-order.
Ivan Svetunkov, [email protected]
## Not run: polyprod(c(1,-2,-1),c(1,0.5,0.3))
## Not run: polyprod(c(1,-2,-1),c(1,0.5,0.3))
The functions allow producing forecasts based on the provided model and newdata.
## S3 method for class 'alm' predict(object, newdata = NULL, interval = c("none", "confidence", "prediction"), level = 0.95, side = c("both", "upper", "lower"), occurrence = NULL, ...) ## S3 method for class 'greybox' predict(object, newdata = NULL, interval = c("none", "confidence", "prediction"), level = 0.95, side = c("both", "upper", "lower"), ...) ## S3 method for class 'scale' predict(object, newdata = NULL, interval = c("none", "confidence", "prediction"), level = 0.95, side = c("both", "upper", "lower"), ...) ## S3 method for class 'greybox' forecast(object, newdata = NULL, h = NULL, ...) ## S3 method for class 'alm' forecast(object, newdata = NULL, h = NULL, ...)
## S3 method for class 'alm' predict(object, newdata = NULL, interval = c("none", "confidence", "prediction"), level = 0.95, side = c("both", "upper", "lower"), occurrence = NULL, ...) ## S3 method for class 'greybox' predict(object, newdata = NULL, interval = c("none", "confidence", "prediction"), level = 0.95, side = c("both", "upper", "lower"), ...) ## S3 method for class 'scale' predict(object, newdata = NULL, interval = c("none", "confidence", "prediction"), level = 0.95, side = c("both", "upper", "lower"), ...) ## S3 method for class 'greybox' forecast(object, newdata = NULL, h = NULL, ...) ## S3 method for class 'alm' forecast(object, newdata = NULL, h = NULL, ...)
object |
Time series model for which forecasts are required. |
newdata |
The new data needed in order to produce forecasts. |
interval |
Type of intervals to construct: either "confidence" or "prediction". Can be abbreviated |
level |
Confidence level. Defines width of prediction interval. |
side |
What type of interval to produce: |
occurrence |
If occurrence was provided, then a user can provide a vector of future values via this variable. |
... |
Other arguments passed to |
h |
The forecast horizon. |
predict
produces predictions for the provided model and newdata
. If
newdata
is not provided, then the data from the model is extracted and the
fitted values are reproduced. This might be useful when confidence / prediction
intervals are needed for the in-sample values.
forecast
function produces forecasts for h
steps ahead. There are four
scenarios in this function:
If the newdata
is not provided, then it will produce forecasts of the
explanatory variables to the horizon h
(using es
from smooth package
or using Naive if smooth
is not installed) and use them as newdata
.
If h
and newdata
are provided, then the number of rows to use
will be regulated by h
.
If h
is NULL
, then it is set equal to the number of rows in
newdata
.
If both h
and newdata
are not provided, then it will use the
data from the model itself, reproducing the fitted values.
After forming the newdata
the forecast
function calls for
predict
, so you can provide parameters interval
, level
and
side
in the call for forecast
.
predict.greybox()
returns object of class "predict.greybox",
which contains:
model
- the estimated model.
mean
- the expected values.
fitted
- fitted values of the model.
lower
- lower bound of prediction / confidence intervals.
upper
- upper bound of prediction / confidence intervals.
level
- confidence level.
newdata
- the data provided in the call to the function.
variances
- conditional variance for the holdout sample.
In case of interval="prediction"
includes variance of the error.
predict.alm()
is based on predict.greybox()
and returns
object of class "predict.alm", which in addition contains:
location
- the location parameter of the distribution.
scale
- the scale parameter of the distribution.
distribution
- name of the fitted distribution.
forecast()
functions return the same "predict.alm" and
"predict.greybox" classes, with the same set of output variables.
Ivan Svetunkov, [email protected]
xreg <- cbind(rlaplace(100,10,3),rnorm(100,50,5)) xreg <- cbind(100+0.5*xreg[,1]-0.75*xreg[,2]+rlaplace(100,0,3),xreg,rnorm(100,300,10)) colnames(xreg) <- c("y","x1","x2","Noise") inSample <- xreg[1:80,] outSample <- xreg[-c(1:80),] ourModel <- alm(y~x1+x2, inSample, distribution="dlaplace") predict(ourModel,outSample) predict(ourModel,outSample,interval="c") plot(predict(ourModel,outSample,interval="p")) plot(forecast(ourModel,h=10,interval="p"))
xreg <- cbind(rlaplace(100,10,3),rnorm(100,50,5)) xreg <- cbind(100+0.5*xreg[,1]-0.75*xreg[,2]+rlaplace(100,0,3),xreg,rnorm(100,300,10)) colnames(xreg) <- c("y","x1","x2","Noise") inSample <- xreg[1:80,] outSample <- xreg[-c(1:80),] ourModel <- alm(y~x1+x2, inSample, distribution="dlaplace") predict(ourModel,outSample) predict(ourModel,outSample,interval="c") plot(predict(ourModel,outSample,interval="p")) plot(forecast(ourModel,h=10,interval="p"))
RMCB stands for "Regression for Multiple Comparison with the Best", referring to the comparison of forecasting methods. This is a regression-based version of the Nemenyi / MCB test relies on the ranks of variables. This test is based on Nemenyi / MCB test (Demsar, 2006). It transforms the data into ranks and then constructs a regression on them of the type:
rmcb(data, level = 0.95, outplot = c("mcb", "lines", "none"), select = NULL, ...) ## S3 method for class 'rmcb' plot(x, outplot = c("mcb", "lines"), select = NULL, ...)
rmcb(data, level = 0.95, outplot = c("mcb", "lines", "none"), select = NULL, ...) ## S3 method for class 'rmcb' plot(x, outplot = c("mcb", "lines"), select = NULL, ...)
data |
Matrix or data frame with observations in rows and variables in columns. |
level |
The width of the confidence interval. Default is 0.95. |
outplot |
What type of plot to use after the calculations. This can be
either "MCB" ( |
select |
What column of data to highlight on the plot. If NULL, then the method with the lowest value is selected. |
... |
Other parameters passed to rank function. |
x |
The produced rmcb model. |
y = b' X + e,
where y is the vector of the ranks of provided data (as.vector(data)), X is the matrix of dummy variables for each column of the data (forecasting method), b is the vector of coefficients for the dummies and e is the error term of the model. Given that the data is ranked, it test the differences in medians between the methods and then produces plots based on that.
There is also a plot()
method that allows producing either "mcb" or "lines"
style of plot. This can be regulated via plot(x, outplot="lines")
.
If outplot!="none"
, then the function plots the results after all
the calculations using plot.rmcb() function.
Function returns a list of a class "rmcb", which contains the following variables:
mean Mean values for each method.
interval Confidence intervals for each method.
vlines Coordinates used for outplot="l", marking the groups of methods.
groups The table containing the groups. TRUE
- methods are in the
same group, FALSE
- they are not.
methods Similar to group
parameter, but with a slightly different
presentation.
p.value p-value for the test of the significance of the model. This is the value from the F test of the linear regression.
level Confidence level.
model lm model produced for the calculation of the intervals.
outplot Style of the plot to produce.
select The selected variable to highlight.
Ivan Svetunkov, [email protected]
Demsar, J. (2006). Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of Machine Learning Research, 7, 1-30. https://www.jmlr.org/papers/volume7/demsar06a/demsar06a.pdf
N <- 50 M <- 4 ourData <- matrix(rnorm(N*M,mean=0,sd=1), N, M) ourData[,2] <- ourData[,2]+4 ourData[,3] <- ourData[,3]+3 ourData[,4] <- ourData[,4]+2 colnames(ourData) <- c("Method A","Method B","Method C - long name","Method D") ourTest <- rmcb(ourData, level=0.95) # See the mean ranks: ourTest$mean # The same is for the intervals: ourTest$interval # You can also reproduce plots in different styles: plot(ourTest, outplot="lines") # Or you can use the default "mcb" style and set additional parameters for the plot(): par(mar=c(2,2,4,0)+0.1) plot(ourTest, main="Four methods")
N <- 50 M <- 4 ourData <- matrix(rnorm(N*M,mean=0,sd=1), N, M) ourData[,2] <- ourData[,2]+4 ourData[,3] <- ourData[,3]+3 ourData[,4] <- ourData[,4]+2 colnames(ourData) <- c("Method A","Method B","Method C - long name","Method D") ourTest <- rmcb(ourData, level=0.95) # See the mean ranks: ourTest$mean # The same is for the intervals: ourTest$interval # You can also reproduce plots in different styles: plot(ourTest, outplot="lines") # Or you can use the default "mcb" style and set additional parameters for the plot(): par(mar=c(2,2,4,0)+0.1) plot(ourTest, main="Four methods")
The function does rolling origin for any forecasting function
ro(data, h = 10, origins = 10, call, value = NULL, ci = FALSE, co = TRUE, silent = TRUE, parallel = FALSE, ...)
ro(data, h = 10, origins = 10, call, value = NULL, ci = FALSE, co = TRUE, silent = TRUE, parallel = FALSE, ...)
data |
Data vector or ts object with the response variable passed to the function. |
h |
The forecasting horizon. |
origins |
The number of rolling origins. |
call |
The call that is passed to the function. The call must be in quotes.
Example: |
value |
The variable or set of variables returned by the |
ci |
The parameter defines if the in-sample window size should be constant.
If |
co |
The parameter defines whether the holdout sample window size should
be constant. If |
silent |
If |
parallel |
If |
... |
This is temporary and is needed in order to capture "silent" parameter if it is provided. |
This function produces rolling origin forecasts using the data
and a
call
passed as parameters. The function can do all of that either in
serial or in parallel, but it needs foreach
and either doMC
(Linux only) or doParallel
packages installed in order to do the latter.
This is a dangerous function, so be careful with the call that you pass to it, and make sure that it is well formulated before the execution. Also, do not forget to provide the value that needs to be returned or you might end up with very messy results.
For more details and more examples of usage, please see vignette for the function.
In order to do that, just run the command: vignette("ro","greybox")
Function returns the following variables:
actuals
- the data provided to the function.
holdout
- the matrix of actual values corresponding to the
produced forecasts from each origin.
value
- the matrices / array / lists with the produced data
from each origin. Name of each object corresponds to the names in the
parameter value
.
Yves Sagaert
Ivan Svetunkov, [email protected]
Tashman, (2000) Out-of-sample tests of forecasting accuracy: an analysis and review International Journal of Forecasting, 16, pp. 437-450. doi:10.1016/S0169-2070(00)00065-0.
y <- rnorm(100,0,1) ourCall <- "predict(arima(x=data,order=c(0,1,1)),n.ahead=h)" # NOTE that the "data" needs to be used in the call, not "y". # This way we tell the function, where "y" should be used in the call of the function. # The default call and values ourValue <- "pred" ourRO <- ro(y, h=5, origins=5, ourCall, ourValue) # We can now plot the results of this evaluation: plot(ourRO) # You can also use dolar sign ourValue <- "$pred" # And you can have constant in-sample size ro(y, h=5, origins=5, ourCall, ourValue, ci=TRUE) # You can ask for several values ourValue <- c("pred","se") # And you can have constant holdout size ro(y, h=5, origins=20, ourCall, ourValue, ci=TRUE, co=TRUE) #### The following code will give exactly the same result as above, #### but computed in parallel using all but 1 core of CPU: ## Not run: ro(y, h=5, origins=20, ourCall, ourValue, ci=TRUE, co=TRUE, parallel=TRUE) #### If you want to use functions from forecast package, please note that you need to #### set the values that need to be returned explicitly. There are two options for this. # Example 1: ## Not run: ourCall <- "forecast(ets(data), h=h, level=95)" ourValue <- c("mean", "lower", "upper") ro(y,h=5,origins=5,ourCall,ourValue) ## End(Not run) # Example 2: ## Not run: ourCall <- "forecast(ets(data), h=h, level=c(80,95))" ourValue <- c("mean", "lower[,1]", "upper[,1]", "lower[,2]", "upper[,2]") ro(y,h=5,origins=5,ourCall,ourValue) ## End(Not run) #### A more complicated example using the for loop and #### several time series x <- matrix(rnorm(120*3,0,1), 120, 3) ## Form an array for the forecasts we will produce ## We will have 4 origins with 6-steps ahead forecasts ourForecasts <- array(NA,c(6,4,3)) ## Define models that need to be used for each series ourModels <- list(c(0,1,1), c(0,0,1), c(0,1,0)) ## This call uses specific models for each time series ourCall <- "predict(arima(data, order=ourModels[[i]]), n.ahead=h)" ourValue <- "pred" ## Start the loop. The important thing here is to use the same variable 'i' as in ourCall. for(i in 1:3){ ourData <- x[,i] ourForecasts[,,i] <- ro(data=ourData,h=6,origins=4,call=ourCall, value=ourValue,co=TRUE,silent=TRUE)$pred } ## ourForecasts array now contains rolling origin forecasts from specific ## models. ##### An example with exogenous variables x <- rnorm(100,0,1) xreg <- matrix(rnorm(200,0,1),100,2,dimnames=list(NULL,c("x1","x2"))) ## 'counti' is used to define in-sample size of xreg, ## 'counto' - the size of the holdout sample of xreg ourCall <- "predict(arima(x=data, order=c(0,1,1), xreg=xreg[counti,,drop=FALSE]), n.ahead=h, newxreg=xreg[counto,,drop=FALSE])" ourValue <- "pred" ro(x,h=5,origins=5,ourCall,ourValue) ##### Poisson regression with alm x <- rpois(100,2) xreg <- cbind(x,matrix(rnorm(200,0,1),100,2,dimnames=list(NULL,c("x1","x2")))) ourCall <- "predict(alm(x~., data=xreg[counti,,drop=FALSE], distribution='dpois'), newdata=xreg[counto,,drop=FALSE])" ourValue <- "mean" testRO <- ro(xreg[,1],h=5,origins=5,ourCall,ourValue,co=TRUE) plot(testRO) ## 'countf' is used to take xreg of the size corresponding to the whole ## sample on each iteration ## This is useful when working with functions from smooth package. ## The following call will return the forecasts from es() function of smooth. ## Not run: ourCall <- "es(data=data, h=h, xreg=xreg[countf,,drop=FALSE])" ourValue <- "forecast" ro(x,h=5,origins=5,ourCall,ourValue) ## End(Not run)
y <- rnorm(100,0,1) ourCall <- "predict(arima(x=data,order=c(0,1,1)),n.ahead=h)" # NOTE that the "data" needs to be used in the call, not "y". # This way we tell the function, where "y" should be used in the call of the function. # The default call and values ourValue <- "pred" ourRO <- ro(y, h=5, origins=5, ourCall, ourValue) # We can now plot the results of this evaluation: plot(ourRO) # You can also use dolar sign ourValue <- "$pred" # And you can have constant in-sample size ro(y, h=5, origins=5, ourCall, ourValue, ci=TRUE) # You can ask for several values ourValue <- c("pred","se") # And you can have constant holdout size ro(y, h=5, origins=20, ourCall, ourValue, ci=TRUE, co=TRUE) #### The following code will give exactly the same result as above, #### but computed in parallel using all but 1 core of CPU: ## Not run: ro(y, h=5, origins=20, ourCall, ourValue, ci=TRUE, co=TRUE, parallel=TRUE) #### If you want to use functions from forecast package, please note that you need to #### set the values that need to be returned explicitly. There are two options for this. # Example 1: ## Not run: ourCall <- "forecast(ets(data), h=h, level=95)" ourValue <- c("mean", "lower", "upper") ro(y,h=5,origins=5,ourCall,ourValue) ## End(Not run) # Example 2: ## Not run: ourCall <- "forecast(ets(data), h=h, level=c(80,95))" ourValue <- c("mean", "lower[,1]", "upper[,1]", "lower[,2]", "upper[,2]") ro(y,h=5,origins=5,ourCall,ourValue) ## End(Not run) #### A more complicated example using the for loop and #### several time series x <- matrix(rnorm(120*3,0,1), 120, 3) ## Form an array for the forecasts we will produce ## We will have 4 origins with 6-steps ahead forecasts ourForecasts <- array(NA,c(6,4,3)) ## Define models that need to be used for each series ourModels <- list(c(0,1,1), c(0,0,1), c(0,1,0)) ## This call uses specific models for each time series ourCall <- "predict(arima(data, order=ourModels[[i]]), n.ahead=h)" ourValue <- "pred" ## Start the loop. The important thing here is to use the same variable 'i' as in ourCall. for(i in 1:3){ ourData <- x[,i] ourForecasts[,,i] <- ro(data=ourData,h=6,origins=4,call=ourCall, value=ourValue,co=TRUE,silent=TRUE)$pred } ## ourForecasts array now contains rolling origin forecasts from specific ## models. ##### An example with exogenous variables x <- rnorm(100,0,1) xreg <- matrix(rnorm(200,0,1),100,2,dimnames=list(NULL,c("x1","x2"))) ## 'counti' is used to define in-sample size of xreg, ## 'counto' - the size of the holdout sample of xreg ourCall <- "predict(arima(x=data, order=c(0,1,1), xreg=xreg[counti,,drop=FALSE]), n.ahead=h, newxreg=xreg[counto,,drop=FALSE])" ourValue <- "pred" ro(x,h=5,origins=5,ourCall,ourValue) ##### Poisson regression with alm x <- rpois(100,2) xreg <- cbind(x,matrix(rnorm(200,0,1),100,2,dimnames=list(NULL,c("x1","x2")))) ourCall <- "predict(alm(x~., data=xreg[counti,,drop=FALSE], distribution='dpois'), newdata=xreg[counto,,drop=FALSE])" ourValue <- "mean" testRO <- ro(xreg[,1],h=5,origins=5,ourCall,ourValue,co=TRUE) plot(testRO) ## 'countf' is used to take xreg of the size corresponding to the whole ## sample on each iteration ## This is useful when working with functions from smooth package. ## The following call will return the forecasts from es() function of smooth. ## Not run: ourCall <- "es(data=data, h=h, xreg=xreg[countf,,drop=FALSE])" ourValue <- "forecast" ro(x,h=5,origins=5,ourCall,ourValue) ## End(Not run)
This method produces a model for scale of distribution for the provided pre-estimated model.
The model can be estimated either via lm
or alm
.
sm(object, ...) ## Default S3 method: sm(object, formula = NULL, data = NULL, parameters = NULL, ...) ## S3 method for class 'lm' sm(object, formula = NULL, data = NULL, parameters = NULL, ...) ## S3 method for class 'alm' sm(object, formula = NULL, data = NULL, parameters = NULL, ...)
sm(object, ...) ## Default S3 method: sm(object, formula = NULL, data = NULL, parameters = NULL, ...) ## S3 method for class 'lm' sm(object, formula = NULL, data = NULL, parameters = NULL, ...) ## S3 method for class 'alm' sm(object, formula = NULL, data = NULL, parameters = NULL, ...)
object |
The pre-estimated |
... |
Other parameters to pass to the method, including those explained in alm (e.g. parameters for optimiser). |
formula |
The formula for scale. It should start with ~ and contain all variables that should impact the scale. |
data |
The data, on which the scale model needs to be estimated. If not provided,
then the one used in the |
parameters |
The parameters to use in the model. Only needed if you know the parameters in advance or want to test yours. |
This function is useful, when you suspect a heteroscedasticity in your model and want to
fit a model for the scale of the pre-specified distribution. This function is complementary
for lm
or alm
.
Ivan Svetunkov, [email protected]
xreg <- cbind(rnorm(100,10,3),rnorm(100,50,5)) xreg <- cbind(100+0.5*xreg[,1]-0.75*xreg[,2]+sqrt(exp(0.8+0.2*xreg[,1]))*rnorm(100,0,1), xreg,rnorm(100,300,10)) colnames(xreg) <- c("y","x1","x2","Noise") # Estimate the location model ourModel <- alm(y~.,xreg) # Estimate the scale model ourScale <- sm(ourModel,formula=~x1+x2) # Summary of the scale model summary(ourScale)
xreg <- cbind(rnorm(100,10,3),rnorm(100,50,5)) xreg <- cbind(100+0.5*xreg[,1]-0.75*xreg[,2]+sqrt(exp(0.8+0.2*xreg[,1]))*rnorm(100,0,1), xreg,rnorm(100,300,10)) colnames(xreg) <- c("y","x1","x2","Noise") # Estimate the location model ourModel <- alm(y~.,xreg) # Estimate the scale model ourScale <- sm(ourModel,formula=~x1+x2) # Summary of the scale model summary(ourScale)
Function constructs the plots depending on the types of variables in the provided matrix / data frame.
spread(data, histograms = FALSE, log = FALSE, lowess = FALSE, ...)
spread(data, histograms = FALSE, log = FALSE, lowess = FALSE, ...)
data |
Either matrix or data frame with the data. |
histograms |
If |
log |
If |
lowess |
If |
... |
Other parameters passed to the plot function. Currently only "main" parameter is accepted. |
If both variables are in metric scale, then the classical scatterplot is constructed. If one of them is either integer (up to 10 values) or categorical (aka 'factor'), then boxplots (with grey dots corresponding to mean values) are constructed. Finally, for the two categorical variables the tableplot is returned (see tableplot function for the details). All of this is packed in a matrix.
See details in the vignette "Marketing analytics with greybox":
vignette("maUsingGreybox","greybox")
Function does not return anything. It just plots things.
Ivan Svetunkov, [email protected]
### Simple example spread(mtcars) spread(mtcars,log=TRUE)
### Simple example spread(mtcars) spread(mtcars,log=TRUE)
Function selects variables that give linear regression with the lowest information criteria. The selection is done stepwise (forward) based on partial correlations. This should be a simpler and faster implementation than step() function from ‘stats’ package.
stepwise(data, ic = c("AICc", "AIC", "BIC", "BICc"), silent = TRUE, df = NULL, formula = NULL, subset = NULL, method = c("pearson", "kendall", "spearman"), distribution = c("dnorm", "dlaplace", "ds", "dgnorm", "dlogis", "dt", "dalaplace", "dlnorm", "dllaplace", "dls", "dlgnorm", "dbcnorm", "dinvgauss", "dgamma", "dexp", "dfnorm", "drectnorm", "dpois", "dnbinom", "dbeta", "dlogitnorm", "plogis", "pnorm"), occurrence = c("none", "plogis", "pnorm"), ...)
stepwise(data, ic = c("AICc", "AIC", "BIC", "BICc"), silent = TRUE, df = NULL, formula = NULL, subset = NULL, method = c("pearson", "kendall", "spearman"), distribution = c("dnorm", "dlaplace", "ds", "dgnorm", "dlogis", "dt", "dalaplace", "dlnorm", "dllaplace", "dls", "dlgnorm", "dbcnorm", "dinvgauss", "dgamma", "dexp", "dfnorm", "drectnorm", "dpois", "dnbinom", "dbeta", "dlogitnorm", "plogis", "pnorm"), occurrence = c("none", "plogis", "pnorm"), ...)
data |
Data frame containing dependant variable in the first column and the others in the rest. |
ic |
Information criterion to use. |
silent |
If |
df |
Number of degrees of freedom to add (should be used if stepwise is used on residuals). |
formula |
If provided, then the selection will be done from the listed variables in the formula after all the necessary transformations. |
subset |
an optional vector specifying a subset of observations to be used in the fitting process. |
method |
Method of correlations calculation. The default is Pearson's correlation, which should be applicable to a wide range of data in different scales. |
distribution |
Distribution to pass to |
occurrence |
what distribution to use for occurrence part. See alm for details. |
... |
This is temporary and is needed in order to capture "silent" parameter if it is provided. |
The algorithm uses alm() to fit different models and cor() to select the next regressor in the sequence.
Some details and examples of application are also given in the vignette
"Greybox": vignette("greybox","greybox")
Function returns model
- the final model of the class "alm".
See alm for details of the output.
Ivan Svetunkov, [email protected]
Burnham Kenneth P. and Anderson David R. (2002). Model Selection and Multimodel Inference. A Practical Information-Theoretic Approach. Springer-Verlag New York. DOI: [10.1007/b97636](http://dx.doi.org/10.1007/b97636).
McQuarrie, A. D. (1999). A small-sample correction for the Schwarz SIC model selection criterion. Statistics & Probability Letters, 44(1), 79–86. [10.1016/S0167-7152(98)00294-6](https://doi.org/10.1016/S0167-7152(98)00294-6).
### Simple example xreg <- cbind(rnorm(100,10,3),rnorm(100,50,5)) xreg <- cbind(100+0.5*xreg[,1]-0.75*xreg[,2]+rnorm(100,0,3),xreg,rnorm(100,300,10)) colnames(xreg) <- c("y","x1","x2","Noise") stepwise(xreg) ### Mixture distribution of Log Normal and Cumulative Logit xreg[,1] <- xreg[,1] * round(exp(xreg[,1]-70) / (1 + exp(xreg[,1]-70)),0) colnames(xreg) <- c("y","x1","x2","Noise") ourModel <- stepwise(xreg, distribution="dlnorm", occurrence=stepwise(xreg, distribution="plogis")) summary(ourModel) ### Fat regression example xreg <- matrix(rnorm(20000,10,3),100,200) xreg <- cbind(100+0.5*xreg[,1]-0.75*xreg[,2]+rnorm(100,0,3),xreg,rnorm(100,300,10)) colnames(xreg) <- c("y",paste0("x",c(1:200)),"Noise") ourModel <- stepwise(xreg,ic="AICc") plot(ourModel$ICs,type="l",ylim=range(min(ourModel$ICs),max(ourModel$ICs)+5)) points(ourModel$ICs) text(c(1:length(ourModel$ICs))+0.1,ourModel$ICs+5,names(ourModel$ICs))
### Simple example xreg <- cbind(rnorm(100,10,3),rnorm(100,50,5)) xreg <- cbind(100+0.5*xreg[,1]-0.75*xreg[,2]+rnorm(100,0,3),xreg,rnorm(100,300,10)) colnames(xreg) <- c("y","x1","x2","Noise") stepwise(xreg) ### Mixture distribution of Log Normal and Cumulative Logit xreg[,1] <- xreg[,1] * round(exp(xreg[,1]-70) / (1 + exp(xreg[,1]-70)),0) colnames(xreg) <- c("y","x1","x2","Noise") ourModel <- stepwise(xreg, distribution="dlnorm", occurrence=stepwise(xreg, distribution="plogis")) summary(ourModel) ### Fat regression example xreg <- matrix(rnorm(20000,10,3),100,200) xreg <- cbind(100+0.5*xreg[,1]-0.75*xreg[,2]+rnorm(100,0,3),xreg,rnorm(100,300,10)) colnames(xreg) <- c("y",paste0("x",c(1:200)),"Noise") ourModel <- stepwise(xreg,ic="AICc") plot(ourModel$ICs,type="l",ylim=range(min(ourModel$ICs),max(ourModel$ICs)+5)) points(ourModel$ICs) text(c(1:length(ourModel$ICs))+0.1,ourModel$ICs+5,names(ourModel$ICs))
Function constructs a plot for two categorical variables based on table function
tableplot(x, y = NULL, labels = TRUE, legend = FALSE, points = TRUE, ...)
tableplot(x, y = NULL, labels = TRUE, legend = FALSE, points = TRUE, ...)
x |
First categorical variable. Can be either vector, factor, matrix or a data
frame. If |
y |
Second categorical variable. If not provided, then only |
labels |
Whether to print table labels inside the plot or not. |
legend |
If |
points |
Whether to plot points in the areas. They help in understanding how many values lie in specific categories. |
... |
Other parameters passed to the plot function. |
The function produces the plot of the table()
function with colour densities
corresponding to the respective frequencies of appearance. If the value appears more
often than the other (e.g. 0.5 vs 0.15), then it will be darker. The frequency of 0
corresponds to the white colour, the frequency of 1 corresponds to the black.
See details in the vignette "Marketing analytics with greybox":
vignette("maUsingGreybox","greybox")
Function does not return anything. It just plots things.
Ivan Svetunkov, [email protected]
tableplot(mtcars$am, mtcars$gear)
tableplot(mtcars$am, mtcars$gear)
Function generates the matrix of dummy variables for the months / weeks / days / hours / minutes / seconds of year / month / week / day / hour / minute.
temporaldummy(object, ...) ## Default S3 method: temporaldummy(object, type = c("month", "quarter", "week", "day", "hour", "halfhour", "minute", "second"), of = c("year", "quarter", "month", "week", "day", "hour", "minute"), factors = FALSE, h = 0, ...) ## S3 method for class 'ts' temporaldummy(object, type = c("month", "quarter", "week", "day", "hour", "halfhour", "minute", "second"), of = c("year", "quarter", "month", "week", "day", "hour", "minute"), factors = FALSE, h = 0, ...) ## S3 method for class 'Date' temporaldummy(object, type = c("month", "quarter", "week", "day", "hour", "halfhour", "minute", "second"), of = c("year", "quarter", "month", "week", "day", "hour", "minute"), factors = FALSE, h = 0, ...) ## S3 method for class 'POSIXt' temporaldummy(object, type = c("month", "quarter", "week", "day", "hour", "halfhour", "minute", "second"), of = c("year", "quarter", "month", "week", "day", "hour", "minute"), factors = FALSE, h = 0, ...) ## S3 method for class 'zoo' temporaldummy(object, type = c("month", "quarter", "week", "day", "hour", "halfhour", "minute", "second"), of = c("year", "quarter", "month", "week", "day", "hour", "minute"), factors = FALSE, h = 0, ...)
temporaldummy(object, ...) ## Default S3 method: temporaldummy(object, type = c("month", "quarter", "week", "day", "hour", "halfhour", "minute", "second"), of = c("year", "quarter", "month", "week", "day", "hour", "minute"), factors = FALSE, h = 0, ...) ## S3 method for class 'ts' temporaldummy(object, type = c("month", "quarter", "week", "day", "hour", "halfhour", "minute", "second"), of = c("year", "quarter", "month", "week", "day", "hour", "minute"), factors = FALSE, h = 0, ...) ## S3 method for class 'Date' temporaldummy(object, type = c("month", "quarter", "week", "day", "hour", "halfhour", "minute", "second"), of = c("year", "quarter", "month", "week", "day", "hour", "minute"), factors = FALSE, h = 0, ...) ## S3 method for class 'POSIXt' temporaldummy(object, type = c("month", "quarter", "week", "day", "hour", "halfhour", "minute", "second"), of = c("year", "quarter", "month", "week", "day", "hour", "minute"), factors = FALSE, h = 0, ...) ## S3 method for class 'zoo' temporaldummy(object, type = c("month", "quarter", "week", "day", "hour", "halfhour", "minute", "second"), of = c("year", "quarter", "month", "week", "day", "hour", "minute"), factors = FALSE, h = 0, ...)
object |
Either a ts / msts / zoo / xts / tsibble object or a vector of dates. |
... |
Other parameters. |
type |
Specifies what type of frequency to produce. For example, if
|
of |
Specifies the frequency of what is needed. Used together with |
factors |
If |
h |
If not |
The function extracts dates from the provided object and returns a matrix with dummy variables for the specified frequency type, with the number of rows equal to the length of the object + the specified horizon. If a numeric vector is provided then it will produce dummies based on typical values (e.g. 30 days in month). So it is recommended to use proper classes with this method.
Several notes on how the dummies are calculated in some special cases:
In case of weeks of years, the first week is defined according to ISO 8601.
Note that not all the combinations of type
and of
are supported. For
example, there is no such thing as dummies for months of week. Also note that some
combinations are not very useful and would take a lot of memory (e.g. minutes of year).
The function will return all the dummy variables. If you want to avoid the dummy variables trap, you will need to exclude one of them manually.
If you want to have a different type of dummy variables, let me know, I will implement it.
One of the two is returned, depending on the value of factors
variable:
factors=FALSE
: Class "dgCMatrix" with all the dummy variables is returned
in case of numeric variable. Feel free to drop one (making it a reference variable) or
convert the object into matrix (this will consume more memory than the returned class).
In other cases the object of the same class as the provided is returned.
factors=TRUE
: The categorical variable (factor) containing specific values
for each observation.
Ivan Svetunkov, [email protected]
xregExpander, xregMultiplier,
outlierdummy
# Generate matrix with dummies for a ts object x <- ts(rnorm(100,100,1),frequency=12) temporaldummy(x) # Generate matrix with monthly dummies for a zoo object x <- as.Date("2003-01-01")+0:99 temporaldummy(x, type="month", of="year", h=10)
# Generate matrix with dummies for a ts object x <- ts(rnorm(100,100,1),frequency=12) temporaldummy(x) # Generate matrix with monthly dummies for a zoo object x <- as.Date("2003-01-01")+0:99 temporaldummy(x, type="month", of="year", h=10)
Function expands the provided matrix or vector of variables, producing
values with lags and leads specified by lags
variable.
xregExpander(xreg, lags = c(-frequency(xreg):frequency(xreg)), silent = TRUE, gaps = c("auto", "NAs", "zero", "naive", "extrapolate"))
xregExpander(xreg, lags = c(-frequency(xreg):frequency(xreg)), silent = TRUE, gaps = c("auto", "NAs", "zero", "naive", "extrapolate"))
xreg |
Vector / matrix / data.frame, containing variables that need
to be expanded. In case of vector / matrix it is recommended to provide
|
lags |
Vector of lags / leads that we need to have. Negative values mean lags, positive ones mean leads. |
silent |
If |
gaps |
Defines how to fill in the gaps in the data. |
This function could be handy when you want to check if lags and leads
of a variable influence the dependent variable. Can be used together
with xregDo="select"
in adam, es,
ces and ssarima. All the missing values
in the beginning and at the end of lagged series are substituted by
mean forecasts produced using adam.
ts
matrix with the expanded variables is returned.
Ivan Svetunkov, [email protected]
# Create matrix of two variables, make it ts object and expand it x <- cbind(rnorm(100,100,1),rnorm(100,50,3)) x <- ts(x,frequency=12) xregExpander(x)
# Create matrix of two variables, make it ts object and expand it x <- cbind(rnorm(100,100,1),rnorm(100,50,3)) x <- ts(x,frequency=12) xregExpander(x)
Function generates the cross-products of the provided exogenous variables.
xregMultiplier(xreg, silent = TRUE)
xregMultiplier(xreg, silent = TRUE)
xreg |
matrix or data.frame, containing variables that need to be expanded. This matrix needs to contain at least two columns. |
silent |
If |
This function might be useful if you have several variables and want to introduce their cross-products. This might be useful when introducing the interactions between dummy and continuous variables.
ts
matrix with the transformed and the original variables
is returned.
Ivan Svetunkov, [email protected]
es, stepwise,
xregExpander, xregTransformer
# Create matrix of two variables and expand it x <- cbind(rnorm(100,100,1),rnorm(100,50,3)) xregMultiplier(x)
# Create matrix of two variables and expand it x <- cbind(rnorm(100,100,1),rnorm(100,50,3)) xregMultiplier(x)
Function transforms each variable in the provided matrix or vector, producing non-linear values, depending on the selected pool of functions.
xregTransformer(xreg, functions = c("log", "exp", "inv", "sqrt", "square"), silent = TRUE)
xregTransformer(xreg, functions = c("log", "exp", "inv", "sqrt", "square"), silent = TRUE)
xreg |
Vector / matrix / data.frame, containing variables that need
to be expanded. In case of vector / matrix it is recommended to provide
|
functions |
Vector of names for functions used. |
silent |
If |
This function could be useful when you want to automatically select the
necessary transformations of the variables. This can be used together
with xregDo="select"
in es, ces,
gum and ssarima. However, this might be
dangerous, as it might lead to the overfitting the data. So be reasonable
when you produce the transformed variables.
ts
matrix with the transformed and the original variables
is returned.
Ivan Svetunkov, [email protected]
# Create matrix of two variables and expand it x <- cbind(rnorm(100,100,1),rnorm(100,50,3)) xregTransformer(x)
# Create matrix of two variables and expand it x <- cbind(rnorm(100,100,1),rnorm(100,50,3)) xregTransformer(x)