ALM stands for “Augmented Linear
Model”. The word “augmented” is used to reflect that the model
introduces aspects that extend beyond the basic linear model. In some
special cases alm()
resembles the glm()
function from stats package, but with a higher focus on forecasting
rather than on hypothesis testing. You will not get p-values anywhere
from the alm()
function and won’t see R2 in the outputs. The
maximum what you can count on is having confidence intervals for the
parameters or for the regression line. The other important difference
from glm()
is the availability of distributions that are
not supported by glm()
(for example, Folded Normal or
Box-Cox Normal distributions) and it allows optimising non-standard
parameters (e.g. λ in
Asymmetric Laplace distribution). Finally, alm()
supports
different loss functions via the loss
parameter, so you can
estimate parameters of your model via, for example, likelihood
maximisation or via minimisation of MSE / MAE, using LASSO / RIDGE or by
minimising a loss provided by user.
Although alm()
supports various loss functions, the core
of the function is the likelihood approach. By default the estimation of
parameters in the model is done via the maximisation of likelihood
function of a selected distribution. The calculation of the standard
errors is done based on the calculation of hessian of the distribution.
And in the centre of all of that are information criteria that can be
used for the models comparison.
This vignette contains the following sections:
All the supported distributions have specific functions which form
the following four groups for the distribution
parameter in
alm()
:
All of them rely on respective d- and p- functions in R. For example,
Log-Normal distribution uses dlnorm()
function from
stats
package.
The alm()
function also supports occurrence
parameter, which allows modelling non-zero values and the occurrence of
non-zeroes as two different models. The combination of any distribution
from (1) - (3) for the non-zero values and a distribution from (4) for
the occurrence will result in a mixture distribution
model, e.g. a mixture of Log-Normal and Cumulative Logistic or a
Hurdle Poisson (with Cumulative Normal for the occurrence part).
Every model produced using alm()
can be represented as:
where yt
is the value of the response variable, xt is the vector
of exogenous variables, B is
the vector of the parameters, μt is the
conditional mean (produced based on the exogenous variables and the
parameters of the model), ϵt is the error
term on the observation t and
f(⋅) is the distribution
function that does a transformation of the inputs into the output. In
case of a mixture distribution the model becomes slightly more
complicated: where ot is the binary
variable, pt is the
probability of occurrence, zt is the vector
of exogenous variables and A
is the vector of parameters for the pt.
In addition, the function supports scale model, i.e. the model that predicts the values of scale of distribution (for example, variance in case of normal distribution) based on the provided explanatory variables. This is discussed in some detail in a separate section.
The alm()
function returns, along with the set of common
for lm()
variables (such as coefficient
and
fitted.values
), the variable mu
, which
corresponds to the conditional mean used inside the distribution, and
scale
– the second parameter, which usually corresponds to
standard error or dispersion parameter. The values of these two
variables vary from distribution to distribution. Note, however, that
the model
variable returned by lm()
function
was renamed into data
in alm()
, and that
alm()
does not return terms
and QR
decomposition.
Given that the parameters of any model in alm()
are
estimated via likelihood, it can be assumed that they have
asymptotically normal distribution, thus the confidence intervals for
any model rely on the normality and are constructed based on the
unbiased estimate of variance, extracted using sigma()
function.
The covariance matrix of parameters almost in all the cases is calculated as an inverse of the hessian of respective distribution function. The exclusions are Normal, Log-Normal, Poisson, Cumulative Logistic and Cumulative Normal distributions, that use analytical solutions.
alm()
function also supports factors in the explanatory
variables, creating the set of dummies from them. In case of ordered
variables (ordinal scale, is.ordered()
), the ordering is
removed and the set of dummies is produced. This is done in order to
avoid the built in behaviour of R, which creates linear, squared, cubic
etc levels for ordered variables, which makes the interpretation of the
parameters difficult.
When the number of estimated parameters is calculated, in case of
loss=="likelihood"
the scale is considered as one of the
parameters as well, which aligns with the idea of the maximum likelihood
estimation. For all the other losses, the scale does not count (this
aligns, for example, with how the number of parameters is calculated in
OLS, which corresponds to loss="MSE"
).
Although the basic principles of estimation of models and predictions from them are the same for all the distributions, each of the distribution has its own features. So it makes sense to discuss them individually. We discuss the distributions in the four groups mentioned above.
This group of functions includes:
For all the functions in this category resid()
method
returns et = yt − μt.
The density of normal distribution 𝒩(μt, σ) is: where σ is the standard deviation of the error term. This PDF has a very well-known bell shape:
alm()
with Normal distribution
(distribution="dnorm"
) is equivalent to lm()
function from stats
package and returns roughly the same
estimates of parameters, so if you are concerned with the time of
calculation, I would recommend reverting to lm()
.
Maximising the likelihood of the model is equivalent to the estimation of the basic linear regression using Least Squares method: where ϵt ∼ 𝒩(0, σ2).
The variance σ2
is estimated in alm()
based on likelihood: where T is the sample size. Its square
root (standard deviation) is used in the calculations of
dnorm()
function, and the value is then return via
scale
variable. This value does not have bias correction.
However the sigma()
method applied to the resulting model,
returns the bias corrected version of standard deviation. And
vcov()
, confint()
, summary()
and
predict()
rely on the value extracted by
sigma()
.
μt is
returned as is in mu
variable, and the fitted values are
set equivalent to mu
.
In order to produce confidence intervals for the mean
(predict(model, newdata, interval="confidence")
) the
conditional variance of the model is calculated using: where V(B) is the covariance
matrix of the parameters returned by the function vcov
.
This variance is then used for the construction of the confidence
intervals of a necessary level α using the distribution of Student:
where $\tau_{df,\frac{1+\alpha}{2}}$ is
the upper ${\frac{1+\alpha}{2}}$-th
quantile of the Student’s distribution with df degrees of freedom
(e.g. with α = 0.95 it will be
0.975-th quantile, which, for example, for 100 degrees of freedom will
be ≈ 1.984).
Similarly for the prediction intervals
(predict(model, newdata, interval="prediction")
) the
conditional variance of the yt is
calculated: where s2 is the bias-corrected
variance of the error term, calculated using: where k is the number of estimated
parameters (including the variance itself). This value is then used for
the construction of the prediction intervals of a specify level, also
using the distribution of Student, in a similar manner as with the
confidence intervals.
Laplace distribution has some similarities with the Normal one: where s is the scale parameter, which, when estimated using likelihood, is equal to the mean absolute error: So maximising the likelihood is equivalent to estimating the linear regression via the minimisation of s . So when estimating a model via minimising s, the assumption imposed on the error term is ϵt ∼ ℒ𝒶𝓅𝓁𝒶𝒸ℯ(0, s). The main difference of Laplace from Normal distribution is its fatter tails, the PDF has the following shape:
alm()
function with distribution="dlaplace"
returns mu
equal to μt and the
fitted values equal to mu
. s is returned in the
scale
variable. The prediction intervals are derived from
the quantiles of Laplace distribution after transforming the conditional
variance into the conditional scale parameter s using the connection between the
two in Laplace distribution: where σ2 is substituted either
by the conditional variance of μt or yt.
The kurtosis of Laplace distribution is 6, making it suitable for modelling rarely occurring events.
Asymmetric Laplace distribution can be considered as a two Laplace
distributions with different parameters s for left and right side. There are
several ways to summarise the probability density function, the one used
in alm()
relies on the asymmetry parameter α (Yu and
Zhang 2005): where s is
the scale parameter, α is the
skewness parameter and I(yt ≤ μt)
is the indicator function, which is equal to one, when the condition is
satisfied and to zero otherwise. The scale parameter s estimated using likelihood is
equal to the quantile loss: Thus maximising the likelihood is equivalent
to estimating the linear regression via the minimisation of α quantile, making this equivalent
to quantile regression. So quantile regression models assume indirectly
that the error term is ϵt ∼ 𝒜ℒ𝒶𝓅𝓁𝒶𝒸ℯ(0, s, α)
(Geraci and Bottai 2007). The advantage of
using alm()
in this case is in having the full
distribution, which allows to do all the fancy things you can do when
you have likelihood.
Graphically, the PDF of asymmetric Laplace is:
In case of α = 0.5 the function reverts to the symmetric Laplace where $s=\frac{1}{2}\text{MAE}$.
alm()
function with
distribution="dalaplace"
accepts an additional parameter
alpha
in ellipsis, which defines the quantile α. If it is not provided, then the
function will estimated it maximising the likelihood and return it as
the first coefficient. alm()
returns mu
equal
to μt and
the fitted values equal to mu
. s is returned in the
scale
variable. The parameter α is returned in the variable
other
of the final model. The prediction intervals are
produced using qalaplace()
function. In order to find the
values of s for the holdout
the following connection between the variance of the variable and the
scale in Asymmetric Laplace distribution is used: where σ2 is substituted either
by the conditional variance of μt or yt.
NOTE: in order for the Asymmetric Laplace to work well, you might need to have large samples. This is inherited from the pinball score of the quantile regression. If you fit the model on 40 observations with α = 0.05, you will only have 2 observations below the line, which does not help very much with the fit. Similarly, the covariance matrix, produced via the Hessian might not be adequate in this situation (because there is not enough variability in the data due to extreme value of α). The latter can be partially addressed by using bootstrap, but do not expect miracles on small samples.
The S distribution has the following density function: where s is the scale parameter. If estimated via maximum likelihood, the scale parameter is equal to: which corresponds to the minimisation of a half of “Mean Root Absolute Error” or “Half Absolute Moment”.
S distribution has a kurtosis of 25.2, which makes it a “severe excess” distribution (thus the name). It might be useful in cases of randomly occurring incidents and extreme values (Black Swans?). Here how the PDF looks:
alm()
function with distribution="ds"
returns μt
in the same variables mu
and fitted.values
,
and s in the
scale
variable. Similarly to the previous functions, the
prediction intervals are based on the qs()
function from
greybox
package and use the connection between the scale
and the variance: where once again σ2 is substituted either
by the conditional variance of μt or yt.
The Generalised Normal distribution is a generalisation, which has Normal, Laplace and S as special cases. It has the following density function: where s is the scale and β is the shape parameters. If estimated via maximum likelihood, the scale parameter is equal to: In the special cases, this becomes either $\sqrt{2}\times$RMSE (β = 2), or MAE (β = 1) or a half of HAM (β = 0.5). It is important to note that although in case of β = 2, the distribution becomes equivalent to Normal, the scale of it will differ from the σ (this follows directly from the formula above). The relations between the two is: s2 = 2σ2.
The kurtosis of Generalised Normal distribution is determined by β and is equal to $\frac{\Gamma(5/\beta)\Gamma(1/\beta)}{\Gamma(3/\beta)^2}$.
alm()
function with distribution="dgnorm"
returns μt
in the same variables mu
and fitted.values
,
s in the scale
variable and β in
other$beta
. Note that if beta
is not provided
in the function, then it will estimate it. However, the estimates of
β are known not to be
consistent and asymptotically normal if it is less than 2. So,
use with care! As for the intervals, they are based on the
qgnorm()
function from greybox
package and use
the connection between the scale and the variance: where once again
σ2 is substituted
either by the conditional variance of μt or yt, depending on
what type of interval is needed.
The density function of Logistic distribution is: where s is the scale parameter, which is
estimated in alm()
based on the connection between the
parameter and the variance in the logistic distribution: Once again the
maximisation of implies the estimation of the linear model , where ϵt ∼ ℒℴℊ𝒾𝓈𝓉𝒾𝒸(0, s).
Logistic is considered a fat tailed distribution, but its tails are not as fat as in Laplace. Kurtosis of standard Logistic is 4.2.
alm()
function with distribution="dlogis"
returns μt
in mu
and in fitted.values
variables, and
s in the scale
variable. Similar to Laplace distribution, the prediction intervals use
the connection between the variance and scale, and rely on the
qlogis
function.
The Student t distribution has a difficult density function: where ν is the number of degrees of freedom, which can also be considered as the scale parameter of the distribution. It has the following connection with the in-sample variance of the error (but only for the case, when ν > 2):
Kurtosis of Student t distribution depends on the value of ν, and for the cases of ν > 4 is equal to $\frac{6}{\nu-4}$. When the μ → ∞, the distribution converges to the normal.
alm()
function with distribution="dt"
estimates the parameters of the model along with the ν (if it is not provided by the user
as a nu
parameter) and returns μt in the
variables mu
and fitted.values
, and ν in the scale
variable. Both prediction and confidence intervals use qt()
function from stats
package and rely on the conventional
number of degrees of freedom T − k. The intervals are
constructed similarly to how it is done in Normal distribution (based on
qt()
function).
In order to see how this works, we will create the following data:
set.seed(41, kind="L'Ecuyer-CMRG")
xreg <- cbind(rnorm(200,10,3),rnorm(200,50,5))
xreg <- cbind(500+0.5*xreg[,1]-0.75*xreg[,2]+rs(200,0,3),xreg,rnorm(200,300,10))
colnames(xreg) <- c("y","x1","x2","Noise")
inSample <- xreg[1:180,]
outSample <- xreg[-c(1:180),]
ALM can be run either with data frame or with matrix. Here’s an example with normal distribution and several levels for the construction of prediction interval:
ourModel <- alm(y~x1+x2, data=inSample, distribution="dnorm")
summary(ourModel)
#> Response variable: y
#> Distribution used in the estimation: Normal
#> Loss function used in estimation: likelihood
#> Coefficients:
#> Estimate Std. Error Lower 2.5% Upper 97.5%
#> (Intercept) 383.9826 69.5496 246.7240 521.2412 *
#> x1 0.0055 2.2403 -4.4159 4.4269
#> x2 1.6701 1.2779 -0.8519 4.1920
#>
#> Error standard deviation: 88.1329
#> Sample size: 180
#> Number of estimated parameters: 4
#> Number of degrees of freedom: 176
#> Information criteria:
#> AIC AICc BIC BICc
#> 2127.157 2127.386 2139.929 2140.523
plot(predict(ourModel,outSample,interval="p",level=c(0.9,0.95)))
And here’s an example with Asymmetric Laplace and predefined α = 0.95:
ourModel <- alm(y~x1+x2, data=inSample, distribution="dalaplace",alpha=0.95)
summary(ourModel)
#> Warning: Choleski decomposition of hessian failed, so we had to revert to the simple inversion.
#> The estimate of the covariance matrix of parameters might be inaccurate.
#> Warning: Sorry, but the hessian is singular, so we could not invert it.
#> Switching to bootstrap of covariance matrix of parameters.
#> Warning in log(yTransformed): NaNs produced
#> Warning in log(ySorted): NaNs produced
#> Response variable: y
#> Distribution used in the estimation: Asymmetric Laplace with alpha=0.95
#> Loss function used in estimation: likelihood
#> Coefficients:
#> Estimate Std. Error Lower 2.5% Upper 97.5%
#> (Intercept) 566.7139 41.3321 485.1437 648.2842 *
#> x1 0.0051 0.0038 -0.0023 0.0125
#> x2 0.8491 0.7832 -0.6965 2.3947
#>
#> Error standard deviation: 167.8399
#> Sample size: 180
#> Number of estimated parameters: 4
#> Number of degrees of freedom: 176
#> Information criteria:
#> AIC AICc BIC BICc
#> 2348.783 2349.012 2361.555 2362.148
plot(predict(ourModel,outSample))
There are currently three distributions in this group:
They allow the response variable to be positive or zero. Note however that the PDF of the Box-Cox Normal distribution is equal to zero in case of yt = 0, which might cause some issues in the estimation.
Box-Cox Normal distribution used in the greybox
package
is defined as a distribution that becomes normal after the Box-Cox
transformation. This means that if $x=\frac{y^\lambda+1}{\lambda}$ and x ∼ 𝒩(μ, σ2),
then y ∼ BC𝒩(μ, σ2).
The density function of the Box-Cox Normal distribution in this case is:
where the variance estimated using likelihood is: Depending on the value
of λ, we will get different
shapes of the density function:
When λ = 0 the distribution transforms to the Log-Normal one.
Estimating the model with Box-Cox Normal distribution is equivalent to estimating the parameters of a linear model after the Box-Cox transform: where ϵt ∼ 𝒩(0, σ2) or:
alm()
with distribution="dbcnorm"
does not
transform the provided data and estimates the density directly using
dbcnorm()
function from greybox
with the
estimated mean μt and the
variance . The μt is returned
in the variable mu
, the σ2 is in the variable
scale
, while the fitted.values
contains the
exponent of μt, which, given
the connection between the Normal and Box-Cox Normal distributions,
corresponds to median of distribution rather than mean. Finally,
resid()
method returns $e_t =
\frac{y_t^{\lambda}-1}{\lambda} - \mu_t$. The lambda
parameter can be provided by the user via the lambdaBC
in
ellipsis.
Folded Normal distribution is obtained when the absolute value of normally distributed variable is taken: if x ∼ 𝒩(μ, σ2), then |x| ∼ Fold𝒩(μ, σ2). The density function is: which can be graphically represented as:
Conditional mean and variance of Folded Normal are estimated in
alm()
(with distribution="dfnorm"
) similarly
to how this is done for Normal distribution. They are returned in the
variables mu
and scale
respectively. In order
to produce the fitted value (which is returned in
fitted.values
), the following correction is done: where
Φ(⋅) is the CDF of Normal
distribution.
The model that is assumed in the case of Folded Normal distribution can be summarised as:
The conditional variance of the forecasts is calculated based on the
elements of vcov()
(as in all the other functions), the
predicted values are corrected in the same way as the fitted values ,
and the prediction intervals are generated from the
qfnorm()
function of greybox
package. As for
the residuals, resid()
method returns et = yt − μt.
Rectified Normal distribution is obtained when all the negative values of normally distributed variable are set to zero: if x ∼ 𝒩(μ, σ), then y = max (0, x) ∼ Rect𝒩(μ, σ). The density function is:
where Φx(0, μ, σ) is the CDF and ϕx(yt, μ, σ) is the PDF of the Normal distribution. This can be graphically represented as:
This distribution can be useful in modelling intermittent demand, when the demand sizes are not integer.
Conditional location and scale of Rectified Normal are estimated in
alm()
(with distribution="drectnorm"
)
similarly to how this is done for Normal distribution. They are returned
in the variables mu
and scale
respectively. In
order to produce the fitted value (which is returned in
fitted.values
), the following formula is used:
The model that is assumed in the case of Rectified Normal distribution is:
The conditional variance of the forecasts is calculated based on the
elements of vcov()
(as in all the other functions), the
predicted values are corrected in the same way as the fitted values ,
and the prediction intervals are generated from the
qrectnorm()
function of greybox
package. As
for the residuals, resid()
method returns et = yt − μt.
This group includes:
Log-Normal distribution appears when a normally distributed variable is exponentiated. This means that if x ∼ 𝒩(μ, σ2), then exp x ∼ log𝒩(μ, σ2). The density function of Log-Normal distribution is: where the variance estimated using likelihood is: The PDF has the following shape:
Estimating the model with Log-Normal distribution is equivalent to estimating the parameters of log-linear model: where ϵt ∼ 𝒩(0, σ2) or:
alm()
with distribution="dlnorm"
does not
transform the provided data and estimates the density directly using
dlnorm()
function with the estimated mean μt and the
variance . If you need a log-log model, then you would need to take
logarithms of the external variables. The μt is returned
in the variable mu
, the σ2 is in the variable
scale
, while the fitted.values
contains the
exponent of μt, which, given
the connection between the Normal and Log-Normal distributions,
corresponds to median of distribution rather than mean. Finally,
resid()
method returns et = log yt − μt.
Inverse Gaussian distribution is an interesting distribution, which
is defined for positive values only and has some properties similar to
the properties of the Normal distribution. It has two parameters:
location μt and scale
ϕ (aka “dispersion”). There
are different ways to parameterise this distribution, we use the
dispersion-based one. The important thing that distinguishes the
implementation in alm()
from the one in glm()
or in any other function is that we assume that the model has the
following form: and that ϵt ∼ ℐ𝒢(1, ϕ).
This means that $y_t \sim
\mathcal{IG}\left(\mu_t, \frac{\phi}{\mu_t} \right)$, implying
that the dispersion of the model changes together with the conditional
expectation. The density function for the error term in this case is:
where the dispersion parameter is estimated via maximising the
likelihood and is calculated using: Note that in our formulation μt = exp (xt′B),
so that the mean is always positive. This implies that we deal with a
pure multiplicative model. In addition, we assume that μt is just a
scale for the distribution, otherwise yt would not
follow the Inverse Gaussian distribution. The density function has
following shapes depending on the values of parameters:
alm()
with distribution="dinvgauss"
estimates the density for yt using
dinvgauss()
function from statmod
package. The
μt is
returned in the variables mu
and
fitted.values
, the dispersion ϕ is in the variable
scale
. resid()
method returns $e_t = \frac{y_t}{\mu_t}$. Finally, the
prediction interval for the regression model are generated using
qinvgauss()
function from the statmod
package.
Another popular distribution, defined for positive values only is called “Gamma”. It is parametrised via the shape k and scale σ2 and has closed forms for mean and variance: E(x) = kσ2, V(x) = kσ4.
The important thing that distinguishes the implementation in
alm()
from the one in glm()
or in any other
function is that we assume that the model has the following form
(similar to the Inverse Gaussian model in alm):
and that ϵt ∼ Γ(σ−2, σ2),
implying that E(ϵt) = kσ2 = 1
and V(ϵt) = σ2.
This means that yt ∼ Γ(σ−2, σ2μt),
meaning that the variance of the model changes together with the
conditional expectation. The density function for the error term in this
case is: where the scale parameter σ2 can be estimated via
the method of moments based on its relation to the variance: Note that
in our formulation μt = exp (xt′B),
so that the mean is always positive, which implies that we deal with a
pure multiplicative model. In addition, we assume that μt is just a
scale for the distribution, otherwise yt would not
follow Gamma distribution. All of this makes the model restrictive, but
arguably reasonable - otherwise the mean of the distribution might
behave uncontrollably.
The density function has following shapes depending on the values of parameters:
alm()
with distribution="dgamma"
estimates
the density for yt using
dgamma()
function from stats
package. The
μt is
returned in the variables mu
and
fitted.values
, the scale σ2 is in the variable
scale
. resid()
method returns $e_t = \frac{y_t}{\mu_t}$. Finally, the
prediction interval for the regression model are generated using
qgamma()
function from the stats
package.
One peculiar and very specific distribution, which can also be used in modelling is Exponential distribution. It only has one parameter, λ, which regulates both mean and variance: It might be useful in cases, when one wants to model inter-arrival times.
The implementation in alm()
relies on the model, similar
to the Inverse Gaussian and Gamma models: where ϵt ∼ Exp(1),
implying that E(ϵt) = V(ϵt) = 1.
This is a very restrictive model, which only works in some special
cases. If for some reason the variance and mean are not equal to one in
the empirical distribution, then the Exponential one would not be
appropriate. But in general the model formulated as above implies that
$y_t \sim \mathrm{Exp}\left( \frac{1}{\mu_t}
\right)$, meaning that the variance of the model changes together
with the conditional expectation. The density function for the error
term in this case is: Note that in our formulation μt = exp (xt′B),
so that the mean is always positive, which implies that we deal with a
pure multiplicative model. In addition, we assume that μt is just a
scale for the distribution, otherwise yt would not
follow Exponential distribution.
The density function has the following shapes depending on the values of the expectation:
alm()
with distribution="dexp"
estimates
the density for yt using
dexp()
function from stats
package. The μt is returned
in the variables mu
and fitted.values
, the
scale is assumed to be equal to one. resid()
method returns
$e_t = \frac{y_t}{\mu_t}$. Finally, the
prediction interval for the regression model are generated using
qexp()
function from the stats
package.
NOTE that if the assumption of E(ϵt) = V(ϵt) = 1 does not hold, the model will produce unreasonable quantiles.
This is based on the exponent of Laplace distribution, which means that the PDF in this case is: The model implemented in the package has similarity with Log-Normal distribution. The MLE scale is: The density function of Log-Laplace has the following shapes:
Estimating the model with Log-Laplace distribution is equivalent to estimating the parameters of log-linear model: where ϵt ∼ ℒ𝒶𝓅𝓁𝒶𝒸ℯ(0, σ2). This distribution might be useful if the data has a strong skewness (larger than in case of Log-Normal distribution).
alm()
with distribution="dllaplace"
uses
dlaplace()
function with the logarithm of actual values,
estimated mean μt and the scale
. The μt
is returned in the variable mu
, the s is in the variable
scale
, while the fitted.values
contains the
exponent of μt, which
corresponds to median of distribution rather than mean. Finally,
resid()
method returns et = log yt − μt.
This is based on the exponent of S distribution, giving the PDF: The model implemented in the package has similarity with Log-Normal and Log-Laplace distributions. The MLE scale is: The shape of the density function of Log-S is similar to Log-Laplace but with even more extreme values:
Estimating the model with Log-S distribution is equivalent to estimating the parameters of log-linear model: where ϵt ∼ 𝒮(0, σ2). This distribution can be used for sever seldom right tail cases.
alm()
with distribution="dls"
uses
ds()
function with the logarithm of actual values,
estimated mean μt and the scale
. The μt
is returned in the variable mu
, the s is in the variable
scale
, while the fitted.values
contains the
exponent of μt, which
corresponds to median of distribution rather than mean. Finally,
resid()
method returns et = log yt − μt.
This is based on the exponent of Generalised Normal distribution, giving the PDF: The model implemented in the package has similarity with Log-Normal, Log-Laplace and Log-S distributions. The MLE scale is: The shapes of the distribution depend on the value of parameters, giving it in some cases very long right tail:
Estimating the model with Log-Generalised Normal distribution is equivalent to estimating the parameters of log-linear model: where ϵt ∼ 𝒢𝒩(0, s, β).
alm()
with distribution="dlgnorm"
uses the
dgnorm()
function from greybox
package with
the logarithm of actual values, estimated mean μt, the scale
and either provided or estimated shape parameter β. The μt is returned
in the variable mu
, the s is in the variable
scale
and β is in
other$beta
, while the fitted.values
contains
the exponent of μt, which
corresponds to median of distribution rather than mean. Finally,
resid()
method returns et = log yt − μt.
There is currently only one distribution in this group:
A random variable follows Logit-normal distribution if its logistic transform follows normal distribution: where y ∈ (0, 1), y ∼ logit𝒩(μ, σ2) and z ∼ 𝒩(μ, σ2). The bounds are not supported, because the variable z becomes infinite. The density function of y is: which has the following shapes: Depending on the values of location and scale, the distribution can be either unimodal or bimodal and can be positively or negatively skewed. Because of its connection with normal distribution, the logit-normal has formulae for density, cumulative and quantile functions. However, the moment generation function does not have a closed form.
The scale of the distribution can be estimated via the maximisation of likelihood and has some similarities with the scale in Log-Normal distribution:
Estimating the model with Log-Normal distribution is equivalent to estimating the parameters of logit-linear model: where ϵt ∼ 𝒩(0, σ2) or: where $\mathrm{logit}^{-1}(z)=\frac{\exp(z)}{1+\exp(z)}$ is the inverse logistic transform.
alm()
with distribution="dlogitnorm"
does
not transform the provided data and estimates the density directly using
dlogitnorm()
function from greybox
package
with the estimated mean μt and the
variance . The μt is returned
in the variable mu
, the σ2 is in the variable
scale
, while the fitted.values
contains the
inverse logistic transform of μt, which, given
the connection between the Normal and Logit-Normal distributions,
corresponds to median of distribution rather than mean. Finally,
resid()
method returns et = logit(yt) − μt.
Beta distribution is a distribution for a continuous variable that is defined on the interval of (0, 1). Note that the bounds are not included here, because the probability density function is not well defined on them. If the provided data contains either zeroes or ones, the function will modify the values using: and it will warn the user about this modification. This correction makes sure that there are no boundary values in the data, and it is quite artificial and needed for estimation purposes only.
The density function of Beta distribution has the form: where αt is the first
shape parameter and βt is the second
one. Note indices for the both shape parameters. This is what makes the
alm()
implementation of Beta distribution different from
any other. We assume that both of them have underlying deterministic
models, so that: and where A
and B are the vectors of
parameters for the respective shape variables. This allows the function
to model any shapes depending on the values of exogenous variables. The
conditional expectation of the model is calculated using: while the
conditional variance is: Beta distribution has shapes similar to the
ones of Logit-Normal one, but with shape parameters regulating
respectively the left and right tails of the distribution:
alm()
function with distribution="dbeta"
returns ŷt
in the variables mu
and fitted.values
, and
V(yt) in
the scale
variable. The shape parameters are returned in
the respective variables other$shape1
and
other$shape2
. You will notice that the output of the model
contains twice more parameters than the number of variables in the
model. This is because of the estimation of two models: αt and βt - instead of
one.
Respectively, when predict()
function is used for the
alm
model with Beta distribution, the two models are used
in order to produce predicted values for αt and βt. After that
the conditional mean mu
and conditional variance
variances
are produced using the formulae above. The
prediction intervals are generated using qbeta
function
with the provided shape parameters for the holdout. As for the
confidence intervals, they are produced assuming normality for the
parameters of the model and using the estimate of the variance of the
mean based on the variances
(which is weird and probably
wrong).
This group includes:
These distributions should be used in cases of count data.
Poisson distribution used in ALM has the following standard probability mass function (PMF): where λt = μt = σt2 = exp (xt′B). As it can be noticed, here we assume that the variance of the model varies in time and depends on the values of the exogenous variables, which is a specific case of heteroscedasticity. The exponent of xt′B is needed in order to avoid the negative values in λt.
Here are several examples of the PMF of Poisson with different values of parameters λ:
alm()
with distribution="dpois"
returns
mu
, fitted.values
and scale
equal
to λt. The
quantiles of distribution in predict()
method are generated
using qpois()
function from stats
package.
Finally, the returned residuals correspond to yt − μt,
which is not really helpful or meaningful.
Negative Binomial distribution implemented in alm()
is
parameterised in terms of mean and variance: where μt = exp (xt′B)
and σ2 is estimated
separately in the optimisation process. These values are then used in
the dnbinom()
function in order to calculate the
log-likelihood based on the distribution function.
Here are some examples of PMF of Negative Binomial distribution with different sizes and probabilities:
alm()
with distribution="dnbinom"
returns
μt in
mu
and fitted.values
and σ2 in scale
.
The prediction intervals are produces using qnbinom()
function. Similarly to Poisson distribution, resid()
method
returns yt − μt.
The user can also provide size
parameter in ellipsis if it
is reasonable to assume that it is known.
The PMF of Binomial distribution is written as: where n is the size parameter, μt = exp (xt′B)
$p_t = \frac{1}{1+\mu_t}$ and E(yt) = n × pt.
The size parameter is always known and can be calculated as a number of
unique values in y minus one.
So, if the data takes one of the three values: 0, 1 and 2, the size will
be n = 2. The values of pt and n are then used in the
dbinom()
function to calculate the log-likelihood.
Visually, the PMF of the Binomial distribution has the following shapes with different values of probability and size:
alm()
with distribution="dbinom"
returns
μt in
mu
and E(yt) in
fitted.values
, the scale
and
other$size
both contain the same value of n. The prediction intervals are
produces using qbinom()
function. resid()
method returns yt − E(yt).
There is also Geometric distribution implemented in
alm()
. It has the following PMF: where $p_t = \frac{1}{1+\exp(x_t' B)}$. The
conditional expectation in this model is calculated as $\mu_t = \frac{1}{p_t} - 1 = \exp(x_t' B)$.
The probability is used in the calculation of the log-likelihood via the
dgeom()
function.
Several PMFs of the Geometric distribution are shown in the following figure:
alm()
with distribution="dgeom"
returns
μt in
mu
and in fitted.values
. The
scale
does not contain anything useful, because the
Geometric distribution has a time varying variance, which depends on the
probability. The prediction intervals are produces using
qgeom()
function. Similarly to the other count
distribution, resid()
method returns yt − μt.
Round up the response variable for the next example:
Negative Binomial distribution:
ourModel <- alm(y~x1+x2, data=inSample, distribution="dnbinom")
summary(ourModel)
#> Response variable: y
#> Distribution used in the estimation: Negative Binomial with size=33.8042
#> Loss function used in estimation: likelihood
#> Coefficients:
#> Estimate Std. Error Lower 2.5% Upper 97.5%
#> (Intercept) 5.9876 0.1417 5.7080 6.2671 *
#> x1 -0.0009 0.0046 -0.0100 0.0081
#> x2 0.0034 0.0026 -0.0018 0.0086
#>
#> Error standard deviation: 82.4676
#> Sample size: 180
#> Number of estimated parameters: 4
#> Number of degrees of freedom: 176
#> Information criteria:
#> AIC AICc BIC BICc
#> 2116.695 2116.924 2129.467 2130.060
And an example with predefined size:
ourModel <- alm(y~x1+x2, data=inSample, distribution="dnbinom", size=30)
summary(ourModel)
#> Response variable: y
#> Distribution used in the estimation: Negative Binomial with size=30
#> Loss function used in estimation: likelihood
#> Coefficients:
#> Estimate Std. Error Lower 2.5% Upper 97.5%
#> (Intercept) 6.0010 0.1501 5.7048 6.2972 *
#> x1 -0.0010 0.0048 -0.0105 0.0086
#> x2 0.0032 0.0028 -0.0023 0.0087
#>
#> Error standard deviation: 82.4643
#> Sample size: 180
#> Number of estimated parameters: 4
#> Number of degrees of freedom: 176
#> Information criteria:
#> AIC AICc BIC BICc
#> 2117.149 2117.378 2129.921 2130.514
The final class of models includes two cases:
In both of them it is assumed that the response variable is binary
and can be either zero or one. The main idea for this class of models is
to use a transformation of the original data and link a continuous
latent variable with the binary one. As a reminder, all the models
eventually assume that: where ot is the binary
response variable and g(⋅) is
the cumulative distribution function. Given that we work with the
probability of occurrence, the predict()
method produces
forecasts for the probability of occurrence rather than the binary
variable itself. Finally, although many other cumulative distribution
functions can be used for this transformation
(e.g. plaplace()
or plnorm()
), the most
popular ones are logistic and normal CDFs.
Given that the binary variable has Bernoulli distribution, its log-likelihood is: So the estimation of parameters for all the CDFs can be done maximising this likelihood.
In all the functions it is assumed that the probability pt corresponds to some sort of unobservable `level’ qt = xt′A, and that there is no randomness in this level. So the aim of all the functions is to estimate correctly this level and then get an estimate of probability based on it.
The error of the model is calculated using the observed occurrence variable and the estimated probability p̂t. In a way, in this calculation we assume that ot = 1 happens mainly when the respective estimated probability p̂t is very close to one. So, the error can be calculated as: However this error is not useful and should be somehow transformed into the original scale of qt. Given that both ot ∈ (0, 1) and p̂t ∈ (0, 1), the error will lie in (−1, 1). We therefore standardise it so that it lies in the region of (0, 1):
This transformation means that, when ot = p̂t, then the error ut = 0.5, when ot = 1 and p̂t = 0 then ut = 1 and finally, in the opposite case of ot = 0 and p̂t = 1, ut = 0. After that this error is transformed using either Logistic or Normal quantile generation function into the scale of qt, making sure that the case of ut = 0.5 corresponds to zero, the ut > 0.5 corresponds to the positive and ut < 0.5 corresponds to the negative errors. The distribution of the error term is unknown, but it is in general bimodal.
We have previously discussed the density function of logistic
distribution. The standardised cumulative distribution function used in
alm()
is: where q̂t = xt′A
is the conditional mean of the level, underlying the probability. This
value is then used in the likelihood in order to estimate the parameters
of the model. The error term of the model is calculated using the
formula: This way the error varies from −∞ to ∞ and
is equal to zero, when ut = 0.5.
The alm()
function with
distribution="plogis"
returns qt in
mu
, standard deviation, calculated using the respective
errors in scale
and the probability p̂t based on in
fitted.values
. resid()
method returns the
errors discussed above. predict()
method produces point
forecasts and the intervals for the probability of occurrence. The
intervals use the assumption of normality of the error term, generating
respective quantiles (based on the estimated qt and variance
of the error) and then transforming them into the scale of probability
using Logistic CDF. This method for intervals calculation is
approximate and should not be considered as a final solution!
The case of cumulative Normal distribution is quite similar to the cumulative Logistic one. The transformation is done using the standard normal CDF: where qt = xt′A. Similarly to the Logistic CDF, the estimated probability is used in the likelihood in order to estimate the parameters of the model. The error term is calculated using the standardised quantile function of Normal distribution: It acts similar to the error from the Logistic distribution, but is based on the different set of functions. Its CDF has similar shapes to the logit:
Similar to the Logistic CDF, the alm()
function with
distribution="pnorm"
returns qt in
mu
, standard deviation, calculated based on the errors in
scale
and the probability p̂t based on in
fitted.values
. resid()
method returns the
errors discussed above. predict()
method produces point
forecasts and the intervals for the probability of occurrence. The
intervals are also approximate and use the same principle as in Logistic
CDF.
Finally, mixture distribution models can be used in
alm()
by defining distribution
and
occurrence
parameters. Currently only plogis()
and pnorm()
are supported for the occurrence variable, but
all the other distributions discussed above can be used for the
modelling of the non-zero values. If occurrence="plogis"
or
occurrence="pnorm"
, then alm()
is fit two
times: first on the non-zero data only (defining the subset) and second
- using the same data, substituting the response variable by the binary
occurrence variable and specifying distribution=occurrence
.
As an alternative option, occurrence alm()
model can be
estimated separately and then provided as a variable in
occurrence
.
As an example of mixture model, let’s generate some data:
set.seed(42, kind="L'Ecuyer-CMRG")
xreg <- cbind(rnorm(200,10,3),rnorm(200,50,5))
xreg <- cbind(500+0.5*xreg[,1]-0.75*xreg[,2]+rs(200,0,3),xreg,rnorm(200,300,10))
colnames(xreg) <- c("y","x1","x2","Noise")
xreg[,1] <- round(exp(xreg[,1]-400) / (1 + exp(xreg[,1]-400)),0) * xreg[,1]
# Sometimes the generated data contains huge values
xreg[is.nan(xreg[,1]),1] <- 0;
inSample <- xreg[1:180,]
outSample <- xreg[-c(1:180),]
First, we estimate the occurrence model (it will complain that the response variable is not binary, but it will work):
And then use it for the mixture model:
The occurrence model will be return in the respective variable:
summary(modelMixture)
#> Response variable: y
#> Distribution used in the estimation: Mixture of Log-Normal and Cumulative logistic
#> Loss function used in estimation: likelihood
#> Coefficients:
#> Estimate Std. Error Lower 2.5% Upper 97.5%
#> (Intercept) 6.3599 0.3203 5.7278 6.9921 *
#> x1 -0.0060 0.0034 -0.0127 0.0008
#> x2 -0.0005 0.0019 -0.0043 0.0032
#> Noise -0.0003 0.0010 -0.0023 0.0016
#>
#> Error standard deviation: 0.128
#> Sample size: 180
#> Number of estimated parameters: 9
#> Number of degrees of freedom: 171
#> Information criteria:
#> AIC AICc BIC BICc
#> 1938.476 1939.534 1967.212 1969.961
summary(modelMixture$occurrence)
#> Response variable: y
#> Distribution used in the estimation: Cumulative logistic
#> Loss function used in estimation: likelihood
#> Coefficients:
#> Estimate Std. Error Lower 2.5% Upper 97.5%
#> (Intercept) -29.5784 10.7590 -50.8118 -8.3451 *
#> x1 -0.0942 0.0885 -0.2690 0.0805
#> x2 0.0165 0.0533 -0.0887 0.1216
#> Noise 0.1079 0.0345 0.0398 0.1760 *
#>
#> Error standard deviation: 1.0606
#> Sample size: 180
#> Number of estimated parameters: 4
#> Number of degrees of freedom: 176
#> Information criteria:
#> AIC AICc BIC BICc
#> 130.8196 131.0482 143.5914 144.1849
We can also do regression diagnostics using plots:
After that we can produce forecasts using the data from the holdout sample (in this example we also ask for several confidence levels):
If you expect autoregressive elements in the data, then you can
specify the order of ARIMA via the respective parameter (note that the
MA part is not supported yet. Use adam
function from
smooth
package for that):
modelMixtureAR <- alm(y~x1+x2+Noise, inSample, distribution="dlnorm", occurrence=modelOccurrence, orders=c(1,0,0))
summary(modelMixtureAR)
#> Response variable: y
#> Distribution used in the estimation: Mixture of Log-Normal and Cumulative logistic
#> Loss function used in estimation: likelihood
#> ARIMA(1,0,0) components were included in the model
#> Coefficients:
#> Estimate Std. Error Lower 2.5% Upper 97.5%
#> (Intercept) 5.9291 0.5836 4.7773 7.0809 *
#> x1 -0.0058 0.0034 -0.0125 0.0010
#> x2 -0.0006 0.0019 -0.0044 0.0031
#> Noise -0.0003 0.0010 -0.0023 0.0016
#> yLag1 0.0699 0.0798 -0.0875 0.2274
#>
#> Error standard deviation: 0.1281
#> Sample size: 180
#> Number of estimated parameters: 10
#> Number of degrees of freedom: 170
#> Information criteria:
#> AIC AICc BIC BICc
#> 2126.098 2127.400 2158.028 2161.408
plot(predict(modelMixtureAR, outSample, interval="p", side="u"))
If the explanatory variables are not available for the holdout
sample, the forecast()
function can be used:
plot(forecast(modelMixtureAR, h=10, interval="p", side="u"))
#> Warning: No newdata provided, the values will be forecasted
It will produce forecasts for each of the explanatory variables based
on the available data using es()
function from
smooth
package (if it is available; otherwise, it will use
Naive) and use those values as the new data.
Alternatively, a user might know when demand occurs in future and can provide the vector of zeroes and ones so that function takes them into account properly:
Similarly, a vector of zeroes and ones can be provided in
occurrence
variable in alm()
to let function
know that the occurrence is not stochastic (e.g. zeroes in the data
appear because we do not sell products over weekends).
Finally, alm()
supports scale model, estimating the
scale of assumed distribution via the specified set of variables (this
is inspired by GAMLSS model). This might be handy if you suspect that
some variables cause heteroscedasticity. However, this can only be done
if likelihood approach is used for the model estimation and is done via
scale
parameter, which can accept either previously
estimated model (e.g. via sm()
method) or a formula for the
model. While the details will differ from one distribution to another,
the main idea will be the same, so here we discuss this on example of Normal distribution, for which the model becomes:
where xt and
B are defined as above, zt is
the vector of explanatory variables for the scale and c is the vector of
parameters for this part of model. The exponent in the formula above is
needed to make sure that the resulting values are always positive. The
resulting value σ̂t2 = exp (zt′c)
will be the variance of the residuals conditional on the values of
explanatory variables. The same model can be rewritten as (in case of
normal distribution): where ϵt ∼ 𝒩(0, 1).
Given that the location and scale of distribution are independent, we
can construct the complete model in two steps: first by estimating the
parameters of location and then the parameters of scale. The parameters
are initialised based on the residuals of the model and then are
optimised via maximisation of likelihood of the specified probability
density function. This is done using the generic method
sm()
after constructing a model. For example, you can
estimate a model based on lm()
and then construct scale
model for it (sm()
will assume Normal distribution in this
case and will use likelihood):
In sm()
, the estimation starts with a least squares of
logarithms of squared residuals on the selected explanatory variables.
The obtained parameters are then used in the optimiser (using
nloptr()
function) and are modified to maximise the
likelihood function of the Normal distribution.
The function returns the object of class scale
, which
contains several potentially useful variables, including the fitted
value (σ̂t2 = exp (zt′c))
and residuals (ϵt). The formula
for residuals would differ from one distribution to another. The
residuals can be used for model diagnostics to see if the potential
problems of the location model have been resolved
(e.g. heteroscedasticity removed). The resulting model supports several
methods. Here is an example:
summary(scaleModel)
#> Scale model for the variable: mpg
#> Distribution used in the estimation: Normal
#> Coefficients:
#> Estimate Std. Error Lower 2.5% Upper 97.5%
#> (Intercept) -5.7567 2.9613 -11.8132 0.2998
#> qsec 0.2920 0.1456 -0.0059 0.5899
#> wt 0.7292 0.3125 0.0899 1.3684 *
#>
#> Sample size: 32
#> Number of estimated parameters: 3
#> Number of degrees of freedom: 29
#> Information criteria:
#> AIC AICc BIC BICc
#> 154.5043 155.3615 158.9015 160.3868
The diagnostic plots can be produced as well. The error term in this case would be standardised.
If the scale model needs to be taken into account in forecasting,
then the implant()
method can be used to implant it into
the previously estimated model. This only works with alm()
function:
locationModel <- alm(mpg~., mtcars)
scaleModel <- sm(locationModel,~qsec+wt)
locationModel <- implant(locationModel,scaleModel)
When it comes to alm()
, the model can be estimated in
the same function:
In this case, the summary of the model would differ from the basic
call of alm()
- it would include the output for the scale
model together with the location one:
summary(almModel)
#> Response variable: mpg
#> Distribution used in the estimation: Normal
#> Loss function used in estimation: likelihood
#> Coefficients:
#> Estimate Std. Error Lower 2.5% Upper 97.5%
#> (Intercept) 18.6535 20.6895 -25.4451 62.7521
#> cyl -0.1946 1.1409 -2.6264 2.2372
#> disp 0.0111 0.0200 -0.0315 0.0537
#> hp -0.0291 0.0280 -0.0888 0.0307
#> drat 0.4069 1.7893 -3.4069 4.2206
#> wt -3.3747 2.1548 -7.9676 1.2182
#> qsec 0.7124 0.8109 -1.0160 2.4409
#> vs 0.0451 2.4014 -5.0732 5.1635
#> am 4.1402 2.4729 -1.1307 9.4110
#> gear -0.0861 1.7330 -3.7798 3.6076
#> carb -0.0150 1.0666 -2.2883 2.2584
#>
#> Coefficients for scale:
#> Estimate Std. Error Lower 2.5% Upper 97.5%
#> (Intercept) -2.5873 3.1123 -9.2210 4.0464
#> qsec 0.1820 0.1576 -0.1539 0.5179
#> wt 0.2159 0.2398 -0.2952 0.7270
#>
#> Error standard deviation: 2.8746
#> Sample size: 29
#> Number of estimated parameters: 14
#> Number of degrees of freedom: 15
#> Information criteria:
#> AIC AICc BIC BICc
#> 152.7057 182.7057 171.8479 222.3573
Forecasting of the overall model (with location and scale) can be done in the same was as for the basic model:
If the prediction interval is required, the function would first produce forecasts of the scale, and then use them for the construction of interval via the standard procedure, depending on the assumed distribution.
As mentioned earlier, depending on the distribution, we would have
slight differences in the function sm()
, with different
transformations of residuals of the location model (needed for the
initialisation) and transformations of scale depending on the used
distribution. Here is the full list of these:
alm()
that E(et) = 1. σ̂t is used in
the density function;Note that the scale model is a relatively new feature, so it might not produce perfect results. Let me know if you have ideas of how to improve it via https://github.com/config-i1/greybox/issues.
There are several loss functions implemented in the function and
there is an option for a user to provide their own. If the
loss
is not "likelihood"
, then the
distribution is only needed for inference. Keep in mind that this
typically means that the likelihood is not maximised, so the inference
might be wrong and the results can be misleading. However, there are
several cases, when this is not the case:
loss="MSE"
,
distribution=c("dnorm","dlnorm","dbcnorm","dlogitnorm")
;loss="MAE"
,
distribution=c("dlaplace","dllaplace")
;loss="HAM"
,
distribution=c("ds","dls")
;The likelihoods of the distributions above are maximised, when the
respective losses are minimised, so all the inference discussed above
hold in these three situations. In all the other cases, the
alm()
function will not return logLik
and
might complain that the derivations can be misleading.
Also, when loss="likelihood"
(and for the three cases
above) the number of estimated parameters includes the scale of
distribution (e.g. standard deviation in case of Normal distribution),
but it is not counted in the number of parameters of all the other
losses.
Finally, we can provide custom loss functions. Here is an example:
lossFunction <- function(actual, fitted, B, xreg){
return(mean(abs(actual-fitted)^3));
}
modelLossCustom <- alm(y~x1+x2+Noise, inSample, distribution="dnorm", loss=lossFunction)
summary(modelLossCustom)
#> Warning: You used the non-likelihood compatible loss, so the covariance matrix
#> might be incorrect. It is recommended to use bootstrap=TRUE option in this
#> case.
#> Response variable: y
#> Distribution used in the estimation: Normal
#> Loss function used in estimation: custom
#> Coefficients:
#> Estimate Std. Error Lower 2.5% Upper 97.5%
#> (Intercept) 639.6606 1.1436 637.4036 641.9177 *
#> x1 -11.9618 0.0134 -11.9883 -11.9353 *
#> x2 -5.8942 0.0071 -5.9082 -5.8802 *
#> Noise 0.5565 0.0036 0.5493 0.5636 *
#>
#> Error standard deviation: 176.0577
#> Sample size: 180
#> Number of estimated parameters: 4
#> Number of degrees of freedom: 176
Keep in mind that it is important to have parameters
actual
, fitted
, B
and
xreg
in the custom loss function, even if they are not
used. xreg
is the matrix of the expanded explanatory
variables (including the intercept in the first column if it is
needed).
You might also notice that there are LASSO and RIDGE available as
options for loss
parameter. These are experimental. When
they are used, the explanatory variables are not normalised, but the
parameters (all but intercept) are divided by the standard deviations of
explanatory variables, which has a similar (but not the same) effect.
Also, in order for this to work appropriately, you need to
provide the parameter lambda
, which should lie between
0 and 1, where lambda=0
means that there is no shrinkage
and lambda=1
implies that there is only shrinkage, and the
MSE is not used at all. The MSE part of the loss is divided by V(Δyt),
where Δyt = yt − yt − 1,
bringing it closer to the scale of parameters and hopefully making λ a bit more meaningful.
Note that due to the formulation of alm()
, you can use
LASSO for Logistic
(distribution="plogis"
), Poisson
(distribution="dpois"
) and Negative
Binomial (distribution="dnbinom"
) regressions as well.
Keep in mind that in order to get better results, you might need to tune
the parameters of the optimiser (e.g. set
maxeval
to a higher value or increase
ftol_rel
).
When working with non-conventional loss functions do not rely on the
default standard errors of parameters, because they are not what they
seem: they only work in case of loss="likelihood"
. A
correct method for calculating the standard errors is the bootstrap,
which is now available via the bootstrap=TRUE
parameter.
Here is an example:
summary(modelLossCustom, bootstrap=TRUE, nsim=100)
#> Response variable: y
#> Distribution used in the estimation: Normal
#> Loss function used in estimation: custom
#> Bootstrap was used for the estimation of uncertainty of parameters
#> Coefficients:
#> Estimate Std. Error Lower 2.5% Upper 97.5%
#> (Intercept) 639.6606 65.9097 463.8722 707.1389 *
#> x1 -11.9618 1.4146 -14.4401 -9.4744 *
#> x2 -5.8942 0.6039 -7.2645 -4.8199 *
#> Noise 0.5565 0.2089 0.3818 1.1303 *
#>
#> Error standard deviation: 176.0577
#> Sample size: 180
#> Number of estimated parameters: 4
#> Number of degrees of freedom: 176
Where nsim
is the parameter passed to the method
coefbootstrap()
, which is used in order to generate
coefficients of models and then extract the necessary statistics. See
?coefbootstrap
to see what parameters are accepted by the
function. There are several methods that can use
coefbootstrap()
and will use bootstrapped covariance
matrix: vcov()
, coefint()
,
summary()
, predict()
and
forecast()
.
There are several parameters in the optimiser that can be regulated by the user. Here is the list:
B
- the vector of starting values of parameters for the
optimiser, which should correspond to the ordering of the explanatory
variables. In order to see the order of parameters, you can fit a model
to the data and extract B
from it via
ourModel$B
;algorithm
the algorithm to use in optimisation
("NLOPT_LN_SBPLX"
by default).maxeval
- maximum number of iterations to carry out
(default is 200 for simpler models, 500 for Logistic, Probit, Poisson and Negative Binomial
and 1000 in case of loss=c("LASSO","RIDGE)
);maxtime
- stop, when the optimisation time (in seconds)
exceeds this;xtol_rel
- the relative precision of the optimiser (the
default is 1E-6);xtol_abs
- the absolute precision of the optimiser (the
default is 1E-8);ftol_rel
- the stopping criterion in case of the
relative change in the loss function (the default is 1E-4);ftol_abs
- the stopping criterion in case of the
absolute change in the loss function (the default is 0 - not used);print_level
- the level of output for the optimiser (0
by default). If equal to 41, then the detailed results of the
optimisation are returned.You can read more about these parameters by running the function
nloptr.print.options()
form the nloptr
package. Typically, if your model does not work as expected, you might
need to tune some of these parameters in order to make sure that the
optimum of the loss is reached to the satisfactory extent.