Business Forecasting Principles: 15. Forecast Selection and Combination
By Lancaster CMAF
Summary
## Key takeaways - **Avoid In-Sample Fit for Forecast Selection**: Never select a forecast model based on its in-sample fit, as this leads to overfitting the noise in the data and results in worse out-of-sample performance. [04:01], [06:34] - **Information Criteria: Useful but Limited**: Information criteria like AIC and BIC penalize model complexity, offering a better alternative to in-sample fit, but they cannot be used to compare different model classes. [06:40], [08:46] - **Holdout Performance: The Gold Standard**: Selecting forecasts based on their performance on a holdout sample is the most reliable method, as it directly measures out-of-sample accuracy and avoids overfitting. [09:10], [12:05] - **Forecast Combination Outperforms Selection**: Combining multiple forecasts, often by simple averaging, empirically tends to outperform selecting a single best forecast because reality is complex and no single model is perfect. [12:40], [15:20] - **The Forecast Combination Puzzle**: Surprisingly often, simple unweighted forecast combinations outperform sophisticated methods that attempt to optimize weights, possibly due to the variance introduced by estimating weights. [16:57], [18:15]
Topics Covered
- Why You Must Avoid In-Sample Fit for Forecast Selection.
- Holdout Performance: The Gold Standard for Forecast Selection.
- Why Combining Forecasts Beats Picking a Single Best Model.
- The Forecast Combination Puzzle: Why Simple Averages Win.
Full Transcript
[Music]
Welcome everybody to another video in
our series on forecasting. I'm Stefan
Kasa. I'm a data science expert at SAP
Switzerland. I'm also affiliated with
the Lancaster University Center for
Marketing Analytics and Forecasting. And
today we'll talk about forecast
selection and combination.
So
the question that we're going to talk
about today is how to deal with multiple
forecasts. So multiple forecasts in the
sense of having multiple forecasts,
multiple datas, multiple possible
forecasts for the same point in time in
the future. So one forecast could be and
these could come from different model
classes like we might have uh an ARMA
forecast, we might have an exponential
smoothing forecast, we might have
boosting, deep learning, what have you.
We might have judgmental forecasts. So
forecasts that just come in from humans
that threw something into the pot. How
do we deal with all of these?
Or even we might have different models
within a class. So we might have ARMA
forecasts, but within ARMA there is a
plethora actually infinitely many
different forecasts like we might have
an ARMA 101 versus ARMA 2000. So your
experts can explain to you all the
details about how these differ at a
cruciating excruciating detail. We might
have seasonal versus non-seasonal
exponential smoothing or any of the
other uh of the rest of the zoo of
exponential smoothing methods. And of
course in boosting or deep learning we
might have different parameterizations
or different hyperparameters or what
have you. So we might have lots of
different forecasts all for the same
point in time. So fig um picture a room
full of forecasters and everybody is
screaming their favorite number at you
for one point in time in the future. How
do you deal with those?
How do we get a single forecast to act
on? So uh here's a little time series
which I took from the M3 forecasting
competition and um I just calculated
three different forecasts and I didn't
really pick any any ones that made a lot
of sense. But there's one benchmark that
you should always keep in mind which is
the historical mean which is
surprisingly often very very effective.
Just take historical data take the
average of those and forecast that one
out. Of course that's a flat line and um
people often complain flat lines uh
can't be sophisticated. Well actually
often enough they are. It's a question
of knowing when the flat line is best.
So could also call the flat the
historical mean forecast ana 0000
forecast. That sounds much more
sophisticated, but it's the exact same
thing. You could also have something
like an ARMA 303 uh seasonal with
seasonal model 10 0. That sounds much
more sophisticated. The forecast also
looks more sophisticated because it
wiggles up and down. That doesn't mean
it's going to be more accurate. Or
finally a third uh contender would be an
exponential smoothing forecast uh with
which has multiplicative error, no trend
and additive seasonality and just pulled
that out of a hat. So now we have three
different candidates. How do we deal
with these? We want one forecast to act
upon.
There is essentially two ways of going
about this. There is forecast selection,
forecast combination. Let's look at
these in turn.
Forecast selection is well just what it
says on the tin. It's selecting. It's a
question of selecting the one of these
three forecasts. We're going to go with
one of these three. And later on
combination probably not going to
surprise you by telling you that
combination means we combine these.
Let's stick with selection for now. I'm
just going to select one of these. But
how do we deal? How do we do that? We
have three different ways. Three
different forecasts here. How do we
select? Three different ways I'm going
to present today. One is going by in
sample fit. One is going by information
criteria. One is going by holdout
performance. Let's look at these in
turn. Selection by insample fit. So if
we have a time series and we have a
forecasts and most statistical methods
and most machine learning methods too
will give you an insample fit. So they
will say what the model thinks would
have been the average demand or whatever
you're forecasting in sample on top of
which there is also some unavoidable
noise because a time series always
depends consists of the systematic
component that we're trying to forecast
and something that we can't capture and
we call that noise. And what we have
here the red lines in our little
pictures here are the in sample fits.
That's a systematic component in the
training sample. Of course, now we might
want to say, well, let's just pick the
one that worked best in sample because
that apparently that obviously
understood uh the data in sample in the
training sample best because it's fit.
It's fitting best. And then of course
you'd say that this flat line at the
very top doesn't do a very good job
because there is a lot of wiggling that
the flat line doesn't capture. And the
seasonal methods are much better. So we
should be picking one of those and just
do use your favorite error measure and
pick the one that scores lowest on that
error measure. It's actually very
simple.
Uh the problem is that you should not do
this. We should really not do this. Let
me let me repeat you should not do this.
Why? The problem is once you have a
given model can always make it more
complicated by adding components by
adding auto reggressive orders. It can
go from AR1 to AR2. You can in an
exponential smoothing method you can say
okay I have a method with no error with
no trend add a trend component making a
a model more complicated will always
improve the insample fit simply because
the model has more leeway more has more
way of of turning thing of tweezing
swing things around and and and fitting
in here and fitting in there. It's more
flexible and that means it can adapt
better to the insample time series. The
problem is that you it will at some
point in time start fitting the noise.
It will start thinking that it
understands stuff that is simply noise
that is simply unforcastable random
variation. And the problem is you will
not know because in sample making a
model more and more complicated will
look like it's doing a better and better
job and at some point in time out of
sample it's going to get worse. So do
not do that. Do not rely on insample fit
as a forecast selection method.
This is overfitting. Don't do that. All
right. Happily enough, there's other
methods and one alternative is
information criteria. Uh so we again
have the insample fits here. Um but we
also have uh things cryptical
abbreviations like the AIC and the BIC.
If you have a statistical forecasting
method like the ARMA or exponential
smoothing, these will also give you
information criteria. So these are KPIs
key performance indicators of a
statistical model applied to data. It's
the there's different flavors here.
There's a kaius information criteria and
that the basian or schwarz information
criteria and that's the AIC and BIC that
we have in here. And those are
essentially the insample errors that we
just talked about, but they're penalized
by the accur by the complexity of the
model. So you don't just look at the
insample accuracy or insample model fit,
but you penalize this number by the
complexity. So at some point in time,
this complexity penalty outweighs the
insample fit. And so what you're doing
then is you select the forecast with the
lowest information criterion. The
absolute numbers don't really make a lot
of difference. They differ by
implementation, but it's just if you
have multiple forecasts, just select the
one with the lowest information
criterion. Can deal with AIC, you can
use BIC. It doesn't really make all that
much of a difference. Sometimes one
performs better, sometimes the other. It
doesn't really make that much of a
difference. It's much more informative
than just in sample fit because of this
penalty term for complexity, which makes
uh which makes sure that you don't
overfit too badly. can still overfit
with information criteria but it's much
harder and you have an in built-in break
that stops you from being too complex
for your own good. Problem uh there's
always problems here. The problem here
with information criteria is that you
cannot compare them between different
model classes. You can't compare you
cannot compare the AIC between an ARMA
model and the AIC for an exponential
smoothing model simply for statistical
reasons. There's completely different
things. you're comparing apples and
oranges. Um, and it's simply you can't
do it. It's not meaningful. Can't do it.
Don't try it. Also, you can't compare
ARMA models with different integration
orders. Uh, so can only compare ARMA
models with the same integration order
on information criteria. So, they're
limited. They're really limited in in
application, but better than in sample
fit.
Happily enough, there is a third option
and that's almost always available.
Selection by holdout performance. How
does that work? We again have our
forecast or time series or historical
time series actually. And selection by
hold up performance is not so much a
selection of forecasting of forecasts
but more one of methods forecasting
methods. Why?
In looking at holdout performance, we
use what we call a hold out sample. So
we take the last couple of data points,
the last couple of observations. 20% of
the time series is a rough rule of
thumb. So if you have five years of
data, you hold out the last year. You
take the first say in this case four
years, fit your model to those four
years and forecast out into the period
that you just held out that you hid from
the model. And then you can assess
whether your forecasting method did a
good job in forecasting out into this
whole lot sample. And the advantage is
since we're not looking at in sample fat
but true out of sample performance,
we're not overfitting so much. Again, we
could overfit much harder here. And you
do that for every single um method that
you have whether it's uh the mean
forecast or the ARMA or your boosting
method or whatever else you have. You
assess the quality and then pick the one
the method that performed best on this
hold out sample. And now you have one
winner method that you're going to stick
with. Of course, you don't do use the
method that you just fitted to only the
head of the time series. You're going to
refit that for the entire forecast for
the entire time series, the entire
history and then you're going to
forecast that one out and you're done.
The advantage here is that this is
always always feasible if you can refit
your methods. So even if you have if
you're comparing anything else uh like
boosting neural networks whatever it's a
little harder for to use this for
selection in judgmental forecast because
it's very hard to say uh to somebody who
is calculating a judgmental forecast uh
disregard the the end of the time series
just look at the beginning people will
always peak it's hard you really have to
to make sure that they don't see what
they're forecasting for you just show
them the beginning of the time series if
they've never seen the and then ask them
to judgmentally forecast and then you
can run this experiment this approach
with judgmental forecasts too but
anything that's uh purely statistical
purely it purely in your computer can
always tell your computer don't look at
this take the beginning fit your model
forecast tell me what you forecasted I'm
going to compare to the actuals that I
hid from you I'm going to tell you how
good you were going to select the one
that performs best this is pretty much
the gold standard standard in forecast
selection.
So to summarize, there's three different
main ways of forecast selection. There's
insample fit. Do not use this. There's
information criteria which are of
relevance but of limited usability
because you can only use them for
statistical methods, not for machine
learning or mostly not for machine
learning. you can only compare use them
in a limited way to com to actually
compro compare statistical forecasts.
However, third there's hold out
performance and that is always feasible
or almost always feasible.
Now let's move on to forecast
combination. If it's so hard to select
among forecasts but perhaps we can just
combine them and learn from all of them
and not throw information away and just
use it all. There's different aspects to
look at here. We're going to talk about
combination by averaging, combination by
pooling. We're also going to talk a
little about why does this work at all.
And we're also going to talk about
something that's been called the
forecast combination puzzle in the
literature.
Onward comp combination by averaging.
And that's actually the very simplest
way of combining forecasts. We have plan
series. We have the forecasts from three
different methods. We're just going to
combine them per time bucket. So we're
looking at the forecast one step ahead.
We have three different forecasts from
our three different methods. Take the
average, you're done. Hooray. It's very
simple and again, it's very
understandable and explainable. Can
explain that to people with no
statistical background at all. And so,
you get a forecast out that looks like
this. It's a little seasonal because two
of our component forecasts are seasonal,
but it's uh the seasonality in the
forecast has been dampened because we're
mixing that with a non-seasonal flatline
forecast.
Can take the arithmetical mean. can take
the median can trim the extreme
forecasts use windsized means or
anything else. So there's different
variations that you have here but the
core idea is really just to take the
average of your forecasts.
There is also a related concept u which
is pooling for combination by pooling
and that really consists of first
removing some forecast. So, if we have a
good idea that some forecasts we simply
don't trust them, they're probably not
very good, let's just remove them and
then average everything else.
How do we remove forecasts? Well, here
we might uh by some analysis find out
that this time series is really
seasonal. Uh then perhaps it doesn't
make sense to have a non-seasonal
flatline forecast. So, we're just going
to remove that and just average the
seasonal forecasts. We could also say
something like um we look at the hold
out performance of all forecasts and
remove all forecasts that the five
forecasting methods or in this case the
one that performs worst on the hold out
sample or we could again look at
information criteria whatever there's
different ways of dealing with this. In
this case we might want to select only
the bottom two the seasonal forecasts
combine them together by averaging we
again get a nice seasonal forecast and
that might perform rather well.
Now how why does this work? Why can't we
just select the best forecasting method?
Why do we have to Why does it make sense
to look at all of them? Why does this
actually very often outperform model
selection? That's an empirical fact.
Very often works better than selection.
Why is that?
The point is that usually we do not have
the data generating process that we just
have to find. It's not a question of
finding the process that generated our
data. Reality is complex. Reality is
noisy. reality has nonlinearities and
influencing factors that we have have
never dreamed of and whatever. So
usually there is no method somewhere out
there in the platonic ether that
generated our data and our job is now to
find that model that to find that
statistical uh process that generated
our data and project that out. No,
typically our job is to find something
that is usually wrong. Usually not the
data generating process, but that is
useful. There's this famous quote by
George Box, all models are wrong, but
some are useful. We have to find the
most useful one. And combination works
by reducing the effect of the very of a
very wrong model slipping through a
selection step. If we select them, we
always have a good chance of selecting
just by random cho by random noise
selecting a model that simply does not
go to do a good job at forecasting. And
averaging essentially averages out these
possibilities for failure and makes our
forecasts more stable and less variable.
So you also have something that
statisticians like to call the bias
variance trade-off that comes into this
here.
All right. Finally,
uh there is something we like to call
the forecast combination puzzle. That's
actually a term that's been used in
scientific work. Uh what does this refer
to? Well, now we just averaged forecasts
and just you did an an arithmetical mean
or something. Uh if you think about it,
uh we could also think about doing a
weighted combination. Let's I mean some
of these forecasts are going to perform
better and some are going to perform
worse. Let's give those that perform
better a higher weight than the others.
Can do for instance you can look at
information criteria. You can again look
at wholelet performance is always the
same idea and just figure out which ones
are better which ones are worse and give
the better ones a higher weight than the
worst ones and then use an an a weighted
average.
And the empirical observation is that
unweighted combinations very often
outperform a clever schemes for setting
weights
uh optimal weights or most people claim
to to find optimal weights in some way.
Very often they find that it's very hard
to beat the unweighted combination. And
that's what people call the forecast
combination puzzle. Why is it the why is
it that unweighted combination very
often works better than finding optimal
weights?
One approach one one possible solution
for this has been suggested by a 2016
paper that I like very much uh which
points to the fact that if we want to
optimize weights then we have an
uncertainty in these weights and these
weights have a variance. These weights
depend on the data. So, and this noise
in selecting weights and picking the
right weights or optimizing those
weights, this noise directly goes
through to our forecasts and makes our
forecasts more weekly and more variable.
And variability is usually statistically
not a good thing. We like to remove
variability because it's it's
unpredictable. It's it makes things
harder. And in this case, this variance
in the estimation of weights may lead to
worse forecasts at the end of the line.
So bottom line, you may even if forecast
combination with weights sounds like a
good idea in theory, in practice it may
not work. I'm not saying not to look at
it. I'm by no means do look at weights.
Just be open to the possibility that an
unweighted combination or a pooling for
instance, if you just remove the very
worst forecasts, that might outperform
optimizing the weights.
All right, to conclude, if you have
multiple forecasts, and who doesn't?
We're uh we're drowning in a plethora of
forecasts because there are so many
people who will give the their
forecasts. If you have a multiple
forecasts for the same point in time,
you need to figure out a single forecast
to deal to deal with to work on to
decide upon. You can select your
forecasts by in sample fit. No, you
shouldn't do that. I'm repeating myself.
Don't use insample fit as a forecast
selection criterion. could use
information criteria or what is probably
the gold standard is hold out
performance. However, you can also look
at forecast combination and that just
empirically it's just an observation uh
very often outperforms selection. So do
think about combination can do a simple
combination by averaging can do a simple
combination by pooling by removing some
forecasts. It works for well reasonable
reasons. Let's call it that. And
finally, uh very often, surprisingly
often, an unweighted combination will
outperform setting optimal weights. So,
do think about uh not optimizing weights
and just going with an unweighted
combination. Thank you very much.
[Music]
Loading video analysis...