Business Forecasting Principles: 15. Forecast Selection and Combination

By Lancaster CMAF

Summary

## Key takeaways - **Avoid In-Sample Fit for Forecast Selection**: Never select a forecast model based on its in-sample fit, as this leads to overfitting the noise in the data and results in worse out-of-sample performance. [04:01], [06:34] - **Information Criteria: Useful but Limited**: Information criteria like AIC and BIC penalize model complexity, offering a better alternative to in-sample fit, but they cannot be used to compare different model classes. [06:40], [08:46] - **Holdout Performance: The Gold Standard**: Selecting forecasts based on their performance on a holdout sample is the most reliable method, as it directly measures out-of-sample accuracy and avoids overfitting. [09:10], [12:05] - **Forecast Combination Outperforms Selection**: Combining multiple forecasts, often by simple averaging, empirically tends to outperform selecting a single best forecast because reality is complex and no single model is perfect. [12:40], [15:20] - **The Forecast Combination Puzzle**: Surprisingly often, simple unweighted forecast combinations outperform sophisticated methods that attempt to optimize weights, possibly due to the variance introduced by estimating weights. [16:57], [18:15]

Topics Covered

Why You Must Avoid In-Sample Fit for Forecast Selection.
Holdout Performance: The Gold Standard for Forecast Selection.
Why Combining Forecasts Beats Picking a Single Best Model.
The Forecast Combination Puzzle: Why Simple Averages Win.

Full Transcript

[Music]

Welcome everybody to another video in

our series on forecasting. I'm Stefan

Kasa. I'm a data science expert at SAP

Switzerland. I'm also affiliated with

the Lancaster University Center for

Marketing Analytics and Forecasting. And

today we'll talk about forecast

selection and combination.

So

the question that we're going to talk

about today is how to deal with multiple

forecasts. So multiple forecasts in the

sense of having multiple forecasts,

multiple datas, multiple possible

forecasts for the same point in time in

the future. So one forecast could be and

these could come from different model

classes like we might have uh an ARMA

forecast, we might have an exponential

smoothing forecast, we might have

boosting, deep learning, what have you.

We might have judgmental forecasts. So

forecasts that just come in from humans

that threw something into the pot. How

do we deal with all of these?

Or even we might have different models

within a class. So we might have ARMA

forecasts, but within ARMA there is a

plethora actually infinitely many

different forecasts like we might have

an ARMA 101 versus ARMA 2000. So your

experts can explain to you all the

details about how these differ at a

cruciating excruciating detail. We might

have seasonal versus non-seasonal

exponential smoothing or any of the

other uh of the rest of the zoo of

exponential smoothing methods. And of

course in boosting or deep learning we

might have different parameterizations

or different hyperparameters or what

have you. So we might have lots of

different forecasts all for the same

point in time. So fig um picture a room

full of forecasters and everybody is

screaming their favorite number at you

for one point in time in the future. How

do you deal with those?

How do we get a single forecast to act

on? So uh here's a little time series

which I took from the M3 forecasting

competition and um I just calculated

three different forecasts and I didn't

really pick any any ones that made a lot

of sense. But there's one benchmark that

you should always keep in mind which is

the historical mean which is

surprisingly often very very effective.

Just take historical data take the

average of those and forecast that one

out. Of course that's a flat line and um

people often complain flat lines uh

can't be sophisticated. Well actually

often enough they are. It's a question

of knowing when the flat line is best.

So could also call the flat the

historical mean forecast ana 0000

forecast. That sounds much more

sophisticated, but it's the exact same

thing. You could also have something

like an ARMA 303 uh seasonal with

seasonal model 10 0. That sounds much

more sophisticated. The forecast also

looks more sophisticated because it

wiggles up and down. That doesn't mean

it's going to be more accurate. Or

finally a third uh contender would be an

exponential smoothing forecast uh with

which has multiplicative error, no trend

and additive seasonality and just pulled

that out of a hat. So now we have three

different candidates. How do we deal

with these? We want one forecast to act

upon.

There is essentially two ways of going

about this. There is forecast selection,

forecast combination. Let's look at

these in turn.

Forecast selection is well just what it

says on the tin. It's selecting. It's a

question of selecting the one of these

three forecasts. We're going to go with

one of these three. And later on

combination probably not going to

surprise you by telling you that

combination means we combine these.

Let's stick with selection for now. I'm

just going to select one of these. But

how do we deal? How do we do that? We

have three different ways. Three

different forecasts here. How do we

select? Three different ways I'm going

to present today. One is going by in

sample fit. One is going by information

criteria. One is going by holdout

performance. Let's look at these in

turn. Selection by insample fit. So if

we have a time series and we have a

forecasts and most statistical methods

and most machine learning methods too

will give you an insample fit. So they

will say what the model thinks would

have been the average demand or whatever

you're forecasting in sample on top of

which there is also some unavoidable

noise because a time series always

depends consists of the systematic

component that we're trying to forecast

and something that we can't capture and

we call that noise. And what we have

here the red lines in our little

pictures here are the in sample fits.

That's a systematic component in the

training sample. Of course, now we might

want to say, well, let's just pick the

one that worked best in sample because

that apparently that obviously

understood uh the data in sample in the

training sample best because it's fit.

It's fitting best. And then of course

you'd say that this flat line at the

very top doesn't do a very good job

because there is a lot of wiggling that

the flat line doesn't capture. And the

seasonal methods are much better. So we

should be picking one of those and just

do use your favorite error measure and

pick the one that scores lowest on that

error measure. It's actually very

simple.

Uh the problem is that you should not do

this. We should really not do this. Let

me let me repeat you should not do this.

Why? The problem is once you have a

given model can always make it more

complicated by adding components by

adding auto reggressive orders. It can

go from AR1 to AR2. You can in an

exponential smoothing method you can say

okay I have a method with no error with

no trend add a trend component making a

a model more complicated will always

improve the insample fit simply because

the model has more leeway more has more

way of of turning thing of tweezing

swing things around and and and fitting

in here and fitting in there. It's more

flexible and that means it can adapt

better to the insample time series. The

problem is that you it will at some

point in time start fitting the noise.

It will start thinking that it

understands stuff that is simply noise

that is simply unforcastable random

variation. And the problem is you will

not know because in sample making a

model more and more complicated will

look like it's doing a better and better

job and at some point in time out of

sample it's going to get worse. So do

not do that. Do not rely on insample fit

as a forecast selection method.

This is overfitting. Don't do that. All

right. Happily enough, there's other

methods and one alternative is

information criteria. Uh so we again

have the insample fits here. Um but we

also have uh things cryptical

abbreviations like the AIC and the BIC.

If you have a statistical forecasting

method like the ARMA or exponential

smoothing, these will also give you

information criteria. So these are KPIs

key performance indicators of a

statistical model applied to data. It's

the there's different flavors here.

There's a kaius information criteria and

that the basian or schwarz information

criteria and that's the AIC and BIC that

we have in here. And those are

essentially the insample errors that we

just talked about, but they're penalized

by the accur by the complexity of the

model. So you don't just look at the

insample accuracy or insample model fit,

but you penalize this number by the

complexity. So at some point in time,

this complexity penalty outweighs the

insample fit. And so what you're doing

then is you select the forecast with the

lowest information criterion. The

absolute numbers don't really make a lot

of difference. They differ by

implementation, but it's just if you

have multiple forecasts, just select the

one with the lowest information

criterion. Can deal with AIC, you can

use BIC. It doesn't really make all that

much of a difference. Sometimes one

performs better, sometimes the other. It

doesn't really make that much of a

difference. It's much more informative

than just in sample fit because of this

penalty term for complexity, which makes

uh which makes sure that you don't

overfit too badly. can still overfit

with information criteria but it's much

harder and you have an in built-in break

that stops you from being too complex

for your own good. Problem uh there's

always problems here. The problem here

with information criteria is that you

cannot compare them between different

model classes. You can't compare you

cannot compare the AIC between an ARMA

model and the AIC for an exponential

smoothing model simply for statistical

reasons. There's completely different

things. you're comparing apples and

oranges. Um, and it's simply you can't

do it. It's not meaningful. Can't do it.

Don't try it. Also, you can't compare

ARMA models with different integration

orders. Uh, so can only compare ARMA

models with the same integration order

on information criteria. So, they're

limited. They're really limited in in

application, but better than in sample

fit.

Happily enough, there is a third option

and that's almost always available.

Selection by holdout performance. How

does that work? We again have our

forecast or time series or historical

time series actually. And selection by

hold up performance is not so much a

selection of forecasting of forecasts

but more one of methods forecasting

methods. Why?

In looking at holdout performance, we

use what we call a hold out sample. So

we take the last couple of data points,

the last couple of observations. 20% of

the time series is a rough rule of

thumb. So if you have five years of

data, you hold out the last year. You

take the first say in this case four

years, fit your model to those four

years and forecast out into the period

that you just held out that you hid from

the model. And then you can assess

whether your forecasting method did a

good job in forecasting out into this

whole lot sample. And the advantage is

since we're not looking at in sample fat

but true out of sample performance,

we're not overfitting so much. Again, we

could overfit much harder here. And you

do that for every single um method that

you have whether it's uh the mean

forecast or the ARMA or your boosting

method or whatever else you have. You

assess the quality and then pick the one

the method that performed best on this

hold out sample. And now you have one

winner method that you're going to stick

with. Of course, you don't do use the

method that you just fitted to only the

head of the time series. You're going to

refit that for the entire forecast for

the entire time series, the entire

history and then you're going to

forecast that one out and you're done.

The advantage here is that this is

always always feasible if you can refit

your methods. So even if you have if

you're comparing anything else uh like

boosting neural networks whatever it's a

little harder for to use this for

selection in judgmental forecast because

it's very hard to say uh to somebody who

is calculating a judgmental forecast uh

disregard the the end of the time series

just look at the beginning people will

always peak it's hard you really have to

to make sure that they don't see what

they're forecasting for you just show

them the beginning of the time series if

they've never seen the and then ask them

to judgmentally forecast and then you

can run this experiment this approach

with judgmental forecasts too but

anything that's uh purely statistical

purely it purely in your computer can

always tell your computer don't look at

this take the beginning fit your model

forecast tell me what you forecasted I'm

going to compare to the actuals that I

hid from you I'm going to tell you how

good you were going to select the one

that performs best this is pretty much

the gold standard standard in forecast

selection.

So to summarize, there's three different

main ways of forecast selection. There's

insample fit. Do not use this. There's

information criteria which are of

relevance but of limited usability

because you can only use them for

statistical methods, not for machine

learning or mostly not for machine

learning. you can only compare use them

in a limited way to com to actually

compro compare statistical forecasts.

However, third there's hold out

performance and that is always feasible

or almost always feasible.

Now let's move on to forecast

combination. If it's so hard to select

among forecasts but perhaps we can just

combine them and learn from all of them

and not throw information away and just

use it all. There's different aspects to

look at here. We're going to talk about

combination by averaging, combination by

pooling. We're also going to talk a

little about why does this work at all.

And we're also going to talk about

something that's been called the

forecast combination puzzle in the

literature.

Onward comp combination by averaging.

And that's actually the very simplest

way of combining forecasts. We have plan

series. We have the forecasts from three

different methods. We're just going to

combine them per time bucket. So we're

looking at the forecast one step ahead.

We have three different forecasts from

our three different methods. Take the

average, you're done. Hooray. It's very

simple and again, it's very

understandable and explainable. Can

explain that to people with no

statistical background at all. And so,

you get a forecast out that looks like

this. It's a little seasonal because two

of our component forecasts are seasonal,

but it's uh the seasonality in the

forecast has been dampened because we're

mixing that with a non-seasonal flatline

forecast.

Can take the arithmetical mean. can take

the median can trim the extreme

forecasts use windsized means or

anything else. So there's different

variations that you have here but the

core idea is really just to take the

average of your forecasts.

There is also a related concept u which

is pooling for combination by pooling

and that really consists of first

removing some forecast. So, if we have a

good idea that some forecasts we simply

don't trust them, they're probably not

very good, let's just remove them and

then average everything else.

How do we remove forecasts? Well, here

we might uh by some analysis find out

that this time series is really

seasonal. Uh then perhaps it doesn't

make sense to have a non-seasonal

flatline forecast. So, we're just going

to remove that and just average the

seasonal forecasts. We could also say

something like um we look at the hold

out performance of all forecasts and

remove all forecasts that the five

forecasting methods or in this case the

one that performs worst on the hold out

sample or we could again look at

information criteria whatever there's

different ways of dealing with this. In

this case we might want to select only

the bottom two the seasonal forecasts

combine them together by averaging we

again get a nice seasonal forecast and

that might perform rather well.

Now how why does this work? Why can't we

just select the best forecasting method?

Why do we have to Why does it make sense

to look at all of them? Why does this

actually very often outperform model

selection? That's an empirical fact.

Very often works better than selection.

Why is that?

The point is that usually we do not have

the data generating process that we just

have to find. It's not a question of

finding the process that generated our

data. Reality is complex. Reality is

noisy. reality has nonlinearities and

influencing factors that we have have

never dreamed of and whatever. So

usually there is no method somewhere out

there in the platonic ether that

generated our data and our job is now to

find that model that to find that

statistical uh process that generated

our data and project that out. No,

typically our job is to find something

that is usually wrong. Usually not the

data generating process, but that is

useful. There's this famous quote by

George Box, all models are wrong, but

some are useful. We have to find the

most useful one. And combination works

by reducing the effect of the very of a

very wrong model slipping through a

selection step. If we select them, we

always have a good chance of selecting

just by random cho by random noise

selecting a model that simply does not

go to do a good job at forecasting. And

averaging essentially averages out these

possibilities for failure and makes our

forecasts more stable and less variable.

So you also have something that

statisticians like to call the bias

variance trade-off that comes into this

here.

All right. Finally,

uh there is something we like to call

the forecast combination puzzle. That's

actually a term that's been used in

scientific work. Uh what does this refer

to? Well, now we just averaged forecasts

and just you did an an arithmetical mean

or something. Uh if you think about it,

uh we could also think about doing a

weighted combination. Let's I mean some

of these forecasts are going to perform

better and some are going to perform

worse. Let's give those that perform

better a higher weight than the others.

Can do for instance you can look at

information criteria. You can again look

at wholelet performance is always the

same idea and just figure out which ones

are better which ones are worse and give

the better ones a higher weight than the

worst ones and then use an an a weighted

average.

And the empirical observation is that

unweighted combinations very often

outperform a clever schemes for setting

weights

uh optimal weights or most people claim

to to find optimal weights in some way.

Very often they find that it's very hard

to beat the unweighted combination. And

that's what people call the forecast

combination puzzle. Why is it the why is

it that unweighted combination very

often works better than finding optimal

weights?

One approach one one possible solution

for this has been suggested by a 2016

paper that I like very much uh which

points to the fact that if we want to

optimize weights then we have an

uncertainty in these weights and these

weights have a variance. These weights

depend on the data. So, and this noise

in selecting weights and picking the

right weights or optimizing those

weights, this noise directly goes

through to our forecasts and makes our

forecasts more weekly and more variable.

And variability is usually statistically

not a good thing. We like to remove

variability because it's it's

unpredictable. It's it makes things

harder. And in this case, this variance

in the estimation of weights may lead to

worse forecasts at the end of the line.

So bottom line, you may even if forecast

combination with weights sounds like a

good idea in theory, in practice it may

not work. I'm not saying not to look at

it. I'm by no means do look at weights.

Just be open to the possibility that an

unweighted combination or a pooling for

instance, if you just remove the very

worst forecasts, that might outperform

optimizing the weights.

All right, to conclude, if you have

multiple forecasts, and who doesn't?

We're uh we're drowning in a plethora of

forecasts because there are so many

people who will give the their

forecasts. If you have a multiple

forecasts for the same point in time,

you need to figure out a single forecast

to deal to deal with to work on to

decide upon. You can select your

forecasts by in sample fit. No, you

shouldn't do that. I'm repeating myself.

Don't use insample fit as a forecast

selection criterion. could use

information criteria or what is probably

the gold standard is hold out

performance. However, you can also look

at forecast combination and that just

empirically it's just an observation uh

very often outperforms selection. So do

think about combination can do a simple

combination by averaging can do a simple

combination by pooling by removing some

forecasts. It works for well reasonable

reasons. Let's call it that. And

finally, uh very often, surprisingly

often, an unweighted combination will

outperform setting optimal weights. So,

do think about uh not optimizing weights

and just going with an unweighted

combination. Thank you very much.

[Music]

Loading...

Loading video analysis...