Tackling the Cold Start Challenge in Demand Forecasting - PyCon DE Berlin 2024
By paretos
Summary
Topics Covered
- Cold Start Forecasting Needs Global Models
- Impute History from Similar Items
- Use True Cold Backtesting
- Simple Baselines Beat Complex Models Marginally
Full Transcript
[Applause] yes thank you very much for the introduction it's actually very cool to be here and we are really looking forward to sharing a little bit of our experiences in the everyday work with
demand forecasting and especially for this topic um of codar problem which also we think is kind of under represented it's often addressed but not in a very concise way we feel and this
is our contribution to say to give a little bit of structure and um yeah condensed content to this topic us that's my colleague and me we both work
for a company called paros it's a heidleberg based decision intelligence startup just a little bit of advertisement bear with me um our mission statement is no more bad
decisions and um yeah so what we often observe is that um if you look into businesses and decision processes how
they are taken um often you find that a lot of critical things are even in 2024 in times of AI machine learning still done in Microsoft Excel right and a lot
of manual work a lot of eror prone things that don't scale well and this is something that we tend to address because uh we're convinced we can do
better and demand forecasting is one yeah one part of that and for today's talk we're going to stick to a specific example namely demand forecasting for
let's say some fashion retailer like here we have our small uh our nicely AI generated uh fashion retailer COA um yeah um as a typical fashion
retailer what you do you sell fashion items you have like a supply chain behind it so you have some suppliers and the question is especially getting interesting once you want to launch a
new collection like maybe summer is coming up you have new items that you want to sell question is how many do I have to order right like for each of
those items on the one hand I want to order enough in order to meet the demand that my clients or my customers will have on the other hand I don't want to
just order too much because then I will Overstock and this will induce all sorts of bad consequences like Warehouse costs and so on and so forth and there is somewhere in between there is a sweet
spot and this is actually a classical a classical use case for decision intelligence where you have some optimization problem and what optimality here means actually depend on it depends
on you right it depends on your kpis it depends on your risk affinity depends on your what's important to your business but the point being here that having a
good demand forecast even for that scenario where you have a new collection is a very critical part of that process to look into the
future now the examples we are looking here is actually drawn from a real data set we'll also give you the reference later so this also something that you can play with and try it out it's called
a visual it's a data set consisting of around 5,000 um items from a Real Fashion Company an Italian one um
contains data of around four years so all sorts of also those product images are actually taken from the data set they have descriptions and so on and most notably um they contain the time
series data of the historic uh sales that you want to learn from uh speaking of Time series do you have anyone here that has any practical experience with
Time series maybe raise your hands nice that's quite a few so okay then I guess we can um we still for those of you who didn't raise the hand we would start with a little recap but I guess we can keep this one a bit
shorter so what is time serious what is time series you probably know uh so you can see two examples in this image so it is date on x-axis and then for each date
we have some number of sales so in this case you see Target on y- axis and this is number of pieces sold uh and so time serious means that
there is time Dimension there is a point which we can think of as prediction time that divides past and the future and in this task we stand in this prediction time and our task is to say what is
going to happen how many pieces we sell for each item and in this case we have two different items maybe it is a shortt and shorts or something so
what else do we have in terms of data we have static Coates so what is coar you can think of it as a feature that you feed to model that learns from it and
static means that it doesn't change in time so for instance it can be an image of a product and it can have textual descriptions like size uh which category
it is like shirt color Fabric and so on besides that you can also have time wearing Coates these are features that change in time um for instance it can be
calendar features and they are known both for the past and for the future because you probably know which day of week will be in 15 weeks and so on and
you also have information about prices and weather and now let's think about traveling back in time 3 months and now
we stand in this point this prediction time and what is notable in this example is that one item doesn't have any past Target but still we need to make some
forecast this is what we call cold situation call start problem and it means we don't have enough Target data what are the challenges that appear
here so first of all standard Solutions don't exist and if you refer to a book which probably everyone who has experienced with time serus know about
um from Rob uh hman judgmental forecasting is usually the only available method for new product forecasting what is meant here with
judgmental forecasting it actually means that we have a human expert this is a AI generated picture of an expert so we have a domain expert that use their own
judgment and tells us how many pieces we sell this is a good approach probably but then it's hard to scale what if we have 10,000 pieces for which we want to
make a forecast every week that is not probably doable and also it's hard because usually in fashion how far in the future you want to forecast it is 12
to 20 weeks which is really hard and the Second Challenge is performance evaluation okay what if we have a point we make a forecast how do we guarantee that it's good what if it's like random
numbers we need to make sure that we really tackle call start forecast without seeing any historical sales and for that we can suggest some
basic ideas on which we can base our methods uh to to tackle call start so we don't have an off pass Target and we uh see it as a
problem but let's not forget that we have other available data we have images we have time wearing features or covariates and we can also be encouraged
to be creative with features and think what else can help us like even if we don't have it in our data set what if it's not directly related we can think what what can what else can we use for
instance Google TRS why not to use that and of course arguably the most important piece of information in our data set is historic Target of other
items so if we don't know anything about sales of this shirt because it's not launched yet we look at our data set and we see how many times and which
seasonality periods did similar short have in the past so we have this to learn from it and to summarize it we want to search for available data we want to be creative
with features we want to make sure that we have similar items with the past in the data because if you uh sell vegetables and fruits and then suddenly you want to
start selling some merchandise with shirts it's probably not going to help you're probably not going to forecast it with your data and finally we also want to think of it in terms of global model
Global meaning that we have one model that has access to all items to all time seres and it is able to learn from other items to forecast for New
launches and we have a reference um that tells that actually this is in general better approach to have Global model unlike local one that is like one model per one time
series and having all this data how can we proceed uh so in general you see that it has has different modalities we have images we have textual information and we have
numerical information so with numerical one is very simple you just feed it to your model and it learns then with checkol we can do one H en coding and to
reduce it to vectors and for images we can apply Vector embedding and this has been done in two references that we add here for you um so it is also from demand
forecasting they take pictures of products and they do Vector embedding and essentially what you end up with from all these data sets from all these different modalities are vectors and
having vectors as you know from math you can always measure distance so now you can know what is the most similar product to the short for which you want
to make a forecast or maybe how many of them you have that are similar and how similar they are now let's think about models that require P Target the most most obvious
thing you can probably do is to Simply take the most similar shirt to the shirt for which you want to make forecast but hasn't been launched
yet you can take the most similar one take this its history and pretend that it's actually history of this new one you can train on it and then do per
usual just train on it and make a forecast you can also be a bit more creative here you can take mean uh sales of all the similar t-shirts you can also
apply nearest neighor imputation algorithm for that which we also reference here and essentially impute the past alternatively you can do some dummy
padding and masking what it means doesn't matter what was there you tell your model don't learn in it it's not like it's not real data you may add
a feature that informs the model that this is imputed part or you can actually delve deeper into the model to the very core and say that uh this data should not be
used in uh loss functions for instance we add a reference here for a paper presented by zelando what they did there is that they applied applied this dummy
padding and masking in EN quar decorder architecture so the model didn't learn on padded zeros and also something
practical there is a uh Library Nixa so it's written in Python and you can also use it in Python with Nixa you can actually very easily in couple of lines use neural forecast
models and in only couple of lines you can add this masking part which will do everything else for you and then we can also think about models that don't even require pass
Target so for them this missing part it doesn't have like any consequences for instance we can think of uh some tree based models like XG
boost it can can easily support n and then we can also have models that in inside of their architecture don't even require to have pass Target uh to illustrate it better let's look at this
example with tabular representation of the data so we have a Target which we want to forecast we have time Dimension this is a number of week we have some
features like day of week what was the weather which color the product has it doesn't change because it's a static over it and then we have this sales yesterday it can also be sales two days
ago sales one week ago and so on this is also what we call lag so in call start case this sales information of the past has limited availability what we can do
first of all again to recap maybe some model can already support this column with nons maybe we can fill it with sales of a similar item and pretend like it's
history of this one this certain item we can do dummy podding and masking and finally let's just imagine this column doesn't exist we don't use
it we don't learn on it and should work so now we talked a lot about different methods how we can do forcasting and then there is still
evaluation part even if we use one method how can we know how good it is how do we have guaranties that we can handle call start problem and how can we say which one is better than the other
and and for that uh I think Alexander can help us with evaluation yes exactly so in the time series context the first thing that may come into your mind when we talk about
evaluating the performance of a model is back testing if you're not familiar think of um back testing as some kind of cross validation scheme that accounts for the time Dimension and make sure
that you don't leak any things you are not supposed to know from the future into your testing data and yeah I mean if you don't have any data then you're going to have a hard time do back
testing right and it's per definition like a tricky one so um classical back testing is difficult for cold starts but on the other hand whenever you do
something whenever you model something and you deploy something you want and you have to make sure how good does this actually work and the good news is there is something that we can do and we want
to have a look at two yeah General strategies of um yeah tackling these things two back testing strategies that
you can use in the uh code start context the first one is what we call pseudo code back testing the basic idea is that um you
can artificially make your data cold so what does that mean let's just say for sake of Simplicity we have some time series it actually has quite a long history
so it's not cold it's indicated by this flame um but we can just artificially chop away the history and just pretend
it's a cold start time series now it's indicated by this pseudo cold um and now I have a testing sample to work with right I can like do my cut offs here I
have like I don't have really much data in the past and I can I I have back test samples that look like codar data artificially generated at um
so that's something that you sorry that's obviously something that you can do rather easily given your data just chop some things away so this is an
approach that's generally very flexible and straightforward but then on the other hand I mean just discarding the history is not something that's the true
start right of an item so if you think about in the fashion example if you think about a product launch a product launch often is accompanied by I don't know certain advertisement then maybe
it's a new season there is maybe special Market interest and whatnot right and just cutting away the history and doing this technical approach in general will not reflect that it's just something
where you can use for the technical part of your model but yeah arguably it doesn't reflect the true starts of your items and also don't want to go into detail here but also you you have to
stratify this a little bit if you do it like especially if you have product hierarchies you have to be a little bit careful you can't just randomly pick any items or you should not randomly pick any items and do that so also it's a
little bit unclear or at least difficult to actually draw these subsamples but it's possible next thing would be what we
call true Cod back testing and the idea here is pretty simple that no matter let's say you have again this item here with a long history the idea is every
item no matter how old it is it has been code at some point in the past right we just have to travel back long enough to get to that point um and that's what the
true code back testing is doing so you um you go back um as far as you can afford with your data to capture the
actual beginnings of uh the time series now yes this now solves the problem this is something that reflects the true dynamics of your item start because per definition you go back to
the item start and that's what you test on on the other hand this is and depending on your data situation depending on how many cold starts you
have observed in your historic data this may be expensive in the sense that um you have only limited amount of testing data available to do
that but it seems a a bit more like the cleaner approach okay what now so um as per usual it depends right what what am I
supposed to do general rule of thumb is if use the true code if possible and if possible means if you can afford it with your data so if you for example a
fashion retailer and you have every I don't know every couple of months a new um new products launch then the cold start is probably something you very very familiar with you have a lot of
sampit in your data that is something you can um you can jump on this true cold approach if you don't have that um if cold starts are more like the exception then a you're probably going to have a
hard time fix like fitting that anyway and B no chance of doing like this true code approach because you just don't have the data to do that um one point we would generally
like to stress out in this discussion is um not on a quantitative level but qualitatively is whatever you do in back testing and I
think this goes beyond cold start like whatever you do it should reflect um it should reflect the your real scenario that you expect as close as possible
what do we mean by that I mean Daria you already mentioned it if we if we think about this counter example of a of a guy who's selling fruit and vegetables and now all of a sudden he
starts dreaming about well I could as well start my fashion brand and sell t-shirts and um sweaters and whatnot then yes he may have the data to do cold
back tests with his fruits but this doesn't give you any indication about like how plausible uh that will be for fashion products this is an extreme example of course but this is um a
general thing to think about okay so um to the very end we have seen methods like how can we deal with the code start we have seen a few
evaluation schemes how can we judge those methods how good or bad they are and we want to jump into a quick example um that we did and this slide is
actually also a good um a good opportunity to give a quick shout out to um our colleague zimon um who assisted Us in designing all these slides that's the reason why they look so fancy this
wasn't our work because it's obvious because in this slide we did the last minute thing without seon as you can see um this slide illustrates the uh the
data set that we are considering the visel data set um we want to do some um yeah some evaluation we are going to consider the true cold back testing
scheme and the testing here is indeed interesting because um the testing data consists of cold start only it's like of an extreme case we are really only it's
like a couple of hundred products that we are considering in a testing period we have a few thousand products to train on and only code start so whatever the model has to predict you always have no
history at all and um due to time I will keep it a bit short like don't want to go into the details of the models but I want to give you the high level idea of Which models
did we consider and what was the outcome so we um have three concrete examples three fashion items in shaded gray you can see the the true time series
respectively like in time steps after their respective launch and now we we considered different methods and one thing that we think is always a good idea is to start as simple as you can
like with very simple statistic itical or naive baselines that you understand and so we did here the most naive thing we could think about is just take the
historic averages of everything and this is that line and that's the first Baseline we are considering the next sophisticated step is take the average
but per product category and then a little bit sophisticated still take the average but per product category and per time step after launch so these are the
um these are the simple baselines and then we um compared it against a actually very architecturally rather sophisticated deep learning model with
image embeddings text embeddings enrichment from Google Trends so very um yeah very nice architecture also we included some rather old schoolish
machine learning approaches so cat boost and XG boost only relying on the tular data and we got some results um the results I think there are only aside
from the numbers I don't want to go into details two messages here I think are important first thing is yes it was worth it putting the effort because the the Deep learning model was the one that
wins but if you look at the numbers then the margin is rather small right so it's it's not a clear winner it's rather a case where even with very simple
statistical baselines you can achieve results that even like an XG boost cannot beat right and I think this is an
important message overall um solving the code start problem is something that is possible you can apply different things
you have to evaluate them properly and don't forget about the simple baselines and the cool thing for you is you will be able to if you want you can
have a deep dive and replicate these results rather soon right yeah as Alexandra said you can also give it a try uh very soon we're going to publish the code that we used to have
this experiment results and we also want to collect there all the literature review that we went through in the last years so we couldn't find a place where we can really store and find all this uh
call start literature information so we're going to also update it as we go and besides that you will also find all the references uh that we use in this present
[Applause] a thanks a lot for that nice presentation a lot of questions so time
serious and forecasting people are really interested in that stuff here um okay let's start which model types approaches do you use for time series
with with a lot of zeros intermittent time series um yeah that's actually a very good question that's a problem you face often
and I would say there are two Dimensions to that one is the data Dimension because the first question you should ask is this really the for example the right level you want to forecast if you have very sparse time series because
often this lowlevel kind of things may contain too much noise but that aside there is a whole model sue for that right like intermittent time series is an active branch of research there are
like classical Baseline models which we think you should always use and always include like Adida and so on and so forth but also you can think about I don't know fora regression and you can
plug that in any kind of model so as per usual it depends I would say thank you is there research on whether vect Vector embeddings of images of clothing items
in their similarity corresponds well to the perceived similarity by humans I think it's fair to say I don't
know okay what advantage is provides nickler versus other python libraries for forecasting such as dots or SK
forecast generally a good question I would say um the Nix what we observed is like the community aspect is really nice um over the last last couple of years it
has rapidly grown especially in the also in the Deep Learning Community a lot of models that you find where I don't know papers have been published there is already a somewhere a branch for noal forecast where this model has been
implemented so this is quite cool it's a unified interface kind of um yeah as for differentiation to dots or other libraries um not so sure to be honest I
would add it's really fast compared to the others um so how no it keep changing so voting up I try to keep that in mind
should my key takeaway of this session be for cold starts just be naive um I don't think so um I think the key takeaway that we wanted to transport
is more like you should start naive always like compare it against something that you understand right there were I mean a couple of weeks or months ago there were these blog posts going viral about some
time series Transformer model and then two weeks later it was shown that well actually some season exponential smoothing is better that's that's unfortunate because I think this is really something that you like include
this simple statistical stuff in your analysis and if you if you notice that you are not really better than that then either you can continue and ask the question what why is this about that's
something that we in this case didn't do also due to time constraints we just stopped at some point um or maybe also rais the honest question maybe it's not possible to get better than that because
the data situation is um not good so that's definitely not the takeaway the takeaway is more like includeed do it as well definitely so now a question um to
the latest developments like time GPT Etc have you tried transfer learning in the time series domain and if yes what's your experience with
it um we didn't we want to we want to but we didn't um we're also interested in that field like we read a lot about it um and it's also something that we
want to we want to dive more into um I think what we've seen is that for the cold start it may be a little bit tricky because a lot of the models depend on
also the history as the most informative thing for as for for predicting um including covariates I think is really even even more challenging in this in
this case because how do you unify like different semantic um Co various across your data right I think that's really yeah that's obviously a challenge so um not sure if there or what's to
gain for the codar problem but definitely something we are also interested in going deeper maybe talk for next year how is training on the data from similar products different to just using the forecasting models of
those products and optionally apply transfer learning I would I would say that the suggestion is also a valid approach and can also be thought of another method
and can be evaluated but like how different it is it's hard to say like this is the best uh the most na way to transfer learning right if you just use something that already exists and impute
it instead of the missing time series okay now last question there's still a lot of open questions how can your models predict fashion trends items that
are popular only one season for a single year um so you I'm not sure uh so items that are only suitable
for one I'm not sure if I got the question right like if we like if there's a new trend that now yellow is the color to wear and will you get that oh right um I would say no
because this goes really more into the direction of Market marketing research and this is a different use case I would argue than like the replenishment forecasting things that that we have
shown um still people are doing that um and it's very interesting but then you I think you need to think a lot more about external data sources because if there is any signal you want to pick it up
then it has to be somewhere right so the question is where I don't know thanks a lot for your talk thanks a lot for your answer [Applause]
Loading video analysis...