LongCut logo

Tackling the Cold Start Challenge in Demand Forecasting - PyCon DE Berlin 2024

By paretos

Summary

Topics Covered

  • Cold Start Forecasting Needs Global Models
  • Impute History from Similar Items
  • Use True Cold Backtesting
  • Simple Baselines Beat Complex Models Marginally

Full Transcript

[Applause] yes thank you very much for the introduction it's actually very cool to be here and we are really looking forward to sharing a little bit of our experiences in the everyday work with

demand forecasting and especially for this topic um of codar problem which also we think is kind of under represented it's often addressed but not in a very concise way we feel and this

is our contribution to say to give a little bit of structure and um yeah condensed content to this topic us that's my colleague and me we both work

for a company called paros it's a heidleberg based decision intelligence startup just a little bit of advertisement bear with me um our mission statement is no more bad

decisions and um yeah so what we often observe is that um if you look into businesses and decision processes how

they are taken um often you find that a lot of critical things are even in 2024 in times of AI machine learning still done in Microsoft Excel right and a lot

of manual work a lot of eror prone things that don't scale well and this is something that we tend to address because uh we're convinced we can do

better and demand forecasting is one yeah one part of that and for today's talk we're going to stick to a specific example namely demand forecasting for

let's say some fashion retailer like here we have our small uh our nicely AI generated uh fashion retailer COA um yeah um as a typical fashion

retailer what you do you sell fashion items you have like a supply chain behind it so you have some suppliers and the question is especially getting interesting once you want to launch a

new collection like maybe summer is coming up you have new items that you want to sell question is how many do I have to order right like for each of

those items on the one hand I want to order enough in order to meet the demand that my clients or my customers will have on the other hand I don't want to

just order too much because then I will Overstock and this will induce all sorts of bad consequences like Warehouse costs and so on and so forth and there is somewhere in between there is a sweet

spot and this is actually a classical a classical use case for decision intelligence where you have some optimization problem and what optimality here means actually depend on it depends

on you right it depends on your kpis it depends on your risk affinity depends on your what's important to your business but the point being here that having a

good demand forecast even for that scenario where you have a new collection is a very critical part of that process to look into the

future now the examples we are looking here is actually drawn from a real data set we'll also give you the reference later so this also something that you can play with and try it out it's called

a visual it's a data set consisting of around 5,000 um items from a Real Fashion Company an Italian one um

contains data of around four years so all sorts of also those product images are actually taken from the data set they have descriptions and so on and most notably um they contain the time

series data of the historic uh sales that you want to learn from uh speaking of Time series do you have anyone here that has any practical experience with

Time series maybe raise your hands nice that's quite a few so okay then I guess we can um we still for those of you who didn't raise the hand we would start with a little recap but I guess we can keep this one a bit

shorter so what is time serious what is time series you probably know uh so you can see two examples in this image so it is date on x-axis and then for each date

we have some number of sales so in this case you see Target on y- axis and this is number of pieces sold uh and so time serious means that

there is time Dimension there is a point which we can think of as prediction time that divides past and the future and in this task we stand in this prediction time and our task is to say what is

going to happen how many pieces we sell for each item and in this case we have two different items maybe it is a shortt and shorts or something so

what else do we have in terms of data we have static Coates so what is coar you can think of it as a feature that you feed to model that learns from it and

static means that it doesn't change in time so for instance it can be an image of a product and it can have textual descriptions like size uh which category

it is like shirt color Fabric and so on besides that you can also have time wearing Coates these are features that change in time um for instance it can be

calendar features and they are known both for the past and for the future because you probably know which day of week will be in 15 weeks and so on and

you also have information about prices and weather and now let's think about traveling back in time 3 months and now

we stand in this point this prediction time and what is notable in this example is that one item doesn't have any past Target but still we need to make some

forecast this is what we call cold situation call start problem and it means we don't have enough Target data what are the challenges that appear

here so first of all standard Solutions don't exist and if you refer to a book which probably everyone who has experienced with time serus know about

um from Rob uh hman judgmental forecasting is usually the only available method for new product forecasting what is meant here with

judgmental forecasting it actually means that we have a human expert this is a AI generated picture of an expert so we have a domain expert that use their own

judgment and tells us how many pieces we sell this is a good approach probably but then it's hard to scale what if we have 10,000 pieces for which we want to

make a forecast every week that is not probably doable and also it's hard because usually in fashion how far in the future you want to forecast it is 12

to 20 weeks which is really hard and the Second Challenge is performance evaluation okay what if we have a point we make a forecast how do we guarantee that it's good what if it's like random

numbers we need to make sure that we really tackle call start forecast without seeing any historical sales and for that we can suggest some

basic ideas on which we can base our methods uh to to tackle call start so we don't have an off pass Target and we uh see it as a

problem but let's not forget that we have other available data we have images we have time wearing features or covariates and we can also be encouraged

to be creative with features and think what else can help us like even if we don't have it in our data set what if it's not directly related we can think what what can what else can we use for

instance Google TRS why not to use that and of course arguably the most important piece of information in our data set is historic Target of other

items so if we don't know anything about sales of this shirt because it's not launched yet we look at our data set and we see how many times and which

seasonality periods did similar short have in the past so we have this to learn from it and to summarize it we want to search for available data we want to be creative

with features we want to make sure that we have similar items with the past in the data because if you uh sell vegetables and fruits and then suddenly you want to

start selling some merchandise with shirts it's probably not going to help you're probably not going to forecast it with your data and finally we also want to think of it in terms of global model

Global meaning that we have one model that has access to all items to all time seres and it is able to learn from other items to forecast for New

launches and we have a reference um that tells that actually this is in general better approach to have Global model unlike local one that is like one model per one time

series and having all this data how can we proceed uh so in general you see that it has has different modalities we have images we have textual information and we have

numerical information so with numerical one is very simple you just feed it to your model and it learns then with checkol we can do one H en coding and to

reduce it to vectors and for images we can apply Vector embedding and this has been done in two references that we add here for you um so it is also from demand

forecasting they take pictures of products and they do Vector embedding and essentially what you end up with from all these data sets from all these different modalities are vectors and

having vectors as you know from math you can always measure distance so now you can know what is the most similar product to the short for which you want

to make a forecast or maybe how many of them you have that are similar and how similar they are now let's think about models that require P Target the most most obvious

thing you can probably do is to Simply take the most similar shirt to the shirt for which you want to make forecast but hasn't been launched

yet you can take the most similar one take this its history and pretend that it's actually history of this new one you can train on it and then do per

usual just train on it and make a forecast you can also be a bit more creative here you can take mean uh sales of all the similar t-shirts you can also

apply nearest neighor imputation algorithm for that which we also reference here and essentially impute the past alternatively you can do some dummy

padding and masking what it means doesn't matter what was there you tell your model don't learn in it it's not like it's not real data you may add

a feature that informs the model that this is imputed part or you can actually delve deeper into the model to the very core and say that uh this data should not be

used in uh loss functions for instance we add a reference here for a paper presented by zelando what they did there is that they applied applied this dummy

padding and masking in EN quar decorder architecture so the model didn't learn on padded zeros and also something

practical there is a uh Library Nixa so it's written in Python and you can also use it in Python with Nixa you can actually very easily in couple of lines use neural forecast

models and in only couple of lines you can add this masking part which will do everything else for you and then we can also think about models that don't even require pass

Target so for them this missing part it doesn't have like any consequences for instance we can think of uh some tree based models like XG

boost it can can easily support n and then we can also have models that in inside of their architecture don't even require to have pass Target uh to illustrate it better let's look at this

example with tabular representation of the data so we have a Target which we want to forecast we have time Dimension this is a number of week we have some

features like day of week what was the weather which color the product has it doesn't change because it's a static over it and then we have this sales yesterday it can also be sales two days

ago sales one week ago and so on this is also what we call lag so in call start case this sales information of the past has limited availability what we can do

first of all again to recap maybe some model can already support this column with nons maybe we can fill it with sales of a similar item and pretend like it's

history of this one this certain item we can do dummy podding and masking and finally let's just imagine this column doesn't exist we don't use

it we don't learn on it and should work so now we talked a lot about different methods how we can do forcasting and then there is still

evaluation part even if we use one method how can we know how good it is how do we have guaranties that we can handle call start problem and how can we say which one is better than the other

and and for that uh I think Alexander can help us with evaluation yes exactly so in the time series context the first thing that may come into your mind when we talk about

evaluating the performance of a model is back testing if you're not familiar think of um back testing as some kind of cross validation scheme that accounts for the time Dimension and make sure

that you don't leak any things you are not supposed to know from the future into your testing data and yeah I mean if you don't have any data then you're going to have a hard time do back

testing right and it's per definition like a tricky one so um classical back testing is difficult for cold starts but on the other hand whenever you do

something whenever you model something and you deploy something you want and you have to make sure how good does this actually work and the good news is there is something that we can do and we want

to have a look at two yeah General strategies of um yeah tackling these things two back testing strategies that

you can use in the uh code start context the first one is what we call pseudo code back testing the basic idea is that um you

can artificially make your data cold so what does that mean let's just say for sake of Simplicity we have some time series it actually has quite a long history

so it's not cold it's indicated by this flame um but we can just artificially chop away the history and just pretend

it's a cold start time series now it's indicated by this pseudo cold um and now I have a testing sample to work with right I can like do my cut offs here I

have like I don't have really much data in the past and I can I I have back test samples that look like codar data artificially generated at um

so that's something that you sorry that's obviously something that you can do rather easily given your data just chop some things away so this is an

approach that's generally very flexible and straightforward but then on the other hand I mean just discarding the history is not something that's the true

start right of an item so if you think about in the fashion example if you think about a product launch a product launch often is accompanied by I don't know certain advertisement then maybe

it's a new season there is maybe special Market interest and whatnot right and just cutting away the history and doing this technical approach in general will not reflect that it's just something

where you can use for the technical part of your model but yeah arguably it doesn't reflect the true starts of your items and also don't want to go into detail here but also you you have to

stratify this a little bit if you do it like especially if you have product hierarchies you have to be a little bit careful you can't just randomly pick any items or you should not randomly pick any items and do that so also it's a

little bit unclear or at least difficult to actually draw these subsamples but it's possible next thing would be what we

call true Cod back testing and the idea here is pretty simple that no matter let's say you have again this item here with a long history the idea is every

item no matter how old it is it has been code at some point in the past right we just have to travel back long enough to get to that point um and that's what the

true code back testing is doing so you um you go back um as far as you can afford with your data to capture the

actual beginnings of uh the time series now yes this now solves the problem this is something that reflects the true dynamics of your item start because per definition you go back to

the item start and that's what you test on on the other hand this is and depending on your data situation depending on how many cold starts you

have observed in your historic data this may be expensive in the sense that um you have only limited amount of testing data available to do

that but it seems a a bit more like the cleaner approach okay what now so um as per usual it depends right what what am I

supposed to do general rule of thumb is if use the true code if possible and if possible means if you can afford it with your data so if you for example a

fashion retailer and you have every I don't know every couple of months a new um new products launch then the cold start is probably something you very very familiar with you have a lot of

sampit in your data that is something you can um you can jump on this true cold approach if you don't have that um if cold starts are more like the exception then a you're probably going to have a

hard time fix like fitting that anyway and B no chance of doing like this true code approach because you just don't have the data to do that um one point we would generally

like to stress out in this discussion is um not on a quantitative level but qualitatively is whatever you do in back testing and I

think this goes beyond cold start like whatever you do it should reflect um it should reflect the your real scenario that you expect as close as possible

what do we mean by that I mean Daria you already mentioned it if we if we think about this counter example of a of a guy who's selling fruit and vegetables and now all of a sudden he

starts dreaming about well I could as well start my fashion brand and sell t-shirts and um sweaters and whatnot then yes he may have the data to do cold

back tests with his fruits but this doesn't give you any indication about like how plausible uh that will be for fashion products this is an extreme example of course but this is um a

general thing to think about okay so um to the very end we have seen methods like how can we deal with the code start we have seen a few

evaluation schemes how can we judge those methods how good or bad they are and we want to jump into a quick example um that we did and this slide is

actually also a good um a good opportunity to give a quick shout out to um our colleague zimon um who assisted Us in designing all these slides that's the reason why they look so fancy this

wasn't our work because it's obvious because in this slide we did the last minute thing without seon as you can see um this slide illustrates the uh the

data set that we are considering the visel data set um we want to do some um yeah some evaluation we are going to consider the true cold back testing

scheme and the testing here is indeed interesting because um the testing data consists of cold start only it's like of an extreme case we are really only it's

like a couple of hundred products that we are considering in a testing period we have a few thousand products to train on and only code start so whatever the model has to predict you always have no

history at all and um due to time I will keep it a bit short like don't want to go into the details of the models but I want to give you the high level idea of Which models

did we consider and what was the outcome so we um have three concrete examples three fashion items in shaded gray you can see the the true time series

respectively like in time steps after their respective launch and now we we considered different methods and one thing that we think is always a good idea is to start as simple as you can

like with very simple statistic itical or naive baselines that you understand and so we did here the most naive thing we could think about is just take the

historic averages of everything and this is that line and that's the first Baseline we are considering the next sophisticated step is take the average

but per product category and then a little bit sophisticated still take the average but per product category and per time step after launch so these are the

um these are the simple baselines and then we um compared it against a actually very architecturally rather sophisticated deep learning model with

image embeddings text embeddings enrichment from Google Trends so very um yeah very nice architecture also we included some rather old schoolish

machine learning approaches so cat boost and XG boost only relying on the tular data and we got some results um the results I think there are only aside

from the numbers I don't want to go into details two messages here I think are important first thing is yes it was worth it putting the effort because the the Deep learning model was the one that

wins but if you look at the numbers then the margin is rather small right so it's it's not a clear winner it's rather a case where even with very simple

statistical baselines you can achieve results that even like an XG boost cannot beat right and I think this is an

important message overall um solving the code start problem is something that is possible you can apply different things

you have to evaluate them properly and don't forget about the simple baselines and the cool thing for you is you will be able to if you want you can

have a deep dive and replicate these results rather soon right yeah as Alexandra said you can also give it a try uh very soon we're going to publish the code that we used to have

this experiment results and we also want to collect there all the literature review that we went through in the last years so we couldn't find a place where we can really store and find all this uh

call start literature information so we're going to also update it as we go and besides that you will also find all the references uh that we use in this present

[Applause] a thanks a lot for that nice presentation a lot of questions so time

serious and forecasting people are really interested in that stuff here um okay let's start which model types approaches do you use for time series

with with a lot of zeros intermittent time series um yeah that's actually a very good question that's a problem you face often

and I would say there are two Dimensions to that one is the data Dimension because the first question you should ask is this really the for example the right level you want to forecast if you have very sparse time series because

often this lowlevel kind of things may contain too much noise but that aside there is a whole model sue for that right like intermittent time series is an active branch of research there are

like classical Baseline models which we think you should always use and always include like Adida and so on and so forth but also you can think about I don't know fora regression and you can

plug that in any kind of model so as per usual it depends I would say thank you is there research on whether vect Vector embeddings of images of clothing items

in their similarity corresponds well to the perceived similarity by humans I think it's fair to say I don't

know okay what advantage is provides nickler versus other python libraries for forecasting such as dots or SK

forecast generally a good question I would say um the Nix what we observed is like the community aspect is really nice um over the last last couple of years it

has rapidly grown especially in the also in the Deep Learning Community a lot of models that you find where I don't know papers have been published there is already a somewhere a branch for noal forecast where this model has been

implemented so this is quite cool it's a unified interface kind of um yeah as for differentiation to dots or other libraries um not so sure to be honest I

would add it's really fast compared to the others um so how no it keep changing so voting up I try to keep that in mind

should my key takeaway of this session be for cold starts just be naive um I don't think so um I think the key takeaway that we wanted to transport

is more like you should start naive always like compare it against something that you understand right there were I mean a couple of weeks or months ago there were these blog posts going viral about some

time series Transformer model and then two weeks later it was shown that well actually some season exponential smoothing is better that's that's unfortunate because I think this is really something that you like include

this simple statistical stuff in your analysis and if you if you notice that you are not really better than that then either you can continue and ask the question what why is this about that's

something that we in this case didn't do also due to time constraints we just stopped at some point um or maybe also rais the honest question maybe it's not possible to get better than that because

the data situation is um not good so that's definitely not the takeaway the takeaway is more like includeed do it as well definitely so now a question um to

the latest developments like time GPT Etc have you tried transfer learning in the time series domain and if yes what's your experience with

it um we didn't we want to we want to but we didn't um we're also interested in that field like we read a lot about it um and it's also something that we

want to we want to dive more into um I think what we've seen is that for the cold start it may be a little bit tricky because a lot of the models depend on

also the history as the most informative thing for as for for predicting um including covariates I think is really even even more challenging in this in

this case because how do you unify like different semantic um Co various across your data right I think that's really yeah that's obviously a challenge so um not sure if there or what's to

gain for the codar problem but definitely something we are also interested in going deeper maybe talk for next year how is training on the data from similar products different to just using the forecasting models of

those products and optionally apply transfer learning I would I would say that the suggestion is also a valid approach and can also be thought of another method

and can be evaluated but like how different it is it's hard to say like this is the best uh the most na way to transfer learning right if you just use something that already exists and impute

it instead of the missing time series okay now last question there's still a lot of open questions how can your models predict fashion trends items that

are popular only one season for a single year um so you I'm not sure uh so items that are only suitable

for one I'm not sure if I got the question right like if we like if there's a new trend that now yellow is the color to wear and will you get that oh right um I would say no

because this goes really more into the direction of Market marketing research and this is a different use case I would argue than like the replenishment forecasting things that that we have

shown um still people are doing that um and it's very interesting but then you I think you need to think a lot more about external data sources because if there is any signal you want to pick it up

then it has to be somewhere right so the question is where I don't know thanks a lot for your talk thanks a lot for your answer [Applause]

Loading...

Loading video analysis...