Abhishek Murthy-Applying Foundational Models for Time Series Anomaly Detection-PyData Boston 2025
By PyData
Summary
Topics Covered
- Foundation Models Enable Zero-Shot Novelty Detection
- TimeGPT Detects Anomalies via Forecast Disagreement
- Moment Reconstructs via Fixed-Size Windows
- PRTS Rewards Partial Anomaly Overlaps
Full Transcript
So, I think we'll let folks continue to trickle in, but um hello everyone. I
hope you've had a wonderful third day.
Our next speaker is Abashak. He is a senior principal machine learning and AI architect at Schneider. I had to read that one. And an instructor at Nor
that one. And an instructor at Nor Eastern. Um if uh if you want to learn
Eastern. Um if uh if you want to learn everything about machine learning algorithms for IoT systems, that's the title of this course. You have to become a Nor Eastern student. Um but today we
are going to learn about um using foundation models for anomaly detection that is take it away.
>> Okay. Hi everyone. Thank you for the introduction and uh thank you PI data for uh having me and uh organizing this really nice event. uh
I am I'll be talking about uh applying foundational large uh models for time series uh for the anomaly detection problem.
Um this is a little bit about me. Um as
Adam mentioned I work for Schneider Electric. Uh it is a French company a
Electric. Uh it is a French company a and a global leader in energy management and industrial automation based in the greater Boston area. Um I also teach uh
some of these uh uh topics on uh classification, anomaly detection and forecasting of time series at this course at Northeastern. I primarily work with IoT data which is in the form of
time series. And so all these uh
time series. And so all these uh problems that I just mentioned is something that is very interesting to me. Um and these are some other places
me. Um and these are some other places that uh I'm I have been associated with in the past. connect with me on LinkedIn
and we can chat uh about everything time series right um cool so with that uh my agenda is as follows I'll start with a
gentle introduction uh and then uh lay out this promise of zeroshot novelty detection for time series uh data we'll go over a couple of models timegpt and
moment moment is an open source foundational model I'll look at how they perform on a a couple of toy data sets and then conclude with a summary and
recommendations to go forward. Right?
um time series uh so we often understand time series as a collection of ordered measurements right these measurements
can be multi-dimensional in nature um uh and they often tend to have some repeated patterns in them right so
consider this univariate ECG signal um this has this uh repeating train of pulses um which is the repeating pattern in the time
Right? And so any subsequence that
Right? And so any subsequence that deviates from this repeated pattern is called an anomaly and uh it's often very interesting. So in the case of an ECG,
interesting. So in the case of an ECG, you could have a slight delay in one of these pulses uh which becomes a subsequence anomaly or it could be a point value which is either too high or
too low. It's often referred to as a
too low. It's often referred to as a point anomaly. Um when there are
point anomaly. Um when there are multiple dimensions often the invariant happens to be the uh uh correlation between some of these features. The
relationship between the various dimensions is often preserved and if there is a breakdown at any point these are called correlation anomalies and they are very interesting uh in various
domains. There is a growing market for
domains. There is a growing market for time series anomaly detection as shown in this market intelligence report and correspondingly there is a pretty rich body of literature on time series
anomaly detection. This is just a
anomaly detection. This is just a snapshot of uh uh uh various techniques that have been applied to this problem taken from a survey uh published in VLDB
a few years ago. uh but as we'll see the community of foundational models has generally uh not looked at time series so much and we we they're starting to uh
wake up to this opportunity and uh this talk is all about that.
Um that was kad in general. I'm
specifically interested in a variant of anomaly detection in this talk known as novelty detection. Right? So in novelty
novelty detection. Right? So in novelty detection um uh the machine learning model uh does not have access to
anomalies uh during training right uh in my world in of industrial IoT this can happen because you have just rolled out a new family of assets so we haven't had
any anomalies to observe yet or it could be the case that um uh we just haven't kept good track record of observed anomalies in the past which happens quite quite quite often
Right? So in novelty detection the
Right? So in novelty detection the machine learning model just learns from nominal data. So in that sense it is a
nominal data. So in that sense it is a oneclass classifier that that we have all studied. Right. Um and so the model
all studied. Right. Um and so the model essentially learns to recognize what nominal normal patterns are uh and therefore can recognize any deviations
from nominal behavior. And that's no that that's the definition of novelty detection, right? The identifying new
detection, right? The identifying new patterns that we had not seen during training. Uh how can we use this? At
training. Uh how can we use this? At
least in the IoT world, it enables what is known as predictive maintenance. So
if an asset was working fine till now and you start observing deviations from uh the normal uh the model picks up some of these deviations, those deviations
could be an early indicator [laughter] of a fault developing. Right? So instead
of waiting to react when the asset actually breaks down, uh you can use [laughter and clears throat] novelty detection to detect subtle deviations early on, do service interventions and
make sure um the the asset is available uh uh to the maximum extent. Right? So
uh at least predictive maintenance is a big use case in the industrial IoT and novelty detection is often how it is solved uh uh uh in the IoT world.
The inference can happen in a batchwise fashion or it can happen in the streaming fashion. Right? So once you
streaming fashion. Right? So once you have the novelty detection model ready, you start streaming in data. As new data comes in, um uh you detect uh de any deviations that that could be there in
there. And if there are no deviations,
there. And if there are no deviations, then normally detection model will say everything's fine. Right? So it's all
everything's fine. Right? So it's all about detecting novel patterns.
How has this problem been traditionally solved in the past? Right? Um
us conventionally what we have done is create a dedicated use uh model for every use case right so uh for every use
case we you do our usual workflow of collecting nominal data from the asset of interest it could be the physics that you're interested in so these this is IoT data in the form of time series it
could be some kind of vibration measurements some electrical measurements if you're interested in electrical equipment um uh and then we do the usual exploratory data analysis, feature engineering and then go ahead
and train this novelty detection model for our data right uh the novelty detection model as I just mentioned learns what the normal looks like and
it's able to essentially say the new data point the previously unseen data point that you have given me is either normal or not normal so it's a oneclass classifier
and once trained uh we can start inferring uh um on this data that we the kind of data that we just trained on. Um
uh and we once we get the engine uh features we feed it to the model and the model usually tells us uh this new test point is similar to what I had seen
during training and it quantifies the similarity using an anomaly score and usually the interpretation is if the score is low then the data point that you just showed me is similar to what I
had seen and therefore is nominal or if it is very far if the anomaly score is very high it often means that this is a new data point that I have not seen before and therefore could be indicative
of an anomaly. Right? The important
point to note is conventionally this has been done in a bespoke fashion in a sense for every use case for every kind of asset you train these models. Uh so
here's an instantiation of how this is done for autoenccoder neural networks.
Right? So autoenccoder neural networks contain two parts the encoder and the decoder. Um the typically the way they
decoder. Um the typically the way they work is you provide them some time series data uh and the encoder [snorts] learns to um uh reduce the data into some lower dimensional uh
representation.
The decoder network takes that lower uh dimension embedding and then reconstructs the original data back.
Right?
uh and then we can compare this reconstructed data from the original data uh to see how well the reconstruction uh uh occurred. Right? So
the the idea is the autoenccoder network can reconstruct the data the data well if and only if the data was nominal because you show the autoenccoder all kinds of nominal patterns. So it only
learns to reconstruct nominal data.
Once the once the encoder and decoder are trained, you send it for inference and the inference works as follows. You
take the new previously unseen uh data point, pass it through the encoder, get the lower representation embedding uh low dimensional embedding and then try to reconstruct the data from that and
compare the reconstruction error to what you saw during training. Right? If the
reconstruction error is low, then probably what you saw is uh uh similar to the training data and therefore is nominal. But if the autoenccoder uh
nominal. But if the autoenccoder uh network fails to reconstruct this data point very well, it means that this could be a potentially new data point that it had not seen during training.
Right? The autoenccoder is trained on the specific data for the use case. uh
and it can reconstruct the data well if and only if it is nominal, right?
Because it was it was shown only nominal data during training. This is just a way to instantiate this workflow on the left uh uh for autoenccoder networks.
But of course the paradigm is changing, right? Uh the the next step u is
right? Uh the the next step u is proposed to be these time series foundational models, right?
um how do time series foundational models work now just like the LLM community uh we don't have to train our own models on our data right so here is
a TSM TSFM you throw the data at it and it will automatically generate an anomaly score for you right um so there
is no bespoke models anymore uh you can use one of these TSFMs to start doing anomaly detection from get-go right from day zero you can provide it from some
with some data. If the data looks normal, the uh TSFM will say so. And if
there are any anomalies that it has observed in the recent past, it'll throw a flag up saying that I think something's wrong. So, no training
something's wrong. So, no training required. And of course, if you want
required. And of course, if you want better performance, you can fine-tune one of these TSFMs for your domain, your data set that you are playing with, right? That'll give you better
right? That'll give you better performance. Um so how do these TSFMs
performance. Um so how do these TSFMs work? Obviously pre-training, right?
work? Obviously pre-training, right?
That's the main uh ingredient. So these
TSFMs train on massive amounts of uh time series data. And by doing so, they essentially learn the core structures of time series, right? So time series are
often characterized by trend and seasonality. And so by looking at all
seasonality. And so by looking at all these massive number of data sets d from diverse domains they're able to pick up on all these important structures and
they can also exist at different scales right um uh uh this multiscale nature is also very important because then you you want to be able to apply this to any
domain to an ECG data set which occurs in milliseconds time scale or even uh lesser to power demand data set which probably works in the order of hours
right um so massive uh data sets used for pre-training to learn uh different structures enables these TSFMs to be foundational in nature right so the same
model can be used for different types of uh time series tasks so you can do forecasting with it you can do anomaly detection with it and of course you can
do any kind of imputation or filling in missing data kind of tasks also with it right with the same model you can do all these three different tasks
And uh uh uh that makes it foundational, right? Uh and finally, you can always
right? Uh and finally, you can always fine-tune this model to your domain. So
if you're playing with the energy management data, uh you can use it to do these tasks better for your data. And if
you're working with variables, uh then you can use your uh accelerometer and ppg sensor data uh to to fine-tune these models. So the whole idea is if you can
models. So the whole idea is if you can complete sentences using an LLM uh you can forecast the next value of a time series right [snorts] and if you can forecast the next value of the time
series you can do anomaly detection so on and so forth. Um
several players have proposed uh TSFMs and uh uh Google Salesforce Amazon have all come up with their own models. Uh
today we'll be focusing on a couple of them uh next time GPT and Carnegie Melon University's moment which is open source
uh and look at how these work right uh so the whole idea is we are we're trying to move away from bespoke models to foundational models which can
work in a zeros sense or you with minimal finetuning can work very well so you can get going pretty quickly right that that's the whole uh uh idea of disruption in the time series
foundational model you can get going pretty quickly for these foundational tasks of forecasting anomaly detection and uh uh denoising. So let's look at time GPT and moment next.
So time GPT is from NIXLA. So here is a QR code if you want to scan it uh uh to get to u uh the the model you can do that. So NIXLA uh if people that work on
that. So NIXLA uh if people that work on in the time series forecasting uh world know that they have some very interesting libraries for uh forecasting. uh they have now come up
forecasting. uh they have now come up with a proprietary transformer-based model uh and so this uh model is very similar to what the LLM community uses
um uh and uh the only thing is now they have trained it on a massive data set they they claim to be uh training it on about 100 billion data points um they
don't tell us how many parameters are there in the model today right um and these data sets come from very diverse domains like finance and electricity all
the way to tourism, right? Uh and the goal is uh to be able to forecast the time series, right? Forecasting is the foundational task task that they excel
at. Um so we can provide the time series
at. Um so we can provide the time series of interest. So you can just take any
of interest. So you can just take any time series you want. If you have some exogenous variables, you can use them as well. Provide it to the API provided
well. Provide it to the API provided given by Nexla and it'll tell you what the next values are. Right? Um
we can use time GPT for anomaly detection. They have native support.
detection. They have native support.
They provide some API specifically for anomaly detection. Uh and the idea of
anomaly detection. Uh and the idea of using uh time GPT for novelty detection is as follows. You just take your data and start creating forecasts with it. Um
as the data streams through at every point uh you as when you generate the forecast you also create a confidence interval around it. Right? So
essentially if the observe your forecasting is so good that when you observe a data point if it disagrees with your forecast then it's most likely an anomaly right and remember you're not
training a forecasting model here that comes with the TSFM so you forecast the data point if the data uh if the observation disagrees with your forecast [clears throat]
too much then most likely it's an anomaly right so how do we uh characterize too much difference It's done using confidence intervals. Um, and
this is an example that we'll look later in the slide. So here, um, essentially the, uh, the black curve is the observed data point. The pink curve represents
data point. The pink curve represents the forecasts generated by, uh, time GPT, right? And this, uh, shaded region
GPT, right? And this, uh, shaded region represents the confidence interval. So a
99.9% confidence interval essentially ensures that the true average value will lie within this region 99.9% of the time right this is a pretty strong guarantee
uh and therefore the confidence interval is wide right if you reduce this number uh then say for 50% then you only need to ensure that the true value is within
this region 50% of the time therefore the confidence interval shrinks right um so the consequence of reducing this number is that uh you are more aggressive. You're more trigger happy
aggressive. You're more trigger happy when it comes to anomaly detection because it's easy to be outside that narrow range. Uh and therefore there's a
narrow range. Uh and therefore there's a higher chance of uh uh false positives and your precision can go down. But of
course if you are very conservative and use a very strong guarantee like a 99.9% confidence interval uh it'll uh you may take a hit on your recall rates. Right?
So that's the that's how it works. So at
every point you forecast the next few values uh then you observe what the new data came in. If the observation is very different from your forecast then uh it's suspect
right um taking going one level deeper um so the whole idea of uh using time GPT for anomaly detection entails uh let me
just show you the um the function signature right. So this is the data frame that we uh provide uh with the uh pandas data frame that you
can provide. Uh you need to tell
can provide. Uh you need to tell explicitly what the time scale is of your data. Um h and step size pertain to
your data. Um h and step size pertain to the forecast that is being made. Um uh
these are uh overlapping windows of eight time steps um that overlap by a size of six. Uh and then level is the confidence interval threshold. Right?
This essentially ensures that your the true value will lie within your confidence interval 99% of the time. And
detection size is the tail end of the data that you give that you are interested in to find anomalies in.
Right? So um going back to this uh uh figure. So this is your entire data
figure. So this is your entire data frame. You're interested to find
frame. You're interested to find anomalies in the last few data points.
And of course this is the window that's going to slide through your data. uh and
you're interested in finding anomalies in this tail end. Right? Of course,
because there could be anomalies present here, we do not use this window to produce the confidence interval. Uh this
is uh this window is where you are comparing with the confidence interval that was produced based on all this previous data. Right? So this is a
previous data. Right? So this is a function signature. Uh and uh a little
function signature. Uh and uh a little bit about how time GPT works. We are
going to look at how it works on spec concrete data just a bit.
Next I want to introduce uh moment which is the open-source foundational model from uh Carnegi Milan University and uh the auton lab folks have proposed it. Um
and the lead developer is Monito Gowami.
Uh here is the QR code if you're interested that will take you to the GitHub page of uh this model. Um to
start with they created their own data set which is also public. It's an
interesting data set uh called the time series pile. Um it has about uh 13
series pile. Um it has about uh 13 million uh different time series uh spanning about 1.13 billion timestamps.
Um in a modern sense, it's a very small data set. It fits on your laptop, right?
data set. It fits on your laptop, right?
It's 20 20 gigs, but they claim to have covered uh again diverse domains that they can use to pick up on all these fundamental structures that I talked
about. um and they train the moment
about. um and they train the moment model based on uh the this data set.
So how does that model look? Um so the idea there is they have done some interesting uh things. Um firstly uh they do not uh explicitly handle
multiple features. They will handle one
multiple features. They will handle one feature at a time. So if you have a multivariate time series data set, it essentially works feature by feature. Um
and uh we don't have to specify the frequency the time scale of our data set explicitly. Again uh and remember these
explicitly. Again uh and remember these data sets can have very different amplitudes. Some data time series can be
amplitudes. Some data time series can be of a smaller amplitude or other time series other time series could be of much larger amplitude and that's where uh this technique called reversible instance normalization has become
popular in the time [snorts] series community for as a normalization step.
they use that uh in their model and moreover um they internally use uh window sizes that are of fixed length of 512 right so uh in my experience window
size selection is a tough problem for time series there have been papers written just on this essentially what is how do you automatically find the characteristic length of a time series
that you study at a time that you analyze at a time is the problem of window size selection uh interestingly the authors made selection of 512 and they always use it no matter which data
uh set that you are using. Um internally
the transformer that they use is a simple again uh u uh vanilla transformer from the LLM community. And finally they use this u so-called lightweight head
prediction head which come in different flavors. So if you're interested in in
flavors. So if you're interested in in reconstruction they they give you a reconstruction head. If you are
reconstruction head. If you are interested in forecasting they come they give you a forecasting head. uh and a way to fine-tune is what is known as linear probing. You can just fine-tune
linear probing. You can just fine-tune the final layer, this particular layer uh uh to your task, right? Instead of
training the entire weights. Uh that's
the idea of modularizing it and giving you this uh uh prediction head. And so
we'll be specifically interested in the reconstruction head in just a sec, right? Um and yeah, this model is
right? Um and yeah, this model is available uh from hugging face uh uh and you can start using it from get-go.
Um how does uh moment work? It works
essentially as an autoenccoder, right?
The the the one that we just looked at [snorts] earlier in the presentation. Uh
so here is the reconstruction specific head that you're requesting. And uh uh you can get get your model going like this. And what the model essentially
this. And what the model essentially what we're going to do with uh this model is to take any the time series that we are playing with chop it up into
windows um contiguous windows u could work uh but if you have overlaps you of course get more robust performance uh so here we're just going to take some
window this data uh using uh tumbling continuous windows and then use the model to create some create the lowdimensional embedding
Right. Um, so we're going to encode our
Right. Um, so we're going to encode our windows using this model that we get from moment and we're going to reconstruct it. So they have uh
reconstruct it. So they have uh functions for encoding and reconstructing and uh we're going to compare the reconstructed windows from moment with
the original data uh original windows uh and uh look at the reconstruction error.
So the reconstruction errors look something like this. Um
right and uh I'm just going to measure how far the reconstruction is from the original. Uh and uh that this signal may
original. Uh and uh that this signal may be noisy as it is shown here. So instead
of uh using like the max reconstruction error that I saw during training, I can use slightly [snorts] more robust statistics of it essentially to characterize what normal data looks like. uh all this is done for normal
like. uh all this is done for normal data initially right so I'm going to pass a bunch of nominal data because this is normaly detection we have access to normal data I can study the
reconstruction errors of moment for normal data right that's step one uh and then during inference I can take the the data points that I get and uh do
[clears throat] the same process and see how well [snorts] the does the test data reconstruction error compared to what I saw during training Right? If a at a
partic I know what is the typical error moment makes on normal data and for a new data that I want to infer on if it works very badly then maybe it's an
anomaly right so treat it as if someone has given the autoenccoder for you is how moment works uh for anomaly detection right so time GPT which is mainly
forecasting based moment which is mainly reconstruction based uh uh can be used for novelty detection uh moment is open source uh and time GPT
is proprietary.
So having seen these two let's look at how they work on some actual data but but before that I actually want to digress a little bit. um because I think
before we can see how they work on uh time series data, we need to clarify how do you measure performance and so when it comes to time series anomaly
detection uh precision and recall uh has been refined a little bit, right? So
when we have our classical tabular data, this could be some bank transactions or credit card transactions that we are looking at. Um you you could train a
looking at. Um you you could train a model on this. Uh and essentially when a new data point comes in uh the model predicts whether this uh uh transaction
looks weird or not, right? Uh and so for tabular data, these predictions are unambiguous, right? You unambiguously
unambiguous, right? You unambiguously know whether the data point is uh whether the prediction was correct or not, right? So either your prediction
not, right? So either your prediction agrees with the ground truth or it does not agree with the ground truth. Um for
time series this does this is not the case right for time series what could happen is uh your uh anomaly may be spanning from 10:00 to 10:30 in the
morning but your uh model predicts the anomalies began at 10:15 and went on till 10 20 35 right so this partial
overlap between uh the ground truth and the prediction makes it difficult to use the classical notion of precision and recall right? You get weird behavior
recall right? You get weird behavior from this.
Uh and so people thought about this what to do and uh back in 2018 the folks at MIT and Intel came up with this paper where they refined these notions uh to account for partial overlap that occurs
when you're playing with time series data. Right? Um a and so uh how do we
data. Right? Um a and so uh how do we exactly define precision and recall when the prediction uh cannot be unambiguously called a true positive or a false positive but there is partial
overlap. So here's a QR code that will
overlap. So here's a QR code that will take you to the again the open source GitHub repo uh for this package which again can be installed uh and uh used as
follows right so uh u you essentially in the package is called PRTS you can go ahead and install it and instead of using the classic skarnbased precision score and recall score we can measure
what is known as TS precision and TS recall right um so what do these uh notions do they are essentially uh telling us they're they're
quantifying the following aspects, right? Uh so firstly, you want to um you
right? Uh so firstly, you want to um you want to reward the algorithm if it predicted uh and caught at least some of
your anomaly. As long as it's even a
your anomaly. As long as it's even a sample of your anomaly, ground truth was predicted by the algorithm you want to reward it. That's known as the
reward it. That's known as the existence, right? as long as it's able
existence, right? as long as it's able to detect the existence of the anomaly correctly uh you uh reward the algorithm. So in this case uh the
algorithm. So in this case uh the algorithm gets the um existence reward otherwise it will not right. So if your prediction completely misses the real
anomaly, you don't give it any existence reward. Right? And so the rest of them
reward. Right? And so the rest of them can actually be customized. There is
there are parameters corresponding to size, position and cardality. And for
your application, you can tune these parameters and use the uh functions for measuring precision and recall. Right?
Um so what is size? Again going back to the concept of uh if you had the real anomaly going all the way uh during the uh in this span of time and the
algorithm picked up about 60% of it uh you want to treat it differently from a prediction that had only 10% overlap with your ground truth. So size
parameter allows you to penalize an algorithm for this aspect.
Right. Um, moving on. Um, we may be interested in the start of the anomaly more than the end of the anomaly. Right?
So, if you're interested in some kind of a defense application or something really bad happened and you may be interested in starting uh the the initial part of the anomaly more than
the tail end. Uh so the exact position which part of the anomaly did you catch uh is an important uh aspect. So the
position parameter allows you to tune that. Um and finally
that. Um and finally u cardality right so if you made a very fragmented prediction uh you want to penalize the algorithm so for example if
you had a variable and you started uh playing a sport but from and you did that from 10:00 to u 11:00 but instead
of detecting one big um uh activity it detected three sub activities you will not be happy right and so cardality penalizes the algorithm if it's
predicting the anomaly in a fragmented fashion. So in this uh package there are
fashion. So in this uh package there are all these different knobs that you can tune uh to make it customized and measure the performance of anomaly detection um that suits your
application.
Right? So please uh take a look at PRTS.
I think it's an important one for the TAD community. Going on let's quickly go
TAD community. Going on let's quickly go through a few examples of how time GPT and moment worked. Uh so we're going to look at the UCR data set for time series. Again like it's a very
series. Again like it's a very interesting and an important data set.
There's a very interesting backstory uh from uh professor Eman Kog at UC Riverside. So look at that YouTube uh
Riverside. So look at that YouTube uh video link from that QR code for this backstory and why this data set is important uh for the time series community. Um so this is uh some air
community. Um so this is uh some air temperature data set uh which is uh which essentially measures uh periodic variations and the anomaly uh is right
here. Right? So there is some periodic
here. Right? So there is some periodic pattern and you see some different pattern on these days which is considered to be an anomaly and uh this is what time GPT provides us. Right? So
the the red dots are the um uh anomalies um that are predicted but the actual anomaly is somewhere here right and the way this data set typically works is the
first 4,000 data points are considered normal uh and uh the rest of the data is testing data and the anomaly lies between the indices 606 to 654 right so
this is usually considered a good benchmark for anomaly detection uh and when we uh looked at time GPT uh it did produce quite a bit of false positives but I think from an if you
look at a at it from a high level on normal data the forecasts look pretty good and the confidence intervals are also pretty tight uh and and decent and so whenever there is an anomaly this is
the example that I showed you earlier in the presentation the observation uh the anomalous observation does lie outside the confidence interval as we expect it
to do right um and we uh when we use the time series precision and recall we get a numbers uh we get pretty decent recall and uh we have quite a few false
positives.
Um ECG data right uh is difficult to forecast for this model. Uh and as we can see uh it produces quite a few number of false positives. The precision
is pretty low right uh and u we can see why that is happening right because we are not forecasting anything. And of
course this the the goal of this is not to say that time GPT is not good. The
goal is uh no free lunch right? So you
need to tune the parameters. You need to make it work for your use case. Uh
zero shot may not be zero. You still
need to give it a few shots and then it'll work right. Um
and uh moment on the other hand is very interesting. So this is the same data
interesting. So this is the same data set and this is the anomaly that we uh talked about. Uh these are the
talked about. Uh these are the reconstruction errors of the autoenccoder rec uh based approach from moment that I talked about. And what we do is we look at the 98th percentile of
the reconstruction error and uh multiply by three times. Right? So if I see a data point that I'm able to reconstruct so poorly that the error is three times
the 98th percentile that I saw on nominal data I think that's reasonable to say that it's it's anomalous. That's
the approach that we're taking. And here
it's really important to take the the to determine the size of those windows. Uh
internally moment uses fixed size windows but you can always have u uh different sizes uh that you use and so window size and threshold tuning uh are
important parameters and this one when tested u there is a huge spike on the uh around the anomaly and so the reconstruction for this
segment was really bad which we expect and everywhere else the reconstruction was uh reasonably okay. Um there are some few spikes which almost became anomalous and when we look a bit deeper
we think maybe they were actually anomalies right uh that that we are picking these are not labeled properly um and uh but but this was just an
interesting artifact that I noticed um finally ECG data set again like I think the reconstruction errors are pretty bad around the spikes consequently we get
massive thresholds uh with this data but still it manages to do better uh at least in a zeros short sense, right?
When you quickly pass in the data, this is the reconstruction errors. These are
the reconstruction errors on um test data and there are definitely there's a spike around the actual anomaly. But we
also see some spikes other than uh the the u the anomaly of interest and these are false positives and that's what drives down the precision of the
algorithm right and uh um I think we this is mainly because those spikes on an ECG signal were harder to reconstruct as they were harder to forecast for time
GPT. With that let me just uh stop with
GPT. With that let me just uh stop with this summary. Right. most important
this summary. Right. most important
points is uh uh to note that time GPT is a forecasting based novelty detection approach. Moment is a reconstruction
approach. Moment is a reconstruction error based anomaly detection approach.
Right? Uh they both use transformer networks. Uh this is completely open
networks. Uh this is completely open source. Uh even the data set that they
source. Uh even the data set that they trained on is available to do other stuff. Um important knobs uh for moment
stuff. Um important knobs uh for moment are the window size and the threshold uh that you do and uh on the time GPT side there are a few knobs that you can use
to change and by the way both these models allow you to fine-tune them right uh I have not used any fine-tuning these are really zero shot in that sense
um and yeah I mean there's a tendency when we look at time series to pre-process them using some kind of escal scalers uh you don't need to do that with these models, right? You can
just throw them at the models and they'll do something for you, right? In
fact, Moment explicitly tells you not to re reprocess because they use that Revan uh normalization technique internally.
Cool. So, this is a summary and I'll open up for questions.
[applause] >> Yeah.
Oh, thank you. So, so some of these anomalies are rare events, right? They
don't happen every day or every week or every month. So, how do you approach
every month. So, how do you approach that type of thing? Uh, when you have anomaly detection of rare events, >> so from a zeros sense, right? Like if
you're using foundational models, >> they don't know uh what domain you're coming from, what anomalies uh they are.
>> No, but you're a practitioner, so I'm just saying I understand that this methodology >> might not might not know, right? So like
I if you I need to apply like time series >> that's what I'm coming to right. So
depending on if you have some uh prior knowledge uh domain knowledge of what happens if there are more anomalies on a certain day than others under certain conditions you have these knobs that you
can tune right you can bring down the error thresholds uh that will make your algorithm more trigger happy if you're expecting more anomalies on certain time periods uh or you can make it more
conservative. Right? So there are these
conservative. Right? So there are these it's not completely zero shot. You still
have to tune these knobs uh based on uh your uh specific data set.
>> Sure. But we have so few examples it's hard to turn those knobs, >> right? But that's why you have uh that's
>> right? But that's why you have uh that's the whole point of novelty detection, right? Novelty detection does not need
right? Novelty detection does not need any anomalies. So the whole idea is can
any anomalies. So the whole idea is can you just use your normal data to understand what your typical forecast errors are, typical reconstruction errors are. Right? So novelty detection
errors are. Right? So novelty detection does not require any labeled anomalies.
Uh this is precisely why we use this approach where the model just learns to detect what is normal and I'm hoping that is available right normal data should be available.
>> I love these questions. We have time for one or two more questions and then I'm sure in the hallway Abashek will be able to more.
>> Thank you. Um,
do you think it's worth looking at these approaches for uh non-periodic non-stationary data like
sequences? Um, feel like the zero shot
sequences? Um, feel like the zero shot obviously wouldn't work, but um is is that type of data represented in the training data for something like moment
so that you could use the transformer outputs for dealing with that type of data?
>> Correct. Right. So specifically that Revan normalization uh works with non-stationary uh data right uh and it's not just moment that came up with so
that kind of normalization technique is very popular out there specifically for uh non-stationary aspects >> uh so great talk I have quick questions
that you know in your experience what's a typical failure mode does those two actually uh come up with is that like
for example like the uh outlier masking or like a sparse uh sampling. Another
thing is that this provide anomalous score but do you have something like can enhance the explanability for this like
kind of yeah whether you can do that.
Thank you.
>> Yeah. um
>> explanability part at least from moment uh is relatively easy because you need to use it on a feature by feature basis >> right so you know exactly which feature is uh causing it and that feature will
be associated with some measurement >> uh that that you can uh go with uh time GPT uh is all about forecasting right so
um I I I need to check if they have some kind of feature importances that they give uh right which can be used or shap app based approaches that that are
possible. Um but when it comes to error
possible. Um but when it comes to error modes, I think uh it'll it'll primarily come like you can understand uh if your model is working well or not just based
on the forecasting errors or the reconstruction, right? So the it allows
reconstruction, right? So the it allows you to at least figure out very quickly whether it's working or not, right? and
and then you can understand there are these knobs that are available uh um to to understand but but I I'm not aware of a systematic analysis of various failure
modes for these uh models.
>> Okay.
>> Um >> when someone >> but at least for moment it can be done.
>> Okay.
>> When someone shop for time series model we'll have to we'll have to come together again.
>> Thank you so much. One more round of applause for Abasha.
>> Thank you so much. [applause]
>> Thank you so much. [applause]
Loading video analysis...