LongCut logo

Abhishek Murthy-Applying Foundational Models for Time Series Anomaly Detection-PyData Boston 2025

By PyData

Summary

Topics Covered

  • Foundation Models Enable Zero-Shot Novelty Detection
  • TimeGPT Detects Anomalies via Forecast Disagreement
  • Moment Reconstructs via Fixed-Size Windows
  • PRTS Rewards Partial Anomaly Overlaps

Full Transcript

So, I think we'll let folks continue to trickle in, but um hello everyone. I

hope you've had a wonderful third day.

Our next speaker is Abashak. He is a senior principal machine learning and AI architect at Schneider. I had to read that one. And an instructor at Nor

that one. And an instructor at Nor Eastern. Um if uh if you want to learn

Eastern. Um if uh if you want to learn everything about machine learning algorithms for IoT systems, that's the title of this course. You have to become a Nor Eastern student. Um but today we

are going to learn about um using foundation models for anomaly detection that is take it away.

>> Okay. Hi everyone. Thank you for the introduction and uh thank you PI data for uh having me and uh organizing this really nice event. uh

I am I'll be talking about uh applying foundational large uh models for time series uh for the anomaly detection problem.

Um this is a little bit about me. Um as

Adam mentioned I work for Schneider Electric. Uh it is a French company a

Electric. Uh it is a French company a and a global leader in energy management and industrial automation based in the greater Boston area. Um I also teach uh

some of these uh uh topics on uh classification, anomaly detection and forecasting of time series at this course at Northeastern. I primarily work with IoT data which is in the form of

time series. And so all these uh

time series. And so all these uh problems that I just mentioned is something that is very interesting to me. Um and these are some other places

me. Um and these are some other places that uh I'm I have been associated with in the past. connect with me on LinkedIn

and we can chat uh about everything time series right um cool so with that uh my agenda is as follows I'll start with a

gentle introduction uh and then uh lay out this promise of zeroshot novelty detection for time series uh data we'll go over a couple of models timegpt and

moment moment is an open source foundational model I'll look at how they perform on a a couple of toy data sets and then conclude with a summary and

recommendations to go forward. Right?

um time series uh so we often understand time series as a collection of ordered measurements right these measurements

can be multi-dimensional in nature um uh and they often tend to have some repeated patterns in them right so

consider this univariate ECG signal um this has this uh repeating train of pulses um which is the repeating pattern in the time

Right? And so any subsequence that

Right? And so any subsequence that deviates from this repeated pattern is called an anomaly and uh it's often very interesting. So in the case of an ECG,

interesting. So in the case of an ECG, you could have a slight delay in one of these pulses uh which becomes a subsequence anomaly or it could be a point value which is either too high or

too low. It's often referred to as a

too low. It's often referred to as a point anomaly. Um when there are

point anomaly. Um when there are multiple dimensions often the invariant happens to be the uh uh correlation between some of these features. The

relationship between the various dimensions is often preserved and if there is a breakdown at any point these are called correlation anomalies and they are very interesting uh in various

domains. There is a growing market for

domains. There is a growing market for time series anomaly detection as shown in this market intelligence report and correspondingly there is a pretty rich body of literature on time series

anomaly detection. This is just a

anomaly detection. This is just a snapshot of uh uh uh various techniques that have been applied to this problem taken from a survey uh published in VLDB

a few years ago. uh but as we'll see the community of foundational models has generally uh not looked at time series so much and we we they're starting to uh

wake up to this opportunity and uh this talk is all about that.

Um that was kad in general. I'm

specifically interested in a variant of anomaly detection in this talk known as novelty detection. Right? So in novelty

novelty detection. Right? So in novelty detection um uh the machine learning model uh does not have access to

anomalies uh during training right uh in my world in of industrial IoT this can happen because you have just rolled out a new family of assets so we haven't had

any anomalies to observe yet or it could be the case that um uh we just haven't kept good track record of observed anomalies in the past which happens quite quite quite often

Right? So in novelty detection the

Right? So in novelty detection the machine learning model just learns from nominal data. So in that sense it is a

nominal data. So in that sense it is a oneclass classifier that that we have all studied. Right. Um and so the model

all studied. Right. Um and so the model essentially learns to recognize what nominal normal patterns are uh and therefore can recognize any deviations

from nominal behavior. And that's no that that's the definition of novelty detection, right? The identifying new

detection, right? The identifying new patterns that we had not seen during training. Uh how can we use this? At

training. Uh how can we use this? At

least in the IoT world, it enables what is known as predictive maintenance. So

if an asset was working fine till now and you start observing deviations from uh the normal uh the model picks up some of these deviations, those deviations

could be an early indicator [laughter] of a fault developing. Right? So instead

of waiting to react when the asset actually breaks down, uh you can use [laughter and clears throat] novelty detection to detect subtle deviations early on, do service interventions and

make sure um the the asset is available uh uh to the maximum extent. Right? So

uh at least predictive maintenance is a big use case in the industrial IoT and novelty detection is often how it is solved uh uh uh in the IoT world.

The inference can happen in a batchwise fashion or it can happen in the streaming fashion. Right? So once you

streaming fashion. Right? So once you have the novelty detection model ready, you start streaming in data. As new data comes in, um uh you detect uh de any deviations that that could be there in

there. And if there are no deviations,

there. And if there are no deviations, then normally detection model will say everything's fine. Right? So it's all

everything's fine. Right? So it's all about detecting novel patterns.

How has this problem been traditionally solved in the past? Right? Um

us conventionally what we have done is create a dedicated use uh model for every use case right so uh for every use

case we you do our usual workflow of collecting nominal data from the asset of interest it could be the physics that you're interested in so these this is IoT data in the form of time series it

could be some kind of vibration measurements some electrical measurements if you're interested in electrical equipment um uh and then we do the usual exploratory data analysis, feature engineering and then go ahead

and train this novelty detection model for our data right uh the novelty detection model as I just mentioned learns what the normal looks like and

it's able to essentially say the new data point the previously unseen data point that you have given me is either normal or not normal so it's a oneclass classifier

and once trained uh we can start inferring uh um on this data that we the kind of data that we just trained on. Um

uh and we once we get the engine uh features we feed it to the model and the model usually tells us uh this new test point is similar to what I had seen

during training and it quantifies the similarity using an anomaly score and usually the interpretation is if the score is low then the data point that you just showed me is similar to what I

had seen and therefore is nominal or if it is very far if the anomaly score is very high it often means that this is a new data point that I have not seen before and therefore could be indicative

of an anomaly. Right? The important

point to note is conventionally this has been done in a bespoke fashion in a sense for every use case for every kind of asset you train these models. Uh so

here's an instantiation of how this is done for autoenccoder neural networks.

Right? So autoenccoder neural networks contain two parts the encoder and the decoder. Um the typically the way they

decoder. Um the typically the way they work is you provide them some time series data uh and the encoder [snorts] learns to um uh reduce the data into some lower dimensional uh

representation.

The decoder network takes that lower uh dimension embedding and then reconstructs the original data back.

Right?

uh and then we can compare this reconstructed data from the original data uh to see how well the reconstruction uh uh occurred. Right? So

the the idea is the autoenccoder network can reconstruct the data the data well if and only if the data was nominal because you show the autoenccoder all kinds of nominal patterns. So it only

learns to reconstruct nominal data.

Once the once the encoder and decoder are trained, you send it for inference and the inference works as follows. You

take the new previously unseen uh data point, pass it through the encoder, get the lower representation embedding uh low dimensional embedding and then try to reconstruct the data from that and

compare the reconstruction error to what you saw during training. Right? If the

reconstruction error is low, then probably what you saw is uh uh similar to the training data and therefore is nominal. But if the autoenccoder uh

nominal. But if the autoenccoder uh network fails to reconstruct this data point very well, it means that this could be a potentially new data point that it had not seen during training.

Right? The autoenccoder is trained on the specific data for the use case. uh

and it can reconstruct the data well if and only if it is nominal, right?

Because it was it was shown only nominal data during training. This is just a way to instantiate this workflow on the left uh uh for autoenccoder networks.

But of course the paradigm is changing, right? Uh the the next step u is

right? Uh the the next step u is proposed to be these time series foundational models, right?

um how do time series foundational models work now just like the LLM community uh we don't have to train our own models on our data right so here is

a TSM TSFM you throw the data at it and it will automatically generate an anomaly score for you right um so there

is no bespoke models anymore uh you can use one of these TSFMs to start doing anomaly detection from get-go right from day zero you can provide it from some

with some data. If the data looks normal, the uh TSFM will say so. And if

there are any anomalies that it has observed in the recent past, it'll throw a flag up saying that I think something's wrong. So, no training

something's wrong. So, no training required. And of course, if you want

required. And of course, if you want better performance, you can fine-tune one of these TSFMs for your domain, your data set that you are playing with, right? That'll give you better

right? That'll give you better performance. Um so how do these TSFMs

performance. Um so how do these TSFMs work? Obviously pre-training, right?

work? Obviously pre-training, right?

That's the main uh ingredient. So these

TSFMs train on massive amounts of uh time series data. And by doing so, they essentially learn the core structures of time series, right? So time series are

often characterized by trend and seasonality. And so by looking at all

seasonality. And so by looking at all these massive number of data sets d from diverse domains they're able to pick up on all these important structures and

they can also exist at different scales right um uh uh this multiscale nature is also very important because then you you want to be able to apply this to any

domain to an ECG data set which occurs in milliseconds time scale or even uh lesser to power demand data set which probably works in the order of hours

right um so massive uh data sets used for pre-training to learn uh different structures enables these TSFMs to be foundational in nature right so the same

model can be used for different types of uh time series tasks so you can do forecasting with it you can do anomaly detection with it and of course you can

do any kind of imputation or filling in missing data kind of tasks also with it right with the same model you can do all these three different tasks

And uh uh uh that makes it foundational, right? Uh and finally, you can always

right? Uh and finally, you can always fine-tune this model to your domain. So

if you're playing with the energy management data, uh you can use it to do these tasks better for your data. And if

you're working with variables, uh then you can use your uh accelerometer and ppg sensor data uh to to fine-tune these models. So the whole idea is if you can

models. So the whole idea is if you can complete sentences using an LLM uh you can forecast the next value of a time series right [snorts] and if you can forecast the next value of the time

series you can do anomaly detection so on and so forth. Um

several players have proposed uh TSFMs and uh uh Google Salesforce Amazon have all come up with their own models. Uh

today we'll be focusing on a couple of them uh next time GPT and Carnegie Melon University's moment which is open source

uh and look at how these work right uh so the whole idea is we are we're trying to move away from bespoke models to foundational models which can

work in a zeros sense or you with minimal finetuning can work very well so you can get going pretty quickly right that that's the whole uh uh idea of disruption in the time series

foundational model you can get going pretty quickly for these foundational tasks of forecasting anomaly detection and uh uh denoising. So let's look at time GPT and moment next.

So time GPT is from NIXLA. So here is a QR code if you want to scan it uh uh to get to u uh the the model you can do that. So NIXLA uh if people that work on

that. So NIXLA uh if people that work on in the time series forecasting uh world know that they have some very interesting libraries for uh forecasting. uh they have now come up

forecasting. uh they have now come up with a proprietary transformer-based model uh and so this uh model is very similar to what the LLM community uses

um uh and uh the only thing is now they have trained it on a massive data set they they claim to be uh training it on about 100 billion data points um they

don't tell us how many parameters are there in the model today right um and these data sets come from very diverse domains like finance and electricity all

the way to tourism, right? Uh and the goal is uh to be able to forecast the time series, right? Forecasting is the foundational task task that they excel

at. Um so we can provide the time series

at. Um so we can provide the time series of interest. So you can just take any

of interest. So you can just take any time series you want. If you have some exogenous variables, you can use them as well. Provide it to the API provided

well. Provide it to the API provided given by Nexla and it'll tell you what the next values are. Right? Um

we can use time GPT for anomaly detection. They have native support.

detection. They have native support.

They provide some API specifically for anomaly detection. Uh and the idea of

anomaly detection. Uh and the idea of using uh time GPT for novelty detection is as follows. You just take your data and start creating forecasts with it. Um

as the data streams through at every point uh you as when you generate the forecast you also create a confidence interval around it. Right? So

essentially if the observe your forecasting is so good that when you observe a data point if it disagrees with your forecast then it's most likely an anomaly right and remember you're not

training a forecasting model here that comes with the TSFM so you forecast the data point if the data uh if the observation disagrees with your forecast [clears throat]

too much then most likely it's an anomaly right so how do we uh characterize too much difference It's done using confidence intervals. Um, and

this is an example that we'll look later in the slide. So here, um, essentially the, uh, the black curve is the observed data point. The pink curve represents

data point. The pink curve represents the forecasts generated by, uh, time GPT, right? And this, uh, shaded region

GPT, right? And this, uh, shaded region represents the confidence interval. So a

99.9% confidence interval essentially ensures that the true average value will lie within this region 99.9% of the time right this is a pretty strong guarantee

uh and therefore the confidence interval is wide right if you reduce this number uh then say for 50% then you only need to ensure that the true value is within

this region 50% of the time therefore the confidence interval shrinks right um so the consequence of reducing this number is that uh you are more aggressive. You're more trigger happy

aggressive. You're more trigger happy when it comes to anomaly detection because it's easy to be outside that narrow range. Uh and therefore there's a

narrow range. Uh and therefore there's a higher chance of uh uh false positives and your precision can go down. But of

course if you are very conservative and use a very strong guarantee like a 99.9% confidence interval uh it'll uh you may take a hit on your recall rates. Right?

So that's the that's how it works. So at

every point you forecast the next few values uh then you observe what the new data came in. If the observation is very different from your forecast then uh it's suspect

right um taking going one level deeper um so the whole idea of uh using time GPT for anomaly detection entails uh let me

just show you the um the function signature right. So this is the data frame that we uh provide uh with the uh pandas data frame that you

can provide. Uh you need to tell

can provide. Uh you need to tell explicitly what the time scale is of your data. Um h and step size pertain to

your data. Um h and step size pertain to the forecast that is being made. Um uh

these are uh overlapping windows of eight time steps um that overlap by a size of six. Uh and then level is the confidence interval threshold. Right?

This essentially ensures that your the true value will lie within your confidence interval 99% of the time. And

detection size is the tail end of the data that you give that you are interested in to find anomalies in.

Right? So um going back to this uh uh figure. So this is your entire data

figure. So this is your entire data frame. You're interested to find

frame. You're interested to find anomalies in the last few data points.

And of course this is the window that's going to slide through your data. uh and

you're interested in finding anomalies in this tail end. Right? Of course,

because there could be anomalies present here, we do not use this window to produce the confidence interval. Uh this

is uh this window is where you are comparing with the confidence interval that was produced based on all this previous data. Right? So this is a

previous data. Right? So this is a function signature. Uh and uh a little

function signature. Uh and uh a little bit about how time GPT works. We are

going to look at how it works on spec concrete data just a bit.

Next I want to introduce uh moment which is the open-source foundational model from uh Carnegi Milan University and uh the auton lab folks have proposed it. Um

and the lead developer is Monito Gowami.

Uh here is the QR code if you're interested that will take you to the GitHub page of uh this model. Um to

start with they created their own data set which is also public. It's an

interesting data set uh called the time series pile. Um it has about uh 13

series pile. Um it has about uh 13 million uh different time series uh spanning about 1.13 billion timestamps.

Um in a modern sense, it's a very small data set. It fits on your laptop, right?

data set. It fits on your laptop, right?

It's 20 20 gigs, but they claim to have covered uh again diverse domains that they can use to pick up on all these fundamental structures that I talked

about. um and they train the moment

about. um and they train the moment model based on uh the this data set.

So how does that model look? Um so the idea there is they have done some interesting uh things. Um firstly uh they do not uh explicitly handle

multiple features. They will handle one

multiple features. They will handle one feature at a time. So if you have a multivariate time series data set, it essentially works feature by feature. Um

and uh we don't have to specify the frequency the time scale of our data set explicitly. Again uh and remember these

explicitly. Again uh and remember these data sets can have very different amplitudes. Some data time series can be

amplitudes. Some data time series can be of a smaller amplitude or other time series other time series could be of much larger amplitude and that's where uh this technique called reversible instance normalization has become

popular in the time [snorts] series community for as a normalization step.

they use that uh in their model and moreover um they internally use uh window sizes that are of fixed length of 512 right so uh in my experience window

size selection is a tough problem for time series there have been papers written just on this essentially what is how do you automatically find the characteristic length of a time series

that you study at a time that you analyze at a time is the problem of window size selection uh interestingly the authors made selection of 512 and they always use it no matter which data

uh set that you are using. Um internally

the transformer that they use is a simple again uh u uh vanilla transformer from the LLM community. And finally they use this u so-called lightweight head

prediction head which come in different flavors. So if you're interested in in

flavors. So if you're interested in in reconstruction they they give you a reconstruction head. If you are

reconstruction head. If you are interested in forecasting they come they give you a forecasting head. uh and a way to fine-tune is what is known as linear probing. You can just fine-tune

linear probing. You can just fine-tune the final layer, this particular layer uh uh to your task, right? Instead of

training the entire weights. Uh that's

the idea of modularizing it and giving you this uh uh prediction head. And so

we'll be specifically interested in the reconstruction head in just a sec, right? Um and yeah, this model is

right? Um and yeah, this model is available uh from hugging face uh uh and you can start using it from get-go.

Um how does uh moment work? It works

essentially as an autoenccoder, right?

The the the one that we just looked at [snorts] earlier in the presentation. Uh

so here is the reconstruction specific head that you're requesting. And uh uh you can get get your model going like this. And what the model essentially

this. And what the model essentially what we're going to do with uh this model is to take any the time series that we are playing with chop it up into

windows um contiguous windows u could work uh but if you have overlaps you of course get more robust performance uh so here we're just going to take some

window this data uh using uh tumbling continuous windows and then use the model to create some create the lowdimensional embedding

Right. Um, so we're going to encode our

Right. Um, so we're going to encode our windows using this model that we get from moment and we're going to reconstruct it. So they have uh

reconstruct it. So they have uh functions for encoding and reconstructing and uh we're going to compare the reconstructed windows from moment with

the original data uh original windows uh and uh look at the reconstruction error.

So the reconstruction errors look something like this. Um

right and uh I'm just going to measure how far the reconstruction is from the original. Uh and uh that this signal may

original. Uh and uh that this signal may be noisy as it is shown here. So instead

of uh using like the max reconstruction error that I saw during training, I can use slightly [snorts] more robust statistics of it essentially to characterize what normal data looks like. uh all this is done for normal

like. uh all this is done for normal data initially right so I'm going to pass a bunch of nominal data because this is normaly detection we have access to normal data I can study the

reconstruction errors of moment for normal data right that's step one uh and then during inference I can take the the data points that I get and uh do

[clears throat] the same process and see how well [snorts] the does the test data reconstruction error compared to what I saw during training Right? If a at a

partic I know what is the typical error moment makes on normal data and for a new data that I want to infer on if it works very badly then maybe it's an

anomaly right so treat it as if someone has given the autoenccoder for you is how moment works uh for anomaly detection right so time GPT which is mainly

forecasting based moment which is mainly reconstruction based uh uh can be used for novelty detection uh moment is open source uh and time GPT

is proprietary.

So having seen these two let's look at how they work on some actual data but but before that I actually want to digress a little bit. um because I think

before we can see how they work on uh time series data, we need to clarify how do you measure performance and so when it comes to time series anomaly

detection uh precision and recall uh has been refined a little bit, right? So

when we have our classical tabular data, this could be some bank transactions or credit card transactions that we are looking at. Um you you could train a

looking at. Um you you could train a model on this. Uh and essentially when a new data point comes in uh the model predicts whether this uh uh transaction

looks weird or not, right? Uh and so for tabular data, these predictions are unambiguous, right? You unambiguously

unambiguous, right? You unambiguously know whether the data point is uh whether the prediction was correct or not, right? So either your prediction

not, right? So either your prediction agrees with the ground truth or it does not agree with the ground truth. Um for

time series this does this is not the case right for time series what could happen is uh your uh anomaly may be spanning from 10:00 to 10:30 in the

morning but your uh model predicts the anomalies began at 10:15 and went on till 10 20 35 right so this partial

overlap between uh the ground truth and the prediction makes it difficult to use the classical notion of precision and recall right? You get weird behavior

recall right? You get weird behavior from this.

Uh and so people thought about this what to do and uh back in 2018 the folks at MIT and Intel came up with this paper where they refined these notions uh to account for partial overlap that occurs

when you're playing with time series data. Right? Um a and so uh how do we

data. Right? Um a and so uh how do we exactly define precision and recall when the prediction uh cannot be unambiguously called a true positive or a false positive but there is partial

overlap. So here's a QR code that will

overlap. So here's a QR code that will take you to the again the open source GitHub repo uh for this package which again can be installed uh and uh used as

follows right so uh u you essentially in the package is called PRTS you can go ahead and install it and instead of using the classic skarnbased precision score and recall score we can measure

what is known as TS precision and TS recall right um so what do these uh notions do they are essentially uh telling us they're they're

quantifying the following aspects, right? Uh so firstly, you want to um you

right? Uh so firstly, you want to um you want to reward the algorithm if it predicted uh and caught at least some of

your anomaly. As long as it's even a

your anomaly. As long as it's even a sample of your anomaly, ground truth was predicted by the algorithm you want to reward it. That's known as the

reward it. That's known as the existence, right? as long as it's able

existence, right? as long as it's able to detect the existence of the anomaly correctly uh you uh reward the algorithm. So in this case uh the

algorithm. So in this case uh the algorithm gets the um existence reward otherwise it will not right. So if your prediction completely misses the real

anomaly, you don't give it any existence reward. Right? And so the rest of them

reward. Right? And so the rest of them can actually be customized. There is

there are parameters corresponding to size, position and cardality. And for

your application, you can tune these parameters and use the uh functions for measuring precision and recall. Right?

Um so what is size? Again going back to the concept of uh if you had the real anomaly going all the way uh during the uh in this span of time and the

algorithm picked up about 60% of it uh you want to treat it differently from a prediction that had only 10% overlap with your ground truth. So size

parameter allows you to penalize an algorithm for this aspect.

Right. Um, moving on. Um, we may be interested in the start of the anomaly more than the end of the anomaly. Right?

So, if you're interested in some kind of a defense application or something really bad happened and you may be interested in starting uh the the initial part of the anomaly more than

the tail end. Uh so the exact position which part of the anomaly did you catch uh is an important uh aspect. So the

position parameter allows you to tune that. Um and finally

that. Um and finally u cardality right so if you made a very fragmented prediction uh you want to penalize the algorithm so for example if

you had a variable and you started uh playing a sport but from and you did that from 10:00 to u 11:00 but instead

of detecting one big um uh activity it detected three sub activities you will not be happy right and so cardality penalizes the algorithm if it's

predicting the anomaly in a fragmented fashion. So in this uh package there are

fashion. So in this uh package there are all these different knobs that you can tune uh to make it customized and measure the performance of anomaly detection um that suits your

application.

Right? So please uh take a look at PRTS.

I think it's an important one for the TAD community. Going on let's quickly go

TAD community. Going on let's quickly go through a few examples of how time GPT and moment worked. Uh so we're going to look at the UCR data set for time series. Again like it's a very

series. Again like it's a very interesting and an important data set.

There's a very interesting backstory uh from uh professor Eman Kog at UC Riverside. So look at that YouTube uh

Riverside. So look at that YouTube uh video link from that QR code for this backstory and why this data set is important uh for the time series community. Um so this is uh some air

community. Um so this is uh some air temperature data set uh which is uh which essentially measures uh periodic variations and the anomaly uh is right

here. Right? So there is some periodic

here. Right? So there is some periodic pattern and you see some different pattern on these days which is considered to be an anomaly and uh this is what time GPT provides us. Right? So

the the red dots are the um uh anomalies um that are predicted but the actual anomaly is somewhere here right and the way this data set typically works is the

first 4,000 data points are considered normal uh and uh the rest of the data is testing data and the anomaly lies between the indices 606 to 654 right so

this is usually considered a good benchmark for anomaly detection uh and when we uh looked at time GPT uh it did produce quite a bit of false positives but I think from an if you

look at a at it from a high level on normal data the forecasts look pretty good and the confidence intervals are also pretty tight uh and and decent and so whenever there is an anomaly this is

the example that I showed you earlier in the presentation the observation uh the anomalous observation does lie outside the confidence interval as we expect it

to do right um and we uh when we use the time series precision and recall we get a numbers uh we get pretty decent recall and uh we have quite a few false

positives.

Um ECG data right uh is difficult to forecast for this model. Uh and as we can see uh it produces quite a few number of false positives. The precision

is pretty low right uh and u we can see why that is happening right because we are not forecasting anything. And of

course this the the goal of this is not to say that time GPT is not good. The

goal is uh no free lunch right? So you

need to tune the parameters. You need to make it work for your use case. Uh

zero shot may not be zero. You still

need to give it a few shots and then it'll work right. Um

and uh moment on the other hand is very interesting. So this is the same data

interesting. So this is the same data set and this is the anomaly that we uh talked about. Uh these are the

talked about. Uh these are the reconstruction errors of the autoenccoder rec uh based approach from moment that I talked about. And what we do is we look at the 98th percentile of

the reconstruction error and uh multiply by three times. Right? So if I see a data point that I'm able to reconstruct so poorly that the error is three times

the 98th percentile that I saw on nominal data I think that's reasonable to say that it's it's anomalous. That's

the approach that we're taking. And here

it's really important to take the the to determine the size of those windows. Uh

internally moment uses fixed size windows but you can always have u uh different sizes uh that you use and so window size and threshold tuning uh are

important parameters and this one when tested u there is a huge spike on the uh around the anomaly and so the reconstruction for this

segment was really bad which we expect and everywhere else the reconstruction was uh reasonably okay. Um there are some few spikes which almost became anomalous and when we look a bit deeper

we think maybe they were actually anomalies right uh that that we are picking these are not labeled properly um and uh but but this was just an

interesting artifact that I noticed um finally ECG data set again like I think the reconstruction errors are pretty bad around the spikes consequently we get

massive thresholds uh with this data but still it manages to do better uh at least in a zeros short sense, right?

When you quickly pass in the data, this is the reconstruction errors. These are

the reconstruction errors on um test data and there are definitely there's a spike around the actual anomaly. But we

also see some spikes other than uh the the u the anomaly of interest and these are false positives and that's what drives down the precision of the

algorithm right and uh um I think we this is mainly because those spikes on an ECG signal were harder to reconstruct as they were harder to forecast for time

GPT. With that let me just uh stop with

GPT. With that let me just uh stop with this summary. Right. most important

this summary. Right. most important

points is uh uh to note that time GPT is a forecasting based novelty detection approach. Moment is a reconstruction

approach. Moment is a reconstruction error based anomaly detection approach.

Right? Uh they both use transformer networks. Uh this is completely open

networks. Uh this is completely open source. Uh even the data set that they

source. Uh even the data set that they trained on is available to do other stuff. Um important knobs uh for moment

stuff. Um important knobs uh for moment are the window size and the threshold uh that you do and uh on the time GPT side there are a few knobs that you can use

to change and by the way both these models allow you to fine-tune them right uh I have not used any fine-tuning these are really zero shot in that sense

um and yeah I mean there's a tendency when we look at time series to pre-process them using some kind of escal scalers uh you don't need to do that with these models, right? You can

just throw them at the models and they'll do something for you, right? In

fact, Moment explicitly tells you not to re reprocess because they use that Revan uh normalization technique internally.

Cool. So, this is a summary and I'll open up for questions.

[applause] >> Yeah.

Oh, thank you. So, so some of these anomalies are rare events, right? They

don't happen every day or every week or every month. So, how do you approach

every month. So, how do you approach that type of thing? Uh, when you have anomaly detection of rare events, >> so from a zeros sense, right? Like if

you're using foundational models, >> they don't know uh what domain you're coming from, what anomalies uh they are.

>> No, but you're a practitioner, so I'm just saying I understand that this methodology >> might not might not know, right? So like

I if you I need to apply like time series >> that's what I'm coming to right. So

depending on if you have some uh prior knowledge uh domain knowledge of what happens if there are more anomalies on a certain day than others under certain conditions you have these knobs that you

can tune right you can bring down the error thresholds uh that will make your algorithm more trigger happy if you're expecting more anomalies on certain time periods uh or you can make it more

conservative. Right? So there are these

conservative. Right? So there are these it's not completely zero shot. You still

have to tune these knobs uh based on uh your uh specific data set.

>> Sure. But we have so few examples it's hard to turn those knobs, >> right? But that's why you have uh that's

>> right? But that's why you have uh that's the whole point of novelty detection, right? Novelty detection does not need

right? Novelty detection does not need any anomalies. So the whole idea is can

any anomalies. So the whole idea is can you just use your normal data to understand what your typical forecast errors are, typical reconstruction errors are. Right? So novelty detection

errors are. Right? So novelty detection does not require any labeled anomalies.

Uh this is precisely why we use this approach where the model just learns to detect what is normal and I'm hoping that is available right normal data should be available.

>> I love these questions. We have time for one or two more questions and then I'm sure in the hallway Abashek will be able to more.

>> Thank you. Um,

do you think it's worth looking at these approaches for uh non-periodic non-stationary data like

sequences? Um, feel like the zero shot

sequences? Um, feel like the zero shot obviously wouldn't work, but um is is that type of data represented in the training data for something like moment

so that you could use the transformer outputs for dealing with that type of data?

>> Correct. Right. So specifically that Revan normalization uh works with non-stationary uh data right uh and it's not just moment that came up with so

that kind of normalization technique is very popular out there specifically for uh non-stationary aspects >> uh so great talk I have quick questions

that you know in your experience what's a typical failure mode does those two actually uh come up with is that like

for example like the uh outlier masking or like a sparse uh sampling. Another

thing is that this provide anomalous score but do you have something like can enhance the explanability for this like

kind of yeah whether you can do that.

Thank you.

>> Yeah. um

>> explanability part at least from moment uh is relatively easy because you need to use it on a feature by feature basis >> right so you know exactly which feature is uh causing it and that feature will

be associated with some measurement >> uh that that you can uh go with uh time GPT uh is all about forecasting right so

um I I I need to check if they have some kind of feature importances that they give uh right which can be used or shap app based approaches that that are

possible. Um but when it comes to error

possible. Um but when it comes to error modes, I think uh it'll it'll primarily come like you can understand uh if your model is working well or not just based

on the forecasting errors or the reconstruction, right? So the it allows

reconstruction, right? So the it allows you to at least figure out very quickly whether it's working or not, right? and

and then you can understand there are these knobs that are available uh um to to understand but but I I'm not aware of a systematic analysis of various failure

modes for these uh models.

>> Okay.

>> Um >> when someone >> but at least for moment it can be done.

>> Okay.

>> When someone shop for time series model we'll have to we'll have to come together again.

>> Thank you so much. One more round of applause for Abasha.

>> Thank you so much. [applause]

>> Thank you so much. [applause]

Loading...

Loading video analysis...