Module 3 - AI in Medical Education : Types of Clinical Data & Asking the Right Questions
By एनबीई आईटी विभाग IT Section
Summary
## Key takeaways - **AI Learns from Input-Output Pairs**: Unlike traditional programming where you code rules like 'if temperature >100°F then fever positive', in AI you provide input-output data pairs such as 102°F with label 'fever positive' so the computer learns the rule itself. [03:40], [04:46] - **Training Data Bias Causes Inference Failures**: If training data for chest X-rays only includes patients over 75 years, predictions for younger patients will be biased; biases like race underrepresented in training data lead to mismatched predictions during inference, such as underpredicting care needs for Black patients. [07:52], [09:06] - **AI Life Cycle Starts with Need Identification**: Before developing AI, identify the need like reducing chest X-ray report time from 1 hour to 30 minutes, describe existing workflow, set target state, implement, monitor performance, and update or deimplement as needed. [11:22], [12:35] - **Episodic vs Continuum Data Distinction**: Episodic data treats specific conditions at a time, like one visit; continuum data collects all from birth immunization to present across patient, provider, and payer sources to enable broader questions. [25:08], [25:47] - **90% AI Effort on Data Quality**: 90% of AI development involves data cleanup for quality; even after, models are only locally accurate, dropping 20-30% when tested on different sites due to quality shifts and protocols. [38:07], [38:47] - **Ask Questions Matching Available Data**: With a single chest X-ray, asking if AI can predict lung cancer likelihood fails without continuum data like smoking history; right question is 'pneumonia positive or not' as that info exists in the data. [44:40], [45:11]
Topics Covered
- AI Learns Rules from Data Pairs
- Data Bias Emerges in Inference
- AI Lifecycle Starts with Need
- Shift to Continuum Data Required
- Ask Data-Matched Questions
Full Transcript
again to the AI in healthcare uh education. Now we have two eminent
uh education. Now we have two eminent people with us. We have seen the teaser.
Professor Fendra Yalawati is going to take us through the types of clinical data and asking the right question.
Right. He is a professor at IC Bangalore. After his masters, he
Bangalore. After his masters, he continued to do PhD in the biosciences and worked in the industry in Samsung and G healthcare as well. He's having
100 plus publications and lot of patents. He's one of the pioneers in
patents. He's one of the pioneers in bridging the tech with medicine. And to
moderate the session, we have professor Andra Kagural who is eminent personality. He worked in national
personality. He worked in national institutes like IGAB currently serving as dean at Ashoko University in the bio.
Uh over to both of you. Uh over to Fidra.
>> Okay. Uh thank you very much. Dr. Suresh
uh I would like to first of all thank NBE for giving me this opportunity and I would also like to thank for being the moderator. Uh so let's get started. Um
moderator. Uh so let's get started. Um
uh so I have a disclaimer to start off uh to say that anybody who tells you that you know they are going to teach you about clinical data and asking the right questions for AI in about 45
minutes I'll say that they are absolutely lying. there is no way I can
absolutely lying. there is no way I can do this in this 45 minutes given to me.
So what I will try and do is that give a snippets of what is important and what is not important and then hopefully this gives you some sense of how to handle the healthcare data. So to start off
with uh I will just say that I work with lot of companies so these are some disclaimers but one thing I wanted to mention is that there's no competing financial interest uh that is there for
this lecture. So uh to as a broad
this lecture. So uh to as a broad overview I would like to start off with AI revolution and what is the aspect of data to AI model and then talk a little
bit about AI life cycle and then come to the data aspects of healthcare what is called as episodic versus continue and then I'm going to quickly go through creating your own data set because I
think a lot of you are interested in creating your own data set for a models.
So I'll just you know touch upon what is the overview of creating such a data set and then I will leave you out with some challenges. I don't have any solutions
challenges. I don't have any solutions for the data challenges but I would like you to think about those challenges as well. So let me uh let me say that first
well. So let me uh let me say that first of all AI revolution is not only in the healthcare it is happening across the globe across the sectors and currently
the AI revolution which is happening is part of industry 4.0.
uh that is after the third revolution which has started post 1970 about automated protection, electronics, computers. So you generate a lot of
computers. So you generate a lot of digital data and because of the digital data you are also having this revolution around big data and AI and and if you
really look at it health sector um health sector up to 1990 went through a different phase of what I call it as more of prescriptive type care. uh from
post 1990 health sector has progressed into what we call it as little bit more into the wellness prevention. So these
have become very very critical parts of it. So in a broad spectrum health sector
it. So in a broad spectrum health sector has progress has been post 1990 has been phenomenal. The majority of the clinical
phenomenal. The majority of the clinical practices have come and become the main board post 1990.
And I just want to point out the difference between traditional programming and AI. This is a very important one from the data aspect. So
for example, let's imagine that um you have a program where uh you want to know if I give a temperature whether it will tell you know uh whether it is a favor
positive or not. So for example assume that the temperature is 100 2° F that 1002° F you give is what we call it as input and the program has a simple what
we call it as a matching one. If it is greater than 100° F then we want to say output is positive. So in this case you are giving an input and then there's a
code that exist in the computer which checks 100° F above or below. If it is above then it will say disfavor positive. In the air world what we want
positive. In the air world what we want to do is that we want to give input and out. So in this case we wanted to give
out. So in this case we wanted to give inputs like 102 and then we want to give what is called as output or the label which says table positive. So that mean
that we give the data which is input output data pairs to a computer and we would like it to come up with that rule and come up with that program. So the
program has to say that okay anything above 100 is actually positive anything below 100 is favor negative. So for that
AI so the fundamental aspect of AI is all about data. So let me show you a little bit more as we go along you'll understand this but at least if you look
at the advancement of AI the first part there are three things we say that which has advanced the AI one is the data other is algorithms and third one is
compute fundamentally data is what has been the foundation for the advancement.
So you have this digitization, you have new sources of data. For example, 10 years back there is no variables data.
Today almost you know is estimated that 40% of the population are using some kind of variables. Um so now you have this new algorithms like deep learning algorithms, foundational models like
CHP. You have good optimizations which
CHP. You have good optimizations which have come up and then you also have computing which has become the mainstream. For example, the most
mainstream. For example, the most valuable company in the world is Nvidia which is making all these GPUs which are used for the AI development. So in some sense all these three have contributed
but the foundation of AI has always been on the detail. That's why it is a very very important.
Let me start off with a little bit of little more nuanced application. I work
in medical imaging. So a lot of my applications are going to revolve around medical imaging. So let's say that you
medical imaging. So let's say that you want to train model which will look at chest X-rays and then say whether it is pneumonia positive or negative. So that
mean that you need to have two things one is the chest X-ray and then we call it as corresponding labels for so the bluish ones are the ones which we call
as labels which are telling you positive or negative. So we give these two to the
or negative. So we give these two to the training part of the algorithm and it will give us what is called as a model.
So and then uh when we get a new chest X-ray which is not part of any of the training data we give it to this train model and then ask it to predict or
infer whether is pneumonia positive or not. So that mean that AI model has two
not. So that mean that AI model has two phases. One is the training phase.
phases. One is the training phase.
Another one is you can call it as ref inferencing phase, testing phase and prediction phase where you give the model an unseen data and then ask it to
predict what will be the outcome in this case positive. So most of the metrics
case positive. So most of the metrics which you see for the algorithms are these at the inference page or what we call as predictions. So the problem is
that the data which you have given to train the model is what is typically called as a data set.
And because of that what happens is that whatever data you have if it has a biased then what we are going to get is a biased prediction for example let's
imagine in this case where you said that all the data I have for the chest x-rays are above the age of 75 years so then any prediction you do below 75 years you
don't have any representation of such data so that might be a biased variable right so Now there might be other data cleaning you have done for example the
chest X-rays somehow you removed all the ribs or you suppressed all the ribs. So
that mean that you have done a bios data cleaning. So anything which is present
cleaning. So anything which is present with the ribs are suppressed we call this as a domain and then our you know in the labeling itself what is the label we said that pneumonia positive or
negative you got a clinician or labeler who is having only very limited experience. So then your labels are also
experience. So then your labels are also might be not accurate. So in any of these cases you have this bias DI outcomes and in fact there are lot of
biases. I'm not going into all these
biases. I'm not going into all these biases. Uh but traditionally this bias
biases. Uh but traditionally this bias shows up most of the time when we are trying to do the inference. So you may not know the bias while training but
during the inference you get this bias.
For example, here the bias has been towards what we call it as towards the base. So if you see here the white
base. So if you see here the white patients which are actually requiring the name versus predicted care, it was able to somehow match it. But when it came to the population which are black
which was not represented in the training data, it was predicting that they don't require that much care but actual requirement was much higher. So
this is just to show these examples.
So now now whenever we talk about uh AI solutions you also have to think about AI solutions are across the healthcare community. So they are not only in the
community. So they are not only in the clinical diagnosis they are in the drug discovery they are in the administrative operations they are in the patient care they are in emergency medicine they are there there in the preventive health. So
that mean that now you can think about for each of these health and the healthare continuum you also require continuum detail but all you want to develop a continuum application then you require continuum data as well. So this
will become little more clearer when I talk about episodic versus continuum and one of the point I wanted to mention is that this is for all all of you who
are doctors. So this is a study which
are doctors. So this is a study which was done in the US. So between 23 and 24 in 23 one in three doctors have used to
certain extent that is what we call as a adoption ratings but 24 it has become two in three. So what is expected is that in the next five years 90% of the
doctors will use somewhat. So it's also important for us and thanks to NBE which is able to put up a program for you to bring that awareness what are the limitations of AI uh you know in some
sense in this session we are talking what are the limitations with the data um again I also want to focus that saying that um there is something called
as key focus areas for each of the um regions for example North America which is mostly US has almost more than 50% share they focus on medical imaging and
diagnostic tools which is AI is catering whereas we are all part of Asia Pacific most of our AI has been towards mobile health apps and tele medicine so
depending on the demographic area your AI activity might change but overall the adoption rates are going up that's a point I wanted to point out um so this
is the AI life cycle in fact um uh this is if you ask me what is the most important slide for this lecture I will say that this is it. So first of all if
you want to develop AI or or if you want to use AI in the clinic the first important thing for us is to identify the need. So whether you have a need for
the need. So whether you have a need for it remember that AI is not needed everywhere. For example let's take this
everywhere. For example let's take this example of chest X-ray where we are telling whether is pneumonia or positive or negative. So you identify the need
or negative. So you identify the need saying that currently that report and report generation of just Xray is taking 1 hour and uh your need is to bring it
down to half an hour. That mean that you want the AI to generate that report within half an hour or a assisted report to come out within half. So then you describe what is the existing workflow.
What happens after the chest X-ray is is acquired. So does they go into the pack
acquired. So does they go into the pack system? Does the radiologist read? So
system? Does the radiologist read? So
now you you describe that and then you say what is the desired target state. I
already told you the desired target state need not be about accuracy need not be about what is the uh positive predictive value or negative predictive value. It could be just simply saying
value. It could be just simply saying that I want the report generation to go from go down from 1 hour to 30 minutes and then you develop this AI system or
acquire an AI system and then you implement that in the target setting.
When you say implement, you integrate to what is existing workflow and you monitor the performance. When we say monitor the performance, what do we mean is we already know what is the target we
are looking for. So that mean that this is what we we were calling earlier as a testing phase. So our inference so you
testing phase. So our inference so you you do that ongoing performance. For
example, you have reached the desired time. You have reduced the report
time. You have reduced the report generation of chest X-ray from 1 hour and a half. Then you make them or you have gone from 1 hour to something let's
imagine 45 minutes. So then you might need to go back and update the model right or update the lead. So or maybe update what is your target status you
know 45 minutes is better than 60 minutes. So you might have to do all
minutes. So you might have to do all these what is called as update of what is your metric case or update the current workflow or maybe update the date or in some cases
let's say that implementing AI has gone from 1 hour to let's imagine 2 hours it is taking because of implementation of so then we deimplement we say that no no
we cannot implement AI in the target setting we deimplement because earlier it was taking one hour so remember that each one of this is actually linked to
the need. So when you have a need, if
the need. So when you have a need, if you want to develop AI system, you require input output pairs. If you want to know the performance, again you require some test data to be there. So
how do you know which is the performance? You need to know what is
performance? You need to know what is the existing time versus current timing in this case. So in some sense the whole AI life cycle uh at least in the healthcare set walks
around the data, right? Right? So that
mean that unless you have this data you cannot do any of this management. So let
me give you a simple examples. I mean
these are not my examples. So these were a little bit older but this also makes sense. So this is um this is way back in
sense. So this is um this is way back in 2019. This was published in nature where
2019. This was published in nature where they looked at all the electronic health records data of the pediatric patients and they looked at about 1.3 million
EHRs of the pediatric patients. They
extracted what is called a symptom signs and history. They also have
and history. They also have corresponding diagnosis for it. So this
is called as data points. They extracted
about 101 million points. And then they trained an AI algorithm. This is called as natural language process model. That
mean that from the HRS they made it into what we called a structured database where you have symptoms on one side and you have the corresponding diagnosis on the other and gave that data such that
when the new data came with symptoms could we assess the what is the diseases so in fact I'm not going into this but one thing which I wanted to point out
was so compared to what is the ground truth was the ground truth was done by what is called as experienced physicians
groups um so that is physician group three four and five they all got around 90% accuracy whereas the physician group
one and two have got only the 83% accuracy which is lesser than the EI model accuracy was 88% so that means that this
also showed that maybe the junior physician required this AI model the senior physicians may not require the air model itself that is also One of the thing which you should understand that
AI is not meant for everybody. So there
should be also what is experience level which is required for it and we call it as intended use as well. So the other one is again this is way back in 2010 where they looked at the breast cancer
screening using mamograms and breast cancer is one of the common form of cancer among the women one in nine get affected. They just want to distinguish
affected. They just want to distinguish between malignant and benign. So again
just to those of you who know u mamograms and there is something called as bat score. So the one two three is considered as benign 4.6 is what is considered as a malignant one. So they
just wanted to do a binary classification given a migrant. Can we
say whether it is benign or malignant.
So in a very very simple sense in this case early detection is the key for us.
So they looked at 60,000 mamograms and they have corresponding biome score which tell that whether it is bi or they train the artificial neural network and then once it has neural network has
given 96% accuracy whereas a regular radiologist have an accuracy around 94 so more or less comparable but this is being driven by machine which is AI model is nothing but a computer model
which you have generated so this doesn't get tired you can do more and more readings whereas typical radiologist might get time. So that was a good concept.
So if at all you ask me what are the current challenges for the AI and the healthcare is so there are mainly I I going to talk about only in the data part but there are other challenges in
terms of models and results. I I will ask the question is that for example for every AI model development you require data. So first question to ask is that
data. So first question to ask is that how many patients data we need for an AI algorithm for example the examples we have seen in the pediatric patient cases
they had 1 million patients data in the mamographic data they had 60,000 uh mamograms so what how do you determine how much data do we need and again I
don't have a solution for it it's a big research problem there are people who work their lifetime to answer this question so now let's assume that you said that like you know if you are like me working in medical imaging you say
that okay somehow we for the chest x-ray problem I require 60,000 cases or 60,000 uh chest x-rays now what happens if you don't have 60 what happens if you are
working on diagnosis of one of the various diseases so what do you do then okay you have the images but you don't have corresponding annotations remember
the annotations has to come from what we are calling it as the experts so what happens if there are not that many exports available in that particular disease or in that particular spectrum.
So again there are there are no easy answers to be given here. Each one of them is a big question. How do you determine how much data do we need? I
would argue saying that you keep adding more and more data and you keep seeing whether there's an improvement in the model accuracy. If the model accuracy is
model accuracy. If the model accuracy is getting plateaued that mean that you have enough representation and what happens if you don't have enough data you do a good data.
you try to generate some somewhat like a synthetic data for using the existing data. What happens if you don't have
data. What happens if you don't have enough annotation? There is something
enough annotation? There is something called a semi-supervised and unsupervised models which you can develop. Again, each one of them is a
develop. Again, each one of them is a big area. Let's not get into it. But I
big area. Let's not get into it. But I
just wanted to point out that these three are the biggest questions in terms of the clinical data and in terms of moral development as well.
So now now let's let's switch back and then look at it what are the datas we have right now and healthcare is a is a slightly complex one. So when I say
complex one um even in the India also this is increasing trend you have three piece which are part of the healthcare.
So one is a patient who is actually the one whose data we are all collecting and then you have the provider and when I say provider these are like clinics which are really maintaining the what
you call it as the electronic medical record or electronic health record or the clinical records and then you have a payer which is like an insurance agency which is doing the claim processing. So
in some sense uh even though the patient is the one which is getting a service from the provider right but the payment might come from a third party which is like a payer it
could be insurance agency it could be like ashman scheme and then you also have other stakeholders which are part of the healthcare like for example think about like pharma companies which are producing the drugs think about like
medical device companies which are making the devices they also sit on a lot of data right so now you can think about that you know we have lot of data which is sitting in with each one of
them and they're all part of the healthcare continuum right so so for example patients might be having all these variables data all the electronic health records are with the providers
and the payers are having all the financial data the stakeholders have all what is called as uh clinical trials or randomized clinical trial data which
might give you adverse reactions data and other things which are all very important as well so Ideally we want to use all these data to potentially
benefit everybody in the right. So in a very very nutshell the the famous statement for the AI in the healthcare is could I provide the right intervention to the right patient at the
right time. So that mean that we want
right time. So that mean that we want that to be fully datadriven and we want to provide the right intervention to the right patient at the right time. Of
course this is the biggest north star we have. I mean we said that that is our
have. I mean we said that that is our ultimate we may not reach there or we might take another 30 years to reach there but that's the question we we are interested in answering so and again if
I look at all the data so in some sense what is estimated is that if you have a 500 bed hospital the hospital alone generates about 1 pabyte of data per day
so and 90% of that healthcare data which is by volume is all images and these images tend to be what we call a structured format. So that's why
structured format. So that's why anything datadriven intervention uh it is going to come in the imaging world first and go to the other areas.
So that mean that if at all you are looking at AI developments or AI tools which are exist in healthcare lot of them are related to radiology in fact
75% of them are related to radology that is because by volume that much data is what we have. So again I wanted to point out this and then say that if at all you
want to think philosophically about the data point I will argue saying that philosophically the data point needs to have something we call it as um a
symbolization. So for example the data
symbolization. So for example the data itself which is existing right now uh symbolizes certain people events things and ideas for example I can call name as
the data.
So now the name represents a person right. So now if I have a chest X-ray
right. So now if I have a chest X-ray that is also symbolizing the person right. So what typically humans do is
right. So what typically humans do is that humans really don't use the data I mean in the sense that what humans use
is what we call as information and what computers use is data. So in a very very nutshell uh if the as a human
what you do is that we try to make it convertible into information or we convert into information as long as we understand it. So that mean that when
understand it. So that mean that when you introduce the data in a format that is easy for humans to understand you make data into information. So now what
is the format which is required for computers or machines that has been the biggest black box for us. How does the machines are able to get to that information is what is the biggest. So
whereas even for humans how do you convert data into information has also been a what I call it as still unsolved problem for us.
So um again um in in a nutshell when when we do lot of these things uh what happens is that one person data becomes another person's information
right so that mean that the human which is there at least in the healthcare has learned from past data and then using it for example in the case of what we were
talking about you know beyond 100° is fever positive so that has been a tradition additional knowledge which has been passed on which has been observed
on particular patients and have become.
So um and and healthcare also does the data in a two different settings. We
call it as episodic part of it and continuum part. So episodic means that
continuum part. So episodic means that you are treating a specific condition at a specified time. That means that for that instance you have the data.
Continum means that can you have over that. So you can think about like you
that. So you can think about like you know having all the data starting from the immunization which is happening post birth all the way till now the human exist. Can I collect all the data which
exist. Can I collect all the data which have has been done? When I say all the data remember that we have the patient data, we have the provider data. We have
the payer data. So could we collect all that and put them into continue? So we
call it as conveyor belt. Can you have that conveyor belt type of data? So
because unless you have that data you may not be able to answer questions which are come up in the country right I will talk about the questions in a while
um uh but but let me go back to the data so this is again going back to a little bit more of data science than the healthcare world but let me just put that data is typically we call it as
there are three types of data first one we call it as unstructured data the other one we call it as semiructured data the third one is structured data In a very simple sense, structured data as
the name represents is an organized form of data. It has rows and columns. We
of data. It has rows and columns. We
call it as there is a relationship that exists and then the computer program can easily understand. So those of you who
easily understand. So those of you who work in image don image is a very structured data format. You have imaging data and you have a header for the icon which tells you who is the patient, what
is the age of the patient, what scanner was used, what kind of protocol was followed, what is the resolution of this and what is this pixel spacing. So it
gives you all kinds of that meta data.
So that makes it a very structured. So
that's why I already mentioned that most of the air development has been towards imaging at least in the healthcare world because of the structured one. We want
that structured data. So now there is a so if you have slightly disorganization of the data you have something called a semistructure it kind of conforms to your data model but doesn't have a very good structure
the best example is your health records you have a HR platform where I can type in some things that is a free form text right so now that we call it as that is
belonging to the patient but then it is not really doesn't have the metadata on what context that EHR was in right so for example Example, did the patient
walked in with a leg pain but then you wrote saying that you know the leg is swollen. So what has been the chief
swollen. So what has been the chief complaint if you have not captured so then we call it a semistructured you don't have all the information you need.
Unstructured is like you know you for example you go to Google or you go to any other search engine and you search for a literature about a disease then that is what we call it as um
unstructured and in fact healthcare world at best you can expect that healthcare world has semiructured data not all the data which is collected
either by the patient or the provider or the failure is actually structured data.
Um now this brings us to a a a fine concept. A lot of the times it is
concept. A lot of the times it is missed. So of course we interchangeably
missed. So of course we interchangeably use these health records as per our convenience. But I just wanted to point
convenience. But I just wanted to point out that there are three different aspects of health records. The first one is the EMR electronic medical record. So
electronic medical record typically gets curated by a particular department in a particular provider facility. So that
mean that if you have a cardiology department, cardiology department has one EMR. If you have a neurology department, it has another EM for the same patient. But when we aggregate all
same patient. But when we aggregate all the EMRs which is coming from different departments, then we call that as EHR which is electronic health report. So in
some sense electronic health report is a combination of all the EMRs. And then you have this PHR which is aggregated by the patient. We call it as for example
the patient. We call it as for example somebody is having a variable weight or somebody is having a CGM it's a continuous glucose monitor and that data we call it as is coming in the PHR
ideally all the data belongs to the patient but the part which the patient accumulates we call it as less regulated and whereas the HR and EMR is highly
regulated right so in in a in a very simple sense if you look at from the quality point of view EMR and gets collected in a controlled environment
and then there is also quality check or it's administered in such a way that the quality is maintained whereas the PHR is what patient is accumulating or the part
which is outside the HR so you cannot guarantee the quality around that's the point I wanted to make it here and also the privacy rules doesn't apply the same way this becomes and shows up in
multiple instances so now before all these things what we are calling as data sets Earlier we were collecting it as patient registries. Those who come from
registries. Those who come from registries background I will say that you are not really that behind. You know
all the AI data sets today are nothing but patient registries. Right? Of course
you have emerging data type that is coming up like you know a lot of it has become digital you know a lot of it also has become now you are collecting like know genetic data and things like that.
Um so uh again what I want to point out is that remember that data quality is what is one of the important metric for us. I told you already about the bias
us. I told you already about the bias which is there that gives you bias to model lot of the data quality also is about the question we are trying to answer. So for example let's imagine
answer. So for example let's imagine that you know I have the patient's data I just want to answer the question saying that what is will be the expected OPD count tomorrow for a particular
person. So here you are not interested
person. So here you are not interested in whether you have done the correct diagnosis or not whether patients have left with satisfaction or not. You are
just interested in knowing how many patients will visit tomorrow in the OBD.
Right? So now here data quality is metric is as long as you are able to capture all the patients walking in and if you have that count right you have the quality. But then you ask the
the quality. But then you ask the question saying that you know are all the people who have walked into OPD were correctly diagnosed or you ask the question saying that are all of them
needed to come for OP. Right? So now
your quality metric has changed now. Now
now that mean that you are talking about now diagnostic accuracy or what is the patient condition is whether it has been correctly diagnosed or that mean that I need to have a ground truth what is the
ground truth for it and that mean that you need to have multiple checks of the patient. So did we do it? So
patient. So did we do it? So
traditionally all the clinical data that is exist has been very very useful.
Typically people have walked into the provider facilities if they had a complaint and then once the complaint has been addressed they walked away from the what we call it as the provider
facility. Whereas today what we are
facility. Whereas today what we are talking about is all this big data where you want to accumulate this data over the continuum of the life cycle of the person and the life of the person such
that you can really do this so that the criteria for the quality also has shifted. When I say continum you know
shifted. When I say continum you know you might want to combine the CGM data which I am collecting as a person compared to the all the capillary blood glucose level tests
which were done at a hospital facility or a diagnostic center. So that is what makes it as continu right. So that mean I cannot guarantee on the continent the data quantity. So that's a point I want
data quantity. So that's a point I want to mention that so and also the big data has these four issues volume velocity variety and velocity. So velocity is one
of the important characters of big data when you talk about big data you also expect that there are lot of errors. So
when I say lot of errors, what is the typical percentage of errors we talk about is anywhere between 20 to 30% errors is very very common in the big data world. Right? And of course what
data world. Right? And of course what you are trying to bring with all this data is I told you that the the method we want to do is that can we do right intervention at the for the right person
at the right time. So that mean that you are trying to bring a value for this data and remember big data as such in the four ways there is no value associated with
so now again I I already mentioned this I'm just going to skip so so if at all I think about the big data you have all these types of data right so you have pharmacy claims you have government
health plans right like ashamarat is a good example there you have private players almost 60 to 70% of the current healthcare in India is all done by private
and you have this patient industries now you are doing genomics and you have this hospital EHRs and you have this non- retail outlet which are doing purely a
diagnostic tests and you have this mobile data or variables like CGM is a good example there you also have lot of the claims which has gone to the insurance agency so these are all part
of the big data right so now out of which you want to make some sense and get it out into this so you know again this is this just to say that there is there is variety of data that exists and
then how do we really take that and convert into what we call as usability is so I want to give a few examples of what has been right remember we talked about
the pediatric condition so what we have done is that we have taken the data we wrote down all the symptoms we also wrote corresponding diseases so this we call a structured ehr you made somehow
unstructured into structured right by most of the clinical nodes that exist HR all unstructured or even if it is unstructured can you make some sense out of it. For example,
if you think about charged on the data which exist in the on the internet and I already told you the data which exist on the internet is all
unstructured but it has learned from or it has had some sense out. So could we also do that and medical imaging data I already told you right in the chest X-ray example we have this you have
trains data you have variable data you have physician dictation data so how do we really put that into training and come up with models it all depends on
what is the need for us and what is the existing workflow for us and then how do we put a target state around right so not every data can be used for the end
at the same time just because you have a particular episode data you may not be able to answer the company for example you have one instance of the blood
glucose level can you determine like you know what will be the complications of the diabetes right so that is not possible with one instance of the diabetes data right or
one instance of the blended you require continuum data for so that's the part I wanted to emphasize so I'm going to skip this part but I just wanted to point out that
um anytime we have the data the quality becomes very very important. So if you have data which is very much like a garbage. So when you put a garbage into
garbage. So when you put a garbage into the model, the model is learning from the data. You get only the garbage.
the data. You get only the garbage.
Right? In the healthcare world, you put garbage in, you might get a gospel, right? Remember that we are all trying
right? Remember that we are all trying to do the diagnosis or we are trying to do the right intervention for the patient. So in some sense that becomes a
patient. So in some sense that becomes a uh you know big shift. The quality is most important. So do you have that
most important. So do you have that metrics for the quality? How do you define the quality is also depends on the question you are going to ask in the OPD case I told you about if you are
interested in just determining how many people are going to visit tomorrow a particular hospital for the OPD that quality metric is different than how many people are correctly diagnosed
right so in some sense you want to have that quality of data metric very clearly defined you want to standardize the entry you want to describe or design the data elements to avoid errors so That
mean that you want to have welldesigned interfaces and then you also want to document what is your inclusion criteria, what is your exclusion criteria and you want to build the human
capacity around it. So in some sense quality is not really a biggest strength of what we call as healthcare
data because lot of you who are doctors are very well trying to work with imperfect data but AI model is not like that because what you have is
information but the AI development is all about data because the computer consumes the data. So if your data doesn't have the quality so then the model you are going to get the AI model
you are going to get will be always be biased or it will be a garbage and and in fact I can tell you I have been working in the model development for
more than 15 years I can I can invariably tell you that 90% of our AI development we do is about this data quality right so that mean that we do a
lot of cleanup of data and and and in fact even After that what we get is that 70% of our AI algorithms are only locally accurate. So that mean that if I
locally accurate. So that mean that if I collect a data from one particular site and then take that algorithm and then try to test it on another site because there's lot of shift in the quality
metrics and the protocols we get at least you know I will argue saying that 20 to 30% less.
So that has been a fundamental problem this is all revolved around the data part. So again how do you become good
part. So again how do you become good with the data? I will argue saying that there's no shortcut for this. Everybody
has to learn from the experience. So you
have to come up with those metrics. You
have to practice it. There's no shortcut in this. You learn with the exercise. So
in this. You learn with the exercise. So
again I just wanted to point out that you know there is a concept called as bias and variance. So you might have if it is a bias data you can correct it. If
it has large variability there is no way you can correct that. you cannot apply any corrective measures for the variance other than going back and making sure that the data you are collecting it doesn't have that much variance. So that
mean that I told you about the upfront thing which you have to do train the people so that you can get a good quality data right so and if it is a biological variant then you also want to
be aware of that. So if the biological if if a test itself is having 5% accuracy you can you cannot expect the data you are having having an accuracy
of less than right. So that's a very simple way of understanding it.
So typically if it's a numerical data what we try to do is that we try to plot it. So we try to plot it that you know
it. So we try to plot it that you know we call it as you know 25 percentile to 75 percentile. So that you can call it
75 percentile. So that you can call it as 1 sigma to 2 sigma. And if you plot it, it should nicely follow a gshian curve. And anything which is which we
curve. And anything which is which we call it as tiles. So which is this part which is the end of this gshian curve.
Uh that should not be exceeding 3%. So
that mean the totally 6% is what you can expect out of the one sigma or two sigma. Anything which is going into that
sigma. Anything which is going into that tiles which is more than 6% of your data which you have collected or at least 10% of the data then then you got to go back and check your quality. That is what we
try to do if it is in micro data.
Um so I I have come almost to an end but I just wanted to point out that if you at all you want to create your own data set one of the important point I wanted to mention is that whatever medical data
which is there is never available. So
you you should start with you know you have a you have a retrospective or a prospective data you want to first parties get the ethical board approval.
Every hospital today has ethical board.
You get access to the data. You query
the data what you need. You do something called as deidentification and then you get the data deidentified data into your system. You do a little bit of quality
system. You do a little bit of quality check. I already spoke about quality and
check. I already spoke about quality and you make it into a structured format.
Remember in the periodic cases they had a charts but they listed all the symptoms and all the diseases and then you put the labels for it and then that is what gets it to the model
development. So the only point and there
development. So the only point and there is a nice paper which was written around it. I will say must read for all of you.
it. I will say must read for all of you.
The only thing I want to point out is that um um there are new methodologies for this deidentification. In the healthcare
deidentification. In the healthcare world you don't really anonymize the data you deidentify so that you can reidentify the patient if it is required. So there's something called a
required. So there's something called a safe harbor and expert determination. If
it is a multimodel data which you are having safe harbor is typically applied on like medical images which is very structured and expert is the one which you typically apply if you have
multimodel multimodel means that you have text data you have waveforms you have this so that mean that you do it for example safe harbor has nicely these
identifiers so you can use these write a simple program to remove all this identifier from that one. So again I just wanted to point out and then stop it there with some challenges here. So
for example here is the case where I was involved in the study where we were asking the experts to mark the unknowns.
So now each one of the color you see here whether it is blue, green or um you know violet these are all done by different experts. So now obviously none
different experts. So now obviously none of them are matching right. So this is fundamental thing right. So now which one will you take it as a ground for your training the model. Will you take
an average? Will you take who is more
an average? Will you take who is more experienced or you can ask fundamentally different question and say that are these variations really clinically significant? Do they change anything to
significant? Do they change anything to do with the diagnosis? So those
questions we are you know are still unsolved but but you know in some sense I want you to think about these which are there. So um I'm going to leave with
are there. So um I'm going to leave with some challenges. So I already mentioned
some challenges. So I already mentioned this saying that existing medical data is never you don't have input output pairs. You have a chest X-ray then you
pairs. You have a chest X-ray then you have some report right the report doesn't clearly mention whether is whether it is pneumonia positive or negative. They write saying that not
negative. They write saying that not positive for pneumonia right if you search for positive pneumonia then you are going to get all these cases which are also not positive for so in some sense you have to do
little bit of clean up for it and prepare the data for the AIS and then lot of the healthcare data that exist in our systems are largely episodic they're
all fragmented right so you don't have a continuum data like like anywhere else of course ashman the digital mission and all trying to get into this array ated
part of it and we also have errors in terms of data entries. So what we are entering into any of these EHRs there is large number of errors it is it is well
known that 30% of our that entry are more than that and you also have issues around data quality lot of missing fields and other things and what is the ground truth we take what is the ground
truth we wish to take right is it consensia there's lot of biological variability I just now showed you a case of lung nodule marking right which one do you take as a
So I just wanted to leave with a question right. So if you have a single
question right. So if you have a single chest X-ray of patient what is the right question to answer. So can you answer the question that can AI predict what is the likelihood of this patient
developing a lung cancer just having only that data or do you require more data for example if you want to know whether they are being smokers or non-smokers or can you ask ask the
question can AI predict whether this patient is pneumonia positive or not will that mark will that information exist in that data and remember we are
processing the information whereas the computer is processing the data So in this question the right question is the pneummonia positive or not because that is what is the data which
is the present right so in some sense we also want to answer this question so this is where I'm going to stop there are new challenges that are coming in for example DPP act was passed in 2025
which is about 3 months back so you have new privacy you have new multi models you have data harmonization issues you are for example code 19 was not existing until 2019 now you have COVID 19 as a
new subclass of COVID and you also have this continuous data which new and new variables are giving in CGM was non-existent 10 years back today it is existing so what you do there are more
and more data here so that is where I'm going to stop thank you for patiently listening to me >> thank you very much funny that was a wonderful introduction to the doctors on
not only just the data but the right type of questions to ask you can ect many of the questions were around data itself and many were about how to use AI
the importance of building AI. The first
question I think that we should probably discuss is when you say annotation of data as a computer scientist annotation of the medical data. What is it that you
need? And number two,
need? And number two, is it really important to have Indian data to make sure that AI solutions for
doctors in India actually work? Let's
take these two questions first.
>> Okay. So I can answer the first question. Um so um in my view this is
question. Um so um in my view this is purely my view I know to me data doesn't have any value so what you require is data with
expertise but remember that as humans we process information so I need that data plus information so I want to know the process of how a clinician has taken the
data and converted that into right so this is a very very important concept right so I call data as being the foundation. So but nobody is going
the foundation. So but nobody is going to give you any money if you have only foundation. You cannot rent it. What you
foundation. You cannot rent it. What you
build on top of it is what is usable and what makes revenue. Same way data is that foundation we require. Without that
we cannot stop but we also require that expertise we have and whenever we ask the annotation to be done typically the way we follow it is we somehow get at
least two people to look at the data and do the application and then we try to bring some consensus among them even though it is very hard to bring the same consensus among clinicians but we try to
take somewhat a middle path of the average of both and if it is a annotation on like this lance we try to take an average. So if there is no consensus, we go with the most
experienced clinician. So we'll take
experienced clinician. So we'll take that as a counter.
Okay. So now the second question to answer is ask you what you have asked is a very very important question. So let
me just give you an a counter example.
So today Delhi I think currently the air quality index is about 500.
Right? So if some somebody is coming up with a cuff you know that it is because of bad air quality right so we cannot put the same condition which is on the western data.
So for example Singapore air quality 75.
So if model has been trained on cup analysis right to determine let's say DB. Now this is two different things
DB. Now this is two different things right. So we have lot of environmental
right. So we have lot of environmental factors lot of personalized factors which come into it in some sense if you don't have Indianized data or Indian
data set I can I already told you that almost all the algorithms are locally accurate so they when you bring it to another
site another place they their accuracy was by 22 so in a very simple sense if you don't have Indian data we cannot
really use AI for our Thanks. And I think the very important
Thanks. And I think the very important clear derivation from that is that expert Indian doctors annotating highquality Indian data and
making it available to researchers in India is pretty much the only way forward for highquality AI solutions to be built. Now on the same line, now we
be built. Now on the same line, now we talking about AI solutions and how well they work, how they do not work. How do
you measure how well an AI solution works? How do we deal with errors? Who
works? How do we deal with errors? Who
takes responsibility?
Any thoughts on that from your side?
>> So um uh the data quality metric is largely varying based on the question you ask. So uh so there is a new concept
you ask. So uh so there is a new concept called as trust metric. So we call it as a trust metric. For example, uh let me explain in non-medical terms. So Anerag
this is up to Anerag and Dr. Suresh as well if I ask you to give you one rupee you are not going to think so much and give me that one rupee but let's say that I ask you to give me one lakh
then you have to think about whether you trust me or not with that one lakh or not similarly healthcare also then the stakes are very low if you have to do a cardiac surgery your stake is much
higher right whereas if you are only just determining that how many people are going to walk into your OPD so that's a different metric all gift. So
in a very very simple sense the the quality we talk about or the thing we talk about is depends on the trust we have and that trust of the AI model
comes from the trust in the data also right so that mean that if the stakes are high we have to have higher bar for if the stakes are low you can have a lower bar and remember also this is a
very very important thing because somehow in our fancy world we think that AI is about self-driving cars. I mean
this is what we say right and if you look at any self-driving car video they all show you on a nice sunny day very clear highway it is not jam-packed
traffic like bamboo traffic and it is not raining at all and that is what self driving is so that mean that your benchmark for AI to work has been
compared to an average condition average time whereas in healthcare world our benchmark is that all of you get trained
for at least for 10 years before we get right. So in some sense we are telling
right. So in some sense we are telling that we have to compare with like a race car our should work on par with race car
but is there any race car driver which AI is competing with. So the bar is very high at least in the clinical world because medicine is always evidence based and the evidence we are trying to
build up what we are posing against is most experienced clinicians.
So two further questions along this line one is that you know to extract large amounts of data and check the quality.
For example we can take the example of electronic health records to extract a large amount of data from both structured and unstructured data contained within it.
we will often have to use other AI and that's something for example we have all seen LLMs being used now to pass through electronic health records and recreate a
more organized structure data for use.
So use of AI to generate data for AI and to check quality for AI training is get circular at some point. How do you see this going forward? Um so technically
anybody who's using LLM to extract the data from EHR I would say please be care careful because any data you are putting into the LLM LLM is also using that data for its own training purposes. So in
some sense you are violating the patient privacy part by uploading the data into the LM because there is no on the prem that we are aware or at least if you are
using it please download locally ah the model and then use it for example med example where we have a local model which can be downloadable and use so that's the first part I wanted to
mention um so um so in the in this space of what we call as a circular data where LLM is extracting some data and we using that data training building an AI model there are two AI models one is which is
giving the input to the other AI model right so in this process one of the quality checks we do is that uh we at least leave out some data so that mean
that we don't use for example even on the prem you are using data extraction tools we say that can we do manually this data extraction and compare the
performance of the data we extracted versus what AI model is extracted Of course to compare it you need to have some metrics around it right so you need to clearly define those metrics for
example these images and you are extracting like this lung nodules we have some metrics of signal to noise ratio you have metrics around what we call it as calibration errors so you
have to clearly define those metrics and compare them that mean that you always want to think about what is the human alternative and at least some part of the data 5 to
10% of the data you try to do it with the human alternator and come back such that you have you don't have any hallucinations you don't have anything that is getting created and in fact
those of you who doesn't know there is something called a zero GPT so the zero GPT will also if you put a paper in it will also tell you whether the uh for example like you know the reference is
allergenated reference or which is actual reference which exist in the public so so you want to come up with these kind of tools is a metric for it. But
the primary goal is is there a human alternative and if it exists please use it.
I think on the same lines continuing forward if a critical part of doing this is evaluating the errors then it automatically stands to reason that
people with domain expertise must be part of this type of data extraction structuring organization things like this and the presence of
people with their inputs regarding what data might be more valuable where the corruption is coming and their discussions where data science members team is probably very important to create highquality products. Your
thoughts?
>> Absolutely. I already mentioned this that in my view data has zero value.
So what has value is data plus that expertise which the domain experts bring right because the only the domain expert can understand the context
right whereas data scientists like us we don't understand the context very well and almost healthcare in a in a nutshell works in the context right most of the
data science works out of the context so that mean that if you're at all told you are doing data extraction and you want to define define clearly the quality metrics for example I showed you
whether the quality metric is clinically significant or not that's a very important question now who will determine the clinical significance that is only the domain experts can determine so in my view it has to be a synergy
between all of them you're not involving the clinical experts or the domain experts into your data extraction process or data set creation process I will say that your model whether you
want to believe it or not is going to be bias and he's going to end up with what we call it as eronicisms like garbage hearing gospel out
>> now think on that note context let me move to the broader questions on AI in general one of them of course is that medical AI or some products that are free are already being used by patients
and doctors do you think that's a dangerous trend of products that have never been actually approved never been examined and directly being used by doctors and
patients and if so what do you see as the biggest danger there >> um so see I I want to tell what is the regulatory says and I'll come and answer this question currently the regulator
says that air is like a calculator for example you as a doctor are not allowed to use calculator I will argue saying that you are allowed to use calculator but do you know how to
use that's an important question right and what calculator can be you have a clear boundaries right so in the same way if you can understand what is the AI boundaries sir
I will argue say that you are free to use right but if you don't know the boundaries of AI and then you are trying to use it this is like me expecting
calculator to come up with AI model given input output data >> right so in some sense you need to know the boundaries of the there's nothing
wrong in using AI as long as you know what the AI can do. Right? So this is my >> calculator is a very good example because many people fear that use of
calculators will make us bad at maths.
Many people fear that use of smartphones will make us bad in terms of remembering things. Yes, to some degree we become
things. Yes, to some degree we become worse at some things that we used to do well earlier but by and large maths does not become weaker because of calculators and the world does not stop functioning.
A general principle of course is that people using AI for a given task should ask themselves can I do this task without the AI. That's probably one of the ways you know for certain that you
are a safe user of AI that you can estimate the output.
>> Nice.
>> I think we're coming pretty much at the end of the hour. This was a lovely talk and I think I'll go back to Suresh for anything final that he has to add.
>> Thank you very much Dr. Andra and Dr. Fendra. It was a wonderful lecture.
Fendra. It was a wonderful lecture.
First of all, a lot of appreciation all over and I'm sure Dr. Andra we can't cover all the questions because there are more than 180 relevant questions and
close to 200 other questions which are partly relevant to our topic. uh we
can't take all those things but few common things that people have given as a feedback today for both of you and for
us also at the NBMS is that um is there any way how these people can reach out to you if they have a specific question can we make your mail ladies or any of
them available if they are interested to do so probably I leave the uh question to you people to answer at the end of it and uh and they told that the legal and
other aspects have been really very well covered by Dr. Dra. I don't think uh I can add anything other than few specific interests which are coming in hundreds
to learn more about this specific class and unstructured data is one question Dr. Andra I think we can touch upon that the gentleman has asked.
Yes, I mean um you want to friends you can go ahead and answer what >> I will argue saying that if the data is coming from the provider which is like
mostly the clinics uh I will argue saying that most of it will be in semistructured format you have a patient ID and corresponding to it you have the data but where is the
difficulty that lies that if a particular individual is collecting its own data like let's say like CGM type of data which is coming to your mobile phones into the apps, they tend to be
unstructured. And remember that if it is
unstructured. And remember that if it is an unstructured format, you also want to be little bit um cognizant to that fact that it may not have the quality as it
deserved because lot of it is coming from an uncontrolled environment right so this uncontrolled environment adds a little bit of uncertainty towards
the data. So that's the part I wanted to
the data. So that's the part I wanted to mention that unstructured data you also want to have good quality metrics for you to convert into structure.
>> Right. And in our experience and particularly you're right I'm going to reemphasize that any work you do with patient data should be with downloaded models as opposed to feeding it into
online models. Otherwise it's a
online models. Otherwise it's a violation. But assuming you have access
violation. But assuming you have access to facilities where you can download the models and run them. If you take semistructured data and you have your
own highquality annotators and you provide examples of conversions of semistructure to structured then after a few short learning it is possible to
further you know train even your tune uh some of the downloadable models to do that job for you. But it's a continuous learning process. In something as simple
learning process. In something as simple as extracting drug names and labels, after scraping the entire Tata 1MG databases, we found almost 15 20%
hallucinations using accepted models. After a few short training, it went down to 2 to 3%. But
we could never get rid of that last 2 3%. And you have to be just careful in
3%. And you have to be just careful in these things. So I think that's probably
these things. So I think that's probably and this is a huge challenge for everyone and it's a great area of ongoing work.
Uh thank you very much uh everyone that is indeed a very very great lecture just you know one from our side NBMS there was some quality issue in the audio we
apologize for that next video we are going to just fix that as well but it was a wonderful having both of you here sir and from all the participants we thank both the speaker and moderator for
sparing the valuable time thank you >> thank you all very much pleasure to be here by Thank you all for joining us and thank you Dr. S for coordinating session.
Thank you sir. Thank you so much.
Loading video analysis...