Azure Databricks Interview Questions 2025 [WITH REAL-TIME SCENARIOS]
By Ansh Lamba
Summary
Topics Covered
- Interviews Demand Hands-On Solutions
- Unity Catalog Centralizes Governance
- Delta Location Hierarchy Determines Storage
- Autoloader Ensures Exactly-Once Loading
- Z-Ordering Enables Data Skipping
Full Transcript
There are so many job postings for datab bricks and the numbers are growing up with the time. But why it is so difficult to actually crack a datab bricks interview? Because the dynamics
bricks interview? Because the dynamics of this application is changing rapidly and you need to be aligned with the latest trends and the latest interview
questions. So that's why I have created
questions. So that's why I have created this 3 hours long video covering all the latest data bricks interview questions covering all the real-time scenarios
plus the conceptual questions as well.
So if your aim is to actually crack a databick interview this year then let's get started with this video and be serious. What's up? What's up? What's up
serious. What's up? What's up? What's up
my data fam? What's up? How is your Sunday going so far? I know it's really really good and if not now it will be
good. So basically after receiving so
good. So basically after receiving so many comments, so many messages now it's time to talk about databicks interview questions.
This is a kind of evergreen topic and I personally feel that there should be a new video on this particular topic let's
say in at least 2 to 3 months because the dynamics of data bricks is changing rapidly and it's not just about dynamics of data bricks it's about the dynamics
of the whole data engineering industry and just just tell me one thing if you are sitting in the interview just tell me one
um are you just still being asked like all those questions that you were would be asked like let's say two to three years back I'm not talking about the fundamental questions obviously
fundamentals will remain the same like what is distributed computing how like spark architecture works I'm not talking about those things but now you would see
that questions are evolving more about delivering the results using these technologies such as data bricks Azure they are more inclined towards using the technology because they want to hire
data engineers obviously who have the knowledge who have foundational knowledge and who is really really um having deep knowledge in the concepts everything is fine but at the end of the
day after covering those fundamentals engineers developers need to develop the solutions using the
technologies and trust me the industry is right now like so so so hectic that the moment you will be hired you will
directly be landed on a project how can you just expect that you will be landing as a data bricks data engineer and you will be spending okay I
will just learn this technology while just um being in the company no bro you have to develop the solution because that's why company is hiring you that's why they are just paying you and if you
are being hired in a consult tens company then obviously you can just expect that you will be working on the projects maybe from the day one from day two or let's say at least or let's say at most at most like after one
week so what you need to have deep hands on okay and that's why right now interview questions are revolving more about the technologies along with your
solutions make sense make sense so that is why in this particular video we'll be discussing so many latest questions and we going to discuss all the latest things because obviously in data bricks
we have so many latest latest feat latest latest features available right now in data bricks okay so I will just try to include all those scenarios and we are not going to just cover
simple questions theoretical questions no we going to focus on let's say end to end questions which will include your design skills which will be including um let's say in in those questions you need
to think for a solution it's not just like pretty straightforward No, you can develop the same solution using a different approach maybe but you need to
think that way that okay we can approach this kind of solution we can approach this kind of solution we can approach this kind of solution as well. Okay. So
in this particular video obviously the number of questions will totally depend upon the flow as we just go along the video
because my intent is not covered like 100 questions 200 questions and are we just saying hey just tell me what is data bricks hey just tell me what is data lake hey just tell me what is um delta table. No these are not the
delta table. No these are not the interview questions. So my intent is not
interview questions. So my intent is not just covering the lots of questions. No,
my intent is just to cover all the latest questions obviously but end to end questions in which if interviewer asks you the follow-up questions you
should be well prepared okay just ask me anything and these questions would rely with the projects as well. So let's say you have built a project okay and you
have used a technology called data bricks. So the person can just ask you
bricks. So the person can just ask you some followup questions and these question would be like very obvious. So
those question will be covered in this particular video as well. Okay. So
without delaying further let me just get started with this video. And now yes you will be asking hey an what is the prerequest for this video. Just basic
understanding of databicks. Obviously I
have databicks video on my channel. If
you have watched that video you are all set for this video. If not just go that go and just check that video first.
Okay. And just drop a lovely comment.
Okay. on this video as well and on that video as well and now actually let's get started with this video and in this particular video we'll be just creating Azure datab bricks resource because
earlier I was just thinking to just go with the opensource version but the thing is you will not actually grab all the concepts okay so we'll be using Azure data bricks and yes you'll be
learning a lot in this particular video just be with me just some excitement okay and just some enthusiasm and you're all set, bro.
You're all set. Okay, so let's get started with the video and let's create our first thing which is Azure account and then obviously data brick. Okay,
let's see. So in order to create your free Azure account steps are very very simple and if you are my data fam hey by the way if you haven't clicked on the subscribe button just do it right now
and if you have haven't shared my videos with your friends that means you are not your friends two true friend because if you would share that video obviously you
know that that friend will gain a lot of knowledge right so just spread the positivity I know everyone is talking about Hey, there's so much competition.
There's that, this, that, this, that.
Okay, everything is fine. This is life.
Okay, life is not easy. So, we know there are challenges, but we know there's a way. Okay, we know how to walk. We know how to overcome the
walk. We know how to overcome the hurdles. Okay, we know how to make
hurdles. Okay, we know how to make sprints. We know how to just run in the
sprints. We know how to just run in the marathon. We know how to walk. And we
marathon. We know how to walk. And we
know how to stop. Stop. Not really. Just
walk. Okay. So you know so the thing is we know there are challenges we know there's competition we know AI is here so still we going to win right we going
to win why because you are my data fam so first of all let's create our free Azure account and the steps are very simple simply search on the incognito mode and type Azure data factory Azure
free account okay and just click on the first link and then yeah perfect now here you will see try Azure for free. By the way,
you can even click on this one, pay as you go. But you do not need to spend any
you go. But you do not need to spend any money many many money money money. Okay,
just click on try Azure for free and then it will simply redirect to you redirect you to a page. Okay, so this is
a page here. The steps are very simple.
You simply need to put your email ID.
Okay, Microsoft email ID, not Gmail email ID. Okay, just put your Microsoft
email ID. Okay, just put your Microsoft email id and the moment you put your Microsoft email ID. Simply click on next. Now, some of you will say, "Hey,
next. Now, some of you will say, "Hey, we do not have any Microsoft email ID."
Don't worry. Click on create one. Now,
my personal advice. Do not create your new account
advice. Do not create your new account from here. Sometimes it will just make
from here. Sometimes it will just make you stuck on the quiz part. Now, what is the quiz part? You need to verify that you are not a robot. And if you will do
this step in the mobile phone, it's fine. It works fine. But in laptop
fine. It works fine. But in laptop sometime it's stuck. It it stucks. So
you simply need to click on create one in your mobile. And once it is created, you can simply put it here. Okay. The
moment you put here your email account, it will simply ask you to fill a form just to tell your name, address, phone number, blah blah blah. And at the end, you would
blah blah. And at the end, you would simply need to click on sign up. Okay.
Okay. So the moment you click on the sign up button, it will just ask you for some more details such as like card details and those kinds of details.
Don't worry, it is just for the confirmation because Microsoft confirms that you are the one who will be using the services because obviously if you are the one you should have like some
financial details, right? So that's why it confirms that's it. And just fill that and your account is free for use for 30 days. You do not need to worry at
all. Awesome. Okay. Okay. Okay. So now
all. Awesome. Okay. Okay. Okay. So now
let me just take you to the Azure portal because Azure portal is the place where we just create everything our all the resources and everything. Don't worry
this is not an Azure masterass but yeah we need to go inside Azure to create our databicks. Okay. So simply go to Google
databicks. Okay. So simply go to Google and simply search portal.asure.com. Okay. This is the
portal.asure.com. Okay. This is the link. Simply hit enter.
link. Simply hit enter.
And then it will simply ask you to put the email id. Simply put that email id that you have just created like for your registration. The moment you put it
registration. The moment you put it here, it's done. You will simply land on the Azure portal. And let me just show you how does it look like. Let me show
you. So this is my Azure portal and I
you. So this is my Azure portal and I know it could be different in your case.
Okay. Why? because these are the Azure services that I have used so far or maybe I would I would have just clicked on it. So these are some of the recent
on it. So these are some of the recent ones. So do not worry. The rest of the
ones. So do not worry. The rest of the things should be same. Okay. And let me just give you a quick overview. We do
not have special things here on the homepage. Okay. But there are some
homepage. Okay. But there are some things that you should know. First of
all, the most important thing is simply click on this ribbon and the most important thing is these two tabs are these two tabs. First of all, all resources and a resource group. By the
way, what is a resource group? You should know about this, bro.
group? You should know about this, bro.
You're sitting in the interviews, right, bro? Okay, no worries. Okay, so
right, bro? Okay, no worries. Okay, so
resource group. Resource group is basically just a folder in which we just store our all the resources. Now, don't
ask me what are resources. Okay, let me just tell you. Resources are just the services that we use from Azure. Such as
let's say you are using Azure databicks that is a service. You are using data factory that is a service. You are using data lake that is a service. Okay. Very
good. So everything we just create we need to have one resource group. That's
it. Rest of the things are like some most commonly used resources services on a daily basis. And then we have some monitoring capabilities and then some
cost management and billing management.
All those things that you will not be taking care of right now because you're just using free account. And yeah when you will be landing as a data engineer or when you will be landing the role as a data engineer obviously you will just
take care of that thing there. So that
is fine. Okay, not you maybe it will be I think data architecture or someone from the data governance team but yeah you can just take part because if you want to work with a startup that's a
good thing because you can just wear so many hats and possibility of so many so many so many new learnings right very good okay once everything is done now
let's create our resource group and how we can just create resource group simply click on search and simply type resource Okay, perfect. Now I have so many
Okay, perfect. Now I have so many resource groups so far and I'm so lazy.
I haven't deleted anyone. So very well done Anel Lamba. Good going. So simply
click on create and I will simply name it let's say datab bricks interview. Okay. And region. Let's pick
interview. Okay. And region. Let's pick
a region. Let's
pick which country do you want? Do you
want UK South? Let's pick UK South.
Okay, perfect. Click on review plus create and that's it. Click on
create and here is your resource group created and you can simply search it and it should be here. Simply click on refresh
and where's that? Go to home. Go
to here and simply search DB interview. Yeah, perfect. Here you
DB interview. Yeah, perfect. Here you
can see resource group. Now this is your resource group. Now what is the first
resource group. Now what is the first thing that we need to create? Obviously
we need to have a data lake H. Okay.
Obviously bro databick not a storage solution. It is just like your
solution. It is just like your transformation layer, your data processing layer on top of your data, data warehousing layer on top of your data. Yeah, they are just investing so
data. Yeah, they are just investing so much in like impro improvising their data warehousing capabilities. And now
in the recent um summit they have announced that they have their own I think data models. Um I didn't go through the documentation right now from their summit from their convocation because it's still going on. So I can
just have a look and it's really really amazing. Yeah they have they are just
amazing. Yeah they have they are just planning I think to build something related to for like data reporting obviously it's really cool feature. Okay
simply click on create. Let's create our data lake quickly and click on plus button and go to marketplace and simply search storage account.
And we have a storage account here and simply click on Microsoft one. Okay.
Then click on create then see now your resource group folder is automatically filled.
Why? Don't worry this is not an interview question but yeah I have a tip. I have a tip that could be your
tip. I have a tip that could be your interview question. Let me show you.
interview question. Let me show you.
First of all we need to just say storage account name. like what would be the
account name. like what would be the storage account name that we need to provide. I will simply pick storage
provide. I will simply pick storage account name as um DB interview
lake. It's a good name DB interview
lake. It's a good name DB interview lake. And one thing you cannot pick the
lake. And one thing you cannot pick the same name as I'm picking here because it should be unique throughout the network.
You can simply put um I love and db interview lake. I'm just kidding. Do not
interview lake. I'm just kidding. Do not
need to write this. Okay. Okay. So you
can simply say DB interview lake and that's it. Very good. And then region
that's it. Very good. And then region obviously it is automatically picked and primary service it is fine still I I can just show you what you need to just create Azure blob storage gen 2
obviously we just create this and if you just leave it as it is it's not a big deal okay so now what do we need to do we can simply say standard or premium
just go with standard then this is important redundancy basically we have four different types of policies LRS GRS Z and GZS The cheapest one is LRS in
which the replica of your data will be created on the same data center which is local redundant redundant storage.
Simply pick this one and then click on next. And this is important. In order to
next. And this is important. In order to create the data lake you need to check this box otherwise it will simply create a blob storage. Blob storage is not a data lake. Data lake is built on top of
data lake. Data lake is built on top of blob storage. Basically difference is in
blob storage. Basically difference is in blob storage you cannot create hierarchical um folders. In data lake you can create hierarchical folders.
Okay, click on this and then simply click on review plus create and just hit on this create button and it will simply
deploy your data lake and it will hardly take few seconds trust me hardly few seconds. Okay, let me just click on
seconds. Okay, let me just click on refresh if it is not done but I think it should be done. Uh yeah so as you can see these
done. Uh yeah so as you can see these are deployment details and it has just started and this is just a Microsoft account like Microsoft storage account stata is accepted that means it has
accepted our uh request to create that particular data lake and it is now creating it. Let me just see. Yeah
creating it. Let me just see. Yeah
perfect. So now it is done as you can see it has deployed go to resource. So
you can either click on this or simply go to home, search your resource group and then you will see your data lake created which is called DB interview link. So this is our data lake. Okay.
link. So this is our data lake. Okay.
And now the second thing is we need to create Azure data bricks resource.
That's it. We just need two resources and that's it. I was just thinking not to create external data link but in the realtime scenarios in the real world
interviews they will ask you questions regarding external data links because we do not use manage data lake okay that is why I chose this one see I'm so
possessive for you so let's create our database resource right now let's create our datab bricks resource and without delaying let's actually create
it so in order to create your data databicks resource. Simply go to
databicks resource. Simply go to marketplace and simply search datab bricks and hit enter and simply pick Azure datab bricks and click
create. Yeah, perfect. So as you can see
create. Yeah, perfect. So as you can see now here as well we need to just give the workspace name. I will simply say DB interview so that it would be aligned
with our naming convention that we are using. And I will simply put workspace.
using. And I will simply put workspace.
Okay. And then region is UK south. Now
is the thing which one do we need to pick? Premium, standard or trial?
pick? Premium, standard or trial?
Basically standard is not good because in standard we do not get all the things. In premium yes we get all the
things. In premium yes we get all the things but it is paid. You are using free account so it is not a big deal but it's for those who are using paid account and if it if they do not want to
spend much okay they can simply pick trial premium. What is that? It is just
trial premium. What is that? It is just like premium but just for 14 days.
That's it. Just the trial version. And
then we simply need to put manage resource group name which is not necessary by the way. You can be asked in the interview interviews like why do we have this and what's the role of this? So basically here comes your first
this? So basically here comes your first question. Yes, unofficial first question
question. Yes, unofficial first question and that's my way of telling it. Just
tell it whenever it is possible. It's
not like anything is okay. This is the question. What is? How is when is no
question. What is? How is when is no question. That's it. So what is this
question. That's it. So what is this manage resource group and why do we need to just worry about this? So basically
we have two different types of areas what one is compute and other one is um compute and what was the oh not compute yeah compute and data. So these are two
different types of you can say areas that we have or ps that we have within data bricks. So in the compute area we
data bricks. So in the compute area we just worry about all the things like all the interfaces all the web interfaces all the web UI UX everything that we'll
be doing in the datab bricks workspace okay so that will be going in the datab bricks workspace area this one
but all your manage tables all the VMs virtual machines will be going to the data pane Okay. And whenever we will be
pane Okay. And whenever we will be creating let's say job clusters, you will actually see those virtual machines being created, those hard drives being
created in that particular area. And
that is this your manage resource group.
After the introduction of Unity catalog, we do not use managed resource group for managed tables. We just use it for the
managed tables. We just use it for the clusters. That's it. Now what is unity
clusters. That's it. Now what is unity catalog? Don't worry, we have a
catalog? Don't worry, we have a dedicated question on that. So for now you can either put your dedicated name ei otherwise it will simply pick any default name for you. Then click on next
and everything is fine. You can directly click on review plus create and just scroll down and click on create. That's
it. That's your datab bricks workspace.
And I'm really really excited to just tell you our first question for today.
And this first question is really important because being a developer it's very important to know how we can set up the environment. It is very important
the environment. It is very important and let me just tell you bro we are not going to just set up a basic environment in datab bricks. We have to have to use modern things that we use in data bricks
and that's the intent of this video. So
I will be using something called as unity catalog. Unity catalog is nothing
unity catalog. Unity catalog is nothing but you can say a modern way of governing your resources within datab bricks. Okay. And how we can enable
bricks. Okay. And how we can enable unity catalog? We need to enable
unity catalog? We need to enable something called as unity meta store.
Okay, make sense. So till the time it is deploying it will take just few seconds.
Let me just quickly show you how we just deal with unity catalog architecture.
Okay, let me just go on Google. Let me
just show you Unity catalog. Let's pick this documentation.
catalog. Let's pick this documentation.
I just want to show you that image because it is really really nice and the best image best for understanding. So
this is the criteria which we were referring with like before unity get a log. Okay. Here we were actually
log. Okay. Here we were actually managing independent workspaces and in every uh in every independent workspace we were managing compute obviously meta store and obviously if like we are
governing some things we have to just manage the access everything right after the introduction of unity catalog. This is unity catalog mode.
catalog. This is unity catalog mode.
Actually we are enabling something called as unity meta store. So unity
meta store comes at the top level. Okay.
Then within that we create something called as cataloges and those cataloges are called as unity catalog. Why?
Because they are united and that particular unity catalog that you'll be creating right now. Don't worry it is just a highle overview. So that unity catalog you will be creating can be
accessed through different databicks workspace as well. Wow. So it is totally united. That's why it is called unity
united. That's why it is called unity catalog. And within that your compute
catalog. And within that your compute will be there. It is independent. It
obviously we can govern it. Obviously we
can just uh manage it check the lineage everything is there but compute is residing in the dedicated workspace like the dedicated workspace resource group
manager or let's say default resource group which is also called as manage resource group. Okay. And then let me
resource group. Okay. And then let me show you this architecture. So this is the
architecture. So this is the architecture. See at the top we have
architecture. See at the top we have metas store. This is called as unity
metas store. This is called as unity metas store. Earlier you were using hive
metas store. Earlier you were using hive metas store but now we have unity metas store. What is a metas store? Again this
store. What is a metas store? Again this
can be question. So we going to cover like literary questions like this. Okay.
So just be informal while learning and be very formal while answering these questions. Okay. So metas store metas
questions. Okay. So metas store metas store is nothing but the repository where you store all the data information. All the data information
information. All the data information data about data metadata. So let's say you're creating a table, you're creating database, you're creating schema, you're creating volumes, everything, everything
will be there in the meta store. So
earlier we were creating independent uh database workspace. Okay. And if I am
database workspace. Okay. And if I am creating a table, if I am creating anything database, schema, so I was writing that information like this
information in the managed resource group in the managed resource group that we just saw during um creating the workspace. Okay. So we were storing all
workspace. Okay. So we were storing all that information there. And if we want to just create a manage table, obviously
data used to go there. But we said hey this way let's say our organization has 20 database workspaces 20 and it is
very fine it it is a very common scenario your your organization can have okay so those 20 workspaces will be having different 20 storage accounts
obviously just to store your manage tables data metadata everything 20 storage accounts. So what we did we said
storage accounts. So what we did we said hey just hang on. So what we need to do now we will simply create one metas store which will be linked to so many
database workspaces and one metas store obviously would only have one managed resource group. Okay. And now this
resource group. Okay. And now this resource group will be managed by us. So
that is why we will call it as let's say in the external link that we have created. So this meta store is there.
created. So this meta store is there.
Don't worry we'll be creating that metas store and we will just link that metas store with our data lake as well. And is
really a best practice that we should create external uh location with our unity meta store. It is optional but we should always.
Okay. So this is just about your highle overview that we have within unity catalog and your basics should be really really clear first okay then only we can just do anything that's common sense and
I know it is not really common nowadays so I'm just taking care of everyone okay so now once we know the concept let's say now I have this meta store
now whatever I'll be doing let's say I'm just creating a manage table I am creating a schema catalog anything all that information will be going
here in this external lake instead of that managed resource group and not only of this particular workspace every workspace which will be
attached to this meta store will be just dumping all the information to this external link. That's it. That's the
external link. That's it. That's the
concept of it. Let's see if our resource is ready. Yeah, it has deployed. It is
is ready. Yeah, it has deployed. It is
ready. So now what we can do? We can
simply go to home. You can simply go to our resource group and these are our two our resource group. Okay. So let me show you our question number one and let me just actually get started with the real
time interview question number one and this would be your hottest question right now and I have already given you some groundwork for this question and
you would understand this question. Now
the question is let me just tell you using the scenario. So you have a database
scenario. So you have a database workspace.
Okay. And let's say this is your database workspace. Databicks workspace.
database workspace. Databicks workspace.
Okay. And you are a developer. Let's use blue
developer. Let's use blue color. Let's say you are a developer. He
color. Let's say you are a developer. He
or she whatever.
Okay. Okay. Let's drop this. Yeah.
Perfect. Perfect. So let's say you are a developer and you have this database workspace. Okay, makes sense. Now you
workspace. Okay, makes sense. Now you
need to set up this workspace in such a way that it can first of all access this particular lake which is a data lake.
But the thing is this is an external data lake. That means this is not a
data lake. That means this is not a managed data lake assigned to this particular data bricks workspace. So you
need to first of all assign this particular data lake to this databicks workspace. And there are some
workspace. And there are some conditions. You cannot just develop
conditions. You cannot just develop something on your own. I'm just giving you some points that you have to consider. The points are you need to use
consider. The points are you need to use something called as external location. Why? So that you can actually
location. Why? So that you can actually reuse this particular axis. Okay. So
first is this. Second one is you need to actually allow this particular database workspace to be used across multiple workspaces. Yes. whatever you will be
workspaces. Yes. whatever you will be building within this database workspace let's say schemas databases um tables functions volumes everything so you need
to actually develop a shared workspace that can be used across multiple database workspaces and obviously you need to take care about the data governance and everything about that particular workspace as well and just a
hint for you what hint bro you will be showing us right yeah but still I want you to just try it on your own and obviously just look at the solution if you cannot do it. So this is a kind of
situational question that you need to tackle. Okay. So now hint is you need to
tackle. Okay. So now hint is you need to use unity catalog. You need to enable unity meta store and that is your hint. Okay. Sorted sorted sorted. So now
hint. Okay. Sorted sorted sorted. So now
let's see how we can actually tackle this question and how we can actually approach this question and I will just guide you step by step everything. Okay.
What is necessary? So let's see how we can just solve this. So let's go to our database workspace for this. Okay. So
simply click on it and simply click on launch workspace and then simply pick any account. So this is just your normal
any account. So this is just your normal account and this is the ext account. You
would know when you just go to enter ID you get your this account users and you just see this account and you just click on it. See this long email ID. By the way, this is very handy
email ID. By the way, this is very handy and whenever you want to use the database workspace, you should always use your normal Gmail account. But
whenever you want to use your console page, then you need to just use that long account. So this is our database
long account. So this is our database workspace. Wow, that looks so cool.
workspace. Wow, that looks so cool.
Right now they have just changed the UI.
Earlier used to black or gray here. I
used to like that particular version more. So yeah, no worries. No worries.
more. So yeah, no worries. No worries.
No worries. Okay. So this is our database workspace. Let me just increase
database workspace. Let me just increase the screen size a little bit. And
perfect. So now what we need to do first thing just a quick overview nothing special obviously you would have some knowledge that's why you are just watching the interview questions right.
So nothing special or everything is same uh we have this left pane for workspace recent catalog workflows compute marketplace compute was really important before but now we have by default one
thing available which is serverless compute. So we do not need to actually
compute. So we do not need to actually worry about job compute or let's say job cluster right now or obviously job cluster is for production or like allp purpose cluster not job cluster. So we
do not need to worry about allp purpose cluster we will simply use and it will be ready for us to use. Wow simple and then we have jenny which is just like
chatbot by databicks. Okay. And then we have some machine learning things. If
you are into AI and all geni obviously go to playground, pick your model and then just build something use API and then just build something cool. Okay. So
that is all about database right now.
I'm again telling a lots of a lot of lot of lot of changes are happening right now in all the applications not database not just the only application in all the
applications. Okay. So now what we need
applications. Okay. So now what we need to do we need to simply create our meta store and how we can just create meta store we simply need to click on this dropdown and simply click on manage
account. So when you click on this you
account. So when you click on this you will see all the workspaces okay and simply click on manage account. If you
do not see this button you can simply watch my this YouTube video which is this one datab bricks unity catalog and you can simply watch this video from um
5600. So this is the exact time stamp.
5600. So this is the exact time stamp.
See, I'm just saving you a lot of time.
So, simply say I love you in the comments. Simply say, "Yeah, it's up to
comments. Simply say, "Yeah, it's up to you if you want to say." Okay. So, when
you just click on it, you will simply land on the admin console level. So, the
thing is when you just log for the first time with databicks, you will only see manage account button with your default account with your default default account. And right now, my default
account. And right now, my default account is this one. So if you're just logging in to the database using default account, you should see it. But if you do not see it, do not worry. You can
simply click on that particular page and it will just ask you to just put your ID and if you just want to go to man like manage account, you cannot simply go to
using this normal Gmail account. You
have to use this long Gmail account because this is the one registered with your default directory. Okay. So these
are some admin things you should know.
Now I will simply click on manage account and here you will see obviously I'm already logged in because I am super smart. No I have already logged in here
smart. No I have already logged in here before. So if I just click on it see
before. So if I just click on it see this is not my normal Gmail account.
This is like ext. So ideally what I prefer I simply create a new user in my enter ID and I just prefer using that particular account and I have one I
think it's called I think let me just go to console page. Let me just see user management. See, I have created a
management. See, I have created a dedicated account for my datab bricks unity catalog. I use this and if I just
unity catalog. I use this and if I just want to make any changes, I just prefer using this one. And in order to make any changes uh in databicks workspace, this
account should be the what's the name of that? I think global admin. Yeah. So, if
that? I think global admin. Yeah. So, if
you just go to the Android ID, this account should be the global admin uh database unity.
Um you can check the roles. Click on the roles and see global
roles. Click on the roles and see global admin. By the way, this is just like you
admin. By the way, this is just like you you will get everything in that particular video. You can actually see
particular video. You can actually see this is just like admin things and do not worry at all. Ignore. Ignore. So now
the main thing as you can see in the workspace tab we have all the workspaces listed right. Very good. So now what we
listed right. Very good. So now what we need to do we need to create a new unity meta store. Simply go to catalog and
meta store. Simply go to catalog and click on create meta store. Perfect. So
simply quickly quickly quickly quickly quickly create a new unity meta store and I'll simply call it as um DB
interview meta store. Make sense? Now
region you can pick any region. I will
pick UK south. Why? Because you can create oh another interview question.
How many unity meta stores you can create within a region? Only and only one. Okay just keep this thing in your
one. Okay just keep this thing in your mind.
Now what is the default storage account location? So we need to give this and
location? So we need to give this and before that it is asking for uh asking us to provide access connector ID. Yes.
So basically I told you this is a situational question in which multiple questions are embedded. So now just tell me one thing. This is your database
workspace and this is your data lake.
How they will be communicating with each other? How like do they know each other?
other? How like do they know each other?
Obviously no, right? Both are different entities. This is owned by Azure. This
entities. This is owned by Azure. This
is owned by data bricks. How? How bro?
How? So here comes the role of datab bricks access access connector. So we
need to create datab bricks access connector and it is the only way to connect if you're just working with unity catalog. Okay. How we can just
unity catalog. Okay. How we can just create that? Simply go to Azure, go to
create that? Simply go to Azure, go to home and then simply search here. Oh,
first of all go to resource group because it will save you a lot of time.
Click on create and then simply search access and you will see a Spider-Man logo. Now just search access
logo. Now just search access connector. Oh,
connector. Oh, nice. See access connector for Azure
nice. See access connector for Azure data. This is a logo. And then click on
data. This is a logo. And then click on create. By the way, I already have so
create. By the way, I already have so many access connectors. I don't know why I do not delete it after creating the video on Lamba. Just do something. You
should delete those. Okay. So now you just need to name it. I will simply say DB interview access. Perfect. Region US
East. No bro, UK South. Yeah, perfect. Click on review
South. Yeah, perfect. Click on review plus create. So what this will do?
plus create. So what this will do?
Nothing. It is just a kind of credential that we need to use.
We are just allowing this particular connector to use our uh storage accounts and then that particular connector can actually be integrated with datab bricks. That is the solution of question
bricks. That is the solution of question number one part number one in which we need to create the connection between your datab bricks with your external
data link. Okay. Click on
data link. Okay. Click on
create and see that's why I focus more on the real real realtime scenarios.
There's a difference between real-time scenarios and real realtime scenarios.
Real realtime scenarios involve all those admin tasks. Bro, just get this thing. Data data bricks interview is not
thing. Data data bricks interview is not just about like hey what is volume? Hey,
what is cluster? Hey, what is job cluster? No, they will just give you a
cluster? No, they will just give you a situation. You need to just tell them
situation. You need to just tell them hey I will just do this. Okay, it is created. Now what I will do? I will go
created. Now what I will do? I will go to resource. I will just show you. See
to resource. I will just show you. See
this is my resource. So now what we need to do we need to go to our storage account. Okay. And go to access control
account. Okay. And go to access control because we are just providing the access contributor access on this particular data lake to that connector. Okay. Click
on uh add add role assignment and then simply search storage
below. Click on search
below. Click on search pro storage. Sometime it just tucks and it
storage. Sometime it just tucks and it do not populate immediately. So see now it is coming storage blob data contributor. Click on this. Click on
contributor. Click on this. Click on
next. And now click on manage identity.
Click on select members and simply pick um your access connector. Obviously see
I have 14 access connector. Anal lamba.
What are you doing manu?
Angel lamba I will just pick this one datab bricks access connector and I promise I will delete all these access connectors
after this okay click on select and click review plus assign plus review plus sign that's it now we can integrate this particular access connector to our
data bricks in order to access this particular data link understood the link understood the relation okay data Data bricks interview question is not just
about writing pispar code. Okay. Datab
bricks is a technology is a wide technology which is used by the whole organization not just by the data engineers not just by like pispark developers.
Okay. So now you can simply go to your resource group and now click on this u resource. Now you will simply need this
resource. Now you will simply need this resource ID. That's it. Go to this and
resource ID. That's it. Go to this and put access connector ID here. That's it.
Now it is asking us to provide the ADLS gen 2 path. Simply go here and simply click on your resource group which is DB
interview. Click on this storage account
interview. Click on this storage account and click on containers and here just create any container because obviously you'll be creating a metas store container. So just say meta store. So
container. So just say meta store. So
this is basically the container which is dedicatedly given to the unity metas store only. Do not touch this. Do not
store only. Do not touch this. Do not
touch this. Okay. Do not do not because this is only available for that particular Unity metas store. That's it.
This is not your property. Yeah, this is your property but you are not living here. So now what we need to do click on
here. So now what we need to do click on properties and simply remember that this is the storage account name which is DB
interview lake. Okay. Now go to this and
interview lake. Okay. Now go to this and now you need to just paste the location and what is the container name like it is
DB interview lake at the rate storage account name I guess container name is metas store by the way anala oh yeah so this is basically
called abfs protocol Azure blob file system secure yeah so this is kind of protocol that we use to allow the access to data lake from Azure and you should
also know about this. Okay. Then simply
write dfs.core dot
dfs.core dot windows.net and then everything because we are not specifying any folder. Use
any folder bro all up to you. Click on
create. That's it. This is your connection done so far if everything is going smoothly. Okay. And let's Oh,
going smoothly. Okay. And let's Oh, let's see. Okay. By the way, don't don't
let's see. Okay. By the way, don't don't worry. These red marks are not errors.
worry. These red marks are not errors.
This is saying we cannot assign. So
because metas store is created, now we need to assign the workspace, right? And
we cannot assign the workspace which is already assigned to a different metas store. And this is already assigned to different meta store. So we can only assign this one.
store. So we can only assign this one.
Makes sense. Click on assign. And it
will say hey do you want to enable uni catalog? We will say yes. Click on
catalog? We will say yes. Click on
enable.
That's it. That's it. Example.
Congratulations. And click on close. It
is done. It's done. Yeah. But the setup is not done like completely. You will
see this page after this. See metas
store admin. Click on this edit button and currently this particular account is the uh admin. But you are using Gmail
account, right? So you need to make that
account, right? So you need to make that particular account as admin only then you will be able to use and create unity cataloges. Okay. So this is a very short
cataloges. Okay. So this is a very short one but interviewers can hook you in that scenario. Okay. And obviously if
that scenario. Okay. And obviously if you have if you have watched my videos you will answer like a pro. Okay. And
then simply click on this dropdown and then simply type your email id and then click on save. That's it. That's it. It
is done. Now you can simply close this.
Okay. Now we are here. Simply refresh
the screen. So we have successfully set up
screen. So we have successfully set up the Unity meta store and we can now go on the things. Now we can easily create all those things that we need to create
and now our database workspace is connected to external data lake. Okay. Sorted. Very good. So now
lake. Okay. Sorted. Very good. So now
what we need to do next just to confirm simply go to the um catalog.
Yeah. So here how you can just confirm that you have enabled unity meta store.
Simply click on this plus button and you will see create a catalog. Earlier you
cannot see this because you cannot create cataloges in the native database workspaces. Okay cool cool. Our first
workspaces. Okay cool cool. Our first
question is done. Now let's see what do we have in our second. Let's talk about the question number two that we have and in question number two what we need to achieve and this is the day-to-day activity that you as a data engineer
using data bricks will be doing and that was the core reason of me picking external data lake. So the thing is this is again a scenario where you have a
data lake in which your manager will be let's say or anything any system any you can say
pipeline will be dumping the data in the pocket format okay and this is in the Azure data lakeink okay
now you need to push this data to a sync location or let's say destination. Okay. And you do not need
destination. Okay. And you do not need to just push this data. You actually
need to convert this data into the delta format and on top of it you need to even create a table which will be the delta
table. So you need to do this task and
table. So you need to do this task and for this you will be doing everything with the help of data bricks. That's it. This is in the pocket
bricks. That's it. This is in the pocket format. You need to create a kind of uh
format. You need to create a kind of uh you can say uh data file type conversion. You need to just land that
conversion. You need to just land that data into the delta format which is the most widely used open table format right
now. And not only this, you also need to
now. And not only this, you also need to just create this delta table on top of it. And there's one more feature that
it. And there's one more feature that you need to add. Every time you need to do the full refresh of the data, this is the
requirement. This is the full refresh of
requirement. This is the full refresh of the data. every time data is coming here
the data. every time data is coming here because we are pulling all the data from let's say web methods or any API every time we are just doing a full refresh how you can just achieve this let me
just show you and this is our really really really common scenario that lots of scenarios be revolving around this trust me maybe it will be
inter directly or indirectly but it will be there okay so let me show you how you can just achieve This in order to achieve this obviously we should first have some data. So how
you can just grab the data? I have
uploaded all the data files in my GitHub repository and you can simply check it out and this is the link that you can actually refer. Okay,
let me just increase the screen. I will
also put the link in the description.
Okay, and this is the repository and in this we have these folders. Okay,
sorted. So let's go to our first of all data lake because we need to set up our data lake. Okay, and within this what we
data lake. Okay, and within this what we need to do, we need to simply create a container which will be called as raw and one more which will be for let's say
destination.
Okay, perfect. So in the raw container I will be uploading files. Okay. So I will create a directory and I will simply say
pocket data. Click on save and within
pocket data. Click on save and within this I will upload that pocket file and you can also download it from here. You
can also upload. So I think it is here. Perfect.
upload. So I think it is here. Perfect.
So perfect. So now I have already uploaded this demo.pocket here. So what
I will do? I will simply read this data.
How? That is the question, right? Yeah.
And it is not that pretty much like straightforward. Okay. You need to just
straightforward. Okay. You need to just do some things before that. Okay. Let's
see. So, let's go to datab bricks. And
obviously, we are just following Unity catalog architecture. So, you need to
catalog architecture. So, you need to just do everything keeping that in your mind. Okay. So, in order to just read
mind. Okay. So, in order to just read this data, you would need something.
Simply go to your catalog and simply go to this plus button or simply click on this external data and then you need to create something called as external
location. Okay, because only then you
location. Okay, because only then you can read the data sitting in the data lake. Okay. And even before that you
lake. Okay. And even before that you need to go to credentials. You need to actually create a credential. Wow. So we
need to do these things before reading the data from the data lake. Yes. That
is why I picked external data link bro.
That is why. So simply click on create credential. By the way, what is
credential. By the way, what is credential? Credential is nothing but
credential? Credential is nothing but just a fancy name for your access data connector. Really? Yeah. Click on this
connector. Really? Yeah. Click on this create credential and credential name. I
will simply say an credits. Now see
access connector ID. I told you this is the exactly same thing. So simply go to Azure. Go to your um resource
Azure. Go to your um resource group. go to home resource group or
group. go to home resource group or directly go to your um this resource which is access connector simply copy the resource ID that's it and paste it
here and that's it click on create this is just a fancy name for it again it can prick you in the interview so just be prepared okay so now we are all set to
create our external data click on this click on create external location and simply you can name it I will simply say raw Another interview question. Okay. So
many interview questions. Yeah. All
interlin because these are the follow-up questions that the person can ask. Hey,
what is this? Hey, what is that? So you
should be all prepared. Whenever you
create external location, you always create external location till the container level. So just a quick IQ
container level. So just a quick IQ question for you. We have two containers. How many external locations
containers. How many external locations we need to create? The answer is two because we create external location till container level. Very good. So now it is
container level. Very good. So now it is asking me to provide the URL. I will
simply say abss. This is just the protocol and here
abss. This is just the protocol and here we need to just put the container name which is raw. Add the storage account
name which is db interview lake. Then
dfs.co windows.net. And then we do not need to
windows.net. And then we do not need to worry about anything because we are just providing uh external location till container level. We can even fine grain
container level. We can even fine grain it but it is always recommended to just provide the access or let's say external like access to the external location till the container level. Okay. Now it
is saying storage credential you have already created an credits. That's it.
Click on create. That's it. It is done. You can
create. That's it. It is done. You can
even click on test connection and it will simply test it. And you can see read list write delete path exist hierarchical, name, space, everything is enabled. Very good. Click on done. So
enabled. Very good. Click on done. So
similarly create one more location which is which is for another container which is destination container. Otherwise you
cannot write your data. So your half of the question would be pending. Okay.
Click on the external data for one more time. Click on external data and I will
time. Click on external data and I will simply say destination and URL we already know a bfss this time it is destination at the
rate uh what's the location name like storage account db interview lakebfs dot core windows.net net. Okay,
that's it, bro. Credential we already have. Click on create. That's it. Click
have. Click on create. That's it. Click
on test connection. Perfect baby. So now
we are all set. Click on workspace.
Click on create. Create a folder. Why?
It's always good. Simply say DB interview.
DB interview okay sorted now within this create a notebook and we will simply say notebook one and don't worry I will upload all the notebooks in the GitHub
repository but but but just try to write the code on your own I have seen lots of learners say hey
can you just upload the notebook because we are so so so lazy we do not want to write the code but we want to crack the data engineing interviews Bro, have some
water. Just show some enthusiasm to
water. Just show some enthusiasm to write the code. Bro, notebooks. I will
upload just for a reference if you see some errors and because I just want to upload.
That's why I will upload because I don't know why you just need to refer the notebooks. You have everything on the
notebooks. You have everything on the screen. Just type it, bro. You see
screen. Just type it, bro. You see
errors, you complain. Hey, I'm seeing errors. So what? So what?
errors. So what? So what?
Let me just tell you if you are not aware of this. If you want to become a data engineer, your half of the more than half of the job will be going in just debugging. Do not expect you will
just debugging. Do not expect you will be building thousand pipelines in a day.
No, you will build one pipeline in a day hardly and the next 4 days you'll be just debugging it. So just have this clarity bro. Otherwise reality will hit
clarity bro. Otherwise reality will hit you and you will see hey in which field I have entered. So just make your mind accordingly. Write the code. Okay, do
accordingly. Write the code. Okay, do
not say hey upload this, upload that. I
can but why I do not? Because I want you to just write the code. I want you to just grow and just get the success.
Okay, I know in the beginning it's not really easy. Make the
really easy. Make the habit. H enough psychological talks,
habit. H enough psychological talks, philosophical talks. Okay, simply say
philosophical talks. Okay, simply say notebook one. Now let's start our
notebook one. Now let's start our development. So first of all, we know
development. So first of all, we know that we always connect our cluster with our notebook. Click on this connect
our notebook. Click on this connect button and this time you will see hey this is already on an Lama did you create cluster and you didn't show us no bro so as per the latest update by
databicks you already get one serverless cluster which is always running for you and you can simply peg this and you can actually run your code boom no need to wait for 10 minutes 15 minutes to turn
your cluster because when you are learning you would not want to just waste your time on the cluster creation right so simply pick this and I will simply write the markdown cell and I
hope that I do not need to give an overview of notebook because this is just a notebook right and you are just practicing interview question so I assume that you know some things in
notebooks right okay so now obviously if you don't know in that video everything is covered from scratch bro everything so let me just create a
markdown cell and let me just say hey reading bucket data because first we need to read uh first we need to read it. So in
order to read the data we can simply use spark API spark dot read dot format.
Okay, baby. Punch lamba. Mind your language.
baby. Punch lamba. Mind your language.
What? Baby is a good word. I am a baby. What's wrong with
word. I am a baby. What's wrong with this? Okay. So, spark read dot format
this? Okay. So, spark read dot format pocket. Format is pocket. In case you're
pocket. Format is pocket. In case you're just using any other file, you can use CSV, JSON, any file. Pocket I am using.
So, I will simply say pocket. Within
this, I will simply say dotload. Why?
Because when we just work with pocket files, the schema of the data is actually being stored at the footer of the file. So I do not need to worry
the file. So I do not need to worry about defining the schema or anything.
It's the best thing that I love about pocket. So I'll simply say now in the
pocket. So I'll simply say now in the load section obviously I need to just define the location. What's the
location? We already know abs and container name is raw at the
rate db interview and then lake yeah dfs.co windows.net
net within this I have one folder right it's called pocket data very good simply load this and click on this close button because we know our trial will be ending
in 14 days not a big deal okay so it will simply load the data if everything is fine and I hope everything
is fine with the location and all it takes some time with the first cell Don't say that hey spark is very slow now. So now our data is loaded. In order
now. So now our data is loaded. In order
to display it we can simply say display df just to make sure hey data is fine.
Okay. So it is saying display df and perfect. So one thing to note whenever
perfect. So one thing to note whenever you're using serverless compute you cannot go to spark web UI. You can only see the performance. See, but when you just create your allp purpose cluster,
you can actually see jobs and click on that and then it will just take you to the spark web UI. Okay, so this is your data. Um, for me it's good. So now what
data. Um, for me it's good. So now what I need to do I need to create a kind of solution which will every time do a full refresh on the on the
sync side on the destination side plus I want to create a delta table on top of it. How we can just do that? So it's
it. How we can just do that? So it's
very simple. You will simply say
simple. You will simply say DF.right dot format. This time you will
DF.right dot format. This time you will write delta. Okay. Then you need to type
write delta. Okay. Then you need to type mode. Mode is called something called as
mode. Mode is called something called as overwrite. Okay. Mode is override. So
overwrite. Okay. Mode is override. So
basically we have four modes. Append.
Overwrite. Append is uh used when we just want to insert the data only.
Override will do a full refresh every time. We have third one as well. It's
time. We have third one as well. It's
called error. It will simply throw error if uh if any file is already there.
Fourth is ignore. It will simply ignore.
Okay. Whether the file is there, whether is not, it will simply ignore.
Okay. So mode is done. Now what we need to do? We simply need to say hey option.
to do? We simply need to say hey option.
Now we need to define the path where you want to just write the data. So I will simply say um destination at uh what is the storage
account name? DB interview. Perfect.
account name? DB interview. Perfect.
Within this I want to create this folder which is called pocket data. Okay game
is not over. You need to some write something as um write sorry dot save as table. So when you just write
save as table what it will do it will simply create a delta parket data table on top of it. Now you have actually two
options. What one is you can actually
options. What one is you can actually create table while writing. Second is
you can create table when data is there.
You have both the options. So I will just show you both the options. Okay. So
first of all we will just simply write the data. I will simply say dot save. It
the data. I will simply say dot save. It
will simply write the data there and we are good. Um it is running but I know it
are good. Um it is running but I know it will it oh error nice dot save. What's wrong pro path must be
absolute? What do you
absolute? What do you mean? So this is our container name.
mean? So this is our container name.
Okay. DB
interviews.dfs.co window.net.net with
this is our folder.
Okay.
[Music] H Let me just scroll down. What's
wrong? It is saying it is wrong here.
Save. Why it is saying wrong? It is just a save button. Uh uh uh
button. Uh uh uh uh. Okay. Anala.
uh. Okay. Anala.
Who will put the protocol? Now it is fine. Okay. Okay. I was just creating
fine. Okay. Okay. I was just creating those external location and that it was not asking me. So human error. So now it is there. So in order to just validate
is there. So in order to just validate this information simply go to your Azure and just go to your data lake and see containers and here in the destination you should see this folder. Perfect. And
this is of delta format. Now in order to create table on top of it what you need to do you first of all need a database then you would need database or schema
same thing okay so first of all you would need a catalog then you would need database/ schema same then obviously table name so in order to do that you would first create a catalog simply go
to cataloges and then click on this plus button click on create a catalog so catalog name will be let's say DB DB
interview catalog or let's say DB catalog that's it because we need to use this name so we cannot keep it very long. Um now it is asking me for storage account
location. This is another interview
location. This is another interview question. Okay. Now we are creating a
question. Okay. Now we are creating a catalog.
Okay.
Now we are not providing any location to this catalog. Now let's say I am
this catalog. Now let's say I am creating a manage table. Okay, I am creating what? Manage. Okay, I will just
creating what? Manage. Okay, I will just discuss this in a separate question because I would I would like to just mention all the scenarios. It will be really good. It will be really really
really good. It will be really really good and it can be asked in your interviews as well. So just for now we are not providing any location and don't worry in the next question I will cover all the scenarios. It's really really
really important. Okay, simply click on
really important. Okay, simply click on create and by covering those scenarios you will become master of this hierarch hierarchical structure within the
catalog schema and blah blah. Okay. Then
it is saying catalog created configure catalog and it will simply say hey configure this and is saying hey limit the workspaces in which users can access
this catalog. I will say all workspace
this catalog. I will say all workspace users like all workspace have access. So
all the workspaces which are attached to that particular unity meta store will have the access. I am fine. If I just uncheck this box, I can just simply pick assign to workspace and I will simply
say hey just use it. Use it owner. Owner
is an Lamba. Okay. Then simply say grant this one privileges grant. Choose
which users or group can access this catalog. All users are granted. All
catalog. All users are granted. All
account users granted. browse by
default. So in this way you can actually click on all the users that we have for now we are saying all account users that means like whatever users do we have in this particular workspace can actually
access this particular catalog and we are not restricting anything and this way you can actually pick this one and simply say revoke because currently all account users have the access. If I
click on principle and revoke I can actually revoke the access. Okay,
obviously I do not want it and you can simply say okay all account users should have the access. So simply click on this and click on next. By the way, by default it is granted. So now metadata
it is fine because we are not taking care of categorization within the cataloges. So our catalog is done. Now
cataloges. So our catalog is done. Now
we can even create schema for from here or from the code as well. Code is simple create schema schema name. So I'll
simply create it from here. Schema name
is let's say um DB schema. Okay. And again we are not
schema. Okay. And again we are not providing path here. Don't worry I will just cover that particular thing in the separate question. So now let me just go
separate question. So now let me just go back to my notebook by clicking on recents and this is my notebook. Perfect. Now you will see something. Click on this ribbon and
see something. Click on this ribbon and click on this button. Now we have this catalog right? DB catalog. Click on this
catalog right? DB catalog. Click on this and in this we have schema. Perfect. In
this particular schema I want to create a table. Okay. I will simply say create
a table. Okay. I will simply say create table. Then I will put catalog name
table. Then I will put catalog name because we use three level naming space in Unity catalog. Catalog database table sorted. Okay. DB
sorted. Okay. DB
catalog. Okay. Dot DB
schema dot table. I would simply say parket data. Why? Because it is always
parket data. Why? Because it is always recommended to have the same name of your table and your folder.
some best practices. Okay. Now, create
table is done. Do we need to define the schema? It's up to me. It's up to me.
schema? It's up to me. It's up to me.
But I would not create the schema. Why?
Because in the delta log schema is already there. So, I do not need to
already there. So, I do not need to worry at all. Okay. I will simply say,
hey, just create table using delta.
Now using delta is I put I. Now again do we need to put single quotes or not?
Maybe not. Yeah. So in using delta as well it is also optional because now data bricks has made it default that you do not need to put delta. If you do not put anything it will still work because
it understands that you want to create a table using delta. But I as a developer usually like to put it. It promotes
readability. Okay. Now I will simply add one more thing. It's called location because we need to define hey where is the data on which we are creating
table. Now again interview question see
table. Now again interview question see I told you I am going to cover so many questions in this video and all are real like real realtime scenarios and
followup questions as well. interviewer
can ask you hey if I create a table on top of location let's say XYZ it will simply create a delta table
on top of it on that location it makes sense what if what if I want a table okay on this particular location ABC and
there is no data residing in that particular folder or that particular location what will happen tell me So
what it will do it will simply create a blank table on that particular folder on that particular location and if you just
provide schema here it will simply create a blank table with schema on that location. So the moment that location
location. So the moment that location receives files it will show the data.
Okay remember this thing. So I'm going to cover so many follow-up questions as well. Okay.
well. Okay.
So because I know it's my responsibility that you have clicked on this video that means you will be getting lots of knowledge lots of knowledge it's my it's
my responsibility okay it's my love for you it's my everything for you okay now location I will simply copy this one and
I will simply paste it here perfect simply run this and do not show me any errors otherwise
Please. Oh, bro. What's the error, by
Please. Oh, bro. What's the error, by the way? Um, what is saying? Oh, we need
the way? Um, what is saying? Oh, we need to add one more S. Hey, by the way, how did it work
S. Hey, by the way, how did it work here? Oh, because it was just writing.
here? Oh, because it was just writing.
Okay, let me just add ABFSS because what it is saying earlier in Azure blob storage, we used to just use ABFS, but in Azure data lakeink, we use
ABFS. Okay, so just make sure just a
ABFS. Okay, so just make sure just a silly mistake. But yeah, again one
silly mistake. But yeah, again one follow-up question like why do we use ABFS? It is recommended by Microsoft.
ABFS? It is recommended by Microsoft.
Just talk to them. Is this the answer to the question? Is this a way to talk to
the question? Is this a way to talk to your interviewer? Yes. Why not? Why not?
your interviewer? Yes. Why not? Why not?
If you have skills, just talk to the person like this. What's the big deal?
Thousands of thousands of companies are waiting for you. Okay? If you have skills, if you do not have skills, then you need to think twice or
maybe at least 10 times before saying anything to your interviewer. If you
have skills, just be okay. Okay? It's just a company.
You are the one who is a who is an who is an asset to this world.
Okay? Just understand this thing, bro.
world is really really changing. See
let's let's take an example of anyone any big personality Bill Gates okay there were not many companies at that time okay he had the skills so he had to
just do something with the skills so he just opened a new company new organization and now it's Microsoft now we have so many companies so we just
think that okay job is the only possibility that will be you can say important for your survival.
No no no. Just think big and just realize like
no. Just think big and just realize like world is really really big. World is
really big. Okay. And there is a world that revolves after 5:00 p.m. and before
9:00 a.m. Just live in that world as well.
Okay. Again, personal choice. Personal
choice. Personal choice. By the way, if you want to live in that world, you would be just suffering a lot. Okay?
Sometimes you would be skipping food for so many days. So, if you're ready for that world, welcome.
Otherwise, 9 to5 is done, bro. 95 is
very good. Very good. Very good. Sorted.
Sorted life. Okay. So, it's it's about choices. Okay. So, now what do we need
choices. Okay. So, now what do we need to do? Simply validate it. Select I will
to do? Simply validate it. Select I will simply run a select statement on top of it. I will simply say select a from this
it. I will simply say select a from this table name. and I should see the data.
table name. and I should see the data.
Okay. And then we we will be simply jumping on to our next question. So this
is my table. Okay. And this should show me
table. Okay. And this should show me some data. Perfect. Now again another
some data. Perfect. Now again another interview question. What's that for? Now
interview question. What's that for? Now
uh we simply created a delta table.
Makes sense. And we simply queried this particular table. Okay. What if I would not have
table. Okay. What if I would not have created this table and I would only have
written my data in this delta format. Okay. How you would see the data
format. Okay. How you would see the data then? Hm. Good question. Good question.
then? Hm. Good question. Good question.
So the question is very simple. We have
something called as delta dot and then location. We can actually query the data
location. We can actually query the data directly. We really yeah we have delta
directly. We really yeah we have delta connector. So what you need to write you
connector. So what you need to write you simply need to write select ax from then write delta dot now just put tick. Okay. What
what what are ticks? If you just um go one key above your tab key and just towards left of your one digit it's it's
called tick. Okay tick. So now simply
called tick. Okay tick. So now simply write the location which you want to read. I want to read this location.
read. I want to read this location.
Simply run this and you can actually query the files directly instead of creating tables. And I'm not lying. See
creating tables. And I'm not lying. See
same result. So just a follow-up question. Not a follow-up question. This
question. Not a follow-up question. This
is an individual question. Okay. And
this is you can say this was this feature was not available before. Okay.
And it is recently added not recently recently added. Yeah. But these kinds of
recently added. Yeah. But these kinds of features was feature were not there before. Okay. Now what is our next
before. Okay. Now what is our next question? So see I am covering so many
question? So see I am covering so many questions within question number s two question number two. So do not feel like um you we are not covering like we are
just covering like few question there are embedded question within each category. So I would simply say category
category. So I would simply say category 1, category 2, category 3 instead of question one, question two, question three. Okay. So I will simply divide it
three. Okay. So I will simply divide it otherwise people will be just scrolling the video to say hey only five to six questions are covered
bro just click on the video and just watch the video okay do not judge a book by its cover okay you can judge but you will be
the don't use bad words don't use obviously you will be a fool right if you're just judging a book by its cover right okay so what is Our next question and what do we need to cover? Basically
not question like category. Okay, let me just show you what do we have in the next category and and and I know that we want to just discuss those um categories of your hierarchical structure like what
will happen if you just provide the uh location at u meta store level and then at um your catalog level then schema level. Let's discuss that as our next
level. Let's discuss that as our next question and let's see what do we have after that. Now let's talk about
after that. Now let's talk about question number three. Basically this is a question come multiple scenarios and you need to just tell what will happen
in which scenario. Yes it can be you can say a quick questions like fire round.
Hey just tell tell me like what will happen in the scenario. So without
wasting any time let me just tell you first of all we know that our unity catalog metas store unity metas
store okay has a location. So if it has a location I will simply make blue sign here. Blue sign means ADLs. Okay
here. Blue sign means ADLs. Okay
remember this because we'll be using these conventions in the later scenarios like in these scenarios. Unity meta
store has location. Unity
catalog doesn't have any kind of location. Okay. Our database schema also
location. Okay. Our database schema also doesn't have any kind of location. Now I
am creating or let's say your interviewer says I am creating a table.
Okay.
Obviously now we need to use three level naming space um catalog database and then table name. Okay. Now obviously
this is a manage table because obviously if you're just creating external table it will simply go to that location. It
is a manage table. it in which we do not need to provide the location. I am not providing location at the table level. What location it will pick? Tell
level. What location it will pick? Tell
me like where it will go. The answer is it will simply go to this particular location because first it will go to database. It will say hey do you have
database. It will say hey do you have any kind of location? Database will say no bro. So it will simply go to catalog.
no bro. So it will simply go to catalog.
It will simply say hey do you have any kind of location? It will simply say no bro. It will simply say hey unity metas
bro. It will simply say hey unity metas store do you have any kind of location I have to have to put my data it will simply say yeah bro so it will simply use that particular location sorted
scenario number one now I am creating adls here as well so basically I am creating a kind of
location while creating the catalog as well catalog catalog so this will be called as external internal catalog okay which is not using unity meta store location so whatever catalog I'll be
creating it will be going to a dedicated location and obviously in order to do that we use access connector we already know that this time my manage table which do not have any kind of location
it will simply say bro do you have any kind of location it will simply say no so it will simply go to catalog hey do you have any kind of location I need to just put my data it will simply say it
will simply say yeah this time I So it will simply stop here and it will simply this data it will simply go to this particular catalog. It will not go
to this particular ADLS at all. No it
will simply go to catalog and it will be simply saved there. That's it. Don't
worry I'll simply show you a quick demo because obviously you need some contents while answering this. Just a quick demo not like very long and just for the catalog. That's it. Now or let's say
catalog. That's it. Now or let's say just for the database. Okay. Now let's
say our database has some you can say resources and it is saying hey I also want to create an external database. So
it also has a dedicated container. Now
this time this table will ask hey bro I know that you are very poor but still let me ask again do you have any kind of location? This time database will say
location? This time database will say hey bro I have I have. So it will simply stop here. It will not go to this
stop here. It will not go to this location at all. It will not go to this this location as well. No, it will simply be saved here. Wow.
Yes. Yes. Yes. Yes. Do you want to test it? Let's test it. And these are the
it? Let's test it. And these are the three scenarios that you should aware.
And four scenarios very simple. When
this table also has a location, obviously it will be directly saved there. It will not ask anyone else.
there. It will not ask anyone else.
Okay, make sense? So now let me just show you what and how you can actually do this in your data. So in order to test this I can
data. So in order to test this I can simply show you one thing. So simply go to your data lake and in the containers we have this container for metas store obviously and this is empty because we
do not have any kind of manage table so far. Okay. So what I will do I will
far. Okay. So what I will do I will simply create one container. I'll simply
say DB container because this is DB container and Anla who told you to use underscore okay sorry okay now simply go here and
obviously in order to do that we need to create a database okay and in order to create a database we need to simply provide the location you can either code
it here like create schema schema name will be DB catalog dot schema name I will simply
say DB container okay and I'll simply say location and I'll simply put the location and in order to do that we should have a location and for now we do not have so we will simply create the
location like external location okay because we do not have right now so I'll simply go to this catalog and I'll simply right click on it and open link
in new tab simple because I do not want to pause this okay simply go to catalog and see these things are really really
detailed and that's why it's really really hard to actually crack the database interview because everyone is focusing on writing the code and just
like developing the code. It's not just like that. You need to understand all
like that. You need to understand all the ins and outs now because it's really important. Okay. So now what you will
important. Okay. So now what you will do? We will simply create an external
do? We will simply create an external location like external data. So let me show you whatever error it throws. If I
just do it right away, I will simply say um abfs. Okay. And then my location is
um abfs. Okay. And then my location is uh container is DB
container. Okay. And then at the rate DB
container. Okay. And then at the rate DB interview lake. Okay. And then within
interview lake. Okay. And then within that I'm I'm okay with that. I I just want to create this directly within this particular location and it should work.
So now if I just run this we should see something and yeah error it was expected what it is saying create schema in unity catalog must use
manage location not location. Oh this is a different error. So we simply need to write manage location. So we need to simply add this
location. So we need to simply add this manage button. So now we should see
manage button. So now we should see error regarding that location. Yeah
perfect. It is saying external location doesn't exist. Perfect. Because that
doesn't exist. Perfect. Because that
location actually doesn't exist because we have not created it but I will create it right now. I will simply go to external data and I will simply say create external location. Okay. External
location name will be DB location.
Perfect. And what is the location? I'll
simply copy the code here. And I'll simply say hey this is the location storage credential. Obviously,
this is the one because this storage credential has the storage block contributor on the whole data lake on all the containers. Click on create and
fail to access cloud storage.
Why? Why? Why? Why? Oh, simply remove this backslash. Click on create. Wait, wait,
backslash. Click on create. Wait, wait,
wait. I I I know this error. Wait, wait,
wait. I know this error. Uh,
error. Uh, abss. Okay, let me just remove this code
abss. Okay, let me just remove this code like this.
uh at the rate dbfs.code.windows.net apfs db container
dbfs.code.windows.net apfs db container at the db interview link dbf.dfs.co.window windows.net. I am so
dbf.dfs.co.window windows.net. I am so sure like it is regarding your storage account location. Use the bucket. Let me
account location. Use the bucket. Let me
just see if it is same. Uh uh uh
same. Uh uh uh uh. What error are you giving me? Failed
uh. What error are you giving me? Failed
access to cloud storage. Yeah, I know that. This is something related to this
that. This is something related to this location [Music] uh interview.tfs.core.windows.net. Enter
interview.tfs.core.windows.net. Enter
the bucket path you want to use as the external location. Is the um spelling
external location. Is the um spelling correct? Oh, see an Lamba is using
correct? Oh, see an Lamba is using underscore and this underscore is not residing here. Now you will say, hey
residing here. Now you will say, hey Anlamba, it is also there. So that means we are right here. Oh, this will simply throw the error because this container doesn't have any external location. See,
trust me, trust me, trust this guy.
Trust this guy. Now, what's wrong? DB
container. Add the DB interview link.
And again, I think just a typo. DB. Oh,
conainer. It's not conotainer. It's
container. See, trust this guy. And we can simply say test
guy. And we can simply say test connection. Everything is done. Very
connection. Everything is done. Very
well done. Now, you can simply run this.
It should just run fine because now we have the location in place. It simply
like sometimes takes some time and we can simply run it. It it usually takes like 1 to 2 minutes. Yeah. And I was just talking right now and it it works. So
now what I need to do I want to create a new table here. So I will simply say create table and table will be DB
container dot DB catalog dbcontainer.
Okay. and dot let's say test table okay test table and I'll simply provide the schema let's say id int that's it just one column because we just want to test
I'll simply say using delta okay and location obviously not because we are creating a manage table so you will just run this and you will see what will happen and by the way this is db
container it's not db contain okay so what will happen. It
will create a manage table but not in the meta store but in the uh our DB container because that is assigned to that schema and simply just test meta
store is empty because it will not go there but our DB container should have this unity storage and obviously we do not have any data that's why it is like showing like this within this we have
schemas within this this is our table id tables this is the table id sorry and this is the data and what do we have in the table location just empty data and
just the data log in which we just store the schema or metadata that's it. Okay,
make sense? So this is a location this is the hierarchy and it is validated that it will simply go to that database only schema only. See now you know all the
schema only. See now you know all the things. Now let's actually jump on to
things. Now let's actually jump on to the next category of our questions which is more towards like processing the data and the next question is really really
interesting and it is like the bread and butter of nowadays because of excessive or let's say massive data that we process. Let me just give you a hint. It
process. Let me just give you a hint. It
is related to incremental loading. Okay
let's see what do we have in that particular question. Now we have a
particular question. Now we have a second category of the questions where we are just more inclined towards processing of data. How we can just process the data and how we can
effectively process the data and what do we have with processing the data. Great
question. And here comes the role of that massive hero of nowadays in data bricks. It's called
bricks. It's called autoloader where it has made incremental data processing so so so easy and
everything automated and yes there are some scenarios where you need to just tackle some things such as schema failures schema changes schema evolution all those things so don't worry I will just take some follow-up questions as
well you need to understand this autoloader and it's really really important because it is built on top of your spark structured streaming So obviously you know like spark structure streaming is really really important
nowadays. So it is built on top of it
nowadays. So it is built on top of it and it enables so so so many cool features when you just work with data bricks. And what's the scenario? First
bricks. And what's the scenario? First
of all scenario is simple. What you need to do you simply need to let's say this is your source. Okay, in your
your source. Okay, in your source, in your source, you will be continuously
receiving files. Let's say file number
receiving files. Let's say file number one, file number two, file number three, file number four, and so on. Now, it's
your responsibility to incrementally load the data to this particular destination. Okay?
destination. Okay?
But you need to just take care of one special thing. It's called item
special thing. It's called item potency. Don't worry, it's not that what
potency. Don't worry, it's not that what you're thinking. So it's called exactly
you're thinking. So it's called exactly once. Exactly once means let's say this
once. Exactly once means let's say this data is here. This file is here. Okay,
we ingested this data here. On the
second day, we have this new file, second file. Now instead of processing
second file. Now instead of processing both the files you simply need to process only this file and do not consider this file that's the concept of
IDOM IDM potency or exactly once that means once data is processed you are not processing that data now a quick question interviewer can ask
you hey okay everything is do everything is done by autoloader but how it does behind the scenes like how uh it achieves that situation we have something called
has a kind of repository. It's called
Rox DB. Rox DB is a kind of folder which takes care of all the metadata of the files like which file is ingested and which file is not processed and all those things. So it takes care of
those things. So it takes care of everything in that particular folder and yes this folder you'll be creating at the time of your query creation of autoloader. Don't worry and I'll just
autoloader. Don't worry and I'll just show you where that like folder resides and how does it look like and everything is done by this Rox DB. By the way, we
have two ways to actually start with autoloader. One is file notification and
autoloader. One is file notification and second one is API. File notification is something similar to your storage events which is like triggered automatically when it receives the new file. For that
you need to enable the storage events.
it's not enabled by default and you need to pass so many uh permissions in order to use your storage events in datab bricks. Second better option is API
bricks. Second better option is API calling and it is good and API calling is strongly aligned with Rox DB as well.
So you do not need to actually worry about anything. And now let's see how we
about anything. And now let's see how we can just work with autoloader in data bricks. Let me show you and don't worry
bricks. Let me show you and don't worry I have incremental files and it is in the repository. You can download it and
the repository. You can download it and let me just upload the data one by one and I will just show you how does it work. Okay. Now let's see. So let's go
work. Okay. Now let's see. So let's go to our Azure. Okay. And let's go to our containers and let's go to raw container. Let's create a new directory
container. Let's create a new directory and I'll simply say autoloader. Okay. Click on
autoloader. Okay. Click on
save and I will simply go inside this and I will simply upload the data. And
before that I can simply download the data. I think I already have. So I will
data. I think I already have. So I will simply you can see raw data first raw data second raw data third. So these are basically the files that we'll be ingesting one by one and yeah there will
be so much fun. Okay so simply go here upload just upload one file for now because we'll be incrementally loading the data. Okay so so I have uploaded my
the data. Okay so so I have uploaded my first data raw data first. Okay click on upload. So this is my first file. Now
upload. So this is my first file. Now
let's create our new notebook.
Okay, go to your workspace and simply click on create and then notebook.
Perfect. Let me just name it. I'll
simply say notebook 2. H that's fine. Okay. So now let's
2. H that's fine. Okay. So now let's attach this to serverless. And see it's so easy. So now I will simply add the
so easy. So now I will simply add the heading.
I'll simply say um autoloader or incrementally loading the files. Incrementally loading the
files. Incrementally loading the files. Okay, perfect. So in order to
files. Okay, perfect. So in order to create your autoloader query, okay, don't worry I'll just show you the documentation as well. We have very nice code written here. You can simply copy and paste it because it is fine because
whenever you are just writing the code you can simply refer the documentation just for the parameters and all and code is really really really easy. How we can
just do that? You simply say df equals spark dot read dot format. What will be the format? You will simply say an lamba
the format? You will simply say an lamba csv. Use your common sense. We know that
csv. Use your common sense. We know that common sense is not common but you you you can use your common sense. So bro
format is not CSV. What's the format then? Format is cloud files. Wow, what's
then? Format is cloud files. Wow, what's
that? So basically, hey, first of all, why are you giving us spoilers? So
whenever we just work with autoloader, we need to pick format called as cloud files. Don't worry, we will simply
files. Don't worry, we will simply say dot option cloud files dot format. And then
we'll simply say CSV. So we define CSV but as a cloud file. Okay. Sorted. Very
good. So
now we have something called as dot option and it's called basically it is very very handy. It's called schema hints. So let's say you want to say
hints. So let's say you want to say schema hints. This is very handy. Why? Because
hints. This is very handy. Why? Because
in the schema hence you do not need to actually provide the schema for the whole columns or let's say whole set of columns. You can either choose to
columns. You can either choose to provide schema for your just first column. That's it. That's it. You can do
column. That's it. That's it. You can do that. You can simply say id int and
that. You can simply say id int and that's it. You do not need to just
that's it. You do not need to just provide the schema for other columns.
It's fine. So you can just do that.
Obviously I'm not doing providing any schema but just for your reference you should know about this. Okay. Now once
we have this we need to create something called as checkpoint location. Checkpoint location what's
location. Checkpoint location what's that? Basically it's not like real
that? Basically it's not like real checkpoint location of structure streaming. Checkpoint location for your
streaming. Checkpoint location for your schema. Okay. It is also a kind of
schema. Okay. It is also a kind of checkpoint but only for your schema because in structured streaming we need to capture the schema of every file and
it's called dot option cloud files. See? Spoiler.
Spoiler. I was just about to write the code and I was just about to flex that I know the code. But
see, oh man. Cloud files
dot schema location. I'll simply write. I will not
location. I'll simply write. I will not hit tab. I I want to write. So basically
hit tab. I I want to write. So basically
it's a kind of checkpoint location. But
do not get confused. It is not the real checkpoint
confused. It is not the real checkpoint location. Real checkpoint location is
location. Real checkpoint location is something else which keeps a track of your um uh you can say your current state and the previous state of the
table um metadata of the files everything your rocks db everything will be there everything will be there but we try to store this particular schema
location in the checkpoint location so I just call it as checkpoint location for schema because ideally you should not create different locations No, you should pick the same parent location and
obviously different folders. That's it.
That's the best practice. Okay. So now
because obviously in the interview they can ask you for the management as well like how you will just perform the best practices to tackle this. Okay. So
simply schema cloud files dos schema location. I'll simply
location. I'll simply say abs and then I'll simply say destination or let's say raw because I think it's in
raw but it's up to us like how we need to just do that and how we need to just store it it's not a big deal okay it's like up to us like where we need to just store it I'll simply say destination I
want to store it in destination okay so destination and then we can simply say destination at the create um DB
interview lake dfs.co windows.net. Yeah. And then within this
windows.net. Yeah. And then within this we I will simply say checkpoint. Yep. I'll simply say
say checkpoint. Yep. I'll simply say checkpoint. Now one most important thing
checkpoint. Now one most important thing it will be discussed by the way in the next question because it is just a follow-up question but just a hint we will be just mentioning how we need to
tackle the schema evolution mode and by default it is add new columns and the code for it is dot option and it can be your interview question like what is the default mode of schema evolution.
I know that I'm not writing any schema evolution mode but still what is applying on my code by default it is this uh cloud files dot
schema evolution mode and by default it is add new columns. So if I am writing this or
new columns. So if I am writing this or if I'm not writing this it is same because it is already applied by default. Don't worry I will just show
default. Don't worry I will just show you everything before running this command. Don't worry, trust me. And then
command. Don't worry, trust me. And then
once it is done, we can simply say dot load. And from where we need to read the
load. And from where we need to read the files, it is in raw container. Okay. And
at the rate um your TB interview, not sales. CSV bro um it's called autoload
sales. CSV bro um it's called autoload loader. Where is that? Yeah, autoloader.
loader. Where is that? Yeah, autoloader.
Perfect. Autoload order. And that's it.
Let me just now show you your um autoload orderer.
Autoloader data brick. What is autoloader? This one. So
brick. What is autoloader? This one. So
now just have a look at the code and you will see hey where's the code bro? Let me just show you. So this is
bro? Let me just show you. So this is just like telling what is autoloader and click on this and click on this. I
would say this one schema inference.
Yeah. So now this is the code as you can see that we have all these things cloud files and then format and then cloud files dots schema location that then
load and then after just writing to this we simply just provide the checkpoint location because whenever you just want to read the data we just simply refer that hey this is our
checkpoint location where you just need to write our schema. Yeah, it will simply put that particular file there
before reading anything. Okay. Now, how
does order schema inference work? It is
it is really important. I will just talk about it in the next question deeply like how does it fail and how we just need to do it. Okay. And I'm just about to show you the default mode. Yeah,
perfect. So, this is the modes. Add new
columns is the default mode. Rescue is
another very famous mode. and fail on new columns. It is not convenient
new columns. It is not convenient because obviously in the day-to-day activities you cannot fail your columns if your system is just pulling new columns on daily basis. Again, it
totally depends upon the design. Okay.
And that's it. This is just the mode that we wanted to talk about and I'm just finding that particular reference that we write like see. Yeah, perfect.
This is the one cloud files. My
evolution mode and your boy has already put this particular code here. Simple.
So now I will simply run this and let's see if we have any errors. It's
fine because errors are good. Don't
worry because whenever you just develop something you are not machine. Even
machines make mistakes while writing the code. You write the code you see the
code. You write the code you see the errors you debug it. And when you do not see the errors that mean you that means you are just kidding. Errors are good
bro. Okay. So this is my query but
bro. Okay. So this is my query but nothing actually happened so far really.
Yeah. just go to your um whatever your thing is because this is this is uh this is a kind of you can say streaming.
Okay. And by the way if you just observe one thing if you if you if you have observed because I have observed we simply need to use read stream not just
read. Okay. So simply say read
read. Okay. So simply say read stream. Perfect. So now what it will do
stream. Perfect. So now what it will do it will simply let me just run this. It
will simply write the stream. it will not initiate it because in order to initiate anything we need to provide the action and what is the action and I'm receiving
a call let me just pick that call and let me just continue that and if this would be a such a scam call and broke the
flow don't worry I'll just simply report it okay so I was saying that this streaming query need an action in order to be performed so I will simply provide
the action right now I will simply say df dot write stream okay and then simply dot format in which format I want to write I will simply say I want to write my data
in the delta format let's say okay let's say in the delta format and you can simply refer the code as you can see where is that right stream yeah perfect
so now you can simply say hey dot option and if you do not provide any kind of format it will simply write the data in the delta format and then simply you need to simply write start dot start
that's it you can even write your data in the table for that you need to use an action called dot to table if I'm not wrong okay so I will simply say dot yeah
format delta then I will simply say dot option the most important thing is this one that we discussed checkpoint location okay and now we do not need to
write cloud files dot checkpoint location because it is understood that it is just the checkpoint location. Okay. And I will simply pick
location. Okay. And I will simply pick the same path with the same folder because I told you it is really
important. Then oh, I just hit the run
important. Then oh, I just hit the run button by mistake. So do not take it as like completion.
Okay. Dot now I will simply say dot start because we do not have anything.
Okay. I will simply say dot start.
And now we need to write the location where I want to write the data. So I
will simply say write my data to this particular location.
Obviously not in the checkpoint but yeah in the oh did you just obviously I do not want to write my data in the directly in the checkpoint location.
Okay let's write it. Don't not a big deal. I just want to write my data in
deal. I just want to write my data in the destination and not in the checkpoint. So it is fine. Or I can
checkpoint. So it is fine. Or I can simply create a folder called data.
That's it. Okay, makes sense. Uh, makes
sense to me. It makes sense. Let's run
this. And now it should just initiate the query and without any errors.
Triggers type. Oh yeah, very good. So
now it is saying hey trigger type processing is not supported for this cluster type. So we have to just create
cluster type. So we have to just create a kind of allpurpose cluster in order to just perform this. And we will simply go to our um compute and let's quickly
create one allp purpose cluster. Click
on create compute. Single node it's fine. Unrestricted no personal compute
fine. Unrestricted no personal compute because I don't want to fill all the boxes. And um and which one I should
boxes. And um and which one I should pick? 16.3. Yeah. And no type obviously
pick? 16.3. Yeah. And no type obviously the minimum one. Terminate after 40 minutes. Fine. And click on create
minutes. Fine. And click on create compute. So it will take I think just 3
compute. So it will take I think just 3 to 4 minutes to just create your cluster. Once it is green, we can
cluster. Once it is green, we can actually attach our notebook to that cluster and it is fine. But but but now I will just take you to the next
question because just assume that our notebook this one recents this notebook two will be running fine. It will be loading the data incrementally as it is.
We just need to run it and it will run fine. I know. Okay. Now the next
fine. I know. Okay. Now the next question is or let's process this. Okay.
Let's go slow. Okay, let's wait and let's actually see what do we have this and let's validate this first. Okay,
this is just processing the data incrementally because we have uh CSV second and CSV third and let's see how we can actually increment the data incrementally load the data. Let's see.
As you can see, simply attach it and click on confirm. And obviously we need to simply run this for one more time and let's click on this as well.
Oh, so obviously it will take some time to um run the first cell because it is just warming up the
machines and yeah. Okay. So now is the best thing.
yeah. Okay. So now is the best thing.
Now you can see stream initializing.
What is that? It is called spark structured streaming graph that we can actually see it is going up and down like how data is coming and how we are just processing the data more and more
data bro it is really helpful because you will see the data being processed in the real time as you can see the graph goes up boom why because there were like so many data as compared to zero the
graph went really up and you can see the processing rate input rate and now it goes down because data is not as much now it is zero again now it will be running continuously running
continuously running. Why? Because this
continuously running. Why? Because this
is a cluster bro. This is a cluster and we are just performing streaming data.
So in the solutions in some solutions we need to just perform streaming right. So
now I will um simply query the data first of all. Okay. And how we can just simply query the data. I have already
told you select a trick from delta. tick
and ABFSS and then um destination at the rate in DB
interview DB interview and then sales so autoloader okay and then within this data simply query this and you should see the data without any errors okay
what's that uh Again typo destination DB interview lakea it's lake
okay path does not exist it does not exist path does not exist wow where are you writing the data man okay so you are writing the data
within oh this was a bad one let me just stop this by the way not a big deal because now what it is doing it is simply
writing the data into the destination container instead of inside it autoload order. So we have destination okay and
order. So we have destination okay and we have checkpoint within this location as you can see. So this is our checkpoint and we are not actually
writing the data within the auto loader and yeah it's fine it's fine. I thought
like I will be just writing data inside the autoloader part but no and I should have chosen the autoload orderer
container because I specially created that but it's fine it's not a big deal it's fine so within this you will see something called as data this is my real data right and I can obviously say data and you should see
the data okay baby so you will see the data and why I am quering this data just to confirm the number of records so that you can see
the number of records will be high and item potency will be there. Okay, 95
rows. Okay, perfect. Now let's add one more file and the moment I'll be writing or let's say inserting the data. I will
simply go to my source like in the row or maybe in the what's my source? It's raw autoload order. Yeah, this one here. I will just
order. Yeah, this one here. I will just upload one more data data file. So I
will simply say I have uploaded the second file. Let me just click on upload
second file. Let me just click on upload and you will see the magic here. See now
I uploaded the data and it will be processing this data. See graph is going up again. Real time data and item button
up again. Real time data and item button C and as you can see graph is up. Can
you see that blue line? See this one.
Now let's query this data for one more time and you should see 95 plus some records and I think it was I think 31 records is if I'm not
wrong total 126 records and I know it is just it has just um processed the um 35 records that's it because we only have 126 records right otherwise we would
have at least double than 95 records common maths right I have done my bachelor's in mathematics so I'm really smart in mathematics Really? No. So 186
records. Okay. Now we need to just cover the second question which is strongly aligned to this category. But yes, it is totally a different category or let's
say it is totally a different question because that question can fit in any category. It is for schema evolution. So
category. It is for schema evolution. So
next question is strongly aligned here. Let's
say I am adding my data here in the source. Okay, because this is my first
source. Okay, because this is my first file, this is my second file. Let's say
on the third day someone has uploaded the data in the CSV format obviously.
But in that particular CSV, we have schema mismatch. Wo, what
will happen in that particular scenario and how we can just tackle this? So the
thing is what we need to do simply go to row. Okay. Or not like row just go to
row. Okay. Or not like row just go to destination just go to checkpoint and within this we have these
um uh you can say folders created click on sources and we have here ros db as I told you in this we have all the zip
files. So if you just go to sources this
files. So if you just go to sources this is your zero and just go to checkpoint this is your metadata here it is just writing the metadata. Click on edit.
Obviously, we cannot see it. And this is the ID. We can only edit it. But yeah,
the ID. We can only edit it. But yeah,
just a good way to preview it. And if
you want to just see the schema, it is a folder which is created by the cloud files. Schema location. Don't worry, I
files. Schema location. Don't worry, I will just validate all this information.
Okay. Simply go to schemas and this is your schema. Go to edit. See this is
your schema. Go to edit. See this is your schema schemas folder is created by the cloud files. Schema evolution. Okay, so
cloud files. Schema evolution. Okay, so
this is your schema. So now what will happen? Now let me just add one more
happen? Now let me just add one more file and let me just show you what will happen and how it will happen and how we can just tackle this. Okay, all the how, what, everything will be answered right
now in just few minutes. So what we need to do we will simply first upload the data in the row zone and before that let me just make some changes in the code because as you can see um first of all
let me just click on interrupt because I want to stop this query because I want to say hey if my query has or let's say
if my source file has schema evolution I want to rescue those columns that means in the destination side I do not want to
add any further column but at the same time I want to store those columns. I
want to store those columns and how we can actually do that. How? So in this particular scenario what we need to do
we simply need to say rescue. The moment we say rescue what it
rescue. The moment we say rescue what it will do it will simp let me just show you the data first in which we have
rescue data column which is created by default for you. So all the data which is not matching with the schema will be dumped here in the JSON format so that
all your data can fit in one column in the form of obviously um key and value pairs. Okay. So in my third file I have
pairs. Okay. So in my third file I have a schema mismatch. I have done that intentionally so that I can just show you. And all those values will be going
you. And all those values will be going in the rescue data column and we do not need to change any schema on the sync side. So this is just
side. So this is just kind stay. So this is just just a kind
kind stay. So this is just just a kind of you can say overview like how it happens behind the scenes and how it is
actually being taken care of. So what I will do right now I will simply show you the documentation and what will happen step by step. It is really important
because I want to go really really deep into this. It's really important because
into this. It's really important because it is a new topic and you can be hooked like this if you do not know everything.
It's very easy to hook you like this like this like this like like this. So
what will happen and how does autoloader schema evolution work? Let's say you have a new file. Okay. So what
autoloader will do? Autoloader detects
the addition of new columns as it processes your data. So when it will be processing my third file, it will detect
the new columns. H okay. Then it will simply stop the query with this error which says unknown unknown field
exception because we do not have that particular um you can say thing in place which says hey just ignore this thing.
But before your stream throws this error h before your stream throws this
error what will happen? It will
simply who is it here? Autoloader.
Autoloader performs schema inference on the latest file and updates the schema location with the latest schema by merging the new columns to the end of
the schema. That means it will simply go
the schema. That means it will simply go to my data link. It will simply uh update that underscore schemas folder and it will simply add the new schema
there even before throwing the error.
Now the very good question if the interviewers if the interviewer is skillful he or she should ask this
question if my schema is updated before throwing the error why I am receiving the error why and I spent I think the
whole day to find this thing when I was learning this I was like why why if my schema location is updated it knows the updated schema why it is throwing the error
Why? You would say um you you would need
Why? You would say um you you would need to just simply write um rescue add new columns. No, no, it is already there by
columns. No, no, it is already there by default. Add new columns is already
default. Add new columns is already there. So the thing
there. So the thing is, listen to me carefully. Whenever we turn on the
carefully. Whenever we turn on the query, it simply caches the schema which is there in the source which is there in
the underscore schema's location. Okay.
Now for the next run if schema is updated it will simply update the schema location but it will not update the cached
schema. Okay. So that is why it simply
schema. Okay. So that is why it simply says cached schema is this and stored schema is this and then it throws the
error because it mismatch. Okay. And when it does that
mismatch. Okay. And when it does that just before doing that writing stream step just before that or let's say
whenever we just say dot write stream it will simply check that before entering into the writing zone. Okay then it will throw the error
zone. Okay then it will throw the error then we simply need to rerun the query.
That's it. It will simply update the cached version of your schema. This is
really really deep. Okay. And everything
is written here on the documentation. So
if you say I'm wrong, the sale say like documentation is wrong. Okay, this was really really deep. I spent the whole day. I still remember that day. I was
day. I still remember that day. I was
like what's going on man? I thought
what's going on? But you do not need to mug up your mind because here you have subscribed my channel. If not
bro don't talk to me. And if you have just drop a lovely comment in the comment section right now. It really
makes me hap makes me feel happy. And do
you want me to feel happy? If yes, just drop a lovely comment. What are you waiting for? So what we will do? We will
waiting for? So what we will do? We will
simply upload one more file here in the destination or sorry in the raw zone.
And then we will simply say uh autoloader and third file. Let's upload
the third file with the updated schema.
Let me just click on upload. Okay. Now
let me just run this query and let me just run this query like the writing part and this is again streaming initialization and let's see how many
records do we have and obviously we are expecting item potency and this time it should uh let me just click on this graph because I
really like seeing this graph so it has already processed the data and why we didn't see any error because we just
updated the cache by rerunning this and this time what is the detection mode is it's called rescue. So let's see do we have anything in rescue? Let's see.
Simply run this and it should show uh records this time 226 total instead of
126. And simply like 226 obviously we
126. And simply like 226 obviously we have like 100 records more. And this
time we have rescued data that means all these and again just a common sense this column will only be available for the new records obviously because for the previous records we do not have rescue
data. So now if in future we receive any
data. So now if in future we receive any file with the new columns it will simply go here and what is the kind of column it is called return flag and that's
right I have added return flag so now a quick question for you what's that what's that now let's say I have everything in place if I add now a new file with a new schema will this query
fail yes and now you know the reason because this is this uh schema is cached and we have not updated it. Okay. So,
how we can just tackle this scenario in production? Obviously, we need to use
production? Obviously, we need to use something called as try except method in which we can simply say hey if it fails simply rerun this because we simply need to update the cache. Use
Python. Python
is this. Okay. So now you know everything about data processing things.
Okay. Now we need to cover something amazing which is a new feature in data bricks. Again it is called volumes. What
bricks. Again it is called volumes. What
is that? Let me show. Now it's time to just talk about another category of the columns and it's more about managing and governing the
files. What does that mean? So let's say
files. What does that mean? So let's say in your interviews your interviewer asking hey every time you are governing or let's say every time you are just
focusing only on the tables. If you are just creating table you like if you are ever using catalog schema you are only creating tables right? You're only
creating tables. Is there any way to actually govern the files directly instead of tables using your schemas and cataloges? Yes, the answer is volumes.
cataloges? Yes, the answer is volumes.
We have something called as volumes. And
don't worry, I'll just tell you everything in volumes. Simply click on create and click on this notebook. And
let's create a new notebook. And I will simply call it as notebook 3. By the
way, I already recorded this particular lecture and the next one as well. And I
realized that mic was not working.
Wow. Wow. Wow.
Wow. Simply pick this cluster. So I'm
just re-recording just for you. Don't
worry. So now what you know that with the help of volume we can just do a lot of stuff. We can just manage and we can
of stuff. We can just manage and we can actually do a lot of things. But how
does it work? So just to explain this and so that you can efficiently answer regarding everything in this particular topic in your interviews simply go to
your um containers. So as I just mentioned that I already recorded this I'm re-recording it. So basically go to your destination okay or simply go yeah go to your
destination simply create a folder with any name with any name. Make sense?
name. Make sense?
Okay. Then go inside this and simply create another folder. Okay. And then
simply upload this file that we are currently using. It is pocket file or
currently using. It is pocket file or you can just simply upload any file. Now
what we will do? We will simply go to datab bricks and we will simply first of all let me just make some space. Okay.
And let me simply say volumes. Okay. So this is a volume. So
volumes. Okay. So this is a volume. So
now as you can see now I want to govern that particular folder which is called my volume. So
what I can do for now if you just simply click on this catalog click on this schema you will simply see only these two tables. That's it. We
do not have anything else. So we can create something called as volumes. And
let me just show you volumes in data bricks how you can just create that.
Okay. Simply click on this documentation. So basically we have
documentation. So basically we have again two types of volumes manage volume and external volume similar to table and you will feel that it is actually a
table but it is not a table. Basically
it behaves like a table but it is a file. Okay. And you will just query same
file. Okay. And you will just query same way catalog schema and then volume name.
Okay. So now first of all let's create an external volume because we know that we have a data in our external data lake and then we will be creating our managed volume and code code is exactly same
code is simple create external volume and then simple location that's it similar to your table creation. Okay,
now let me just do that. I will simply say create volume and then obviously the volume name and volume name will be
um DB catalog dot DB schema dot my volume. I want to create this volume with this name and obviously
location I want to pick ABFSS and then ABFSS and then destination add the rate storage account
name which is DB interview DB interview and then lake dot
DFS dot core dot Windows.net. Perfect. Very good. Then
Windows.net. Perfect. Very good. Then
within this, I have a folder called my volume. Perfect. Simply run
volume. Perfect. Simply run
this. Wow.
Anala. What is this? Oh, I forgot to mention external. Simply run this because we are
external. Simply run this because we are creating external volume in table. We do
not need to actually write external. By
default, it creates external table. But
here we need to just put external. Okay.
So now just refresh this this catalog then you will see something called as volume. See now we can actually see
volume. See now we can actually see volume this volume and if you just click on this dropdown you will see that folder pocket data that means now we can
just govern this just same way we govern our tables and this is really really nice.
By the way in Microsoft fabric as well we have something called as files. So
this is exactly same as that.
Okay. So this is really really nice and now you will be saying hey for tables we simply use select a tricks from table name but what we can use in terms of
volume like what what what is that way of using that? So in order to query the volume there's a special structure provided by the documentation and the simple way I will also show you don't
worry. So first of all you simply need
worry. So first of all you simply need to type select aix from um volumes then your DB
catalog then your DB schema then your volume name which is my volume and then you simply need to provide the path my path is my volume oh sorry not my volume
my volume it's pocket data perfect so you can simply run this and obviously you need to simply say pocket. Oh, let me just add it
say pocket. Oh, let me just add it here. Pocket dot perfect because our
here. Pocket dot perfect because our data is in the pocket format. So, we
have to tell this. So, now you'll be saying, hey, what's the difference between this and when we just query the delta files? First
delta files? First difference, those files cannot be governed. These files can be governed.
governed. These files can be governed.
Second difference that method only works well with delta. Okay. And this way we can
with delta. Okay. And this way we can actually use any file format and that's not the difference like the difference is we can actually govern. Okay. So very good. So now
govern. Okay. So very good. So now
you'll be saying okay okay we have understood our external volumes. What
about managed volumes? Very good. So
let's create manage volume.
create managed volume and it will be my man volume. my
managed volume. Okay. And this time I will simply add DB catalog obviously. Then DB
schema. Let's run
this managed.
Wait wait wait wait. Oh, so we do not need to write
wait. Oh, so we do not need to write managed. Bro, what is this way of doing
managed. Bro, what is this way of doing it? You should just keep it very sorted.
it? You should just keep it very sorted.
Managed or external. What is this? So,
okay, this is done. Let me just refresh this. So, it is already refreshing it
this. So, it is already refreshing it for me.
Nice nice nice nice. Because I I will just show you
nice. Because I I will just show you like how you can actually upload the files and in the manage volume because volumes volume is a new addition in database. It was not there if you would
database. It was not there if you would if you just comp if you would just compare it with the previous versions.
So it is it was not there and I don't know why it is taking so long just to refresh this because obviously it is a managed volume. So they're not they they
managed volume. So they're not they they just need to take care of all the data.
So now you'll be saying hey where will be data going obviously to the metas store because we are not providing any kind of location to the catalog. So DB
schema and then you can see two volumes.
This is my managed volume and this is my volume. So this is an external volume.
volume. So this is an external volume.
This is an internal volume or managed volume. So now let's say you want to
volume. So now let's say you want to insert some data in this. What like how you can just do this? You can simply click on these three dots and simply click on upload to volume. This will be
simply going to this particular location. And where will be manage
location. And where will be manage volume saved. Same way same like manage
volume saved. Same way same like manage volumes are equal to manage tables. If
there will be location in schema, it will take that. If there'll be location in catalog, it will take that. If there
will be location in your meta store, it will take that. That's it. Exactly same.
Okay. Because datab bricks do not hold anything.
No. Yeah. In the free versions, it does just for your uh sake. That's it. And
now the easiest way to put the you can say location. So let me just show you
say location. So let me just show you like how we can just upload the data because you will complain hey you didn't show us like how we can upload. Simply
click on these three dots and I will simply first create the directory in this because it's always good practice to create the directory. I will simply say pocket data. Click on create and
perfect. Now let me just click on these
perfect. Now let me just click on these three dots upload to volume and let me just upload this. Perfect. So same way I can just query this and I will just show
you the easier way and easier way is simply say select a str
from parket dot tick and now just click on this and then just click on these two arrows it will simply insert this location for you here. So you simply do
not need to worry at all. Simply run
this and you will see that data is here.
And where data is residing in the manage location. Where is that location? We
location. Where is that location? We
already know in the meta store container.
See this is my volumes. See in the meta store. This is pocket data and this is
store. This is pocket data and this is my file. Tada.
file. Tada.
See this is the illustration that you need in the interviews and you need to be really really confident. So as you can see this data is writing here. So
this is just a catalog and then we have this volume like this is the volume ID and this is the folder. That's
it. Okay. So just be aware in the interviews. So now this was all about
interviews. So now this was all about our volumes and data processing and different different data types how we can just manage. Now let's cover some
questions related to the optimization techniques or just forget about optimization techniques. First we should
optimization techniques. First we should know how we can just tackle problems related to um rollbacks. Let's say you are in prod
rollbacks. Let's say you are in prod environment and you just did some wonders and now you want to just go back to the previous version. So you need to know data versioning, you need to know
time traveling and all these features are really really important because you cannot work without these features because you would make mistakes and in
order to just roll back to your mistakes so like to your previous version so how you can just do that don't worry I'll just show you and these are the hottest topics right now time traveling data versioning all these things. So how you
can just do that because these topics strongly align with the open table file formats as well. So this is the hottest topic right now. So I will simply go to
my workspace and then simply click on create and create a new notebook. Okay.
And simply name it and name it as notebook 4.
Okay. Now let's talk about first of all let's connect and I don't think so we would need this cluster anymore. So I
will simply say hey just dominate this cluster. Simply terminate this
cluster. Simply terminate this cluster. I don't know why they do
cluster. I don't know why they do this. Serless was better. Yeah. Just
this. Serless was better. Yeah. Just
terminate this. They should just add those capabilities in the serverless compute like to just do streaming. Yeah. Yeah. So yeah, our
streaming. Yeah. Yeah. So yeah, our notebook is connected to this particular server serverless. Okay. So now let's
server serverless. Okay. So now let's write our heading and it will be data
versioning data versioning and time travel. So for
travel. So for that let's create our table which is called which is called let's create one table.
So this lecture was also recorded and this was a table. Mic was not working. Okay, don't worry. Let's create
working. Okay, don't worry. Let's create
another table. So what we need to do? We
need to first of all create a table and then we will we will insert some data and then we will make some mistakes intentionally. Then we will just try to
intentionally. Then we will just try to roll back those changes to the previous versions in the delta table. Make sense?
Very good. Let's do that. Let's create a table. So I will simply say create
table. So I will simply say create table DB um catalog dot DB
schema dot uh my delta table my delta table. Okay. Then simply
say id int and then maybe name string and then salary and this is also int. Okay. And this will be an
also int. Okay. And this will be an external table. So I will simply
external table. So I will simply say abfs destination at the
rate db interview lake dot dfs do.core
core dot windows.net net and then let's say data delta it's a good name okay
let's create this table and now quickly let's insert some data and your interview will say like I'm just creating the exact environment okay so
interval will say you have this table and you want to insert some data you will simply say okay insert into and db
catalog dobb schema dot my delta table. Perfect. And you will simply say values and you will simply insert the
value such as one. Name can be let's say
one. Name can be let's say nora and then salary let's say 1,000.
Okay. And then say second name any name Rahul salary is 900.
Why? So three
then an no is out of the league.
[Music] Um Sophie who is Sophie and then let's say she's earning
more 2,000. Let's insert these records.
more 2,000. Let's insert these records.
So your interviewer will say hey you have this delta table. Delta table is a keyword. So just put some focus. Okay,
keyword. So just put some focus. Okay,
you have this data table, you have this data, you want to delete Rahul, bro. Rahul is already earning the
Rahul, bro. Rahul is already earning the least. Why you want to delete that
least. Why you want to delete that person?
So, no personal hit. Okay, just a hypothetical table. So, we want to
hypothetical table. So, we want to delete Rahul like ID equals to two Rahul. And the uh interview will ask you
Rahul. And the uh interview will ask you hey just write the code and you will say okay delete
from DB catalog dot DB schema dot my delta table where ID and it will say hey stop
stop simply run this command without where condition and you will say bro are you mad or what no obviously you will not
say this and you will simply say okay let's run this and now interview will say hey what you have done then obviously you should say bro
are you mad you just told me just to run this no don't don't talk to him or her like this so the intent behind this is now interview will now interview will
ask you hey just bring back me the records that you have just deleted yes so for this you will simply say okay let
me just insert the records again. No,
no, you cannot answer this. Okay. Okay.
So, what's the solution? The solution is you need to know how you can just time travel. And in order to perform time
travel. And in order to perform time travel, you need to know how you can just find the versions of the table. So,
in order to find the versions of the table, there is a very simple trick. We
have something called as describe history and pick the SQL code. And then
you just need to define the table name DB catalog dot DB schema
dot my delta table. That's it. And this
way you will see all the versions of this table. And that's what we want.
this table. And that's what we want.
Perfect. So we have version zero. We
have version one. We have version two.
That means we want to go to version one, right? Because version two is a delete
right? Because version two is a delete operation. H makes sense. So how we can
operation. H makes sense. So how we can just go back to a previous version or any particular version. So we have something called as restore command. We can simply say
command. We can simply say restore table and then table name is my catalog. Not my catalog. DB
catalog. Not my catalog. DB
catalog dot DB schema dot my delta table. Very good. Simply pick
table. Very good. Simply pick
SQL and simply say to version as of one because I can pick any version of my choice because I
know version number one is absolutely right that I want. And this way you will see that your data will be restored.
Yes, see it is done. Now if you just query this table, you should see the result. You
should see all the records. And now you can easily reply to that person and simply say, hey, this is your data and
now task is done. Where's my offer letter? So, obviously, bro,
letter? So, obviously, bro, obviously, just show some confidence.
It's always good to be overconfident in the interviews. Trust
the interviews. Trust me, it's always better to be overconfident instead of
underconfident. Just mark my words,
underconfident. Just mark my words, okay? Just be overconfident in life.
okay? Just be overconfident in life.
Just be overconfident.
There are scenarios where you need to be underconfident but in the corporate in the interviews just be
overconfident. Just behave like you know
overconfident. Just behave like you know everything. Why? Because there's no one
everything. Why? Because there's no one who knows everything. So
what? So just behave like you know everything and by that way you would know a
lot. Okay. just philosophy classes along
lot. Okay. just philosophy classes along with the data engineering classes.
Okay. Okay. Okay.
Okay. So this was all about your data versioning time travel interview questions that can be uh that can be asked from you and now I hope that you can answer any question related to this.
Now let's talk about some data optimization questions that are being asked right now. Why? Because with the rise of data, every organization wants
to optimize the performance. Because
earlier these concepts were new. So they
were using like they were like actually being crazy and they were just testing so many new new things. Now they have data. Now
new things. Now they have data. Now
they're just running the queries and queries are running really really slow.
And if you know how to optimize those things, you going to be an asset for them. They cannot ignore you.
them. They cannot ignore you.
They cannot they they cannot say hey we cannot hire you because if you know how to optimize the queries if you know how to optimize the performance they are
looking for the person who can optimize the things in their existing solutions because trust me every single organization is facing the issue related to optimizations right now because data
is huge data is growing rapidly solutions are new concepts such as lakehouse is really really new engines are developing and if you know these
hacks bro trust me this is kind of an X factor okay and let's talk about those questions so let me just create a new
notebook or let's continue it in this notebook not a big deal so now let's say the person will be saying hey you have this table okay you want you you need to
optimize the performance and you have so many files under the hood for this table. How you can just do that? So
table. How you can just do that? So
basically the first approach you can just say like this that we have several options available in order to optimize the tables. The first approach is
tables. The first approach is the optimize command itself.
Okay. It's optimize command and this command what it does. So you need to explain this as well like why do we need to do it. So first way is doing optimize
command. So what we can do we can simply
command. So what we can do we can simply say hey optimize table. We'll simply say optimize and then simply table name DB
catalog dot DB schema dot my delta
table. Okay, simply say SQL. So this is
table. Okay, simply say SQL. So this is the code for it. So what it does? So it
will simply perform the collies operation on your partitions on your pocket files. So let's say you
have one 2 3 4 5 or let's say so many files. When you run optimize command, it
files. When you run optimize command, it will simply create the ideal size of the pocket files and it will just merge these small small files and it will
simply create ideal size and ideal size is like 1 gig. Okay, 1 GB per partition and we'll simply do that. So this is one
optimization. Okay, second optimization,
optimization. Okay, second optimization, it's called Zorder by command. It's called Zordering
command. It's called Zordering basically. Now what is this Zordering?
basically. Now what is this Zordering?
And Zordering cannot be applied independently. We have to just use
independently. We have to just use Zorder by command with optimize command.
So we'll simply say optimize DB catalog do DB schema dot my delta
table and then you simply need to say Z order by let's say ID column or any column like it totally depends upon the situation and we usually put columns here in the Z order Z order by command
which column we are using to prune our data okay what is this so let's say same example let's Say you have so many files. Okay, that is the issue with the
files. Okay, that is the issue with the current companies right now. It will
simply create the ideal size of the partitions. That's true. But along with
partitions. That's true. But along with that, what it will do? It will simply apply the sorting on the
data on the data. Okay, it will simply apply the sorting on the data. What will
happen when it just applies the sorting?
So, basically there's a concept called data skipping. We can simply skip some
data skipping. We can simply skip some partitions based on the data. So let's
say you are just reading some data and those ids are only residing in this particular partition and those ids are not here. Okay. So it will simply read
not here. Okay. So it will simply read this partition only and it will not read this partition. So it will simply skip
this partition. So it will simply skip it. That improve the performance. This
it. That improve the performance. This
should be your answer with your confidence bro. Okay. And now the person
confidence bro. Okay. And now the person can ask followup question.
He or she will say how it decides that it needs to skip this data or this ID is not deciding here. It's because
of first or let's say stats of first 32 columns. So basically in delta
um or let's say delta tables okay it is a feature in which like it is the feature by default it calculates the
statistics of the first 32 columns. So
your ID column is also like should be in your 432 columns. Okay. And if it is there the statistics of that particular column will be there in each partition.
So you will have minimum minimum value, maximum value everything. So you can simply like not you like engine can simply see hey this is the minimum value and this is the maximum value. So we do
not need to go to that particular partition. So just try to explain
partition. So just try to explain everything and I would say even if the person is not asking these things and if you know that you
know and if you are confident that you can explain just try to put some points just try to highlight that you know much more than you're looking for.
Okay. If the person is just asking hey what is the order by command? You will
tell that you you will just tell that like what is the order by to that person. Try to divert the topic and just
person. Try to divert the topic and just say hey um this has happened because it calculates the statistics of 432 columns
due to which data skipping happens due to which it reads the data faster.
Person would think this person knows really really well and generally have like deep skills. Okay. So again like anam are you
skills. Okay. So again like anam are you a philosopher? No, no, no, no, no. I
a philosopher? No, no, no, no, no. I
just want to um uh attend the uh not attend just like Okay, okay. I should not just talk about
Okay, okay. I should not just talk about that. Okay, okay, okay. I have already
that. Okay, okay, okay. I have already talked about that. I was just about to talk about TEDex. So, okay. Okay. Okay.
So, this was all about your um Zorder by or you can say optimize command or you can say data optimization techniques that we have within data bricks. And
there's one more that is new one which is called liquid clustering which just which just creates
the dynamic clusters on top of your data according to obviously the columns and dynamic query behavior.
So let's say you are just consuming a kind of query more in which column A B C are being used and obviously on a pruned
condition. So it will try to create a
condition. So it will try to create a cluster of that particular data type. So
whenever you will be just quering the data it will simply go to that cluster and grab the data. So it is just like liquid clustering dynamic clustering.
Okay, let me just take you to the documentation. Liquid clustering data
documentation. Liquid clustering data bricks.
So see delta liquid clustering replaces table partitioning and zorder by that means we do not need to use zorder by and liquid cluster. Oh sorry zorder by
and optimize command when you're using liquid clustering. Okay. Then what it
liquid clustering. Okay. Then what it does, liquid clustering simply provides the flexibility to redefine clustering keys. That is something else. Obviously,
keys. That is something else. Obviously,
you can just alter the table to do that.
And then liquid clustering applies to both streaming tables and materialized views. That is important. And obviously
views. That is important. And obviously
data bricks recommend using runtime 15.2 and above with liquid clustering. And it
was in preview now it is in like general availability zone. And obviously there
availability zone. And obviously there are some recommendations. And how we can just enable the liquid clustering? It is
very easy. You can simply say it is it can be enabled in the at the time of table creation. You can you simply need
table creation. You can you simply need to write uh cluster by and then just put the columns. As you can see here uh let
the columns. As you can see here uh let me show you. I think it is here. Uh uh
uh uh yeah see create table and then simply put cluster by that's it and then uh this is just a CTS command if you don't know CTS command is just like
create table okay where you want to create table location of it okay you will be creating table here and if you just run this it will simply create table on that particular location okay
and obviously it will be it will be an empty table but if you just write as select so it will simply put put the data as well in that location. Again
extra knowledge but it is really handy.
Seas, create table as select. Okay, so
create table as select is also known as C taz because obviously this is an external table. So it is also popularly
external table. So it is also popularly known as C taz in which it first moves the data to the location and then
creates a table on top of it. Okay. Extra knowledge. Extra
it. Okay. Extra knowledge. Extra
knowledge. extra knowledge. Okay. So
this is the code if you want to alter your table and if your table is already created and you were you have not enabled the cluster in uh liquid clustering you can also enable it and
this is a new feature and it is in public preview which is automatic liquid clustering that means you do not need to even define the columns you simply need to say cluster by auto that's it I can
just show you see and it will simply say hey automatically pick the columns we are not doing any hard So this is all about like some optimization techniques that we have.
These are basically your you can say latest questions and I personally felt that these questions were not actually covered because topics are really new,
situations are really new and so what we need to cover it we will cover it. So
that is why I had decided to cover these particular questions. And now you would
particular questions. And now you would think like okay these questions were more inclined towards data bricks that's correct but whenever you sit in the
datab bricks interviews or let's say datab bricks data engineer interviews it is very obvious that the person will be asking pispark related questions as well
that is common sense but the intent is everyone is covering pispark questions that's good data bricks questions were not here. So I decided
let's take the responsibility and let's let's let's let's help you to actually crack the interview. And now you are really good with databicks interviews like interview questions. Obviously in
the future we'll be covering more and more questions because I told you like datab bricks is evolving really really rapidly. So what's what's your next
rapidly. So what's what's your next step? What's your next step? You are all
step? What's your next step? You are all set for datab bricks. You know almost all the functionalities. Okay. You have
tackled so many real-time scenarios as well. Okay. Schema evolution,
well. Okay. Schema evolution,
incremental loading, so many things.
Right now your next task should be preparing for the pispar questions because in the databicks interviews 60%
or let's say 50%, 50% will be data brick or let's say 40%, because 40% is not a less number. Okay, 40% will be from
less number. Okay, 40% will be from these datab bricks areas because these datab bricks areas are really really new
and now it's time to actually cover the pispark questions including your pispark coding pispark coding round how you can just prepare for that I have created a
dedicated video on that like how you can just prepare for pispar coding interviews let me just show you let me
just go to incognito mode and search YouTube. Search on lamba
YouTube. Search on lamba and let me just see uh yeah this one pispark interview
questions. So this video this one just
questions. So this video this one just simply search it. These are the this video has covered all the pispark coding
questions using pispark functions window functions ranking functions or you can say spark sql functions everything so this is like pure coding question pure
coding so simply go there and again I have just created dedicated and realtime scenarios in the coding round as well you want to enjoy and learn a lot simply
go there and just cover all those questions and obviously database questions are also done. You are all set. Trust me. And just carry one more
set. Trust me. And just carry one more thing in your interviews.
Confidence. Confidence. Okay. Just
confidence. Just go
there. Okay. And just say I'm going to kill in this interview.
Okay? So just make your mind. Okay? And
trust me, you will be just clearing your interview this year and I'm just waiting for your message, okay? Saying that,
yay, I have just cracked the interview.
I feel so happy when you just send me messages or let's say you comment on the video that you have cracked the interview. I feel really really happy.
interview. I feel really really happy.
So, just waiting for your comment and just drop a lovely comment for now that you have learned a lot and then obviously once you crack the interview simply come back on this video and
comment. Why? It's our love, right?
comment. Why? It's our love, right?
Okay.
Loading video analysis...