LongCut logo

Azure Databricks Interview Questions 2025 [WITH REAL-TIME SCENARIOS]

By Ansh Lamba

Summary

Topics Covered

  • Interviews Demand Hands-On Solutions
  • Unity Catalog Centralizes Governance
  • Delta Location Hierarchy Determines Storage
  • Autoloader Ensures Exactly-Once Loading
  • Z-Ordering Enables Data Skipping

Full Transcript

There are so many job postings for datab bricks and the numbers are growing up with the time. But why it is so difficult to actually crack a datab bricks interview? Because the dynamics

bricks interview? Because the dynamics of this application is changing rapidly and you need to be aligned with the latest trends and the latest interview

questions. So that's why I have created

questions. So that's why I have created this 3 hours long video covering all the latest data bricks interview questions covering all the real-time scenarios

plus the conceptual questions as well.

So if your aim is to actually crack a databick interview this year then let's get started with this video and be serious. What's up? What's up? What's up

serious. What's up? What's up? What's up

my data fam? What's up? How is your Sunday going so far? I know it's really really good and if not now it will be

good. So basically after receiving so

good. So basically after receiving so many comments, so many messages now it's time to talk about databicks interview questions.

This is a kind of evergreen topic and I personally feel that there should be a new video on this particular topic let's

say in at least 2 to 3 months because the dynamics of data bricks is changing rapidly and it's not just about dynamics of data bricks it's about the dynamics

of the whole data engineering industry and just just tell me one thing if you are sitting in the interview just tell me one

um are you just still being asked like all those questions that you were would be asked like let's say two to three years back I'm not talking about the fundamental questions obviously

fundamentals will remain the same like what is distributed computing how like spark architecture works I'm not talking about those things but now you would see

that questions are evolving more about delivering the results using these technologies such as data bricks Azure they are more inclined towards using the technology because they want to hire

data engineers obviously who have the knowledge who have foundational knowledge and who is really really um having deep knowledge in the concepts everything is fine but at the end of the

day after covering those fundamentals engineers developers need to develop the solutions using the

technologies and trust me the industry is right now like so so so hectic that the moment you will be hired you will

directly be landed on a project how can you just expect that you will be landing as a data bricks data engineer and you will be spending okay I

will just learn this technology while just um being in the company no bro you have to develop the solution because that's why company is hiring you that's why they are just paying you and if you

are being hired in a consult tens company then obviously you can just expect that you will be working on the projects maybe from the day one from day two or let's say at least or let's say at most at most like after one

week so what you need to have deep hands on okay and that's why right now interview questions are revolving more about the technologies along with your

solutions make sense make sense so that is why in this particular video we'll be discussing so many latest questions and we going to discuss all the latest things because obviously in data bricks

we have so many latest latest feat latest latest features available right now in data bricks okay so I will just try to include all those scenarios and we are not going to just cover

simple questions theoretical questions no we going to focus on let's say end to end questions which will include your design skills which will be including um let's say in in those questions you need

to think for a solution it's not just like pretty straightforward No, you can develop the same solution using a different approach maybe but you need to

think that way that okay we can approach this kind of solution we can approach this kind of solution we can approach this kind of solution as well. Okay. So

in this particular video obviously the number of questions will totally depend upon the flow as we just go along the video

because my intent is not covered like 100 questions 200 questions and are we just saying hey just tell me what is data bricks hey just tell me what is data lake hey just tell me what is um delta table. No these are not the

delta table. No these are not the interview questions. So my intent is not

interview questions. So my intent is not just covering the lots of questions. No,

my intent is just to cover all the latest questions obviously but end to end questions in which if interviewer asks you the follow-up questions you

should be well prepared okay just ask me anything and these questions would rely with the projects as well. So let's say you have built a project okay and you

have used a technology called data bricks. So the person can just ask you

bricks. So the person can just ask you some followup questions and these question would be like very obvious. So

those question will be covered in this particular video as well. Okay. So

without delaying further let me just get started with this video. And now yes you will be asking hey an what is the prerequest for this video. Just basic

understanding of databicks. Obviously I

have databicks video on my channel. If

you have watched that video you are all set for this video. If not just go that go and just check that video first.

Okay. And just drop a lovely comment.

Okay. on this video as well and on that video as well and now actually let's get started with this video and in this particular video we'll be just creating Azure datab bricks resource because

earlier I was just thinking to just go with the opensource version but the thing is you will not actually grab all the concepts okay so we'll be using Azure data bricks and yes you'll be

learning a lot in this particular video just be with me just some excitement okay and just some enthusiasm and you're all set, bro.

You're all set. Okay, so let's get started with the video and let's create our first thing which is Azure account and then obviously data brick. Okay,

let's see. So in order to create your free Azure account steps are very very simple and if you are my data fam hey by the way if you haven't clicked on the subscribe button just do it right now

and if you have haven't shared my videos with your friends that means you are not your friends two true friend because if you would share that video obviously you

know that that friend will gain a lot of knowledge right so just spread the positivity I know everyone is talking about Hey, there's so much competition.

There's that, this, that, this, that.

Okay, everything is fine. This is life.

Okay, life is not easy. So, we know there are challenges, but we know there's a way. Okay, we know how to walk. We know how to overcome the

walk. We know how to overcome the hurdles. Okay, we know how to make

hurdles. Okay, we know how to make sprints. We know how to just run in the

sprints. We know how to just run in the marathon. We know how to walk. And we

marathon. We know how to walk. And we

know how to stop. Stop. Not really. Just

walk. Okay. So you know so the thing is we know there are challenges we know there's competition we know AI is here so still we going to win right we going

to win why because you are my data fam so first of all let's create our free Azure account and the steps are very simple simply search on the incognito mode and type Azure data factory Azure

free account okay and just click on the first link and then yeah perfect now here you will see try Azure for free. By the way,

you can even click on this one, pay as you go. But you do not need to spend any

you go. But you do not need to spend any money many many money money money. Okay,

just click on try Azure for free and then it will simply redirect to you redirect you to a page. Okay, so this is

a page here. The steps are very simple.

You simply need to put your email ID.

Okay, Microsoft email ID, not Gmail email ID. Okay, just put your Microsoft

email ID. Okay, just put your Microsoft email id and the moment you put your Microsoft email ID. Simply click on next. Now, some of you will say, "Hey,

next. Now, some of you will say, "Hey, we do not have any Microsoft email ID."

Don't worry. Click on create one. Now,

my personal advice. Do not create your new account

advice. Do not create your new account from here. Sometimes it will just make

from here. Sometimes it will just make you stuck on the quiz part. Now, what is the quiz part? You need to verify that you are not a robot. And if you will do

this step in the mobile phone, it's fine. It works fine. But in laptop

fine. It works fine. But in laptop sometime it's stuck. It it stucks. So

you simply need to click on create one in your mobile. And once it is created, you can simply put it here. Okay. The

moment you put here your email account, it will simply ask you to fill a form just to tell your name, address, phone number, blah blah blah. And at the end, you would

blah blah. And at the end, you would simply need to click on sign up. Okay.

Okay. So the moment you click on the sign up button, it will just ask you for some more details such as like card details and those kinds of details.

Don't worry, it is just for the confirmation because Microsoft confirms that you are the one who will be using the services because obviously if you are the one you should have like some

financial details, right? So that's why it confirms that's it. And just fill that and your account is free for use for 30 days. You do not need to worry at

all. Awesome. Okay. Okay. Okay. So now

all. Awesome. Okay. Okay. Okay. So now

let me just take you to the Azure portal because Azure portal is the place where we just create everything our all the resources and everything. Don't worry

this is not an Azure masterass but yeah we need to go inside Azure to create our databicks. Okay. So simply go to Google

databicks. Okay. So simply go to Google and simply search portal.asure.com. Okay. This is the

portal.asure.com. Okay. This is the link. Simply hit enter.

link. Simply hit enter.

And then it will simply ask you to put the email id. Simply put that email id that you have just created like for your registration. The moment you put it

registration. The moment you put it here, it's done. You will simply land on the Azure portal. And let me just show you how does it look like. Let me show

you. So this is my Azure portal and I

you. So this is my Azure portal and I know it could be different in your case.

Okay. Why? because these are the Azure services that I have used so far or maybe I would I would have just clicked on it. So these are some of the recent

on it. So these are some of the recent ones. So do not worry. The rest of the

ones. So do not worry. The rest of the things should be same. Okay. And let me just give you a quick overview. We do

not have special things here on the homepage. Okay. But there are some

homepage. Okay. But there are some things that you should know. First of

all, the most important thing is simply click on this ribbon and the most important thing is these two tabs are these two tabs. First of all, all resources and a resource group. By the

way, what is a resource group? You should know about this, bro.

group? You should know about this, bro.

You're sitting in the interviews, right, bro? Okay, no worries. Okay, so

right, bro? Okay, no worries. Okay, so

resource group. Resource group is basically just a folder in which we just store our all the resources. Now, don't

ask me what are resources. Okay, let me just tell you. Resources are just the services that we use from Azure. Such as

let's say you are using Azure databicks that is a service. You are using data factory that is a service. You are using data lake that is a service. Okay. Very

good. So everything we just create we need to have one resource group. That's

it. Rest of the things are like some most commonly used resources services on a daily basis. And then we have some monitoring capabilities and then some

cost management and billing management.

All those things that you will not be taking care of right now because you're just using free account. And yeah when you will be landing as a data engineer or when you will be landing the role as a data engineer obviously you will just

take care of that thing there. So that

is fine. Okay, not you maybe it will be I think data architecture or someone from the data governance team but yeah you can just take part because if you want to work with a startup that's a

good thing because you can just wear so many hats and possibility of so many so many so many new learnings right very good okay once everything is done now

let's create our resource group and how we can just create resource group simply click on search and simply type resource Okay, perfect. Now I have so many

Okay, perfect. Now I have so many resource groups so far and I'm so lazy.

I haven't deleted anyone. So very well done Anel Lamba. Good going. So simply

click on create and I will simply name it let's say datab bricks interview. Okay. And region. Let's pick

interview. Okay. And region. Let's pick

a region. Let's

pick which country do you want? Do you

want UK South? Let's pick UK South.

Okay, perfect. Click on review plus create and that's it. Click on

create and here is your resource group created and you can simply search it and it should be here. Simply click on refresh

and where's that? Go to home. Go

to here and simply search DB interview. Yeah, perfect. Here you

DB interview. Yeah, perfect. Here you

can see resource group. Now this is your resource group. Now what is the first

resource group. Now what is the first thing that we need to create? Obviously

we need to have a data lake H. Okay.

Obviously bro databick not a storage solution. It is just like your

solution. It is just like your transformation layer, your data processing layer on top of your data, data warehousing layer on top of your data. Yeah, they are just investing so

data. Yeah, they are just investing so much in like impro improvising their data warehousing capabilities. And now

in the recent um summit they have announced that they have their own I think data models. Um I didn't go through the documentation right now from their summit from their convocation because it's still going on. So I can

just have a look and it's really really amazing. Yeah they have they are just

amazing. Yeah they have they are just planning I think to build something related to for like data reporting obviously it's really cool feature. Okay

simply click on create. Let's create our data lake quickly and click on plus button and go to marketplace and simply search storage account.

And we have a storage account here and simply click on Microsoft one. Okay.

Then click on create then see now your resource group folder is automatically filled.

Why? Don't worry this is not an interview question but yeah I have a tip. I have a tip that could be your

tip. I have a tip that could be your interview question. Let me show you.

interview question. Let me show you.

First of all we need to just say storage account name. like what would be the

account name. like what would be the storage account name that we need to provide. I will simply pick storage

provide. I will simply pick storage account name as um DB interview

lake. It's a good name DB interview

lake. It's a good name DB interview lake. And one thing you cannot pick the

lake. And one thing you cannot pick the same name as I'm picking here because it should be unique throughout the network.

You can simply put um I love and db interview lake. I'm just kidding. Do not

interview lake. I'm just kidding. Do not

need to write this. Okay. Okay. So you

can simply say DB interview lake and that's it. Very good. And then region

that's it. Very good. And then region obviously it is automatically picked and primary service it is fine still I I can just show you what you need to just create Azure blob storage gen 2

obviously we just create this and if you just leave it as it is it's not a big deal okay so now what do we need to do we can simply say standard or premium

just go with standard then this is important redundancy basically we have four different types of policies LRS GRS Z and GZS The cheapest one is LRS in

which the replica of your data will be created on the same data center which is local redundant redundant storage.

Simply pick this one and then click on next. And this is important. In order to

next. And this is important. In order to create the data lake you need to check this box otherwise it will simply create a blob storage. Blob storage is not a data lake. Data lake is built on top of

data lake. Data lake is built on top of blob storage. Basically difference is in

blob storage. Basically difference is in blob storage you cannot create hierarchical um folders. In data lake you can create hierarchical folders.

Okay, click on this and then simply click on review plus create and just hit on this create button and it will simply

deploy your data lake and it will hardly take few seconds trust me hardly few seconds. Okay, let me just click on

seconds. Okay, let me just click on refresh if it is not done but I think it should be done. Uh yeah so as you can see these

done. Uh yeah so as you can see these are deployment details and it has just started and this is just a Microsoft account like Microsoft storage account stata is accepted that means it has

accepted our uh request to create that particular data lake and it is now creating it. Let me just see. Yeah

creating it. Let me just see. Yeah

perfect. So now it is done as you can see it has deployed go to resource. So

you can either click on this or simply go to home, search your resource group and then you will see your data lake created which is called DB interview link. So this is our data lake. Okay.

link. So this is our data lake. Okay.

And now the second thing is we need to create Azure data bricks resource.

That's it. We just need two resources and that's it. I was just thinking not to create external data link but in the realtime scenarios in the real world

interviews they will ask you questions regarding external data links because we do not use manage data lake okay that is why I chose this one see I'm so

possessive for you so let's create our database resource right now let's create our datab bricks resource and without delaying let's actually create

it so in order to create your data databicks resource. Simply go to

databicks resource. Simply go to marketplace and simply search datab bricks and hit enter and simply pick Azure datab bricks and click

create. Yeah, perfect. So as you can see

create. Yeah, perfect. So as you can see now here as well we need to just give the workspace name. I will simply say DB interview so that it would be aligned

with our naming convention that we are using. And I will simply put workspace.

using. And I will simply put workspace.

Okay. And then region is UK south. Now

is the thing which one do we need to pick? Premium, standard or trial?

pick? Premium, standard or trial?

Basically standard is not good because in standard we do not get all the things. In premium yes we get all the

things. In premium yes we get all the things but it is paid. You are using free account so it is not a big deal but it's for those who are using paid account and if it if they do not want to

spend much okay they can simply pick trial premium. What is that? It is just

trial premium. What is that? It is just like premium but just for 14 days.

That's it. Just the trial version. And

then we simply need to put manage resource group name which is not necessary by the way. You can be asked in the interview interviews like why do we have this and what's the role of this? So basically here comes your first

this? So basically here comes your first question. Yes, unofficial first question

question. Yes, unofficial first question and that's my way of telling it. Just

tell it whenever it is possible. It's

not like anything is okay. This is the question. What is? How is when is no

question. What is? How is when is no question. That's it. So what is this

question. That's it. So what is this manage resource group and why do we need to just worry about this? So basically

we have two different types of areas what one is compute and other one is um compute and what was the oh not compute yeah compute and data. So these are two

different types of you can say areas that we have or ps that we have within data bricks. So in the compute area we

data bricks. So in the compute area we just worry about all the things like all the interfaces all the web interfaces all the web UI UX everything that we'll

be doing in the datab bricks workspace okay so that will be going in the datab bricks workspace area this one

but all your manage tables all the VMs virtual machines will be going to the data pane Okay. And whenever we will be

pane Okay. And whenever we will be creating let's say job clusters, you will actually see those virtual machines being created, those hard drives being

created in that particular area. And

that is this your manage resource group.

After the introduction of Unity catalog, we do not use managed resource group for managed tables. We just use it for the

managed tables. We just use it for the clusters. That's it. Now what is unity

clusters. That's it. Now what is unity catalog? Don't worry, we have a

catalog? Don't worry, we have a dedicated question on that. So for now you can either put your dedicated name ei otherwise it will simply pick any default name for you. Then click on next

and everything is fine. You can directly click on review plus create and just scroll down and click on create. That's

it. That's your datab bricks workspace.

And I'm really really excited to just tell you our first question for today.

And this first question is really important because being a developer it's very important to know how we can set up the environment. It is very important

the environment. It is very important and let me just tell you bro we are not going to just set up a basic environment in datab bricks. We have to have to use modern things that we use in data bricks

and that's the intent of this video. So

I will be using something called as unity catalog. Unity catalog is nothing

unity catalog. Unity catalog is nothing but you can say a modern way of governing your resources within datab bricks. Okay. And how we can enable

bricks. Okay. And how we can enable unity catalog? We need to enable

unity catalog? We need to enable something called as unity meta store.

Okay, make sense. So till the time it is deploying it will take just few seconds.

Let me just quickly show you how we just deal with unity catalog architecture.

Okay, let me just go on Google. Let me

just show you Unity catalog. Let's pick this documentation.

catalog. Let's pick this documentation.

I just want to show you that image because it is really really nice and the best image best for understanding. So

this is the criteria which we were referring with like before unity get a log. Okay. Here we were actually

log. Okay. Here we were actually managing independent workspaces and in every uh in every independent workspace we were managing compute obviously meta store and obviously if like we are

governing some things we have to just manage the access everything right after the introduction of unity catalog. This is unity catalog mode.

catalog. This is unity catalog mode.

Actually we are enabling something called as unity meta store. So unity

meta store comes at the top level. Okay.

Then within that we create something called as cataloges and those cataloges are called as unity catalog. Why?

Because they are united and that particular unity catalog that you'll be creating right now. Don't worry it is just a highle overview. So that unity catalog you will be creating can be

accessed through different databicks workspace as well. Wow. So it is totally united. That's why it is called unity

united. That's why it is called unity catalog. And within that your compute

catalog. And within that your compute will be there. It is independent. It

obviously we can govern it. Obviously we

can just uh manage it check the lineage everything is there but compute is residing in the dedicated workspace like the dedicated workspace resource group

manager or let's say default resource group which is also called as manage resource group. Okay. And then let me

resource group. Okay. And then let me show you this architecture. So this is the

architecture. So this is the architecture. See at the top we have

architecture. See at the top we have metas store. This is called as unity

metas store. This is called as unity metas store. Earlier you were using hive

metas store. Earlier you were using hive metas store but now we have unity metas store. What is a metas store? Again this

store. What is a metas store? Again this

can be question. So we going to cover like literary questions like this. Okay.

So just be informal while learning and be very formal while answering these questions. Okay. So metas store metas

questions. Okay. So metas store metas store is nothing but the repository where you store all the data information. All the data information

information. All the data information data about data metadata. So let's say you're creating a table, you're creating database, you're creating schema, you're creating volumes, everything, everything

will be there in the meta store. So

earlier we were creating independent uh database workspace. Okay. And if I am

database workspace. Okay. And if I am creating a table, if I am creating anything database, schema, so I was writing that information like this

information in the managed resource group in the managed resource group that we just saw during um creating the workspace. Okay. So we were storing all

workspace. Okay. So we were storing all that information there. And if we want to just create a manage table, obviously

data used to go there. But we said hey this way let's say our organization has 20 database workspaces 20 and it is

very fine it it is a very common scenario your your organization can have okay so those 20 workspaces will be having different 20 storage accounts

obviously just to store your manage tables data metadata everything 20 storage accounts. So what we did we said

storage accounts. So what we did we said hey just hang on. So what we need to do now we will simply create one metas store which will be linked to so many

database workspaces and one metas store obviously would only have one managed resource group. Okay. And now this

resource group. Okay. And now this resource group will be managed by us. So

that is why we will call it as let's say in the external link that we have created. So this meta store is there.

created. So this meta store is there.

Don't worry we'll be creating that metas store and we will just link that metas store with our data lake as well. And is

really a best practice that we should create external uh location with our unity meta store. It is optional but we should always.

Okay. So this is just about your highle overview that we have within unity catalog and your basics should be really really clear first okay then only we can just do anything that's common sense and

I know it is not really common nowadays so I'm just taking care of everyone okay so now once we know the concept let's say now I have this meta store

now whatever I'll be doing let's say I'm just creating a manage table I am creating a schema catalog anything all that information will be going

here in this external lake instead of that managed resource group and not only of this particular workspace every workspace which will be

attached to this meta store will be just dumping all the information to this external link. That's it. That's the

external link. That's it. That's the

concept of it. Let's see if our resource is ready. Yeah, it has deployed. It is

is ready. Yeah, it has deployed. It is

ready. So now what we can do? We can

simply go to home. You can simply go to our resource group and these are our two our resource group. Okay. So let me show you our question number one and let me just actually get started with the real

time interview question number one and this would be your hottest question right now and I have already given you some groundwork for this question and

you would understand this question. Now

the question is let me just tell you using the scenario. So you have a database

scenario. So you have a database workspace.

Okay. And let's say this is your database workspace. Databicks workspace.

database workspace. Databicks workspace.

Okay. And you are a developer. Let's use blue

developer. Let's use blue color. Let's say you are a developer. He

color. Let's say you are a developer. He

or she whatever.

Okay. Okay. Let's drop this. Yeah.

Perfect. Perfect. So let's say you are a developer and you have this database workspace. Okay, makes sense. Now you

workspace. Okay, makes sense. Now you

need to set up this workspace in such a way that it can first of all access this particular lake which is a data lake.

But the thing is this is an external data lake. That means this is not a

data lake. That means this is not a managed data lake assigned to this particular data bricks workspace. So you

need to first of all assign this particular data lake to this databicks workspace. And there are some

workspace. And there are some conditions. You cannot just develop

conditions. You cannot just develop something on your own. I'm just giving you some points that you have to consider. The points are you need to use

consider. The points are you need to use something called as external location. Why? So that you can actually

location. Why? So that you can actually reuse this particular axis. Okay. So

first is this. Second one is you need to actually allow this particular database workspace to be used across multiple workspaces. Yes. whatever you will be

workspaces. Yes. whatever you will be building within this database workspace let's say schemas databases um tables functions volumes everything so you need

to actually develop a shared workspace that can be used across multiple database workspaces and obviously you need to take care about the data governance and everything about that particular workspace as well and just a

hint for you what hint bro you will be showing us right yeah but still I want you to just try it on your own and obviously just look at the solution if you cannot do it. So this is a kind of

situational question that you need to tackle. Okay. So now hint is you need to

tackle. Okay. So now hint is you need to use unity catalog. You need to enable unity meta store and that is your hint. Okay. Sorted sorted sorted. So now

hint. Okay. Sorted sorted sorted. So now

let's see how we can actually tackle this question and how we can actually approach this question and I will just guide you step by step everything. Okay.

What is necessary? So let's see how we can just solve this. So let's go to our database workspace for this. Okay. So

simply click on it and simply click on launch workspace and then simply pick any account. So this is just your normal

any account. So this is just your normal account and this is the ext account. You

would know when you just go to enter ID you get your this account users and you just see this account and you just click on it. See this long email ID. By the way, this is very handy

email ID. By the way, this is very handy and whenever you want to use the database workspace, you should always use your normal Gmail account. But

whenever you want to use your console page, then you need to just use that long account. So this is our database

long account. So this is our database workspace. Wow, that looks so cool.

workspace. Wow, that looks so cool.

Right now they have just changed the UI.

Earlier used to black or gray here. I

used to like that particular version more. So yeah, no worries. No worries.

more. So yeah, no worries. No worries.

No worries. Okay. So this is our database workspace. Let me just increase

database workspace. Let me just increase the screen size a little bit. And

perfect. So now what we need to do first thing just a quick overview nothing special obviously you would have some knowledge that's why you are just watching the interview questions right.

So nothing special or everything is same uh we have this left pane for workspace recent catalog workflows compute marketplace compute was really important before but now we have by default one

thing available which is serverless compute. So we do not need to actually

compute. So we do not need to actually worry about job compute or let's say job cluster right now or obviously job cluster is for production or like allp purpose cluster not job cluster. So we

do not need to worry about allp purpose cluster we will simply use and it will be ready for us to use. Wow simple and then we have jenny which is just like

chatbot by databicks. Okay. And then we have some machine learning things. If

you are into AI and all geni obviously go to playground, pick your model and then just build something use API and then just build something cool. Okay. So

that is all about database right now.

I'm again telling a lots of a lot of lot of lot of changes are happening right now in all the applications not database not just the only application in all the

applications. Okay. So now what we need

applications. Okay. So now what we need to do we need to simply create our meta store and how we can just create meta store we simply need to click on this dropdown and simply click on manage

account. So when you click on this you

account. So when you click on this you will see all the workspaces okay and simply click on manage account. If you

do not see this button you can simply watch my this YouTube video which is this one datab bricks unity catalog and you can simply watch this video from um

5600. So this is the exact time stamp.

5600. So this is the exact time stamp.

See, I'm just saving you a lot of time.

So, simply say I love you in the comments. Simply say, "Yeah, it's up to

comments. Simply say, "Yeah, it's up to you if you want to say." Okay. So, when

you just click on it, you will simply land on the admin console level. So, the

thing is when you just log for the first time with databicks, you will only see manage account button with your default account with your default default account. And right now, my default

account. And right now, my default account is this one. So if you're just logging in to the database using default account, you should see it. But if you do not see it, do not worry. You can

simply click on that particular page and it will just ask you to just put your ID and if you just want to go to man like manage account, you cannot simply go to

using this normal Gmail account. You

have to use this long Gmail account because this is the one registered with your default directory. Okay. So these

are some admin things you should know.

Now I will simply click on manage account and here you will see obviously I'm already logged in because I am super smart. No I have already logged in here

smart. No I have already logged in here before. So if I just click on it see

before. So if I just click on it see this is not my normal Gmail account.

This is like ext. So ideally what I prefer I simply create a new user in my enter ID and I just prefer using that particular account and I have one I

think it's called I think let me just go to console page. Let me just see user management. See, I have created a

management. See, I have created a dedicated account for my datab bricks unity catalog. I use this and if I just

unity catalog. I use this and if I just want to make any changes, I just prefer using this one. And in order to make any changes uh in databicks workspace, this

account should be the what's the name of that? I think global admin. Yeah. So, if

that? I think global admin. Yeah. So, if

you just go to the Android ID, this account should be the global admin uh database unity.

Um you can check the roles. Click on the roles and see global

roles. Click on the roles and see global admin. By the way, this is just like you

admin. By the way, this is just like you you will get everything in that particular video. You can actually see

particular video. You can actually see this is just like admin things and do not worry at all. Ignore. Ignore. So now

the main thing as you can see in the workspace tab we have all the workspaces listed right. Very good. So now what we

listed right. Very good. So now what we need to do we need to create a new unity meta store. Simply go to catalog and

meta store. Simply go to catalog and click on create meta store. Perfect. So

simply quickly quickly quickly quickly quickly create a new unity meta store and I'll simply call it as um DB

interview meta store. Make sense? Now

region you can pick any region. I will

pick UK south. Why? Because you can create oh another interview question.

How many unity meta stores you can create within a region? Only and only one. Okay just keep this thing in your

one. Okay just keep this thing in your mind.

Now what is the default storage account location? So we need to give this and

location? So we need to give this and before that it is asking for uh asking us to provide access connector ID. Yes.

So basically I told you this is a situational question in which multiple questions are embedded. So now just tell me one thing. This is your database

workspace and this is your data lake.

How they will be communicating with each other? How like do they know each other?

other? How like do they know each other?

Obviously no, right? Both are different entities. This is owned by Azure. This

entities. This is owned by Azure. This

is owned by data bricks. How? How bro?

How? So here comes the role of datab bricks access access connector. So we

need to create datab bricks access connector and it is the only way to connect if you're just working with unity catalog. Okay. How we can just

unity catalog. Okay. How we can just create that? Simply go to Azure, go to

create that? Simply go to Azure, go to home and then simply search here. Oh,

first of all go to resource group because it will save you a lot of time.

Click on create and then simply search access and you will see a Spider-Man logo. Now just search access

logo. Now just search access connector. Oh,

connector. Oh, nice. See access connector for Azure

nice. See access connector for Azure data. This is a logo. And then click on

data. This is a logo. And then click on create. By the way, I already have so

create. By the way, I already have so many access connectors. I don't know why I do not delete it after creating the video on Lamba. Just do something. You

should delete those. Okay. So now you just need to name it. I will simply say DB interview access. Perfect. Region US

East. No bro, UK South. Yeah, perfect. Click on review

South. Yeah, perfect. Click on review plus create. So what this will do?

plus create. So what this will do?

Nothing. It is just a kind of credential that we need to use.

We are just allowing this particular connector to use our uh storage accounts and then that particular connector can actually be integrated with datab bricks. That is the solution of question

bricks. That is the solution of question number one part number one in which we need to create the connection between your datab bricks with your external

data link. Okay. Click on

data link. Okay. Click on

create and see that's why I focus more on the real real realtime scenarios.

There's a difference between real-time scenarios and real realtime scenarios.

Real realtime scenarios involve all those admin tasks. Bro, just get this thing. Data data bricks interview is not

thing. Data data bricks interview is not just about like hey what is volume? Hey,

what is cluster? Hey, what is job cluster? No, they will just give you a

cluster? No, they will just give you a situation. You need to just tell them

situation. You need to just tell them hey I will just do this. Okay, it is created. Now what I will do? I will go

created. Now what I will do? I will go to resource. I will just show you. See

to resource. I will just show you. See

this is my resource. So now what we need to do we need to go to our storage account. Okay. And go to access control

account. Okay. And go to access control because we are just providing the access contributor access on this particular data lake to that connector. Okay. Click

on uh add add role assignment and then simply search storage

below. Click on search

below. Click on search pro storage. Sometime it just tucks and it

storage. Sometime it just tucks and it do not populate immediately. So see now it is coming storage blob data contributor. Click on this. Click on

contributor. Click on this. Click on

next. And now click on manage identity.

Click on select members and simply pick um your access connector. Obviously see

I have 14 access connector. Anal lamba.

What are you doing manu?

Angel lamba I will just pick this one datab bricks access connector and I promise I will delete all these access connectors

after this okay click on select and click review plus assign plus review plus sign that's it now we can integrate this particular access connector to our

data bricks in order to access this particular data link understood the link understood the relation okay data Data bricks interview question is not just

about writing pispar code. Okay. Datab

bricks is a technology is a wide technology which is used by the whole organization not just by the data engineers not just by like pispark developers.

Okay. So now you can simply go to your resource group and now click on this u resource. Now you will simply need this

resource. Now you will simply need this resource ID. That's it. Go to this and

resource ID. That's it. Go to this and put access connector ID here. That's it.

Now it is asking us to provide the ADLS gen 2 path. Simply go here and simply click on your resource group which is DB

interview. Click on this storage account

interview. Click on this storage account and click on containers and here just create any container because obviously you'll be creating a metas store container. So just say meta store. So

container. So just say meta store. So

this is basically the container which is dedicatedly given to the unity metas store only. Do not touch this. Do not

store only. Do not touch this. Do not

touch this. Okay. Do not do not because this is only available for that particular Unity metas store. That's it.

This is not your property. Yeah, this is your property but you are not living here. So now what we need to do click on

here. So now what we need to do click on properties and simply remember that this is the storage account name which is DB

interview lake. Okay. Now go to this and

interview lake. Okay. Now go to this and now you need to just paste the location and what is the container name like it is

DB interview lake at the rate storage account name I guess container name is metas store by the way anala oh yeah so this is basically

called abfs protocol Azure blob file system secure yeah so this is kind of protocol that we use to allow the access to data lake from Azure and you should

also know about this. Okay. Then simply

write dfs.core dot

dfs.core dot windows.net and then everything because we are not specifying any folder. Use

any folder bro all up to you. Click on

create. That's it. This is your connection done so far if everything is going smoothly. Okay. And let's Oh,

going smoothly. Okay. And let's Oh, let's see. Okay. By the way, don't don't

let's see. Okay. By the way, don't don't worry. These red marks are not errors.

worry. These red marks are not errors.

This is saying we cannot assign. So

because metas store is created, now we need to assign the workspace, right? And

we cannot assign the workspace which is already assigned to a different metas store. And this is already assigned to different meta store. So we can only assign this one.

store. So we can only assign this one.

Makes sense. Click on assign. And it

will say hey do you want to enable uni catalog? We will say yes. Click on

catalog? We will say yes. Click on

enable.

That's it. That's it. Example.

Congratulations. And click on close. It

is done. It's done. Yeah. But the setup is not done like completely. You will

see this page after this. See metas

store admin. Click on this edit button and currently this particular account is the uh admin. But you are using Gmail

account, right? So you need to make that

account, right? So you need to make that particular account as admin only then you will be able to use and create unity cataloges. Okay. So this is a very short

cataloges. Okay. So this is a very short one but interviewers can hook you in that scenario. Okay. And obviously if

that scenario. Okay. And obviously if you have if you have watched my videos you will answer like a pro. Okay. And

then simply click on this dropdown and then simply type your email id and then click on save. That's it. That's it. It

is done. Now you can simply close this.

Okay. Now we are here. Simply refresh

the screen. So we have successfully set up

screen. So we have successfully set up the Unity meta store and we can now go on the things. Now we can easily create all those things that we need to create

and now our database workspace is connected to external data lake. Okay. Sorted. Very good. So now

lake. Okay. Sorted. Very good. So now

what we need to do next just to confirm simply go to the um catalog.

Yeah. So here how you can just confirm that you have enabled unity meta store.

Simply click on this plus button and you will see create a catalog. Earlier you

cannot see this because you cannot create cataloges in the native database workspaces. Okay cool cool. Our first

workspaces. Okay cool cool. Our first

question is done. Now let's see what do we have in our second. Let's talk about the question number two that we have and in question number two what we need to achieve and this is the day-to-day activity that you as a data engineer

using data bricks will be doing and that was the core reason of me picking external data lake. So the thing is this is again a scenario where you have a

data lake in which your manager will be let's say or anything any system any you can say

pipeline will be dumping the data in the pocket format okay and this is in the Azure data lakeink okay

now you need to push this data to a sync location or let's say destination. Okay. And you do not need

destination. Okay. And you do not need to just push this data. You actually

need to convert this data into the delta format and on top of it you need to even create a table which will be the delta

table. So you need to do this task and

table. So you need to do this task and for this you will be doing everything with the help of data bricks. That's it. This is in the pocket

bricks. That's it. This is in the pocket format. You need to create a kind of uh

format. You need to create a kind of uh you can say uh data file type conversion. You need to just land that

conversion. You need to just land that data into the delta format which is the most widely used open table format right

now. And not only this, you also need to

now. And not only this, you also need to just create this delta table on top of it. And there's one more feature that

it. And there's one more feature that you need to add. Every time you need to do the full refresh of the data, this is the

requirement. This is the full refresh of

requirement. This is the full refresh of the data. every time data is coming here

the data. every time data is coming here because we are pulling all the data from let's say web methods or any API every time we are just doing a full refresh how you can just achieve this let me

just show you and this is our really really really common scenario that lots of scenarios be revolving around this trust me maybe it will be

inter directly or indirectly but it will be there okay so let me show you how you can just achieve This in order to achieve this obviously we should first have some data. So how

you can just grab the data? I have

uploaded all the data files in my GitHub repository and you can simply check it out and this is the link that you can actually refer. Okay,

let me just increase the screen. I will

also put the link in the description.

Okay, and this is the repository and in this we have these folders. Okay,

sorted. So let's go to our first of all data lake because we need to set up our data lake. Okay, and within this what we

data lake. Okay, and within this what we need to do, we need to simply create a container which will be called as raw and one more which will be for let's say

destination.

Okay, perfect. So in the raw container I will be uploading files. Okay. So I will create a directory and I will simply say

pocket data. Click on save and within

pocket data. Click on save and within this I will upload that pocket file and you can also download it from here. You

can also upload. So I think it is here. Perfect.

upload. So I think it is here. Perfect.

So perfect. So now I have already uploaded this demo.pocket here. So what

I will do? I will simply read this data.

How? That is the question, right? Yeah.

And it is not that pretty much like straightforward. Okay. You need to just

straightforward. Okay. You need to just do some things before that. Okay. Let's

see. So, let's go to datab bricks. And

obviously, we are just following Unity catalog architecture. So, you need to

catalog architecture. So, you need to just do everything keeping that in your mind. Okay. So, in order to just read

mind. Okay. So, in order to just read this data, you would need something.

Simply go to your catalog and simply go to this plus button or simply click on this external data and then you need to create something called as external

location. Okay, because only then you

location. Okay, because only then you can read the data sitting in the data lake. Okay. And even before that you

lake. Okay. And even before that you need to go to credentials. You need to actually create a credential. Wow. So we

need to do these things before reading the data from the data lake. Yes. That

is why I picked external data link bro.

That is why. So simply click on create credential. By the way, what is

credential. By the way, what is credential? Credential is nothing but

credential? Credential is nothing but just a fancy name for your access data connector. Really? Yeah. Click on this

connector. Really? Yeah. Click on this create credential and credential name. I

will simply say an credits. Now see

access connector ID. I told you this is the exactly same thing. So simply go to Azure. Go to your um resource

Azure. Go to your um resource group. go to home resource group or

group. go to home resource group or directly go to your um this resource which is access connector simply copy the resource ID that's it and paste it

here and that's it click on create this is just a fancy name for it again it can prick you in the interview so just be prepared okay so now we are all set to

create our external data click on this click on create external location and simply you can name it I will simply say raw Another interview question. Okay. So

many interview questions. Yeah. All

interlin because these are the follow-up questions that the person can ask. Hey,

what is this? Hey, what is that? So you

should be all prepared. Whenever you

create external location, you always create external location till the container level. So just a quick IQ

container level. So just a quick IQ question for you. We have two containers. How many external locations

containers. How many external locations we need to create? The answer is two because we create external location till container level. Very good. So now it is

container level. Very good. So now it is asking me to provide the URL. I will

simply say abss. This is just the protocol and here

abss. This is just the protocol and here we need to just put the container name which is raw. Add the storage account

name which is db interview lake. Then

dfs.co windows.net. And then we do not need to

windows.net. And then we do not need to worry about anything because we are just providing uh external location till container level. We can even fine grain

container level. We can even fine grain it but it is always recommended to just provide the access or let's say external like access to the external location till the container level. Okay. Now it

is saying storage credential you have already created an credits. That's it.

Click on create. That's it. It is done. You can

create. That's it. It is done. You can

even click on test connection and it will simply test it. And you can see read list write delete path exist hierarchical, name, space, everything is enabled. Very good. Click on done. So

enabled. Very good. Click on done. So

similarly create one more location which is which is for another container which is destination container. Otherwise you

cannot write your data. So your half of the question would be pending. Okay.

Click on the external data for one more time. Click on external data and I will

time. Click on external data and I will simply say destination and URL we already know a bfss this time it is destination at the

rate uh what's the location name like storage account db interview lakebfs dot core windows.net net. Okay,

that's it, bro. Credential we already have. Click on create. That's it. Click

have. Click on create. That's it. Click

on test connection. Perfect baby. So now

we are all set. Click on workspace.

Click on create. Create a folder. Why?

It's always good. Simply say DB interview.

DB interview okay sorted now within this create a notebook and we will simply say notebook one and don't worry I will upload all the notebooks in the GitHub

repository but but but just try to write the code on your own I have seen lots of learners say hey

can you just upload the notebook because we are so so so lazy we do not want to write the code but we want to crack the data engineing interviews Bro, have some

water. Just show some enthusiasm to

water. Just show some enthusiasm to write the code. Bro, notebooks. I will

upload just for a reference if you see some errors and because I just want to upload.

That's why I will upload because I don't know why you just need to refer the notebooks. You have everything on the

notebooks. You have everything on the screen. Just type it, bro. You see

screen. Just type it, bro. You see

errors, you complain. Hey, I'm seeing errors. So what? So what?

errors. So what? So what?

Let me just tell you if you are not aware of this. If you want to become a data engineer, your half of the more than half of the job will be going in just debugging. Do not expect you will

just debugging. Do not expect you will be building thousand pipelines in a day.

No, you will build one pipeline in a day hardly and the next 4 days you'll be just debugging it. So just have this clarity bro. Otherwise reality will hit

clarity bro. Otherwise reality will hit you and you will see hey in which field I have entered. So just make your mind accordingly. Write the code. Okay, do

accordingly. Write the code. Okay, do

not say hey upload this, upload that. I

can but why I do not? Because I want you to just write the code. I want you to just grow and just get the success.

Okay, I know in the beginning it's not really easy. Make the

really easy. Make the habit. H enough psychological talks,

habit. H enough psychological talks, philosophical talks. Okay, simply say

philosophical talks. Okay, simply say notebook one. Now let's start our

notebook one. Now let's start our development. So first of all, we know

development. So first of all, we know that we always connect our cluster with our notebook. Click on this connect

our notebook. Click on this connect button and this time you will see hey this is already on an Lama did you create cluster and you didn't show us no bro so as per the latest update by

databicks you already get one serverless cluster which is always running for you and you can simply peg this and you can actually run your code boom no need to wait for 10 minutes 15 minutes to turn

your cluster because when you are learning you would not want to just waste your time on the cluster creation right so simply pick this and I will simply write the markdown cell and I

hope that I do not need to give an overview of notebook because this is just a notebook right and you are just practicing interview question so I assume that you know some things in

notebooks right okay so now obviously if you don't know in that video everything is covered from scratch bro everything so let me just create a

markdown cell and let me just say hey reading bucket data because first we need to read uh first we need to read it. So in

order to read the data we can simply use spark API spark dot read dot format.

Okay, baby. Punch lamba. Mind your language.

baby. Punch lamba. Mind your language.

What? Baby is a good word. I am a baby. What's wrong with

word. I am a baby. What's wrong with this? Okay. So, spark read dot format

this? Okay. So, spark read dot format pocket. Format is pocket. In case you're

pocket. Format is pocket. In case you're just using any other file, you can use CSV, JSON, any file. Pocket I am using.

So, I will simply say pocket. Within

this, I will simply say dotload. Why?

Because when we just work with pocket files, the schema of the data is actually being stored at the footer of the file. So I do not need to worry

the file. So I do not need to worry about defining the schema or anything.

It's the best thing that I love about pocket. So I'll simply say now in the

pocket. So I'll simply say now in the load section obviously I need to just define the location. What's the

location? We already know abs and container name is raw at the

rate db interview and then lake yeah dfs.co windows.net

net within this I have one folder right it's called pocket data very good simply load this and click on this close button because we know our trial will be ending

in 14 days not a big deal okay so it will simply load the data if everything is fine and I hope everything

is fine with the location and all it takes some time with the first cell Don't say that hey spark is very slow now. So now our data is loaded. In order

now. So now our data is loaded. In order

to display it we can simply say display df just to make sure hey data is fine.

Okay. So it is saying display df and perfect. So one thing to note whenever

perfect. So one thing to note whenever you're using serverless compute you cannot go to spark web UI. You can only see the performance. See, but when you just create your allp purpose cluster,

you can actually see jobs and click on that and then it will just take you to the spark web UI. Okay, so this is your data. Um, for me it's good. So now what

data. Um, for me it's good. So now what I need to do I need to create a kind of solution which will every time do a full refresh on the on the

sync side on the destination side plus I want to create a delta table on top of it. How we can just do that? So it's

it. How we can just do that? So it's

very simple. You will simply say

simple. You will simply say DF.right dot format. This time you will

DF.right dot format. This time you will write delta. Okay. Then you need to type

write delta. Okay. Then you need to type mode. Mode is called something called as

mode. Mode is called something called as overwrite. Okay. Mode is override. So

overwrite. Okay. Mode is override. So

basically we have four modes. Append.

Overwrite. Append is uh used when we just want to insert the data only.

Override will do a full refresh every time. We have third one as well. It's

time. We have third one as well. It's

called error. It will simply throw error if uh if any file is already there.

Fourth is ignore. It will simply ignore.

Okay. Whether the file is there, whether is not, it will simply ignore.

Okay. So mode is done. Now what we need to do? We simply need to say hey option.

to do? We simply need to say hey option.

Now we need to define the path where you want to just write the data. So I will simply say um destination at uh what is the storage

account name? DB interview. Perfect.

account name? DB interview. Perfect.

Within this I want to create this folder which is called pocket data. Okay game

is not over. You need to some write something as um write sorry dot save as table. So when you just write

save as table what it will do it will simply create a delta parket data table on top of it. Now you have actually two

options. What one is you can actually

options. What one is you can actually create table while writing. Second is

you can create table when data is there.

You have both the options. So I will just show you both the options. Okay. So

first of all we will just simply write the data. I will simply say dot save. It

the data. I will simply say dot save. It

will simply write the data there and we are good. Um it is running but I know it

are good. Um it is running but I know it will it oh error nice dot save. What's wrong pro path must be

absolute? What do you

absolute? What do you mean? So this is our container name.

mean? So this is our container name.

Okay. DB

interviews.dfs.co window.net.net with

this is our folder.

Okay.

[Music] H Let me just scroll down. What's

wrong? It is saying it is wrong here.

Save. Why it is saying wrong? It is just a save button. Uh uh uh

button. Uh uh uh uh. Okay. Anala.

uh. Okay. Anala.

Who will put the protocol? Now it is fine. Okay. Okay. I was just creating

fine. Okay. Okay. I was just creating those external location and that it was not asking me. So human error. So now it is there. So in order to just validate

is there. So in order to just validate this information simply go to your Azure and just go to your data lake and see containers and here in the destination you should see this folder. Perfect. And

this is of delta format. Now in order to create table on top of it what you need to do you first of all need a database then you would need database or schema

same thing okay so first of all you would need a catalog then you would need database/ schema same then obviously table name so in order to do that you would first create a catalog simply go

to cataloges and then click on this plus button click on create a catalog so catalog name will be let's say DB DB

interview catalog or let's say DB catalog that's it because we need to use this name so we cannot keep it very long. Um now it is asking me for storage account

location. This is another interview

location. This is another interview question. Okay. Now we are creating a

question. Okay. Now we are creating a catalog.

Okay.

Now we are not providing any location to this catalog. Now let's say I am

this catalog. Now let's say I am creating a manage table. Okay, I am creating what? Manage. Okay, I will just

creating what? Manage. Okay, I will just discuss this in a separate question because I would I would like to just mention all the scenarios. It will be really good. It will be really really

really good. It will be really really good and it can be asked in your interviews as well. So just for now we are not providing any location and don't worry in the next question I will cover all the scenarios. It's really really

really important. Okay, simply click on

really important. Okay, simply click on create and by covering those scenarios you will become master of this hierarch hierarchical structure within the

catalog schema and blah blah. Okay. Then

it is saying catalog created configure catalog and it will simply say hey configure this and is saying hey limit the workspaces in which users can access

this catalog. I will say all workspace

this catalog. I will say all workspace users like all workspace have access. So

all the workspaces which are attached to that particular unity meta store will have the access. I am fine. If I just uncheck this box, I can just simply pick assign to workspace and I will simply

say hey just use it. Use it owner. Owner

is an Lamba. Okay. Then simply say grant this one privileges grant. Choose

which users or group can access this catalog. All users are granted. All

catalog. All users are granted. All

account users granted. browse by

default. So in this way you can actually click on all the users that we have for now we are saying all account users that means like whatever users do we have in this particular workspace can actually

access this particular catalog and we are not restricting anything and this way you can actually pick this one and simply say revoke because currently all account users have the access. If I

click on principle and revoke I can actually revoke the access. Okay,

obviously I do not want it and you can simply say okay all account users should have the access. So simply click on this and click on next. By the way, by default it is granted. So now metadata

it is fine because we are not taking care of categorization within the cataloges. So our catalog is done. Now

cataloges. So our catalog is done. Now

we can even create schema for from here or from the code as well. Code is simple create schema schema name. So I'll

simply create it from here. Schema name

is let's say um DB schema. Okay. And again we are not

schema. Okay. And again we are not providing path here. Don't worry I will just cover that particular thing in the separate question. So now let me just go

separate question. So now let me just go back to my notebook by clicking on recents and this is my notebook. Perfect. Now you will see something. Click on this ribbon and

see something. Click on this ribbon and click on this button. Now we have this catalog right? DB catalog. Click on this

catalog right? DB catalog. Click on this and in this we have schema. Perfect. In

this particular schema I want to create a table. Okay. I will simply say create

a table. Okay. I will simply say create table. Then I will put catalog name

table. Then I will put catalog name because we use three level naming space in Unity catalog. Catalog database table sorted. Okay. DB

sorted. Okay. DB

catalog. Okay. Dot DB

schema dot table. I would simply say parket data. Why? Because it is always

parket data. Why? Because it is always recommended to have the same name of your table and your folder.

some best practices. Okay. Now, create

table is done. Do we need to define the schema? It's up to me. It's up to me.

schema? It's up to me. It's up to me.

But I would not create the schema. Why?

Because in the delta log schema is already there. So, I do not need to

already there. So, I do not need to worry at all. Okay. I will simply say,

hey, just create table using delta.

Now using delta is I put I. Now again do we need to put single quotes or not?

Maybe not. Yeah. So in using delta as well it is also optional because now data bricks has made it default that you do not need to put delta. If you do not put anything it will still work because

it understands that you want to create a table using delta. But I as a developer usually like to put it. It promotes

readability. Okay. Now I will simply add one more thing. It's called location because we need to define hey where is the data on which we are creating

table. Now again interview question see

table. Now again interview question see I told you I am going to cover so many questions in this video and all are real like real realtime scenarios and

followup questions as well. interviewer

can ask you hey if I create a table on top of location let's say XYZ it will simply create a delta table

on top of it on that location it makes sense what if what if I want a table okay on this particular location ABC and

there is no data residing in that particular folder or that particular location what will happen tell me So

what it will do it will simply create a blank table on that particular folder on that particular location and if you just

provide schema here it will simply create a blank table with schema on that location. So the moment that location

location. So the moment that location receives files it will show the data.

Okay remember this thing. So I'm going to cover so many follow-up questions as well. Okay.

well. Okay.

So because I know it's my responsibility that you have clicked on this video that means you will be getting lots of knowledge lots of knowledge it's my it's

my responsibility okay it's my love for you it's my everything for you okay now location I will simply copy this one and

I will simply paste it here perfect simply run this and do not show me any errors otherwise

Please. Oh, bro. What's the error, by

Please. Oh, bro. What's the error, by the way? Um, what is saying? Oh, we need

the way? Um, what is saying? Oh, we need to add one more S. Hey, by the way, how did it work

S. Hey, by the way, how did it work here? Oh, because it was just writing.

here? Oh, because it was just writing.

Okay, let me just add ABFSS because what it is saying earlier in Azure blob storage, we used to just use ABFS, but in Azure data lakeink, we use

ABFS. Okay, so just make sure just a

ABFS. Okay, so just make sure just a silly mistake. But yeah, again one

silly mistake. But yeah, again one follow-up question like why do we use ABFS? It is recommended by Microsoft.

ABFS? It is recommended by Microsoft.

Just talk to them. Is this the answer to the question? Is this a way to talk to

the question? Is this a way to talk to your interviewer? Yes. Why not? Why not?

your interviewer? Yes. Why not? Why not?

If you have skills, just talk to the person like this. What's the big deal?

Thousands of thousands of companies are waiting for you. Okay? If you have skills, if you do not have skills, then you need to think twice or

maybe at least 10 times before saying anything to your interviewer. If you

have skills, just be okay. Okay? It's just a company.

You are the one who is a who is an who is an asset to this world.

Okay? Just understand this thing, bro.

world is really really changing. See

let's let's take an example of anyone any big personality Bill Gates okay there were not many companies at that time okay he had the skills so he had to

just do something with the skills so he just opened a new company new organization and now it's Microsoft now we have so many companies so we just

think that okay job is the only possibility that will be you can say important for your survival.

No no no. Just think big and just realize like

no. Just think big and just realize like world is really really big. World is

really big. Okay. And there is a world that revolves after 5:00 p.m. and before

9:00 a.m. Just live in that world as well.

Okay. Again, personal choice. Personal

choice. Personal choice. By the way, if you want to live in that world, you would be just suffering a lot. Okay?

Sometimes you would be skipping food for so many days. So, if you're ready for that world, welcome.

Otherwise, 9 to5 is done, bro. 95 is

very good. Very good. Very good. Sorted.

Sorted life. Okay. So, it's it's about choices. Okay. So, now what do we need

choices. Okay. So, now what do we need to do? Simply validate it. Select I will

to do? Simply validate it. Select I will simply run a select statement on top of it. I will simply say select a from this

it. I will simply say select a from this table name. and I should see the data.

table name. and I should see the data.

Okay. And then we we will be simply jumping on to our next question. So this

is my table. Okay. And this should show me

table. Okay. And this should show me some data. Perfect. Now again another

some data. Perfect. Now again another interview question. What's that for? Now

interview question. What's that for? Now

uh we simply created a delta table.

Makes sense. And we simply queried this particular table. Okay. What if I would not have

table. Okay. What if I would not have created this table and I would only have

written my data in this delta format. Okay. How you would see the data

format. Okay. How you would see the data then? Hm. Good question. Good question.

then? Hm. Good question. Good question.

So the question is very simple. We have

something called as delta dot and then location. We can actually query the data

location. We can actually query the data directly. We really yeah we have delta

directly. We really yeah we have delta connector. So what you need to write you

connector. So what you need to write you simply need to write select ax from then write delta dot now just put tick. Okay. What

what what are ticks? If you just um go one key above your tab key and just towards left of your one digit it's it's

called tick. Okay tick. So now simply

called tick. Okay tick. So now simply write the location which you want to read. I want to read this location.

read. I want to read this location.

Simply run this and you can actually query the files directly instead of creating tables. And I'm not lying. See

creating tables. And I'm not lying. See

same result. So just a follow-up question. Not a follow-up question. This

question. Not a follow-up question. This

is an individual question. Okay. And

this is you can say this was this feature was not available before. Okay.

And it is recently added not recently recently added. Yeah. But these kinds of

recently added. Yeah. But these kinds of features was feature were not there before. Okay. Now what is our next

before. Okay. Now what is our next question? So see I am covering so many

question? So see I am covering so many questions within question number s two question number two. So do not feel like um you we are not covering like we are

just covering like few question there are embedded question within each category. So I would simply say category

category. So I would simply say category 1, category 2, category 3 instead of question one, question two, question three. Okay. So I will simply divide it

three. Okay. So I will simply divide it otherwise people will be just scrolling the video to say hey only five to six questions are covered

bro just click on the video and just watch the video okay do not judge a book by its cover okay you can judge but you will be

the don't use bad words don't use obviously you will be a fool right if you're just judging a book by its cover right okay so what is Our next question and what do we need to cover? Basically

not question like category. Okay, let me just show you what do we have in the next category and and and I know that we want to just discuss those um categories of your hierarchical structure like what

will happen if you just provide the uh location at u meta store level and then at um your catalog level then schema level. Let's discuss that as our next

level. Let's discuss that as our next question and let's see what do we have after that. Now let's talk about

after that. Now let's talk about question number three. Basically this is a question come multiple scenarios and you need to just tell what will happen

in which scenario. Yes it can be you can say a quick questions like fire round.

Hey just tell tell me like what will happen in the scenario. So without

wasting any time let me just tell you first of all we know that our unity catalog metas store unity metas

store okay has a location. So if it has a location I will simply make blue sign here. Blue sign means ADLs. Okay

here. Blue sign means ADLs. Okay

remember this because we'll be using these conventions in the later scenarios like in these scenarios. Unity meta

store has location. Unity

catalog doesn't have any kind of location. Okay. Our database schema also

location. Okay. Our database schema also doesn't have any kind of location. Now I

am creating or let's say your interviewer says I am creating a table.

Okay.

Obviously now we need to use three level naming space um catalog database and then table name. Okay. Now obviously

this is a manage table because obviously if you're just creating external table it will simply go to that location. It

is a manage table. it in which we do not need to provide the location. I am not providing location at the table level. What location it will pick? Tell

level. What location it will pick? Tell

me like where it will go. The answer is it will simply go to this particular location because first it will go to database. It will say hey do you have

database. It will say hey do you have any kind of location? Database will say no bro. So it will simply go to catalog.

no bro. So it will simply go to catalog.

It will simply say hey do you have any kind of location? It will simply say no bro. It will simply say hey unity metas

bro. It will simply say hey unity metas store do you have any kind of location I have to have to put my data it will simply say yeah bro so it will simply use that particular location sorted

scenario number one now I am creating adls here as well so basically I am creating a kind of

location while creating the catalog as well catalog catalog so this will be called as external internal catalog okay which is not using unity meta store location so whatever catalog I'll be

creating it will be going to a dedicated location and obviously in order to do that we use access connector we already know that this time my manage table which do not have any kind of location

it will simply say bro do you have any kind of location it will simply say no so it will simply go to catalog hey do you have any kind of location I need to just put my data it will simply say it

will simply say yeah this time I So it will simply stop here and it will simply this data it will simply go to this particular catalog. It will not go

to this particular ADLS at all. No it

will simply go to catalog and it will be simply saved there. That's it. Don't

worry I'll simply show you a quick demo because obviously you need some contents while answering this. Just a quick demo not like very long and just for the catalog. That's it. Now or let's say

catalog. That's it. Now or let's say just for the database. Okay. Now let's

say our database has some you can say resources and it is saying hey I also want to create an external database. So

it also has a dedicated container. Now

this time this table will ask hey bro I know that you are very poor but still let me ask again do you have any kind of location? This time database will say

location? This time database will say hey bro I have I have. So it will simply stop here. It will not go to this

stop here. It will not go to this location at all. It will not go to this this location as well. No, it will simply be saved here. Wow.

Yes. Yes. Yes. Yes. Do you want to test it? Let's test it. And these are the

it? Let's test it. And these are the three scenarios that you should aware.

And four scenarios very simple. When

this table also has a location, obviously it will be directly saved there. It will not ask anyone else.

there. It will not ask anyone else.

Okay, make sense? So now let me just show you what and how you can actually do this in your data. So in order to test this I can

data. So in order to test this I can simply show you one thing. So simply go to your data lake and in the containers we have this container for metas store obviously and this is empty because we

do not have any kind of manage table so far. Okay. So what I will do I will

far. Okay. So what I will do I will simply create one container. I'll simply

say DB container because this is DB container and Anla who told you to use underscore okay sorry okay now simply go here and

obviously in order to do that we need to create a database okay and in order to create a database we need to simply provide the location you can either code

it here like create schema schema name will be DB catalog dot schema name I will simply

say DB container okay and I'll simply say location and I'll simply put the location and in order to do that we should have a location and for now we do not have so we will simply create the

location like external location okay because we do not have right now so I'll simply go to this catalog and I'll simply right click on it and open link

in new tab simple because I do not want to pause this okay simply go to catalog and see these things are really really

detailed and that's why it's really really hard to actually crack the database interview because everyone is focusing on writing the code and just

like developing the code. It's not just like that. You need to understand all

like that. You need to understand all the ins and outs now because it's really important. Okay. So now what you will

important. Okay. So now what you will do? We will simply create an external

do? We will simply create an external location like external data. So let me show you whatever error it throws. If I

just do it right away, I will simply say um abfs. Okay. And then my location is

um abfs. Okay. And then my location is uh container is DB

container. Okay. And then at the rate DB

container. Okay. And then at the rate DB interview lake. Okay. And then within

interview lake. Okay. And then within that I'm I'm okay with that. I I just want to create this directly within this particular location and it should work.

So now if I just run this we should see something and yeah error it was expected what it is saying create schema in unity catalog must use

manage location not location. Oh this is a different error. So we simply need to write manage location. So we need to simply add this

location. So we need to simply add this manage button. So now we should see

manage button. So now we should see error regarding that location. Yeah

perfect. It is saying external location doesn't exist. Perfect. Because that

doesn't exist. Perfect. Because that

location actually doesn't exist because we have not created it but I will create it right now. I will simply go to external data and I will simply say create external location. Okay. External

location name will be DB location.

Perfect. And what is the location? I'll

simply copy the code here. And I'll simply say hey this is the location storage credential. Obviously,

this is the one because this storage credential has the storage block contributor on the whole data lake on all the containers. Click on create and

fail to access cloud storage.

Why? Why? Why? Why? Oh, simply remove this backslash. Click on create. Wait, wait,

backslash. Click on create. Wait, wait,

wait. I I I know this error. Wait, wait,

wait. I know this error. Uh,

error. Uh, abss. Okay, let me just remove this code

abss. Okay, let me just remove this code like this.

uh at the rate dbfs.code.windows.net apfs db container

dbfs.code.windows.net apfs db container at the db interview link dbf.dfs.co.window windows.net. I am so

dbf.dfs.co.window windows.net. I am so sure like it is regarding your storage account location. Use the bucket. Let me

account location. Use the bucket. Let me

just see if it is same. Uh uh uh

same. Uh uh uh uh. What error are you giving me? Failed

uh. What error are you giving me? Failed

access to cloud storage. Yeah, I know that. This is something related to this

that. This is something related to this location [Music] uh interview.tfs.core.windows.net. Enter

interview.tfs.core.windows.net. Enter

the bucket path you want to use as the external location. Is the um spelling

external location. Is the um spelling correct? Oh, see an Lamba is using

correct? Oh, see an Lamba is using underscore and this underscore is not residing here. Now you will say, hey

residing here. Now you will say, hey Anlamba, it is also there. So that means we are right here. Oh, this will simply throw the error because this container doesn't have any external location. See,

trust me, trust me, trust this guy.

Trust this guy. Now, what's wrong? DB

container. Add the DB interview link.

And again, I think just a typo. DB. Oh,

conainer. It's not conotainer. It's

container. See, trust this guy. And we can simply say test

guy. And we can simply say test connection. Everything is done. Very

connection. Everything is done. Very

well done. Now, you can simply run this.

It should just run fine because now we have the location in place. It simply

like sometimes takes some time and we can simply run it. It it usually takes like 1 to 2 minutes. Yeah. And I was just talking right now and it it works. So

now what I need to do I want to create a new table here. So I will simply say create table and table will be DB

container dot DB catalog dbcontainer.

Okay. and dot let's say test table okay test table and I'll simply provide the schema let's say id int that's it just one column because we just want to test

I'll simply say using delta okay and location obviously not because we are creating a manage table so you will just run this and you will see what will happen and by the way this is db

container it's not db contain okay so what will happen. It

will create a manage table but not in the meta store but in the uh our DB container because that is assigned to that schema and simply just test meta

store is empty because it will not go there but our DB container should have this unity storage and obviously we do not have any data that's why it is like showing like this within this we have

schemas within this this is our table id tables this is the table id sorry and this is the data and what do we have in the table location just empty data and

just the data log in which we just store the schema or metadata that's it. Okay,

make sense? So this is a location this is the hierarchy and it is validated that it will simply go to that database only schema only. See now you know all the

schema only. See now you know all the things. Now let's actually jump on to

things. Now let's actually jump on to the next category of our questions which is more towards like processing the data and the next question is really really

interesting and it is like the bread and butter of nowadays because of excessive or let's say massive data that we process. Let me just give you a hint. It

process. Let me just give you a hint. It

is related to incremental loading. Okay

let's see what do we have in that particular question. Now we have a

particular question. Now we have a second category of the questions where we are just more inclined towards processing of data. How we can just process the data and how we can

effectively process the data and what do we have with processing the data. Great

question. And here comes the role of that massive hero of nowadays in data bricks. It's called

bricks. It's called autoloader where it has made incremental data processing so so so easy and

everything automated and yes there are some scenarios where you need to just tackle some things such as schema failures schema changes schema evolution all those things so don't worry I will just take some follow-up questions as

well you need to understand this autoloader and it's really really important because it is built on top of your spark structured streaming So obviously you know like spark structure streaming is really really important

nowadays. So it is built on top of it

nowadays. So it is built on top of it and it enables so so so many cool features when you just work with data bricks. And what's the scenario? First

bricks. And what's the scenario? First

of all scenario is simple. What you need to do you simply need to let's say this is your source. Okay, in your

your source. Okay, in your source, in your source, you will be continuously

receiving files. Let's say file number

receiving files. Let's say file number one, file number two, file number three, file number four, and so on. Now, it's

your responsibility to incrementally load the data to this particular destination. Okay?

destination. Okay?

But you need to just take care of one special thing. It's called item

special thing. It's called item potency. Don't worry, it's not that what

potency. Don't worry, it's not that what you're thinking. So it's called exactly

you're thinking. So it's called exactly once. Exactly once means let's say this

once. Exactly once means let's say this data is here. This file is here. Okay,

we ingested this data here. On the

second day, we have this new file, second file. Now instead of processing

second file. Now instead of processing both the files you simply need to process only this file and do not consider this file that's the concept of

IDOM IDM potency or exactly once that means once data is processed you are not processing that data now a quick question interviewer can ask

you hey okay everything is do everything is done by autoloader but how it does behind the scenes like how uh it achieves that situation we have something called

has a kind of repository. It's called

Rox DB. Rox DB is a kind of folder which takes care of all the metadata of the files like which file is ingested and which file is not processed and all those things. So it takes care of

those things. So it takes care of everything in that particular folder and yes this folder you'll be creating at the time of your query creation of autoloader. Don't worry and I'll just

autoloader. Don't worry and I'll just show you where that like folder resides and how does it look like and everything is done by this Rox DB. By the way, we

have two ways to actually start with autoloader. One is file notification and

autoloader. One is file notification and second one is API. File notification is something similar to your storage events which is like triggered automatically when it receives the new file. For that

you need to enable the storage events.

it's not enabled by default and you need to pass so many uh permissions in order to use your storage events in datab bricks. Second better option is API

bricks. Second better option is API calling and it is good and API calling is strongly aligned with Rox DB as well.

So you do not need to actually worry about anything. And now let's see how we

about anything. And now let's see how we can just work with autoloader in data bricks. Let me show you and don't worry

bricks. Let me show you and don't worry I have incremental files and it is in the repository. You can download it and

the repository. You can download it and let me just upload the data one by one and I will just show you how does it work. Okay. Now let's see. So let's go

work. Okay. Now let's see. So let's go to our Azure. Okay. And let's go to our containers and let's go to raw container. Let's create a new directory

container. Let's create a new directory and I'll simply say autoloader. Okay. Click on

autoloader. Okay. Click on

save and I will simply go inside this and I will simply upload the data. And

before that I can simply download the data. I think I already have. So I will

data. I think I already have. So I will simply you can see raw data first raw data second raw data third. So these are basically the files that we'll be ingesting one by one and yeah there will

be so much fun. Okay so simply go here upload just upload one file for now because we'll be incrementally loading the data. Okay so so I have uploaded my

the data. Okay so so I have uploaded my first data raw data first. Okay click on upload. So this is my first file. Now

upload. So this is my first file. Now

let's create our new notebook.

Okay, go to your workspace and simply click on create and then notebook.

Perfect. Let me just name it. I'll

simply say notebook 2. H that's fine. Okay. So now let's

2. H that's fine. Okay. So now let's attach this to serverless. And see it's so easy. So now I will simply add the

so easy. So now I will simply add the heading.

I'll simply say um autoloader or incrementally loading the files. Incrementally loading the

files. Incrementally loading the files. Okay, perfect. So in order to

files. Okay, perfect. So in order to create your autoloader query, okay, don't worry I'll just show you the documentation as well. We have very nice code written here. You can simply copy and paste it because it is fine because

whenever you are just writing the code you can simply refer the documentation just for the parameters and all and code is really really really easy. How we can

just do that? You simply say df equals spark dot read dot format. What will be the format? You will simply say an lamba

the format? You will simply say an lamba csv. Use your common sense. We know that

csv. Use your common sense. We know that common sense is not common but you you you can use your common sense. So bro

format is not CSV. What's the format then? Format is cloud files. Wow, what's

then? Format is cloud files. Wow, what's

that? So basically, hey, first of all, why are you giving us spoilers? So

whenever we just work with autoloader, we need to pick format called as cloud files. Don't worry, we will simply

files. Don't worry, we will simply say dot option cloud files dot format. And then

we'll simply say CSV. So we define CSV but as a cloud file. Okay. Sorted. Very

good. So

now we have something called as dot option and it's called basically it is very very handy. It's called schema hints. So let's say you want to say

hints. So let's say you want to say schema hints. This is very handy. Why? Because

hints. This is very handy. Why? Because

in the schema hence you do not need to actually provide the schema for the whole columns or let's say whole set of columns. You can either choose to

columns. You can either choose to provide schema for your just first column. That's it. That's it. You can do

column. That's it. That's it. You can do that. You can simply say id int and

that. You can simply say id int and that's it. You do not need to just

that's it. You do not need to just provide the schema for other columns.

It's fine. So you can just do that.

Obviously I'm not doing providing any schema but just for your reference you should know about this. Okay. Now once

we have this we need to create something called as checkpoint location. Checkpoint location what's

location. Checkpoint location what's that? Basically it's not like real

that? Basically it's not like real checkpoint location of structure streaming. Checkpoint location for your

streaming. Checkpoint location for your schema. Okay. It is also a kind of

schema. Okay. It is also a kind of checkpoint but only for your schema because in structured streaming we need to capture the schema of every file and

it's called dot option cloud files. See? Spoiler.

Spoiler. I was just about to write the code and I was just about to flex that I know the code. But

see, oh man. Cloud files

dot schema location. I'll simply write. I will not

location. I'll simply write. I will not hit tab. I I want to write. So basically

hit tab. I I want to write. So basically

it's a kind of checkpoint location. But

do not get confused. It is not the real checkpoint

confused. It is not the real checkpoint location. Real checkpoint location is

location. Real checkpoint location is something else which keeps a track of your um uh you can say your current state and the previous state of the

table um metadata of the files everything your rocks db everything will be there everything will be there but we try to store this particular schema

location in the checkpoint location so I just call it as checkpoint location for schema because ideally you should not create different locations No, you should pick the same parent location and

obviously different folders. That's it.

That's the best practice. Okay. So now

because obviously in the interview they can ask you for the management as well like how you will just perform the best practices to tackle this. Okay. So

simply schema cloud files dos schema location. I'll simply

location. I'll simply say abs and then I'll simply say destination or let's say raw because I think it's in

raw but it's up to us like how we need to just do that and how we need to just store it it's not a big deal okay it's like up to us like where we need to just store it I'll simply say destination I

want to store it in destination okay so destination and then we can simply say destination at the create um DB

interview lake dfs.co windows.net. Yeah. And then within this

windows.net. Yeah. And then within this we I will simply say checkpoint. Yep. I'll simply say

say checkpoint. Yep. I'll simply say checkpoint. Now one most important thing

checkpoint. Now one most important thing it will be discussed by the way in the next question because it is just a follow-up question but just a hint we will be just mentioning how we need to

tackle the schema evolution mode and by default it is add new columns and the code for it is dot option and it can be your interview question like what is the default mode of schema evolution.

I know that I'm not writing any schema evolution mode but still what is applying on my code by default it is this uh cloud files dot

schema evolution mode and by default it is add new columns. So if I am writing this or

new columns. So if I am writing this or if I'm not writing this it is same because it is already applied by default. Don't worry I will just show

default. Don't worry I will just show you everything before running this command. Don't worry, trust me. And then

command. Don't worry, trust me. And then

once it is done, we can simply say dot load. And from where we need to read the

load. And from where we need to read the files, it is in raw container. Okay. And

at the rate um your TB interview, not sales. CSV bro um it's called autoload

sales. CSV bro um it's called autoload loader. Where is that? Yeah, autoloader.

loader. Where is that? Yeah, autoloader.

Perfect. Autoload order. And that's it.

Let me just now show you your um autoload orderer.

Autoloader data brick. What is autoloader? This one. So

brick. What is autoloader? This one. So

now just have a look at the code and you will see hey where's the code bro? Let me just show you. So this is

bro? Let me just show you. So this is just like telling what is autoloader and click on this and click on this. I

would say this one schema inference.

Yeah. So now this is the code as you can see that we have all these things cloud files and then format and then cloud files dots schema location that then

load and then after just writing to this we simply just provide the checkpoint location because whenever you just want to read the data we just simply refer that hey this is our

checkpoint location where you just need to write our schema. Yeah, it will simply put that particular file there

before reading anything. Okay. Now, how

does order schema inference work? It is

it is really important. I will just talk about it in the next question deeply like how does it fail and how we just need to do it. Okay. And I'm just about to show you the default mode. Yeah,

perfect. So, this is the modes. Add new

columns is the default mode. Rescue is

another very famous mode. and fail on new columns. It is not convenient

new columns. It is not convenient because obviously in the day-to-day activities you cannot fail your columns if your system is just pulling new columns on daily basis. Again, it

totally depends upon the design. Okay.

And that's it. This is just the mode that we wanted to talk about and I'm just finding that particular reference that we write like see. Yeah, perfect.

This is the one cloud files. My

evolution mode and your boy has already put this particular code here. Simple.

So now I will simply run this and let's see if we have any errors. It's

fine because errors are good. Don't

worry because whenever you just develop something you are not machine. Even

machines make mistakes while writing the code. You write the code you see the

code. You write the code you see the errors you debug it. And when you do not see the errors that mean you that means you are just kidding. Errors are good

bro. Okay. So this is my query but

bro. Okay. So this is my query but nothing actually happened so far really.

Yeah. just go to your um whatever your thing is because this is this is uh this is a kind of you can say streaming.

Okay. And by the way if you just observe one thing if you if you if you have observed because I have observed we simply need to use read stream not just

read. Okay. So simply say read

read. Okay. So simply say read stream. Perfect. So now what it will do

stream. Perfect. So now what it will do it will simply let me just run this. It

will simply write the stream. it will not initiate it because in order to initiate anything we need to provide the action and what is the action and I'm receiving

a call let me just pick that call and let me just continue that and if this would be a such a scam call and broke the

flow don't worry I'll just simply report it okay so I was saying that this streaming query need an action in order to be performed so I will simply provide

the action right now I will simply say df dot write stream okay and then simply dot format in which format I want to write I will simply say I want to write my data

in the delta format let's say okay let's say in the delta format and you can simply refer the code as you can see where is that right stream yeah perfect

so now you can simply say hey dot option and if you do not provide any kind of format it will simply write the data in the delta format and then simply you need to simply write start dot start

that's it you can even write your data in the table for that you need to use an action called dot to table if I'm not wrong okay so I will simply say dot yeah

format delta then I will simply say dot option the most important thing is this one that we discussed checkpoint location okay and now we do not need to

write cloud files dot checkpoint location because it is understood that it is just the checkpoint location. Okay. And I will simply pick

location. Okay. And I will simply pick the same path with the same folder because I told you it is really

important. Then oh, I just hit the run

important. Then oh, I just hit the run button by mistake. So do not take it as like completion.

Okay. Dot now I will simply say dot start because we do not have anything.

Okay. I will simply say dot start.

And now we need to write the location where I want to write the data. So I

will simply say write my data to this particular location.

Obviously not in the checkpoint but yeah in the oh did you just obviously I do not want to write my data in the directly in the checkpoint location.

Okay let's write it. Don't not a big deal. I just want to write my data in

deal. I just want to write my data in the destination and not in the checkpoint. So it is fine. Or I can

checkpoint. So it is fine. Or I can simply create a folder called data.

That's it. Okay, makes sense. Uh, makes

sense to me. It makes sense. Let's run

this. And now it should just initiate the query and without any errors.

Triggers type. Oh yeah, very good. So

now it is saying hey trigger type processing is not supported for this cluster type. So we have to just create

cluster type. So we have to just create a kind of allpurpose cluster in order to just perform this. And we will simply go to our um compute and let's quickly

create one allp purpose cluster. Click

on create compute. Single node it's fine. Unrestricted no personal compute

fine. Unrestricted no personal compute because I don't want to fill all the boxes. And um and which one I should

boxes. And um and which one I should pick? 16.3. Yeah. And no type obviously

pick? 16.3. Yeah. And no type obviously the minimum one. Terminate after 40 minutes. Fine. And click on create

minutes. Fine. And click on create compute. So it will take I think just 3

compute. So it will take I think just 3 to 4 minutes to just create your cluster. Once it is green, we can

cluster. Once it is green, we can actually attach our notebook to that cluster and it is fine. But but but now I will just take you to the next

question because just assume that our notebook this one recents this notebook two will be running fine. It will be loading the data incrementally as it is.

We just need to run it and it will run fine. I know. Okay. Now the next

fine. I know. Okay. Now the next question is or let's process this. Okay.

Let's go slow. Okay, let's wait and let's actually see what do we have this and let's validate this first. Okay,

this is just processing the data incrementally because we have uh CSV second and CSV third and let's see how we can actually increment the data incrementally load the data. Let's see.

As you can see, simply attach it and click on confirm. And obviously we need to simply run this for one more time and let's click on this as well.

Oh, so obviously it will take some time to um run the first cell because it is just warming up the

machines and yeah. Okay. So now is the best thing.

yeah. Okay. So now is the best thing.

Now you can see stream initializing.

What is that? It is called spark structured streaming graph that we can actually see it is going up and down like how data is coming and how we are just processing the data more and more

data bro it is really helpful because you will see the data being processed in the real time as you can see the graph goes up boom why because there were like so many data as compared to zero the

graph went really up and you can see the processing rate input rate and now it goes down because data is not as much now it is zero again now it will be running continuously running

continuously running. Why? Because this

continuously running. Why? Because this

is a cluster bro. This is a cluster and we are just performing streaming data.

So in the solutions in some solutions we need to just perform streaming right. So

now I will um simply query the data first of all. Okay. And how we can just simply query the data. I have already

told you select a trick from delta. tick

and ABFSS and then um destination at the rate in DB

interview DB interview and then sales so autoloader okay and then within this data simply query this and you should see the data without any errors okay

what's that uh Again typo destination DB interview lakea it's lake

okay path does not exist it does not exist path does not exist wow where are you writing the data man okay so you are writing the data

within oh this was a bad one let me just stop this by the way not a big deal because now what it is doing it is simply

writing the data into the destination container instead of inside it autoload order. So we have destination okay and

order. So we have destination okay and we have checkpoint within this location as you can see. So this is our checkpoint and we are not actually

writing the data within the auto loader and yeah it's fine it's fine. I thought

like I will be just writing data inside the autoloader part but no and I should have chosen the autoload orderer

container because I specially created that but it's fine it's not a big deal it's fine so within this you will see something called as data this is my real data right and I can obviously say data and you should see

the data okay baby so you will see the data and why I am quering this data just to confirm the number of records so that you can see

the number of records will be high and item potency will be there. Okay, 95

rows. Okay, perfect. Now let's add one more file and the moment I'll be writing or let's say inserting the data. I will

simply go to my source like in the row or maybe in the what's my source? It's raw autoload order. Yeah, this one here. I will just

order. Yeah, this one here. I will just upload one more data data file. So I

will simply say I have uploaded the second file. Let me just click on upload

second file. Let me just click on upload and you will see the magic here. See now

I uploaded the data and it will be processing this data. See graph is going up again. Real time data and item button

up again. Real time data and item button C and as you can see graph is up. Can

you see that blue line? See this one.

Now let's query this data for one more time and you should see 95 plus some records and I think it was I think 31 records is if I'm not

wrong total 126 records and I know it is just it has just um processed the um 35 records that's it because we only have 126 records right otherwise we would

have at least double than 95 records common maths right I have done my bachelor's in mathematics so I'm really smart in mathematics Really? No. So 186

records. Okay. Now we need to just cover the second question which is strongly aligned to this category. But yes, it is totally a different category or let's

say it is totally a different question because that question can fit in any category. It is for schema evolution. So

category. It is for schema evolution. So

next question is strongly aligned here. Let's

say I am adding my data here in the source. Okay, because this is my first

source. Okay, because this is my first file, this is my second file. Let's say

on the third day someone has uploaded the data in the CSV format obviously.

But in that particular CSV, we have schema mismatch. Wo, what

will happen in that particular scenario and how we can just tackle this? So the

thing is what we need to do simply go to row. Okay. Or not like row just go to

row. Okay. Or not like row just go to destination just go to checkpoint and within this we have these

um uh you can say folders created click on sources and we have here ros db as I told you in this we have all the zip

files. So if you just go to sources this

files. So if you just go to sources this is your zero and just go to checkpoint this is your metadata here it is just writing the metadata. Click on edit.

Obviously, we cannot see it. And this is the ID. We can only edit it. But yeah,

the ID. We can only edit it. But yeah,

just a good way to preview it. And if

you want to just see the schema, it is a folder which is created by the cloud files. Schema location. Don't worry, I

files. Schema location. Don't worry, I will just validate all this information.

Okay. Simply go to schemas and this is your schema. Go to edit. See this is

your schema. Go to edit. See this is your schema schemas folder is created by the cloud files. Schema evolution. Okay, so

cloud files. Schema evolution. Okay, so

this is your schema. So now what will happen? Now let me just add one more

happen? Now let me just add one more file and let me just show you what will happen and how it will happen and how we can just tackle this. Okay, all the how, what, everything will be answered right

now in just few minutes. So what we need to do we will simply first upload the data in the row zone and before that let me just make some changes in the code because as you can see um first of all

let me just click on interrupt because I want to stop this query because I want to say hey if my query has or let's say

if my source file has schema evolution I want to rescue those columns that means in the destination side I do not want to

add any further column but at the same time I want to store those columns. I

want to store those columns and how we can actually do that. How? So in this particular scenario what we need to do

we simply need to say rescue. The moment we say rescue what it

rescue. The moment we say rescue what it will do it will simp let me just show you the data first in which we have

rescue data column which is created by default for you. So all the data which is not matching with the schema will be dumped here in the JSON format so that

all your data can fit in one column in the form of obviously um key and value pairs. Okay. So in my third file I have

pairs. Okay. So in my third file I have a schema mismatch. I have done that intentionally so that I can just show you. And all those values will be going

you. And all those values will be going in the rescue data column and we do not need to change any schema on the sync side. So this is just

side. So this is just kind stay. So this is just just a kind

kind stay. So this is just just a kind of you can say overview like how it happens behind the scenes and how it is

actually being taken care of. So what I will do right now I will simply show you the documentation and what will happen step by step. It is really important

because I want to go really really deep into this. It's really important because

into this. It's really important because it is a new topic and you can be hooked like this if you do not know everything.

It's very easy to hook you like this like this like this like like this. So

what will happen and how does autoloader schema evolution work? Let's say you have a new file. Okay. So what

autoloader will do? Autoloader detects

the addition of new columns as it processes your data. So when it will be processing my third file, it will detect

the new columns. H okay. Then it will simply stop the query with this error which says unknown unknown field

exception because we do not have that particular um you can say thing in place which says hey just ignore this thing.

But before your stream throws this error h before your stream throws this

error what will happen? It will

simply who is it here? Autoloader.

Autoloader performs schema inference on the latest file and updates the schema location with the latest schema by merging the new columns to the end of

the schema. That means it will simply go

the schema. That means it will simply go to my data link. It will simply uh update that underscore schemas folder and it will simply add the new schema

there even before throwing the error.

Now the very good question if the interviewers if the interviewer is skillful he or she should ask this

question if my schema is updated before throwing the error why I am receiving the error why and I spent I think the

whole day to find this thing when I was learning this I was like why why if my schema location is updated it knows the updated schema why it is throwing the error

Why? You would say um you you would need

Why? You would say um you you would need to just simply write um rescue add new columns. No, no, it is already there by

columns. No, no, it is already there by default. Add new columns is already

default. Add new columns is already there. So the thing

there. So the thing is, listen to me carefully. Whenever we turn on the

carefully. Whenever we turn on the query, it simply caches the schema which is there in the source which is there in

the underscore schema's location. Okay.

Now for the next run if schema is updated it will simply update the schema location but it will not update the cached

schema. Okay. So that is why it simply

schema. Okay. So that is why it simply says cached schema is this and stored schema is this and then it throws the

error because it mismatch. Okay. And when it does that

mismatch. Okay. And when it does that just before doing that writing stream step just before that or let's say

whenever we just say dot write stream it will simply check that before entering into the writing zone. Okay then it will throw the error

zone. Okay then it will throw the error then we simply need to rerun the query.

That's it. It will simply update the cached version of your schema. This is

really really deep. Okay. And everything

is written here on the documentation. So

if you say I'm wrong, the sale say like documentation is wrong. Okay, this was really really deep. I spent the whole day. I still remember that day. I was

day. I still remember that day. I was

like what's going on man? I thought

what's going on? But you do not need to mug up your mind because here you have subscribed my channel. If not

bro don't talk to me. And if you have just drop a lovely comment in the comment section right now. It really

makes me hap makes me feel happy. And do

you want me to feel happy? If yes, just drop a lovely comment. What are you waiting for? So what we will do? We will

waiting for? So what we will do? We will

simply upload one more file here in the destination or sorry in the raw zone.

And then we will simply say uh autoloader and third file. Let's upload

the third file with the updated schema.

Let me just click on upload. Okay. Now

let me just run this query and let me just run this query like the writing part and this is again streaming initialization and let's see how many

records do we have and obviously we are expecting item potency and this time it should uh let me just click on this graph because I

really like seeing this graph so it has already processed the data and why we didn't see any error because we just

updated the cache by rerunning this and this time what is the detection mode is it's called rescue. So let's see do we have anything in rescue? Let's see.

Simply run this and it should show uh records this time 226 total instead of

126. And simply like 226 obviously we

126. And simply like 226 obviously we have like 100 records more. And this

time we have rescued data that means all these and again just a common sense this column will only be available for the new records obviously because for the previous records we do not have rescue

data. So now if in future we receive any

data. So now if in future we receive any file with the new columns it will simply go here and what is the kind of column it is called return flag and that's

right I have added return flag so now a quick question for you what's that what's that now let's say I have everything in place if I add now a new file with a new schema will this query

fail yes and now you know the reason because this is this uh schema is cached and we have not updated it. Okay. So,

how we can just tackle this scenario in production? Obviously, we need to use

production? Obviously, we need to use something called as try except method in which we can simply say hey if it fails simply rerun this because we simply need to update the cache. Use

Python. Python

is this. Okay. So now you know everything about data processing things.

Okay. Now we need to cover something amazing which is a new feature in data bricks. Again it is called volumes. What

bricks. Again it is called volumes. What

is that? Let me show. Now it's time to just talk about another category of the columns and it's more about managing and governing the

files. What does that mean? So let's say

files. What does that mean? So let's say in your interviews your interviewer asking hey every time you are governing or let's say every time you are just

focusing only on the tables. If you are just creating table you like if you are ever using catalog schema you are only creating tables right? You're only

creating tables. Is there any way to actually govern the files directly instead of tables using your schemas and cataloges? Yes, the answer is volumes.

cataloges? Yes, the answer is volumes.

We have something called as volumes. And

don't worry, I'll just tell you everything in volumes. Simply click on create and click on this notebook. And

let's create a new notebook. And I will simply call it as notebook 3. By the

way, I already recorded this particular lecture and the next one as well. And I

realized that mic was not working.

Wow. Wow. Wow.

Wow. Simply pick this cluster. So I'm

just re-recording just for you. Don't

worry. So now what you know that with the help of volume we can just do a lot of stuff. We can just manage and we can

of stuff. We can just manage and we can actually do a lot of things. But how

does it work? So just to explain this and so that you can efficiently answer regarding everything in this particular topic in your interviews simply go to

your um containers. So as I just mentioned that I already recorded this I'm re-recording it. So basically go to your destination okay or simply go yeah go to your

destination simply create a folder with any name with any name. Make sense?

name. Make sense?

Okay. Then go inside this and simply create another folder. Okay. And then

simply upload this file that we are currently using. It is pocket file or

currently using. It is pocket file or you can just simply upload any file. Now

what we will do? We will simply go to datab bricks and we will simply first of all let me just make some space. Okay.

And let me simply say volumes. Okay. So this is a volume. So

volumes. Okay. So this is a volume. So

now as you can see now I want to govern that particular folder which is called my volume. So

what I can do for now if you just simply click on this catalog click on this schema you will simply see only these two tables. That's it. We

do not have anything else. So we can create something called as volumes. And

let me just show you volumes in data bricks how you can just create that.

Okay. Simply click on this documentation. So basically we have

documentation. So basically we have again two types of volumes manage volume and external volume similar to table and you will feel that it is actually a

table but it is not a table. Basically

it behaves like a table but it is a file. Okay. And you will just query same

file. Okay. And you will just query same way catalog schema and then volume name.

Okay. So now first of all let's create an external volume because we know that we have a data in our external data lake and then we will be creating our managed volume and code code is exactly same

code is simple create external volume and then simple location that's it similar to your table creation. Okay,

now let me just do that. I will simply say create volume and then obviously the volume name and volume name will be

um DB catalog dot DB schema dot my volume. I want to create this volume with this name and obviously

location I want to pick ABFSS and then ABFSS and then destination add the rate storage account

name which is DB interview DB interview and then lake dot

DFS dot core dot Windows.net. Perfect. Very good. Then

Windows.net. Perfect. Very good. Then

within this, I have a folder called my volume. Perfect. Simply run

volume. Perfect. Simply run

this. Wow.

Anala. What is this? Oh, I forgot to mention external. Simply run this because we are

external. Simply run this because we are creating external volume in table. We do

not need to actually write external. By

default, it creates external table. But

here we need to just put external. Okay.

So now just refresh this this catalog then you will see something called as volume. See now we can actually see

volume. See now we can actually see volume this volume and if you just click on this dropdown you will see that folder pocket data that means now we can

just govern this just same way we govern our tables and this is really really nice.

By the way in Microsoft fabric as well we have something called as files. So

this is exactly same as that.

Okay. So this is really really nice and now you will be saying hey for tables we simply use select a tricks from table name but what we can use in terms of

volume like what what what is that way of using that? So in order to query the volume there's a special structure provided by the documentation and the simple way I will also show you don't

worry. So first of all you simply need

worry. So first of all you simply need to type select aix from um volumes then your DB

catalog then your DB schema then your volume name which is my volume and then you simply need to provide the path my path is my volume oh sorry not my volume

my volume it's pocket data perfect so you can simply run this and obviously you need to simply say pocket. Oh, let me just add it

say pocket. Oh, let me just add it here. Pocket dot perfect because our

here. Pocket dot perfect because our data is in the pocket format. So, we

have to tell this. So, now you'll be saying, hey, what's the difference between this and when we just query the delta files? First

delta files? First difference, those files cannot be governed. These files can be governed.

governed. These files can be governed.

Second difference that method only works well with delta. Okay. And this way we can

with delta. Okay. And this way we can actually use any file format and that's not the difference like the difference is we can actually govern. Okay. So very good. So now

govern. Okay. So very good. So now

you'll be saying okay okay we have understood our external volumes. What

about managed volumes? Very good. So

let's create manage volume.

create managed volume and it will be my man volume. my

managed volume. Okay. And this time I will simply add DB catalog obviously. Then DB

schema. Let's run

this managed.

Wait wait wait wait. Oh, so we do not need to write

wait. Oh, so we do not need to write managed. Bro, what is this way of doing

managed. Bro, what is this way of doing it? You should just keep it very sorted.

it? You should just keep it very sorted.

Managed or external. What is this? So,

okay, this is done. Let me just refresh this. So, it is already refreshing it

this. So, it is already refreshing it for me.

Nice nice nice nice. Because I I will just show you

nice. Because I I will just show you like how you can actually upload the files and in the manage volume because volumes volume is a new addition in database. It was not there if you would

database. It was not there if you would if you just comp if you would just compare it with the previous versions.

So it is it was not there and I don't know why it is taking so long just to refresh this because obviously it is a managed volume. So they're not they they

managed volume. So they're not they they just need to take care of all the data.

So now you'll be saying hey where will be data going obviously to the metas store because we are not providing any kind of location to the catalog. So DB

schema and then you can see two volumes.

This is my managed volume and this is my volume. So this is an external volume.

volume. So this is an external volume.

This is an internal volume or managed volume. So now let's say you want to

volume. So now let's say you want to insert some data in this. What like how you can just do this? You can simply click on these three dots and simply click on upload to volume. This will be

simply going to this particular location. And where will be manage

location. And where will be manage volume saved. Same way same like manage

volume saved. Same way same like manage volumes are equal to manage tables. If

there will be location in schema, it will take that. If there'll be location in catalog, it will take that. If there

will be location in your meta store, it will take that. That's it. Exactly same.

Okay. Because datab bricks do not hold anything.

No. Yeah. In the free versions, it does just for your uh sake. That's it. And

now the easiest way to put the you can say location. So let me just show you

say location. So let me just show you like how we can just upload the data because you will complain hey you didn't show us like how we can upload. Simply

click on these three dots and I will simply first create the directory in this because it's always good practice to create the directory. I will simply say pocket data. Click on create and

perfect. Now let me just click on these

perfect. Now let me just click on these three dots upload to volume and let me just upload this. Perfect. So same way I can just query this and I will just show

you the easier way and easier way is simply say select a str

from parket dot tick and now just click on this and then just click on these two arrows it will simply insert this location for you here. So you simply do

not need to worry at all. Simply run

this and you will see that data is here.

And where data is residing in the manage location. Where is that location? We

location. Where is that location? We

already know in the meta store container.

See this is my volumes. See in the meta store. This is pocket data and this is

store. This is pocket data and this is my file. Tada.

file. Tada.

See this is the illustration that you need in the interviews and you need to be really really confident. So as you can see this data is writing here. So

this is just a catalog and then we have this volume like this is the volume ID and this is the folder. That's

it. Okay. So just be aware in the interviews. So now this was all about

interviews. So now this was all about our volumes and data processing and different different data types how we can just manage. Now let's cover some

questions related to the optimization techniques or just forget about optimization techniques. First we should

optimization techniques. First we should know how we can just tackle problems related to um rollbacks. Let's say you are in prod

rollbacks. Let's say you are in prod environment and you just did some wonders and now you want to just go back to the previous version. So you need to know data versioning, you need to know

time traveling and all these features are really really important because you cannot work without these features because you would make mistakes and in

order to just roll back to your mistakes so like to your previous version so how you can just do that don't worry I'll just show you and these are the hottest topics right now time traveling data versioning all these things. So how you

can just do that because these topics strongly align with the open table file formats as well. So this is the hottest topic right now. So I will simply go to

my workspace and then simply click on create and create a new notebook. Okay.

And simply name it and name it as notebook 4.

Okay. Now let's talk about first of all let's connect and I don't think so we would need this cluster anymore. So I

will simply say hey just dominate this cluster. Simply terminate this

cluster. Simply terminate this cluster. I don't know why they do

cluster. I don't know why they do this. Serless was better. Yeah. Just

this. Serless was better. Yeah. Just

terminate this. They should just add those capabilities in the serverless compute like to just do streaming. Yeah. Yeah. So yeah, our

streaming. Yeah. Yeah. So yeah, our notebook is connected to this particular server serverless. Okay. So now let's

server serverless. Okay. So now let's write our heading and it will be data

versioning data versioning and time travel. So for

travel. So for that let's create our table which is called which is called let's create one table.

So this lecture was also recorded and this was a table. Mic was not working. Okay, don't worry. Let's create

working. Okay, don't worry. Let's create

another table. So what we need to do? We

need to first of all create a table and then we will we will insert some data and then we will make some mistakes intentionally. Then we will just try to

intentionally. Then we will just try to roll back those changes to the previous versions in the delta table. Make sense?

Very good. Let's do that. Let's create a table. So I will simply say create

table. So I will simply say create table DB um catalog dot DB

schema dot uh my delta table my delta table. Okay. Then simply

say id int and then maybe name string and then salary and this is also int. Okay. And this will be an

also int. Okay. And this will be an external table. So I will simply

external table. So I will simply say abfs destination at the

rate db interview lake dot dfs do.core

core dot windows.net net and then let's say data delta it's a good name okay

let's create this table and now quickly let's insert some data and your interview will say like I'm just creating the exact environment okay so

interval will say you have this table and you want to insert some data you will simply say okay insert into and db

catalog dobb schema dot my delta table. Perfect. And you will simply say values and you will simply insert the

value such as one. Name can be let's say

one. Name can be let's say nora and then salary let's say 1,000.

Okay. And then say second name any name Rahul salary is 900.

Why? So three

then an no is out of the league.

[Music] Um Sophie who is Sophie and then let's say she's earning

more 2,000. Let's insert these records.

more 2,000. Let's insert these records.

So your interviewer will say hey you have this delta table. Delta table is a keyword. So just put some focus. Okay,

keyword. So just put some focus. Okay,

you have this data table, you have this data, you want to delete Rahul, bro. Rahul is already earning the

Rahul, bro. Rahul is already earning the least. Why you want to delete that

least. Why you want to delete that person?

So, no personal hit. Okay, just a hypothetical table. So, we want to

hypothetical table. So, we want to delete Rahul like ID equals to two Rahul. And the uh interview will ask you

Rahul. And the uh interview will ask you hey just write the code and you will say okay delete

from DB catalog dot DB schema dot my delta table where ID and it will say hey stop

stop simply run this command without where condition and you will say bro are you mad or what no obviously you will not

say this and you will simply say okay let's run this and now interview will say hey what you have done then obviously you should say bro

are you mad you just told me just to run this no don't don't talk to him or her like this so the intent behind this is now interview will now interview will

ask you hey just bring back me the records that you have just deleted yes so for this you will simply say okay let

me just insert the records again. No,

no, you cannot answer this. Okay. Okay.

So, what's the solution? The solution is you need to know how you can just time travel. And in order to perform time

travel. And in order to perform time travel, you need to know how you can just find the versions of the table. So,

in order to find the versions of the table, there is a very simple trick. We

have something called as describe history and pick the SQL code. And then

you just need to define the table name DB catalog dot DB schema

dot my delta table. That's it. And this

way you will see all the versions of this table. And that's what we want.

this table. And that's what we want.

Perfect. So we have version zero. We

have version one. We have version two.

That means we want to go to version one, right? Because version two is a delete

right? Because version two is a delete operation. H makes sense. So how we can

operation. H makes sense. So how we can just go back to a previous version or any particular version. So we have something called as restore command. We can simply say

command. We can simply say restore table and then table name is my catalog. Not my catalog. DB

catalog. Not my catalog. DB

catalog dot DB schema dot my delta table. Very good. Simply pick

table. Very good. Simply pick

SQL and simply say to version as of one because I can pick any version of my choice because I

know version number one is absolutely right that I want. And this way you will see that your data will be restored.

Yes, see it is done. Now if you just query this table, you should see the result. You

should see all the records. And now you can easily reply to that person and simply say, hey, this is your data and

now task is done. Where's my offer letter? So, obviously, bro,

letter? So, obviously, bro, obviously, just show some confidence.

It's always good to be overconfident in the interviews. Trust

the interviews. Trust me, it's always better to be overconfident instead of

underconfident. Just mark my words,

underconfident. Just mark my words, okay? Just be overconfident in life.

okay? Just be overconfident in life.

Just be overconfident.

There are scenarios where you need to be underconfident but in the corporate in the interviews just be

overconfident. Just behave like you know

overconfident. Just behave like you know everything. Why? Because there's no one

everything. Why? Because there's no one who knows everything. So

what? So just behave like you know everything and by that way you would know a

lot. Okay. just philosophy classes along

lot. Okay. just philosophy classes along with the data engineering classes.

Okay. Okay. Okay.

Okay. So this was all about your data versioning time travel interview questions that can be uh that can be asked from you and now I hope that you can answer any question related to this.

Now let's talk about some data optimization questions that are being asked right now. Why? Because with the rise of data, every organization wants

to optimize the performance. Because

earlier these concepts were new. So they

were using like they were like actually being crazy and they were just testing so many new new things. Now they have data. Now

new things. Now they have data. Now

they're just running the queries and queries are running really really slow.

And if you know how to optimize those things, you going to be an asset for them. They cannot ignore you.

them. They cannot ignore you.

They cannot they they cannot say hey we cannot hire you because if you know how to optimize the queries if you know how to optimize the performance they are

looking for the person who can optimize the things in their existing solutions because trust me every single organization is facing the issue related to optimizations right now because data

is huge data is growing rapidly solutions are new concepts such as lakehouse is really really new engines are developing and if you know these

hacks bro trust me this is kind of an X factor okay and let's talk about those questions so let me just create a new

notebook or let's continue it in this notebook not a big deal so now let's say the person will be saying hey you have this table okay you want you you need to

optimize the performance and you have so many files under the hood for this table. How you can just do that? So

table. How you can just do that? So

basically the first approach you can just say like this that we have several options available in order to optimize the tables. The first approach is

tables. The first approach is the optimize command itself.

Okay. It's optimize command and this command what it does. So you need to explain this as well like why do we need to do it. So first way is doing optimize

command. So what we can do we can simply

command. So what we can do we can simply say hey optimize table. We'll simply say optimize and then simply table name DB

catalog dot DB schema dot my delta

table. Okay, simply say SQL. So this is

table. Okay, simply say SQL. So this is the code for it. So what it does? So it

will simply perform the collies operation on your partitions on your pocket files. So let's say you

have one 2 3 4 5 or let's say so many files. When you run optimize command, it

files. When you run optimize command, it will simply create the ideal size of the pocket files and it will just merge these small small files and it will

simply create ideal size and ideal size is like 1 gig. Okay, 1 GB per partition and we'll simply do that. So this is one

optimization. Okay, second optimization,

optimization. Okay, second optimization, it's called Zorder by command. It's called Zordering

command. It's called Zordering basically. Now what is this Zordering?

basically. Now what is this Zordering?

And Zordering cannot be applied independently. We have to just use

independently. We have to just use Zorder by command with optimize command.

So we'll simply say optimize DB catalog do DB schema dot my delta

table and then you simply need to say Z order by let's say ID column or any column like it totally depends upon the situation and we usually put columns here in the Z order Z order by command

which column we are using to prune our data okay what is this so let's say same example let's Say you have so many files. Okay, that is the issue with the

files. Okay, that is the issue with the current companies right now. It will

simply create the ideal size of the partitions. That's true. But along with

partitions. That's true. But along with that, what it will do? It will simply apply the sorting on the

data on the data. Okay, it will simply apply the sorting on the data. What will

happen when it just applies the sorting?

So, basically there's a concept called data skipping. We can simply skip some

data skipping. We can simply skip some partitions based on the data. So let's

say you are just reading some data and those ids are only residing in this particular partition and those ids are not here. Okay. So it will simply read

not here. Okay. So it will simply read this partition only and it will not read this partition. So it will simply skip

this partition. So it will simply skip it. That improve the performance. This

it. That improve the performance. This

should be your answer with your confidence bro. Okay. And now the person

confidence bro. Okay. And now the person can ask followup question.

He or she will say how it decides that it needs to skip this data or this ID is not deciding here. It's because

of first or let's say stats of first 32 columns. So basically in delta

um or let's say delta tables okay it is a feature in which like it is the feature by default it calculates the

statistics of the first 32 columns. So

your ID column is also like should be in your 432 columns. Okay. And if it is there the statistics of that particular column will be there in each partition.

So you will have minimum minimum value, maximum value everything. So you can simply like not you like engine can simply see hey this is the minimum value and this is the maximum value. So we do

not need to go to that particular partition. So just try to explain

partition. So just try to explain everything and I would say even if the person is not asking these things and if you know that you

know and if you are confident that you can explain just try to put some points just try to highlight that you know much more than you're looking for.

Okay. If the person is just asking hey what is the order by command? You will

tell that you you will just tell that like what is the order by to that person. Try to divert the topic and just

person. Try to divert the topic and just say hey um this has happened because it calculates the statistics of 432 columns

due to which data skipping happens due to which it reads the data faster.

Person would think this person knows really really well and generally have like deep skills. Okay. So again like anam are you

skills. Okay. So again like anam are you a philosopher? No, no, no, no, no. I

a philosopher? No, no, no, no, no. I

just want to um uh attend the uh not attend just like Okay, okay. I should not just talk about

Okay, okay. I should not just talk about that. Okay, okay, okay. I have already

that. Okay, okay, okay. I have already talked about that. I was just about to talk about TEDex. So, okay. Okay. Okay.

So, this was all about your um Zorder by or you can say optimize command or you can say data optimization techniques that we have within data bricks. And

there's one more that is new one which is called liquid clustering which just which just creates

the dynamic clusters on top of your data according to obviously the columns and dynamic query behavior.

So let's say you are just consuming a kind of query more in which column A B C are being used and obviously on a pruned

condition. So it will try to create a

condition. So it will try to create a cluster of that particular data type. So

whenever you will be just quering the data it will simply go to that cluster and grab the data. So it is just like liquid clustering dynamic clustering.

Okay, let me just take you to the documentation. Liquid clustering data

documentation. Liquid clustering data bricks.

So see delta liquid clustering replaces table partitioning and zorder by that means we do not need to use zorder by and liquid cluster. Oh sorry zorder by

and optimize command when you're using liquid clustering. Okay. Then what it

liquid clustering. Okay. Then what it does, liquid clustering simply provides the flexibility to redefine clustering keys. That is something else. Obviously,

keys. That is something else. Obviously,

you can just alter the table to do that.

And then liquid clustering applies to both streaming tables and materialized views. That is important. And obviously

views. That is important. And obviously

data bricks recommend using runtime 15.2 and above with liquid clustering. And it

was in preview now it is in like general availability zone. And obviously there

availability zone. And obviously there are some recommendations. And how we can just enable the liquid clustering? It is

very easy. You can simply say it is it can be enabled in the at the time of table creation. You can you simply need

table creation. You can you simply need to write uh cluster by and then just put the columns. As you can see here uh let

the columns. As you can see here uh let me show you. I think it is here. Uh uh

uh uh yeah see create table and then simply put cluster by that's it and then uh this is just a CTS command if you don't know CTS command is just like

create table okay where you want to create table location of it okay you will be creating table here and if you just run this it will simply create table on that particular location okay

and obviously it will be it will be an empty table but if you just write as select so it will simply put put the data as well in that location. Again

extra knowledge but it is really handy.

Seas, create table as select. Okay, so

create table as select is also known as C taz because obviously this is an external table. So it is also popularly

external table. So it is also popularly known as C taz in which it first moves the data to the location and then

creates a table on top of it. Okay. Extra knowledge. Extra

it. Okay. Extra knowledge. Extra

knowledge. extra knowledge. Okay. So

this is the code if you want to alter your table and if your table is already created and you were you have not enabled the cluster in uh liquid clustering you can also enable it and

this is a new feature and it is in public preview which is automatic liquid clustering that means you do not need to even define the columns you simply need to say cluster by auto that's it I can

just show you see and it will simply say hey automatically pick the columns we are not doing any hard So this is all about like some optimization techniques that we have.

These are basically your you can say latest questions and I personally felt that these questions were not actually covered because topics are really new,

situations are really new and so what we need to cover it we will cover it. So

that is why I had decided to cover these particular questions. And now you would

particular questions. And now you would think like okay these questions were more inclined towards data bricks that's correct but whenever you sit in the

datab bricks interviews or let's say datab bricks data engineer interviews it is very obvious that the person will be asking pispark related questions as well

that is common sense but the intent is everyone is covering pispark questions that's good data bricks questions were not here. So I decided

let's take the responsibility and let's let's let's let's help you to actually crack the interview. And now you are really good with databicks interviews like interview questions. Obviously in

the future we'll be covering more and more questions because I told you like datab bricks is evolving really really rapidly. So what's what's your next

rapidly. So what's what's your next step? What's your next step? You are all

step? What's your next step? You are all set for datab bricks. You know almost all the functionalities. Okay. You have

tackled so many real-time scenarios as well. Okay. Schema evolution,

well. Okay. Schema evolution,

incremental loading, so many things.

Right now your next task should be preparing for the pispar questions because in the databicks interviews 60%

or let's say 50%, 50% will be data brick or let's say 40%, because 40% is not a less number. Okay, 40% will be from

less number. Okay, 40% will be from these datab bricks areas because these datab bricks areas are really really new

and now it's time to actually cover the pispark questions including your pispark coding pispark coding round how you can just prepare for that I have created a

dedicated video on that like how you can just prepare for pispar coding interviews let me just show you let me

just go to incognito mode and search YouTube. Search on lamba

YouTube. Search on lamba and let me just see uh yeah this one pispark interview

questions. So this video this one just

questions. So this video this one just simply search it. These are the this video has covered all the pispark coding

questions using pispark functions window functions ranking functions or you can say spark sql functions everything so this is like pure coding question pure

coding so simply go there and again I have just created dedicated and realtime scenarios in the coding round as well you want to enjoy and learn a lot simply

go there and just cover all those questions and obviously database questions are also done. You are all set. Trust me. And just carry one more

set. Trust me. And just carry one more thing in your interviews.

Confidence. Confidence. Okay. Just

confidence. Just go

there. Okay. And just say I'm going to kill in this interview.

Okay? So just make your mind. Okay? And

trust me, you will be just clearing your interview this year and I'm just waiting for your message, okay? Saying that,

yay, I have just cracked the interview.

I feel so happy when you just send me messages or let's say you comment on the video that you have cracked the interview. I feel really really happy.

interview. I feel really really happy.

So, just waiting for your comment and just drop a lovely comment for now that you have learned a lot and then obviously once you crack the interview simply come back on this video and

comment. Why? It's our love, right?

comment. Why? It's our love, right?

Okay.

Loading...

Loading video analysis...