LongCut logo

Fundamentals Of Data Engineering Masterclass

By Darshil Parmar

Summary

Topics Covered

  • Data Beats Intuition Every Time

Full Transcript

In this three-hour Data Engineering Master Class, you will learn about what Data Engineering is, the Data Engineering life cycle, data generation and storage, database management systems, data modeling, SQL versus NoSQL, data processing systems like OLTP versus OLAP, ETL pipelines (Extract, Transform, Load), data architecture, and I will give you a complete guide on how to build

the architecture from scratch. We'll cover data warehousing, dimensional modeling, slowly changing dimensions, data marts, data lakes, data lake versus data warehouse, big data landscape, data engineering on cloud, top AWS services you should learn for data engineering, and we will understand

real-world case study architectures on AWS, GCP data services, and Azure data services. We will

also explore the modern data stack, important tools for data engineering that you should learn, understanding Python and SQL for data engineering, understanding data warehouse tools like Snowflake, BigQuery, understanding Apache Spark with Databricks, understanding Apache Airflow and Apache Kafka for data engineering, and many more things. So sit tight, get your notebooks, pen and paper, and start taking notes so that you can remember this for a longer period of

time. And before you move forward, make sure to hit the like button and subscribe to the channel

time. And before you move forward, make sure to hit the like button and subscribe to the channel if you are new here. Let's get started with the Fundamentals of Data Engineering Master Class.

The Fundamentals of Data Engineering Okay, we'll start by understanding what Data Engineering is because if we want to understand different fundamental concepts, we need to have our basics clear. Now, if you have been following me on this channel for the past few years, then you might already know what Data Engineering is because we keep talking about this.

But if you're seeing me for the first time or if you're just getting started with Data Engineering, it is important for you to understand what Data Engineering is. So let's start with that.

Okay, now we already know, right? Everything that happens on the internet mainly, okay, because this is where Data Engineering happens, on the internet. All of these are the businesses.

Okay, the businesses are, let's say, Amazon. Okay, what is the business of Amazon? Amazon

is an e-commerce company. What do they do? They give you the ability to purchase products online, okay, from your home. Now, this is the business of Amazon. What is the business of, let's say, Netflix? Okay, the business of Netflix is to give you exclusive content. You buy the premium,

Netflix? Okay, the business of Netflix is to give you exclusive content. You buy the premium, and they give you the exclusive content. On top of that, they also give recommendations and all of the other things. Okay, this is the business of Netflix. What is the business of, let's say, Zomato? Okay, this is a food delivery app in India. From your home, you can order food,

Zomato? Okay, this is a food delivery app in India. From your home, you can order food, okay, and the order will get delivered to you within, like, half an hour to an hour. Okay,

there are multiple companies doing businesses on the internet. Now, all of these companies, okay, have certain goals and visions for the business, right? They want to understand the customer.

Why do they want to understand the customer? So that they can provide better services. Okay,

they want to increase their profit. Okay, I want to increase my profit. Okay, this is one of the goals, that I want to increase my profit, understand my customer. They also want to detect some of the bottlenecks they might have in the business, okay, so improve the process, okay,

improve the business process. And like this, there might be multiple goals a company might have.

Now, if they want to achieve all of these goals, they need to understand how these things are happening, and one of the best ways companies can do that is by understanding the data. Now,

most of the time, all of these decisions are taken based on assumptions, right? A business person, let's say, who is working in the shipping department of Amazon, okay, is actually working on the ground and has knowledge about this particular segment—the shipping, okay? Now, he already has

some business knowledge to take direct decisions on this particular segment of the business, okay, because he's an expert. He's been working in this particular field for, like, 15 to 20 years, so he understands what might be the problem. But a lot of times, even as humans, we might miss out on some of the information that we don't know. And the best way to understand all of this information

is by understanding what the data says. You can assume certain things, and you can be right for some time, but if you want to be right most of the time, the best way is to be sure about it.

The only way you can be sure about all of these things is by understanding what the data says, okay? And this is where the entire picture of Data Engineering, Data Science, Machine Learning, AI,

okay? And this is where the entire picture of Data Engineering, Data Science, Machine Learning, AI, all of these come into the picture. So let's start by understanding all of these things one by one.

Okay, I just painted you the picture. The reason we are doing Data Engineering and Data Science in the first place is that companies want to understand, okay? They want to understand, they want to improve their business, they want to provide better services to customers, they want to, like, remove the challenges they might have in the business by using the data because data gives

you the direct answer. It gives you the factual understanding rather than you just assuming things, okay? So this is the understanding of why we need the data-driven system.

things, okay? So this is the understanding of why we need the data-driven system.

Now, how do all of these things happen? Okay, we already know this, okay? At the end, we want to have the final outcome. It can be—we already understood,

okay? At the end, we want to have the final outcome. It can be—we already understood, right?—improve business revenue, and it can be, like, recommendation and overall things.

right?—improve business revenue, and it can be, like, recommendation and overall things.

So these are my business goals, okay? These are my business goals. Business goals, okay? Every

single thing you do in your data ecosystem, or in general in the engineering ecosystem online, is for this only, okay? Anything you do should create value for the business. Even if you use, like, the highest algorithm, but it doesn't impact the final outcome of the business, it is completely useless, okay? It should help the business in some way; it should help the business

to save costs, it should help the business to improve the process, it should help the business to understand the customer—whatever it can be. If it can provide the final value, then it is useful; otherwise, it is completely useless, okay? So this is very important—everything that you do should create value for the business. If this is clear, let's start by understanding

the entire pipeline of Data Engineering and the entire pipeline of the overall internet system, okay? Before we just understand the Data Engineering life cycle, we need to understand

okay? Before we just understand the Data Engineering life cycle, we need to understand how different things or different fields come together to make the complete system, okay? So we have the company—this is my company over here—which is, let's say, this is my company,

okay? So we have the company—this is my company over here—which is, let's say, this is my company, okay? And the company is doing, like, it can be Amazon or whatever. We'll take one example, okay?

okay? And the company is doing, like, it can be Amazon or whatever. We'll take one example, okay?

Now, at the front end, we usually have the application, okay? This is my application, okay?

This might be my mobile—there's a button, and this is my application. And the user interacts with the application, okay? I'm the user—I have Instagram installed on my phone, I have Facebook, I have

application, okay? I'm the user—I have Instagram installed on my phone, I have Facebook, I have whatever, okay? I might be using LinkedIn—I have the application. And whenever I interact with this

whatever, okay? I might be using LinkedIn—I have the application. And whenever I interact with this application, data gets generated, okay? Whenever I click on any application, whenever I, like, like something, when I comment on something, every single thing that I do, even if I go to Amazon, if I click on a certain product, every single thing, okay, every single thing generates data,

okay? Now, all of this data will get stored, okay, inside the DBMS, okay? These are called database

okay? Now, all of this data will get stored, okay, inside the DBMS, okay? These are called database management systems. Now, there are different types of database management systems we will understand in this video, but just try to understand every single thing that we do gets stored inside a

DBMS, a database management system, okay? Now, these systems are usually designed for storing this kind of data, right? You can store this data easily. There is something called CRUD operation, okay, which is called Create, Read, Update, and Delete. We'll understand

that in the further video, but all of these databases are called relational databases, specially designed for storing all of these things. Now, once you store all of these things, alright, we have the data available. Now, data might be coming from multiple places, but let's understand—from the application, our data gets stored inside the DBMS, and from there, our entire

Data Engineering pipeline starts, okay? The Data Engineering happens, Data Science happens, Machine Learning or Data Analytics might happen over here, and then there might be a final dashboard, okay? There might be some dashboard or some charts available here, okay? Businesses use this,

okay? There might be some dashboard or some charts available here, okay? Businesses use this, or there might be a machine learning model, okay? So this is like a robot, okay? I'm bad at drawing, but this is one of the robots or machine learning models that might help in understanding all of

these different things, okay? Just trying to, like, just trying to paint a simple picture of the entire ecosystem—there are many different things that go here, okay? The application development, there might be DevOps who might be deploying the application, but in general, from application to DBMS, whenever we have any data available, okay? This is where internet companies come into the

picture because you can store all of the data inside the DBMS, database management system, okay? And then you can utilize all of this data for this kind of workload, okay?

okay? And then you can utilize all of this data for this kind of workload, okay?

Once you have the data generated, this is where the Data Engineering starts, because without data, you don't have Data Engineering, Data Science, Machine Learning—because they work fundamentally on the data. If you have the data, then you can do something about it; if you don't have the data, then you can't do anything about it, okay? So the fundamental concept of a data-driven system

is having a data generation in place, and this is what the data generation looks like, okay?

You have the application, the data is getting generated, there might be other things such as sensor data, okay? A truck is moving from one location, okay? This is one of the trucks, okay, that is moving from one location, and it is going from location A to location B. Now, in between,

from B, it might go to C location, okay? Now, the truck goes from here to here to here, okay?

It goes from here to here to here. Now, we need to capture all of this data, and all of this data gets captured by the sensors, right? The truck might have sensors, just like we interact with the application. The truck might have sensors. Just like this, we have the stock market data,

the application. The truck might have sensors. Just like this, we have the stock market data, we have data coming from numerous places, okay? So we understand how the data is getting generated, sent to the system, and all of the other things. So this is the fundamental concept of Data Engineering, which is where Data Engineering sits in the first place, okay?

As we move forward, we will understand all of the different parts of Data Engineering individually, but just try to understand where Data Engineering really fits into the entire cycle, okay? It is

between the application development and the database. So whenever your data gets generated, okay, it is over here—this is my application side, and this is my Data Science, Machine Learning, dashboarding side. Data Engineering sits in between. It is kind of like a plumber,

dashboarding side. Data Engineering sits in between. It is kind of like a plumber, okay? I'm connecting one thing to the second thing by transforming data and some of the other things

okay? I'm connecting one thing to the second thing by transforming data and some of the other things that we will understand, and then I pass the data to the next end, okay? I get the data from one source, and I pass my data to the next source. How do Data Engineers do that? What are the different features, functionality, and frameworks they use? We will talk about all of these things one by one

in this video, so don't worry about it, okay? I hope you understood the basics until now.

Okay, so now that we understood where Data Engineering sits, what is the role of Data Engineers in this place? Because application developers, so we have the software engineers, okay? The general role of software engineering is to develop the app, web app—it can be a web application, it can be writing code, developing, or deploying some of

the things, okay? Then we might have the DBA. This thing can also be done by the software engineers in smaller companies, but if you're working in a big company, a DBA is a Database Administrator who develops the data, right? They build the different tables, they build the different columns, and all of the other things. These are built by DBAs. Usually, Data Engineers can also do that, or the software engineers can also do that—depends on the company's

size and your job profile—but let's understand. We might have a DBA who will build a database, okay? So this guy will be writing a database. Now, who do we have? We have Data Engineers, okay?

okay? So this guy will be writing a database. Now, who do we have? We have Data Engineers, okay?

Data Engineers. The roles of Data Engineers—there are many different things I'll tell you, just it's to write the ETL pipeline (Extract, Transform, Load). We have a dedicated section on this, but ETL is basically we extract data from one end, we transform that data, and we load that data,

okay? Then, it can be also building a database or a data warehouse, okay? Database or data

okay? Then, it can be also building a database or a data warehouse, okay? Database or data warehouse. Data Engineers can also do that, okay? They can build relational databases or dimensional

warehouse. Data Engineers can also do that, okay? They can build relational databases or dimensional modeling—we'll understand that. Working with big data, big data, and processing all of this data using Spark, Hadoop, using different frameworks or Kafka, okay? To process batch data or real-time

data, we can also do that. Data integration, data integration—so again, data is coming from the API, data is coming from the sensors, data is coming from the RDBMS, so we want to integrate all the data, so Data Engineers have the responsibility. There are other responsibilities such as quality check of the data and governance, how to organize all of this data properly, so these

are the core use cases of the Data Engineers. Now, after that, we have Data Science people, okay? Data Science or Data Analysts, okay? Usually, the difference between Data Science

okay? Data Science or Data Analysts, okay? Usually, the difference between Data Science and Data Analysts is basically that Data Analysts usually answer questions about what has happened in the past, okay? How can we, like, what was the revenue of this particular product last year compared to the last five years, right? They are trying to find the pattern from the past and find

some of the answers. The role of Data Science is to predict what can happen in the future, right? We did a product sale for this particular product X amount for the last one year—what will

right? We did a product sale for this particular product X amount for the last one year—what will be the product sale for this particular product for the next six months? This is what Data Science answers, right? They try to predict what will happen in the future based on past patterns,

answers, right? They try to predict what will happen in the future based on past patterns, and we have the Machine Learning Engineers who can basically automate all of the other things. So on

Amazon, we have the recommendation system, right? All of this recommendation system is done by Machine Learning Engineers. They deploy the machine learning models onto the production system so that a system can learn by itself and generate the right output for the user. So you

can predict what is happening inside your system, or you can predict how the users are behaving and recommend them the right information. Like on Instagram, you go to Reels, you see the right reels as per your interest, okay? They don't directly recommend you random things—they also recommend you random things just to understand if you like it or not. So they are just trying

to train the machine learning algorithm based on your usage on the application, okay?

Now, the difference between DS and ML is quite thin, okay? You might see a Data Science person might do the ML work, or an ML person might do the Data Science work, but in larger organizations, they might have individual work to do, okay? They have core responsibilities, but in smaller organizations, they might have to do all of these things by themselves,

so do not get caught up in the title like, "Oh, what does a Data Science person do? What does

a Machine Learning Engineer do?" Just try to understand their core responsibility from the top level. In the actual organization, when you go to work, okay, when you start working, you might

top level. In the actual organization, when you go to work, okay, when you start working, you might have to do everything by yourself because the role is just a name, okay? But this is the core distinction between all of these roles. There are other roles such as DevOps, DataOps—these are just fancy names, but on a fundamental level, you might be doing similar work, okay?

So we understood what Data Engineering is, okay? The role of Data Engineering is to take data from one source, okay? It can be any data from, like, RDBMS, API, do some transformation, and pass this data to Data Science or Machine Learning guys so that they can build dashboards or they can, you know, build machine learning models. Now, all of these things that we do,

okay, there is a proper approach to it, okay? You can't directly get the data from one source and directly push it to the Data Science person—there has to be a step-by-step approach that is designed properly so that the entire pipeline that you generate has some purpose to serve,

okay? And this is what we will understand, okay? So this is what we call a Data Engineering life

okay? And this is what we will understand, okay? So this is what we call a Data Engineering life cycle. This is taken from the book Fundamentals of Data Engineering. I have recommended this book

cycle. This is taken from the book Fundamentals of Data Engineering. I have recommended this book to so many people, and it is one of the best books if you want to understand the fundamentals of Data Engineering. A lot of the material that I have learned about the fundamentals is from that book,

Engineering. A lot of the material that I have learned about the fundamentals is from that book, and some of the material I also added in this video, so you will get the understanding, okay?

So the first step here is data generation, okay? Now, this thing we already talked about, right? Data generation—data is getting generated from multiple places. We already know data comes

right? Data generation—data is getting generated from multiple places. We already know data comes from what? APIs, okay? RDBMS, it comes from sensors, it comes from analytics like Google

from what? APIs, okay? RDBMS, it comes from sensors, it comes from analytics like Google Analytics or all of the other things, okay? So data is coming from multiple places. Now,

all of this data that is coming from these different places, we need to aggregate this data together and ingest it into the system, okay? Now, this is what the next step is over here. Let me just remove this, okay, this part, okay? The data ingest, okay? We are getting the

here. Let me just remove this, okay, this part, okay? The data ingest, okay? We are getting the data generated from one place, then we need to ingest this data to one particular system. The

ingestion can be setting up the connection with the API, setting the connection with the RDBMS, building a system that can read the data from sensors, and then automatically ingest this data into our Data Engineering system, okay? We will understand what this entire Data Engineering system feels like when we actually look at the project example, but these are the fundamentals,

okay? We have the data generation, and that data is getting ingested into some kind of system,

okay? We have the data generation, and that data is getting ingested into some kind of system, okay? And we just build a programmatic connection between this. So whenever any data gets added to

okay? And we just build a programmatic connection between this. So whenever any data gets added to the RDBMS, okay, it should automatically get ingested into our system. There are

multiple approaches to do that, but these are the fundamentals. Once the data is getting generated, we ingest this into our system. Then the data that got ingested will get stored, okay? There's some

kind of storage layer we have, so every data that is coming from multiple places, we have to store all of this data at some location, okay? It should get stored at some location at least, so this is where the storage happens, okay? We are storing this data at some location. Now,

between this ingestion and the serving, okay? Serving is basically we are serving our data to machine learning, analytics, and reporting, okay? The thing that we understood over here, okay? After the Data Engineering happens, we have Data Science, Machine Learning persons who are

okay? After the Data Engineering happens, we have Data Science, Machine Learning persons who are building a dashboard or who are building a machine learning model. The same thing here is that this is the part, okay? The Machine Learning or the analytics—we have reporting, dashboarding—all of these things happen over here. This is where the data is ingested, and this is where the data is

getting stored, okay? Between that is the core of Data Engineering that is called a transformation.

Transformation is basically the set of business logic, alright, that we need to convert our raw data—this is usually what we call raw data because it is coming from the system, okay? So this is my raw data. This is my raw data, okay? Here, this is my transformed data when we serve this, okay?

raw data. This is my raw data, okay? Here, this is my transformed data when we serve this, okay?

When we serve this, this is the transformed data, this is the raw data, and everything that happens between this and this is called a transformation. Transformation is a set of business logic, and it can be anything, okay? So consider this example. Let me just explain this part. Now, we have data

coming from the API, okay? I have data coming from the API, and I have data coming from the RDBMS, okay? Now, in both of the data, I have a date column, okay? I have a date column, I have a

okay? Now, in both of the data, I have a date column, okay? I have a date column, I have a date column, and the format of the date in the API is YYYY-MM-DD, okay? It is like 2024-06-01, okay?

The first—I think it's June of 2024—the date is something like this. Now, in RDBMS, okay, the date

format is like MM-DD-YYYY, okay? Something like this—it can be like 01 or 06, sorry, 01 and 2024, alright? Now we have a date coming. Now, what we need to do is we need to join this system because

alright? Now we have a date coming. Now, what we need to do is we need to join this system because at the end, we need to find the analysis. There might be some ID column here, okay, and there might be one more ID column available over here. We need to join these two data together. Now, when

do we join it? Okay, when we join data coming from the API, this might be, let's say, product date, okay? This is a product date, okay? And this is an order date—it can be anything like this,

okay? This is a product date, okay? And this is an order date—it can be anything like this, okay? Now, when we join this information, we need to transform this data into one particular logic,

okay? Now, when we join this information, we need to transform this data into one particular logic, alright, that can be formatted as this particular format or this particular format—it can be anything. This is the decision that business people or you can take, like I want to transform

anything. This is the decision that business people or you can take, like I want to transform this data based on this format only, so any information that is coming from any other sources, okay, it should be transformed into the YYYY-MM-DD format for the date, okay? So if we are getting this data after the transformation block, okay, so we will have our transformation block here. I

will have my transformation block, okay? This data will go inside this and this, okay? Transformation

can be done by Python, PySpark, Scala, whatever it is, okay? We will understand all of these things, okay? How do we do the transformation? And at the end of this, I will get this data into YYYY-MM-DD,

okay? How do we do the transformation? And at the end of this, I will get this data into YYYY-MM-DD, okay? The date values will be converted into one single thing. This is what we call a

okay? The date values will be converted into one single thing. This is what we call a transformation, okay? This is one example, but transformation can be anything, okay? It can be

transformation, okay? This is one example, but transformation can be anything, okay? It can be removing duplicate values, it can be removing the null values, okay? It can be aggregating the data, it can be merging two data sets, it can be generating a new column based on the two different

columns, concatenating—it can be anything, okay? It can be filtering—whatever it is, transformation is basically a set of business logic that you have to write inside the code or inside the SQL query or use any tool to do that to generate a suitable outcome so that the Data Science person or the Machine Learning person can build a model or build a dashboard to find the relevant answer,

okay? So as a Data Engineer, my role is to organize the data into the proper structure so

okay? So as a Data Engineer, my role is to organize the data into the proper structure so that we can easily visualize this or we can easily understand what is going on inside the data, so that is my job. I want to make the data into the proper structure, and that usually happens in the transformation layer, okay? Now that we understood what is going on, we are

getting data generated from one source—it can be many sources, APIs, sensors, whatever, okay? All

of this data is getting ingested into one system. Ingestion basically means making a connection in such a way that any time a new data is getting generated, we automatically fetch this data, okay, and store it inside our storage system, okay? This is what we understood. Now, once we have this data available, we need to make sure the data that is coming from all of these different systems

passes through a certain transformation logic so that our data gets structured. Once that is done, we serve this data to a user. A user can be a Machine Learning Engineer, a Data Analyst, or some dashboard expert—it can be anything, okay? They are using this data so that they can understand, build machine learning models. This is the entire Data Engineering life cycle that we are

talking about, okay? There are some undercurrents that we will understand in further videos, so don't worry about it, but I hope you understand the complete Data Engineering life cycle from a fundamental point of view because this is really important, right? You can use any tools, right, to do all of these things, but if you understand the fundamental side of it,

then it doesn't matter which tool you use—you already know what needs to be done, so you can pick the shittiest tool in the market, okay, and still make this entire pipeline work, okay?

That is the power you have as a Data Engineer because once you understand the fundamentals, you don't really need to know anything else. You can learn tools within 30 minutes, okay? It doesn't take time to learn any new modern tool—it's very simple. Even to learn Spark and how

okay? It doesn't take time to learn any new modern tool—it's very simple. Even to learn Spark and how to write the Spark code, it's very easy, okay? You just need to understand some of the functions and execute. There are some angles to Spark, such as the internal and the understanding of executors,

execute. There are some angles to Spark, such as the internal and the understanding of executors, drivers, and all of these other things that you need to understand to become a better engineer, but to do this entire job is not that difficult, okay? You just need to make—you just need to understand how to make connections between systems and execute the entire thing, okay?

Now that you understood, we can go forward and start talking about the individual components, right? How can I do the generation? How can I do the ingestion? What can I use for the

right? How can I do the generation? How can I do the ingestion? What can I use for the transformation? How can I do the serving? What is used for storage, okay? Machine Learning,

transformation? How can I do the serving? What is used for storage, okay? Machine Learning,

Analytics, Reporting—every single thing that we will talk about, and we will also talk about this part further down the video, okay? Now this is understood, let's talk about the data generation and data storage one more time. Alright, so we got the basics until now—data is

generated from multiple places. Data is coming from transactional systems. Transactional systems, okay, these are called RDBMS, okay? There are multiple types of transactional systems that we will talk about, so don't worry about it. Data is coming from IoT devices, so we have the IoT

devices, okay? It is coming from there. It is also coming from web and social media, okay?

devices, okay? It is coming from there. It is also coming from web and social media, okay?

We understand data is coming from logs and machine data, okay? This is also important because, again, we are running the technical machines, so they are also generating logs, and if you want to improve the utilization of this technical machine, we can also use this log data to understand what is going

on and save costs over there also, okay? Then we might have some API data—API or third-party data, okay? Third-party data. Sorry for the bad handwriting, but this is where the data

okay? Third-party data. Sorry for the bad handwriting, but this is where the data is getting generated, okay? Now, once we have the data available, we have to store this data, okay?

The storing of the data is basically we store it in a relational database, okay? This is the same transactional data and relational data, so from the application to RDBMS, data is generated, okay? This is where the data generation—you can also put the RDBMS into data generation because

okay? This is where the data generation—you can also put the RDBMS into data generation because it is connected to the application, okay? And you can also put it on the storage layer because data is getting stored inside the RDBMS, so you can also keep it generation and storage—it doesn't matter, okay? Because from the Data Engineering point of view, we usually consider RDBMS as a

matter, okay? Because from the Data Engineering point of view, we usually consider RDBMS as a data generation source, okay? From the application point of view, we usually consider it as a storage layer also, okay? So quite tricky to understand, but it's simple. You can consider RDBMS as data

generation and storage also. We also have a NoSQL database, okay? NoSQL database that we will understand. For data storage, we have data warehouses, okay? This is what we are talking

will understand. For data storage, we have data warehouses, okay? This is what we are talking about, okay? The thing that we understood about storage, okay, generation, and ingestion is this

about, okay? The thing that we understood about storage, okay, generation, and ingestion is this part—this is the data generation, okay? And the storage that we talked about over here is this part, okay? We can store our data in the RDBMS, NoSQL, data warehouse, or object storage—object

part, okay? We can store our data in the RDBMS, NoSQL, data warehouse, or object storage—object storage can be like S3, Google Cloud Storage, Azure Blob Storage, all of these other things

that we will also understand, okay? You can also call these things a data lake, data lake, okay? So

these are the storage systems. We understood the generation, how the data is getting generated, and where our data will get stored. So, okay, this is what we understand. Now let's understand about the DBMS, okay? The thing that we were talking about—transactional systems and RDBMS systems, okay, that are used for data generation and data storage—in reality, we use the DBMS,

Database Management System, okay? These are the systems specially designed for storing your data in a structured way so that you can easily query your data.

Now, understand this, okay? You can also store your data in MS Excel or Google Sheets. If you

already know, right? You can have columns here and rows and column formats, so you can store your data. But if you want to store, let's say, millions of data or billions of data,

your data. But if you want to store, let's say, millions of data or billions of data, and if you want to find a specific record, MS Excel will not be able to handle that, okay? Because if you want to find a specific record from, let's say, the thousand lines

okay? Because if you want to find a specific record from, let's say, the thousand lines or the one lakh, one lakh row, okay, it will be very difficult for you to do that. DBMS systems,

okay, are specially designed for this kind of workload, okay? You can store your data, and you can easily retrieve, update your data as per your requirement. There are different types of DBMS systems available. We have PostgreSQL—this is open source. We also have MySQL—this is open source. We have Microsoft SQL Server, we have Oracle, okay? These are enterprise-level,

source. We have Microsoft SQL Server, we have Oracle, okay? These are enterprise-level, okay? If you want to get started, Postgres and MySQL are the easiest to get started. Now,

okay? If you want to get started, Postgres and MySQL are the easiest to get started. Now,

to work with all of these systems, we have a language, okay? We have a language called SQL—this stands for Structured Query Language, okay? Now, this is the language that we use to communicate with the database. You might already know about this because you've been following me,

or you have heard about it somewhere, but if you're new to Data Engineering or just in general to the data space, SQL is the language that we use to communicate with the database.

Now, what can we do with SQL? We can do multiple things. We can select the data, okay? We have a SELECT query to fetch the data. We can insert the data, okay? We can insert the

okay? We have a SELECT query to fetch the data. We can insert the data, okay? We can insert the data. We can update the data. We can delete the data, okay? All of this data is getting stored

data. We can update the data. We can delete the data, okay? All of this data is getting stored inside the table. It looks something like this, okay? The table will have a column name, okay, and the actual data stored inside this—this is where all of the actual data is getting stored.

The data that we talk about, like it can be, let's say, this is our data, okay, student data, okay? And there is a table, Student. What will Student have? Student will have ID,

okay? And there is a table, Student. What will Student have? Student will have ID, okay? Student will have a name. It will have age, and it might have, let's say,

okay? Student will have a name. It will have age, and it might have, let's say, a city where the student lives. So ID can be one. The name can be, let's say, D, okay? Age can be 26, and the city can be Mumbai. Just like this, there might be some other person who might be,

let's say, Akash. Age can be 25, and is living in Delhi. Okay, like this, we have data stored inside our table, okay? So this is what is happening over here, okay? We can select a specific data, let's say, where the student ID is equal to two, okay? I can select this particular data by writing

the SQL queries. I can insert new data as ID3, okay? I can delete this data if I want, okay? And

I can update this data, say, if I want to update the age or I want to update the name. There are

multiple SQL cases. If you want to learn about SQL, I have a course so you can learn in-depth, but this is the fundamental concept of SQL, okay? Now, this is what we understood, right? This is

the SQL that is used for working with the DBMS systems—this is the language or scripting language that you can call to work with the system, alright? Now we have a concept of data modeling.

Now, this is where we are slowly diving into the Data Engineering fundamentals concept one by one, okay? We have cleared the foundation part of Data Engineering. Now we are diving into

okay? We have cleared the foundation part of Data Engineering. Now we are diving into the individual concepts that are important for you to understand the entire life cycle, okay?

Data modeling. Now see, whenever we are designing any application or whenever we are thinking to build or store our data, we need to design a data model. Data modeling is basically a visual representation of how our data looks, okay? So we will take one example, okay? Let's take the

example that we all understand, which is Amazon, okay? We are building the data model for Amazon.

Now, just use your general knowledge, okay, and common sense to think about what information Amazon will store. Data modeling is basically charting out or building a visual representation of how our data will get stored inside the RDBMS, okay? This is the entire goal of it, okay? So I

need to think about what kind of tables or what kind of data that I want to store for my system, okay? I want to store in Amazon, right? I might be storing information such as about the orders,

okay? I want to store in Amazon, right? I might be storing information such as about the orders, okay? I'm storing about the orders. I might store about the users, okay? Users who are on

okay? I'm storing about the orders. I might store about the users, okay? Users who are on my website. Orders, then the product, I've been storing about the product, okay? What else? I

my website. Orders, then the product, I've been storing about the product, okay? What else? I

might store about the payments, okay? Okay, what else? What else? Shipping information,

okay? Shipping. I might store information about the sellers, okay? Sellers who are selling on my platform. And like this, there might be hundreds of tables in the actual Amazon,

my platform. And like this, there might be hundreds of tables in the actual Amazon, right? But this is the basic table. Like I say, I'm starting my e-commerce company,

right? But this is the basic table. Like I say, I'm starting my e-commerce company, and I'm designing a data model from scratch. Amazon doesn't exist, nothing exists, and I'm the first person who is starting an e-commerce company on this entire planet, okay? And I'm thinking, okay, I'm going to be Designing my data model, initially, it will have some kind of tables. Okay,

these are the pieces of information that I want to capture for my system. Okay, this is, we are talking from the application side right now. Okay, so we are slowly moving onto data engineering, one by one. These are all concepts you really need to understand if you want to become a data engineer. So, I'm going step by step to make you understand each and every single concept.

engineer. So, I'm going step by step to make you understand each and every single concept.

Okay, so we have the orders, users, products, payment, shipping, and sellers. Now, let's say I'm satisfied with all of this information that I want to capture. What I will do, I will first design a data model for this. Okay, it will look something like this. So, first of all, I have

the orders. I will create an order table. This is my order table. Okay, order. Now, the order will

the orders. I will create an order table. This is my order table. Okay, order. Now, the order will have a lot of things. So, first of all, I have the order ID, order name, and order date. Okay,

let's be satisfied with this. Then we have the user. I have the user available. The user will also have the user ID, name, age, address, and all of the other things just like a normal user has.

Okay, then we have the product. Now, we have the product information. In the product, we have the product ID. This is the primary key or the unique key to understand which product it is. Then we

product ID. This is the primary key or the unique key to understand which product it is. Then we

have the product name, product category, product description, product quantity, product weight, product unit size—lots of things that we can store. Then we have the payment. Payment ID,

payment amount, and payment date can be there. So, we'll just keep these three things. Then we

will have shipping. Shipping ID and shipping date, okay, just keep these two. And the sellers. Okay,

this is sellers. We will have the sellers' ID, seller name, age, location, or whatever it is.

Okay, so we just kind of figured out the tables that we want for our database. Now we need to join them. Alright, so all of these tables only make sense if they have a relationship with each other,

them. Alright, so all of these tables only make sense if they have a relationship with each other, right? So how does the relationship happen? Okay, a user orders a product. So, the order will have

right? So how does the relationship happen? Okay, a user orders a product. So, the order will have all of the information that is getting ordered on the platform. Okay, so on the order, we also have a user ID. This is a foreign key; this will be joined over here. A user can order multiple or

single products, so we will have information about the user ID. A user ID has ordered a product.

Which product did they order? Okay, so we also need to add a product ID. Product ID, so we will join this particular thing over here. Okay, it is joining. Let me just change the color for this.

Okay, so we understand that a user will order a product. Product ID will be over here. So,

in the order table, which user ordered the product? Okay, which product did that particular user order? Okay, then this is done. Like, this is a user and product. We can also add payment information. Let's say if I want to add payment information, it can be added over here.

payment information. Let's say if I want to add payment information, it can be added over here.

The payment ID, okay, and then I can also track down the payment. The payment can also be tracked down easily over here. So in the order, what was the payment ID? If you want to understand how much payment that particular user made, we can also do that. So this is what we can add here. Okay,

then in the shipping, what do we have? We have the connection ready. Then for the seller, okay, which seller is selling which product? So we can also add the seller ID over here. So

let me just get the right color. We can have a seller ID inside the product ID, so we can understand which seller is selling which particular product, and then we can make a connection between them too. So for the seller ID, I have, okay, not this one. After this, what do we have? Okay, a seller. This is a seller product;

this is a seller's information. He's selling a particular product, so we can also make a connection between these as well. So what we will do is we will add information about, let's say, below the product. I will just use a different color to show you that there will be a seller ID just to understand which seller is selling that particular product, and then we can make

a connection between this seller ID from here. Okay, it will go and it will come over here, seller ID, something like this. But in general, and then we might have the shipping information,

so shipping will have information about the order ID, which order is getting shipped. Okay, so we can join this particular thing over here also. So all of these tables will be connected together. Again, this is the worst way to draw this particular thing,

together. Again, this is the worst way to draw this particular thing, but I just want to show you the fundamental side of it. Because if I just show you the picture, if you just search on Google for a data model picture, you will find a lot of data models.

So in reality, a data model really looks like this. There are some applications, such as draw.io, or there are some specific applications for databases to make this kind of diagram. And I teach all of these things in my SQL courses. So, if you want, you can check the description if you want to know more about it. But this is the fundamental concept

of data modeling. I go in-depth in my courses, but I just want to give you a good overview.

Okay, now we understand the data modeling. This is what we usually call a SQL table because these are relational databases; they have a specific schema defined. So, this is the data model. Now, in this data model, every single piece of information has some kind of schema attached to it. The schema

is basically the data type. So, let's say the order ID will be the integer information. Okay,

order ID will be, let's say, order ID will be an integer; order name will be a string. Okay,

order date will be the date value. User ID will be an integer again. Just like this, each and every single column has some kind of schema or data type attached to it. This is called a SQL or relational database table because it is properly structured; every schema is properly defined,

and you use SQL queries to work with it. After that, we have something called a NoSQL database. In SQL, we store our data in the column and row format, but in the NoSQL database, we can

database. In SQL, we store our data in the column and row format, but in the NoSQL database, we can store our data in different types of formats. One of the formats is the key-value. If you know the basics of Python or JSON, it's something like this: we have the ID, and there will be a value attached to it, ID one. Then we will have the name, and the name will be, let's say,

D. Something like this. And the age will be, let's say, 26. All of this information will be stored

D. Something like this. And the age will be, let's say, 26. All of this information will be stored in the key and value. So, if you want to find, let's say, a particular piece of information, you can just search it by the name, age, or something like that. Then we have the column family. All of this data is actually stored inside the column. We have the document, we have the

family. All of this data is actually stored inside the column. We have the document, we have the graph data. Graph data is used for representation. We don't want to deep dive into it; I just want

graph data. Graph data is used for representation. We don't want to deep dive into it; I just want to give you an overview that this kind of database also exists for some kind of workload.

After this, these are the usual comparisons that I want to talk about: SQL versus NoSQL. SQL is

relational, which basically means that the data model we talked about, all of these things, are properly stored and have a relationship between them. As you can see, this table is connected to that one; the order table is connected to shipping; the shipping table is connected to the product ID; the user table is connected to the order. They have a relationship between each of

them with specific primary and foreign key IDs. So, this is called an SQL relational database.

Then we have the analytical, which is usually OLAP, or data warehousing. Data warehouses,

this is what we will talk about further down the video, but these are the SQL databases.

Then we have the NoSQL. In NoSQL, we have the graph, wide column, document, key-value. Well,

if you want to understand all of this, you can just Google it, and you will understand most of it. We don't want to spend time on NoSQL because we will mainly be focusing on SQL. This is what

it. We don't want to spend time on NoSQL because we will mainly be focusing on SQL. This is what you will be working with mainly in the real world because most of the data is actually stored in SQL databases, and you will be using data warehouses. So, let's talk about that one by one.

Okay, now in SQL, the two things that we talked about, relational and analytical, these are the two different data processing systems because all the data storage processing, okay, and we want to talk about that. So let me just get that information. Let's do this. So,

we have two data storage processing systems. One is called OLTP, and the second is called

OLAP. OLTP means online transactional processing, and OLAP is called online analytical processing.

OLAP. OLTP means online transactional processing, and OLAP is called online analytical processing.

Okay, in SQL, we have the relational and the analytical. These are the two things. Relational is usually called online transactional processing, and the analytical

things. Relational is usually called online transactional processing, and the analytical is called online analytical processing. This is a relational database. This is a relational DB, and this is the data warehouse. Data warehouse. And you will be juggling between these two only

as a data engineer. Now we are slowly, slowly deep diving into data engineering, so pay attention.

Okay, now OLTP system has some kind of use case, and OLAP system has some kind of use case. This

is not something where OLTP is better or OLAP is better; they both have their own places in the entire system. Now, the use case of OLTP is usually for processing transactional data.

It is used for transactional data. What does transactional data mean? It means that when you send money to one person from your account, it goes to the other account. That is considered a transaction. When you purchase something on Amazon, when you buy something on Amazon,

a transaction. When you purchase something on Amazon, when you buy something on Amazon, that particular information of the product—that this user purchased this particular product and made payment for this amount—that entire thing is called a single transaction that is stored inside the OLTP system. These systems are mainly designed for this kind of workflow. So,

when you want to do a fast insert of the data, when you want to do an update, or when you want to do quick reads of the data on an individual level, these are the best systems if you want to do that. These are very fast if you want to insert or update quickly. We

talked about the CRUD operation: Create, Read, Update, Delete. It is very useful for that.

It is very useful for this kind of workload. So, the use case of OLTP is more on the transaction level. Whenever you have a lot of transactions happening on an e-commerce website

transaction level. Whenever you have a lot of transactions happening on an e-commerce website or banking, the transaction doesn't only mean money transactions. It can be any transaction, such as if you buy a product, if you return some product—all of these are the individual row-level information that is getting stored. But if you want to understand what is happening,

let's say if I want to understand the last five years of data using the OLTP system or SQL, I won't be able to do that. And I'll explain the reason behind it, but for that, we have an OLAP system. The OLAP system, the name literally says that it is for online analytical processing. The

system. The OLAP system, the name literally says that it is for online analytical processing. The

reason OLAP systems are good is that they are mainly used for analysis workloads. So, if you want to analyze the last five years of data, you can easily do that using the OLAP system.

Let me just explain this individually so that you have a better understanding. So, the OLTP system is mostly row-based. So, every piece of information that you store is stored inside the row. Like, this is my ID, this is my name, this is my age, this is my payment that I made,

the row. Like, this is my ID, this is my name, this is my age, this is my payment that I made, something like this. Now, all of this information is getting stored inside the individual row. Now,

this is the OLTP system used for transactions, so this is really good for row-level operations. If

you want to do something on the row level, if you want to update the date of birth, if you want to update the age, delete a particular thing at the row level, this is very easy. But let's say if I want to analyze the entire data—let's say this is the payment made for 10 rupees, 20, and 30,

and what I want to do, I want to aggregate, and like this, there are millions of rows available like this. And if I want to analyze this entire data from start to end, what I will have to do,

like this. And if I want to analyze this entire data from start to end, what I will have to do, if I were to write the query, such as 'SELECT * FROM' or 'SELECT SUM from payment from this

particular user table,' let's say if I run this query, the thing is, the way this entire variable

gets executed, it will first fetch all of these individual rows inside the result set. One by one, it will fetch all of these rows, and then from that entire result set, it will just pick this single column. This particular single column will get picked after this, and then it will

single column. This particular single column will get picked after this, and then it will do the sum. Now, picking the entire column or scanning the entire row from start to end is a useless process for this operation. Understand this, right? Because we just want to get the sum of payment, I just want to get the information about the payment only. Why am I scanning each

and every individual row? Because this entire database—OLTP databases—are stored on the row level. Every single piece of information is stored in the row. So, even if I want to get the

level. Every single piece of information is stored in the row. So, even if I want to get the information about the payment, I will have to scan all of the data from start to end and then just select the one single column only. Now, as I said, this is only good for row-level transactions,

if I want to update or delete a specific row. On the other hand, OLAP systems, let me just draw this, OLAP systems are column-based. So, all of the things are the same. Every single thing, such

as the ID, name, date of birth, age, whatever it is, and this is my payment. On the OLAP system or the data warehouse, if I execute the same query, these are column-based. Most of the time, you will

find them as column-based. So, all of the single pieces of information that are getting stored will be stored like this. In this case, we are storing individual rows, so it gets stored like this. We will have the first row, and all of the information about... let me just draw it properly.

this. We will have the first row, and all of the information about... let me just draw it properly.

We will have one single row available, one, then we will have, let's say, the field name, age is 25, and this information will get stored. After this, there will be one more row that will attach, so everything will get stored at the row level. Over here, everything we are storing is at the row

level. Over here, it will store everything at the column level. So, IDs will get stored one, two,

level. Over here, it will store everything at the column level. So, IDs will get stored one, two, three. IDs will get stored, then we will have the name stored inside one single column, and we will

three. IDs will get stored, then we will have the name stored inside one single column, and we will have the payment information stored inside the column. This says 23 or 25 dollars, 26 dollars, something like this. So, every single thing that is getting stored internally is at the column level. Just try to understand and visualize this. So, when I run the same query on the OLAP system,

level. Just try to understand and visualize this. So, when I run the same query on the OLAP system, instead of scanning the entire row, instead of scanning the entire thing and then fetching this,

it will directly go to the payment and directly give me the sum. So, the useless operation of scanning the ID or the name is not needed. We can directly go to the payment level, and we can fetch the result that we need. This is the difference between OLTP and OLAP.

Now, understand this as a data engineer. As a data engineer, you will be taking data from OLTP

systems to OLAP systems. In between, we will be writing transformations. The thing that we understood about data generation, data generation and storage is my OLTP system. This is where the

data is getting generated. This is where I do the storage; this is where I do the transformation, and this is where I do the analysis. This is where the data warehouse will come into the picture, and the data analyst will write the query to understand the data, and then they will build dashboards, ML models, AI models, whatever you want to call them. They will use this OLAP system,

data warehouse, or the storage layer that we will have. We will understand data storage again in the future about object storage, so don't worry about it. But this is the fundamental of it. Now,

we are just trying to zoom into the individual component and understand what is going on.

So, data engineering is basically taking this data and moving it somewhere else. We should take the data from OLTP systems, APIs, ingest it into the system, do some transformation, apply some logic, and load it into the data warehouse. This is the core of data engineering. But how do we do

this? You understand everything, but how does this entire pipeline happen? We have something called

this? You understand everything, but how does this entire pipeline happen? We have something called ETL: Extract, Transform, Load. You might already know this; everyone keeps talking about it. We

call this ETL: Extract, Transform, Load. The same thing that we talked about in the lifecycle. The

data engineering lifecycle is one way; it is ETL only. We are extracting data, transforming data, and this is the serving layer, which is the loading of data. That is just a conceptual architecture of how things work. This is what really happens in the real world. We build the ETL pipeline. We extract the data, we transform the data, and we load the data. Now we already

ETL pipeline. We extract the data, we transform the data, and we load the data. Now we already know about this, right? Where do we even extract all of this data? We extract our data from DBMS,

analytics, sensor APIs, and all of this data from multiple sources. Then this data comes, and then we do the transformation. We understood transformation also, right? It is about removing

duplicates, handling null values. Structured data means getting all of the information into the same scale. If one age is stored inside, let's say, the string value, and another source has the age

scale. If one age is stored inside, let's say, the string value, and another source has the age stored inside the integer value, we bring it to the integer level. If the date is in a different format, we bring it to the same level. And then we load our data. The load can be on anything; it

can be on the data warehouse. Data warehouses are like Snowflake DB, BigQuery, Redshift, and a lot of different data warehouses. Or you can also store it in object storage. Object storage stages

like S3, Google Cloud Storage (GCS), or Azure Data Lake, we also have that. This is the core concept of ETL that we will also talk about one by one. Now, okay, so you understood the upper layer.

We did all of this work just to understand this particular thing, the data engineering lifecycle, the top layer of it. Just the top layer of everything that we did till now. But

just to understand the top layer, now I want to understand the bottom layer of the data engineering lifecycle. That is the undercurrent: security, data management, data architecture,

engineering lifecycle. That is the undercurrent: security, data management, data architecture, orchestration, and software engineering. These undercurrents are also important.

Security: just by the name, you understand that our data should be secure. That basically means who is able to access our data and the system. We need to make sure the right person with the right authorization can only access our data. We should not give access to our data to every single person working in the company. This is the importance of security.

Data management: that basically means data governance. Data governance means we should be able to easily find the data that we need. Think about this, right? I was working at an e-commerce company in Europe and the US for furniture. They had tables—more than thousands of tables—in the system. Now, if I had to find particular data, where this data is stored,

I had to go through the documentation they created to understand, okay, this data can be found at this particular location. This is what we call data governance: the ability to find data. Then

the definition: what each and every single column means. Think about it, if you have thousands of tables, and if you access one of the tables from that pool, and that particular single table has, let's say, hundreds of columns, and you want to understand what the sixth column means. It

could be something like the payment gateway ID or XYZ, something like that. I don't know what this particular column means. This is the use of definitions, understanding what the data is, what

type of data is stored. This is very important. Data governance. Accountability: who owns this data? Who is the user? Did you create this table? Which user created the table? So I can go to that

data? Who is the user? Did you create this table? Which user created the table? So I can go to that user and understand if I don't really understand the purpose of this table, I can go to the user.

If I am working in the shipping department, I am an engineer over there, and I created the entire shipping table. Now, if any person from, let's say, the order department or the return department wants to understand what is going on inside this table, they can directly reach out to me. I am accountable for that particular data. That is what accountability means.

me. I am accountable for that particular data. That is what accountability means.

Then we have data modeling, which we already understood. Data integrity: making sure every piece of data makes sense; every piece of data is proper. It basically means the data is correct; it should not have any random information. DataOps: you might already know about DevOps.

DevOps is basically to automate the entire process of deployment of your application using the best practices. DataOps is somewhat similar. You monitor data governance, observability, incident

practices. DataOps is somewhat similar. You monitor data governance, observability, incident reporting. That basically means everything that is happening inside your data system. Every single

reporting. That basically means everything that is happening inside your data system. Every single

thing that is happening in your data system, you should be able to monitor. You should be able to report the incidents that are happening. All of these things should be automated, and that is a fundamental concept of DataOps, data operations. So, all of the operations of the data,

right? When you deploy something, is it working fine? If it is working fine or not,

right? When you deploy something, is it working fine? If it is working fine or not, I should be able to get the error message. I should be able to observe how my data pipelines are working. I should be able to monitor what is going on. All of this is a part of DataOps.

are working. I should be able to monitor what is going on. All of this is a part of DataOps.

Data architecture: we have a detailed section after this about data architecture where you analyze the information, analyze the trade-offs, and add value to the business by designing the proper architecture for the system. We'll understand this.

Orchestration: this is used for coordination, for scheduling jobs, and managing tasks. In data

engineering, we have multiple data pipelines working. Data pipelines are basically the ETL jobs. It is just a fancy name, but it's just extracting, transforming, and loading the

jobs. It is just a fancy name, but it's just extracting, transforming, and loading the data to some location. This entire operation is called a data pipeline. Now, like this, there might be hundreds of data pipelines deployed in the organization. I need to orchestrate all

of these things. Let's say once the first data pipeline completes, I should only run the second data pipeline because the second data pipeline is dependent on the first data pipeline.

All of these things are called orchestration. We have a tool called Apache Airflow for this kind of workload, and we will also understand orchestration as we go into the future.

Software engineering: software engineering is basically programming, software design, testing, and debugging. You have to apply the best practices of software engineering when you write the ETL, the transformation job using code. You should use a design pattern of software engineering for scalability. You should also use testing and debugging approaches to test your data

pipelines. So, all of these are the fundamental concepts. When building a data pipeline,

pipelines. So, all of these are the fundamental concepts. When building a data pipeline, you should remember security is important, data management is important, DataOps is important, architecture is important, orchestration and software engineering. Just fundamental concepts, good to know. You don't need to deep dive into it right now; as you move in your career, you will understand them one by one. The next thing I want to talk about is data

architecture. If you want to become a good data engineer, you should understand data architecture,

architecture. If you want to become a good data engineer, you should understand data architecture, and we will be referring to one of the new data that I wrote, "Data Architect 101 for Data Engineers." So, let's jump into that. So, before we move forward, I just want to say

Data Engineers." So, let's jump into that. So, before we move forward, I just want to say that I am re-recording this part of the segment because I was recording this part yesterday, and my disc got full. I ran out of space, my OBS stopped recording in between, and the entire file, like a one and a half-hour file, got corrupted. So, I'm re-recording this part of the video just

to have one complete video. If you're still watching this video till here, I'll urge you to at least like this video because it takes a lot of effort, and do comment something so that it increases the reach of this video and it reaches more and more people.

Okay, let's start with the video. Now, till now, what we have done is we have understood the basics of data engineering, right? We understood what data engineering is, where data engineering fits in the entire pipeline, the data engineering lifecycle, different parts of ETL, OLAP versus OLTP. So, we cleared the basic fundamentals required to understand core data engineering. Now,

versus OLTP. So, we cleared the basic fundamentals required to understand core data engineering. Now,

so we understood the core data engineering aspect. Now, I want to take you on a journey to understand how data engineering happens in the real world, from understanding how to build the architecture, how the architecture is actually built from the ground up, how the thought pattern is developed, how you understand the business side, how to choose the right technology and put all of

these things together and individual components, each and everything. Now, we will understand.

Okay, so let's start. Now, I want to make you understand data architecture first. Because

before we even understand the different parts of data engineering, it is really important that you understand how to build the basic architecture as a data engineer. Because this is the core skill set, and we'll be learning about that, right? So, I published this particular newsletter.

If you are interested, you can also subscribe to it. Just go to the DataVidhya.substack.com

to get the high-quality data engineering blogs. Okay, so Data Architect 101 for Data Engineers.

Now, till now, we have understood that the goal of every data project is to solve a business problem. From the start of the video, I've been saying this particular thing again and again,

problem. From the start of the video, I've been saying this particular thing again and again, that everything you do as a data engineer or as an engineer in general, right? You are doing all of these things for the business. Now, it can be anything from reducing the current system cost to building a full-fledged data system to help businesses make data-driven decisions. Now,

I want to take you on a journey to understand how to think about building data architecture from the data engineering point of view. Because as you grow in your career, you should have the basic understanding of how to design the architecture and how to build data systems. What is data architecture? So, from the definition of the fundamentals of data engineering, data

architecture is a design of systems to support the evolving data needs of an enterprise. Evolving

data needs are achieved by flexible and reversible decisions reached through a careful evaluation of trade-offs. We'll understand this technical architecture, but in simple terms, it is basically

of trade-offs. We'll understand this technical architecture, but in simple terms, it is basically like before you construct a building, right? You have to build a blueprint of the building. If

you're trying to build, let's say, a 12-floor building, you have to first build the blueprint.

Inside the blueprint, you have to add some of the things, such as the foundation, floor plans, elevation, elevator, stairs, office, restroom—all of these things you have to first plan, and then you can start building the entire construction. Data architecture has a similar concept. Instead

of foundation, floor plans, elevation, and elevators, you'll have to think about storage, what are the different software that you have to use, how does the data actually flow, interfaces, how do you write the transformation, the staging areas, data warehouses, reporting systems, and many more. Just like you think about building an entire building, the construction, you also have

many more. Just like you think about building an entire building, the construction, you also have to think about the data when you are building data architecture. You also have to think about what are the different components that we need in order to build the entire system. This is how we start.

Now, as per the technical definition that we just read, it says that decisions should be flexible and reversible, which means like each and every component that you put inside the architecture, in case something goes wrong, you should be able to easily replace it with something else, and it should be easily reversible. So, every decision that you take, if it goes in the wrong

direction, it should be easily reversible so that you can make it right. This is what it means. It is achieved by flexible and reversible decisions reached through a careful evaluation

means. It is achieved by flexible and reversible decisions reached through a careful evaluation of trade-offs. Trade-offs are basically, you have to understand, based on your requirement,

of trade-offs. Trade-offs are basically, you have to understand, based on your requirement, which technologies you can choose. We'll understand all of this step by step.

Now, building data architecture is divided into two different parts. One is business needs, and the second is technological integration, basically the operational architecture and the technical architecture. Let's try to understand both of these right now, and then we'll deep dive into them individually. We focus on the business goals and requirements inside the operational architecture. Again, we understood, right? Everything that we are

doing is for the business only. So, before you think about choosing the right technologies or writing code and all of the other things, first, you need to define what the business even needs in the first place. Because once you know that, then you can think about the technological side.

So, the first step in building data architecture, or even if you're building your own personal project, is to understand the operational side or the business side. For example, in an e-commerce platform, what is the impact of the XYZ category of the product? So, I want to find this particular thing. This is my business goal. I want to find information about this particular product. Why

thing. This is my business goal. I want to find information about this particular product. Why

is there a delay in product shipping? So, I want to understand what is happening with the product shipping. I want to understand why there is a delay in shipping. So, this is my business goal.

shipping. I want to understand why there is a delay in shipping. So, this is my business goal.

How do we manage data quality from the third-party vendor? In e-commerce, right? We are working with different third parties, such as FedEx or some shipping department, or the data might be coming from multiple places. How do we manage data quality while working with these vendors? These

are the different business goals that we have. So, while building technical architecture, we need to think in this particular direction. These are different things that the business needs. So,

now I have to build my technical architecture to fulfill all of these different requirements.

In the technical architecture, we focus on the technical side for solving how to ingest, store, and transform data. What happens when we have a sudden order spike? Basically,

on the technical side, we mainly focus on storage, technical things such as how do we ingest data, how do we transform data, and if there is a festival or a sudden spike in the system. So,

we also think about scalability. This is more of a system design side. One is a business goal, where you focus on the business. One is the business side, where you focus on what the business needs.

The second is the technical side, where you think about what are the different technologies that you can use. Let's try to understand all of these things in a little more detail with examples.

can use. Let's try to understand all of these things in a little more detail with examples.

The operational architecture ensures that your data practice aligns closely with the business objectives. It is the "why" behind every piece of data you collect, process, and store. Again,

objectives. It is the "why" behind every piece of data you collect, process, and store. Again,

business architecture or the Operational architecture is basically the "why"—why you are doing this entire activity. Why are we even building everything? It is to support the business in achieving their goals. So, operational architecture is basically the "why" behind every piece of data you collect. Here are some insights to think about when building the operational architecture or defining the business goals. First, start with

the end in mind. Always begin by understanding the business problem you are trying to solve.

This clarity will guide your decisions and ensure that your data architecture directly contributes to the business outcome. This is very important—start with the end in mind. We need to understand what the business goals are before you even think about building the architecture or the technologies. Understand what the business needs, because once you define that, you can easily build

technologies. Understand what the business needs, because once you define that, you can easily build the technological side. Technology is very easy to build if you know what the business needs. If

you don't know what the business needs, you will be stuck in building the architecture and will never be able to get out of it. Second, iterate and evolve. Business keeps

changing every six months—a new product line comes up, something keeps changing. Business priorities,

product strategies—these things happen on the business side. So, when you design your architecture, it should be able to iterate and evolve quickly as per the business changes.

And focus on impact. Everything you do should generate value for the business. Every data

solution you architect should have a clear line of sight to its business impact. It can be improving customer satisfaction, streamlining operations, or enhancing decision-making. The value of your data initiative should be measurable and aligned with business priorities. This is operational architecture and aligning with business goals. Now let's talk about the technical architecture,

the building block. This is where the actual execution happens. While operational architecture is about "why," the technical architecture is the "how" of the equation. By focusing on specific technologies and methodologies, you'll be able to meet your operational goals. So, what do we do? We

use technologies—technology is our "how" to meet the business goals, which is basically the "what" we want to achieve. Very simple to understand. If you want to build the technologies, we have thousands of tools available in the market. This is the big data landscape, and you can see there are so many different tools available that you can't even see them all until

you zoom in. If you want to understand each tool, you need to know that these are different tools available for different kinds of workloads. We have a proper framework to choose the different technologies as per your business use case. Now, you can't choose any random technology and think, "Okay, I'll use Snowflake, I'll use Apache Spark, I'll use these fancy tools just to

solve my business problem." It doesn't matter. You can even use a simple Python script as long as it solves and helps you reach your business goals. Technology is

not about choosing fancy tools or something everyone is using in the market. As a business, you should be thinking about saving costs and reaching your business needs. Whatever technology

helps you, whether it is an enterprise-level technology or an open-source technology, as long as it solves your business problem, you're good. Now, let's try to understand that one by one. How

do you build the technical architecture? Simplicity is key—the aim is to keep your technical architecture as simple as possible while meeting your needs. This approach makes your system more maintainable, scalable, and less prone to error. The simpler you keep things, the easier it is to maintain, scale, and quickly identify errors. The more complex

the system, the harder it is to debug errors. Second is choosing the right tools for the job.

There is no one-size-fits-all solution in data architecture. The right storage, processing, and analysis tools totally depend on your requirements and the specific use case. If you have structured data, you can go with a data warehouse. If you have millions of rows,

case. If you have structured data, you can go with a data warehouse. If you have millions of rows, you might not need Snowflake or another expensive database. You can work with basic ad-hoc query interfaces like Amazon Athena, which will be good to go. All of these different

decisions should be made based on your business understanding. It's not about choosing fancy tools; it's about solving your business problem. Third is building for scale and flexibility. Even

if you are not dealing with billions of rows right now, in the future your business will grow. If you

are projecting that growth, you should be planning the architecture to scale all the systems. For example, currently, you're using Python to process millions of rows, but you know you'll have billions of rows tomorrow. You should keep the system ready in the backend for that growth.

For instance, you can use distributed processing like Apache Spark and scale up the cluster as needed. Start with a smaller cluster and then think about scaling up as you move forward. It's

needed. Start with a smaller cluster and then think about scaling up as you move forward. It's

not that everything is perfect when you start; you start small and evolve as you move forward.

Third is embedding automation. A lot of times, you might monitor different systems manually, try to solve different errors manually, or build data pipelines manually. Instead,

you should generate scripts and automation to do these things. In case an error occurs, you should get an email or a Slack notification, depending on your system integration. Instead of checking every single day whether your data pipeline is working, you should have an alerting mechanism in place so that you don't have to check manually. Finally, prioritize data

security and governance. In the digital age, data leakage is quite common, so you should properly secure your database, encrypt your data, and keep your data secure within the network. These

are the different things you need to consider while building your technical architecture.

Now, let's bring all of these different things together to understand how this happens in the real world. Let's take the example of the data architecture for an e-commerce platform—pretty

real world. Let's take the example of the data architecture for an e-commerce platform—pretty easy to understand. The first thing is that we need to understand the business needs. In

this case, let's define the business goals, because this is what we understood first. We

define the operational architecture, like what are the goals of the business. In this case, the first goal is to improve customer experience: improve site navigation, personalize product recommendations, and enhance customer service. Simple to understand. We want to improve the overall site navigation, how customers interact with the application, and build a recommendation

engine and customer service integration. Next is operational efficiency: streamline inventory management, order processing, and shipping to reduce costs and delivery times. We need to improve our entire operational efficiency so we can reduce order processing time,

times. We need to improve our entire operational efficiency so we can reduce order processing time, reduce shipping costs, and shorten delivery times. Then, marketing insights: we want to understand how customers are behaving so we can improve product placement and increase sales.

Vendor management: we might be working with different vendors, so we also want to build a strategy for better product availability, pricing strategies, and quality control.

And fifth, compliance and security: in an e-commerce platform, people will be making payments, so there are compliance requirements we need to follow. For example,

we don't capture credit card information, or if we do, we should mask it so that it doesn't get leaked. These are some of the compliance requirements we have to follow.

So, these are the business goals, right? We want to increase customer experience, operational efficiency, marketing insight, vendor management, compliance, and security. Now, based on these business goals, we can think about building the architecture—the actual technical architecture.

The first is our data ingestion layer. We are getting data from multiple sources, and the purpose of the ingest is to collect data from various sources such as website interactions, server logs, vendor systems, inventory management, and customer support. We can use technology like

Apache Kafka for real-time data streaming to handle data coming from different sources.

After we capture our data, we need to store it in some object storage for a longer period of time. The purpose is to store collected data in a structured manner for easy access and analysis.

time. The purpose is to store collected data in a structured manner for easy access and analysis.

Different components, like object storage (S3 bucket) for unstructured data, or data warehouses like Snowflake or BigQuery for structured data, can be used depending on your business requirements. How do you decide which one to use—Snowflake or Redshift, for example? It depends. If you're already on AWS,

going with Redshift might be a good choice due to integration. But if Redshift is too expensive for your business needs, you can go with Snowflake or even open-source solutions. You need to research, understand your data size and frequency, and do a simple proof of concept (PoC) to

see how different technologies behave with your data. Whatever works best, you can choose that.

So, we might have to structure our data before we put it into the data warehouse—that's where the data processing and transformation layer comes in. This is where we clean, validate, and transform our raw data into a structured format. For this, we can use Apache Spark if we're working with large datasets. If you have a smaller dataset, like a few million rows,

you can go with simple Python scripts. But if you have a large dataset and data coming from multiple sources, you might want to go with Apache Spark, a highly used framework by top companies.

After the data is in the data warehouse, the data analysis and business layer comes into play. This is where machine learning engineers and data analysts build dashboards

into play. This is where machine learning engineers and data analysts build dashboards and machine learning models for predictions to help the business move forward. This is where the final value comes in—when a person from the business team can look at a dashboard, see issues in shipping, and make the right decisions to improve the overall business.

Business intelligence tools like Tableau and Power BI help us visualize data, and machine learning platforms like TensorFlow and PyTorch help us build recommendation engines and algorithms. There's also the side of data security and compliance, where we ensure that we meet regulatory compliance, such as GDPR and CCPA. These are government regulations you need

to follow when storing data, like encrypting or masking personal information. We'll cover

data masking in more detail later in this video, so don't worry about it.

Lastly, we have the data integration and API layer. We'll be working with multiple vendors and sending data between different systems, so we should build an API for easy integration between systems. We also need to think about this. So, if we meet all the standards, our final architecture might look like this (example architecture shown). This is not the final architecture, but it might look like this, and you can improve on top of it.

As you can see, we have data coming from on-premises systems, social media, and stream data. This data is ingested into the system, stored on AWS S3 as a data lake.

We can use transformation layers such as AWS Glue and Lambda to process our data, and then store it on Amazon Redshift. We can also use Amazon Athena as an ad-hoc query interface and SageMaker as a machine learning platform. Visualization is done through tools like Tableau.

This architecture is built to fulfill our business needs. We define the business goals, then define the tools to use, and then build the architecture. If you look at this architecture, it looks similar to the data engineering lifecycle we discussed earlier. There's data collection,

ingestion, storage, transformation, serving, and end users. The data engineering lifecycle is the fundamental block, and this real-world architecture applies those concepts.

You can plug and play—if you want to use Google Cloud Storage instead of S3 as a data lake, you can. If you want to replace Amazon Redshift with Snowflake, you can. If you prefer Databricks over AWS Glue, go for it. Use what best meets your business needs.

can. If you prefer Databricks over AWS Glue, go for it. Use what best meets your business needs.

That's everything about building architecture. I hope you understood. If this is clear, we can move forward and discuss the other parts. Okay, let me check this. All right, this looks good. Let's continue with our second thing. Now that we've understood architecture and

good. Let's continue with our second thing. Now that we've understood architecture and how it's built, let's try to understand the individual components of the architecture, their use cases, and how the entire execution happens while building this.

Let's start by understanding the data warehouse. This is what the architecture of a data warehouse looks like (architecture shown). So, we have data coming from multiple places, as we discussed. Data

comes from APIs, RDBMS, websites—all these places generate data. This data goes to the streaming engine and gets ingested, and then we write the ETL pipeline. After ETL,

our data gets stored inside the data warehouse. This is the ETL pipeline—what we are doing is extracting data, transforming it, and then loading it onto the data warehouse.

There's one more concept called EL, where instead of transforming the data first,

we extract and load the data into a staging area, or directly into the data warehouse. We then do

the transformation on the fly using SQL queries. This is ELT—extract, load, transform. In ETL,

we extract, transform, and then load it as per our requirement. These are the two ways you can build a data warehouse. In the real world, ETL is highly used because it's the most structured way

to organize your data. ELT is also used, and some newer companies are trying to replace ETL with ELT, where you don't have to do the transformation first—you load your data into the warehouse as it is and then transform it as needed. However, ELT is not as successful because real-world

data is often messy and requires some processing before storing it in the data warehouse. ETL is what you'll be using most of the time, but it's good

data warehouse. ETL is what you'll be using most of the time, but it's good

to know that ELT also exists for some use cases. When we built the data model in our relational

database part, we understood that data models are normalized—this means we try to create as many

tables as possible and reduce duplicates in each table. This allows us to have proper information stored across different tables. Let me show you that again for clarity. This is what it looks like (example shown). We have different tables that store different information. If you want to get

(example shown). We have different tables that store different information. If you want to get information about a user who purchased a product, you need to pull the user ID, connect it with the order table to get order information, then connect with the product information, and if you want to track payment information, you'll need to join the payment ID—joining four different tables to get one outcome. However, relational databases are not designed

for analytical workloads. Even if you join all this data and try to run analysis queries by aggregating user or order information, the OLTP database (Online Transaction Processing database)

will struggle because it's not designed for that kind of workload. It will pull all these rows one by one and then pull one single column for your final analysis—not ideal.

This is where the data warehouse comes in, but you can't just store your data in a data warehouse without following specific methods—that's where dimensional modeling comes in. Just like we have a method to store data in relational databases (data modeling), we have a method to store data

in a data warehouse called dimensional modeling. In dimensional modeling, we have two things: Dimensions and Facts. Dimensions and Facts are the two types of tables you'll create to build your data warehouse. This is called a dimension table, and this is called a fact

table. The fact table is always one—there will be one fact table and multiple dimension tables.

table. The fact table is always one—there will be one fact table and multiple dimension tables.

The fact table stores information about quantitative data points that can be measured in the business, such as sales amount, product quantity sold,

revenue, profit—all the quantitative values that get stored in the fact table.

It is the center of your dimensional modeling. On the other hand, there are multiple dimension tables, each representing different business categories. For example, you might have a product dimension, a date dimension, and an order dimension. Each dimension table stores information about the categories or descriptive attributes, such as product name,

product category, user name, user city—all descriptive attributes related to the dimension.

If you want to understand how all this happens in detail, I have a course available on data warehousing with Snowflake, where I go deep into this. For now, I'm just providing a fundamental overview. Dimensional modeling is built using two concepts:

star schema and snowflake schema. These are the two methodologies or concepts used to build a

dimension model. Let me show you (example shown). This is what a star schema looks like—there's a

dimension model. Let me show you (example shown). This is what a star schema looks like—there's a fact table in the center with different dimension tables attached to it. It looks like a star,

hence the name "star schema." The snowflake schema is a more normalized version, where there are sub-dimension tables attached to the dimension table. It kind of resembles a relational data

model but still has a fact table in the middle, with different dimension tables attached to it.

In the star schema, you have the fact table in the center and dimension tables attached to it, forming a star shape. The snowflake schema is similar, but with sub-dimension tables

added to the main dimension tables. The snowflake schema is different from Snowflake, the database

company that offers databases as a service. Let's look at an example. Let's say we're working with an e-commerce company. We'll have a fact table in the center, such as an order fact table, where it stores all transactional information. This will have a unique ID and quantitative

attributes like price, quantity, and weight—all measurable attributes in the business. Then,

you'll have different dimension tables, like order dimension, product dimension, and date dimension.

Each dimension table will store descriptive values like product name, product category, and other

relevant information. You can join these tables using a common key, such as product ID, to get the

relevant information. You can join these tables using a common key, such as product ID, to get the

final analysis. This makes analysis easier because if you want to get information about a product and

final analysis. This makes analysis easier because if you want to get information about a product and its quantity, you just need to join two tables. This join happens in the data warehouse, and the OLAP database (Online Analytical Processing database) will handle this more efficiently.

If you want to understand this in more depth, I go hands-on with this in my data warehouse course on Snowflake. I teach these concepts using real datasets.

Now that we've covered facts and dimensions, I want to talk about Slowly Changing Dimensions (SCDs). We know that these facts, such as quantity, product weight,

(SCDs). We know that these facts, such as quantity, product weight,

and price, keep changing. Quantity changes, product prices change, and these changes

need to be reflected in the system. We understood that the data flows from sources, like APIs or RDBMS systems, through ETL to the data warehouse, where it gets updated daily, hourly,

or however frequently it's scheduled. But these dimensions, like product name and user address,

don't change frequently—these are dimension values that don't change for long periods. However,

when they do change, how do we handle that? This is where the concept of Slowly

Changing Dimensions (SCDs) comes in. SCDs deal with handling dimension values that change slowly over time. There are different strategies for handling SCDs, categorized into different types like SCD1, SCD2, and SCD3, each with its own approach to handling these changes.

In SCD Type 1, the values are overwritten, and no history is maintained. For example, if we

overwrite data without keeping the previous value, we are using SCD1. If a customer's city changes

from New York to New Jersey, we simply overwrite the New York value with New Jersey. In this case, there's no way to know what the previous value was—this approach can be used for some use cases.

In SCD Type 2, we maintain a complete history of changes. Every time there is a change,

we add a new row with all the details without deleting the previous value. There are multiple ways to handle this, such as using a flag approach. For instance, if the city was New

York and then changes to New Jersey, we'll add a new row with an "is active" flag to indicate the

current value. If there are further changes, like moving to Miami, we'll add another row,

current value. If there are further changes, like moving to Miami, we'll add another row,

keeping the history intact. We can also use version numbers or date ranges to track changes.

In SCD Type 3, we maintain partial history. For example, we might store the current and previous city in separate columns. If the city changes from New York to New Jersey, we keep New York

in the "previous city" column and New Jersey in the "current city" column.

There are also more advanced types like SCD6, which is a combination of SCD1, SCD2, and SCD3, capturing the current city, previous city, start date, end date, and active flag all together.

These are fundamental concepts, and if you want to do hands-on practice, you can find tutorials or check my course on Snowflake, where I cover these concepts in depth with real datasets. Lastly, there's the concept

of data marts. Let me take a sip of water; you can also drink some water.

Okay, so data marts. Now, data marts are basically a subset of a data warehouse. Okay,

to understand this, the subset part we understand, right? In a data warehouse, we have many different tables available like this. Okay, now these tables can be, let's say, product information, order information, payment information. We have the fact table, like the fact table information.

These are the product dimension, order dimension, payment dimension, user dimension, date dimension. Okay, these are the different tables available in the data warehouse.

date dimension. Okay, these are the different tables available in the data warehouse.

Now, like this, there are many different teams working in the organization. We have different teams available, such as the teams that might be about shipping, okay, who handles the shipping information. There's a team that handles the refund information, okay, there's a team that

information. There's a team that handles the refund information, okay, there's a team that handles the payment, there's a team that handles the third party, okay, accounting, IT—these are the different departments available. Inside these departments, we have different teams available.

Now, all of these teams don't really need this entire dataset. Every team wants to solve their own business use case. So, understand this: this is my company. Inside the company, okay,

I have different departments, okay? These are my departments. Inside the departments, we have different teams working on different problems. Okay, if you work in any company, you will see that in a large company, you will always see something like this, okay? You will see something like this, where we have the company. Inside the company, we have the

okay? You will see something like this, where we have the company. Inside the company, we have the different departments. Inside the departments, we have different teams trying to solve their own

different departments. Inside the departments, we have different teams trying to solve their own department's issues so that they can meet the company's goals. It looks like something like this: if this team solves their own problem, that basically means they are solving the department's problem, which means they are solving the company's problem. And if all of these different

departments solve their own problems, that basically means the company is moving forward.

Now, in order for these departments to solve their own problems, what they want to do is build their own reporting system, right? The analysis, data science, machine learning models—what they do is that they build the reporting system as per their team's requirement or their department's requirement. And what they do is create a subset of the data warehouse as per the requirement.

requirement. And what they do is create a subset of the data warehouse as per the requirement.

So instead of, let's say, for the shipping department, right, they just need the information about, let's say, users, payment, and the product. Just think about that—they just need the information from these three tables only. So, they will create their own new table,

using these three tables, and they will choose the columns that they need for reporting and all of the other things. Let's say there are 300 columns available in these three tables—they will just pick 100 columns from these three tables, okay, and they will build their own reporting system for their own department. This is called a data mart. Right, I am building this—this is

a subset of a data warehouse. I'm building my own reporting system; I'm building my own table as per my requirement. Data mart, a subset of the data warehouse—pretty simple to understand. I solve

my requirement. Data mart, a subset of the data warehouse—pretty simple to understand. I solve

my department's problem; that helps the company solve their own problem. Simple. Okay, understood

the data mart concept? Let's move forward. Now, the data lake. This is the new term that was tossed because of object storage now, okay? Before we store our data in the data warehouse, we understood, right, what we have to do—we have to, again, process the data through ETL, okay,

and then you can build a data warehouse and store your data. Every data that you store inside the data warehouse gets stored in a structured format, right? It is getting stored as a structured format. So, what that means is that every time you want to store your data or make any changes, you

format. So, what that means is that every time you want to store your data or make any changes, you have to make changes in the structure, and that is quite difficult to do, right? If you have a table that already has millions of rows of data and five columns, okay, think about this: I have a table,

okay, 1, 2, 3, 4, 5, okay, 1, 2, 3, 4, 5—it has five columns and millions of rows. Now, tomorrow

I decide to add one more column, okay, inside the data. So all of these values will be null, okay? I

will have to change the entire structure and then start adding new rows as per the new data. So,

changing the structure, changing the schema type, is quite difficult in a data warehouse because you have to take a lot of things into consideration. Now, data comes—what it says, right? Okay,

you don't worry about the ETL, okay, you don't worry about the ETL, you don't worry about writing transformations and putting your structured data. What you can do—you can use a data lake, like S3—you store all of your data into the data lake. Data lake is basically a storage location.

You can use S3 as a data lake storage, okay? It is a centralized repository where you dump all of your data as it is, right? I will store all of my CSV data, I will store all of my Parquet data, I will store all of my JSON data as it is onto the different folder structures in my data lake.

Now, different teams from different departments—we understood, right, as in the data mart, there are different teams working here—they want their own columns, they want to generate their own reporting. So, what does the data lake say? Dump all of your

data as it is into the data lake, okay? As per your requirement, you can query the data from the data lake itself. Basically, you can directly read the data from the S3 file storage system, okay, object storage system, as per your requirement. This is called schema on read. Again, concepts

are getting quite heavy, but I'm just trying to keep it easy. So, take a break if you want, but I'll continue—just pause the video if it is getting heavy and come back again, but if you can understand, just go forward with it. Okay, it is basically a centralized repository, okay? So, data lake. So, data lake is basically a centralized depository. You can use S3, okay,

okay? So, data lake. So, data lake is basically a centralized depository. You can use S3, okay, you can use Azure Blob Storage or Azure Data Lake, okay, Azure—you can also use Google Cloud Storage, okay, as a data lake, which is kind of like object storage, where you dump all of your data as it is,

as raw, okay? And on the other side, right, on the data lake, there are users or teams, okay, who can read this data, okay? This is called schema on read, okay? What they will say: okay, I want to read this column from this file, I want to read these columns from this file—all of the different

file systems, okay? I will read all of this data as per my requirement, okay, and I will build the table onto, let's say, Athena, okay, Athena, or any ad hoc query interface, okay? That is up to me, and they will build their own table, or they can directly also pull the data from the data lake and put it in the structured format. So here, we will only process the data that we need, okay?

Instead of processing all of the data over here in the ETL and data warehouse part, we will only process the data that we need and then store that data in the data warehouse for querying purposes.

Now, it is not that, okay, data warehouse is bad because it requires a lot of processing or that data lake is good. Both of these systems have their own place in the architecture because data warehouses give you the flexibility of structured data so you can do the analysis, whereas data lakes give you the ability to access any data anytime you want as per your requirement.

Okay, let's understand the difference between a data lake and a data warehouse, okay? Inside the

data warehouse, data is structured, as you can see it over here. Let me just, okay, let me just zoom in. Okay, data is structured, okay? The users are business analysts, and it is used for batch

in. Okay, data is structured, okay? The users are business analysts, and it is used for batch processing for BI reporting and all of the other things. The data is pre-defined, contains smaller data, and it is usually relational, right—columns and rows. Over here, data is unstructured because

you can store JSON data, you can store Parquet, CSV, whatever you want. Alright, users are usually data analysts and data scientists because instead of—think about this, right? Data scientists

want to build their own machine learning models. Now, in the data warehouse, alright, once you have data added, you can only work with the limited data, right, because you defined that as per the business goals, and changing the structure is quite difficult—you have to do a lot of changes inside the pipeline also. So, for data scientists and data analysts, a data

lake is a gold mine because it is completely raw data, right, stored as a file storage, stored as a file inside the object storage as it is. It is up to me which data I want to read, which columns I want to read, as per my requirement. I can read using Python code, I can write Spark code,

I can build a table on top of it as per my requirement, okay? So these are the users. The use

case is for stream processing, machine learning, real-time data analysis—you can use that. Okay,

the data is raw, data is large, and it is undefined, okay? It is not properly relational, so it is undefined, okay? This is the difference between a data lake and a data warehouse, okay? This is what we have understood till now. Now, this is just the fundamental concept. The

okay? This is what we have understood till now. Now, this is just the fundamental concept. The

actual hands-on part, if you want, I have some projects available freely on the YouTube channel, okay? I will just comment down—I will give you the link to that, okay? So if you want to do that,

okay? I will just comment down—I will give you the link to that, okay? So if you want to do that, you can do it and understand the data warehouse and also the data lake. I also teach all of these things hands-on in my courses, so if you are interested, just check the link in the description about the combo pack, okay? So till now, we have understood a lot of different things, okay? We

started by understanding what data engineering is, where data engineering actually fits into the entire pipeline, okay? We understood about the different roles such as software engineering, DBA, DS, ML, and all of the other things. We understood about the important part, which is the data engineering life cycle, okay? We understood about the ingestion, transformation, serving, how

all of these things happen. The storage part, we understood about why transformation is needed—like how the transformation actually happens. Data generation, data storage, DBMS systems, relational databases, data modeling, okay, how data modeling actually happens, NoSQL databases,

SQL versus NoSQL, data storage processing such as OLTP versus OLAP, the difference between row-based transaction and column-based databases, why OLTP is needed, why OLAP is needed, why transformation is needed because we go from OLTP to OLAP while doing the transformation, okay? We understood

about ETL processing, understood about the undercurrent such as security, data management, data ops, architecture, software engineering. We delved deep into the data architecture part, okay? We understood about operational architecture and technical architecture, about a lot of

okay? We understood about operational architecture and technical architecture, about a lot of things. We understood about the data warehouse, the important part, okay? ETL versus ELT,

things. We understood about the data warehouse, the important part, okay? ETL versus ELT, understood about dimensional modeling, understood about the snowflake schema and the star schema, understood about the difference between fact tables and dimension tables, such as how to build the dimension tables. It stored transactional values and categorical values, understood about

slowly changing dimensions, why we need them, different types of them, a lot of things. Data

marts—a subset of the data warehouse—why we need data marts. Understood about the data lake and the difference between a data warehouse and a data lake, okay? Understood a lot of things about data engineering, actually. I was not even expecting to go into this deep before recording this video—I

engineering, actually. I was not even expecting to go into this deep before recording this video—I thought I'd just give an overview, but I went into a flow state and started recording and explaining everything because I really love teaching, right? So, understood a lot of things. If you’ve reached this section, do let me know by commenting, because it might be around 2 hours by now,

and if you're still watching, salute! Alright, so do let me know by commenting that you watched this video till here and you are about to complete the entire thing, okay? And I just want to plug my courses—if you're interested, right, if you love my teaching and the way I teach, then do check out my data engineering courses. I create in-depth data engineering courses in the market, okay? It's

not just about the course—it's about giving you the experience, okay? The understanding of proper technology, how this works in the real world, right? It's not just about learning technologies; it's about understanding where it is used, how to use it, following best practices—all of these things I teach in my courses, so do check them out. You'll find the link in the description.

You'll also find the latest coupon code available with a discount, so go at least check that out.

And yeah, let's continue with our video. Okay, now we understood the fundamentals and we also looked at this big data landscape. Let me just zoom in, right? Can you see the tools' names? Can you see the different things available, right? These are the data warehouses,

tools' names? Can you see the different things available, right? These are the data warehouses, okay? As you can see, Snowflake, AWS Redshift might be here, Microsoft Firebolt,

okay? As you can see, Snowflake, AWS Redshift might be here, Microsoft Firebolt, Oracle—there are some new companies here. This is used for data lakes. As you can see, there might be S3, Databricks is used, Cloudera has their own stuff going on—these are storage systems provided by the different NoSQL databases, like MongoDB. There might be Cassandra somewhere,

Couchbase DB, and all of the other things. Real-time databases, graph databases—you see, I was telling you about this, right? For every single use case, like for visualization, BI platform, data science notebooks, MLOps, there might be some product analysis—all of these different things, right? We have a set of tools available, right? Every single technology,

everything that we want to do, we have a different toolset available for every single thing that we understood while we were talking about the architecture part. We understood that every single thing needs a set of tools, and we have more than thousands of tools to pick from, okay?

Now, we will understand these individual tools, what they do, why they exist, right? What are the use cases for them, which tools are the most demanded and used by the industry, okay? So that we will understand, and how to work with them. Let's go one by one.

okay? So that we will understand, and how to work with them. Let's go one by one.

Now, let's talk about the cloud platforms, right? We understood about the cloud platform.

Cloud platforms are basically giant computers built in some data center, the basement of a company. It can be Amazon, okay, this is Amazon, this is Google, okay, and this is Microsoft.

company. It can be Amazon, okay, this is Amazon, this is Google, okay, and this is Microsoft.

Now, again, these are the three top cloud providers available in the market. There

are plenty of cloud providers—you have Cloudera, you have IBM Cloud, you have Oracle Cloud. Every

different cloud provider has its own features, but these are the three top cloud providers available.

What is cloud computing? It is basically these companies giving you the computer resources and different services so that you can use them for your work. Before this cloud, what we used to do—we used to build our own servers, okay? Own servers, that means you get your RAM, you

get your hard disk, okay? You get the processing power, processor, okay? You get the GPU if needed, you get all of the wires, you get the ACs to cool down the servers, you get the networking adapters, you get all of these different things, switches, you get the routers—every single thing you get,

you build it on your own. Okay, now you can do this—a lot of people still do it because they want to save on cloud costs, but this also comes with a trade-off because you have to maintain them, okay? You have to maintain this. What if the power goes down, right? What if my hard disk

okay? You have to maintain this. What if the power goes down, right? What if my hard disk fails and I lose all of my data? You also have to think about replication, you also have to think about scalability, okay? How do I scale this entire thing? Because let's say, right now, I'm just working with millions of data and the users are small. Tomorrow, my business grows,

so I will have to buy new hardware, okay, and upgrade my system. What if my hard disk fails?

What if my RAM fails, okay? What if the hardware fails? What if an earthquake comes and I lose all of my data center resources? Anything can happen, right? You don't have control over nature. So,

this is the reason people usually go with the cloud providers because I don't want to set up all of these things by myself if I can directly pay to the cloud providers, okay? And these

cloud providers always charge pay-per-use, okay? Pay-per-use means that you only pay for what you

use. That's pretty awesome, right? I will only pay for what I use, whatever resources I consume. So,

use. That's pretty awesome, right? I will only pay for what I use, whatever resources I consume. So,

if I just use a simple virtual machine, which is like the online computer, and I run it for two hours for some workload, I am only going to pay for these two hours, okay? I will not, in the on-premise data center, have to keep running these machines 24 hours because this is how the entire server is set up, right? If I want to store something, because I will be running

some other workloads, my website is also hosted on that, there are some other functions running, databases, and everything, so I have to keep it running for 24 hours. Let's say if I just want to do some workload quickly for two hours onto the cloud, I can rent that, and I can also pay for that use case. Cloud has multiple services available for different use cases, okay? These different services are divided into

three different parts. We have PaaS, okay, we have SaaS, okay, and we have IaaS, okay?

This is Platform as a Service, this is Software as a Service, and this is Infrastructure as a Service. What do these three things mean? Platform as a Service means they give you the

a Service. What do these three things mean? Platform as a Service means they give you the direct platform, so you don't have to worry about setting up different things. So, for example, for example, on AWS, we have a service called AWS Lambda, okay? You can call it Platform as a Service because they directly give you one kind of platform where you can just

focus on writing your code—they will take care of all of the infrastructure side, such as running the server, all of the backend things, they will take care of the maintenance and everything. You just focus on writing your code. This is called Platform as a Service.

and everything. You just focus on writing your code. This is called Platform as a Service.

Second is Software as a Service. You can think about Software as a Service as Google Suite, alright? You have Google Sheets, AppScript—not the AppScript, what do we have, like Google

alright? You have Google Sheets, AppScript—not the AppScript, what do we have, like Google Slides? You have the entire Google Suite. You can think about that as Software as a Service

Slides? You have the entire Google Suite. You can think about that as Software as a Service because they are directly giving you access to the software as a service for your work, so you can use that and grow your business. Then we have Infrastructure as a Service, okay?

That basically means cloud providers will give you the infrastructure. So, an example of this is EC2 machine—this is basically a virtual machine online. There’s also a concept called EMR—this is like Elastic MapReduce, to run your Spark jobs. These are different infrastructures that they give you so that you can run your workloads, okay? This is how cloud platforms are divided into three

services—they give you these services that you can use to grow your business, right? Now, these

services have different names as per the cloud providers, right? If I go to AWS, right, on AWS, we have these many services and many more. These are just a few services, right? We have EC2—just don't worry about the names if you are seeing them for the first time, just forget what it means, right? Don't worry about it. If you already know, that's good, but if you are seeing these services

right? Don't worry about it. If you already know, that's good, but if you are seeing these services or these logos for the first time, just forget about them, right? There’s something called EC2, which is like the virtual machine, we have Lambda, where you can just write and run your code on the serverless machine, okay? Elastic Container Service, if you want to run a Docker image,

okay? There is Simple Email Service—it is used for notification purposes or email purposes. Aurora

okay? There is Simple Email Service—it is used for notification purposes or email purposes. Aurora

is the database created by AWS, so if you want to store your relational data, it is a service. It's

like AWS is giving you the service, so you only pay for the number of hours you use or the number of resources that you consume, so that you don't have to build all of these things by yourself.

Everything is built for you, pay for it, and grow your business. Elasticache, DynamoDB, right? EMR,

VPC, CloudFront, Elastic Load Balancing, Kinesis for real-time data, RDS for relational databases, Redshift for data warehouse, right? Kinesis, Elasticsearch for some IoT devices, Simple Storage Service, object storage to build a data lake, right? File system, Elastic Block Storage, Cognito, API Gateway, Queue System—you need to build your entire technical architecture,

right? We understood—we have the business goals, but once you define the business goals,

right? We understood—we have the business goals, but once you define the business goals, you think about, right now, how to build my technical architecture. So, you start thinking, okay, which cloud computing platform should I go with? Now, most of the time, you might have the answer—let's say you are a student right now, okay? You might have a question:

which cloud computing is the best and will give me a job? The answer is, pick any one of the three, and there are high chances that you will get a job because most companies only work with these cloud providers. If I were to rank them, okay, this is just my personal opinion, it can be wrong. This was my opinion a few years back, like one year back, you can also say that.

I used to rank AWS as one, okay? Azure as two, and Google Cloud as three. Okay, now it is changing, and I'm seeing the trend that Azure can be one because a lot of companies are using Azure due to their new functionality and good services. The services that they provide are specific to

the enterprise level, so Azure is good if you want to target enterprise-level companies. They always

go with Azure, especially in India, because a lot of companies directly use the Microsoft app suite, like Microsoft 365 at the enterprise level—because Microsoft Word, PowerPoint, and all of the other things. So, they are likely to go with Azure because the integration is quite simple,

things. So, they are likely to go with Azure because the integration is quite simple, right? A lot of startups usually go with AWS because AWS gives you good credits, you can

right? A lot of startups usually go with AWS because AWS gives you good credits, you can easily start, and a lot of people know AWS, like the industry. If you want to find resources or employees with AWS skills, it is quite easy to find, so a lot of startups pick up AWS. Like,

I'm building my data engineering startup, okay? I'm also using AWS for my infrastructure.

The third one, I still say, is Google Cloud. Again, there are some services Google provides that are really good, but these are my takes, right? This is my personal take, it can be wrong, but this is what I see in the industry. I say if you want to target top companies—when I mean top companies, I mean the enterprise level, like banks, like a lot of top companies,

the service-based companies can also be taken into the picture, like Infosys, TCS, and all of the other things, up to your requirement—but top companies that already have an IPO set up, you can just research their company architecture, and you will find that you will see a lot of companies use Azure if they are enterprise level. A lot of startups, like Indian startups—if you see Zepto,

if you see CRED, okay—all of these guys are on AWS because it's good for startups, they give it a good ecosystem. So, I say if you want to target startups, learn AWS and GCP. I think it's—I always suggest either learning Azure or AWS unless you want to target a specific company and they tell

you that they require skills in GCP, then go with GCP. Okay, I just answered your question.

If you are a student, then you can go with this. If you are someone who is looking to build the architecture, again, the situation is the same: think about the services that solve your problem.

Okay, we will talk about the different services, but the idea is to think about what services these cloud providers give us that can help us solve our business goals. We understood about operational and technical architecture—now you start thinking from this point of view: if I were to choose AWS,

GCP, and Azure, and if I say, okay, Azure gives me these services, AWS gives me these services, and as per my requirement, I can easily solve all of my business problems using Azure because they have a good service pack together, so I'll go with Azure. Like, I can do a simple small project on Azure and see if that works—if it works, I can move my entire production

workloads onto Azure. Okay, if that doesn't work, there is also the concept of hybrid cloud, so you use some services from Azure, you use the best services from AWS, you use the best services from Google Cloud, okay, and build your system. For example, in my personal opinion,

right? I really love Google BigQuery—this is a data warehouse provided by Google, okay?

right? I really love Google BigQuery—this is a data warehouse provided by Google, okay?

And on Azure, I really love the Databricks integration, okay? On AWS, I really love the Glue service, which is serverless Spark workloads. So, what I can do, okay—Glue, or also I love the S3 as an object storage, okay? If I want to do that, I can use S3 as my object storage,

I can use Databricks as my Spark workload, and I can use BigQuery as my data warehouse. So, you can also do cross-cloud integration, but maintaining all of these things is quite difficult. Again,

there are some tools that can help you with that, but these are the different concepts that you can explore. I just want to throw them at you right now so that you can keep that in mind, okay? Let's move forward—let's talk about the services that we understood, okay?

okay? Let's move forward—let's talk about the services that we understood, okay?

Now, we understood, right, these are the services—so let's say if I go with AWS, and if I build my entire architecture, if I want to build my ETL pipeline, okay, how will I go with that? Let's say this is how it will happen, okay? Let me just remove this, okay. Collect, process, store, and analyze, right? Data engineering lifecycle—the simple

okay. Collect, process, store, and analyze, right? Data engineering lifecycle—the simple architecture that we've been understanding. I can collect data from S3, Kinesis, DynamoDB, RDS, MSK, whatever, right? This is object storage, this is the real-time data streaming platform, this is the

whatever, right? This is object storage, this is the real-time data streaming platform, this is the NoSQL database, this is the relational database. We understood data is coming from multiple places, okay, where we can collect our data and easily ingest the data, and we can collect again, Siri—

Stop—why? Okay, I found this on the web for when we can collect Siri. Check it out—stop! Okay,

Stop—why? Okay, I found this on the web for when we can collect Siri. Check it out—stop! Okay,

yeah, so we can collect the data, okay? Then we can do the event processing, okay? Let's say if you want to do something, let's say every time data gets uploaded onto Amazon S3,

okay? Let's say if you want to do something, let's say every time data gets uploaded onto Amazon S3, I want to run the Lambda function. Okay, Lambda function is basically the compute service, so if you want to run small code, you can do that—I can do this, and then I can do the actual data processing using EMR, which is a Spark workload. I can run the machine learning, I can run AWS Glue,

again the Spark workload, and then I can use these services for analysis. So on AWS itself, I can build my entire data system, right? Instead of going out and picking random tools, AWS gives you a wide range of services that you can pick from that pool and build your entire data system,

okay? This is just an example, okay? Just to help you understand from this entire service tool pack,

okay? This is just an example, okay? Just to help you understand from this entire service tool pack, right, that AWS gives you—we understood about services. Services can be platforms—they might give you the platform, they might give you the software as a service, they might give you the infrastructure as a service, right? These are the different services that they provide, and using these services, I can build my entire platform,

okay? And it might look something like this, okay? Now just pay attention, okay? Don't get confused,

okay? And it might look something like this, okay? Now just pay attention, okay? Don't get confused, don't get scared about all of these things—now we're just trying to go a little bit advanced, okay? And this is the architecture of one of the top companies or top startups called Dream11 in

okay? And this is the architecture of one of the top companies or top startups called Dream11 in India, okay? Dream11 is like the fantasy betting app. This is the architecture of Dream11 that

India, okay? Dream11 is like the fantasy betting app. This is the architecture of Dream11 that they have used to build on AWS. Now, if you see this architecture, you will understand it is not completely AWS, okay? There are some things that they use from AWS, as you can see over here, okay,

and there are some things that they use that are open source, and this is how technical systems are built. This is the final version of Dream11—they went through three different phases to build this

built. This is the final version of Dream11—they went through three different phases to build this particular architecture. I have posted a LinkedIn post—I will put the link in the description. If I

particular architecture. I have posted a LinkedIn post—I will put the link in the description. If I

forget, just comment it out—I will add it to it. Okay, now let's try to understand and also let's try to remember our data engineering lifecycle, okay? Even though this architecture looks quite complicated, the fundamental concept, okay, the data engineering lifecycle is quite the same, okay? First of all, what do we have in the data engineering lifecycle? First,

okay? First of all, what do we have in the data engineering lifecycle? First,

we have the generation source. Now here, as we understood, our data is coming from multiple places, so we have third-party vendors, okay? As you can see—let me just zoom in. Okay,

our data is coming from third-party vendors, there is some RDBMS, like MySQL, there is some NoSQL, like Cassandra. Where is the streaming data coming? There is Cassandra NoSQL database,

like Cassandra. Where is the streaming data coming? There is Cassandra NoSQL database, okay? And then there's the application, so there's iOS and Android application, and there's

okay? And then there's the application, so there's iOS and Android application, and there's the desktop, Dream11.com, as you can see over here. So, we understood, right, data comes from multiple places. In this case, data is coming from third-party vendors, okay? Data is coming

multiple places. In this case, data is coming from third-party vendors, okay? Data is coming from third-party vendors, data is coming from the databases, data is coming from the application and the iOS. I kept telling you, right? I kept telling you data is coming from multiple bases—this is

the iOS. I kept telling you, right? I kept telling you data is coming from multiple bases—this is what it means. It is coming from multiple places. Now I want to ingest this data into my system, and most of the time, for ingestion, for real-time streaming ingestion, or just ingestion, people use Apache Kafka, okay? Apache Kafka is a real-time data streaming platform, a distributed

real-time data streaming platform, so you can work on large-scale data, okay? And you can easily put Kafka in between to consume all of the data, okay? In Kafka, these are all of the producers, okay? Let me just write it over here. All of these people, okay, are producers who are producing all

okay? Let me just write it over here. All of these people, okay, are producers who are producing all of this data, okay? Once the data gets into Kafka over here, okay, everything else that happens is consumers—consumers who are consuming all of this data, okay? Simple to understand. Again, we are

not deep-diving into Kafka—I will be launching a course on Kafka, so you can keep an eye on that, okay, in the future. But data is getting produced and data is getting consumed—here, consumption is basically what I want to do with this data, okay? So, there is a batch pipeline going on over here, as you can see, okay? This is a batch pipeline. First of all, we understood, right, once the data

is ingested, we need to store our data somewhere, right? There was a storage layer below. So, the

data gets stored inside Amazon S3 as a data lake. Now, the concept of the data lake is coming. Now,

the concept of Amazon S3, which is a service on AWS S3, is coming, right? I kept telling you—you can use S3 as a data lake. I store my data onto the data lake, okay? What happens here after this?

This data goes through the ETL, okay? As you can see over here, this data is going through the ETL, okay, and the ETL is happening using Apache Spark, okay? There is some Apache Spark workload available, and then it stores our data onto Amazon Redshift. This is what I kept telling you—this

is a data warehouse, okay? This is my data warehouse service available on AWS, right? This

is my ingestion, this is my ingestion, this is my data warehouse, this is my data lake, this is my storage, right? This is my storage, okay, this is my data warehouse, this is my ingestion. There's

storage, right? This is my storage, okay, this is my data warehouse, this is my ingestion. There's

one more thing that I told you, right? In a data warehouse, we put our data by transforming and making it into the structured format. Now, there is one more pipeline that goes—it is called ad hoc analysis, okay? And as you can see, it is using Amazon Athena, which is a query engine for ad hoc

analysis, okay? And as you can see, it is using Amazon Athena, which is a query engine for ad hoc analysis, and I told you, right? The Looker, okay, the reporting system or the data science people,

can use the raw data that is coming. I can use this raw data as it is, okay, from the system as per my requirement, or I can also use structured data as per my requirement. So, I get access to both of these things—I can get the proper structured data also, and I also get the raw

data as per my requirement, okay? This is there, again. Understand the data engineering lifecycle that we understood—understand, now try to connect every single thing that we have done, right? We

understood data warehouses, we understood data lakes, we understood ETL, we understood ingestion, storage—every single thing is put together into the real-time system of Dream11 case study, right?

We're just trying to understand the real-world architecture right now, and how they use the fundamental concept in the real world, okay? Every single thing that we talked about, okay, it makes sense here, right? We have the ETL system for ad hoc analysis, we have the structured data, we have the ingestion system going on, okay? This is just a batch pipeline, okay? This is there. There's one

more thing—we have the real-time pipeline going on over here, okay? For the real-time pipeline, what they are using is Apache Flink, okay? Apache Flink is used for the streaming engine, so if they want to understand data on a real-time basis, they can use this and analyze it. So,

from the streaming engine, we go to Elasticsearch, there might be some notification service, there might be some visualization available over here—not sure about that, but this is the entire pipeline. And the fundamental concept that we use is the data engineering lifecycle, okay? And all of the concepts that we use. So, every time I store my data onto Redshift,

okay? And all of the concepts that we use. So, every time I store my data onto Redshift, this is the data warehouse. I might use dimensional modeling, okay? After the ETL, I use Apache Spark to transform my data. I use S3 as my data lake storage, and I use Amazon Athena for the ad hoc query. I use Looker for my visualization, I use Jupyter Notebook for my data science workload,

I use Kafka for my ingestion—these are all of my sources. I use Apache Flink for handling real-time data streaming—this is the real architecture. This is everything we did in the last two hours just to understand this particular thing. Once you have understood this, you get a good gist of data engineering. Now you know, like, yeah, I am a data engineer because I understand this architecture

engineering. Now you know, like, yeah, I am a data engineer because I understand this architecture and what is going on. This is the fundamental part, right? Once you understand the fundamentals, you can understand any architecture. Now, once you complete this entire video, you can understand any architecture in the world because you will know, okay, there is some ingestion going on, there is some transformation happening, there is some loading happening, there is some ETL happening.

I understand this—the tool is different, right? I can replace this tool with Snowflake, I can replace this tool with Databricks, right? I can use something else over here—it doesn't matter, okay? It will work the same. The features might be different, the performance might be different,

okay? It will work the same. The features might be different, the performance might be different, but fundamentally, it will give the same output. But for their use cases, for Dream11, they might have tried multiple things and then finally came up with the final architecture that is currently working for their system, right? Everything that you see on their application,

everything you see on your app as Dream11, there is this kind of system behind that, making it possible, okay? It's not some magic. Alright, we understood AWS—now let's understand GCP. Just like AWS, right, we understood that we have different services available on GCP also:

GCP. Just like AWS, right, we understood that we have different services available on GCP also: ingestion, okay? For ingestion, we have App Engine, Cloud Pub/Sub, Cloud Transfer Service,

ingestion, okay? For ingestion, we have App Engine, Cloud Pub/Sub, Cloud Transfer Service, BigQuery, Cloud Function. Now, I want to tell you this: most of the cloud providers have similar services available. For example, this is a data warehouse available on GCP called BigQuery, which

services available. For example, this is a data warehouse available on GCP called BigQuery, which is the same as Redshift on AWS. Okay, there’s Cloud Function available, which is the same as AWS Lambda—fundamentally, they give you a similar platform to perform your workload. The name is

different, the feature is different, the cost is different, but fundamentally, it is the same, right? Just like we have Cloud Storage—this is basically the S3 of AWS Cloud Storage,

right? Just like we have Cloud Storage—this is basically the S3 of AWS Cloud Storage, object storage. We have Cloud SQL—this is the RDS, this is the Relational Database

object storage. We have Cloud SQL—this is the RDS, this is the Relational Database Service that we talked about, right? BigQuery is a data warehouse that we understood. Data Prep,

DataProc is the same as EMR, okay? Elastic MapReduce of AWS. So, if you understand, okay, services are the same—like most of the cloud providers have similar overlapping services. It's

always about choosing the best service for your use case. So, for ingestion, they have this many, for storage, they have this many, for processing, they have this many, and for exploration. Again,

the concept of the data engineering lifecycle: I have to ingest something, I have to store something, I have to process something, I have to serve something. Okay, this is the simple architecture on the GCP. Same fundamental concept applies—I have data coming from multiple places, I ingest this data, I store this data, there is a pipeline that is running right now. Again,

I store some data, I store the data onto BigQuery, okay? And then there’s some privacy on Oracle Cloud, like identity campaign running—this is like the end-user part, right? Customer platform—there

is customer data, and data destinations such as web apps, customer service, marketing messaging.

Same fundamental concept: data source, collect, process, store, and give it to something. This is

where the entire data engineering is happening, right? I get the data, I ingest it properly, I store it, I process it, I give it back. Okay, now this is done on GCP. Again,

let's look at the Azure level also, okay? These are the developer services. We have compute, okay? For the compute, we have virtual machine, cloud machine, batch storage, again, the same.

okay? For the compute, we have virtual machine, cloud machine, batch storage, again, the same.

We have the web and mobile app. For data, we have the SQL database, Redis Cache, we have SQL Data Warehouses—that is also available. For analytics, we have Data Lake Analytics, Data Lake Store, Stream Analytics, Machine Learning, Data Factory, okay? IoT, we have media, we have identity access.

In my opinion, Azure has, as a data engineer, Azure has a very good service for data engineering workloads, okay? They have three services that I really like on Azure. So, one is Databricks,

workloads, okay? They have three services that I really like on Azure. So, one is Databricks, okay? I really like Databricks because it is properly integrated with Azure, and Databricks

okay? I really like Databricks because it is properly integrated with Azure, and Databricks is basically the environment to run Apache Spark workloads, okay? Second is Data Factory, okay? Data Factory, and third, I like Synapse Analytics, Synapse Analytics, okay? Most of the

okay? Data Factory, and third, I like Synapse Analytics, Synapse Analytics, okay? Most of the services—and there’s a new service available I haven't explored called Fabric. Microsoft Fabric

is basically the combination of these multiple services where you can do everything in one place, okay? It is especially designed for data engineering workloads,

okay? It is especially designed for data engineering workloads, making your life much easier. I have a project on this on my YouTube channel available for free, okay? I'll put the link in the description—if I forget, do let me know by commenting, I will put

okay? I'll put the link in the description—if I forget, do let me know by commenting, I will put that. If you want, you can explore that. I also teach about all of these things in my courses,

that. If you want, you can explore that. I also teach about all of these things in my courses, so we do have projects available on that—you can explore that by going to the website.

Okay, now, this is the architecture side of the same thing, just like AWS, okay? I can replicate this entire architecture on GCP also and also on Azure. What I have to do is

okay? I can replicate this entire architecture on GCP also and also on Azure. What I have to do is basically just replace—let's say if I'm replacing this entire thing onto the GCP, what I will do,

instead of S3, I will use Cloud Storage, okay? Instead of Redshift, I will use BigQuery, okay? I can put DataProc here, okay? I can also put BigQuery here if I want. For streaming engine,

okay? I can put DataProc here, okay? I can also put BigQuery here if I want. For streaming engine, I can also put Pub/Sub and DataFlow, okay? Um, what else? For Kafka, I can put Pub/Sub, but I'll say I'll go with Kafka—Kafka is best. Okay, Looker is good, this is good, everything

else seems fine. So, I can also convert this—my AWS architecture—through GCP. Performance might

be different, the costing might be different, the UI might be different, the integration might be different, but I can do that. Okay, I can also do the same for Azure as well, simple. Okay, and

this is what the Azure architecture says, right? What do we have? We have the customer stream data, we have the customer batch files, okay? Uh, we are ingesting this particular thing, and we are just adding this onto the ADLS, which is Azure Data Lake Gen Storage. It is a service available from the external source. Okay, now there's a Data Factory running, okay? It is like data is coming

from on-premise sources, and some stream data goes to the Data Factory. It gets entered into the raw zone, okay? The raw data is getting entered. We use Databricks over here to process this

zone, okay? The raw data is getting entered. We use Databricks over here to process this raw data and store it in the processed folder. After that, it goes to the analytical zone, and it goes to the SQL pool, which is a SQL data warehouse. Now from here, customers can use this to build the Power BI dashboard and get insights, okay? It can also be integrated

with the desktop application if needed. Same concept: collect, ingest, store, transform, serve, and use it. Same thing is happening, and there are some undercurrents. As you can see, we are using the Azure Key Vault to securely store our keys, Log Analytics, Azure Preview,

and Azure DevOps to properly operationalize our entire integration for the scripts.

These are the different services that can be used together. So, these are the three different pipelines that we have used till now. Okay, I just showed you using AWS, I showed you using GCP, and I showed you using Azure. Now, let's look at the modern data architecture, right? This is

also modern, just especially built on the cloud. This is the modern data architecture, right?

Modern data architecture is basically where new companies are coming into the market and saying, "Okay, the tools that you guys are using are old now. They don't work with the new data workloads, the new volume, and the approach is very old, okay? And I, as a new startup, I am a modern

data company. I will make your life easier." So instead of you doing the ETL, remove the ETL,

data company. I will make your life easier." So instead of you doing the ETL, remove the ETL, okay? I will say directly load the data into my product, okay? And I will directly transform it

okay? I will say directly load the data into my product, okay? And I will directly transform it for you as per your requirement, so that you can directly save time on the ETL and start querying the data. This is what the modern company says. They all have different requirements, so they

the data. This is what the modern company says. They all have different requirements, so they directly give you the integration between your different sources, as you can see here, right? Uh,

I have different data coming from sources like Stripe, Google, PostgreSQL, Google Play. What

they say is that they have the integration with all of these different sources. These are the applications, right? Fivetran, Airbyte, Stitch, okay? These are usually used for ingestion. You

applications, right? Fivetran, Airbyte, Stitch, okay? These are usually used for ingestion. You

can also use Python and SQL, which is also the modern way. Before this, we had Hadoop and all of the other workloads. This company comes and says, "Okay, use our system because we have made all of these things easier for you. Directly connect with these multiple sources, we'll pull the

data for you, and we'll directly load it onto the data warehouse so that you can do everything as per your requirement." There is a new tool called DBT (Data Build Tool) that is used for analysis.

People say that it is going to replace SQL. Not going to happen. Most of the time, a lot of companies come and want to replace SQL, but still, SQL is the king of data, right? You should

always learn SQL. DBT is also gaining a lot of popularity. It has—you can divide your data into multiple things, so multiple—you can divide your data into multiple stages. The thing that I told you about, like ELT, right? ELT. We were doing ETL till now: Extract, Transform, Load. Now we will

do EL, which is we will extract our data and directly load our data into the data warehouse, okay? It can be Snowflake, BigQuery, Redshift, doesn't matter. And we will divide our data into

okay? It can be Snowflake, BigQuery, Redshift, doesn't matter. And we will divide our data into different landing zones, okay? This is modern data architecture, right? We create the landing area, we create the staging area, we create the warehouse layer, and we create the M layer. Same

fundamental concept. If you see, there is a data mart, there's a data warehouse, there's object storage, and there's a landing area to store the raw data, right? I store my raw data, I store the staging area after some transformation, I store my data in the warehouse, and this is my M layer. All

of these things you can create inside the DBT, and directly you can store your data onto Snowflake, Redshift, and all of the other things. Same thing. Then it can be consumed by the BI people, machine learning people, they can build dashboards on different tools, uh, you can, uh, do the analysis on different tools. There's also the concept of reverse ETL. Companies are using

that. Basically, that means I have transformed my data, I can put back this data onto the

that. Basically, that means I have transformed my data, I can put back this data onto the source system and get more insights from the transformed data by ingesting that data back to the system again. Uh, this is a totally different concept. Um, I'll cover it in some other videos,

but there's also a concept of reverse ETL that we also saw in the data engineering life cycle, okay?

Modern data architecture—we understood about GCP, we understood about Azure, AWS, and the modern data tools. A combination of these different tools and AWS and Azure can build the modern

data tools. A combination of these different tools and AWS and Azure can build the modern data platform. Now, again, we talked about this. Here we have like thousands of tools available,

data platform. Now, again, we talked about this. Here we have like thousands of tools available, right? How do I decide which tool is best for me? First, I look at the business requirement. Does

right? How do I decide which tool is best for me? First, I look at the business requirement. Does

this solve my business problem? Yes. If it does, then I should use that tool. If it doesn't, then it should be reversible. So, okay, I can easily, uh, remove this. Let's say this tool is costing me too much, okay? And this is not really even solving my problem. I can directly remove this and go with another tool. If this tool is also not working, I can directly go with Spark because it

is going to work because it is open source, right? All of these things are going to cost you. Spark

is going to cost you for the server, so, uh, you have to choose between—this is easy, this might be quite difficult to set up. So, as a company, if you are a startup, people usually go with using these things because it saves time, okay? You have the money, but you want to save time, so you can go with this. This will solve your problem, this will also solve your problem, okay? Whatever

solves your problem, whatever helps you reach your business goals, you can go with that, okay? Now,

uh, I just want to take a break, so I'll have some water, and I'll come back in 1 minute. Alright,

till now we have understood a lot of things. Now, this is kind of like the end of the video, and again, I can't cover every single thing, but I want to leave you with some of the important tools that you can learn about data engineering and some of the concepts, uh, at the end, okay?

So that is important for you in your career. So let's start with that, okay? Important tools for data engineering. Now, first of all, if you want to become a data engineer, you have to learn a few

data engineering. Now, first of all, if you want to become a data engineer, you have to learn a few things. First of all is the programming language. You have three choices: Python, Scala, and

things. First of all is the programming language. You have three choices: Python, Scala, and Java. Now, if you want to learn any programming language, I always suggest starting with Python,

Java. Now, if you want to learn any programming language, I always suggest starting with Python, okay? It's the easiest to learn, mostly used by industry because if you want to write the,

okay? It's the easiest to learn, mostly used by industry because if you want to write the, uh, ETL scripts, if you want to write the Kafka ingestion engine, and all of the other things, Python has a lot of packages that make your life much easier, and even industries, uh, use Python for all of these workloads, so you should always go with Python. If not Python, you can also go

with Java. Java also has good support because most of the open-source frameworks like, uh, Apache

with Java. Java also has good support because most of the open-source frameworks like, uh, Apache Spark, all of them are written in Java, okay? So you can go with Java also, but my suggestion is to go with Python, okay? Important for you to learn in Python. You can learn about the basics of Python, such as variables, operators, basic data structures like dictionaries, lists, all of the

other things. Important things to learn include how to work with date and time formats, how to,

other things. Important things to learn include how to work with date and time formats, how to, uh, like there's a package inside Python called Pandas, so you should learn how to transform the basics of data, how to work with different file formats also, like CSV, JSON, Avro, okay?

This is what you can learn in Python. Uh, I have already created a detailed roadmap for this, so I'll also put the link to that particular video in the description. If I forget, do let me know, and I will add it, okay? SQL—again, SQL is the backbone of your data career. You

cannot skip this. This is how you communicate with the databases. We understood everything, so you have to learn SQL. This is non-negotiable, okay? You cannot skip SQL. You cannot skip Python.

This is the foundation, so you have to, have to learn this. After this, you can understand Linux commands because you will be working with some of the, uh, cloud providers or Linux machines.

Most of the, like, 80 to 90% of the servers online run on Linux servers only, so you should learn to interact because it doesn't have a GUI, right? You don't have a graphical user interface. You'll be

accessing it using the terminal. You can learn commands like cd, how to clear, how to copy, how to exit, how to find something, how to view the file, okay? These are the different commands that you can learn. You can just search on YouTube, Basic Linux Commands, and you will get a good tutorial, okay? Now, we have data warehouses. Again, you don't have to learn all of

the data warehouses, okay? You can learn—you have the AWS Redshift available, you have BigQuery, we have Hive available, SAP Analytics, and Snowflake. My suggestion is to either learn Snowflake because this is not dependent on the cloud platform, okay? This is cloud-independent, so you can learn this.

Also, this is highly demanded in the market, so you can easily learn this and add it to your skill set, very highly in demand. There's one more that I love personally, which is BigQuery, okay? Because I've worked with BigQuery for the last three to four years, and I've really enjoyed

okay? Because I've worked with BigQuery for the last three to four years, and I've really enjoyed this service, so this is one of my favorites because this is one of my favorite services on GCP. So, my suggestion is to go with Snowflake because this is cloud-independent, okay? If you

GCP. So, my suggestion is to go with Snowflake because this is cloud-independent, okay? If you

are working with a specific cloud, you are anyway going to learn—let's say if you're learning AWS, you are anyway going to learn about Redshift. If you're learning BigQuery, you are anyway going to learn about—if you're learning GCP, you are anyway going to learn about BigQuery, right? So, my

suggestion is to just learn Snowflake because you will learn about these three by learning about the cloud. Hive is an open-source, uh, tool that not many people use. It is just used for the metastore

cloud. Hive is an open-source, uh, tool that not many people use. It is just used for the metastore for Apache Spark or Apache Hadoop workloads, okay? As a metastore to store some of the information, but not really recommended to learn it separately. You can just learn the basics, and in case you have a requirement, then you can learn it on the go, right? It will take you like one

to two days if you have the basics clear, okay? Data processing. This is interesting, okay? For

different workloads, you can use Apache Spark for batch and streaming. You have to learn Spark. This

is very, very important, okay? You cannot skip Spark also because this is used by top companies to process big data. You also have to learn Kafka because this is very important to process real-time data, okay? Out of these three, I would say learn Apache Flink for real-time analytics.

You can use Kafka for streaming. You can use Flink for analytics. There's NiFi and Apache Beam also.

If you learn GCP, okay, you will automatically learn Apache Beam also. So, my suggestion is learn Apache Spark, learn Apache Kafka only. Not really right now—if in case you have to use it somewhere, you will learn it on the go, okay? Just add Kafka and Spark to your skill set. Data orchestration.

Okay, we have many tools available. Out of these, you should use Apache Airflow, one of the highly used tools in the market, okay? We have these modern data tools, okay? Uh,

these tools take roughly 30 minutes to 1 hour to learn, okay? If you have your fundamentals clear, right? You can just watch one video and understand more like 80 to 90% of the tools, right? It is

right? You can just watch one video and understand more like 80 to 90% of the tools, right? It is

very simple. I learned about Mage in just one hour, okay? It didn't take me more than that. So,

there's one project available on our channel also, so if you want, you can learn that. These modern

tools are created to make your life easier, okay? So, to learn Apache Airflow is quite complicated, right? It will take some time to understand the gist of it, and we have a course on that, so I'll

right? It will take some time to understand the gist of it, and we have a course on that, so I'll tell you about this right now, but you can learn about Dagster, Mage, and Prefect within one hour.

I don't think that will take so much time, okay? And these are the modern data tools available, okay? These are all part of the modern data stack, okay? These are the modern data stack,

okay? These are all part of the modern data stack, okay? These are the modern data stack, okay? As you can see, we have—for ingestion, I think this is Airbyte, and this is Fivetran

okay? As you can see, we have—for ingestion, I think this is Airbyte, and this is Fivetran for ingesting data. For data storage, we have BigQuery, Snowflake, Databricks. For BI, we have Looker, Data Studio. For data transformation, we have DBT. Data orchestration, right? If you

want to orchestrate your entire thing, we have Airflow. There are some data quality frameworks, Great Expectations, and there are metadata platforms like OpenLineage and DataHub. Again,

you can just search about the tool name, and you will get what they do, okay? When

we talk about the modern data stack, it is really important to just understand why these tools exist in the market, like what problem do they solve. So, in this case, Fivetran solves the problem of data ingestion—it takes data from one source and pushes the data to the other source. DBT gives

you modern transformation, okay? Uh, DBT gives you modern data transformation. Airflow is for orchestration, so if you want to orchestrate and build a data pipeline, you can do that. Uh, this

is for data quality and governance, so these are some of the tools available. Just search online, and you'll find plenty of resources. Alright, uh, now I want to cover these individual things, right? Uh, what do you need to learn about Python? What do you need to learn about SQL?

right? Uh, what do you need to learn about Python? What do you need to learn about SQL?

What do you need to learn about data warehouses, Spark, Apache Airflow, and Kafka, okay? So I just want to cover these individual things. Again, I already have the roadmap available, but I'll just quickly go through this part. Let me just open this, right? Uh, this is available here. So

learning Python is one thing, and learning Python for data engineering is another thing, right? You

can learn Python for free online, but if you want to learn Python for data engineering, you have to learn certain things. I'll just show you quickly because I have it on my website itself. So this

is my Python for Data Engineering course. I'll just go through the modules. You don't have to take this course, but if you want, you can learn these things for free online also. I have created courses just to give you a structured learning approach so that you don't get distracted, okay? So you can learn the basics. All of these modules are open, so if you want, you can learn

okay? So you can learn the basics. All of these modules are open, so if you want, you can learn them. You can start with strings, you can learn about numbers, you can learn about data types,

them. You can start with strings, you can learn about numbers, you can learn about data types, you can learn about data structures like lists, dictionaries, sets, tuples, okay? You can learn about conditional statements like if-else, you can learn about loops (for loop, while loop), then you can go to the intermediate level, such as understanding Python packages, how to import them,

list comprehensions, exception handling. We have to learn how to work with text files, basics of Lambda functions, and object-oriented programming. There are some advanced concepts such as NumPy, understanding the NumPy package, Pandas basics, how to use Pandas for transformation, then working with date-time formats—very important if you want to work as a data engineer—how to work

with different file formats like JSON, CSV, Excel, Avro, okay? And these are the basics.

In my course, I have included one project for Python, okay? This is like a Spotify data pipeline project. Uh, I'll tell you about this part, uh, at the end, okay? If you're interested. Then we have

project. Uh, I'll tell you about this part, uh, at the end, okay? If you're interested. Then we have SQL. Inside SQL, what do you have to learn? You can pick one DBMS. We are going with PostgreSQL

SQL. Inside SQL, what do you have to learn? You can pick one DBMS. We are going with PostgreSQL because PostgreSQL is open-source, easy to learn, and easy to set up. Learn about the important keywords of SQL such as SELECT, INSERT, UPDATE, and all of the other things. Learn about data types and how to create tables, how to create a database, different types of queries available,

okay? Like DML, DDL—like Data Manipulation Language, Data Definition Language—you can learn

okay? Like DML, DDL—like Data Manipulation Language, Data Definition Language—you can learn about that. Uh, you can learn about operators in SQL, okay? You can learn about ALTER query,

about that. Uh, you can learn about operators in SQL, okay? You can learn about ALTER query, database constraints, primary key, foreign key, ACID properties, normalization, INSERT, UPDATE statements, joins like inner, left, right, outer, cross join, ORDER BY, GROUP BY, HAVING clause, aggregation functions like MIN, MAX, and all the other things. Also, understand the advanced

topics like subqueries, Common Table Expressions, window functions, analytical functions like RANK, DENSE_RANK, ROW_NUMBER, LEAD, LAG, set operations, working with date-time, case statements, stored procedures. Learn about data modeling—we understood the basics of it. It is like ER

stored procedures. Learn about data modeling—we understood the basics of it. It is like ER modeling and data modeling. So learn about that and just try to build your own data model. Like

you can pick one company name like e-commerce or Instagram or any company like Netflix, and you can build a data model as a project, right? It looks something like this, a data model, as you can see over here. This is like an Instagram data model, okay? This is like an e-commerce data model,

over here. This is like an Instagram data model, okay? This is like an e-commerce data model, okay? After this, you can learn about data warehouses, okay? In data warehouses,

okay? After this, you can learn about data warehouses, okay? In data warehouses, you can start with the basics: understand what a data warehouse is, understand OLTP vs OLAP—we understood about this—understand the difference between data warehouses and data lakes, ETL process, learn about Snowflake, like basics—just create an account on Snowflake. We have tutorials

on Snowflake also on the YouTube channel. Learn about dimensional modeling, so deep dive into dimensional modeling, which is understanding what dimensional modeling is, understanding fact tables, dimension tables, understanding star schema, snowflake schema, types of fact tables, how to create fact tables, factless fact tables, surrogate keys, date dimension.

So these are the things you can learn about dimensional modeling. You can learn about SCD, Slowly Changing Dimensions. You can learn about ETL—these are the concepts that you can cover in the Snowflake database, okay? Like staging, copy command, file formats, handling unstructured data, how to work with them, virtual warehouses, caching, clustering, storage integration,

Snowpipe, time travel, how to undrop things, how to recover data from the past, types of tables, zero-copy cloning, data sharing, materialized views—these are the concepts that you can learn in Snowflake, right? I'm just trying to give you an overview of the things that you can learn. For me,

Snowflake, right? I'm just trying to give you an overview of the things that you can learn. For me,

I have created this step-by-step roadmap. I'm still building this entire thing, so you can go to this website, DataVidhya, and you will see that I'm trying to build a course—first is Python, then the second one is SQL, third one is data warehouses, fourth one is Spark with Databricks, fifth one is workflow orchestration. I'm currently working on the Kafka course, okay?

And then there will be a dedicated cloud computing course in the future, okay? So, after this, we have Apache Spark. This is very important. In Apache Spark, understand what Apache Spark is, why we need Apache Spark, understand the architecture, understand concepts such as DataFrame, transformations, actions, lazy evaluation in Apache Spark, okay? Learn how to install

Apache Spark—very important. Then we have this, uh, deep dive into the structured API in Apache Spark. We have two things: structured API and the lower-level API. So learn about the structured

Spark. We have two things: structured API and the lower-level API. So learn about the structured API, basics of it, how to define user-defined functions, data types of Apache Spark, data sources, partitioning, bucketing, how to work with external tables. Then we also have the lower-level API, such as understanding the Resilient Distributed Dataset (RDD), also learn about production applications, how to run Spark on the cluster and on Databricks,

okay? These are topics that you can also cover, like you can just screenshot this,

okay? These are topics that you can also cover, like you can just screenshot this, or you can also visit the DataVidhya website just to get an understanding of the modules, okay? You

can learn all of these things for free online, okay? You don't have to, uh, really go through this because I'm just going through this because this makes this entire thing easier to explain, right? Uh, for Airflow also, you can just go through this section, okay? What are the things

right? Uh, for Airflow also, you can just go through this section, okay? What are the things that you need to cover? There are a few concepts that are important, okay? And then you can build the projects like this. So I just quickly showed you, like, what are the different topics that you can cover from the website. So instead of writing each and every single thing onto this page, uh,

that will just increase the time of the video, and I'm also, uh, feeling pain inside my throat, uh, because I've been recording this thing for the last 3 hours, okay? Uh, so I just quickly showed you that particular thing. These are the two different topics that I also wanted to cover: data security and data masking, okay? Data security is important—we talked about this at the initial

stage. Uh, in data security, we have to take care of three things: confidentiality, integrity,

stage. Uh, in data security, we have to take care of three things: confidentiality, integrity, and availability, right? Ensure your data is accessible only to authorized users, so you don't give access to your data to every user—only the authorized users should be able to access it.

Integrity is basically maintaining the accuracy and completeness of your data, so your data should be accurate and should be able to provide the final value, and availability means that your data is available to authorized users whenever it's needed, okay? These are the three important things in data security. These are the measures you should take: first, you should encrypt your data,

okay? Encryption should happen so that, uh, if it goes through the network, uh, other people should

okay? Encryption should happen so that, uh, if it goes through the network, uh, other people should not be able to understand what the data is. Access control—only give the data to specific users. Data

classification—classify your data, like if this data is confidential or not, and security—like secure data on the network level. The one concept that I wanted to talk about is data masking, okay?

Uh, that I talked about at the governance level. So usually what happens is that, uh, you have an employee table, okay? What you have to do is—like there are some governance restrictions, some regulations by governments that say you should not store sensitive information about the users, right? Like credit card numbers, addresses, social security numbers, and all of the other things. So

right? Like credit card numbers, addresses, social security numbers, and all of the other things. So

when you do store it, make sure you mask them. Masking is basically a technique. So basically,

this is the ID of the user, right? What do I do if I want to mask this? I will, uh, replace this with some random number. Let's say this is my Social Security number, okay? What I will do is—if this is my Social Security number, what will I do? I will just mask this with something like this: XXX-XX- and I will just reveal the last four digits, like this, okay? You can also do this

for credit cards. This is called masking, okay? Now, these are the different file formats you can use for big data. These are common: JSON, CSV, Parquet, and ORC. Every file format has its own use case. I don't want to go deep dive into this right now. I covered some of the things in my

use case. I don't want to go deep dive into this right now. I covered some of the things in my courses, or you can just Google this, and you will understand most of the things that you want to learn, okay? So, till now, we have covered a lot of different things. I might have missed some of

learn, okay? So, till now, we have covered a lot of different things. I might have missed some of the topics, right? So I cannot cover every single topic in a single video, right? In a single video, this might be around a 3-hour video, okay? I'm not sure—once I edit, once I sit and edit this video,

I'll get to know about the timeline, but approximately, this might be a 3-hour video. I

might have missed some of the topics, so what you can do—you can comment down, okay, the topics that you want to learn. Just the fundamental topics, right? Hands-on, we will have the projects for that. Just the fundamental concepts that you want to learn. What I will do—I will club all of these

that. Just the fundamental concepts that you want to learn. What I will do—I will club all of these topics inside part two, and I will create a video like this—like a long, three-hour video that you can watch, okay? Now, you understood all of these things. Now, if you like the way I teach and if you really want to learn about data engineering, you can go to the website. I will put the link in

the description: DataVidhya/combo-pack. On this combo pack, you will get five courses because, till now, only five courses have launched. I'm currently working on the Apache Kafka course, as you can see. We have a 'Not Available' for that also. So, you will also get access to this kind of notes if you, uh, enroll in the course because I created all of these kinds of notes by myself,

okay? So that you can revise at any time that you want, okay? So you will get access to all of these

okay? So that you can revise at any time that you want, okay? So you will get access to all of these notes. So this is my Zero-to-Hero Data Engineering Combo Pack. It comes with the five courses. Now,

notes. So this is my Zero-to-Hero Data Engineering Combo Pack. It comes with the five courses. Now,

in the future, when they launch the course, uh, you can enroll in that course separately, and I will also create a new combo pack. So if you see the new combo pack while enrolling in this, okay, at that time, you might see GCP also added into this, okay? In this course, you will get around 14+ projects. I will teach you, okay, how to make one project the best project.

It is like a step-by-step approach you will get. So as you can see over here, uh, in your Python for Data Engineering project, uh, course, you will build this particular project, okay? In Snowflake,

you will build a similar project, but instead of using Glue Crawler, Catalog, and Amazon Athena, we will be using Snowflake over here, okay? Now, in the Spark course, we will replace this Lambda part, okay, for Python, and we will replace it with Apache Spark. This way, you will understand how to evolve one simple project and how to plug and play with different toolsets. This is what you

will learn in the entire combo pack, right? How to take one simple project and make it the best production-level project as we go forward, okay? We start with the basics, we'll replace some components, we will add Spark, then we will also add Apache Airflow, okay? Inside Airflow,

we will use the same project, and we will use Docker and Apache Airflow to orchestrate this entire pipeline. We will also create a similar project just using Apache Airflow only, okay? So as you can see, one simple project we will create in like five to six different ways,

okay? So as you can see, one simple project we will create in like five to six different ways, right? So that you get an understanding that data engineering is not just about using tools;

right? So that you get an understanding that data engineering is not just about using tools; it's about the fundamentals. The fundamentals that we understood, you will actually implement all of these things over here like this. You will also get projects on Apache NiFi and real-time data streaming, and there's one project available on Twitter data analysis, which is also available on

YouTube. Uh, there's a project available on GCP also. There's one project on Azure. Uh,

YouTube. Uh, there's a project available on GCP also. There's one project on Azure. Uh,

I think the GCP project is—uh, there's a project available on Azure over here also. Uh, the GCP project is also available—let me just show you over here. Uh, yeah, this is a crypto data pipeline project available in the Apache Airflow course, okay? So you will also learn about Azure, you will also learn about AWS, and you will learn about GCP just by doing these five courses. And

then, in the future, we will have in-depth courses on individual clouds also, so you will get like 14 different projects over here, okay? Five courses—you can get the information about all of these over here just by clicking this, okay? These are the reviews from our students, okay? Previous students, and they have built their own projects till now, so if you want to check,

okay? Previous students, and they have built their own projects till now, so if you want to check, you can also go through this and understand that they have built some amazing, uh, projects. You

can just click over here, okay? And you will be redirected to the link of the project. I hope

this is working, okay? Or you can go here—this also. Uh, yeah, as you can see over here, uh, this guy actually built the Airbnb project, uh, using Azure. So just like this, you can build your own project and put it on your resume also. Uh, this course is for everyone, like cloud engineers, web developers, data engineers, uh, technical consultants, so it doesn't matter who you are—you

can learn this. What you will get from this course—you will get the code template, okay? You

will get each and everything about the code that you can use. You'll get access to the interactive Discord community, uh, you will get support if you are stuck with any doubts or any errors, uh, you can ask it on the Discord channel. Someone will help you out, or I will help you out. Or

in the future, you will also get early access and a discount—like a huge discount to future courses, right? So, let's say if I launch the Kafka course, you will get the detailed—you will get a huge

right? So, let's say if I launch the Kafka course, you will get the detailed—you will get a huge discount for that course also. These are some of the reviews from our students, so you can go through that. These are some of the commonly asked questions, so you can also go through that. So,

through that. These are some of the commonly asked questions, so you can also go through that. So,

if you're interested, you can go through this. If you're not, you can also learn by yourself, okay? My voice is breaking, but the best part about this particular data engineering roadmap

okay? My voice is breaking, but the best part about this particular data engineering roadmap is that every single course is in-depth, so you will learn about most of the things. Most of the bootcamps available in the market just give you surface-level knowledge. So, if you go to any website that offers data engineering courses, they teach you all of these things, but for each and

every module, they might have like two to three videos added to their module, and they are done.

I will give you all of these things in detail, and with that, you also get access to the notes like this, okay? If I show you here—if I show you the Obsidian, you can see this interactive graph environment. You can see the Apache Spark topics available here, okay? Apache Spark—what are

graph environment. You can see the Apache Spark topics available here, okay? Apache Spark—what are the different topics connected? These topics are also connected between different courses also, so partitioning or the transformation concept is also applied to the data warehouse and Apache Kafka. So

as you can see, data warehouse, you can see the SQL, you can see—you can also go with the basic topics. So all of these are like the topics—cloud, you can easily interact. You can directly search

topics. So all of these are like the topics—cloud, you can easily interact. You can directly search specific topics such as partitioning, okay, partitioning, uh, partitioning, okay? And I

can know that, okay, partitioning is available over here, and also over here, so I can easily search and learn about the different things, okay? After this, uh, there's one more thing, uh, what is this? Okay, you will get the detailed notes. So if I were to show you the notes—let's say,

is this? Okay, you will get the detailed notes. So if I were to show you the notes—let's say, uh, this is the basics of Docker, okay? You can go and search about it, Airflow basics UI, uh, let's say if I want to—like writing my first DAG, okay? This will give me every single instruction, right? How to write my first DAG, what are the codes, everything that I have to do,

right? How to write my first DAG, what are the codes, everything that I have to do, every single thing you will get here, okay? Every single thing. So this will make your life so much easier that you don't get distracted by looking at different courses or different resources. You

just stick to one single path, and you can become a data engineer. So this is what I wanted to show you about my courses. If you're interested, just check the link in the description. If you're not, totally up to you, uh, you can use multiple resources. I also have free resources available, so you can also check that on my YouTube channel. That's everything about this video. Uh, this is

now almost 3 and a half hours—I'm recording this video. Hopefully, the recording gets stored so that I don't have to re-record this entire thing. This was everything for this video. If you're

watching this video till now, okay, do let me know by writing a comment, okay? Because this is a long video. Also, like this video because I put a lot of hard work into it. And also, comment something,

video. Also, like this video because I put a lot of hard work into it. And also, comment something, share this with people so that all the people can take advantage of this and grow in their careers.

So, everything—thank you for watching this video. I'll see you in the next video. Thank you so much.

Loading...

Loading video analysis...