Learn Databricks in Under 2 Hours
By Alex The Analyst
Summary
## Key takeaways - **Databricks Free Edition Signup**: Sign up for Databricks free edition using just your email, no credit card required, and get immediate access to serverless compute, SQL warehouses, notebooks, and all AI tools completely free. [02:14], [11:08] - **Query CSV Files Directly**: You can query CSV files directly using their file path with CSV. notation without converting them to tables first, making it one of the easiest file formats to work with in Databricks. [13:43], [14:17] - **Connect External Data Sources**: Use Databricks connectors to easily sync data from Google Drive, Salesforce, Google Analytics and more, automatically adding a sync timestamp column to track when data was ingested. [17:14], [19:26] - **Dual SQL Editor and Notebooks**: Use SQL editor for simple querying and saving queries, but switch to notebooks when you need to mix SQL, Python, R, or Scala with markdown notes and share comprehensive analysis with your team. [20:54], [21:08] - **Genie AI Natural Language Queries**: Genie lets business users ask natural language questions about their data like 'What zip code has most pickups?' and automatically generates SQL queries, visualizations, and narrative insights. [06:58], [47:36] - **AI Assistant Fixes Code Errors**: The AI assistant diagnoses errors like thousands separators in numeric columns, suggests fixes like using REPLACE before CAST, and can convert code between SQL and Python or add comments automatically. [52:02], [01:14:07]
Topics Covered
- Full Video
Full Transcript
What's going on everybody? Welcome back
to another video. Today we're going to be learning data bricks in under two hours.
Now, for the past 5 weeks, we've been diving into data bricks. And in this video, we're just putting all that together so you can follow along really easily. We're going to start by taking a
easily. We're going to start by taking a quick walkthrough of data bricks and seeing everything that it has to offer.
Then, we're going to start importing data not just from a flat file, but also connecting to a data source. Then we're
going to start using their SQL editor as well as their notebooks and visualizing data. After this, we'll be using their
data. After this, we'll be using their AI tools like Genie and their AI assistant. And finally, we'll be
assistant. And finally, we'll be building a full project at the end. The
best thing about all this is it is completely free. Data Bricks has
completely free. Data Bricks has something called the data bicks free edition. I will leave a link in the
edition. I will leave a link in the description. You can make an account by
description. You can make an account by just using your email. That is it. So,
be sure to go and create your free account so you can follow along with this entire video because we have a lot of things that we are going to cover.
With all that being said, I hope you learn a ton because data bricks is a fantastic platform. So let's jump right
fantastic platform. So let's jump right into it. Now data bicks is an amazing
into it. Now data bicks is an amazing platform. I've been using it for many
platform. I've been using it for many years and it's built on top of Apache Spark which means it is very good at handling large amounts of data. It's
designed for entire teams of data engineers, data analysts, data scientists to work with data collaboratively. So that means ingesting
collaboratively. So that means ingesting the data, analyzing the data, and visualizing data all in one place. In
this lesson, we're going to be diving into data bricks and look at some of the core functionality and features that data bricks free edition has to offer.
and then in the next several lessons, we're going to get really hands-on and start digging into these features and start building things out. With that
being said, let's jump on my screen and get started. Before we jump into the
get started. Before we jump into the platform itself, I just want to show you that this is where you sign up. So, I'm
going to have this link down below so you can just click on it. You can create an account and you can sign in. There is
no credit card information that they take. You just literally sign in and you
take. You just literally sign in and you are good to go. But on this page, you can sign up for the free edition or login, which we'll log in in just a second. But you can also learn a lot
second. But you can also learn a lot more about data bicks free edition and everything that they have to offer. You
can see that they have built-in AI agents and they have AI builtin and that is something that we're going to cover in this series. You can also visualize your data and create full dashboards.
And of course you can interact with your data with Python and SQL either through notebooks or you know through a code editor. Let's go back up and let's go
editor. Let's go back up and let's go and sign up for the free edition. All we
have to do is log in. You can do that with Google, with a Microsoft account, or just an email. I'm going to sign in with my Google account. And all I have to do is say where I'm from. So, this is Alex. We have the free edition, and I'm
Alex. We have the free edition, and I'm from the United States. Let's go ahead and click continue. And just like that, we are signed up for the Data Bricks free edition. It was about as seamless
free edition. It was about as seamless as it can possibly be. Now, there's a lot to cover. Data Bricks has so many different features. They have so many
different features. They have so many different things that you can do from building interactive dashboards to writing SQL queries and sharing that with your team and being able to work with your team. You can even build AI agents. There is a lot of things that
agents. There is a lot of things that you can do and some of this might be a little bit intimidating if you've never used a platform like this before. If
you're used to just working something like SQL or R or Tableau, it's a little bit more advanced than that. So, there's
a lot of things that they have that are all combined into one place. And so,
we're going to walk through a lot of these things that you see on this lefth hand side to see exactly what you can do in data bricks. Now, I'm going to start at the very top because this is a workspace. And a workspace is basically
workspace. And a workspace is basically the place that you work and you can collaborate in as well. So you can give your teammates access to your workspace.
So if you have data in here, if you have code in here, if you have visualizations, whatever it is, you can share that with them and they can get access to all of your work. Next, let's
go to catalog. You can think of catalog as like a schema. So if you have a schema in a database, you're going to have all of your tables and your views and your store procedures and everything underneath it. And this is very similar.
underneath it. And this is very similar.
So if I go into a workspace or if I go down here into the samples, we have different databases. And so we can click
different databases. And so we can click into these databases and we can look at all these different tables. And so this is where you're going to have access to view all of your data and files and tables within data bricks. Next, let's
take a look at jobs and pipelines. This
is where we start getting into some automation. So here we have an ingestion
automation. So here we have an ingestion pipeline, an ETL pipeline, and a job.
And each of these does a slightly different thing. If you create an
different thing. If you create an ingestion pipeline, that's going to be like the extraction of the ETL process.
You're extracting data to bring it in.
Then we have our ETL pipeline, which is the entire process of transforming it, loading, and actually putting it into a table within data bricks. And then we have our jobs. So that's going to orchestrate it. We're going to be able
orchestrate it. We're going to be able to say, here's when we actually run these pipelines so that we can time it.
We can do it, you know, daily, weekly, monthly, whenever we want to run these pipelines. So we don't have any created,
pipelines. So we don't have any created, but when we do, they'll be down here at the bottom. Next, let's look at compute.
the bottom. Next, let's look at compute.
Now, compute is very simple in the free edition because you don't have any options here. We just have our
options here. We just have our serverless starter warehouse. This is
what you get for free with the free edition. Now, if you go to the full data
edition. Now, if you go to the full data bricks product, then you're going to be able to kind of customize your compute to your needs. Before this free edition, we have uh the owner, we have the size,
which is a 2x small, and then we have whether it's active or not. Now, right
now, we're not running anything. We're
not looking at any tables, we're not running any queries. So, it's not active at the moment, but the second that we open something up and start working with real data, it will activate this serverless compute and we'll be able to
use it for free. One small nuance to this is there actually is one other type of compute. You just can't see it. And
of compute. You just can't see it. And
that's for when you run Python in a notebook. It's called the generic
notebook. It's called the generic compute. It's just for that one specific
compute. It's just for that one specific use case. For the most part, you're
use case. For the most part, you're going to be using this serverless compute, but I thought it was worth mentioning. Notice when I clicked on
mentioning. Notice when I clicked on compute, it also took me right down here to the SQL warehouse. This is our SQL warehouse that we are using. Next, let's
go down to the marketplace. Now, the
marketplace is basically just, hey, here's a lot of companies that work with us, that we partner with, we have connectors to, and we have, you know, partnerships with, and it allows you to work with them a lot easier. So, here we
go. Partner connect integration. So, if
go. Partner connect integration. So, if
you want to connect to Fiverr, to PowerBI, to Tableau, DBT, Prophecy, you can easily connect to these. I know at my previous job, I connected it with PowerBI a lot. And so these are great
connectors where you can just kind of search and you can see hey do they connect to this tool that we use and most of the time they do have that connector. Now the marketplace has other
connector. Now the marketplace has other things as well. If we come over here to products we have things like tables or files which is data. You can also search for models and notebooks and all sorts
of other things. So you can come in here and we can search for files. And so
these are all free resources and files that you can search for and use. If you
can't find data here in the marketplace, you can always just go to something like Kaggle and get free data and bring that into data bricks as well. Now, let's
come over here and go to the SQL editor.
You can see we have kind of these uh larger overarching sections up here, but then we have SQL and then we have data engineering. Then we have AI and machine
engineering. Then we have AI and machine learning. Right now, we're in the SQL
learning. Right now, we're in the SQL section. So, right in here, we're able
section. So, right in here, we're able to come in and we're able to write SQL just like any other platform. The neat
thing about this though is that they have AI integrated into it. It helps a lot with writing some of the base queries that you're going to write. Now,
we're gonna have a whole lesson on Genie and so we're going to dive into that and see how you can generate code and do a lot of different things with Genie, which is their AI system within data bricks. And Genie just allows you to get
bricks. And Genie just allows you to get insights from your data just using natural language. But you'll be able to
natural language. But you'll be able to write SQL right here in this editor as well as have multiple tabs and choose what workspace you're working with in and choose what workspace you're working in. Right now, we're just on the
in. Right now, we're just on the default, but if you had different workspaces, you'd be able to just click on those and select where you want your queries to be pointed at. Next, let's
take a look at our queries. So, once we get in and we start developing all these queries, we want to save them. And
organization is a big piece of that. I
know that I used to have like 20 30 different kind of queries that I was saving on my file explorer back in my old job. And then I started working in
old job. And then I started working in data bricks. I'm like, hey, I can just
data bricks. I'm like, hey, I can just save those within this. And then I don't have to email someone my query and then they pull it up and copy and paste. like
it's all here within data bicks, which is really great. Next, let's go look at dashboards. Now, these are some sample
dashboards. Now, these are some sample dashboards, and we're going to have a whole lesson on analyzing and visualizing data within data bicks. So,
we'll be able to build something like this. Let's click into one of these
this. Let's click into one of these dashboards. Right here, it says my
dashboards. Right here, it says my warehouse is starting. That's cuz it's connecting to the data to be able to visualize our data. So, let's get rid of this filter really quick. So, this is what our dashboard looks like. And
you're able to create all these different visualizations. is able to
different visualizations. is able to drill down into the data and this is all connected to data that is already in data brick so it's all in one place.
Next let's go take a look at Genie and they have a whole section just for this and they also have an entire section that is just devoted to this warehouse.
Now, this is something that a lot of platforms are trying to build to, but data bricks already has it, which is they have all the data. They have the ability to query and analyze and visualize data all in one place, but
being able to use AI with it to get a lot of those things done faster. And so,
you can ask it questions and it's going to prompt you because this is kind of a uh, you know, starter section. And you
can ask it questions and it's going to run these queries and it's going to get you that information really quickly. In
one of our future lessons, we're going to be diving into just Genie, focused only on what it can do, how it helps you analyze your data faster, and how it can be really useful to you as a user. The
next thing that we're going to take a look at is alerts. Now, an alert is basically like a trigger. When a
condition is met, it's going to do something. And so, you can customize
something. And so, you can customize these alerts and set these conditions, and they'll email you or they'll message you when these conditions are true. Now,
in this section up here, this is where we're going to spending the majority of our time. There are other things like
our time. There are other things like the job runs which is where you can create your jobs which is part of these uh jobs and pipelines and then of course the data ingestion as well which allows
you to connect to your data. So if you have a data source that you want to point at maybe it's Google Analytics you can just click on this and connect to your data or if you just want to upload a file you can do that as well. Down
here we have our AI and machine learning section. So you can click on the
section. So you can click on the playground. This is another area where
playground. This is another area where you can interact with the AI within data bricks and you can select some custom parameters as well as the prompts that you are using. We can also look at experiments. This is where you can build
experiments. This is where you can build out AI agents and machine learning models and you can actually test them out. And so this is a fantastic place
out. And so this is a fantastic place where you can actually learn how to use a lot of these foundational models and you can take the data and you can deploy it and you can work through these issues and really learn hands-on how to work with them. Naturally, they're going to
with them. Naturally, they're going to have AI integrated into all this. So you
can work with the data bricks assistant to get help with all of your coding. So
as you're going along, if you get stuck, you can always get help with your code.
As I said before though, we are going to be spending most of our time right up here. Of course, we'll need to ingest
here. Of course, we'll need to ingest our data. But after that, we're going to
our data. But after that, we're going to be spending a lot of our time visualizing, see how we can work with our data and how we can use their integrated AI to be able to do our work faster. At the end of this entire
faster. At the end of this entire series, we'll be building a full project using a lot of the things that we're going to be looking at in this series. I
think building and getting hands-on is the best way to learn. So, I cannot wait to get started on that. There are a lot of ways to work with data in data bricks and we're going to cover a lot of them in this lesson. We're going to upload data like a CSV and a JSON file. Then,
we're going to connect to an external data source and we're going to bring that data in. Then, we're going to see how we can actually use our data working in the SQL editor as well as using a notebook. The best way to follow along
notebook. The best way to follow along is to create a data bicks account. I
will leave a link in the description.
It's data bicks free edition, so it is completely free. All you have to do is
completely free. All you have to do is create an account and sign in and you'll have access to everything. You don't
have to put in a credit card at all. It
is completely free which is amazing.
With all that being said, let's jump on my screen and get started. We are
starting out fresh. We haven't uploaded any data or ingested any data into data bicks yet. So, you are right where you
bicks yet. So, you are right where you need to be. Now, there's actually a lot of different ways you can bring data into databicks. Right now, we're in our
into databicks. Right now, we're in our workspace, but let's go over to our catalog and let's go into our workspace.
Let's go into default. Right now, we have no data in our default workspace.
Now, what we're going to do is let's click into this default and we're going to come right over here and we go to create. Now, we have a few different
create. Now, we have a few different options. These two are the only ones
options. These two are the only ones that are really relevant to what we're doing right now. So, let's create a new volume here. And I'm just going to call
volume here. And I'm just going to call this uh YouTube series. And then I'm going to go ahead and create this.
Now, you'll see right over here under default, we now have this folder called YouTube series. If I click in this
YouTube series. If I click in this YouTube series, we can upload to this volume. So, we can do that right here.
volume. So, we can do that right here.
We don't have to actually go outside of it or go to any of these other options.
We just click on upload to this volume.
So, let's go ahead and click browse. And
I have all these different files. Now, I
created these. They're very simple files. We're not getting crazy in this
files. We're not getting crazy in this lesson, but I have a customers CSV and a customer's JSON. So, one's a CSV file,
customer's JSON. So, one's a CSV file, one's a JSON file. I also have orders CSV, orders JSON, product CSV, and products JSON. We're not going to use
products JSON. We're not going to use all these, although I will have all these files in a GitHub. You can just find them in a link below. So, you can download these exact files if you want to work alongside me. But, let's bring
in the customers CSV. So, I'm going to click on this customer CSV. And you can see this is the destination path that we are using. Let's go ahead and upload
are using. Let's go ahead and upload this file. And very quickly, like in one
this file. And very quickly, like in one second it took, we now have our customers CSV CSV. Now, I just named it that. So, you know, if the CSV wasn't
that. So, you know, if the CSV wasn't there, we'll still be able to see it.
Let's click into the CSV really quickly.
We have our customer ID, first name, last name, country, and signup date. And
then you can see our data is all separated by a comma. Now, this data is just being stored as a file. This isn't
actually being stored as a table. So,
you can see it just sits in here as a CSV. So, if we go over to our files,
CSV. So, if we go over to our files, it's going to sit there as an actual CSV file. Now, we can still query off of
file. Now, we can still query off of this data just as it is as a file. We
don't actually have to have this in a table format. Let's come right over here
table format. Let's come right over here and let's copy this path. And then we're going to go down to our SQL editor and let's come over here to a SQL query.
Now, if we want to access this data, we can say select everything and then we'll say from and I'm going to put this path in here. I should be able to use back
in here. I should be able to use back tick just like this. And let's try running this. And actually, I need to
running this. And actually, I need to put CSV dot right here. Now, let's try running this. And as you can see, we
running this. And as you can see, we were able to read in our data without it actually being in a table. A CSV file is one of the easiest files to work with.
It's just values that are separated by commas. And so when we do this, when we
commas. And so when we do this, when we say CSV dot and then we provide the path, we are reading in this data as if it is in a table. Now, if you come from just a SQL background, this may seem
very unintuitive to you. And that's
okay. Let's go back to our catalog. And
we're going to come over here to workspace. We're going to go to default.
workspace. We're going to go to default.
And instead of going into our volume right here, we're going to come back to our default. And so now we're going to
our default. And so now we're going to create a table. When we go to create a table, we're still using our serverless starter warehouse here, but now we have the ability to connect to a data source
or to upload a specific file format.
Let's go ahead and click browse. We just
uploaded the customers CSV, but let's go down to the orders JSON because JSON is a totally different format. Let's go
ahead and open this up. It's going to read in that JSON and make it tabular, which is fantastic because JSON by definition is not a structured format.
And so being able to read in that data really easily and then put it into columns and rows is very helpful. So now
we're going to do a create a new table.
We can name this table anything we want.
I'm just going to get rid of this JSON because once it reads it in, it's like a table anyways. So, we're going to keep
table anyways. So, we're going to keep it as orders here. Then, we're going to create this table. Now, you'll notice over here under our default, we have tables and we have volumes. So, they are
separated out because they're two totally different things. One is going to be storing different files in kind of a folder format. And then one is the tables underneath our default workspace.
So, let's come down here to orders. And
that's actually right over here. So now
under our orders, we can see customer ID, order date, order ID, product ID, quantity, and total amount. We can get a little bit of metadata on this file. And
if we come up here, we can come here and just create a query. So let's click on create query. It's going to open up a
create query. It's going to open up a new query in here. And now we are just selecting everything from our orders table. It's already hitting off of our
table. It's already hitting off of our default schema in our workspace. So, we
don't have to start doing, you know, workspace dot uh default. Orders. We don't have to have all that. You can uh but I'm just
going to hit control-z here. Let's go
ahead and run this. And now we can see all of our data in this really pretty view. So, so far we've ingested a CSV
view. So, so far we've ingested a CSV file and we just did that as a CSV file.
We were still able to read that data in which is really fantastic. But we're
also able to just create tables and then read that data in like any other SQL database. But now let's come over here
database. But now let's come over here to data ingestion. There's a lot of different ways you can access data that is not just sitting in a file. Right off
the bat we have our data bricks connectors. Things like Salesforce,
connectors. Things like Salesforce, Workday, Service Now, Google Analytics, Azure SQL Server. These are a lot of the things that I've used in my actual work.
Almost all companies are going to have at least one of these. And so connecting to that data source bringing it in is really common. We can also bring in data
really common. We can also bring in data from a file like we did before or create a table from Amazon S3. Then we have our five connectors right down at the bottom. I know personally I've worked at
bottom. I know personally I've worked at different companies. I've consulted with
different companies. I've consulted with companies. They use Google Drive as like
companies. They use Google Drive as like their store of information. That's where
they keep everything. So let's go over here to Google Drive and I'm going to actually connect this. So I'm going to say I want to put it in my workspace.
Let's go ahead and click next. really
quickly. I have a file over here, orderscv.csv.
orderscv.csv.
It's sitting in a Google Drive. So,
that's what we're going to go and try to connect to. Let's go ahead and click
connect to. Let's go ahead and click next. This is my email that I'm using
next. This is my email that I'm using for this Google Drive. We're going to connect this to Fiverr. Since this is a new account for Fiverr, I need to create a password. So, I'm going to do that
a password. So, I'm going to do that really quick. We're going to come right
really quick. We're going to come right down here and we're going to go to share. And all we have to do is make
share. And all we have to do is make sure that this is not restricted. So,
we're going to say anyone with this link, which means fiverr as well. And
then we're going to copy this link and put it in our folder URL. Let's go ahead and save and test this. And it looks like our connection test passed. Let's
go ahead and click continue. And now all we're going to do is sync our data.
We'll start the initial sync. And it
should be very quick because I do not have much data in this folder. So, it
looks like our connection was successful. Let's come over here to our
successful. Let's come over here to our schema. our one file that we have in our
schema. our one file that we have in our Google folder is synced up. So, we
should be good to go. Now, let's come back to our data bricks. Let's come
right over here and we're going to go into our catalog. We're going to go into our workspace. And now we have default
our workspace. And now we have default and we also have Google Drive. Let's
click on our Google Drive and we have this orders CSV.
And you'll notice that we now have it in here as a table. Let's go ahead and create a query for this so we can look at our data. So now we have workspace.google drive is connecting to
workspace.google drive is connecting to a different schema. So we have select everything from order CSV. Let's run
this.
And now we have our data. It also adds this in which is five transync which is a really really useful column because if you're syncing this data consistently you really want to know when that data
gets put into this table. I promise you that's really helpful that they put that in there. And then we have all of our
in there. And then we have all of our data that we have. And so that's how we can connect to outside data. In this
case, we use Fiverr, but sometimes you'll just do a direct connection depending on your data source. Now, this
SQL editor works like any other SQL editor. You're going to be able to make
editor. You're going to be able to make joins and aggregations and all sorts of different things, but there is a different way to interact with your data. Let's come right over here. Let's
data. Let's come right over here. Let's
go to new. We can also go and use a notebook.
We can add code. We can add text. You
can also use an AI assistant to help you with these things. So, we have markdown file right here. I can say uh this is my first text
right here. And then I have my code down
right here. And then I have my code down here. So I can start writing and typing
here. So I can start writing and typing my code. Now I can specify right here
my code. Now I can specify right here whether I want it to be SQL, Scola, R or Python. So for this notebook, you can
Python. So for this notebook, you can use any of these. In this cell right here, I have Python. But let's add another one. And I can use SQL in the
another one. And I can use SQL in the next one. And then in the next one, I
next one. And then in the next one, I could use R. And so you don't have to just use one. You can use multiple. Now,
it depends on what you're doing, whether you want to use a SQL editor or you want to come in here and use a notebook. When
I'm just kind of querying data, I'm just looking at it. I'm not doing a lot of transformations. I don't need a big
transformations. I don't need a big programming language. I'm just querying
programming language. I'm just querying the data. I'm going to be using a SQL,
the data. I'm going to be using a SQL, you know, editor right here most of the time. And I can always save these
time. And I can always save these queries and I can pass them along. But
if I'm really digging in and I need to be able to break things out and leave notes and I'm going to share this with my team, a notebook is kind of the way to go. So let's look at SQL really
to go. So let's look at SQL really quick. We already have a query for this,
quick. We already have a query for this, albeit a very simple query. But let's
come over here and let's run this SQL query right here. So let's go ahead and run this. And so now if we scroll down,
run this. And so now if we scroll down, we're going to be able to see our data just like we did in the SQL editor. But
now let's go write the same thing but in Python. In order to do that, let's come
Python. In order to do that, let's come up here. And this is already in Python.
up here. And this is already in Python.
So, we'll just say spark.t.
And we need to read in uh the appropriate table. And all that is is
appropriate table. And all that is is orders. And let's go ahead and run this.
orders. And let's go ahead and run this.
And it is reading it in as a dataf frame. Let's call this uh dataf frame.
frame. Let's call this uh dataf frame.
And let's come right down here. And
we'll say display dataf frame. And let's
run this.
And now we're going to get our data in a table. Just like before, of course, with
table. Just like before, of course, with this, it's going to save this data, but you can always come in here and you can rename and export it and you can put it into git. You can share this with your
into git. You can share this with your friends because all your friends really care about your notebooks that you are writing. Or you can create a new
writing. Or you can create a new notebook and start from scratch. But
this is a totally different way to interact with your data in data bricks.
This is where most of your work is going to be done. is right here querying data whether it's a SQL query or over here in a notebook writing a bunch of code whether it's in Scala or Python or R or
SQL itself. Now I hope that was really
SQL itself. Now I hope that was really helpful because in the next lesson we're going to be analyzing data with SQL and then also building out visualizations in data bricks. Now in the last lesson we
data bricks. Now in the last lesson we looked at the SQL editor as well as notebooks. And so what we're going to be
notebooks. And so what we're going to be doing is we're going to be looking at some data. We're going to be analyzing
some data. We're going to be analyzing that data and then we're going to be putting it into a dashboard and creating different visualizations. This is
different visualizations. This is actually the dashboard that we're going to be creating in this lesson. And I
know what you're thinking, Alex, you put the exact same visualization twice. That
makes no sense, but it will make sense once we get to it in the lesson. I
highly recommend following along. All
you have to do is have a Data Bricks free edition account. I will leave a link in the description so you can make that account and follow along. With that
being said, let's jump on my screen and get started. All right, so we're getting
get started. All right, so we're getting started right here on the dashboards.
All we're going to do is we're going to come up here and we're going to create our own dashboard. So, we get started with this blank slate and it has all these little arrows and I recommend you read through these really quickly. But
this is how you add filters. This is how we get our data which is right up here and I'll show you that in just a second.
And this is how we actually add our visualizations to what they're calling our canvas. Let's come right up here and
our canvas. Let's come right up here and let's go to our data. Now, within our data set, we are able to write SQL queries. And I will say this is one of
queries. And I will say this is one of my personal favorite things about this is you can write the query and then you can use that query to then create a visualization. We're going to do that in
visualization. We're going to do that in this lesson because I like being able to visualize and kind of see my aggregations when I'm doing some type of group by in SQL. I like to see what the actual output is and you can really
easily do that in here. You can also come down here and you can just come in and select some of your data. Now we are actually going to be using a sample data set. Everyone should have this data set.
set. Everyone should have this data set.
It's right here called the bake house.
And so we can come in here and we can also just select one of our data sets.
When we select this sales transaction, it's going to start our warehouse. So
now it's connected to our data and our serverless warehouse is running. So now
we have access to this data. If I come right here, I click on these three dots and I can click add to dashboard. So I'm
going to click add to dashboard. It's
going to read in this as a SQL query. So
if we go back up here, we now have sales transactions as one of our data sets. So
we can just come in here and write it ourselves or we can also come into the catalog and just select a data set and have it run it for us. Now before we start actually diving in and kind of analyzing and visualizing this data.
There are other ways you can actually analyze data. And we looked at this in a
analyze data. And we looked at this in a previous lesson when working with data set. But as it pertains to connecting it
set. But as it pertains to connecting it to a dashboard, what we can do is let's come right up here and let's click on a notebook. Let's say we want to run that
notebook. Let's say we want to run that exact same query. And I'll just go back to the dashboard. I'm going to rename this really quick. I'm going to rename this as our transaction. If I can spell
this right, dashboard. All right. So, if
I go to my data, I could just copy this and then I come right over here to the SQL editor and we'll actually have access to uh our notebook right over here. So, this is just our fresh
here. So, this is just our fresh notebook. We haven't used it. I'm just
notebook. We haven't used it. I'm just
going to make this SQL just so it can you can see how easy this is, but we're going to run this exact same thing.
We're going to get our output right down here. Now, let's say this is the output
here. Now, let's say this is the output that we want. It isn't, but this is the output we want. But if we come right over here to these three dots and we scroll down, I can click add to dashboard. So I'm going to add this to
dashboard. So I'm going to add this to my dashboard. I'm going to say add to
my dashboard. I'm going to say add to existing dashboard. And when I click on
existing dashboard. And when I click on this, I'm going to click on the transaction dashboard. And I'm going to
transaction dashboard. And I'm going to import this. This data is then going to
import this. This data is then going to be imported over as an untitled data set. We can rename this. Uh I'm just
set. We can rename this. Uh I'm just going to say that this is from notebook.
So this is our data from the notebook.
It is the exact same query as you can see, exact same data. But you don't just have to create your data from right here in the create from SQL or in this dashboard tab. Now let's start actually
dashboard tab. Now let's start actually working on our dashboard and kind of analyzing the data as we go. Let's just
start by taking a look at our data. So
we're working the sales transactions. We
have transaction ID, customer ID, franchise ID. If we scroll over to the
franchise ID. If we scroll over to the right hand side, we have the date and time that this transaction went through, the product that was purchased, the quantity, the unit price, total price,
the way that they paid, and their card number. Now, this is not real data, so
number. Now, this is not real data, so don't try to steal any of this card number, but we are going to be using this data to create our visualizations.
Now, typically when I'm analyzing and then visualizing data, I have an end goal in mind. I know what I'm going to be doing with the data, and so I know how I need to analyze it. if I need to be use a group by or a window function, if I need to clean the data or pivot the
data, these are things that I'll know ahead of time. Now, in this lesson, we're going to keep it kind of simple, just learn how to build these things.
But in the last lesson in this series, we're going to be building a full project. It's going to have really messy
project. It's going to have really messy data that we need to dig into. But here,
we're going to learn a lot of the fundamentals. Let's start with that
fundamentals. Let's start with that first bar chart that we saw earlier on.
So, we're going to come right here.
We're going to go create from SQL. And
let's call this one, let's rename this before we actually write it. We're going
to call this one product sales descending. Now, what this means is is
descending. Now, what this means is is we're going to take the product sales.
So, right over here, we're going to take the product and then we're also going to look at this total price. So, we're
going to calculate the sales. This is
actually going to be quite easy. If you
know SQL, this should be uh pretty straightforward because we're just going to be using a group eye for this. So,
I'm going to come right over here and I'm going to say I want to take the product. So, let's take the product and
product. So, let's take the product and I use tab for autocomplete here. So, I'm
going to do uh comma, then I'm going to do the sum, and then I'm going to take the sum of total price. And again, I'm just going to hit tab. So, now that we have the product and the sum of total
price, all we have to do is come right down here and say group by. And we're
going to group by the product. And we
should also order by. So, we're going to order by and we'll do total price. And
we'll do that descending. And so, all we're doing, and let's run this. All
we're doing is we're taking the product and we're grouping it. And then we're taking the sum of all of that total price. And I actually need to order by
price. And I actually need to order by the sum of total price. I just took the column itself, but in this query, we're using the sum of total price. Let's try
running that one more time. Listen, it
happens to the best of us. So now we have the product right down here, and we're ordering it based off the sum of this total price. Now I'm going to leave this just like this. Although typically
I would use an alias. I would say as and I would say total price or something like this, right? But
I'm not going to do that because I want to show you in the dashboard how we can easily rename this. We don't have to use this column name. So now this data right here is ready for us to visualize. We
can use this. So now let's come over to our untitled page and let's title this.
We're going to say uh dashboard. So
we're just going to name this dashboard.
And we can get rid of these global filters for now, but we will need that in a little bit. Now, you have all these options down here at the bottom. We have
the move, we have add visualization, add a text box, add a filter, undo, and redo. Now, let's focus on adding a
redo. Now, let's focus on adding a visualization first. And we'll worry
visualization first. And we'll worry about the text box later because we'll give it kind of a header of transaction dashboard. You don't have to, but we
dashboard. You don't have to, but we will for this dashboard. Now, when you first create a visualization, we're going to have this widget on this right hand side. This is where you basically
hand side. This is where you basically build out your visualizations. We're
going to come in here and we're going to select the data that we want. We just
built this product sales descending.
Let's go ahead and click on this. We do
want this bar chart, but let's come in here and let's look at all these different options. There is a lot of
different options. There is a lot of different types of visualizations that we can create, and a lot of these are the ones that you'll use 99% of the time. So, we're going to click on this
time. So, we're going to click on this bar chart. Now, we have to select the
bar chart. Now, we have to select the x-axis and the y-axis in our chart. The
axes are like this. We have we have two separate axes. And so, we need to select
separate axes. And so, we need to select what data goes on each of those axes.
So, let's select our x-axis. For this,
we're going to do the sum of the total price. And then for the yaxis, we're
price. And then for the yaxis, we're going to select the product. Now, this
looks perfectly fine, right? I'm going
to uh expand this a little bit just for a second. This looks perfectly fine as
a second. This looks perfectly fine as is, but there's a lot of little things that we can do to make it a lot better.
The first thing that I'm noticing is that I, you know, I'm having a little tough time reading this. I'm saying,
okay, how much is this one exactly? It's
a little over 11,000. And if I hover over it, it'll tell me the exact number.
But that's not good for a customer or somebody to see. I'm going to add these labels. And now we can see it right
labels. And now we can see it right away. So, we don't have to kind of
away. So, we don't have to kind of guesstimate. we can see the exact
guesstimate. we can see the exact number. The next thing I'm noticing is
number. The next thing I'm noticing is that it's kind of all over the place. It
looks like it's in alphabetical order here, but if you remember in our data back here, we had it in descending order based off of the sum of total price. So,
highest to lowest, that's descending.
And I want that in our dashboard as well. All we have to do, so we're going
well. All we have to do, so we're going to click on this. We're going to go down to these three bars right above product.
And then we're going to click buy xaxis.
And then right here is the descending.
So now we've ordered our data from highest to lowest. It's no longer alphabetical. Now there are some other
alphabetical. Now there are some other things I want to highlight. We don't
have to add these, but we absolutely can. First, let's add a title. So we're
can. First, let's add a title. So we're
just going to call this total price by product. Keep it super simple. But we
by product. Keep it super simple. But we
can also add a description. For example,
let's say we wanted to add some context to this. Or maybe we wanted to highlight
to this. Or maybe we wanted to highlight that this is our biggest seller. So, we
can say Golden Gate Ginger is our highest selling product eight years in a row. Now, I'm just making this up. Uh, this isn't real.
This is just as an example, but let's say we had some historical data and this is now showing, hey, this is still our best seller and I just wanted to add that as some context. You can definitely do that. You don't have to, but you can.
do that. You don't have to, but you can.
One other thing is let's take a look at these colors because we don't have to just keep these colors. We can also use a custom color. So they can be you know whatever color you think is best for
your dashboard. We can also click on
your dashboard. We can also click on this plus sign and we can create some other options. Now if we click on
other options. Now if we click on product and this is one that I don't necessarily recommend. So each product
necessarily recommend. So each product is going to get its own color on each row. It's a little bit redundant. Now
row. It's a little bit redundant. Now
let's go back over here. Let's get rid of our product and let's select our total price. Now we have this gradient
total price. Now we have this gradient scale from blue to white. This isn't my favorite. I actually prefer if we come
favorite. I actually prefer if we come right up here and let's get the green blue. And so this is a great way to
blue. And so this is a great way to analyze and visualize at the same time.
Sometimes you already know what you're going to be building out and so you can work with the data while you're building your dashboard. And now I can very
your dashboard. And now I can very easily see I'm like Golden Gate Ginger Man that is our product. like we're
killing it with this product. Uh but
Richard Oasis, nobody likes that. It's
doing okay, but it is our worst seller.
Now, just remember, we use this product sales descending and let's go take a look at this. This is the only data available to us in that visualization.
Let's come over here and let's look at our sales transactions. We have our sales transactions right here. This is
all of our data. And so there are times where we don't even need to write a custom query for each visualization.
Let's take a look at this. So let's use our sales transactions to create a visualization in our dashboard. Let's
come down here. Let's create our new visualization. And the next one that
visualization. And the next one that we're going to take a look at is payment type. So let's change this data set to
type. So let's change this data set to the sales transaction. So now we're using two different data sets. Just note
that for future reference because that will come into play. But we're going to come down here and let's say we wanted to create a pie chart for the angle. And
I'm just going to come down here. We're
going to do payment method right here.
It's going to give us a count. And we
don't have to do count distinct. In
fact, uh we probably shouldn't because it's just going to if we hover over it, you're going to see three because there's only three options. But if we do a count and hover over it now, it's 3.33,000.
So now we're going to be able to see how many transactions are using each payment method. Now, for the color, this one is
method. Now, for the color, this one is actually really important for a pie chart. Let's come in here and we want to
chart. Let's come in here and we want to do this based off of the payment method.
We want to break it out. So, we have this payment method. We have Mastercard, AX, and Visa. Again, we can see the split based off of the color, but I really like labels. I think they're just
super important. So, I'm going to add
super important. So, I'm going to add these labels in right here. Now, this is an example where it says count of payment method right here on the dashboard. I don't like that. I mean,
dashboard. I don't like that. I mean,
that's just by default. It's going to take the uh name of it. But I'm going to come in here. I'm going to change this display name. So, all I'm going to do is
display name. So, all I'm going to do is I'm just going to call this payment method breakdown.
And let's change that. And that looks a lot better. It just doesn't seem so I
lot better. It just doesn't seem so I just tossed it in there. Seems like you intentionally named this dashboard. So,
we were able to go in to this sales transactions that has a lot of different fields, a lot of different columns, and we're able to just kind of pick out which one we want to use. So far, our dashboard is looking great. We're just
going to build one last visualization and then we're going to work on filtering. So, let's come right down
filtering. So, let's come right down here and let's create one last visualization. And this is going to be
visualization. And this is going to be our dashboard. So, we're going to come
our dashboard. So, we're going to come in here. We're going to use sales
in here. We're going to use sales transactions again, but this time we want to see our transactions over time, right? We have this date column and we
right? We have this date column and we want to use this. Let's use a line chart. And if we go back to our data, we
chart. And if we go back to our data, we have this date time. And this is really useful. We want to utilize this and see
useful. We want to utilize this and see how many transactions are we having over time. Maybe there's a certain day of the
time. Maybe there's a certain day of the week that people are just making a lot of transactions. This is really useful
of transactions. This is really useful data for someone to know. So let's go back to our dashboard and let's come down here to our x-axis. And for this we want it to be our date column right down
here. So we'll choose our date time. Now
here. So we'll choose our date time. Now
by default it's going to select monthly but you can come in here and you can transform this. It's going to take that
transform this. It's going to take that datetime column. It's going to be able
datetime column. It's going to be able to automatically change it to basically anything you want. So, let's just choose daily for now, but we can change it later on. We just have to choose our
later on. We just have to choose our y-axis. So, let's click on our plus sign
y-axis. So, let's click on our plus sign in order to see our sales over time.
Let's take a look at our quantity. It's
going to do the sum of quantity here.
And if we go back just to check on this data, the quantity is the actual amount that we sold. So, we sold eight, we sold 36, we sold 40. And so, it's aggregating that data for us, which is really nice.
If we come right over here, we can hover over this and we can kind of see each day and the sum of quantity. Now, again,
we need to transform this a little bit because it's just sum of quantity, date, time, um, doesn't display the best. We
want to customize this. We're going to just say date to keep it simple. And
then for the sum of quantity, we're going to say quantity sold and keep it just like this. for our title because I think this one may need one. We'll say
quantity of sales over time. Right,
there we go. Now, we have our dashboard built. It looks great. I'm super happy
built. It looks great. I'm super happy with this. One last thing we should add,
with this. One last thing we should add, and this is optional. You don't have to do this, but I'm going to add this title really quick. I'm going to call this our
really quick. I'm going to call this our transaction. Let me spell that right.
transaction. Let me spell that right.
Transaction dashboard. And I'm going to format it just a little bit. I'm going
to do it like this. I'm going to increase the size just like that. You
can change it to be a different color, but it just kind of adds a little bit of finesse to it, right? We can put that up and make this smaller. Um, we can make
this smaller if we want to go kind of that route. But it is up to you. The
that route. But it is up to you. The
next thing that I want to show you though is adding a filter. Now, filters
are quite important. customers, clients,
managers, whoever is using this dashboard are going to want to filter in some way. They're going to want to say,
some way. They're going to want to say, "Oh, I want to filter on this product or I want to filter on this month or this year or whatever it is." And so adding a filter is really important. We can add a
global filter right here. Now, let's
come over here to our widget because we do have the ability to change the type of filter. We can select multiple
of filter. We can select multiple values. It could just be a single value.
values. It could just be a single value.
Could be a date picker or range picker.
So, a range of dates instead of just a single date. It could be a text entry
single date. It could be a text entry where they're searching for something.
Maybe it's a product. You can also do a range slider. Let's click on this range
range slider. Let's click on this range slider and let's come in here. We're
going to go down to the sales transactions and let's go down to the quantity. Now, what this means is is
quantity. Now, what this means is is we've created this global filter on quantity. Quantity is a numeric data
quantity. Quantity is a numeric data type. So, we have this slider where we
type. So, we have this slider where we can basically say, hey, I only want to see where the quantity was over a certain amount. So, we have a picker
certain amount. So, we have a picker where we can select a range for the quantity. Maybe we only want to see the
quantity. Maybe we only want to see the transactions that have a quantity greater than 30. For example, we would be able to do that in this dashboard. We
can click in here and we can customize this. Maybe the minimum is zero and the
this. Maybe the minimum is zero and the maximum is, let's say, 100. I'm just
going to set it for now. Let's come
right over here. Let's say we only want to look at it where it's greater than 30. So, I'm going to click on that.
30. So, I'm going to click on that.
You'll notice that this doesn't actually change at all, but this one changed and this one changed. And let's just highlight that a little more. Let's kind
of slide this around.
And you'll notice that only these ones are changing, but this one is not changing. The reason for that is the data set that we chose. So,
let's come back here. Right up here, we have the product sales descending. And
then if you go down here, we have our sales transactions and we have our sales transactions. But what are we actually
transactions. But what are we actually filtering on? We are filtering on the
filtering on? We are filtering on the quantity. And if you remember, let's go
quantity. And if you remember, let's go back to our product sales description.
We don't have the quantity in here at all. And so this quantity is not
all. And so this quantity is not connected to this data set. And so when we're applying this filter to this dashboard, it is not connected to this right here. Now, the way that we can fix
right here. Now, the way that we can fix that is we can replicate this exact dashboard. And this is what I was kind
dashboard. And this is what I was kind of talking about earlier, which is why we're going to create a second dashboard, but we're going to create the exact same thing, but we're going to use
the other data set. So now we're in sales transactions. We're going to
sales transactions. We're going to create our bar chart. For the xaxis, we're going to do the sum of total price. And for the yaxis, it's going to
price. And for the yaxis, it's going to be the product. Let's find it right here.
Let's scan this all the way over. Let's
go down, add our labels, create our custom colors based off of the total price, and we'll change that coloring to be the
green, blue. And you can already see
green, blue. And you can already see that this is filtered based off of our global filter right over here. So, let's
get rid of this. Let's just make it the same. And we actually need to do one
same. And we actually need to do one more thing. We need to
more thing. We need to go like this.
So now we have the exact same visualization, but this one is going to be connected to our filter. So this is a really
our filter. So this is a really important just thing to understand when you're building out these dashboards is people really like their filters. They
want to be able to do that. And this is actually going to be a big thing that people request once you build your dashboard. You're going to build it out.
dashboard. You're going to build it out.
It's going to be great. And they're
like, "Hey, I want to be able to filter on this. I want to be able to filter on
on this. I want to be able to filter on this." And so you're going to have to
this." And so you're going to have to build that out. and you may have to go back and connect your data in certain ways to be able to accommodate certain filters. So that's just something to
filters. So that's just something to think about and something to know. But
now this one is connected just like this. And so I would actually replace
this. And so I would actually replace this one with this new visualization right down here because I want it to be connected to all of our other data for these types of global filters. Now there
are of course some other things that we could do. We can come in here and change
could do. We can come in here and change some of these uh axes. We could add some more context in here. This is our sample data set. This is as far as we're going
data set. This is as far as we're going to go in this lesson. But like I mentioned earlier in the last video in this series, we're going to be building out a full project using real raw data.
So we're going to have to do some data cleaning. We're going to have to really
cleaning. We're going to have to really dig in and analyze our data and visualize and create our dashboard. The
first thing we're going to be looking at is Genie. And Genie is built for
is Genie. And Genie is built for business users to be able to get insight from their data just using natural language. Next, we're going to be using
language. Next, we're going to be using their AI assistant to help us code. So
we'll be using that in the SQL editor as well as the notebooks. It's going to help us generate code, but it's also going to help us diagnose and fix issues if we run into them. Next, we're going to use their AI assistant in the dashboards tab. So, it's going to help
dashboards tab. So, it's going to help us create visualizations. Now, what's so amazing is you're going to be able to try all these things completely for free because in Data Bricks free edition, you can use all of their AI tools completely
for free. Be sure to use the link in the
for free. Be sure to use the link in the description to create your account so you can follow along and practice and actually use all these AI tools. With
that being said, let's jump on my screen and get started. When you first pull up data bricks, we're going to need to come right down here within the SQL section to Genie. Let's go ahead and click on
to Genie. Let's go ahead and click on this. And this about sums it up for
this. And this about sums it up for Genie. You can ask questions about your
Genie. You can ask questions about your data in natural language. And that's
what it is. It's just a way to converse with your data. Ask questions about your data. Let's come right over here. We're
data. Let's come right over here. We're
going to go to new and we want to connect to a data source. We are going to use a sample data set. Let's actually
go to all. It's within the samples right down here. this New York City uh data
down here. this New York City uh data set. This is the data that we're going
set. This is the data that we're going to be using for this video. Let's go
ahead and create this. It's going to spin up our SQL warehouse because that is what we're going to be using to interact with this data. Now, while this is going and while it's uh just starting
up and we can get rid of this. We now
have this data that sits right over here. Now, I'm just going to give you
here. Now, I'm just going to give you kind of an introduction to kind of this interface, but we're right here in this configure tab. We can come over here and
configure tab. We can come over here and we can create custom instructions for whatever we want this genie to be. So if
this genie is supposed to be for a specific team or a specific project or whatever it is, you can give it these guidelines because then you can share this with other people and they can interact with it. So if it's people on your team who are interacting with a
specific data source, then everyone can interact with that AI in the same way and it's pretty cool. If we come right over here to settings, we can also name this. So, I'm just going to call this
this. So, I'm just going to call this one uh datab bricks genie taxi. And I
could add some description, but I'm not going to at the moment. Now, right over here, you'll notice in this space, we have a little bit of kind of prompting that is just, hey, here are some things that you can click on and we can give
you that information. We can also add a sample question. So, if you want to add
sample question. So, if you want to add one for your team or whoever you're working with or just for yourself, you can add a question right here. Let's
just go ahead and save this. and we're
going to get out of here. So, we have just this Genie interface right here.
Now, as I mentioned before, Genie is really great for a business user, someone who's just going to be using natural language in order to ask questions about the data. If you are a more technical user, you're most likely
going to be using the AI assistant that we're going to be looking at in the SQL editor and the notebooks. You can edit and dive into the code and get into the more programming side of data bricks.
Let's just go ahead and ask it to explain this data set because we haven't taken a look at this data set at all. In
fact, I don't even know what it looks like. It's just a New York City taxi
like. It's just a New York City taxi data set. So, it says here's where it's
data set. So, it says here's where it's located. It contains taxi trips,
located. It contains taxi trips, including pickup, drop off, trip distance, fair amount, and pickoff/dropoff zip codes. This is the only table in
zip codes. This is the only table in here. Now, we can actually come over
here. Now, we can actually come over here to configure, and we can click on this table, and we can see a very quick kind of sample of this. But I'm going to ask it to show me a sample of the data,
please. I like to be polite uh when
please. I like to be polite uh when working with AI. You just never know, right? I just want to make sure I'm I'm
right? I just want to make sure I'm I'm being uh respectful here. So, now it's going to give me a sample of our data.
Now, you'll notice right over here we have this show code, and this is what it's going to do for basically anything.
It's going to show us kind of what it's doing under the hood, which I really like. You can also edit in here. So, if
like. You can also edit in here. So, if
we want to come in here and we want to say, okay, let's limit it to 20 because we only have a small sample down here.
Now, we can run this again, and it's going to keep all this information, but we're now rerunning and kind of changing the code as we go if you want to do that. So, now let's take a look just at
that. So, now let's take a look just at our data really quickly. This is for taxi data. So, we have a pickup time, a
taxi data. So, we have a pickup time, a drop off time, how long the distance was, how much it cost, the pickup zip, and the drop off zip. So, fairly simple data. Now, there's a lot of things I
data. Now, there's a lot of things I could ask about this data set. I'm going
to keep it pretty simple. As we go further along in this video, we're going to get a little bit more technical, a little bit more challenging to the AI to really see what it can do. So, I'm just
going to say, what zip code are people being picked up at the most? So, I just want to know where
the most? So, I just want to know where are most people being picked up. Maybe
there's a specific zip code that 99% of these users are getting picked up at, and that would be really useful information. So, let's see what Genie
information. So, let's see what Genie comes up with. So, it says it right here. The most common pickup zip is
here. The most common pickup zip is 10,01 with 1,227 pickups recorded. Now,
you can see that right down here, but again, you have to remember that this is mostly for business users. And so, I think it is good that they include this information up here just as a narrative that you can read. Let's go and look at the code. So, it looks like uh they're
the code. So, it looks like uh they're looking at the pickup zip, which is exactly what we would want to do.
They're doing just a count of everything, and then they're grouping on that pickup zip as well, but then ordering it descending on the pickup count and limiting by one. This is how I would have written it as well. This
isn't anything crazy complex, but I think it did a really good job answering this question. Now, as I've been testing
this question. Now, as I've been testing and working within Genie, it is pretty good with answering a lot of the questions about the data. Let's get a little bit more difficult with this one.
So, let's come down here and let's say, "What time of day do most people get a ride?" Now, this is kind of how I would
ride?" Now, this is kind of how I would imagine someone who's not very technical might ask this question, right? It's not
very specific, and I'm curious as to how it's going to handle it. So, let's see what code it writes and the output that we get. So, it looks like it used the
we get. So, it looks like it used the pickup hour. It says the highest number
pickup hour. It says the highest number of rides occur at 1,800, which is 6 p.m.
with,455 rides during that hour. Let's look at the code really quick. So it looks like it's using this hour right here within our datetime. It's running a count. It's
our datetime. It's running a count. It's
kind of filtering a little bit and grouping on this and doing kind of the same thing it did in our previous query.
Now Genie does have the ability to create visualizations as well. Let's see
if it can visualize this. I'm going to say uh can you visualize this?
All right, it did exactly what I asked it to do. I asked it to visualize it and it literally is visualizing uh the pickup hour and the count amount. This
is not exactly what I had in mind. I
actually wanted to see all the pickup times over time. So I'm going to ask it if it can show me all the pickup hours and the counts and visualize it. So I
just went ahead and wrote that. I said
write a SQL query to show me all the pickup times and the counts and then visualize it. Let's see if it's able to
visualize it. Let's see if it's able to do this one because the previous visualization was not exactly what I was hoping for. Although it gave me exactly
hoping for. Although it gave me exactly what I asked for. So it looks like it did this properly. And let's just look at this. They really are just doing the
at this. They really are just doing the same query, but now they're not, you know, filtering on uh it descending and then limiting it by one. Now we have all
of them. And whoops. Let's go right down
of them. And whoops. Let's go right down here. And now we have a visualization of
here. And now we have a visualization of this data. As you can see right here,
this data. As you can see right here, this is our highest one. This is our 1800 or 6pm. And so now we have this visualization. Now, the thing about
visualization. Now, the thing about Genie is it's kind of its own standalone thing. So, if we want, we can come up
thing. So, if we want, we can come up here and we can copy this. So, we can copy all of this and bring this over to the SQL editor. But again, that's not really the use case here. The use case is that a business user can come in here
and ask questions about the data and get their questions answered without having to go into the SQL editor or a notebook or build a dashboard. So, with that being said, let's come right over here
to our SQL editor and let's go right over here and let's take a look within a new query window. Let's take a look at the AI assistant that they have. Now,
they have one right here, but you can also see a button right up here. They
are both the assistant, but they are used in a different way. I'm going to demonstrate that in just a little bit.
We're going to generate some code, but then we can ask larger context questions about the code and the data and revisions and insight into our data with our assistant up here. But we can't really do that with this one because
this is purely for generating code.
Let's go ahead and click on generate code. Now, we don't have this connected
code. Now, we don't have this connected to our data. So, I'm just going to say, show me our New York City taxi data. And
I'm going to ask it to generate this.
Now, this is very generic. I don't even know if it's connected to this data.
Let's just see what it does. It does not know exactly what data I'm talking about. And that's because we're in this
about. And that's because we're in this workspace.default.
workspace.default.
I just wanted to see if it would be able to pick up on it, but it's not doing that. Let's click over here. Let's go to
that. Let's click over here. Let's go to our samples. And for our schema, let's
our samples. And for our schema, let's come down here to New York City. I'm
going to accept this really quick because I want to demonstrate something.
Let's go ahead and try to run this. It's
not going to work. And that's okay. But
what we can do is we can come right down here and we can say let's diagnose this error. So let's click on the diagnose
error. So let's click on the diagnose and the assistant in the top right is now going to be activated. It's going to do this forward/fix. And so now it's going to say okay we need to use the
proper schema and even if we hadn't switched over honestly it would have picked this up. This assistant up here is slightly more contextaware I've noticed than the actual code generation
right down here. So now we can try running this. And we can run this right
running this. And we can run this right over here. And we're going to be able to
over here. And we're going to be able to see our data at the bottom. And I'm
going to click this button in. It's
going to say replace active query content. And so now I'm going to paste
content. And so now I'm going to paste all this code right into this window.
Let's get rid of this for now. And what
we're going to do is I'm going to edit this code. So as you can see, it
this code. So as you can see, it highlights all of our code. And I'm
going to do two things. One, I'm just going to do this forward slash. And
we're going to get some options here.
We're going to say forward sldoc/explain.
We have all these options and it tells you what they do on this right hand side. So we can explain the code,
side. So we can explain the code, improve the formatting, optimize the code, replace parameters, fix errors in our code. There's a lot of different
our code. There's a lot of different options and they're just kind of defaults. And these are things that you
defaults. And these are things that you can use really easily. We just fixed our code, so we're not going to do that right now. But what I'm going to ask it
right now. But what I'm going to ask it to do is do something totally different.
Let's actually run our code really quick because I want you to be able to see the data in here. But we have two really useful columns, a pickup time and a drop off time. And I'm going to ask it to
off time. And I'm going to ask it to calculate the difference between this and give us the average of those times.
What's the average length of ride? It
says the trip distance, but it doesn't tell us how long the trip took. So, I'm
going to go into the code. I'm going to say I want the average time it took between
pickup and drop off. Now, this is a little bit more difficult. This is more difficult than anything we did in Genie.
Let's see if it's able to write this code correctly. Now, we can see what
code correctly. Now, we can see what it's getting rid of with the red lines and with the green lines. We can see what it's actually keeping. We can
accept this either by clicking accept or clicking tab. I'm going to go ahead and
clicking tab. I'm going to go ahead and click tab. And basically, what it's
click tab. And basically, what it's doing is it's taking the drop off time and it's taking the pickup time. It's
subtracting it and then it's taking the average. And so, we're just going to go
average. And so, we're just going to go ahead and run this.
And in seconds, that works perfect. But
I don't want it in seconds. I want it in minutes. So I'm going to edit this. I'm
minutes. So I'm going to edit this. I'm
going to say I want the average to be in minutes, not seconds.
It's going to rewrite this and it's going to divide it by 60. And that's
really uh all it needs. Let's go ahead and tap this. Let's run our code. And
now we have average trip duration in minutes. It's around 15 minutes. Now,
minutes. It's around 15 minutes. Now,
this is really great, but somebody who doesn't know what this is might be very confused. So, I'm going to come up here
confused. So, I'm going to come up here and I'm going to use one of these forward slashes and I'm going to say add comments to code. So, I'm just going to hit enter. It's going to start
hit enter. It's going to start commenting on this code and I can click accept. It's just going to say
accept. It's just going to say calculates the average trip duration in minutes by subtracting pickup from drop off timestamps. Very simple, very
off timestamps. Very simple, very straightforward, but it is really helpful for somebody who's going to be coming behind you and reading this code.
Now, as I mentioned before, this is for coding, but if you have other questions about the data, you can also come over here. So, I'm just going to say, tell me
here. So, I'm just going to say, tell me a bit about this data. Now, this is just kind of a really simple overview of the columns and kind of what's in it. It's
pretty straightforward, but now let's ask it. We're going to say, give me a
ask it. We're going to say, give me a simple data dictionary for each column. I could have said for the whole data set. I don't know why I
said column, but let's see what it gives us.
And just like that, it is going to give us a data dictionary, basically a description for each column. And this is really useful. It reads in the data and
really useful. It reads in the data and just gives you a little bit of information about what is in that data as well as the data type and the column name. So this is really useful. Now, as
name. So this is really useful. Now, as
I mentioned before, this isn't only in the SQL editor. We can also do this within a notebook. So let's come over here and take a look at the assistant within a notebook. You can see we have
this generate. It's going to look very
this generate. It's going to look very similar as it did before in the SQL editor. Now it can do a lot of similar
editor. Now it can do a lot of similar things. It's going to generate code but
things. It's going to generate code but it can do it in different languages now.
So we can do markdown Python SQL and R.
By default though it is in Python. So
let's come over here. Let's go ahead and ask it to give us the sample New York City taxi
data and just see what it comes up with.
It looks like this one is going to be perfect. Let's go ahead and run this and
perfect. Let's go ahead and run this and it's going to give us exactly what we want. We have our data right here. Now,
want. We have our data right here. Now,
one of the things that I want to highlight is that what you can do cuz when you're working within these notebooks, you can use whatever language you'd like. You can convert it to other
you'd like. You can convert it to other languages. So let's come up here to our
languages. So let's come up here to our code. I'm going to say can you convert
code. I'm going to say can you convert this to SQL. It's going to take this which was a dataf frame a spark.sql
dataf frame and it displayed it. It's
now converting it to SQL right here. And
then we're reading in just the SQL.
Let's go ahead and tab this and we'll run it. And we should get the exact same
run it. And we should get the exact same output. So that's one of the things that
output. So that's one of the things that you can do in notebooks that of course you're not going to do in a SQL editor because you're just using SQL. So that's
how the assistant works within a SQL editor as well as a notebook. But let's
head over here to our dashboards. Let's
go and create a new dashboard. And what
we're going to do is we're going to pull in that data that we were using before in our samples. So let's go to our catalog. Let's go to our samples New
catalog. Let's go to our samples New York City. And we'll click on this. And
York City. And we'll click on this. And
we're going to say add to dashboard.
So now we have this data right up here.
This is our trips data. And what we're going to do is come right over here and we're going to add a visualization. The
first thing that's going to come up right here is the assistant. It said ask the assistant to create a chart or to create anything that we want. And so
let's ask it to create something really simple. I wanted to create a card with
simple. I wanted to create a card with the average fair amount. And let's go ahead and run this. So, I just wanted to show us the average price for one of
their fairs for the taxi. And there we go. So, super easy, super simple. Now,
go. So, super easy, super simple. Now,
let's ask it a little bit more difficult question. Create a line chart
question. Create a line chart with the count of pickup times.
And let's go ahead and run this. Now, it
did do this. And what's nice is we have it right over here. It did this based off of the month. I'm going to change this manually to the day so that we can
see a little bit easier what days have a lot or have a little. And we're just doing a simple count on the pickup dates. That's it. Now, with both of
dates. That's it. Now, with both of these, I didn't click the accept button, but once you do, once you click accept, and you can of course uh change it, it is going to stay there like you just created it as a normal visualization.
Now, let's say I want to make a change to this. Let's come up here and say
to this. Let's come up here and say change the daily.
And I need to spell this right. Daily to
hourly. Let's go ahead and run this. And
so there going to be a lot of people they don't know how to come in here or don't want to come in here. They can do this by just writing change the daily to hourly. I'm going to accept this. Now
hourly. I'm going to accept this. Now
this uh looks wild, but let's uh look at it. This is looking correct to me. Take
it. This is looking correct to me. Take
away that global filter. And now we're looking at it hourly. And so we have a very very accurate and specific view of hourby hour the pickup times. And so
this is how we can create different visualizations by just asking it. We can
of course come in here because it's still using the exact things that we would use to create these. And so if we want to add labels, I'm going to say add labels. I wouldn't on this data set.
labels. I wouldn't on this data set.
That's going to look terrible. Um but
it's going to turn on the labels for you. And this uh looks horrible, but it
you. And this uh looks horrible, but it did exactly what I requested. uh it's
too granular to add labels. We would
have to change this to uh probably weekly maybe um and add the labels for it to even make somewhat sense. There we
go. So, this looks a lot better. But we
can do a lot of the things that we can manually do just using this. Now, if you already know how to create these visualizations just by memory, you won't need to use this all the time. But let's
say you just can't figure out how to do it yourself, you can always ask. It is
there to assist if you need it. You
don't have to use it, but it can be useful if you're like, I just can't figure out how to create this. Let me
just ask the assistant and see how they would create it. As you can see, there are a ton of ways that you can use AI and data bricks. You can use the genie, you can use in the SQL editor with the assistant. We can also use this
assistant. We can also use this assistant up here, which gives a little larger context, and we can use it in the notebooks. So, it's integrated into a
notebooks. So, it's integrated into a lot of different places that you would use as an analyst or a scientist or a data engineer. Now, within this entire
data engineer. Now, within this entire data brick series, we've learned a lot of things and we're going to apply all of that into our final project. If
you've never done one of my projects on this channel before, they can be a little bit long because I leave everything in. I don't cut out a lot of
everything in. I don't cut out a lot of stuff. So, if I make mistakes, you're
stuff. So, if I make mistakes, you're going to see it so you can learn from it as well. I will leave a link in the
as well. I will leave a link in the description so you can go to the data bricks free edition and create an account so you can follow along with this project. I will also have the data
this project. I will also have the data set that we're going to be using in a GitHub. So all you have to do is go and
GitHub. So all you have to do is go and download that data set and you can import it and use it just like I'm doing. We are going to be working with
doing. We are going to be working with real raw data. So we are going to encounter some issues and some things that we'll have to clean up along the way. But we're going to learn a lot and
way. But we're going to learn a lot and this is what we're going to end up with at the final project. We're creating
this United States emissions breakdown dashboard. It's going to be an awesome
dashboard. It's going to be an awesome project. So I cannot wait to do this
project. So I cannot wait to do this with you. Let's go ahead and jump onto
with you. Let's go ahead and jump onto my screen and get started. Now before we jump into data bricks and actually start building out our project, I want to take a look at the data because there's a lot of data here. And when I say a lot, I
mean, you know, there's only 3,000 rows, but there's a ton of different fields.
So, if we start scrolling over, we're going to see there's so many different things that we can work on. And I don't want to focus on everything because it's going to be impossible to get everything in there. I have highlighted with yellow
in there. I have highlighted with yellow the ones that we're going to be working with. This is the emissions uh for
with. This is the emissions uh for megatons of CO2. We have our population, latitude, longitude, the county name, county, state name, and state abbreviation. We're gonna be focusing on
abbreviation. We're gonna be focusing on emissions, although there's so many other things we could look at in this data set, but we just don't have that much time and we don't have, you know,
days and weeks to comb through this and build everything out for, you know, all the different scenarios and dashboards that we could build out. So, this is the data set that we're going to be working with. Again, you can get that in the
with. Again, you can get that in the GitHub. Let's get out of this. And what
GitHub. Let's get out of this. And what
we're going to do is I'm right here. We
are in our data free edition account.
I'm gonna go ahead and I'm going to create a new catalog. So, I'm going to call this one I'm gonna say this is our emissions uh catalog. I'm going to go ahead and create this. And now we're going to have let's get out of this.
We're going to have our emissions and I'm going to go under my default and I'm going to come right over here and I'm going to create a table. So, we're going to drop this entire thing into this.
Let's go ahead and browse and we're going to come up here to emissions data 2023. Now, this data is large. may take
2023. Now, this data is large. may take
just a second uh to bring it all in. Um
but you know, it's not like 10 gigabytes or anything. So, we should be able to
or anything. So, we should be able to create this quite well. I'm going to rename this to just emissions data because I think it looks cleaner. You
can keep the 2023 if you would like, but I'm just going to take uh emissions data.
All right. So, our data is in here. This
all looks good. We are not going to transform any of this. Although we may need to transform some of it, right? But
it's going to autodetect all of our data types. So, you know, if we need to
types. So, you know, if we need to change something, we're going to do that after the fact. We're going to take it as it is with the raw data. Now, all we have to do is come down here to create
table, and it's going to create our emissions data table for us. Now, it's
going to give us some of this AI suggested description. This looks good
suggested description. This looks good to me. I'm going to go ahead and just
to me. I'm going to go ahead and just accept this just to have it in this data set, which was uh nice to have. Now,
what I'm going to do is just give you the scenario, right? You were hired by the EPA to create a dashboard, and they want to focus specifically on emissions, but they want a breakdown of things like where it's actually coming from in the
United States, where the most emissions are coming from. They want to see what states and counties are emitting the most emissions as well as just in general as a population, how much emissions do we have per person per
area. So, these are things that we're
area. So, these are things that we're going to have to dive into. We're going
to write some SQL queries in order to do this. Let's come over here and let's go
this. Let's come over here and let's go to our SQL editor.
Now, this is from a previous lesson. I'm
going to go ahead and get rid of all these previous queries and we're going to get started just with a fresh new query. And we will do this by selecting
query. And we will do this by selecting emissions and going to default. And it
should be our emissions data. Let's go ahead and run this. Now,
data. Let's go ahead and run this. Now,
as our dashboards, that's going to be kind of our final product that we're going to hand off. I want to start with our location data because I personally am super interested in this. We of
course have the state name and the county name, but it's not going to be as specific as something like latitude and longitude. And so I want to use this
longitude. And so I want to use this data along with our emissions data. And
let's come right over here and find that. It's right here. So it's GHG
that. It's right here. So it's GHG emissions m tons CO2 E. It's a long name, but that is the column that we want. So let's come right here and
want. So let's come right here and instead of getting all of or taking all the data, I'm going to do uh latitude.
I'm going to do tab to autocomplete.
We're going to do longitude.
and let's see if it can find the CO2.
There it is. So, I'm going to go ahead and hit tab. And I'm just going to call this as emissions. I feel like that's just going to be easier. Uh let's go
ahead and run this. And there we go.
Now, in terms of actually analyzing this data, latitude and longitude tends to be a little bit more difficult to work within something like a county or a state because it is so specific. So, you
can't really aggregate on it. It's more
of a visual thing. So, let's actually just take this over. I'm going to copy this. I'm just going to bring this over
this. I'm just going to bring this over to my dashboard. And we're going to create our new dashboard.
Uh, let's rename it really quickly before we get into data. I'm going to call this our emissions dashboard. As we build things out, we
dashboard. As we build things out, we will then come and add to it. We don't
have to do it that way. We could get all the data up front and then uh go from there. But, I'm going to create from
there. But, I'm going to create from SQL. And I'm going to run this. And we
SQL. And I'm going to run this. And we
actually need to specify where this is coming from. So let me get rid of this.
coming from. So let me get rid of this.
I'm going to call this uh emissions default emissions data. So I was just tabbing there. It got our catalog, then
tabbing there. It got our catalog, then our schema, and then our table. And I
just tabbed along the way. It makes it, you know, kind of easy to work with.
Let's go ahead and run this. And now we have our data right down here. Now, like
I said, this isn't the easiest data to work with if you don't really know latitude and longitude data, but it's really easy to visualize. So, let's come up here and we are going to add I'm
going to get rid of our filters, but I'm going to add a visualization.
And the data that we'll connect with is apparently called our untitled data set.
Let me change this. So, I'm going to rename this as our location data. We
will use some other location data, but it's like state and county. And so we should be good. Let's use our location data. And what we need is actually a
data. And what we need is actually a map. And we didn't do a map when we were
map. And we didn't do a map when we were creating and building dashboards in a previous lesson. So this is a new one
previous lesson. So this is a new one for us. But let's come right down here.
for us. But let's come right down here.
We're going to go to a point map. Now,
you can see we already have the coordinates available for us. Longitude
and latitude. So this should be really easy. We're just going to do uh for the
easy. We're just going to do uh for the longitude, we'll do longitude. And for
the latitude, believe it or not, this is gonna sound insane. We're gonna do latitude. Now, this is a little tricky
latitude. Now, this is a little tricky to work with. Sometimes it's kind of a double click. So, I'm going to double
double click. So, I'm going to double click in. I'm going to double click in.
click in. I'm going to double click in.
And you can kind of scroll back. There
are some things that we need to customize because this is just super blurry. I'm going to go to the size and
blurry. I'm going to go to the size and I'm going to decrease the size a little bit. I think that makes it a lot better.
bit. I think that makes it a lot better.
We can always, you know, doubleclick and zoom out just a little, just to have enough. I will say if we zoom out more,
enough. I will say if we zoom out more, you'll see some information right here.
I think this is Hawaii. Don't cancel me if I'm wrong about that. Geography is
not my uh best area. Then we have Alaska up here. So maybe we want to include all
up here. So maybe we want to include all of that. Um that would be fine if we do,
of that. Um that would be fine if we do, but I'm gonna come right in here. I
think this is what I want to keep, which is mainland United States. So, we're
going to keep this. This to me is really useful and it makes a lot of sense. This
is some of the most densely populated areas over here in the uh kind of the Mid East and the East. And then right over here based on emissions, right?
Emissions is much lower. Even in
California, there's a lot of people in California and on the West Coast, but not as many emissions. And so, most of our emissions just looking at our, you know, our visualization are over here on
the right hand side kind of centralized.
What is this? Like Ohio. Uh Chicago is in Illinois. It's like Illinois, Ohio
in Illinois. It's like Illinois, Ohio area. I'm not a geography expert. Don't
area. I'm not a geography expert. Don't
hate. I just I don't know geography that well. So, this is really interesting. I
well. So, this is really interesting. I
would not have guessed this, but this is official EPA data. So, this is, you know, pretty useful. Now, real quick, I'm just going to come in here and I'm going to create our title. We're going
to call this the United States uh emissions breakdown. And I need to spell this prop
breakdown. And I need to spell this prop properly. Let's put this in the middle.
properly. Let's put this in the middle.
Let's uh make it bold. And we'll make it larger. Not going to get crazy with it.
larger. Not going to get crazy with it.
Maybe we'll do one more. Let me see.
That's too big. All right. Let's go back to 24. Underneath it though, I'm going
to 24. Underneath it though, I'm going to add like some notes. Uh make it a lot smaller. We'll say this data was
smaller. We'll say this data was collected by the EPA. I'm gonna say uh Environmental.
It's always tough to watch people watch me spell, by the way. I'm not good at environmental uh protection agency. And
then I'm going to say in 2023. I think
this is fine for now. If we want to add something later to it, we can. Or make
it larger. Uh whatever we want to do.
But I'm going to keep it just like this for now. Actually, I think I do want to
for now. Actually, I think I do want to make this bigger or this part at least bigger. It seems like it should be. I
bigger. It seems like it should be. I
feel like it should be bigger. That just
feels better. I don't know why. Uh don't
get on to me. Now, we can add more visualizations to this. And of course, we will. But the first one, this one is
we will. But the first one, this one is probably the easiest one because we're not really digging into the data itself.
We're more just trying to visualize it because latitude longitude data is, you know, pretty specific. Now, let's come back over here to our SQL editor. I'm
going to try using some AI here on this next one because uh the next one's going to be a little bit trickier, I think. Uh
let's just create a new query because I think that's perfectly fine. Uh but I want to make sure it's in the right schema. It is not. Let's go to
schema. It is not. Let's go to emissions. Let's go to default. In fact,
emissions. Let's go to default. In fact,
I could have just copied this over. Um
honestly, so I'm going to come over here. Let's paste this in here. Now, I'm
here. Let's paste this in here. Now, I'm
going to edit this. I now want to focus on a new thing that we're looking at. I
want to take a look at some data. Now,
I'm going to do this because this is what I personally do. I'm going to duplicate this and I'm going to bring it back to uh the catalog
and go to emissions, go to default, go to emissions data and look at the sample data. The reason I'm doing this is
data. The reason I'm doing this is because I want to be able to just oneclick over, take a look at my raw data. You can also just create a query
data. You can also just create a query and tab over. Um, but then I don't know, that's not my that's not my workflow. I
like having a different thing for it.
You can do that if you'd like. uh just
do you know select everything from emissions data. Now what we are going to
emissions data. Now what we are going to do is we're going to use the AI and what we're going to be looking at is I want to take a look at this county name as well as the population. So for each
county so this is in Alabama. This is
the county state name in Alabama. I want
to look for this county and this population. Let's take a look at our
population. Let's take a look at our emissions. But more specifically I want
emissions. But more specifically I want to take a look at the emissions per person. So, we're going to have to do a
person. So, we're going to have to do a calculation here. So, I'm going to ask
calculation here. So, I'm going to ask the AI. I'm going to say I want to look
the AI. I'm going to say I want to look at the emissions per person in each county. Let's go ahead and generate
county. Let's go ahead and generate this. Let's see if it will get exactly
this. Let's see if it will get exactly what I want. It may or may not. And
let's accept this. I'm just going to hit tab. So, we're casting this as a double.
tab. So, we're casting this as a double.
And then we're casting the population as a double. And then we're saying that's
a double. And then we're saying that's the emissions per person. Now, this
theoretically should work, right? But
let's go over to our GHG emissions right here. Now, here's the thing, and this is
here. Now, here's the thing, and this is not, you know, a data brick specific thing, but this happens all the time in almost any platform is this is a number.
This is a 100% a number has a comma there. It should be rented as a number.
there. It should be rented as a number.
But you can see right here I can just tell you right now this query is going to fail um because it needs to be changed. Let's go ahead and try to run
changed. Let's go ahead and try to run this.
Now the reason why this is going to fail is because this is a string, right? We
have letters, we have characters that are not numeric in this column and that is an issue. And so what we need to do is we need to convert this or we can
also in here ask it to diagnose this error. I'm going to see if the assistant
error. I'm going to see if the assistant can do it for us because if it can, then that would be fantastic. Let's see if it's able to do this. This is kind of a specific fix. It needs to look into the
specific fix. It needs to look into the data.
And it looks like it got it perfect. It
says you need to remove the thousands separator before you cast it. This looks
excellent. Let's actually uh replace this. Let's get rid of that. And now
this. Let's get rid of that. And now
let's try running this. And wait a second. We have latitude and longitude
second. We have latitude and longitude here.
Let's go back because I didn't want latitude and longitude in the first place. I was just looking at this and I
place. I was just looking at this and I got mixed up. Um, okay, that's fine.
We'll keep this. But I am before we even run this, I am going to say I don't want latitude. Did I spell it right? And
latitude. Did I spell it right? And
longitude. I want the county and the population for those columns. I'm just using AI here. I
columns. I'm just using AI here. I
normally I could just fix this myself. I
just want to test it because I think it's, you know, an interesting thing.
All right, let's accept this and let's go ahead and try running it and see what happens. All right, we're running into
happens. All right, we're running into some issues. And I'm just going to tell
some issues. And I'm just going to tell you that's going to happen and that's okay. Let's take a look. One, we I don't
okay. Let's take a look. One, we I don't think we even have a county column.
Let's go right over here. We have a county name column, which actually should be fine since we're not grouping on anything, but I actually want this column right here. So, I'm going to copy
this as the column name and we'll come right back here. Population should be correct. And then we're working with a
correct. And then we're working with a replace. I believe that we need back
replace. I believe that we need back ticks for this and not brackets.
Let's just go ahead and try that. And
let's run it. All right. Now, this is working. Listen, AI is not perfect. Uh
working. Listen, AI is not perfect. Uh
sometimes we got to step in and do what we do. So, this looks correct to me.
we do. So, this looks correct to me.
Let's take a look just really quick because we are breaking it down by the county state name. Um, which is what I would do. I wouldn't want to use this
would do. I wouldn't want to use this county name because what if we have another county name that's Baldwin County, right? And when we start trying
County, right? And when we start trying to visualize this data, then that could be issues because it's going to probably try to group that data and it just it'll cause issues.
Now, what we're going to do with this is we're actually going to look at the top 10 emissions uh based off of probably the emissions per person. I think that's all we should do. So, I'm just going to
add it this time. I'm not going to ask the AI. So, I'm just going to say order
the AI. So, I'm just going to say order by and then I'm going to say emissions per person and we'll do descending. And
this should be really Actually, I should have done I should limit it by like 10 or something. I'm going to say limit 10.
or something. I'm going to say limit 10.
There we go. Now, let's run this. And
I'm super interested. This is real data.
So, you know, I'm really curious. So,
emissions per person, the population is only 5,000, but they have a lot of emissions. That's in New England. This
emissions. That's in New England. This
is in North Dakota. New England. North
Dakota. Very interesting. So, based off the emissions per person, these are the places. And this is not what I would
places. And this is not what I would have guessed. I would have guessed like
have guessed. I would have guessed like New York or, you know, I don't know, North Carolina or something like that.
But these are we got New England and North Dakota. And then I think MMO is
North Dakota. And then I think MMO is Missouri.
Don't quote me on that. But this looks good to me. Um, this query was a little bit more challenging than our last one.
I think you would agree. Let's bring
this over. We're going to go back to our dashboard. I could also make this
dashboard. I could also make this another tab. Um, and it looks like that
another tab. Um, and it looks like that reset. We'll do that at the end. Um,
reset. We'll do that at the end. Um,
we'll just place that how we want it at the end. Now, what we actually need to
the end. Now, what we actually need to do is put this in our data. So, I'm
going to come in here. I'm going to place this in there. Again, it's not um it's right here. The emissions, we can change this. So, we don't have to um we
change this. So, we don't have to um we don't have to do that. Let's go ahead and run this.
And our data is working. Even though
it's underlined, it's still reading it in fine. We do have it uh correctly
in fine. We do have it uh correctly done. We can always like we did before
done. We can always like we did before uh do emissions and it shouldn't say emissions data emissions.default
emissions.default emissions data just to get rid of those red lines. We can do that. Now we have
red lines. We can do that. Now we have our data down here. This is really useful. We don't necessarily have to
useful. We don't necessarily have to even use this column, but we do have the population for emissions. And uh let's get rid of our limit because I want to
make a specific type of visualization with this. It's a scatter plot. Scatter
with this. It's a scatter plot. Scatter
plots are really great because we're going to be able to use the population and the emissions per person on a scatter plot and kind of take a look.
Hey, as the population increases, does the emissions per person increase as well or does it decrease? A scatter plot would be great for this. Let's call this uh let's do emissions
per person. That's what we're going to
per person. That's what we're going to call this one.
Let's come over here and we're going to come in and we're going to choose our emissions per person. Now on our x-axis we can choose the emissions per person.
We can switch this around, see which one works better. Then we'll also choose the
works better. Then we'll also choose the population. Let's scroll down a little
population. Let's scroll down a little bit. Now we need to change this from a
bit. Now we need to change this from a bar chart cuz that does not look right.
We're going to change this into a scatter plot. Now for the scatter plot,
scatter plot. Now for the scatter plot, we want the raw data. We don't want any aggregation. So let's say none here. And
aggregation. So let's say none here. And
for the sum of population, none as well.
That should be good. We can also really quick change this size because it's a little blurry. I want to see a little
little blurry. I want to see a little bit more granular of what we're looking at. We have this one up here. Uh this
at. We have this one up here. Uh this
emissions per person is 0.5. Okay, so
that's low, but the population is 10 million. I'm curious where this actually
million. I'm curious where this actually is. Let's come down here to the tool
is. Let's come down here to the tool tip. Let's it add in the county state
tip. Let's it add in the county state name. So now when I hover over it and
name. So now when I hover over it and it's so small. Um now when I hover over it, this is Los Angeles County. Um this
is actually really good. There is a very low emissions for that amount of people.
And you can see that because it's over on this left hand side. If we had something um over here, this is not going to be good. Uh this is one of our New England ones. They have emissions
per person of 5.95 and that's in uh I think megat tons. So that's a lot of emissions. Their population is 5.16,000.
emissions. Their population is 5.16,000.
That's not a lot. That's a little over 5,000. They have a lot of emissions. So
5,000. They have a lot of emissions. So
as you can see, the areas that actually have a higher population tend to have less emissions. Um, overall, I mean,
less emissions. Um, overall, I mean, these are Cook County, Harris County in Texas, uh, Maricopa County in Arizona,
but a lot of these ones are actually fairly low per person. In the areas that don't have a lot of people, they actually tend to looks like they have higher emissions for the lower
population.
I mean, some, of course, are over here with very low emissions per person. It's
just really interesting. I'm I'm kind of, you know, I'm kind of looking into this as we go. Let's add this title.
We'll say emissions per person. We'll
say emissions verse population.
And uh we could add some type of description. We don't have to, but I'm
description. We don't have to, but I'm just going to say higher populations tend to have lower emissions per person.
Um, and you know, we could go more in depth and add more notes. That would be useful, but we're not going to. But I
think that's really interesting. Uh, I'm
kind of fascinated by that myself. So,
we have this emissions verse population one. Really quick, I'm going to add a
one. Really quick, I'm going to add a title to this one. I'm going to say emissions per location. I'm just going to keep it
per location. I'm just going to keep it like that. I think that's uh perfectly
like that. I think that's uh perfectly fine.
Uh, let's scroll down really quick.
We're going to add in some more visualizations. I'm going to just put
visualizations. I'm going to just put these here so that we can scroll down a little easier. Um, but now we have these
little easier. Um, but now we have these ones in here with us. Let's come back to our SQL editor and let's add a new tab.
So, I'm going to say a new query here.
Now, what I want to look at is the total emissions by state because I want to see what states are just doing the worst.
They have so many emissions. They're the
top 10 worst states in America. I
wouldn't say that. That's actually
that's a bit extreme. We will need to reuse this really quick. Um, we're going to have to reuse this replace. I'll just
copy this over. In fact, I feel like I can write this quite quickly. If we were in the emissions default here, um, what we need to do is we need to come over
here and let's see. So, I don't want to have to take this and break it out by the state over here. We have the state abbreviation. So, let's use it. Let's
abbreviation. So, let's use it. Let's
copy this column name. Let's bring it back. And all we're going to do is just
back. And all we're going to do is just keep the state abbreviation. We don't
need to do any of these calculations. Uh
we should be able to get rid of all of that. And let me see what I'm doing
that. And let me see what I'm doing wrong here. Okay, it looks like I have
wrong here. Okay, it looks like I have an extra parenthesis. Then we're just going to say total emissions.
That should be fine. Then we need to group by. So we're going to say group by
group by. So we're going to say group by because if we look at our data, we have a lot in Alabama, right? We have a lot of counties in each state. So, we have to group on this state abbreviation. And
then I'm not going to go over to it, but then we're going to group and do our sum right here. Right? And we're have to
right here. Right? And we're have to actually do the sum in just a second.
So, we're going to say group by the state abbreviation.
And there we go. And we'll take this.
And then right up here, we just need to do the sum. So, I actually did need that extra parenthesis. I got rid of it. I
extra parenthesis. I got rid of it. I
was I was so shortsighted. I got too excited. And we'll keep the top 10
excited. And we'll keep the top 10 actually. Um because I know that when we
actually. Um because I know that when we start visualizing this, we're not going to need uh all of them. Uh although it would be, let me just comment this out
for a second because I am curious to look at all the states and see which one has the um highest. So the total emissions is Texas, Florida, Ohio, Illinois, Georgia. You know, that's a
Illinois, Georgia. You know, that's a lot of the south and southeast and and not a lot from uh the west coast. Let's
come down here. We have Vermont, Richmond, AK, is that Arkansas, uh DC, me, Maine. So, actually some of the
me, Maine. So, actually some of the Northeast is actually quite good with New Hampshire. Uh, but then the West
New Hampshire. Uh, but then the West Coast and like kind of the Midwest tends to be really low emissions probably because their populations are lower, but
I guess maybe not because of what we looked at earlier. Anyways, let's limit this by 10. And then I'm going to copy this over. Yeah, we'll go back to our
this over. Yeah, we'll go back to our dashboards. We go back to the emissions
dashboards. We go back to the emissions dashboard.
Let's scroll down.
And actually, we got to go to the data first. Let's create this and we'll run
first. Let's create this and we'll run this. I need to have our emissions and
this. I need to have our emissions and our default.
That should be good. Let's run this data. And then I'm going to rename this.
data. And then I'm going to rename this.
I'm going to say uh total emissions per state.
That should be good. So now we have these. These are our top 10. What I
these. These are our top 10. What I
actually want to look at this for is percentage-wise. We're going to have to
percentage-wise. We're going to have to do some other analysis because I do want to add something that you can't necessarily visualize. Um, we're just
necessarily visualize. Um, we're just going to have to dig into that in the data really quickly and perform a little calculation. It's not nothing crazy. Um,
calculation. It's not nothing crazy. Um,
but we may use AI for it. We'll see if it can uh it can work for us. We'll do
totally emissions by state and we're going to do a pie chart right here. Now,
for the angle, all we have to do is the total emissions, but we want to break it out our color. We want to break it out by the state abbreviations. Uh, we
definitely need to add the labels here.
Let's add our labels. There we go. Um,
and it has it over there. So, we have this little legend right down here. We
have Texas. That's 20% in the top 10.
That's in our top 10. These are the total emissions. If we tried to add all
total emissions. If we tried to add all of them, which you know what? Let's just
go try it. We can get rid of these 10.
Uh we can do that. Let's run this. We'll
have all of our data.
Let's go back to our dashboard.
And now we have all of them. So, it
looks like Texas is 10%. But that is a lot, right? We have a lot of different
lot, right? We have a lot of different colors. And it's really hard to see uh
colors. And it's really hard to see uh almost anything if I'm being honest.
If we want to, and maybe this is a good uh thing to do, is we can add a filter right here. And I haven't even I haven't
right here. And I haven't even I haven't done this at all in this video yet, but I'm just going to say um maybe a range
slider might be good. Uh for the field, we can do let's see total emissions per state. We'll look at the total
state. We'll look at the total emissions. And let's see if I can just
emissions. And let's see if I can just kind of scroll up and do something like this. That might
work. But then again, it's then, you know, only doing the calculation or the percentage based off of that. So, we're
just reinventing the wheel here. I'm
just demonstrating it can be done. Um,
but I don't think we need this. All that
being said, we're actually going to delete that.
But that's okay. Let's go back. We're
going to add the limit 10. Listen, I'm
here for education purposes, right? I
want you to know what you can do. If you
want to do that, you can. Um, but we're not going to do it. Let's go back here.
Now, we have this. This looks good, but I don't like uh this title, actually.
Let's get right here. I'm just going to say
right here. I'm just going to say emissions percentage or something like that is fine. Feel like that's good enough for
fine. Feel like that's good enough for like a title even. We can and I think we should come in here and add some type of description, some information here. So,
I'm just going to say the top 10 states account for x amount of emissions for the whole for all of the US. Now, I'm
saying x amount because we don't have that information yet. We actually need to write that query. Let's go back to our SQL editor and let's dig into this.
I am going to try to use this assistant because I feel like it might be able to understand um what I'm doing. I'm going to paste this in and I'm just going to say I want
to know what percentage me I need to write this right but what percentage
of emissions are the top 10 states for the whole country.
And let's see if it understands that. I
don't know if I even wrote that. Well,
um, but let's see if it did with it. If
it does, uh, this is a CTE it's writing.
I'm just going to see what it's writing.
It's writing a CTE. It's calculating the top 10 and then it's selecting it's selecting the sum of the total emissions.
Okay, this could work. Let me take this query real quick and I'm actually going to open up a new tab or a new query.
And let's make sure. Oops. I need to get all this.
Oh, it's doing it for a notebook. That's
all right. Um, I'm going to get rid of that because I'm going to come down here to emissions to default. Let's paste
this in here. And this may work. Let's
run this.
Okay. So, this is the top 10 emissions.
This is the top 10 percentage. I believe
this is correct. Um, you can validate this, but what it's doing is we don't need this SQL here. What it's doing is it's creating a CTE, calling it the top 10, then we're selecting the state
abbreviations. We have this query. It's
abbreviations. We have this query. It's
taking the total or the sum for the top 10. And then we're using that later on.
10. And then we're using that later on.
So now we're doing select and this is the sum of the total emissions from up here. And then we're taking the sum of
here. And then we're taking the sum of the total emissions divided by this number right here, which is all of them.
Now, wait a second. I may be wrong on this because we're only pulling from the top 10. So, maybe it's just looking at
top 10. So, maybe it's just looking at the total emissions from up here as top 10 emissions. It's taking the sum of the total emissions and then
selecting the sum from emissions data.
No, you know what? No, it is taking it from it's doing a subquery in here. I
missed that. So, this subquery is actually selecting all of the emissions from the emissions data. This is great.
I'm going to say it accounts for 51%.
This is uh great job, AI. I'm glad it uh did a good job. I mean, I I I think I am going to add uh make a duplicate of this. I just that's how I that's my
this. I just that's how I that's my personal workflow. Okay. So, let's come
personal workflow. Okay. So, let's come in here and we're going to say that this accounts for the top 10 account for 51%.
Uh, let's make sure I spell this right.
The top 10 states, I'm actually going to say these 10 states. These 10 states account for 51%
states. These 10 states account for 51% of all emissions.
I'm going to say of What am I What am I doing of all
emissions in the US? I'm just I'm having a tough time writing.
That is a useful statistic.
That is a really useful statistic. So, I
think this one is done. I think this is a really good visualization. Look, pie
charts have its place. Percentages like
this, I think, are really useful. Um,
let's come back. Let's go to our SQL editor. I'm going to have that over
editor. I'm going to have that over here. So, I'm going to have my SQL
here. So, I'm going to have my SQL editor, emissions dashboard, and the data. Now, what I want to do is I want
data. Now, what I want to do is I want to name and shame a little bit. Okay. I
want to take a look at the county state name. What specific county? And we kind
name. What specific county? And we kind of looked at this earlier when we were getting our population data because we had to ground it with a group by something, but now we're actually using
the county state name. We're going to take a look at the top 10 counties within the US and we want to create a bar chart for this. So, let's uh use
this and come right back here. Uh this
was by far the hardest query uh that we've written today. AI did well. Um we
you know we gave it a lot of the context, but I think it did a good job.
Let's bring in some of our previous like this was a really simple um thing here. Actually, let's use this
because it has the uh a lot of the information we're already wanting. Now,
all we're doing is we're just looking at the total emissions, not per person. So,
let's get rid of this. We're literally just looking at it
this. We're literally just looking at it like this. So, we have the county, state
like this. So, we have the county, state name. Let's go to emissions default. Um,
name. Let's go to emissions default. Um,
we have the county state name from emissions data. And we don't have to do
emissions data. And we don't have to do this. We need to do it.
this. We need to do it.
Let's close this parenthesis and call this as um total emissions. And then we have to use this alias for this part to
order by it down here. We don't have to technically. We could just copy this
technically. We could just copy this whole cast uh and replace, but we're not going to. So now we want to look at the
going to. So now we want to look at the top 10. So this should work. Now we have
top 10. So this should work. Now we have the county state name in Arizona, Texas, Illinois, Florida. Geez, just so many so
Illinois, Florida. Geez, just so many so many emissions. Uh, you know, think of
many emissions. Uh, you know, think of think of the world. Think of think of your country. Let's come in here. Uh,
your country. Let's come in here. Uh,
no, we need to go back to the data.
Let's create from SQL. And we're going to put this in here. Let's go back to our emissions. Let's run this. This
our emissions. Let's run this. This
should be good. Awesome. So now we have the highest total emissions right here.
I mean, listen, even though Los Angeles County is in the top 10, it's because their population is four times as much as some of these. I mean, I just it's it's really interesting. That's why we
did the emissions per person earlier because there are outliers and, you know, the total population does matter.
Let's come in here and we're just going to rename this uh county shaming. All
right, that's what we're going to call it because that's what we're doing.
Let's be honest. All right, we're just going to make a bar chart. We're going
to go to the county shaming. We're going
to use a bar chart for our xaxis. Uh, let's use total emissions. Then for our yaxis, we can
emissions. Then for our yaxis, we can use the county state name. Um, and I do want to look at that from highest to lowest. So, let's do it like that. We
lowest. So, let's do it like that. We
can change the coloring if you want. I
definitely want tool tips or not tool tips, sorry, labels. I definitely want labels on this. Uh, this is just our total emissions in like megatons. So,
I'm going to say that I want to add a title. I'm going to say uh total
title. I'm going to say uh total emissions.
I got to spell right. Emissions by m ton m ton I think of CO2.
Uh, and I want to make sure that's correct because listen, this dashboard, I'm going to send this to the EPA right away. Uh, so we have m tons of CO2.
away. Uh, so we have m tons of CO2.
Maybe CO2 E. I don't know if I should be putting that in here. Let's add that.
All right. This looks great. Um, let's
adjust this real quick. I'm actually
going to come over this way. Double
click.
Double click. Come out a little bit. Uh,
this this is something that I just have not gotten the hang of. I don't know if I'm doing something wrong, but let's click in on this and uh have that right there. We need to
bring back our emissions versus population and bring this up.
Now, I do Oh, what did I do here? Oh, I
think I might have clicked on. Yeah, I
zoomed in. Whoops. Uh, don't mind me.
Listen, I make mistakes all the time. I
want to clean this up just a little bit.
Things like the total emissions. I just
want to say total emissions, right? I'm
cleaning up these x and y axes. I'm
going to call this uh the county name.
Maybe I should say county state name, I guess, to be more accurate. That looks
good. Emissions. This one looks good.
Here we need Oops, let me zoom out.
We need to rename this one emissions per person. We're gonna call this
person. We're gonna call this uh emissions.
Oops. Did I spell that right? Emissions
per person. That I does not look right, but maybe it is. And then I'm I'm going to change this population to be capitalized. I just I think it looks
capitalized. I just I think it looks better. Population.
better. Population.
And uh let's see. Emissions per
location. Emissions versus population. I
actually should say emissions.
Where's the title here? Oh, I got to do it up here. I'm going to say emissions for continental US. Did I spell that right? No.
US. Did I spell that right? No.
Continental US. My spelling is so terrible. But we're only looking at the
terrible. But we're only looking at the continental US here, so that might just be worth noting. Um, we are done.
We are done here. We have our United States emissions breakdown and I think this looks great. Uh, we dug into some of the data a little bit deeper than we probably needed to, but I think this was
great and we're able to use AI and build out our dashboard. I am super happy with how this turned out. Remember, this is a real data set. There's so much other data in here. So if you want to, you can
use the assistant in here and say, "Hey, what other things can I look at for emissions and it'll give you suggestions." But this is what I wanted
suggestions." But this is what I wanted to build because I was super interested in it. And emissions is just, you know,
in it. And emissions is just, you know, something that everybody understands.
People are curious about it. And so
visualizing it, being able to see it and kind of shame some of these counties that just got so many emissions, but it's worth it. It's worth uh worth building it out. So I hope that you enjoyed this. I hope that you found this
enjoyed this. I hope that you found this helpful. Before we go, I want to give a
helpful. Before we go, I want to give a huge shout out to the sponsor of this entire series and that was Data Bricks.
Data Bricks has been so awesome to work with because I already love Data Bricks.
It was an easy fit. And what's even better is they have this Data Bricks free edition. If you have not already,
free edition. If you have not already, if you just watched through this without following along, create a Data Bricks free edition account. Go ahead and do that. I will leave a link in the
that. I will leave a link in the description so you can build something like this completely for free. It is an amazing amazing deal. With that being said, thank you guys so much for following along through this entire project. I had a lot of fun. I hope you
project. I had a lot of fun. I hope you did too. If you have not already, be
did too. If you have not already, be sure to like and subscribe and I will see you in the next video.
Loading video analysis...