CS50x 2025 - Lecture 7 - SQL
By CS50
Summary
## Key takeaways - **Flat-File Limitations**: Flat-file databases like CSV files store data in simple text format using commas to delineate columns, but they require manual parsing in languages like Python, leading to verbose code for tasks like counting favorites, as seen with 17 lines to tally language preferences from a Google form. [04:22], [05:47] - **SQL's Domain Power**: SQL, a specialized language for databases, condenses complex queries into single lines, such as SELECT COUNT(*) FROM favorites GROUP BY language, replacing Python's loops and conditionals to instantly reveal Python as the top favorite with 243 votes among 313 submissions. [36:05], [01:04:24] - **Relational Design Wins**: Relational databases avoid duplication by using separate tables for entities like shows and people, linked via integer IDs in a stars table, enabling efficient many-to-many relationships, as in IMDb's schema where The Office (ID 386676) connects to multiple actors without repeating titles. [35:22], [01:23:57] - **Joins Unlock Insights**: JOIN operations merge tables on matching keys, like shows.id = genres.show_id, to reveal Catweazle's multiple genres (adventure, comedy, family) in one query, or list Steve Carell's shows without nested subqueries, transforming cryptic IDs into readable results. [01:44:47], [01:55:10] - **Indexes Boost Speed**: Creating indexes on frequently searched columns, like CREATE INDEX name_index ON people(name), accelerates queries from 0.215 seconds to 0.001 seconds for finding Steve Carell's shows, trading upfront build time for massive read efficiency in large datasets like IMDb's 234,000 entries. [02:07:15], [02:10:04] - **Guard Against Injections**: SQL-injection attacks exploit unescaped user input, like entering ' OR '1'='1 in a login form to bypass authentication, but using placeholders (? in CS50's library) prevents malicious code execution, ensuring safe queries even with inputs like malan@harvard.edu' --. [02:30:05], [02:33:31]
Topics Covered
- Specialized Languages Outperform General-Purpose Ones
- CSV Enables Simple Flat-File Databases
- Dictionaries Simplify CSV Data Processing
- SQL Condenses Complex Queries Dramatically
- Indexes Accelerate Database Queries Exponentially
Full Transcript
[MUSIC PLAYING] DAVID J. MALAN: All right, this is CS50.
DAVID J. MALAN: All right, this is CS50.
And this is our week on SQL, or SeQueL, for Structured Query Language, which is a programming language specifically for talking to databases.
Now, over the course of this lecture, what we're going to do is solve some problems in new and different ways by transitioning from that other language, Python, which, it turns out, is very multipurpose.
And, indeed, this is thematic with Python in that you can do so many things with it, but it turns out there are certain things you might want to do in computing that maybe aren't ideal for a language like Python, let alone a language like C. And so you might want a more specialized language that makes it easier for you to get real work done.
And, indeed, this has been true over the past several weeks.
We started with Scratch.
We transitioned to C. We then transition to Python.
And there's actually a lot of overlap in those languages in what you can do.
They're not perfectly complementary.
I can't imagine doing some of the things we did in Scratch using C.
But what we will see today, no pun intended, is that using a domain-specific language, something specific to databases, in this case, you can actually save a lot of time and express yourself a lot more readily with a lot fewer lines of code than something like Python allows.
But, before we do that, I daresay it would be ideal if we have some data that we can play with so we have a data to put into a database.
And let me propose that we go ahead and collect some live data here.
So if you'd like to go to this URL here that's on your screen if you're on a laptop or desktop, or if you simply have your phone in front of you, feel free to scan this code here, which will lead you to a Google form via which we're going to collect some information from everyone here.
I'll go ahead and, in just a moment, transition to just a screenshot of what everyone here is seeing because if you're watching this on demand thereafter, the form probably won't still be open because we can't take data from the future into account today.
But what we'll have here is a Google form via which we're going to collect some information from everyone here in the room virtually.
And this form asks two questions.
One, it asks what's your favorite language?
the options for now being Scratch, C, and Python.
And it also asks, what's your favorite problem from recent problem sets?
So, go ahead, take a moment, and fill out that form.
And, in just a moment, we'll see that these responses are being gradually collected, not in the context of the Google form per se, but in a more useful format for today's purposes, that of a Google spreadsheet.
In fact, let me switch over to my other laptop here where I have full screen all of the responses as they are coming in.
You'll see that all of them are time-stamped for a particular day and time on which we actually recorded this here session.
And you'll see two columns, one for people's preferred language and one for people's preferred problem.
You'll see that a few folks sort jump the gun a few hours even before class began, but that's great, all the more enthusiastic.
But, as I scroll down here, you'll see there's a lot, a lot of data coming and a couple of hundred responses already.
And so let's go ahead and use this data now to actually manipulate it, to search it, to sort it, and actually ask questions about this here data.
And, in order to do this, what I think we're going to do is, first, talk about what are called flat-file databases.
A flat-file database is the simplest type of database in the world, whereby you just use a text file, a file in which you can store maybe some rows and columns.
But that's actually not really possible in text files alone because, of course, in text files you literally just have text.
So you can put any characters you want on the screen.
You can maybe have different lines of or different rows of text, but we probably ideally need some way of delineating the data that's inside of that file.
And one of the most common ways to do this is via file called CSV, Comma-Separated Values, which is a file format via which we can store really a simplistic database in purely an ASCII or, really, a Unicode file.
And what this means is that all of your data in a CSV file is stored row by row by row, or technically line by line by line, so each entry has a line of its own.
But, for the notion of columns, you can do this in a number of ways.
But the simplest way, as in the CSV format, is to just put a comma in between the values that you are actually collecting.
And then, so long as you're using software like a spreadsheet or maybe Python code or something else altogether, you can tell that software to interpret those commas as really delineating where columns-- one column ends and the next column actually begins.
So it's a nice lightweight format, using a very familiar tool like a text editor to actually store some data that we might care about.
And, in fact, what's nice about collecting data via forms nowadays, whether it's with Google Forms or whether it's with Office 365 or the like, is that you can actually export all of the data ultimately in exactly this format, CSV.
So, in fact, let me go back over to my actual browser here, and you'll see that we indeed have some more data, some more rows that have come in over time here.
And if I go up to the very top of the screen and drop down the menu here, you'll see that, under the File menu, under the Download submenu, I can actually download this data set in a bunch of different formats-- Microsoft Excel, with which many of you are probably familiar; PDF-- although that really won't lend itself to manipulating it very easily with code; HTML if I want to make a web page out of it.
But you'll notice, at the bottom, there's actually a couple of options, one of which we've talked about already, namely CSV, for Comma-Separated Values.
There is an alternative called Tab-Separated Values, which also is a convention too, whereby, to delineate columns, you simply hit the Tab key to shift the data over.
But I daresay CSV is probably a little more common, if only because a lot of text editing programs nowadays automatically convert tabs to spaces.
Indeed, cs50.dev and VS Code, as configured, does exactly that.
So CSV is a little more robust.
So I'm going to go ahead and click on Comma-Separated Values.
That's going to drop a file into my downloads folder on my here computer.
I'm going to go over to VS Code now, minimize my window for just a moment.
And then open up my File Explorer, which, initially, has nothing in it for today.
I'm going to go into my macOS downloads folder and simply drag this file over to the Explorer to essentially upload the file to my account.
I'm going to go ahead now and hide my Explorer, maximize my window again, and hide my terminal window.
And you'll see exactly that same data here, but clearly in textual format.
Now, the file is, by default, called favorites dash Form Responses 1 dot csv.
But let me actually simplify that just to make my life easier.
Let me open up my terminal window and the command, recall, for renaming a file can be mv for move.
And I'm going to rename that file Favorites dash Form Responses dot csv to more simply favorites.csv.
Now let's go ahead and open up that same exact file but under a different name now.
And we'll see these rows and columns.
But it's not quite as pretty as Excel or Google Sheets or Apple Numbers would be.
We have to look a little harder for each of the commas.
And, indeed, in each of these columns-- in each of these lines of text, rather, we have commas delineating one piece of data from another.
But the very first row in this file is a little different.
The first row is actually the headers of the columns that we saw visually in Google Sheets a little bit ago.
And so that's actually indicative-- that's actually pretty common in CSV files whereby the first row and only the first row is somehow special.
And most software will interpret that first row as being special, as really describing the data or the attributes or the records within each of those columns that follow line by line.
Now, there is one curiosity here.
Notice that the first line-- or, really, the second line after the header file has three values-- the timestamp at which the form was submitted, comma, the language that person found to be their favorite, comma, and the problem set that person found to be their favorite.
But, notice here, line 3 in the file, there's actually three commas instead of two even though there's still just three values.
Why do we have the quotes around the name of this problem set, Hello, World!
whereas Scratch did not have a value, Tideman did not have a value.
Mario did not have a value?
Because we named that problem set Hell comma World, we need to make sure that the value is essentially escaped somehow, as with double quotes in this case here, to make clear to any software that's processing this data, be it, again, Google Sheets or Microsoft Excel or Apple Numbers or my own Python code, ultimately, that there is something special going on in that string.
And that comma should not be confused with the other commas, which actually have meaning to delineate those columns.
So what can I actually go about doing with this file?
Well, let's actually go ahead and start answering some questions about the data here.
I realize that not everyone got their response in from time because I downloaded the file just a couple of minutes ago, but I think this is enough data for us to play with.
So let me reopen my terminal window here and clear my terminal.
And why don't we go ahead and write a program, in Python for now, that simply iterates over this data toward an end of doing something interesting with it, ultimately.
Let me go ahead and run-- create a program called favorites.py.
And, in this file, let's go ahead and do this.
First, I'm going to go ahead and import a package called CSV.
It turns out, in Python, there's built-in support for CSV files, which is going to save me a huge amount of time because I'm not going to have to figure out how to open a file and iterate over every line and break on all of the commas and somehow load it into memory.
The CSV module is going to do a lot of this for me, so I'm going to go ahead and say with open, quote, unquote, favorites.csv
comma quote unquote "r'.
And this is going to allow me to open a file called favorites.csv, which is the shorter name I gave that same file a moment ago.
favorites.csv, which is the shorter name I gave that same file a moment ago.
Quote, unquote, "r" means open it in read mode.
And I'm going to give this a variable name of file just so that I know how to refer to this file once it's open in memory.
Then if I want to read this file line by line, it turns out, thanks to this Python package called csv, I can do this pretty easily.
For instance, I'm going to create a variable called reader.
I'm going to set that equal to the return value of csv.reader.
So, inside of the csv package, there is a function called reader that is going to give me exactly that, conceptually, some kind of object in the computer's memory that knows how to read CSV files.
I have to tell it what file to read.
So I'm going to pass in that same variable that I just got back from opening the CSV file itself.
Now, for the moment, I want to go ahead and iterate over all of these rows, so for each row in that reader.
And this is wonderfully useful Python syntax for just iterating line by line over this so-called reader.
Let's just go ahead and do something simple print out the second column we see in that file.
So recall that these columns are going to be 0-indexed as our Python lists.
So 0 would be the timestamp, and I don't really care about that.
1 is going to be the favorite language.
And 2 is going to be the favorite problem among our problem sets.
Let me go down now into my terminal window, run python of favorites.py,
cross my fingers as always, and, voila, I seem to get out very quickly on the screen a list of all of the languages in that column.
But if I scroll back to where I just ran that command-- let's go up and up and up and up all the way to that output, which is a ways away now.
Though, I can clearly see visually a theme here among the most popular language in the group today.
Let me keep scrolling back.
Keep scrolling back and, ohps, there it is, my command that I ran earlier.
There's one anomaly here.
The options on that form were Scratch, C, and Python.
And yet it would seem, based on my output here, that someone's favorite language is "language".
But that's a bit of a red herring because recall that, in favorites.csv,
that first row was important.
That first row had Timestamp, capital T, because that's how Google does it; language in lowercase because that's how I named the form; and problem in lowercase as well.
So I seem to be getting a piece of my metadata, so to speak, the column name and not only the languages in question.
So how can I fix this?
Well, there's a bunch of ways to do this, but the simplest way is actually to use a function in Python that just says, next reader, which essentially means skip the first line of this file.
Just next, and move on to the second row instead.
If I now run this command again, let me do python of favorites.py.
And just so I don't have to scroll through hundreds of lines of output, recall this cleverness in the Linux operating system whereby I can pipe the output of one command into the input of another command, like the command less, and, as the name implies, this is going to show me less output from top to bottom.
Let me go ahead and hit Enter.
And, indeed, we see that the very first line of output here is Python now and not language.
So a minor bug fix, if you will, but at least now gets the job done.
So, at this point, we have a program in Python.
It's only a few lines long that seems to know how to open a CSV file, read it line by line by line, and I have the syntax in Python to access particular columns.
Bracket 0 for timestamp, bracket 1 for language, bracket 2 for problem even though, here, I'm clearly only playing around with the column number 1.
Well, what more might I want to do?
Well, let me take a baby step toward making this program more interesting.
Let me clear my terminal window for just a moment.
Let me go into for loop here.
And, instead of just all in one breath printing out row bracket 1, let me actually create a variable just for the sake of discussion called favorite, so the current favorite language, and just set that explicitly equal to row bracket 1 and then print that out as favorite.
So I haven't fundamentally changed anything about this program.
In fact, I've kind of introduced a seemingly unnecessary variable.
But I just want to be pedantic here about how we can make this program even more user or programmer friendly to me.
So this is still going to work.
Let me go ahead and run again python of favorites.py
and pipe the output of that command into the input of less.
And I, again, see the exact same input.
But the reason I did this is because I want to improve this program now.
It turns out that my program as written is a little vulnerable to changes in the CSV file.
If you've used Google Sheets or Microsoft Excel, you probably know that you can click on those programs and drag to move the column to the left or to the right if you want to reorder your data.
And the problem with that is that if I or if a colleague of mine gives me a CSV file whose columns are now in a slightly different order, that's going to create a problem for my code, which is currently assuming that the favorite language is always in bracket 0, so the second column, the middle column among those three.
But it'd be nice in general to make your code a little more robust against changes to the input so that things just don't break, unbeknownst to you.
So I'm actually going to introduce a different feature of the CSV package whereby, instead of using a reader, I'm going to actually use something called a dictionary reader or DictReader, which is going to give me back essentially not a sequence of lists, one per row indexed by numbers 0, 1, 2, but it's, actually, this DictReader,
going to give me back a dictionary or dict for each row so that instead of using numeric indices like 0, 1, and 2, I can actually use strings as keys to get at the data that I actually want.
Specifically, I can use the names of the columns as defined in that first row of the file.
So let me go ahead now and make a couple of other changes.
We no longer need to ignore the first row of the file.
So I'm going to get rid of this call to next because the dictionary reader is going to assume that the very first line of the CSV file has the header names that I care about.
And it's going to consume that row for me automatically before iterating over everything thereafter.
And the reason I introduced this favorite variable a moment ago is because I wanted to be super explicit here that, instead now of indexing into the current row as though it's a list with numeric indices, I instead am going to index into this row using strings as my keys, using the syntax for indexing into a dictionary instead of a list.
And, for that, I can simply do this.
"language", quote unquote, So row now, to be clear, is no longer a list in Python.
It's actually a dictionary, but I can still use the same square bracket notation to index into this dictionary, but this time using the proper name of the column.
So now if I go ahead and clear my terminal, rerun the same command, piping the output through less so we can just see the first few lines, voila!
it's still working as intended.
But if I now were to go back to Google Sheets or Microsoft Excel and futz with the graphical user interface and move those columns around, download a new CSV wherein they're still comma-separated values, but they're sort of in a different order, this code now will still work unless I or a colleague, very difficultly with great difficulty, changes the names of the column headers themselves, at which point,
things are still going to break, but, at least now, this is more robust against minor mistakes like reordering.
Let's tighten this up again.
I didn't strictly need that favorite variable, so let me go ahead and actually get rid of that and, just more succinctly now, print out row bracket, quote, unquote-- row bracket-- I'm confusing the two-- row, quote, unquote, "language" to print out this current row's favorite language.
So fairly straightforward there, but now let's actually solve a more interesting problem because, at the end of the day, all this program is doing is iterating over CSV, printing out all of the favorite languages.
That's not really a step forward because I can already see in Google Sheets itself all of the answers that I actually cared about, just visually with my own eyes.
But it'd be nice now to maybe crunch some numbers.
And it'd be interesting to, among the people that submitted this form in time before I downloaded it to my code space, how many people do prefer Scratch, how many people do prefer Python, how many people do prefer C. Well,
it looked, at a glance, that Python was definitely visually the biggest contender, but let's quantify that now.
So let me go ahead and make this program a little more interesting, whereby I'm still going to import the CSV package up here.
I'm still going to open the file down here, and I'm still going to use my dictionary reader.
But, this time, instead of just blindly iterating over the file and printing things out, let's actually create a few variables.
So, one, let me create a variable called scratch and set it equal to 0.
Let me create a variable called c and set it equal to 0 and a variable called python and set it equal to 0 so that I have three variables via which I can count how many people like Scratch, how many people like C, how many people like Python.
Now, this is a little bulky visually, so, stylistically, there's actually a slightly more elegant way to initialize multiple variables at once that I'm to avail myself of here just to make the point that I can, scratch comma c comma python equals 0 comma 0 comma 0.
And there's other ways still to achieve the same goal, but this is just a little tighter.
Instead of wasting three lines, I'm using one line to do exactly the same thing.
This has nothing to do with CSVs. It's just a coincidence that I'm using commas on this line 5 to separate my variables and these values, but it's a Python trick.
If you want to initialize multiple values at once to a value like 0.
Now, let's bring back that for loop.
So, for each row in that dictionary reader, row by row by row, let's go ahead and grab the current rows, favorite language, as we did earlier.
And now let's use some conditionals to check what that favorite is.
So if that favorite equals equals, quote, unquote, "Scratch", well then let's go ahead and increment our scratch variable by 1.
Else if the favorite value equals equals c, then let's go ahead and increment our c variable by 1, just as with scratch.
Else if the favorite equals equals Python, then let's go ahead and increment our python variable by 1 instead.
Now, technically and logically, I could have just used an else at the bottom of this conditional, but just in case someone slipped in a fourth language that I don't want to talk about yet into the file or maybe the form changed but my code doesn't know about it, I don't want to mistake some random other value for python, so I'm just going to ignore logically any rows that don't equal
Scratch or C or Python in this case.
Now, at the bottom of this program, outside of that loop and outside of the file itself having been opened, which, because of the width, will be automatically closed.
Now let's just print out these three values.
So let's print out using, say, a format string, just so it looks a bit pretty on the screen, that the total value for Scratch is going to be whatever is inside that scratch variable.
Let's print out on the screen another format string that says whatever the value of the c variable is, that's how many people love C. And then, lastly, let's print out another format string with Python as its prefix and print out, through interpolation there of the variable python, how many people really like Python.
So if I didn't make any mistakes here, I think I can reopen my terminal window now, run python of favorites.py.
I'm pretty sure that the most popular language, just based on a glance earlier, is going to be Python, but let's see if we can quantify it by hitting Enter.
And, indeed, it looks like 11 of you like Scratch, 59 of you like C, but 243 of you prefer Python, so glad to see that it's only getting better and better it would seem here over the week.
So how did we go about doing this?
Well, very mechanically, I wrote a decent amount of code here, some 17 or so lines, just to iterate over that file, create some variables, increment them as needed.
And this is totally fine.
It's not an unreasonably long program, but I do think it's getting a little wordy, if you will.
It seems like we're spending an awful lot of time sort of building this up when it would be nice if we could maybe simplify this code a little bit because this is kind of a common paradigm.
You get some data set as input.
Someone, maybe it's your teacher, maybe it's your boss, maybe it's your own curiosity, you're asking questions about that data you'd like to provide answers, well, how can we go about improving this kind of code and doing something a little differently?
Well, let me propose that we revisit this idea of dictionaries themselves.
They've already been useful in the context of this dictionary reader because we can get handed back more user-friendly dictionary objects instead of rows, each of which represents a row of our data.
And recall that a dictionary really is just this-- it's a set of key value pairs, and it's the Swiss army knife, if you will, of programming because it's actually very powerful to just be able to associate one thing a key with another thing, a value.
So, in fact, instead of having, for instance, three separate variables just to keep track of all of those counts, why don't I draw some inspiration from the world of dictionaries and maybe use one dictionary to keep track of all three of those languages.
And this has been the thematic, Way back in C, recall that we initially would often create one, two, or more variables.
And then we realized, no, no, no, there must be a better way, and there was.
In week two, we introduced arrays, which allowed us to have one variable containing multiple similar values.
And now, with Python, because we have these full-fledged dictionary objects that just allow us to associate arbitrary keys with arbitrary values, we can use that a little more readily and keep our code a little cleaner too.
So, back here in VS Code, let me go ahead and propose that, instead of having all three of these variables here and, in fact, all of these conditionals here, let's get rid of most of that code in the loop.
And, instead, let's just create a new variable, for instance, called counts that's going to be an empty dictionary.
And recall that two curly braces, open and closed together with nothing inside, in Python, gives me an empty dictionary, sort of a table that's ready to go.
Looks a little something like this, but with nothing in it initially, but it's now one and only one variable.
All right, now let's go ahead and iterate for each row in that reader.
Just as before, let's go ahead and grab the current rows favorite by using the dictionary we just got back from the reader, indexing into it via the column name language.
And now, instead of having those three conditionals using three different variables, let's be a little more elegant.
Let's just say that we should go into the counts dictionary, index into it at whatever this name of the language is, and increment that by 1.
In other words, use the current language Scratch or C or Python as a key into the counts dictionary and whatever its value already is increment it by 1.
Now, this is close but actually a little buggy.
It turns out if you just blindly index into a dictionary using a key like Scratch or C or Python, but that key isn't yet there.
That's actually a so-called key error.
You can't just blindly use a key that doesn't yet exist and try to increment it because what are you incrementing?
There's no data actually there.
There's no row in this table to go increment the value thereof.
So I do need to be a little more careful here.
Let me instead do this conditionally.
So if the current favorite language is already in the dictionary called counts, then OK, go ahead and index into the counts dictionary using that favorite as your key and increment that value by one.
Else if it is logically the case that that favorite language is not yet in the dictionary, that is to say, there is no key for it yet in this dictionary, OK, no big deal.
Let's index into that dictionary but assign it a value of 1.
So if it's already there, well, then, presumably, we've seen it before, so we want to increment the total value.
But if it's not yet there, we just want to assign it a value.
So, in both cases, we're indexing into the dictionary.
But we have to make sure that we're only incrementing an existing value if it's there.
But we're initializing the value to 1 for this current row's favorite language if we've not seen this same language before.
Now, at the very bottom of this program, let me go ahead and get rid of all three of these printfs and just do something a little simpler.
Let's go ahead and iterate over every favorite key in that counts dictionary.
And, again, this is Pythonic syntax whereby you don't have to worry about, what's the key, what the value is.
If you just iterate over a dictionary using syntax like this, the favorite variable that I'm creating in this here for loop is going to be initialized to 1 by 1 each and every key in that dictionary, presumably three in total given the data we have.
So, inside of that loop, let me go ahead and print out a format string containing the name of the language, using curly braces to plug in that value.
And then let's go ahead and output, with another set of curly braces, whatever the value is of that key by indexing into that dictionary inside of curly braces and then closing the quote on the outside, which should give me language count, language count, language count by iterating over that dictionary.
All right, crossing my fingers as always, let me reopen my terminal.
Let me rerun python of favorites.py, and, voila, we get the same counts.
Now, I will note they're slightly differently ordered.
I think earlier we saw Scratch and then C and then Python, but, in this case, we're actually seeing Python and then C and then Scratch.
And this is subtle, but does anyone have an instinct for why this is the case?
Why did my ordering suddenly change now that I'm using a dictionary instead of three explicit print calls?
It's ordered by default based on the order in which we inserted those keys.
And so the answer actually lies in favorites.csv itself.
If we go back to that file here, you'll see that someone's favorite language, as of the second row in this file was Python, though, curiously, their favorite problem was Scratch.
So I'm quite sure how to reconcile those two, but their favorite language Python.
The next person's favorite language was also Python.
Oh, the third person's language was C. And then if we go down and down and down, I think we'll eventually see Scratch.
And so dictionaries nowadays are ordered in the sense that if you insert one key before another and then you iterate over that dictionary later, you're actually going to say the same ordering thereof whereas I previously in my code had full control over all three print statements.
So it seems to be an opportunity for us to perhaps sort these via key or maybe even value.
In fact, one is a little easier than the other.
And let me go back into my favorites.py here and propose that we do that.
Let's first sort by the counts themselves.
So this is actually a relatively easy change.
Down here in for loop, instead of just blindly iterating over the dictionary from first key to last key, let's actually sort those counts first by using Python's sorted function.
And then, in my terminal window here, let me run Python of favorites.py again.
Enter, and now you'll see that I've actually sorted them by the keys themselves C and then Python, then Scratch.
So it's not the same order in which I printed them.
Now it seems to be doing it alphabetically, but I do seem to have some control over how I'm printing this here file.
All right, but what if I actually want to be a little more clever and print these things out in order of their values, not their keys.
So I don't really care about alphabetization.
At the of the day, I really want to know, what's the most popular?
What's the least popular?
How can I go about doing this?
Well, in my same code here, let me hide my terminal window for a moment and let me add a bit more syntax to the sorted function in Python.
It turns out if you want to do this, you can actually specify what key to use to sort the dictionary.
And I'm going to use a clever trick here whereby I write key equals counts.get.
And that's going to tell the sorted function that I actually want you to get the value from the counts dictionary and use that as your sorting key.
So don't use the actual key in the dictionary as per our slide here.
Instead, use as your key whatever the value is of the counts dictionary itself, so sort by count.
Let's go now and open my terminal window again.
Let me rerun favorites.py, and I'm going to leave the previous run on the screen so we can see now that this one-- argh, so close.
So it's clearly not alphabetically sorted anymore.
So that's good because I didn't care about alphabetization, but it seems to be sorted how?
From smallest to largest, so from least favorite to most favorite.
So that's kind of backwards, but not to worry.
If you consult the documentation for Python sorted function, you'll see actually that there's another argument we can provide sorted.
Let me clear my terminal here and close it for a moment.
And let me say that, you know what?
Whatever you do with sorting, go ahead ultimately and reverse it.
So set reverse equal to true.
Clearly, the default value must be implicitly false.
But if I reverse that order by getting each count, now if I run python, or pythoon, now if I run python of favorites.py, Enter.
Now, finally, all these minutes later, I have a proper analysis of everyone's favorite language, ordered from most favorite to least favorite.
The code itself, not terribly long, still 14 or so lines, but it's kind of a decent amount of effort just to get to this point.
I had to write a decent amount of code, add a decent amount of complexity just to get a simple answer.
And you can imagine that.
You glance at the Google form, or your boss glances over your shoulder and asks, what's everyone's most popular-- what's everyone's favorite language?
It'd be nice to be able to answer that super quickly.
Ironically, just like Google can, just like Microsoft Excel can, just like Apple numbers can-- because, typically, you can highlight the data.
And then, in the bottom-right corner nowadays, you can see a quick summary maybe of the max value, the min value, the sum, the average.
Or you could use functions built into a spreadsheet.
It almost seems like I have to make more work for myself in code, at least with Python, to answer those same kinds of questions.
But not to worry, that's why we're introducing ultimately a new language today, SQL, via which we can actually get back that speed, that flexibility that you already get in graphical interfaces like spreadsheets.
But you can also use that technique programmatically in code as well.
But, before we dive in, let me pause and see if there's any questions on what we've just done here in Python.
A really good question, and we'll come full circle to the answer to that actually toward the end of today's lecture because we will reintroduce Python at some point because, it turns out, as you learn more and more languages and you tackle more and more sophisticated programming projects, it's actually very common and very compelling to mix programming languages sometimes.
And so you can actually use SQL inside of Python, in fact, any number of other languages, because, yes, sometimes you'll find it easier to bring back a more multipurpose language like Python to get some job done.
But you still want the flexibility and the ease of use of querying for your actual data.
The equal signs here.
So, in my code on the screen right now, I'm using equal, a single equal sign quite a bit.
And that's for the assignment operator in Python whereby you assign a value from the right of the expression to the left.
I don't have n equals equals in this here code because recall that equals equals in C as well as in Python is for comparing two values for equality left and right.
So I previously, in my conditional version, did use equals equals to actually compare the favorite value against, quote unquote "Scratch" quote, unquote "C" quote unquote "Python".
But here now I'm only using a single equals sign for assignment, a good question.
All right, so let's transition now to a proper database for this here data.
And, in fact, we'll focus today on what are generally called relational databases, which is a piece of software that allows you to store data, thus a database, but in such a way that you can relate some of your data to other data.
In other words, instead of just throwing in all of this data messily into a folder, much like on your Mac or PC or even your phone, you can actually store it a little more methodically, a little more algorithmically, if you will, by baking into the database some kinds of relationships among pieces of data.
And we'll see in a moment what we mean by that.
In fact, the language we'll talk about ultimately today, SQL, or SeQUeL for short, really stands for Structured Query Language, which is a programming language specifically designed for databases.
So, once we've got a lot of data in a database, SQL makes it easier to get it back out.
And, in fact, among our goals now for the coming time together is going to be to see can we take that 14-line program in Python and whittle it down maybe to 1 line of code in a different language, namely SQL here.
So what does SQL allow us to do?
Even though it's a new programming language, we're only going to spend today on it really, as well as a problem set, but we'll use it again later in the course to solve bigger and better problems as well, But it really only supports four key pieces of functionality that are sort of crudely referred to in the industry as CRUD.
CRUD is a helpful mnemonic for keeping track of what you can do with SQL.
And, in particular, the C stands for Create.
You can create data in a database.
The R stands for Read.
You can read or access data from the database.
U for update, which means you can Update or change data in the database.
And D for Delete, you can, of course, delete data from the database.
So even though we're about to see a new programming language, at the end of the day, it only does these four things.
Now, technically and a little confusingly, it literally does three of those four things using the exact same terminology create, update, and delete.
But it turns out CRUD is a more general concept in databases more generally.
In SQL, specifically, the command we're going to see and use for reading data, getting data from a database, is technically called SELECT.
And if I may, it turns out there's a couple more ways to create and delete data.
You can also insert data, which starts with I.
So it doesn't really fall into our nice acronym there.
You can also drop data or, really, drop whole sets of data or tables of data using D for Drop as well.
So this is to say there's still only four fundamental operations, but the verbiage that we use throughout the language might vary from command to command.
So what can we actually do with SQL?
Well, first, we need a database to start playing with.
And the data in databases are stored in what we're going to call tables.
And tables have rows and columns, which, wonderfully, is exactly the same mental model that we have in the world of spreadsheets, whether you saw them today in CS50 for the first time or have been using spreadsheets for years, whether it's Google's or Microsoft's or Apple's or anything else.
Spreadsheets, of course, also store data in rows and columns.
And if you have multiple sets of data inside the same file in the world of spreadsheets, recall that you can create multiple sheets.
Now, in the world of relational databases that are used for web applications, mobile applications, business applications, more generally, you can have an analog of exactly that same idea.
In the world of relational databases, we call sets of data tables that have rows and columns, which is really the analog of sheets in the world of spreadsheets.
And the command via which you can create such a table is not quite as simple as going to the File menu and adding a new sheet, or clicking a little plus icon somewhere.
This is a programming language, SQL.
So if you want to create a table, you are literally going to express code like this, by typing it at a command prompt and hitting Enter.
But where are you going to actually run that?
Well, it turns out that in the context of cs50.dev and our own VS Code environment and, really, a standard practice in industry as well, would be to use a command line program to access a database.
Specifically, we're going to use a database called SQLite, which, as the name implies, is sort of a lightweight version of SQL.
That doesn't mean it's a toy language or a toy database.
In fact, SQLite databases are used in web applications, mobile applications, and in all of the CS50 applications we're going to build here.
And what's nice about SQLite is that it stores all of your data in a file on your hard drive or a file in the cloud.
It's not a special piece of software per se that needs to be installed and run 24/7, 365 so that you can constantly have access to the data.
That is the case with databases like MySQL or Postgres or Oracle or Microsoft Access or SQL Server.
There are so many SQL databases out there that are syntactically mostly the same, but certain databases have slightly different dialects, so to speak.
They've sort of evolved slightly differently over time, but we'll focus on SQLite, which really is not only light but has a lot of the core features of any SQL database out there.
So, in fact, in the real world, if you start to use some other SQL database, you should be quite fine.
Ultimately having this foundation in SQLite itself.
Now how do you use SQLite on, say, a Mac or a PC or in VS Code in the cloud as with cs50.dev, you just need to install the software, which we've done automatically in advance for you.
And you can run a command line program called sqlite3, where the three just represents the third version of this here program.
So SQLite3 is going to be a command line program via which we can create a database and also execute commands inside of that there database.
The typical syntax for creating a new database with the sqlite3 command is literally sqlite3 space and the name of the database that you want to create.
So, in fact, let me go over to VS Code here and let me go ahead and type sqlite3.
And then let's call this favorites.db.
So there's different file extensions in the database world.
But .db is pretty common in the world of SQLite for storing your actual database.
When I go ahead now and hit Enter, it's going to ask me here in cs50.dev, are you sure you want to create this file?
And that's just as a quick reminder that it doesn't exist yet.
But that's indeed my intent, so y for yes.
And now I'm at a SQLite prompt.
So notice it's no longer dollar sign, which is my terminal windows default prompt.
Now I'm inside of a program that's running continually until I quit out of it-- and that prompt is SQLite-- and then an angled bracket, a greater than sign, just to indicate that I'm not in my terminal per se, my shell, but rather in this here program.
Now, I could type some fairly arcane commands and import that whole CSV file into this dot DB file, but it turns out that sqlite3 comes with some handy commands to just automatically import a CSV file for you.
And what I'm going to do is this.
I'm going to type first .mode, and I'm going to put SQLite into CSV mode.
Long story short, SQLite can import different types of text files like CSVs, TSVs, Tab-Separated Values, and other formats as well.
But I know in advance that I want to import a CSV file, so I'm going to tell SQLite to go into CSV mode.
The fact that I started this command with a dot is just specific to SQLIte itself.
It has nothing to do with SQL per se, and the commands we're about to see and use.
But anything that starts with a dot at the SQLite command just means you are configuring the SQLite program itself.
Now, let's go ahead and import with .import favorites.csv into a table called favorites.
So, to be clear, I already have a file called favorites.csv.
I've just created a new file called favorites.db,
and I want to store inside of that DB file, the database, a table called favorites.
So this is like having a sheet called favorites inside of a file called favorites.xlxs in the world of Microsoft Excel.
I'm just naming everything accordingly.
I'm going to go ahead and hit Enter.
It blinked for a moment but didn't take very long.
And now let me go ahead and just do .quit, Enter.
And .quit gets me back to the dollar sign prompt as usual.
But now if I type ls, not only will I see favorites.csv,
favorites.py, but also now favorites.db, the database that I've just created and into which I've imported all of this data.
Well, let me go ahead now and clear my terminal and rerun sqlite3 on favorites.db, Enter.
on favorites.db, Enter.
That brings me back into the prompt, and let me now start to poke around like.
What is it I just did by importing that CSV file into this database?
Well, I can see as much by typing .schema.
The schema of a database is the design of the database, what table or tables are in there.
So if I do .schema, Enter, I'll see now some fairly arcane syntax.
But this was all automatically done for me when I used the .import command.
Apparently, what that import command did for me was execute an actual SQL command called create table, if not exists, quote, unquote, "favorites", and then import into that table, apparently three columns, a timestamp column, a language column, and a problem column.
And those are all encased in parentheses here.
They are all quoted as you see, and they're separated by commas.
Again, these are now SQLite-- these are SQL commas, not in the file itself.
But what type of data did I import?
Well, the .import command is pretty naive, and if it just sees data, it's just going to assume it's text text text.
So even if it looks like numbers, it might still assume that it's just text text text.
But we're going to get back to that.
In the world of SQL, there's actually going to be proper data types we can do.
So long story short, what was the point of this?
Well, I didn't want to first start our introduction to SQL by typing out this cryptic-looking command and then manually import all of that data.
The .import command just automatically gets me going with a SQLite database containing all of the same data as a CSV.
And, frankly, you could imagine doing this in the real world.
If for work or some future assignment, you've got a CSV file that's part of the assignment or task, you can very quickly import it into your very own SQLite database and instantly start executing the following types of SQL commands on it to answer questions about that data.
So let's do exactly that.
Let me clear my terminal but stay inside of SQLite.
Let me increase the size of my terminal just because we're going to start to see a lot of textual output.
And let me propose that we introduce the first of those CRUD commands via which we can start to read data from a database.
Recall that the first command-- recall that the command for reading data from a database is this, SELECT.
So it kind of has the right semantics.
When you want to select data from a database, read data from a database, you literally write SELECT, and then you specify one or more columns that you want to select data from.
And you specify from what table you want those columns to come from.
So what you see here on the screen is the canonical simplest form of the SELECT statement in SQL.
So let's put this to the task.
Let me go back to VS Code here where I have my full-screen terminal window.
And let's SELECT everything from this favorites table.
Let me go ahead and type literally SELECT *, where * in this case is a wild card representing every column that happens to be in the table.
FROM what table?
Well, the favorites table, semicolon.
So you do need to unfortunately end your SQL thoughts with semicolons, just like in C but unlike in Python.
Now this is a quick and dirty command, so to speak.
And it's really just going to dump out all of the data onto the screen for me.
But, when you're opening a database for the very first time, it's kind of a nice, simple way to get started if you just want to wrap your mind around, what is it I just opened?
kind of like glancing at a file on the screen.
So I'm going to go ahead and hit Enter.
And, very quickly, you see a sort of ASCII art version of that tabular data.
So the commas have been interpreted in advance as delineating different columns.
That's why, visually, we see on the screen these sort of vertical columns or pipes.
And that's just SQLite being user-friendly for me.
In fact, if I go up to the very top of this output here.
You'll see that, immediately below my command, I see three columns named timestamp, language, problem.
So that CSV has been somehow loaded into memory in this three-column format.
Now, suppose I don't really care about timestamp, and, for now, I don't even care about problems. What more can I do?
Well, let me clear my screen and let me select a specific column.
So SELECT only the language column FROM the favorites.
And if I now hit semicolon and enter, now I get back essentially a single-column table, a subset of that there table.
And, in fact, that's what I'm doing with each of these commands.
I'm sort of creating temporarily in memory a virtual table that just contains the data that I care about.
And I'm not doing anything useful with it yet.
I'm just dumping it to the screen.
But I am essentially selecting these temporary tables with these various commands.
Now, this isn't all that much progress just looking at the very data that I've always had access to.
But, besides just selecting the data, turns out that SQL, like Python, like C, like Scratch, comes with a whole bunch of functions built in.
A lot of these are mathematical or statistical or analytical in nature.
And, indeed, SQL is very commonly used in the world of data science and analyzing data, consulting and the like because it lets you, very readily as you'll see, answer questions about data.
And, for instance, you might want to compute averages or AVG for short.
You might want to COUNT the number of rows in some data sets.
You might want to get unique or DISTINCT values from a data set.
You might want to get force everything to lowercase or uppercase.
You might want to get the max value, the min value, and that should evoke recollections of the exercise we just did with trying to find the most popular language in that there data set.
So, for instance, let me go back to VS Code here, and let's try a couple of these out.
In my SQLite prompt, why don't I go ahead and just see how many people submitted this form before I downloaded it into my code space?
So I can do, of course, SELECT * FROM favorites semicolon, and I could just manually visually count all of the rows.
There's got to be a better way when there's hundreds of them, and indeed there is.
Per this list of some of SQL's functions, I can use the COUNT function to answer just this.
Now, I don't really care what I COUNT.
I really just want to COUNT all of the rows.
And so the best way to do that is to actually say, canonically, SELECT COUNT * using the * in parentheses because COUNT is a function, so I'm passing in arguments.
And *, again, is a wild card that represents all of the columns.
And that's fine because I just want to COUNT the total number of rows ultimately.
So FROM favorites semicolon.
And what I'll now get back is not all of the darn data in the table, but rather a single temporary table that contains just one value of interest, 313 form submissions came in before I closed it and downloaded that file.
Now, I made the form, so I know what the candidates for favorite languages are.
But there's other things I can do with this data too.
For instance, if I want to get the unique values of everyone's favorite language, well, I can use DISTINCT as a keyword here too.
This one I actually don't need parentheses for, it turns out but I'm going to do it here as follows.
SELECT all DISTINCT languages FROM the favorites table.
And this is going to now whittle down that long column of all of the languages that everyone selected into just a list of three because, of course, there's a lot of duplicate Pythons.
There's a lot of duplicate Cs.
There's a few duplicate Scratches.
DISTINCT just means I only care about the distinct values of language.
Now, what if you want to count how many unique languages there are?
We can compose these ideas in the same kind of way as in Python and C.
I can do SELECT the COUNT of all of the DISTINCT languages from the favorites table.
So, again, I'm first filtering it down to the DISTINCT languages.
Then I'm COUNTing those rows from the favorites table.
And, as you might expect, I get back 3 as my answer.
So sort of little tiny bit-sized questions I'm asking, but all very quick and all just with a single line of code.
So even though I have the power now of programming, I can still answer questions like my familiar spreadsheet tools all have.
But it gets a little more interesting than this.
It turns out, there's yet other features besides these here functions, and there's some other keywords we can use.
And we'll just start to scratch the surface of these to give you representative use cases of these.
But it turns out that you can GROUP data.
You can look for data that's LIKE a certain value not exactly equal to.
You can limit how much data comes back if you're getting overwhelmed with the number of rows You can ORDER or sort the data.
And you can also use essentially conditionals, filters, otherwise known as predicates, to get back just subsets of data instead.
So, for instance, let me go back to VS Code once more, and let me play around with a few of these new keywords.
And let me propose this-- suppose I care about how many of you really liked C. Well, I could just SELECT all of the data as before
liked C. Well, I could just SELECT all of the data as before and COUNT it up manually, but that seems unnecessary.
Let me do this-- let me go ahead and SELECT the number of people who FROM favorites WHERE the language they gave us an answer to their form was equal to, quote, unquote, 'C'.
And you'll notice this is the first time I'm using literal quotes.
In SQLite, when you want to quote a string value, a textual value, a literal, you use single quotes.
And I realize all of these languages have their own conventions, but, in SQLite, you would use single quotes around values like C.
So what is this asking?
This is saying SELECT the COUNT of all of the rows FROM favorites, where the language in that row happens to equal, quote, unquote, 'C'.
So this should give me the same number as earlier in code.
And, indeed, there's that same count, 59.
But how did I get it?
Notice that I was able to answer this in one line of SQL code, as opposed to 14 or so lines of Python code.
All right, what else might I do?
Suppose I want to be a little more specific and I really want to know how many people liked not only C but Hello, World.
They sort of peaked early in terms of their excitement and really liked Hello, World! and then none of the problems since.
So how can we express that?
Well, let's go ahead and SELECT the COUNT of all of the rows FROM the favorites table WHERE their favorite language is, quote, unquote, 'C' AND the problem that they love is, quote, unquote, 'Hello comma World'.
That was one of our very first problem sets in C.
And so now I'm using this conjunction of checking for the language value AND the problem value.
Similar to Python, the keyword is indeed AND unlike C, where it was &&.
But this is now giving me some Boolean logic.
If I go ahead and enter, we'll see that 8 of you-- OK, I'm not sure how I feel about that-- 8 of you really liked not only C but Hello, World was your favorite problem in question.
But we can get back that answer too with just one line of code.
We don't have to write a whole Python program.
We don't have to debug it.
We can just express ourselves, in a new way, using this here language known as SQL.
What, though, have I been doing very carefully, stylistically?
You'll notice that I keep capitalizing SELECT and COUNT and FROM and WHERE and AND and then using lowercase as appropriate for everything else.
This is a stylistic convention.
We've talked about style, of course, in the context of C and Python quite a bit.
In general, in SQL, we would encourage you stylistically in CS50 and, really, in the real world, use cap-- use uppercase keywords like SELECT and COUNT and FROM and so forth for anything that's built into SQL itself.
And then use the proper capitalization for any tables you yourself have created, for any values that you're searching from.
Why?
Well, frankly, it just tends to be a little more readable when you can distinguish what is SQL code and what is actual table names and values therein.
But this isn't a universal thing.
Even different databases and different programmers might have their own conventions.
So let me pause here and see, thus far, are there any questions on our use of SQL with any of these keywords here or our prior ones?
What is the difference between SELECT and INSERT?
SELECT is just for getting data or reading data.
INSERT is going to be for adding new data.
We haven't gotten there yet, but we will shortly.
So DISTINCT gives you back unique values.
GROUP BY we'll get to in a moment.
But DISTINCT we've seen gives you unique values.
Oh, can you chain the commands together?
You can separate the commands with semicolons just as in Linux.
But, generally, I would say people do them one at a time.
We'll see in a moment how you can nest queries, though, as well.
A good question, does it matter if you use single quotes or double quotes?
The right way to do it in SQLite is indeed with single quotes.
That is the convention.
However, SQLite has been around for a while and is very tolerant of different conventions, so it will often work to use different quotes like double quotes as well.
It even supports backticks in some cases.
But what we'll try to show you here is the "right way" to do things in general.
And, in fact, you'll notice that I'm actually not using quotes around everything.
Strictly speaking, it's conventional in SQLite to use double quotes around keywords like favorites and language and problem because those are symbols that I myself created in my database and in my database table.
However, when there is not a corresponding SQL command that is identically named, it's not strictly necessary to quote values.
So this is to say I could, to be really proper, do SELECT COUNT * FROM "favorites" WHERE "language" equals, quote, unquote, 'C' and "problem" equals, quote, unquote, 'Hello, World'.
And, again, notice I've used single quotes for my string literals, for the values I'm searching for.
But I've used double quotes around all of the symbols from my table name and my column names.
This is the right way to do this here, but I would say, in the real world, it's so much more convenient to omit quotation marks when you don't strictly need them.
So that's the habit that I will generally be in as well.
All right, if I may, let me propose that, besides these commands, we introduce a few other pieces of functionality and then improve on our database as well.
So let me propose that suppose that I care about not just the Hello, World problem, but there was a follow-on problem, recall, in C where it was Hello, it's me, whereby you could actually type in and see your own name printed.
So if I want to check for either of those, how can I express that logically?
Well, it's very similar to Python, and pretty similar to see.
SELECT the COUNT of the rows FROM the favorites table WHERE as before, language equals, quote, unquote, 'C' AND, but now I want to check for two possible favorite programs. So let me use some parentheses logically just like you might use in Python or C where the problem equals, quote, unquote, 'Hello comma World' OR-- and I'm capitalizing it to make clear that it's a SQL keyword--
the problem equals, quote, unquote, 'Hello, It's Me'.
Close quote, close parenthesis, semicolon.
But, before I hit Enter, let me see if anyone can spot the potential bug in this code.
I think I've just created a problem for myself by advocating for one style or another.
The problem is that we seem to have for my 'Hello, It's Me' problem.
There's actually three single quotes involved, one of which I'm using as an apostrophe for It's.
And this is going to be problematic because just like in Python, just like in C, the language, the computer's not going to really understand where the string begins and ends if there's this ambiguity.
Now, in C, and also to some extent in Python, the way we generally solved confusion like this is we would escape the value using backslash.
In SQL, it turns out that you don't just put a backslash in front of the apostrophe in this case to avoid that problem.
Weirdly and admittedly confusingly, the convention is to actually use two apostrophes.
So this does not create one string on the left and one string on the right.
Having two single quotes in a row means you only want one of them logically to be there.
So this assuages that concern there.
So if I now go to the end of my command and hit Enter, this time it worked because now, it turns out, that 8 of you liked Hello, World!
I'm inferring that none of you liked Hello, It's Me any better because the count is still 8, but, at least now, if we get more data over time and I update the CSV and in turn my database, I would be checking for both the language of C and either the problem Hello, World or Hello, It's Me.
But, really, the goal here has been to experiment with some of the slightly different syntax that we need to express some of these here problems. But you know what?
What if I wanted to express this even more succinctly?
And I really just care about any of the Hello problems themselves, be it Hello, World or Hello, It's Me.
Let me propose to use one other keyword that's on our new list here, namely LIKE, where you want to search for not necessarily something that's exactly the same, but LIKE a value, so to speak.
So let me go back to VS Code here and let me do, as before, SELECT COUNT * FROM favorites WHERE language equals, quote unquote 'C'.
And, here's the new part, the problem is not equal to but rather LIKE the following value, Hello comma something.
Now how do I express something?
Well, here there's another piece of syntax that we need to introduce.
We've seen the asterisk or the star being used with COUNT and in other contexts for SELECTing any number of-- SELECTing all of the columns available.
But, when you use the LIKE keyword, if you want a wildcard, you actually use, a little weirdly, a percent sign instead.
So if I now do Hello comma space %, in this context of using the LIKE keyword, SQL will now search for all of the problems that start with H-E-L-L-O comma space and then 0 or more other characters after that.
And because I'm being a little clever here, I know that both of the problems I care about are start with Hello comma something, this should be equivalent to the previous command, but I don't need to have a special list of all of the values I'm looking for.
This will get everything that starts with Hello comma space.
And, indeed, I still get back my 8 rows as before.
All right, so how now might we use some of these other commands too?
Well, it turns out that, with GROUP BY, we can actually group similar values together.
So how might this be useful?
Well, consider this sample table here.
It doesn't have nearly as many rows, but I propose that suppose I collected a small number of form submissions that just had maybe these six submissions.
1 person liked C, 3 people liked Python, and 2 people like SQL.
Well, we haven't yet seen syntax via which we could replicate that earlier Python program, whereby I really want to be able to get a quick analysis of how many people like C, how many people like Python, how many people like SQL.
Obviously, I can just skim the list visually here because it's short.
But if there's hundreds of submissions, let alone thousands or millions, it would be nice to be able to answer that question without having to write a dozen or more lines of Python code for a relatively simple and pretty common question.
So how might I do this?
Well, let me go back to VS Code, and let me introduce that GROUP BY syntax.
Let me go ahead and SELECT the language column and the COUNT for that language from the favorites table.
But how do I want to do this counting?
Well, let's go ahead and group those rows by the language value.
So the only thing that's new here now is group by language.
And what this is going to do for me is the following, if I go back to this here visualization, notice that the possible values are C, Python, and SQL, but there are several of Python and SQL.
So how might I go about analyzing this?
Well, really, notice there's one value for C, there's these three values for Python, and there's these two values for SQL.
Wouldn't it be nice, in your mind's eye, you could take this table and squish it vertically, GROUPing all of the similar values together.
Combine all of the C values, all of the Python values, and all of the SQL values, as we've now left SCRATCH behind.
Well, how can I do that?
What I really want is to get back an answer like this.
I do want the count of something, but I want to group that counting by the specific value for language.
I don't want to count up all of the rows as I keep doing.
I want to group my counts by the value in the language column.
So how might I express this with exactly this code?
If I want to get back a two-column temporary table, language in the left column, the COUNT thereof in the right column, but I want to squish this together in my mind's eye and combine all of the similar languages together and then COUNT those, I do, at the end of this query.
GROUP BY language.
And if I hit Enter now, I get back that same analysis as earlier that we implemented in Python, but using a single line now of SQL code.
How might I improve on this further?
Well, notice here that, currently, what I've just selected is indeed the-- is indeed the COUNTs, really, in the order seen here, C and then Python then Scratch, which seems to be alphabetized in this particular case.
But it would be nice, frankly, if I could actually sort this by count, not so much because I have a lot of data here.
But if you had lots of different values in a much taller table, it might not be as obvious which is the biggest, so it'd be nice to just do the sorting ourselves.
So how might I do this?
Well, let me go ahead and do a similar query.
And, just as in Linux in your terminal window, in SQLite, you can actually hit the up arrow and go back through your recent commands.
So I'll start to do that now to save myself some identical keystrokes.
Let me go ahead and, instead of just ending my thought with a semicolon, let me actually say ORDER BY the COUNT of those rows semicolon.
If I hit Enter now, you'll see the exact same data, but it is indeed sorted by the COUNT column.
Unfortunately, just like with my earlier Python program, I accidentally sorted it backwards-- it's from smallest to largest by default. In fact, what's implicitly happening here is that I'm ordering it by COUNT in ASCending order ASC for short whereby ASC indicates ascending order, so from smallest to largest.
But if what I want to do is actually do this in descending order, I can actually do DESC for DESCending, hit Enter here, and now I actually do get a sorting version of this from largest to smallest instead.
Now, this temporary table I have here is a little weird because one column is called language.
The other column's called COUNT *.
Turns out there's another trick we can use in the context of SQL if we actually want to rename these columns that we're getting back from our queries.
And we can alias these things in the following way.
Let me go ahead and now do this SELECT language comma COUNT *.
But if I want to give this COUNT * column an alias, a nickname, n is our go-to value for numbers, for instance, in programming.
So let me rename this from COUNT * to just n for number.
And then let's do this, as before, FROM favorites.
Let's GROUP BY language.
And let's ORDER BY-- and now this might make a little more sense.
Let's ORDER BY n in descending order semicolon.
And now I get really back the same result, but I indeed see language as the column name at left, n for the column name at right, and that's just a little more explicable.
And, in fact, this will be useful in code if you want to give aliases to the headers, you're getting back so that, in Python or some other language, you don't have to keep typing COUNT *, COUNT *, you can use nicknames or these synonyms instead.
And, just to show you one other query, one other keyword that flashed before our eyes earlier, there's also this LIMIT keyword, which isn't super useful with this relatively small data set.
But if you've got hundreds, thousands, millions of rows, you might want to limit the results to be just the top 10 or the top 1 because you might not care about all of the other data.
Well, I can do that here in SQLite as well.
Let me go ahead and clear my screen, let me go hit up so as to open up the previous command, and let me add to the end of this statement, LIMIT 1 if what I really care about this whole time is, what is everyone's most popular language?
Enter.
Now, I get back just that answer.
And there we have it, the most popular language, with a total of 243 favorites, is indeed Python.
To be fair, our queries have gotten a little longer, a little longer.
But, by just using these building blocks and those we've seen prior, we can express in really single lines of code what previously took us half a dozen or a dozen or more lines of code in, say, Python.
Let me pause here now and see if there are any questions.
Yes, a really good question.
Unfortunately, yes-- and I think, as you're getting started with SQL for the very first time, that's going to be one of the mistakes you and I commonly make by just getting things out of order.
So I've been very careful with my cheat sheet to do things in the correct order, but even I forget sometimes and have to look it up or ask CS50's duck or the like.
A really good question.
Short answer no because the way the Google form was designed was to only store values in the CSV if those options were selected.
If no one liked a certain problem or even a certain language, we would have no evidence thereof.
We would have to go back to the original form and maybe create a new table that contains all of the original authoritative answers but that have nothing to do with what people selected.
So, indeed, as you'll see in Python, you might want to even validate data.
There's probably some hackish ways where a malicious user could get-- submit data into the form that wasn't actually on the form unless Google too is doing validation.
So that's a compelling use case for exactly that same solution.
A really good question.
So I'm following the same paradigm in mathematics, whereby if you are doing like multiplication followed by addition and you want the addition to happen first, between two values, you have to parenthesize that in math.
Similarly here, what I cared about logically was that the language was C AND the problem was Hello this or Hello that.
And so I expressed with parentheses the fact that I only cared about one of those latter being true, but I definitely wanted language to be true.
And so parenthesizing in that way ensures that the operators are handled in the priority I intend.
So, in the case of the form I created, there were only specific values allowed, Hello comma World and Hello comma It's Me.
Names were not supported in the list of options I gave to the user, so I don't think this is a concern in my analysis of that data because I know from the get-go what those values are.
Even, that said, if I were to use that LIKE expression Hello comma space % sign, if there were a third, if there was a fourth, if there were a fifth option in that form that started with Hello comma something, we would also take those into account.
So that might be a little too overly generous and indeed not the best design.
Well, before we move away from this favorites data set, allow me to propose that we look at just a few other features before graduating to a real-world, much larger data set as well.
So, in addition to being able to create a table, which we automated earlier, and, in addition to selecting data, as we just did hands-on, you can indeed insert data into a table as well.
This wasn't relevant in the case of our Google form because you all used a Google form to insert that data.
But, before long, I bet we're going to be writing software, wink, wink, whereby we're going to programmatically encode, need to insert more and more and more data based on what someone might be typing into a form of our own.
The canonical syntax for insertion is indeed this whereby I can INSERT INTO the table name.
And then, in parentheses, I specify one or more columns for which I have values.
Then I literally say values, and then, in another set of parentheses, I specify one or more values that I want to INSERT respectively into those columns from left to right.
So the syntax here is a bit more verbose, but let's go ahead and try this out.
Let me go ahead in my VS Code and go and type INSERT INTO the favorites table.
How about just the language column and the problem column?
I don't really care about the timestamp.
I don't want to figure out what time it is now and write it in the right format.
So I'm only going to insert two values, not three, specifically language and problem.
The values I want to insert are going to be, quote, unquote, "SQL" because I'm a big fan of SQL all of a sudden.
And the value for problem is going to be fiftyville, a problem that awaits you as well.
Then I'll hit semicolon, Enter, and nothing seems to happen.
But if I quickly do SELECT * from favorites-- which, again, is a nice, quick and dirty command for just seeing all of the data in a table-- look at the bottom of the table.
I actually do see a new value, namely SQL and fiftyville.
Now, curiously, there is a value for timestamp, but it's a special value, and we've seen this value by name before in C, that of NULL.
So, in the world of C, recall that N-U-L-L, with two Ls, represented a 0 pointer, an invalid pointer, or just an initialized value that has no valid address.
In SQL, this has nothing to do with pointers per se, but we still use NULL in all caps, N-U-L-L, to represent explicitly the absence of a value.
And this is actually a really nice feature of SQL.
Imagine the world of Google Sheets, Apple Numbers, Microsoft Excel, humans make a mess of their data all the time.
For instance, if you don't have a value in a cell, one of those rectangular boxes in a spreadsheet, does that mean it's deliberately blank?
Does it mean you don't know the value?
Does it mean you're missing it and you're going to come back later?
So what do we humans do?
Well, in English, we might have a convention like writing N/A for Not Available, maybe lowercase, maybe uppercase.
But that's a complete hack because what if you want the cell to actually contain, for whatever reason, N/A, it's ambiguous.
So maybe you color it in a different color, like maybe black it out in the spreadsheet just to indicate, no, no, no, there's nothing here.
But a color is not an explicit value for a cell.
So, in the world of spreadsheets, it's actually harder to express the deliberate absence of data than it is here in SQL.
So, when you have a NULL value, that means I or someone has made a conscious choice to not put a value there.
It is not the same as, quote, unquote, "nothing", the so-called empty string.
It's not a blank value.
It is the conscious, deliberate absence of a value instead.
So that there is a plus.
Now, suppose I want to actually get rid of something now.
Turns out, there's other features indeed of SQL we should explore, not just the insertion of data, but the deletion of data as well.
Now, this one is pretty simple, DELETE FROM table WHERE some condition is true.
But, any time, as in the real world, you can make really bad decisions and make really big mistakes by deleting data that you did not intend.
So, for instance, let me go back to VS Code here, and let me propose that I want to get rid of this SQL value because, aah, it wasn't in the original form.
I'm really just playing around here.
I can start my query with DELETE FROM favorites.
And please, please, please, especially if working as an intern or a full-time engineer in industry, do not ever end this thought with semicolon unless you know what you're doing because if I were to hit ENTER on my keyboard now, this is how you would delete everything from the favorites table.
This has happened before.
There are news articles about employees doing really bad things because they deleted data they shouldn't have.
So let me just-- let me just delete my semicolon and finish this thought, as per the slide, where I'm saying you should really use a WHERE condition.
We used WHERE before for SELECT, where we only wanted to get a subset of data.
Presumably, when deleting stuff, I probably only want to DELETE a subset of my data, maybe one row or a few, hopefully, not ever all.
So let me specifically DELETE FROM favorites WHERE the Timestamp column, capital T because that's how Google does it, is NULL.
And this too is a little weird.
Instead of saying equals NULL, the right way in SQL is to say is NULL or, conversely, is not NULL to express that literally you are checking for that absence of a value.
So this I'm more OK with because I know that all but one of my rows has timestamps.
So if I hit Enter now, nothing seems to happen, but if I do SELECT * FROM favorites semicolon, Enter.
Notice the bottom of my output no longer has that SQL or fiftyville value.
Now, I will say there's other things too we can do with our data set, not just INSERT, not just DELETE, but also UPDATE existing values.
And this syntax a little wordier, but it does what it says.
If you want to UPDATE a table and set one column equal to some value WHERE some condition is true, you can express that with code like this.
You can actually set multiple values at once by separating column equals value comma column equals value comma column equals value and so forth.
But, for now, I'm going to go ahead and set two things to known values.
Suppose that-- and I'm really getting excited about SQL.
I really hear good things about this fiftyville problem.
Let's kind of change this data and decree that everyone's favorite language, as of today or this coming week, is going to be SQL.
And their favorite problem is going to be fiftyville.
So I can do this, albeit destructively.
UPDATE favorites set everyone's language equal to SQL and everyone's problem equal to, quote, unquote, fiftyville semicolon.
I haven't executed this yet, but what this means is that it's going to iterate over the table, top to bottom and because there is no WHERE clause, I am going to change every cell to equal SQL and fiftyville, respectively, in the language and problem column, so this too probably not something you should do in the real world, certainly, not when working for a real-world company
with a real-world database unless you want to blow away all of your data and replace it with SQL and fiftyville, but let's do that, Enter.
Nothing seems to have gone wrong.
Let me do another SELECT * from favorites, and I really hope no one has questions about the previous version of the data set because now everyone loves SQL and fiftyville.
And, unlike most computer software, I cannot just hit Control- or Command-Z to undo.
What you have to do now is exit out of SQLite, hope you made a backup of your database earlier, which I didn't, and then restore somehow from backup.
So beware the power of these commands that you now have at your disposal.
All right, so that was a lot.
And, unfortunately, we're out of data now, the hundreds of rows that we've had.
But that was never a lot of data to begin with.
So, in fact, let's go ahead here and take a few minutes-break.
And, when we come back, we're going to graduate from hundreds of rows to millions of rows of data.
Indeed, we're going to introduce you to a real-world data set known as IMDb, the Internet Movie Database, which has lots and lots of information about movie stars, TV stars, movies, and TV shows, and so much more.
And we're going to use some of that publicly available data to explore SQL all the more but also the design of our databases.
But let's come back in 10 minutes, and we'll pick up there with millions of rows suddenly.
All right, we are back.
And we now dive into the Internet Movie Database, or IMDb, for short, which, again has a lot of real-world movie stars, TV stars, TV shows, movies, writers, directors, and so much more.
And all of that data, it turns out, is downloadable in TSV format, Tab-Separated Values.
And so among the things we've done before class today is we actually wrote a program of our own.
Turns out, it was in Python, but it could've been in any language to automate the process of importing those TSV files into our very own movies and TV shows databases.
For today, we're going to focus on TV shows in the world with which many of you are likely familiar.
So we thought we'd go about showing you how we might design a database for TV shows.
So let me begin with just some quick and dirty spreadsheets, in fact.
If I go back over to my browser here, I've got a Google spreadsheet open wherein I'm storing the title of one of my favorite TV shows, The Office.
And then I have a lot of TV stars who featured in that show Steve Carell, Rainn Wilson, and so forth.
But, even in the world of spreadsheets, I would argue that this way of associating TV stars with the TV shows they were in is maybe not the best design.
Why?
Well, here I have just five of the TV stars from The Office, but there were certainly many others in that ensemble show.
But so where does this design end?
Do I add another star column, another star column, another star column, and the like?
But that doesn't seem like the best design because then, when I add a second TV show to this same design, then I'm going to have a different number of stars for that show and a different number of stars for the next show and so forth.
So this doesn't feel like the best design to use columns to store each and every TV star in a TV show.
Well, let me propose that we look at a second version of this that I whipped up in advance, whereby maybe we stick to just two columns.
I like this better because from the get-go, I've got two and only two columns, and this thing isn't going to grow wider and wider with more and more TV stars.
But I also don't think I love this design.
Why?
Well, because, in the first column, I've got the title of the show, The Office, but I've got it again and again and again and again, one such row for every TV star in the second column here.
So this feels more compact, sure.
But there's a lot of duplication now.
I'm literally repeating The Office, The Office again and again and again.
So the right way to do this, not necessarily in the world of spreadsheets, but, as we'll soon see, in the world of relational databases, is as follows.
In fact, it's just convenient visually for us to use a spreadsheet just temporarily to mock up this idea.
But I daresay the right way to solve this, as we'll soon see in relational databases, is not even with a single sheet but, rather, to start using our database tables or, in this case, temporarily, sheets, to represent real-world entities or relationships therein.
In particular, notice at the bottom of my Google Sheet here I have a shows tab, a people tab, and a stars tab.
In other words, in the world of TV shows, there are, of course, the shows themselves.
That's a real-world entity, each and every show.
But there's also lots of people in that industry, the actors, the writers, the directors, and so forth.
So people is another real-world entity.
But then there are the people specifically who star in those shows, and that's more of a relationship.
So we've got entity, entity, and relationship.
So notice here, in my shows tab, I have the beginnings of a sheet for just all of the TV shows in the world.
There we have, in the second column, The Office, which is the one we keep talking about.
And, just for the sake of discussion, let me propose that it will be advantageous to us and good design to assign every TV show in the world a unique ID.
It doesn't really matter what it is, but some unique number that no other TV show is using is probably helpful.
So, for The Office, for instance, I'm going to arbitrarily, but deliberately we'll soon see, use 386676 as its unique ID.
Meanwhile, if I click on the people tab now, here are the real-world people.
That is to say, people are an entity unto themselves.
And here we have again, Steve Carell, Rainn Wilson, and so forth.
Each of whom has their own unique ID, which is, again, an integer.
And it could be the same numbers as before.
But, in the context of people, these IDs must be unique.
In the context of shows, these IDs must be unique.
So there isn't necessarily a relationship yet.
But if I go to the third and final sheet here, the star sheet, notice, somewhat cryptically, I have no TV shows, I have no people, but I do have the unique identifiers for both.
And the way I propose to maintain relationships between shows and people is by somehow linking their unique identifiers together.
So if I want to associate Steve Carell and Rainn Wilson and other people with The Office, I include their ID, their ID, and so forth.
But, in the left-hand column, I associate those person IDs, so to speak, with the corresponding show ID.
Now, I do concede that there's still some duplication here.
I keep seeing 386676, 386676, which is obviously the ID for The Office.
So there's still some duplication here, but the duplication does not extend to these longer strings of text, The Office, but rather these smaller integers.
And that I'm going to be OK with.
In the world of relational databases, we shall see, it's very common and generally good design to use simple integers but unique as your identifiers for data.
It's OK if you have duplication, therefore, of those integers because, what, it's 32 bits, maybe 64 bits.
It's not a variable number of characters as would be the name of a TV show like The Office or, heck, a TV show with an even longer name.
We can gain efficiency by just using simple integers.
But, unfortunately, this spreadsheet, I do concede, is no longer nearly as useful to look at as the previous versions because if I want to see shows, I have to go here.
If I want to see people, I have to go here.
If I want to figure out who stars in what show, I have to go here.
But these numbers mean nothing to me, the human.
But that's why we're transitioning now away from these spreadsheets again to a proper relational database because, with SQL, this programming language, we can somehow join these disparate data sets together and get back the answers to the questions we want to visualize exactly the data we want.
But, underneath the hood, so to speak, all of the good design will be intact, using integers to relate one piece of data to another.
So, indeed, in the world of IMDb, we're about to see even more data than this.
In fact, what we'll see now is this so-called schema.
And recall from earlier that the schema for a database is the design of that database.
And, it turns out, there's fairly standard ways of drawing databases and database tables therein and the relationships between those tables.
But we'll focus step by step on subsets of this data here.
But the picture you see on the screen here really represents a shows.db database containing lots of information on TV shows from IMDb that's apparently going to have 6 tables.
And among those tables, at top left, are the people table.
And then we have, at bottom left, the stars table.
In the middle, we have the shows table.
Below that, we have a writers table.
And then, above that, in top right, we have the genres table, which will give a genre to each of these shows, and the ratings for each of those shows.
And, indeed, that's among the reasons IMDb is popular is that it offers up ratings or people's evaluations of these various films. But let's indeed focus on just a subset of this for a moment.
Let's focus on the world of TV shows at left, and, at right, the ratings thereof.
Now, technically, these two tables could be combined into one.
But, insofar as ratings are such a prime feature of IMDb and they themselves are a real-world entity, the ratings for shows, IMDb chose to store ratings in a separate table, or, technically, TSV file.
So we've opted to do the same for consistency with the original data set.
But let me go ahead now and open up VS Code again, whereby I still have my files from earlier, but we're done with those, favorites.csv, favorites.py,
and favorites.db.
But, during break.
I also copied into my code space a big database called shows.db, which is like megabytes of data
called shows.db, which is like megabytes of data from IMDb itself, as seen in top left of my Explorer.
Let me go ahead and close the File Explorer, though, for now.
Let me increase the size of my terminal window as before, and let's go ahead and open up this new shows.db file.
I'm again going to use sqlite3 shows.db Enter.
And rather than create this file, which I did the very first time with favorites.db, this one I showed you already exists,
with favorites.db, this one I showed you already exists, so I'm simply opening it by running SQLite and then the file name.
Let's just go ahead and start wrapping our mind around some of this data.
Let me go ahead and do something like SELECT * FROM show semicolon.
Again, my go-to default command just what I want to get a sense of, what is it I have in front of me?
Enter.
And it's taking a while now.
All of our favorites thus far are kind of flew across the screen, but we are looking at decades' worth of TV shows now flying across the screen here.
So I would wager that we've got not even hundreds but thousands, tens of thousands, hundreds of thousands of TV shows.
This is a very large database.
And so we're now going to really be applying the principles of good design to make sure that, when we execute queries, when we design databases, we're indeed doing it well because, you can imagine, for real world websites like IMDb, you can't really afford to make poor design decisions because real-world users are using that site.
They want to get quick results.
You want to minimize the number of servers you have, the amount of money you're spending.
So good design decisions aren't going to just translate into good problem set scores now, but actual time and cost savings in the real world.
In fact, that was a lot of rows flying across the screen.
Let me clear my terminal for a second.
Let me just do a quick SELECT COUNT of * FROM shows semicolon.
And, indeed, we'll see there's 234,000 TV shows in the database as we have saved it here.
If I want to see just the first 10 of those shows, there's other syntax I can write too.
Let me go up in my history.
And if I don't want to see 240,000 rows all at once, well, I can limit myself to the first show or, maybe more usefully, the first 10, just so it fits on the screen.
And, indeed, we have a bunch of shows from the 1970s, one from 1981.
And if I limited it to even more rows, we'd see more and more films. But what is it I'm looking at?
Well, it looks like in the shows table.
I indeed have an ID for the show.
I have a title for the show.
I also seem to have the year in which the show debuted on TV for the first time, and, as of now, the total number of episodes that are associated with that show.
So it's a helpful way to just look at a subset of the data, just to see what kinds of data is there, even though it's non-exhaustive.
How about the ratings?
Let me go ahead and do the same thing, SELECT * FROM ratings LIMIT 10, whereby I just want to wrap my mind around this table as well.
OK, I can infer here that in one column is the show ID, which I'm guessing corresponds to the ID in the shows table that we just looked at.
There's a rating column, which seems to be out of 10, I'm guessing, from 0.0 to 10.0 for an average, and then the number of votes by real-world humans on the internet that contributed to that average rating.
So, just at a glance, I can kind of infer what this data set looks like in the absence of any additional detail.
Now, what you'll see here implicitly is an example of what we would call a one-to-one relationship.
It turns out that, with relational databases, you can indeed have these relationships across tables, and they can be different types of relationships.
One-to-one means that every show has one rating according to the design of IMDb's database.
And I can kind infer that as follows.
If I have, in my show's table, up here, an ID column that I'm inferring and, I could confirm from IMDb's documentation, uniquely identifies each of these shows, when I then see, in another table, a column that by convention is called show ID, the implication is that that's not a unique identifier for this table's rows,
but it's a reference to, if you will, another table's unique IDs.
So this is just one convention-- and there's many in the real world.
But, in the original table where the data is defined, like TV shows, the unique identifier is typically called ID.
But, when that same identifier is used elsewhere in another table for cross-referencing purposes, it is typically written as table name underscore ID, but you change the pluralization to be singular, so show ID means that it probably comes from a table called shows in the ID column.
So I'm just inferring.
But, frankly, again, per the documentation, I bet I could confirm as much.
And if indeed we do go back to the visualization of this database and take me away, you'll see that the arrows drawn between shows and ratings actually implies this kind of relationship, whereby every show has a one-to-one relationship with its one rating in the other table.
Now, how can we go about wrapping our minds a little more formally about what's in this here database?
Well, let me clear my screen and let me use that schema command.
But let me use it slightly differently.
Previously, for my favorites.db, it just showed me the schema for, the design of my favorites database.
That was super simple because that was all automatically generated from that .import command.
.import command.
This database, again, I claim, we created ourselves using IMDb's own data.
So what we did in advance of today is this-- if I want to look at the schema or the design specifically of my shows table, it looks like this.
And we've tried to format this line by line to be a little more readable than the automatically generated one for favorites.
So, at some point, we, the staff ran a command called CREATE TABLE, and we created a table called shows.
Inside of parentheses, we then specified, what are the columns for that table?
The first one we decided to name ID.
And we decreed that it's actually going to be an integer, a number from, say, 1 on up.
The next column is going to be a title column.
It's going to be text because, of course, The Office and other such TV shows have text as their name.
And, in particular, we want to make sure that no TV show's title is null.
After all, if it's null, that means we don't have it.
Why is it in our database?
So that's just kind of a business decision-- I do not want blank data, null data in my database because what would it even mean for a TV show to have no name?
That's not really useful data at all.
Then we have a year field, which, yes, is an integer, but it turns out for things like years, it turns out, that, in SQLite, you can specify more generally that it's a numeric value.
So it's like a number, but it's a special type of number like a year.
It's not just a boring old integer.
Episodes, though, is just going to be an integer because we just want to count 0, 1, 2 on up.
There's nothing really special about that.
And then the last thing here, after the final comma, is a little different, but we'll come back to this, whereby the primary key for this table is ID.
And, for now, that just means that the unique identifier for this here table is going to be the aforementioned ID column.
We're just telling the database what to expect.
Meanwhile, if I do .schema ratings, which is that second table, very similar in spirit, but notice this.
I've created a table in advance called ratings.
And, inside those parentheses, there are these columns, a show ID column, which, like the ID column earlier, is indeed an integer.
But, be careful here, it is not null.
It would also make no sense if we've got this orphaned rating that has no relationship to a show, so we do not want show ID to ever be null.
That rating has, of course, a real number as its value.
So a real is similar to a float in C or in Python, whereby it has a decimal point.
And that's apt for something like an average.
And it too is not just real, but also not null.
And then the number of votes that contributed to that average is an integer because there's some number of humans who voted that two should not be null.
But the interesting part here, and here is the relation in relational database.
Notice that this last line of code here specifies when creating the ratings table that there's a relationship between this table and another.
Specifically, show ID in this table somehow references the ID column in the shows table.
And you'll notice some keywords here.
Primary key is what I said defines the uniqueness of the column.
Foreign key is what imposes the relationship between this table and that.
So, henceforth, what we mean by primary key is a unique identifier for a table's rows.
It is guaranteed to be unique row by row by row.
A foreign key, meanwhile, is just the appearance of a primary key in some other table for the purpose of relating that other table to the original table.
So it's foreign in the sense that it's the same number, but it's in a foreign land.
It's in a different table that's somehow referring back to the original table.
So this is just jargon.
But, at the end of the day, what's kind of nice here, these are just numbers.
And we've tossed on a bunch of terminology here and jargon.
But, really, we're just talking about reusing numbers so that we know, just like my visualization with the Google spreadsheet earlier, that there's a connection somehow between these two.
Now, what about all of these data types?
Well, here's a quick summary-- within the world of SQLite, these are really the data types you have available to you, and it's not that many.
There is our INTEGERS for simple numbers, counting up or down.
There's NUMERIC which are numbers, but things like dates and years and times, so things that are composed of numbers but are a little different semantically.
There's REAL which are floating point values with decimal points.
There's TEXT, which is just text.
And then we haven't seen it yet, but there's BLOBS, which is wonderfully named, and it stands for Binary Large Object.
It's like when you want to just store raw 0's and 1's for some reason in the database.
So that's not an especially common case, at least for our purposes vis-a-vis these others.
But there's these other keywords we should keep in mind, some of which we've already seen, like NULL and NOT NULL, which allows you to specify, when creating a table, what kinds of values can or cannot be there.
And this is how a relational database differs from something like a spreadsheet.
Generally speaking, in a spreadsheet, there is nothing stopping you from typing or deleting any values whatsoever from any of the rows and columns.
In fact, it's pretty easy for you or a colleague to break a spreadsheet in that way.
But one of the features you get from a real-world database, like SQLite, is that the database software itself can prevent you or other people, colleagues, users from removing or adding data to your database that should not be there.
You can specify that this column, this cell in our database table cannot be NULL or it can be NULL.
And you can specify that it must be UNIQUE, or it doesn't have to be UNIQUE.
The database gives you those features for you.
And, indeed, these things called primary and foreign keys are the key to, no pun intended, relating one table's data to another table's data.
So let's go ahead and start playing around with some of this real-world data here.
Let me go back to VS Code, and let's poke around.
If I now do SELECT * FROM ratings, let's go ahead and look at some of the most popular TV shows over the years.
Now, if I just hit semicolon and enter now, that's going to give me all the ratings.
And that's not that useful, but let me do this.
Let me add a WHERE clause, which we saw earlier and say WHERE the rating is greater than or equal to maybe 6.0.
So it's better than 5, so it's at least 6 or higher.
I don't really want to see hundreds or thousands of results, but let me go ahead and just limit this to the first 10.
It's not necessarily the top 10 because I'm not ordering the data, but, just for the sake of discussion, I want to see 10 shows that seem to be pretty good based on their rating.
All right, not that useful because what the heck are these shows?
These are all unique numbers.
I don't see The Offices number in here, but I do see 10 show IDs that I haven't yet seen.
Well, this isn't all that useful yet, but let me whittle this list down.
Instead of selecting everything, let me re-execute that same query, but SELECT just the show ID from ratings, Enter, so same kind of query, but a subset of the answer that I get back.
I just have now a temporary table containing all of those show IDs.
Well, why is this useful?
Well, what I don't want to do is this-- if I want to know, what 10 TV shows should I start watching.
This would be incredibly tedious.
SELECT * FROM shows WHERE the shows ID equals, and I'll copy-paste this, 62614 semicolon, Enter.
OK, so that's the first TV show that debuted in 1981 that has a pretty good rating of 6.0 or higher.
What's the next one?
Well, let me start with that same query but go back up here and copy the second ID semicolon.
All right, Catweazle, so from 1970.
That seems to be pretty popular.
Let me do it again.
Let me delete that ID, grab a third ID here, copy-paste it, enter UFO from 1970, I mean, this is going to take me forever just to come up with my list of 10 TV shows that I should watch based on their rating.
So there must be a better way.
Copy-paste is rarely, if ever, the solution to our problems in CS50, let alone programming more generally.
Can I automate this process somehow?
But, in a way, that shows me useful information because, as the user, I really don't care about the primary keys, the foreign keys, any of those integers, who cares?
They exist solely for the purpose of uniquely identifying data in a database.
But I am a human, and I prefer to think about TV shows by way of their titles.
So how can I do this?
Well, let me go back to that earlier query whereby I got all of the show IDs that are rated at least 6.0 or higher, done.
And now let me use this as kind of a building block to answer another question as follows.
Let me do this.
SELECT * FROM shows WHERE the ID of the shows is in this earlier query.
So, here, I'm going to copy-paste, but just to save myself some keystrokes and make clear that I'm indeed using the same query as earlier, I'm going to copy everything from earlier, paste it here without the semicolon, and I'm going to move the limit 10 to the outside, because what I care about is limiting the number of TV shows, not the show IDs per se.
So it's almost the same, but, just like in math class, because of the parentheses, this is going to be executed first by the database.
That's going to give me back a list of IDs where the ratings are greater than or equal to 6.0.
And then I'm going to use that list of IDs to ask the database a second question, the outermost question here-- give me everything you know about the shows whose ID is in that list of results.
So I'm using one query inside of another.
This is known as a nested query.
And we do this all the time in C, in Python, when you pass the output of one function into the input of another, that's all we are doing here.
So, when I now hit Enter, what I should see is thank you.
Actually, 10 shows titles, not to mention their IDs and their year and their episodes-- and, indeed, notice the title, notice the IDs at left.
Those are the IDs that we selected earlier, but, thankfully, I can see the titles now as well.
And if I don't care to see any of that other information like year and episodes, that's fine.
Let me just hit up again.
And, instead of SELECTing *, let me just SELECT what I care about.
All I care about is the top 10 TV shows that I should watch here.
And now I have just the titles as well.
So, again, even though the queries are getting a little longer, and I could be a little more tidy, and I could move things to the next line and so forth-- and I'll do that as an example-- but it's just composing ideas from earlier into larger and larger queries by solving smaller problems first.
And let me do this now.
Let me actually type this out a little more clearly without wrapping.
If, in SQLite-- whoops, let me do this.
Let me go ahead and make this multiple lines.
So if I were to start this query as follows.
SELECT title FROM shows WHERE id IN, but I know what I'm going to type is actually going to wrap onto multiple lines, that's OK, I can hit Enter here.
Notice my prompt changes to an ellipses, dot dot dot, and then the greater-than sign.
This is SQLite waiting for more of my query.
So I'm going to go ahead and copy-paste it from earlier.
And then I still haven't hit semicolon.
I'm hitting Enter just to put it onto another line.
But now I want to limit it to 10 and now semicolon.
So I mention this because it's actually very easy to accidentally get into a weird place with SQLite because if you have an open parenthesis or an open quotation mark, you might get the dot dot dot prompt, which just means you haven't finished your thought, at least as far as SQLite is concerned.
So you can finish your thought with the semicolon, the parenthesis, the quote mark, hit Enter.
And now you have a three-line command, which might just be a little more readable for you but still executable.
Worst case, if you get into a weird place, Control-c is your friend to interrupt the process somehow to actually quit out of that query itself.
All right, but there's a way now to do this differently.
And I used this terminology earlier.
Wouldn't it be nice if even though we have a shows table and a ratings table and all these other tables, could we somehow join them back together when we want to see more data at once?
And, indeed, if we consider, in SQL, one other keyword, we can do exactly that.
An operative word in SQL is that of literally JOINing two or more tables together.
So what do I mean by this?
Well, let's do this visually first.
So here we have a visualization of a TV show at left and the rating thereof at right.
And, for today's purposes, I don't really care what the actual rating is or the votes were or so forth.
So I've showed just a subset of the data herein.
But these are two separate tables shows at left and ratings at right.
But, notice, even though this is a tiny example, notice that they do have a column in common.
Notice that the ID value at left is the same as the show ID at right.
Wouldn't it be nice if I could take these tables then and kind of line up those identical values and concatenate them together by kind of joining them where those IDs are the same.
And, in fact, just for the sake of visualization, let me propose to do this.
All I did was flip the title and ID column, not because this is technically useful.
But visually, it puts the two IDs up against each other.
In fact, let me squeeze the tables closer together, noting that 386676 is indeed the unique identifier, and it's the same in both the shows table and the ratings table.
So, sure, let's shift things over.
And, heck, I don't really need both of these values here at the same time, so I bet I could somehow literally join these two columns together in code.
In fact, let me go back to VS Code, and let me introduce one other technique here.
SELECT * FROM shows JOIN ratings.
How do I want to join those two tables left and right?
Well, please join them on the shows table's ID column equaling the ratings table's show_id column.
And, just for good measure, WHERE the rating is greater than or equal to 6.0 and LIMIT 10.
So the last part there is not that useful.
It's just borrowing some of the ideas from earlier.
What is useful here is I'm not just SELECTing from one table, now.
I'm SELECTing from two tables being JOINed together.
But I have to tell the database, how do you want to JOIN those together?
How should those two tables be lined up?
Well, I want the show ID column in one to line up with the show-- sorry, I want the ID column in one to line up with the show ID column in the other.
So you have full control over what values are being lined up.
And if I hit Enter, I will see everything together here, the ID column, the title column, the year, the episodes, and also the duplicate show ID column, rating columns, and vote columns, which is pretty much what we just saw.
But my own visualization, I was keeping it simple, dot dot dot.
I don't care about most of those columns, but this is what we have done.
With this JOIN keyword, we have created in memory a temporary table that has all of that data I care about and, in fact, a bit more.
Now, at the end of the day, what I really only care about, if I go back to my visualization here, is I really only care about getting access to, say, the title and the rating itself.
So if I want to whittle this down, this list down even further, that's actually not that hard.
Let me go back to VS Code, and, as always, instead of just SELECTing *, let's SELECT what I actually care about like title comma rating.
And if that's all you want, so long as it's clear to the database that, OK, titles from over here and ratings from over here.
So long as the column names are not duplicative, I can just refer to them by their short names, enter.
Now I get back 10 TV shows that I do want to watch because all of these have ratings, as you can see with your eyes, that are greater than or equal to 6.0, and there are the titles thereof.
So let's return now to the schema for the entire database, focusing not just on the shows table and the ratings table, but a few others as well.
In particular, why don't we focus this time on the genres table, which describes whether a show is a comedy or a drama or something else along those lines.
So let me go back to VS Code here, and let me go ahead and wrap my mind around this new table.
So SELECT * FROM genres, and then let's just LIMIT 10.
I don't really care about everything that's in this table.
I just want to get a quick sense of it.
OK, it's actually a pretty simple table, per the diagram from a moment ago, whereby the genres table has two columns, a show ID and a genre.
And the show ID presumably refers to the ID column in the shows table, and genre appears to be a text field that describes the genre for that particular TV show.
What's worth noting about the genres table is that this is not really practicing what we preached earlier because, indeed, I see comedy, comedy, comedy.
I see it three times.
I see sci-fi twice.
So there's is redundancy in this table, so feels like a missed opportunity to fully normalize these tables by removing those duplicates and having a whole separate table, much like we have a whole separate table for shows and for people themselves.
But more on that perhaps another time.
For now, let's go ahead and just focus on one of these TV shows that we saw earlier, that of Catweazle, one of those earlier UK shows.
Let me go ahead and do SELECT * from shows-- whoops.
Let's do SELECT * FROM shows WHERE the ID of the show is this one here.
Notice that it's-- what's curious about this show here, 63881, is that it's a comedy, an adventure, a comedy, and a family show all together.
Now I happen to know what this one is.
But, as before, let me go ahead and SELECT it by its ID WHERE ID equals 63881 semicolon.
And, indeed, there's the Catweazle show that I've promised us.
So it seems that Catweazle, per the genres table, actually, has many genres associated with it.
And this is worth noting because, indeed, besides one-to-one relationships across tables, you can also have one-to-many relationships whereby, in this case, one show can have many genres associated with it.
So, to see this, let me go back to VS Code here.
Let me clear my screen and do .schema for genres just to make sense of what's in there.
And, as expected, show_id is an integer that can't be NULL that is apparently indeed a foreign key that references show's table's ID column, so quite like the ratings table.
But the genre in here is indeed a textual value for which there might be duplicates, but it can never be null.
All right, let's go ahead now and see if we can't SELECT programmatically the genres for this TV show called Catweazle.
Let me go ahead and first SELECT the genre from the genres table WHERE the show ID equals that same value.
OK, there's that list of three values that I inferred by just looking at the top 10 list earlier.
But indeed, this show, 63881, aka Catweazle, has three genres adventure, comedy, and family.
So how can I go about SELECTing now the genres plural for this show?
Well, frankly, I and no one in the world should really have to know what the unique identifier is for Catweazle.
After all, if you were to go to imdb.com, it would be a bit crazy to have to type in the numeric identifier for a TV show just to search for it.
So let's make this more like a keyword search.
Let me instead say SELECT the ID FROM the shows table WHERE the title of the show equals, quote, unquote Catweazle.
In other words, I know from earlier, using SELECT alone, I can get that unique ID of Catweazle, and I know from earlier that I can compose more sophisticated queries by nesting one SELECT inside of another.
So let me do this.
I'm going to copy this just so I don't have to type it all out again.
And let me go ahead now and SELECT the genre column from the genres table, where the show ID in question happens to equal the results of this query.
And now notice a subtlety.
Previously, when I nested a query, I actually used the preposition IN because I was getting back a whole bunch show IDs, and I wanted to get back the title of any show whose ID was IN that list of IDs.
In this case, Catweazle, by definition, only has one UNIQUE identifier, 63881.
So, instead of saying IN, I'm more precisely saying equals.
I want to SELECT the genre for the show that literally equals this value, and SQL will accommodate that, so you use IN when you want to search across multiple values, and you use equals when there's just to be 1.
If I hit Enter, there we have it.
I now have a query that, much like my first query, searches for all of the genres for the show, but it doesn't arcanely search by a show ID.
It instead searches by the name of the show itself, which I'd argue is much more useful.
Now we can visualize what might have just happened.
Here are simplified versions of these two tables.
We have the shows table at left and the genres table at right.
And, just for the sake of discussion, let me flip around the two columns just so we can see that these do things do line up even though this is not a technical thing.
Let me highlight the fact that 63881 indeed appears everywhere among these IDs.
So let's nudge things closer together, and let's go ahead now and ideally line up all of these common values.
But notice here that it's not a one-to-one.
This is kind of a one-to-three relationship, more generally known as a one-to-many.
So, really, when we line these columns up, what's actually happening-- when you line these tables up, what's really happening is something like this-- SELECT * FROM shows, let's JOIN them together on the genres-- with the genres table on shows.id equaling genres show_id WHERE ID equals 63881.
So what have I just done here?
Well, if my goal is to join this stuff together, these tables together, as we did earlier, not using ratings but genres, but we want to accommodate for a one-to-many relationship, the query is essentially the same-- SELECT everything FROM the shows table by JOINing it with the genres table.
How to JOIN it?
We'll JOIN it on the shows.id column and the genres show_id column.
So that's what I mean by kind of lining them up just right so that those numbers overlap.
But, for the sake of discussion, I only care about one TV show for today.
I could do this for all of them, but let's focus on just one.
And when I hit Enter now, notice what happens is I get, per the * operator, all of the columns from both of the tables, including ID, title, year, and episodes, but also show ID and genre.
Now, the database itself, it's not bothering to flip things around and show me side by side like I did with my slides.
But the idea is exactly the same.
When I try to JOIN these two tables in this way, what's effectively happening is the database is noticing, OK, I've got a value at left 63881 and I've got the same value at right again and again and again.
So, effectively, what's happening is the left-hand table's rows are duplicated verbatim, just so that, when we get back a table, it has the same dimensions, rows and columns.
Height and width work out to the same.
That does mean there's a lot more duplicative data that's coming back, but if you want to be able to see everything about Catweazle for each and every of its genres, this, in this case, is what the JOIN query would be doing.
We get back, yes, duplicate data, but it's fundamentally joining things in precisely the same way.
Now, at the end of the day, I probably only care about, really, the title and the genre.
So I can tighten this up.
If I go back to VS Code here and I SELECT only title comma genre from shows JOINed with genres ON shows.db
equals genres.stars.show ID WHERE the ID is 63881, I'm just going to get back a much smaller table, exactly like the one I depicted a moment ago.
And, heck, if that's even too much, like, I don't need to know the name of the show if I'm already searching for it.
Well, let me hit up, let me go to the beginning of my query, and let's just get rid of title and genre and just SELECT the genre.
And now I have a list, in this temporary table, of all of the genres for this particular show.
If we bounce back meanwhile to the chart itself, to the schema itself, there's even more tables than these.
And, in fact, we won't spend time going through each and every one of them, but I think it's probably worth showing one of them that really represents a third and final type of relationship that you might have in this or any database of your own as well, a many-to-many relationship.
And, indeed, that's the right semantics for something like TV shows and TV stars because, presumably, one show has many TV stars, and one movie-- TV star is, presumably, if they're successful, going to be in many TV shows.
And so you sometimes want to have not to one-to-one, not a one-to-many, but a many-to-many relationship.
And, to do that, it does not suffice to have just two tables like shows and ratings or shows and genres.
You might need a third table.
And this is where we began after the break, by visualizing thing in that spreadsheet temporarily when we have a shows, we have a people table, and maybe we have a stars table that somehow relates the first to the second.
So how can we go about building that?
Well, let me propose that we explore the data set as follows.
Let me go back to VS Code, and let's just start to poke around with The Office again, one of my own favorite shows.
And let's do SELECT * FROM shows WHERE the title of the show equals, quote, unquote, 'The Office'.
Let's just see what we know about the TV show The Office.
Now, some of you might know The Office in the US, with which I'm very familiar, is not the only version of The Office.
In fact, the origin for it was actually a UK version several years prior.
But I think the one that I'm most familiar with in the US is probably this one that started in 2005, if only because it had 188 episodes because it ran for like nine seasons.
So that's probably the one I'm thinking about with Steve Carell, Rainn Wilson, and others.
So if I want to narrow my search results to that, let me start with the same query, but say WHERE the title is The Office AND the year equals 2005 because, for today's purposes, I just want to focus on one of these here Offices.
Now, suppose I wanted to ask a question like who starred in The Office?
Now, that question is very reasonable.
And, honestly, if you go to imdb.com and search for The Office, you'd probably expect to see a list of all of the TV stars in that show, the cast members.
But this data seems to be split across a show's table, a people table, and a stars table.
So how do I get back a simple, common question, given what I'm proposing, is very good database design?
Well, let me propose that we take baby steps here or small bites out of this problem, as always.
Let's first SELECT, how about, the ID only from the shows table WHERE the title of the show is The Office, and the year of that show is 2005.
I don't care about all of the columns, I only care about the ID.
And I could make mental note of 386676 and then copy-paste that.
But, again, that's not the right way to do things.
I should not be hard-coding the IDs literally when the user is going to care about the titles, not these IDs.
But suppose I want to now get everything I know about that show as it relates to people.
Well, let me go back to the schema for these three tables.
Notice that we have the shows table here, which has an ID, a title, a year, and number of episodes.
The people table has an ID, a name of a person, and their birth year.
But this stars table is what we might call a join table, whereby its purpose in life is to implement this relationship between one and the other.
And that table clearly has a show_id and a person_id, and this is where I'm now thinking.
In my current query, I am getting the show_id that I care about.
What I really want to know is who starred in this show.
So I need a bunch of person IDs.
Well, this is the table, the stars table, that associates show ID with person ID.
So maybe I can kind of bootstrap things and use this one show_id that I just found, found all of the corresponding person IDs, and see where that gets us.
So let me go back to VS Code, and let me do this, SELECT all of the person IDs from the stars table WHERE the show_id itself from that table is equal to, and now I need this query from earlier.
So let me go ahead and do this maybe on a second line.
SELECT ID from shows WHERE title equals The Office and year equals 2005.
Close parenthesis, semicolon.
In other words, I've got a nested query again per my parentheses.
I want the unique ID for the office.
Then I want to use that show ID to find all of the related person IDs.
Enter, and I get back a whole bunch of person IDs, but who are these people?
One of them's probably Steve Carell, one's probably Rainn Wilson, and so forth, but this is not useful information.
So let's take this home one step further.
Let's SELECT the name FROM the people table WHERE the ID of that person is IN the following query.
Well, how do I get back all of those people IDs-- well, person IDs?
Let me go ahead and do SELECT person_id FROM stars WHERE the show_id equals, and now let me do another nested query.
SELECT ID from shows WHERE title equals, as before, The Office AND year equals 2005.
Close parenthesis, close parenthesis, semicolon.
So this must feel like a lot if you're seeing SQL for the first time, but realize that I'm just taking a third and final baby step.
The first query I ever executed during this process was the last one on the screen here.
Give me the ID of The Office that I know and love from 2005.
The second line of code on the screen is the second command that I wrote before, whereby I want to select all of the person IDs from the stars table that are associated with that show ID, but that just gave me a list of person IDs, none of which I recognized visually.
So the third and final step, which, on the screen, is the very top line here, SELECT the name FROM the people table where the ID of that person is in that list of person IDs that I didn't recognize earlier.
So if I didn't make any typos and I hit Enter, I should see, thankfully, Creed Bratton, Steve Carell, Jenna Fischer, Kate Flannery, and so forth, some of the top stars from that there version of The Office.
So a big query to execute, but composed of smaller and, hopefully, seemingly simpler queries as well.
All right, let's try one other query, not just searching for all of the stars of the TV show The Office.
Let's just search now, conversely, for all of the TV shows that Steve Carell was actually in.
So one way I could do this is very similar in spirit.
Let me go ahead and SELECT the title FROM the shows table where the ID of the show is in the following list.
SELECT the show ID FROM the stars table where the person_id equals.
And, now, I need Steve Carell's own ID.
So much like I searched for The Office's ID, let me search for Steve Carell's ID as ideas follows.
SELECT ID from the people table where the name of the person is, quote, unquote, 'Steve Carell'.
Close parenthesis, close parenthesis, semicolon.
So same idea, but I'm sort of searching the tables in the opposite order to get back all of Steve Carell's shows instead of all of The Office's TV stars.
Let me hit Enter, and I get back all of the shows, according to IMDb, that Steve Carell has been in.
But turns out there's another way to do these kinds of queries when you do have one, two, three tables involved.
You don't have to rely on subqueries alone.
If you prefer the idea of JOINing the tables, not just two, but three of them, you can do that as well.
Let me do it as follows.
SELECT title FROM shows JOINed with the stars table ON shows.id equals
stars.show_id JOINed once more, so we're sort of combining all three tables at once in your mind's eye the people table on stars.person_id equals
people.ID, where name equals Steve Carell.
Now, this is a different approach where I'm explicitly joining one, two, and three tables based on those common, primary, and foreign keys.
But it's using the JOIN syntax this time to get back all of, all of Steve Carell's TV shows.
And that was actually a little slower because, it turns out, there's some final opportunities we're about to dive into to optimize this database for faster queries.
But let me show you one last way too.
If that felt a little verbose, we can also search for all of Steve Carell's shows as follows.
SELECT title FROM shows, stars, and people.
In other words, you can tighten up a JOIN query and specify all at once the three tables, in this case, that I want to select things FROM.
And I don't even need to explicitly use the JOIN keyword.
I can do it implicitly by saying WHERE shows.id equals,
stars.show.ID and people_id equals stars.person_id and name equals Steve Carell.
In other words, this is just a third way to express the exact same idea.
Give me all of the titles from all of the shows that Steve Carell has been in, JOINing them really implicitly by specifying which of the columns in one table must equal the columns in other tables.
Enter, and, again, with some suspense, we get back now the same list of shows.
Before we move on, let me summarize here that we have not only the tables we've looked at, but even one other as well, namely the writers table.
And if we're curious as to what that is, we can actually poke around too.
If I go back to my SQLite terminal and I do .schema writers,
you'll actually see that the writers table is quite similar to the stars table.
We call-- because recall that the stars table associated people with shows.
So does the writers table associate people with shows as well.
But, as per this one, we have a show_id and person_id, but the table name itself, writers, connotes that the relationship therein is indeed that of writers.
As an aside, there's probably a better way to implement even that idea because what about directors and producers and other people on a set?
Well, more on perhaps that design opportunity another time.
For now, I think it's timely where we try to chip away at the time of these queries and perhaps others.
So it turns out, besides just deciding when making your own relational database, whether it's with an automated import command or manually, as we the staff did in advance for the IMDb data that we downloaded in files from the internet, we can structure our data in the computer's memory in a way that draws inspiration from week 5 of CS50 and week 3 of CS50,
wherein we focused on data structures and algorithms, respectively.
In the world of databases, you, the database designer, can choose in advance to create indexes, so to speak, on your own databases so that you can speed up certain queries.
In fact, let me go ahead and show you, in my VS Code environment with this same database, how long it takes to execute a representative command, I'm going to go ahead and do timer, .timer ON, which is a SQLite command that's just going to count how many seconds or milliseconds it takes to run each and every one of my commands just so I can see it without having to look at my watch and time it manually.
I'm going to now do something simple like SELECT star FROM shows where the title of the show equals, quote, unquote, 'The Office'.
So I'm just curious to know how many seconds or milliseconds does this here command take.
Enter, and it's pretty fast.
In real time, it took definitely less than 1 second.
In real time, it took 0.043 seconds.
So that's actually pretty darn fast, but I bet I can speed it up even faster.
Let me propose to do this.
Let me build my own index inside of the database's memory using syntax like this.
CREATE INDEX and then the name of the index ON the name of a table, these columns specifically.
In other words, if I'm the designer of this database and I know that my users are going to be searching on the names, the titles of TV shows quite a bit, I can, in advance, tell the database to build a fancy tree-like structure in its memory or RAM so as to speed up searches for values therein.
So how do I do this?
Well, let me go back to VS Code.
And let me simply do CREATE INDEX.
I can call it anything I want, but I'll just be clear with title underscore index so I know what I'm indexing.
What table do I want to CREATE this INDEX ON?
Well, the show's table, specifically the title column therein.
So here I have specified CREATE an INDEX so a fancy tree-like structure in memory on the show's table's title column, Enter.
So that itself took a decent amount of time, almost half a second.
But that's a command I'm only going to run once.
So it's an upfront cost, but, hereafter, I can amortize that cost over many more SELECTs.
Let me go back up into my history.
This was the exact same query we executed a moment ago, which, previously, took 0.043 seconds.
But this time, after CREATing the index, 0.001 second.
And you might've thought earlier, 0.43 was already pretty darn fast.
I mean, 0.001 is an order of magnitude faster.
What's the implication of this?
Well, one, the human users of your website or application are going to get answers much more quickly.
And maybe, more importantly, in the real world, if I can handle like 40-- if this query is like 1/40 the speed of the previous query, I can buy 40 times fewer servers to handle my users because, presumably, a database server in the real world can only handle so many users at once.
So if you speed up your query by a factor of 40, you can get away with that many fewer servers and that much less money as a result. So building this index in memory can be super efficient with the result in time.
Now, what's the trade-off?
Well, as in every discussion of efficiency thus far in design, there's going to be a trade-off.
And, when you build one of these trees in memory, typically known as a B-tree, which is not a binary tree, it's literally a B-tree.
It's essentially a tree that's very short or as short as possible.
So it's not a binary tree in that it only-- each node has only two children.
It might have three or more children as well.
And the idea of having lots of children in this tree is that it kind of pulls the height of the tree up, so it might be very wide, but it's very short.
So the amount of time it takes to get from the root node at the top to any leaf node at the bottom tends to be fairly minimalist.
And you don't have to do any of this.
There's no pointers.
There's no C code.
You, the developer, don't have to do any of this.
When you create an index, what's happening underneath the hood is that kind of structure is being created for you instead.
And we can really see this, maybe with one of our same queries from earlier.
Let me go ahead and propose that, in VS Code, I bring back one of our slower queries from before.
Even if you're not 100% comfortable with the syntax, just take on faith that it's the same fairly slow query I executed a bit ago.
SELECT the title FROM the shows table, the stars table, and the people table WHERE the shows table's ID equals stars.show_id
and people_id equals stars.person_id AND the name equals Steve Carell.
I'm going to go ahead and hit Enter.
And because I've turned the timer on, we will now know mathematically how long this earlier query takes thinking thinking thinking.
All right, and here we have 3.483 seconds.
That's annoyingly long, if you will.
That's going to be an annoying website waiting and waiting for the response to come back.
And it's certainly going to cost other people time because if I've got multiple users on the website, they too might be wanting to execute queries as well.
Just to show how impactful these indexes can be, let me go ahead and revisit a query from earlier and show that we can speed up even our more sophisticated queries as well.
Recall that, when we were searching for all of Steve Carell's shows earlier, we did this.
SELECT title FROM shows where the ID of the show is in the following subquery.
SELECT show_id from the stars table where the person_id equals this subquery.
SELECT ID from people WHERE name equals Steve Carell.
Close parenthesis, close parenthesis, semicolon.
I have my timer on now, though, so we'll see how long this one actually takes.
Enter, and it's not too slow, but 0.215 seconds in real time.
OK, can we do better?
Well, notice, in this query, we searched on a couple of columns.
One, I searched for Steve Carell on the name column in the people table, and I also searched for the person_id column in the stars table.
In other words, when I was asking SQL to filter my results by those two WHERE clauses, person_id name, it was probably doing a lot of work.
And you can think back now to week 3 of CS50 and even week 0 of CS50 when we first implemented linear search, searching the phone book page at a time, or searching the lockers one at a time, linear search was slow.
But these SQL tables are really just really tall lists of values.
So if you want to search for a person_id or you want to search for a name, if you do not have a fancy data structure in memory already, one of these so-called B-trees that an index gives you, what is the database going to do?
It's going to literally search over every person ID and every name looking for the values you care about.
So can we help the database out?
If I know I am frequently going to search on person_id's and names, let's create two indexes.
So let me go back to my SQLite prompt and create one index called, say, name_index on the people table's name column because I want to search for names like Steve Carell, semicolon, ENTER.
It's thinking, thinking.
That took over a second, but it's a one-time operation.
Let's create one more INDEX called, say, person_index ON the stars tables person_id.
That, too, is going to take a moment, moment, moment.
Two seconds, but it's a one-time operation.
Now let's rerun my earlier query, remembering that, previously, it took 0.215 seconds, and that earlier query was, as before, SELECT title FROM shows WHERE the ID of the show is in the following subquery.
SELECT show_id FROM stars WHERE person_id equals the following sub-subquery.
SELECT ID from people WHERE name equals, quote, unquote, 'Steve Carell', close quote, close parenthesis, close parenthesis, semicolon.
Previously, again, this query took 0.215 seconds but gave me all the results I cared about.
Now, when I hit Enter, it's instant, 0.001 seconds, so hundreds of times faster than the previous query just by building up these indexes in memory.
So, at the end of the day, what's helpful about these indexes is the massive improvement potentially in speed.
But if we looked at how much space this database is taking up on the system, it's probably going to be more space because we have to store these B-trees somewhere in memory or on the server's hard drive.
And, it turns out, because we've created these indexes, if I go about INSERTing, UPDATing, DELETing data over time, I'm actually going to slow down those operations just a little bit, not by a huge amount, but, presumably, the database has got to keep this tree balanced, as we've discussed in the past too, so that if we INSERT or UPDATE or DELETE things, we make sure that the tree itself is still efficiently stored in memory.
So, even though we might get massive savings with our SELECT queries, we might increase the time required for some of our other operations.
But odds are, if you are a very read-heavy site or application, like IMDb itself probably is, where many more people are using the data than are changing the data, odds are that's going to be a net positive.
All right, with that said, enough about design, let's focus lastly on how we might use these databases and some of the problems we might run into.
Now, it turns out, it's actually very common in the real world to use multiple programming languages together.
Thus far today, I've been focusing really just on SQL in isolation and on Python in isolation.
But, sometimes, it turns out, it would be very compelling to write a Python program that maybe takes user input, but still harnesses the power and the capabilities of SQL by using SQL inside of Python.
And so we can actually do this by using CS50's own library, which, in addition to get ints and other such functions that we implemented for you in both C and Python, in the Python library, we also have support for SQL databases.
And these aren't so much training wheels as they are just very user-friendly features on top of proper SQL databases.
That is to say, it turns out, that, when you use a third-party library to query a SQL database, there's lots of functions that you might have to pick up in this language or that.
In CS50's library, we do simplify it a little bit, but without hiding any of the underlying functionality from you.
And so, in fact, at this URL here, if curious, you can see the documentation ultimately for CS50's own library.
But, within this library, do we have functionality like that allowing us to execute SQL inside of Python?
So let's do that.
Let's go back to VS Code and open up favorites.py
but start from scratch here.
And, instead of importing the CSV package, let me go ahead and import something from CS50's own library as follows.
From cs50 import SQL in all caps.
This is going to give me access to SQL-specific functionality in CS50's own library.
Let me then create a variable called db for short, but I could call it anything I want, set that equal to the return value of that SQL function from CS50's library, and then specify what SQLite file I want to open.
The syntax for this is a little weird, but, inside of quotes, what I do is pass in a string here that's as follows-- sqlite:///favorites.db.
sqlite:///favorites.db.
So it kind of looks like a URL, but there's a third in there.
Follows a similar format overall, though.
That will take care of, now, opening that .db file in Python code instead of having to manually run sqlite3 as I have in the past.
Now let's go ahead and ask the human for their favorite language, for instance, by creating a variable called favorite, set it equal to the return value of the input function in Python, which is like get string and just prompts the user for their favorite.
Let me now call db.execute, which is a function, or method, that's inside now of that database variable because of how the cs50 library opened that file and execute a line of SQL from within my own Python code.
In particular, let me SELECT how many people like a particular language based on the CSV, which is now in our .db file.
SELECT COUNT * AS n FROM favorites WHERE language equals question mark, close quote.
And, in this case here, not to throw too many wildcard characters at us today, but, in the context of cs50's library and in the context of a lot of SQL libraries in the real world, a single question mark, without any quotes around it, is also meant to be a placeholder for a value like the string that the human typed in.
We don't want to use F strings.
We don't want to use the curly braces, in this case, for reasons we'll come back to.
But the question mark is a placeholder that, much like %s in printf from C, will get plugged in with the actual favorite that the human typed in.
All right, now I want to do something with the return value of this function.
And, turns out, when you use a SELECT statement inside of this execute function, what you're going to get back is a list of 0 or more rows.
So if you match a row, you get back a list of one row.
If you match two rows, you get back a Python list of two rows.
If you match nothing, you get back a list with nothing in it, of length 0 in that case.
Now, if I only really care about one answer like the first row that comes back, well, let's do exactly that.
Row equals rows bracket 0.
So I'm just, for the sake of discussion now, assuming that I'm going to get back at least 1 and, frankly, just 1 row.
So let's just grab it with row equals rows bracket 0 because rows is a list.
But row is now a single row.
And, now, what this function db.execute does, per its own documentation, is it returns to you, yes, a list of rows.
But, much like the DictReader in the context of CSV files, each of the elements in that list of rows is itself going to be a dictionary.
And the keys for each of those dictionaries is going to be the name of the columns from the SQL table, just like the DictReader is grabbing the column names from the CSV itself.
So, lastly, let's go ahead and print out the popularity of that there language.
print row, quote, unquote, "n" because I want into index into that one dictionary and get back the value of n, which, c be clear, was the alias that I gave to COUNT * just so I don't have to keep typing COUNT *, COUNT *.
I can, more simply and more succinctly, call it n for number.
Let me go ahead and open up my terminal window again.
Run python of favorites.py, Enter.
And my favorite, say, I'm going to buck the trend here.
And I'm going to say that C was my favorite, enter.
What is this crazy error message that I just induced deliberately for the sake of discussion?
So, in this case here, it's underlining my entire query here under db.execute.
And then I'm being told more specifically that there's more placeholders than values, so maybe a teachable moment indeed.
I spent all that time talking about the placeholder value earlier, but what I failed to point out is how you place a value there.
So it turns out that the db.execute method takes minimally one argument, the query you want to execute, but, ideally, one or more additional arguments for the values you want to plug into those placeholders.
So what language am I searching for?
Well, whatever value is in my favorite variable, which contains the string that the human just typed in a moment ago.
So if I now open up my terminal window-- and pretend that never happened-- by clearing my screen.
Let me now run python of favorites.py again, really crossing my fingers.
That was the problem.
Didn't do that last time.
Enter, I'm asked for my favorite.
I'll type in C again, Enter.
And there is, thankfully, that 59.
So, granted, it's still a decent number of lines of code.
But notice what's not in here.
There's no more loop.
There's no more dictionary reader.
There's no more conditionals.
I have used SQL instead to just declare what question I want an answer to, and I get it back in the form of some Python variables, rows being a list and containing one or more dictionaries in this case.
So, hopefully, that sort of demonstrates how we might combine these two languages.
And, indeed, in the problem sets to come, we're going to do quite a bit of this.
But, in our final moments together, it feels worth dwelling on some of the bad things that can happen in the world of SQL, much like we saw bad things in the world of C and, really, any programming language as well.
So it turns out that, in the world of SQL and, really, programming in general, you can be vulnerable to something called race conditions.
And the story that was once told to me was this.
If you're in college, for instance, and have one roommate, and, between the two of you, you've got one of those little refrigerators in your dormitory.
And both of you really like something like milk, and so you're determined to make sure you always have milk in the fridge.
Well, how might you go about solving this problem if one day you come home and you realize you're out of milk?
You open the refrigerator, you see nothing in there, you close the refrigerator, and then you go off to the store, ideally, picking up a gallon of this here milk.
All right, well, meanwhile, suppose in the story that your roommate comes home, opens the fridge, similarly loves milk, and realizes, oh, we're out of milk and so closes the fridge, goes out, and tries shopping for another jug of milk.
Well, what happens, unfortunately, if both of you have inspected the state of your refrigerator and made a decision based on the value therein?
Well, at some point, both of you are going to get home.
You're going to end up with two jugs of milk, which is way too much milk because you don't really like milk that much, but that, therefore, is a problem.
So how could we have solved this here problem, questionable whether these props were compelling for this example in retrospect, but more for us later.
The problem at hand was that both of you in the story made a decision based on the value of some variable, but you didn't signal to the other person that you're in the process of updating the value of that variable.
Better would have been in this story for the first of you to lock the refrigerator physically, or at least put a note on the refrigerator saying something like gone for milk to prevent the variable from being updated in duplicate, if you will, at the same moment of time.
Now, as silly and as contrived as that story is, it turns out, it's representative of something that can happen quite a bit in the real world.
Consider any popular social media application nowadays or particularly popular posts.
This photo here is, as of this morning, still the most popular post ever on Instagram based on the number of likes that it got, upwards of millions.
But this is actually a hard problem for a company like Meta and any of these social media platforms to keep track of because, presumably, when something goes viral online and lots of people are clicking like like like like, there's a lot of potential changes to a database somewhere all at once because so many people are trying to that post at a time.
And, just for the sake of discussion, let me hypothesize that Meta, the company behind Instagram, maybe is using cs50's library for this, but more, specifically, using Python and SQL together.
But it's really the SQL that's germane for our discussion here.
Let me step out of the way here and note that, in these lines of code, I'm supposing that, first, a company like Meta is executing a SQL statement like this.
SELECT the number of likes from the posts table where the ID of the post is question mark, where question mark represents whatever the unique identifier is for this here post.
So that's how Facebook-- or that's how meta is trying to figure out, how many likes does the post currently get?
Why are they trying to do that?
Well, if we go back to the code here, they're then getting from the first row they get back from the likes column, the current number of likes for that post.
Why?
Because the goal is to increment it simply by 1.
So the third and final line of code here says, UPDATE the posts table.
SET the likes value to this placeholder WHERE the id equals this placeholder.
And, as before, I have not-- unlike before, I have not one but two placeholders here, plugging in likes plus 1, so the value of the variable that we just inspected and then the ID of the post.
So where is the problem here?
Well, a company like Meta and any of these social media platforms has so many servers, so many users, probably so many databases that, odds are, these inputs are going to be coming in at once.
And, much like our refrigerator example, odds are, unless Meta is smart about this, they might potentially figure out on line 1 what the current number of likes is, on line 3, actually, UPDATE that to be the number plus 1.
But what if some other user's like came into the database while these three lines of code were executing.
You can imagine if things are happening so quickly, maybe the first line of code gets executed for me, and then the first line of code get executed for you.
And the queries in the system essentially get interwoven somehow, such that now, even though two people might have clicked on like, they might've happened so quickly together that, just as the first one is in the process of updating current number plus 1, the other one is in the process of updating current number plus 1 when really the math should've been current number plus 2
and they both end up writing the same value instead of 1 that is, ultimately, 2 likes higher than it once was.
In other words, if you have code being executed that can somehow get interrupted by other code being executed on another server, maybe with the same database, something called a race condition can happen, much like both of us inspecting the state of the milk in the refrigerator while someone else was in the process of updating it as well.
So, it turns out, with the wave of the hand, there are features and solutions in the world of SQL to prevent this, and they are called transactions.
A transaction allows you to say to the database, execute all of these queries or none of them, but it has to be all or nothing.
It is not acceptable for part of them to get executed but not the others.
You can effectively, therefore, lock someone out of the database.
So it's a little fancier than that, but it's similar in spirit to putting a lock on the fridge or a note on the fridge to signal, do not make any changes to this database until my queries are entirely done, rather than only partially done.
So, here, then, in green is the right way to do something like this, for at least Meta.
You first begin a transaction.
Then you do whatever database logic you want, ideally, not much because you want to keep it quick and simple.
But then, at the end of your code, you commit all of those changes.
So this is sort of making change, making change, making change, but only saving the results to the database at the very end.
So it's either saved, or it's not saved, so you don't get interrupted partway through.
Now, besides these race conditions, there's another problem in the world of databases that can strike you as well, known as SQL-injection attacks.
And this is actually germane to what people have been calling prompt-injection attacks nowadays in the world of AI, where people are trying to trick AI into misbehaving in certain ways by injecting inputs into them that they weren't expecting.
Same idea with SQL-injection attacks.
And, in fact, these came long before prompt-injection attacks, historically.
With a SQL-injection attack, imagine taking user input via anything, via something like favorites.py or maybe something web based.
So, for instance, when Yale students log in to their website, they log in here using their net ID and password.
When Harvard students do the same, they log in using something called HarvardKey.
In both cases, there are form fields in the application or website prompting for user input.
But here's the catch.
What Harvard, in this case, is expecting me and anyone to do is log in with their email address and their password.
But what if I maliciously, and very cleverly, use some funky characters or punctuation in my email address that are not valid email characters?
But they are valid SQL characters.
And we haven't talked much about these, but single quotes are certainly valid syntax in SQL.
We've seen those.
It turns out that dash dash, two hyphens, are also popular as well.
But, unlike Python and unlike C, which use other symbols, dash dash in SQL denotes the start of a comment.
So, in SQL, if you do dash dash anything, the anything will be ignored thereafter.
So notice what I've done here is I've written my email address mailing at harvard.edu, single quote dash dash.
Now let's see why this is germane.
So here let me hypothesize-- I'll take myself out of the picture-- is how Harvard might be implementing login to this database.
They have a query in their Python code that says SELECT * FROM users WHERE username equals this placeholder and password equals this other placeholder.
And, for the sake of discussion, assume that username and password somehow in Python came from those two text boxes, the username and the password field.
And then what Harvard does is check, if I get any number of rows back, that means that I found a row with that username and that password.
Therefore, let's proceed to log the user into the website or application.
All right, all fairly reasonable.
But what could go wrong?
Well, here is, in green, the query we want to execute.
But let's plug in those potential values from my own form.
If I were to type in malan@harvard.edu single quote dash dash, notice what happens here.
And if I'm using not the placeholders but literally format strings, F strings in Python.
So the point here will be, use the placeholders, use the question marks because what far too many people in the real world do is they don't use a library like cs50's.
They don't use something that solves this here problem.
Here is an F string or format string in Python that formats the same string.
SELECT * FROM users WHERE username equals, quote, unquote, 'something' and password equals, quote, unquote, 'something'.
This is why we don't use format strings for SQL because this demonstrates a SQL-injection attack.
Notice that whoever wrote this code, at Harvard presumably, put single quotes around the username placeholder and single quotes around the password placeholder.
But what if I, a malicious student, use a single quote or an apostrophe in my own email address?
What happens here?
Well, let's plug in malan@harvard.edu single quote dash dash.
Now, it's a little subtle, but, notice, the fact that I added a single quote at the end of malan@harvard.edu kind of closes the first quote that the Harvard programmer wrote.
And then I typed in dash dash.
But recall that dash dash means ignore everything to the right, so it doesn't matter that there's another quote.
It doesn't matter that there's an AND it doesn't matter that there's a mention of a password.
I have essentially tricked the database into a treating the dash dash and everything after it as an ignorable comment.
So, logically, what happens really-- and I'm going to gray out everything after the dash dash-- the database is only being asked to SELECT * FROM users WHERE the username equals malan@harvard.edu.
Everything else is ignored.
The same conditional comes thereafter.
If I get back any rows, log this user in.
And so the implication here is effectively that if you were to log into Harvard's system as malan@harvard.edu single quote dash dash, you could presumably log in as me without even knowing my password because you've essentially tricked the database into ignoring everything after that username.
So this is a SQL-injection attack in the sense that you, the user, have injected code, SQL code into my application by just typing in user input.
And so the takeaway here is that trusting user input, bad.
You should never trust users input because it could be mistyped or misspelled at best, but it could be malicious input at worst.
You should always assume the worst, unfortunately of your here users and somehow defend against this.
So what is the solution?
The right solution here notice-- and it's very subtle-- would be to escape the single quote that I typed in.
And we saw this earlier when I had to escape the apostrophe in Hello comma It's Me.
I could encode-- make sure that no matter what the human types in, I change any single quotes to two single quotes to escape them accordingly.
But that's going to be annoying and hard to remember all the time.
So, really, the right solution is to do indeed what I started out preaching, which is use placeholders, like question marks, which comes with cs50's library.
But, again, this is not a CS50 thing.
It is representative of how many, if not most, SQL libraries out there let you execute SQL code inside of another language like Python.
Now, a little more playfully, this is a sort of meme that's gone around on the internet here.
Here's someone's car with a crazy-long license plate but with a mention of DROP DATABASE TABLE, which, as you might imagine, DROP, being one of the keywords we mentioned earlier, is how you go about not just deleting data, but dropping all of the data in a table.
And if we zoom in here, presumably, what some hacker here is trying to do is trick some of those cameras that are on streets in various countries that try to read your license plate.
And if someone wrote code to read license plates that trust the user input and doesn't just allow for dash dash but allows for DROP DATABASE TABLE, you can do really bad damage to databases by tricking them into to executing DELETE or DROP or any number of other commands as well.
And, in fact, the final note that we thought we'd end on is this here, which sort of inaugurates you, welcomes you into the world of educated computer citizens by the name of someone called little Bobby Tables.
Let me take myself away for just a moment, and let everyone read here this xkcd.
But, henceforth, after today, if you mention little Bobby Tables to most any programmer in the world, with a bit of a smile, they should know who and what you are talking about, an illusion indeed to SQL-injection attacks.
So, with all of this power, indeed, as always, comes great responsibility.
That then was SQL.
And we'll see you next time.
[MUSIC PLAYING]
Loading video analysis...