The Biggest Database Design Mistake

By Boot dev

Summary

## Key takeaways - **Ignore Academic Normalization Definitions**: There's the academic definition of database normalization, but you can mostly just ignore it. The technical category definitions are mostly just an academic exercise. [00:00], [00:12] - **Normalization Cuts Redundancy, Boosts Integrity**: Normalization is the practice of improving the structure or schema of a database so that it has less data redundancy, duplicate data, and more data integrity, which really just means correct data. We want to store data in its simplest form with no copies because copies can lead to bugs. [00:27], [00:59] - **1NF: Unique Keys, No Nested Tables**: First normal form has two rules: every row must have a unique primary key, and there can be no nested tables. To fix duplicate rows like two 'Lane 30' entries, add a unique ID column as the primary key. [01:44], [02:32] - **2NF: Full Key Dependency Required**: Second normal form adds: all columns not part of the primary key must be dependent on the entire primary key, not just part of it. First initial depends only on first name (part of composite key), so extract to separate mapping table. [02:41], [04:59] - **3NF: No Non-Key Dependencies**: Third normal form adds: all columns not in the primary key must be dependent only on the primary key. First initial depends on first name (both non-key), duplicating the L mapping for Lane. [05:22], [06:24] - **BCNF Fixes Key Dependency Edge Case**: Boyce-Codd normal form adds: a column that is part of the primary key may not be dependent on a column not part of the primary key. With release year + sales as composite key, year depends on release date, creating duplicates. [06:20], [07:50]

Topics Covered

Ignore Academic Normalization Details
Normalization Eliminates Duplicate Data
Primary Keys Prevent Duplicate Rows
Dependencies Break on Partial Keys
Always Normalize Unless Performance Demands

Full Transcript

Look, there's the academic definition of database normalization, but you can mostly just ignore it. Normalization is

super important to understand. It

seriously impacts the code that you write in the real world. But the

technical category definitions are mostly just an academic exercise. That

said, let's go over them in excruciating detail, and I'll give you the actual practical rule of thumb about how to use them at the very end. But first, what is normalization? Well, it's really just

normalization? Well, it's really just the practice of improving the structure or schema of a database so that it has less data redundancy, duplicate data, and more data integrity, which really

just means correct data. So, to put it simply, we want to store data in its simplest form with no copies because copies can lead to bugs. For example, we might update one copy and forget to

update the other copy, which gives us sort of this invalid state. Which one

represents the truth? And normalization

is a gradient. You can have a database that's more normalized or one that's less normalized. The more normalized the

less normalized. The more normalized the database is, the less duplicate data you have and the more correct the data is likely to be. Now, academics have defined four different normal forms.

First, second, third and boy cod. As we

move up from first to second to third to BCNF, the database becomes more and more normalized. Now, each form is just a set

normalized. Now, each form is just a set of rules. If the data in the database

of rules. If the data in the database follows the rules, then the database is in that normal form. The rules build on each other. So second normal form has

each other. So second normal form has all the rules from first normal form plus some extra. Third normal form has the rules of second normal form plus some extra. You get the idea. So

some extra. You get the idea. So

anyways, let's start with first normal form. It just has two rules. First,

form. It just has two rules. First,

every row must have a unique primary key. And two, there can be no nested

key. And two, there can be no nested tables. Now rule number two is actually

tables. Now rule number two is actually pretty easy to follow in practice. Most

database systems don't even let you nest tables within each other with the exception of like maybe embedding some JSON. But to understand rule number one,

JSON. But to understand rule number one, let's say we have a user's table with two columns, first name and age. And say

we have a user in that table, Lane, who's 30. If another user signs up with

who's 30. If another user signs up with the same name and age, Lane 30, we get a duplicate row. And this is what first

duplicate row. And this is what first normal form is supposed to prevent because no duplicate rows are allowed.

To fix it, we can just add a unique ID column as the primary key. We'll give

the first row one and the second row ID two. Now, these ID numbers would be

two. Now, these ID numbers would be generated by the application so that they can't be duplicated no matter what the user's name or age is. Okay, so

that's first normal form. Now, on to second normal form. We still have to follow all the rules of first normal form, but there is one additional rule.

All columns that are not part of the primary key must be dependent on the entire primary key, not just part of it.

And a primary key is usually a single ID column, but you can have a table with a primary key that's unique combination of two or three or more columns. For

example, you might have a user's classes table that maps users to the classes they're in. Has a user ID column and a

they're in. Has a user ID column and a class ID column. User one is in class 3.

User two is in class 4. User one is in class 5. Now, there's not much point

class 5. Now, there's not much point here in adding a third ID column because the unique combination of user ID and class ID does the job just fine. you

can't enroll in the same class twice.

That doesn't really make sense. When we

use the term primary key in a software engineering sense, we're usually referring to an ID column or maybe a specific combination of columns that we've created a primary key around. But

in an academic computer science normalization sense, primary key just means the smallest number of columns that can be used to uniquely identify a

row in an actual data set. So say we have a table with three columns. First

name, last name, and first initial. And

then we have Lane small, first initial L. Lane Brewer, first initial L, and

L. Lane Brewer, first initial L, and Allen small, first initial A. Now, this

table technically follows first normal form because there are no duplicate rows and no nested tables. But it doesn't follow second normal form's rules. None

of the columns can be a primary key by themselves because none have completely unique values. But we can choose a pair

unique values. But we can choose a pair that creates a unique combination like first name and last name. Now with that definition of a primary key in mind, let's read the rule again. All columns

that are not part of the primary key, in this case, just first initial must be dependent on the entire primary key, not just a part of it. The problem is that first initial is entirely dependent on

just the first name column, which is a part of the primary key, not the whole primary key. So we've broken the rule.

primary key. So we've broken the rule.

So to solve this, we can create a new table and move the first initial column there. It's a mapping table of first

there. It's a mapping table of first name to first initial. So we've got Lane, L, Allen, A. And on our first table, we just drop that first initial column. Now, both tables follow the

column. Now, both tables follow the rules of second normal form. And if we need to look up a first initial by the user's name, we just use that second table to do so. Now, we don't have a

duplicate mapping of lane to L. We only

store that relationship once. Now, this

may seem like a really silly example.

Why would you store an initial alongside a name? You just calculate it when you

a name? You just calculate it when you need it. And it is. It is kind of a

need it. And it is. It is kind of a simple example but there are times in the real world where you will have a column that kind of depends on another column and rather than storing it many

many times you can extract it out into another table. Now on to third normal

another table. Now on to third normal form. It follows all the rules of first

form. It follows all the rules of first and second normal form but now adds one additional rule. All the columns not in

additional rule. All the columns not in the primary key must be dependent only on the primary key. Let's go back to the same example but add a unique primary

key column ID. So we've got one lane small L, two lane brewer L, three Allen small A. Now you might think that this

small A. Now you might think that this is the same problem as before and it kind of is. But technically this table does follow second normal form because first initial is dependent on first name

and first name is not part of the primary key anymore because that's the new ID column. But fundamentally we're still duplicating data. We know lane maps to L and we don't need to store

that relationship twice. This table

breaks third normal form because a column not in the primary key first initial is dependent on another column not in the primary key first name. And

again we can solve this by moving the first name to first initial mapping to a separate table. Okay. Last normal form

separate table. Okay. Last normal form boy cod normal form. It adds one final rule. A column that is part of the

rule. A column that is part of the primary key may not be dependent on a column that is not part of the primary key. It's important to note that boycod

key. It's important to note that boycod normal form or BCNF was invented after first, second, and third normal forms. Once they realized that there was still a way for duplicate data to slip into

tables even if a table followed third normal form for Boyce COD normal form to matter, the database has to be in a very specific state. Say we have a table with

specific state. Say we have a table with four columns. Release year, release

four columns. Release year, release date, sales, and name. And we fill it up with all of this data. In this table, we have several possible candidate primary keys. Name could be the primary key on

keys. Name could be the primary key on its own, or it could be a combination of release year and sales. The date column could also be a candidate primary key on its own. If we choose the combination of

its own. If we choose the combination of release year and sales as our primary key, the problem is that release year is fully dependent on the release date column. So we have a column in the

column. So we have a column in the primary key that is dependent on a column that's not part of the primary key. That doesn't break second or third

key. That doesn't break second or third normal form, but it does create duplicate data. And boycod normal form

duplicate data. And boycod normal form fixes this edge case. Again, it says a column that's part of primary key cannot be entirely dependent on a column that's not part of the primary key. Now, one

way to fix this is to store year, month, and day in separate columns. But

honestly, we could just get rid of the release year column entirely because we already have the more specific date column. Okay, so these were the

column. Okay, so these were the technical definitions, the academic definitions of first, second, third, and voice COD normal forms. But here's what I want you to remember as a real world

software engineer, not a computer science academic. Unless you have a

science academic. Unless you have a really, really good reason, usually performance related, don't denormalize your data. try to keep it as normalized

your data. try to keep it as normalized as possible and for the most part we just want full normalization boycod normal form which when you ignore all of

the other categories we can talk about it much more simply just don't store redundant data there should only be one way to represent the state that you're

trying to represent in your database schema add unique primary key ID columns and if you do that you'll be just fine

Loading...

Loading video analysis...