The Biggest Database Design Mistake
By Boot dev
Summary
## Key takeaways - **Ignore Academic Normalization Definitions**: There's the academic definition of database normalization, but you can mostly just ignore it. The technical category definitions are mostly just an academic exercise. [00:00], [00:12] - **Normalization Cuts Redundancy, Boosts Integrity**: Normalization is the practice of improving the structure or schema of a database so that it has less data redundancy, duplicate data, and more data integrity, which really just means correct data. We want to store data in its simplest form with no copies because copies can lead to bugs. [00:27], [00:59] - **1NF: Unique Keys, No Nested Tables**: First normal form has two rules: every row must have a unique primary key, and there can be no nested tables. To fix duplicate rows like two 'Lane 30' entries, add a unique ID column as the primary key. [01:44], [02:32] - **2NF: Full Key Dependency Required**: Second normal form adds: all columns not part of the primary key must be dependent on the entire primary key, not just part of it. First initial depends only on first name (part of composite key), so extract to separate mapping table. [02:41], [04:59] - **3NF: No Non-Key Dependencies**: Third normal form adds: all columns not in the primary key must be dependent only on the primary key. First initial depends on first name (both non-key), duplicating the L mapping for Lane. [05:22], [06:24] - **BCNF Fixes Key Dependency Edge Case**: Boyce-Codd normal form adds: a column that is part of the primary key may not be dependent on a column not part of the primary key. With release year + sales as composite key, year depends on release date, creating duplicates. [06:20], [07:50]
Topics Covered
- Ignore Academic Normalization Details
- Normalization Eliminates Duplicate Data
- Primary Keys Prevent Duplicate Rows
- Dependencies Break on Partial Keys
- Always Normalize Unless Performance Demands
Full Transcript
Look, there's the academic definition of database normalization, but you can mostly just ignore it. Normalization is
super important to understand. It
seriously impacts the code that you write in the real world. But the
technical category definitions are mostly just an academic exercise. That
said, let's go over them in excruciating detail, and I'll give you the actual practical rule of thumb about how to use them at the very end. But first, what is normalization? Well, it's really just
normalization? Well, it's really just the practice of improving the structure or schema of a database so that it has less data redundancy, duplicate data, and more data integrity, which really
just means correct data. So, to put it simply, we want to store data in its simplest form with no copies because copies can lead to bugs. For example, we might update one copy and forget to
update the other copy, which gives us sort of this invalid state. Which one
represents the truth? And normalization
is a gradient. You can have a database that's more normalized or one that's less normalized. The more normalized the
less normalized. The more normalized the database is, the less duplicate data you have and the more correct the data is likely to be. Now, academics have defined four different normal forms.
First, second, third and boy cod. As we
move up from first to second to third to BCNF, the database becomes more and more normalized. Now, each form is just a set
normalized. Now, each form is just a set of rules. If the data in the database
of rules. If the data in the database follows the rules, then the database is in that normal form. The rules build on each other. So second normal form has
each other. So second normal form has all the rules from first normal form plus some extra. Third normal form has the rules of second normal form plus some extra. You get the idea. So
some extra. You get the idea. So
anyways, let's start with first normal form. It just has two rules. First,
form. It just has two rules. First,
every row must have a unique primary key. And two, there can be no nested
key. And two, there can be no nested tables. Now rule number two is actually
tables. Now rule number two is actually pretty easy to follow in practice. Most
database systems don't even let you nest tables within each other with the exception of like maybe embedding some JSON. But to understand rule number one,
JSON. But to understand rule number one, let's say we have a user's table with two columns, first name and age. And say
we have a user in that table, Lane, who's 30. If another user signs up with
who's 30. If another user signs up with the same name and age, Lane 30, we get a duplicate row. And this is what first
duplicate row. And this is what first normal form is supposed to prevent because no duplicate rows are allowed.
To fix it, we can just add a unique ID column as the primary key. We'll give
the first row one and the second row ID two. Now, these ID numbers would be
two. Now, these ID numbers would be generated by the application so that they can't be duplicated no matter what the user's name or age is. Okay, so
that's first normal form. Now, on to second normal form. We still have to follow all the rules of first normal form, but there is one additional rule.
All columns that are not part of the primary key must be dependent on the entire primary key, not just part of it.
And a primary key is usually a single ID column, but you can have a table with a primary key that's unique combination of two or three or more columns. For
example, you might have a user's classes table that maps users to the classes they're in. Has a user ID column and a
they're in. Has a user ID column and a class ID column. User one is in class 3.
User two is in class 4. User one is in class 5. Now, there's not much point
class 5. Now, there's not much point here in adding a third ID column because the unique combination of user ID and class ID does the job just fine. you
can't enroll in the same class twice.
That doesn't really make sense. When we
use the term primary key in a software engineering sense, we're usually referring to an ID column or maybe a specific combination of columns that we've created a primary key around. But
in an academic computer science normalization sense, primary key just means the smallest number of columns that can be used to uniquely identify a
row in an actual data set. So say we have a table with three columns. First
name, last name, and first initial. And
then we have Lane small, first initial L. Lane Brewer, first initial L, and
L. Lane Brewer, first initial L, and Allen small, first initial A. Now, this
table technically follows first normal form because there are no duplicate rows and no nested tables. But it doesn't follow second normal form's rules. None
of the columns can be a primary key by themselves because none have completely unique values. But we can choose a pair
unique values. But we can choose a pair that creates a unique combination like first name and last name. Now with that definition of a primary key in mind, let's read the rule again. All columns
that are not part of the primary key, in this case, just first initial must be dependent on the entire primary key, not just a part of it. The problem is that first initial is entirely dependent on
just the first name column, which is a part of the primary key, not the whole primary key. So we've broken the rule.
primary key. So we've broken the rule.
So to solve this, we can create a new table and move the first initial column there. It's a mapping table of first
there. It's a mapping table of first name to first initial. So we've got Lane, L, Allen, A. And on our first table, we just drop that first initial column. Now, both tables follow the
column. Now, both tables follow the rules of second normal form. And if we need to look up a first initial by the user's name, we just use that second table to do so. Now, we don't have a
duplicate mapping of lane to L. We only
store that relationship once. Now, this
may seem like a really silly example.
Why would you store an initial alongside a name? You just calculate it when you
a name? You just calculate it when you need it. And it is. It is kind of a
need it. And it is. It is kind of a simple example but there are times in the real world where you will have a column that kind of depends on another column and rather than storing it many
many times you can extract it out into another table. Now on to third normal
another table. Now on to third normal form. It follows all the rules of first
form. It follows all the rules of first and second normal form but now adds one additional rule. All the columns not in
additional rule. All the columns not in the primary key must be dependent only on the primary key. Let's go back to the same example but add a unique primary
key column ID. So we've got one lane small L, two lane brewer L, three Allen small A. Now you might think that this
small A. Now you might think that this is the same problem as before and it kind of is. But technically this table does follow second normal form because first initial is dependent on first name
and first name is not part of the primary key anymore because that's the new ID column. But fundamentally we're still duplicating data. We know lane maps to L and we don't need to store
that relationship twice. This table
breaks third normal form because a column not in the primary key first initial is dependent on another column not in the primary key first name. And
again we can solve this by moving the first name to first initial mapping to a separate table. Okay. Last normal form
separate table. Okay. Last normal form boy cod normal form. It adds one final rule. A column that is part of the
rule. A column that is part of the primary key may not be dependent on a column that is not part of the primary key. It's important to note that boycod
key. It's important to note that boycod normal form or BCNF was invented after first, second, and third normal forms. Once they realized that there was still a way for duplicate data to slip into
tables even if a table followed third normal form for Boyce COD normal form to matter, the database has to be in a very specific state. Say we have a table with
specific state. Say we have a table with four columns. Release year, release
four columns. Release year, release date, sales, and name. And we fill it up with all of this data. In this table, we have several possible candidate primary keys. Name could be the primary key on
keys. Name could be the primary key on its own, or it could be a combination of release year and sales. The date column could also be a candidate primary key on its own. If we choose the combination of
its own. If we choose the combination of release year and sales as our primary key, the problem is that release year is fully dependent on the release date column. So we have a column in the
column. So we have a column in the primary key that is dependent on a column that's not part of the primary key. That doesn't break second or third
key. That doesn't break second or third normal form, but it does create duplicate data. And boycod normal form
duplicate data. And boycod normal form fixes this edge case. Again, it says a column that's part of primary key cannot be entirely dependent on a column that's not part of the primary key. Now, one
way to fix this is to store year, month, and day in separate columns. But
honestly, we could just get rid of the release year column entirely because we already have the more specific date column. Okay, so these were the
column. Okay, so these were the technical definitions, the academic definitions of first, second, third, and voice COD normal forms. But here's what I want you to remember as a real world
software engineer, not a computer science academic. Unless you have a
science academic. Unless you have a really, really good reason, usually performance related, don't denormalize your data. try to keep it as normalized
your data. try to keep it as normalized as possible and for the most part we just want full normalization boycod normal form which when you ignore all of
the other categories we can talk about it much more simply just don't store redundant data there should only be one way to represent the state that you're
trying to represent in your database schema add unique primary key ID columns and if you do that you'll be just fine
Loading video analysis...