AnyRAID VDEVs
By OpenZFS
Summary
Topics Covered
- Heterogeneous Disks Unlock Full Capacity
- Tiles Enable Flexible Data Placement
- Replicated Tile Maps Guarantee Bootstrapping
- Rebalance Maximizes Post-Expansion Space
- AnyRAID Contraction Frees Devices
Full Transcript
All right, good morning everyone. Uh my
name is Paul Daniel Lee and today I'm going to be talking to you about a new feature in ZFS that I've been working on called any raid vetos. Uh so who am I?
As I said, my name is Paul Daniel Lee.
Uh I've been working with Clar Systems for about a year now. Uh but this is my 11th 12th developer summit. Uh and I've presented a couple times in the past and I'm excited to be doing so again.
So let's talk about ZFS a little bit.
Um, ZFS is used by really a broad array of people, right? We have users at up at the enterprise scale deploying racks and racks of servers. They have hard drives by the box. They deploy deploy
everything at this huge scale very efficiently, big budgets, procedures, the whole nine yards. Um, we have midscale users, right? We have Capital Auto Group deploying, you know, keeping
track of their sensitive and important data on their systems at a professional scale, but not quite at the enterprise level of some of the vendors that use ZFS. And then we have amateurs and
ZFS. And then we have amateurs and enthusiasts. I'm sure there are a lot of
enthusiasts. I'm sure there are a lot of people here who have a server sitting under a desk somewhere with some hard drives in it that they run for the love of the game. I know I do. Um, and all of
these different use cases have different constraints and requirements when it comes to how they use ZFS, what they use it for, and the things they care about most.
Uh, one of the things that a lot of people who use ZFS care about is reliability. Um, this I think is
reliability. Um, this I think is probably one of the main reasons that people use ZFS, its ability to protect and preserve your data even when a lot of things go horribly wrong. Uh, and ZFS
has a lot of different ways to do that.
Um, there's a lot of VDEV type options, right? But we can do mirror VDEVs and
right? But we can do mirror VDEVs and preserve your data that way. There's
RAID Z VDEVs which have better space efficiency but have other trade-offs.
DRA is the sort of the newer hotness although it's been around for a little while now expanding the RAZ capability and giving it more interesting features.
Um, one odd duck is the copies property which lets you set at a per data set level and store multiple copies of critical data. But it has its own little
critical data. But it has its own little trade-offs that we'll talk about. Um,
and the nice thing about all these reliability tools that ZFS has built into it is that they're simple to use.
Uh, you create your VDABs, you set them up, and then they just keep working.
They're performance. They give you good read speeds, good write speeds. They
paralyze everything across the discs.
They do everything you want from that perspective. And they're effective. They
perspective. And they're effective. They
prevent data loss. You can lose a hard drive and everything keeps working. And,
you know, you can do online receivers, everything works. It's all uh effective
everything works. It's all uh effective and efficient.
Um but there are some restrictions to these things. So all of these VDEV
these things. So all of these VDEV technologies that I talked about require that all of your discs be of the same size, right? If you have a mirror setup,
size, right? If you have a mirror setup, all the discs have to be the same size in the mirror so that you you can write to the same place at each of the discs and store all the data redundantly. Um
if you're working at enterprise scale, that's fine. You bought a box of hard
that's fine. You bought a box of hard drives that are all the same make and model. You can swap them out. You can
model. You can swap them out. You can
plug in new ones. you're going to be fine. If you're working on a slightly
fine. If you're working on a slightly lower scale, this gets a little bit more expensive. Right? If I have a server in
expensive. Right? If I have a server in my apartment, which I do, running a RAID Z device, I can't just swap out a single drive and make it bigger. I have to swap every single drive for a larger one. And
so upgrades become a more expensive and more timeconuming process. Um, the
expansion can be slow, right? If you
want to expand all those drives, you have to swap out each one for a bigger disc and resolver each one along the way, and that takes time. uh or if you want to do RAID Z expansion, we have to reflow all of the data in your RAID Z
device. Uh and so all these things take
device. Uh and so all these things take a little while. Um you can't shrink these devices if you realize later that you need to take some of your drives and use them in a different appliance or in
a different application. Um ZFS storage pools mitigate this to some extent since you can partition up your space however you want, but if you need to move it somewhere else entirely, you can't really do that without permanently
giving up some of your parody in that device. uh and all this combines
device. uh and all this combines together to create sort of a higher administrative burden, right? There are
rewards obviously these things are worth using, but you do have to take into account that things are going to be a little bit more complicated to administer. Um the copies property
administer. Um the copies property doesn't have a lot of these same trade-offs, but it also doesn't provide guarantees in the same way that mirrors and RAID Z do. Copies tries to store your the multiple copies of your data on
different devices, but it's not guaranteed to do so. Um, and it also only provides sort of mirror-l like par where it can store multiple full copies of the data. Um, so there's some space
in here for improvements. Uh, and the solution that we came up with in particular for the problem of disks of all having to be the same size. Um, but
it also has other benefits is any rigid.
Uh, this is a new VDEV type in ZFS. Uh,
it supports all the normal things that you want to do with VDEVs. You can read to it, you can write from it, you can do resverss, you can do all the fun stuff that normal VDES can do. Um, and it provides either mirror-l like parody or
razike par um configurable at the time when you create the device. And the big feature that it supports is you can have heterogeneous discs. You can have a mix
heterogeneous discs. You can have a mix of 10 terabyte drives and 5 terb drives and one tabby drives all working together to provide space. And you can actually take advantage of almost all the space on all of those different
devices. Um, and then there's some other
devices. Um, and then there's some other cool new features it offers like online expansion, uh, contraction, and then some more that we're going to talk about. So, this is the syntax to create
about. So, this is the syntax to create a pool using any raid vdev. Um, you can see it says any raid because it's any raid and then raid Z tells you that it's
mirror or the RAID Z style parade Z1 because there's one parody drive and then the colon three is there's one par drive and three data drives in sort of a conceptual space. But then you can see
conceptual space. But then you can see there's a lot more than four discs here.
There's actually seven discs that are part of this device. So we're going to talk about how that works and how all that space gets used and combined together to provide this larger VDEV. Um
so in order to understand that we're going to take a little bit of a step back and talk about how mirrors and RAID Z work at a high level. This is just going to be a quick summary. Most of you are probably familiar with this. Um but
the way they basically work is that you have the VDEV layer of ZFS that takes these logical IO's. um you know the VDEV presents some block of space to the rest
of the system uh and then the system does its allocations in that space and then when it comes time to do IO's they come down into the VDEV code and the VDEV code splits those logical IO's into
separate physical IO's. So on the left you can sort of see what mirrors look like. You take one logical IO and turn
like. You take one logical IO and turn it into multiple physical IO's all of the same size all doing the same thing on the different devices. On the right is sort of how RAID Z works. you have
one larger logical IO that gets broken down into small pieces and then each of those pieces goes to a different drive.
Uh the one outlined in red is to represent the parody drive. This is all not how the ZIO tree actually works. The
ZIO tree is a nightmarish tangle of weeds sometimes, but uh this is just sort of a simplified depiction to help understand the highle concept. Um, so if
we look at this in a little more detail, when you do a mirror IO, right, you have your logical IO and then the physical IO's are sent to separate devices all at the same offset, right? They go to drive
A at offset X, drive B at offset X, drive C at offset X. Um, and they're all sort of stored in the same place on each of the drives. That's why they all have to be the same size. For RAID Z, this is
sort of accurate as long as everything's in a single row. Again, we're
simplifying to sort of get the idea across. Um, but again, you can see that,
across. Um, but again, you can see that, you know, all of the different parts of the IO end up within a single row of the RAIDsy device. They all end up at
RAIDsy device. They all end up at roughly the same offset with each other.
Um, and so what if we could be a little more flexible? What if we didn't have to
more flexible? What if we didn't have to store all of the data at the same offset on every disc? What if we had some sort of layer of indirection in the middle that would allow us to move the data around and store it in different places
on different discs? And could we use that to get better space efficiency?
Um and so the solution for that is called tiles in any um every disc in the nerid vev is broken into these tiles.
The tiles are the same size on every child. So the children all sort of look
child. So the children all sort of look the same from that perspective. Um and
then when it comes time to do an IO and use these actual tiles, we allocate the physical tiles from the child discs and stripe them together into logical tiles.
And then the logical tiles are combined together to form the you know block of space that represents the top level VDEV. Um and since we have this model
VDEV. Um and since we have this model where you know you have the stripes that work together to form the bigger VDEV but the physical tile location is more flexible now we can preferentially
allocate the the physical tiles from the larger child devices. And that way we can use those devices more and try to take advantage of all of the space on all of our different devices. So let's
look a little bit at what this tile process looks like. So it come you have your any vdev and you do an allocation on it. You go to write some data on it
on it. You go to write some data on it um at the you know first offset on the VDEV. Uh the any code sees that and it
VDEV. Uh the any code sees that and it goes okay I need a physical tile to back that logical space. So let me allocate some tiles you know enough tiles to represent the stripe uh and on our
largest discs. Um, in this example and
largest discs. Um, in this example and the ones going forward, I'm going to be using that sort of any RAID Z1 colon 3 model, which is a four wide stripe. Um,
and so the the first algation it does is these four discs. That's not what I want to be looking at right now. Q.
Where did it go? Here it is.
Um, so it comes along, it does its allocations from the first four largest discs. Everything's hunky dory. when we
discs. Everything's hunky dory. when we
do reads and writes to that area of space, we write to those four physical discs. Um, some new writes come in. Now,
discs. Um, some new writes come in. Now,
it's time to allocate another tile. And
so, this time, rather than just picking the first four discs, we pick the four discs with the most free space. And you
can see that's 1, two, three, and five since five now had more free tiles than four did. Um, time goes on, we do more
four did. Um, time goes on, we do more allocations. This keeps going. And you
allocations. This keeps going. And you
can see that we're sort of slowly using the larger devices faster than we use the smaller devices. Uh, and we sort of as the tiles get allocated, we build this structure up and we allocate more
and more space from these bigger devices. And then eventually as the
devices. And then eventually as the smaller devices become the ones with the most free space, we start taking advantage of those as well. And so you get this sort of gradual buildup of all
these tiles filling up the entire all of the different child devices equally. And
that way you get to leverage all the space that's available in the pool.
Um you'll notice that any given a stripe of any given color is always always across different physical devices and that's so we provide the reliability guarantees that CFS care about right the
yellow blocks are all on different devices the blue blocks are all on different devices the green blocks are all on different devices um sort of there's only so many colors so there's some similar looking colors but I hope
the idea comes across clearly um and this allows us to preserve the requirements of mirrors in Ray Z where If you lose any one of these devices, uh, you're only going to lose tile one
tile of a given color. And in that way, we preserve the same reliability that RAID Z would give us, where you can afford to lose any one of your devices, and we can still reconstruct all of your data using the par information that's
part of RAID Z. Uh, and the same would be true of mirrors if this was a four-way mirror setup.
Um, so there's a map, this sort of diagram is the picture version of something that we call the tile mapping, which does the mapping from these logical tiles that make up the VDEV
space to the physical tiles that are stored on the disk. Um, the NRA VDEV code uses that tile mapping to convert the zios that it receives from the
logical layout that the rest of the system sees into the custom physical layout that lets us do this space balancing. Uh, and that mapping grows as
balancing. Uh, and that mapping grows as we allocate the new tiles. And because
it grows as we allocate the new tiles, that means that you can add new discs and that new space will be taken advantage of uh when that happens.
But the mapping that I just talked about is very important, right? This is the kind of metadata in ZFS that we usually store in the MOSS u because it's sort of core part of how the pool works, how you
read the pool, how you access the pool.
But because the MOSS is stored in, you know, the normal VDE space, uh you can't read it without this mapping. This
mapping is sort of necessary in order to read any of the data off of the pool. Um
and you so we have to load this first before anything any other reads can happen from this VDEV. Uh and so the question is how do we store this data?
Where do we put it in such a way that this can be accessed before everything else?
Um so there's a new on disk structure called the tile map. Uh and the way that this works is that we allocate a region uh after the first two labels in ZFS and before the sort of data segment where we
store all the normal data in ZFS. uh we
just at creation time allocate a chunk of space there and then use that to store this critical metadata.
Um we store the mapping between the logical and physical tiles there the actual you know tile one has located these places tile two is located these places and so on. Um and then there's also some other critical metadata that
you use to parse that and process it and understand it. Um because ZFS is a copy
understand it. Um because ZFS is a copy on write file system, we never want to overwrite anything in place and so that we actually store four full copies of this tile map and we rotate through them
in uh during each txg in the same way that we rotate through Uber blocks. So
if you know even if you have something go wrong while you're syncing up a tile map, we can still roll back to the previous transaction group and that full tile map is there untouched and ready to be used. Um, and because this is such
be used. Um, and because this is such critical metadata, we actually store it fully replicated on every single disk.
Every disk has a full copy of the tile map. Um, because if you did lose this
map. Um, because if you did lose this data structure somehow, you would be unable to read the data on any of your pools or any of your discs in the pool, um, you'd have to do some sort of very complicated reconstruction process that
I'm not sure how it would work. Uh, so
we fully replicated across all the devices with maximum reliability um, since it's so important.
um the size of the tile map on any given disc. So each copy of the tile map is 64
disc. So each copy of the tile map is 64 megabytes and there's four copy of them.
So they combine together to be at about 256 megabytes of space. Um on modern hard drives that's not that much. So it
felt like an acceptable amount of space to use. Uh and that guarantees the
to use. Uh and that guarantees the storage for the full size of the mapping that any raid supports. Uh any raid supports up to 256 drives per top level
any raid vdev. Um, and each of the drives can have up to 65,536 tiles on them. Um, and so that comes up comes with some limitations that we'll talk about a little bit later, but if
you do the math, that all combines together and it all fits in exactly 64 megabytes.
Um, I've talked about the tiles, how the same sizes on all the discs. The
specific size of those tiles is tunable.
There's a bunch of options, but the defaults are that it is either a 64th of the size of the smallest disc that's part of the VDEV at creation time or 16
GB, whichever is bigger. Um, we want to store entire meta labs together within a single uh tile so that we don't have IO's like going across tiles and then
having to split those and create more complicated ZIO trees. So 16 GB felt like a reasonable floor and 1 tabyte is a pretty reasonable size for hard drives
in the modern day. Um what that does mean in conjunction with the 65,000 tile limit that I mentioned is that if you have a 64th of the smallest disc as the
tile size then the largest disc that you can use all the space of is,024 times the size of the smallest disc that was part of the VDEV at creation time. So if
you create a VDEV with you know some 5 TB drives a 10 terabyte drive and a 1 TBTE drive that any read VDEV for now at least we can talk about future mitigations uh that any read vev can't
support a physical drive larger than a pabyte which will probably be fine for a while um in memory we store this tile mapping in an AVL tree uh nice easy
reliable data structure and it doesn't get that big so we don't need anything more complicated than that um and then the any raid vev layer uses that AVL tree to convert the uh logical IO's that
come in, figure out where the physical tiles that back that data are stored on disk and then uses that to dispatch IO's to the existing mirror and RAID Z code.
So we don't have to reimplement any of the parity stuff. We don't have to do any of the sort of uh healing logic or reconstruction logic or failing over to different devices logic. All of that we get just by calling it to that code
directly.
Um, what can this feature do aside from storing your data? Well, it can do all the normal stuff that you want uh VDEVs and ZFS to be able to handle. Um, it
supports reserver just like you'd expect. Uh, we can do scrubs, we can do
expect. Uh, we can do scrubs, we can do checkpoints, all the normal [clears throat] things that you'd expect ZFS to be able to do. For mirror parody, we support linear resolvering uh the same way mirrors do. We don't for RAID Z
because the way RAID Z stores its data is not really amendable to linear resolvering, but that's true of actual RAID Z as well.
Um, for checkpoints, you can do checkpoint rollbacks just like usual.
Um, we don't store a separate full copy of the tile map and the tile map isn't in the moss. So, we have to have some special stuff for handling checkpoints in the tile mapping. And the way it works is it rolls back the space up to
the sort of highest allocated logical tile that was uh that we save when the checkpoint is taken. Um, that works fine for the current design. There may be augmentations to that for the future, but there's plenty of space in the
metadata to store more interesting information about that. Um, there's
exactly two caveats that I am aware of, which is trim and initialize. The way
trim and initialize work is a little bit interesting. you take the
interesting. you take the um the logical space within the VDEV and then it converts it to a physical offset for each of the children which is fine and good except that for the space that isn't mapped yet that isn't backed by
physical tiles there is no corresponding physical location for that logical space yet and it would be very silly if anytime you did a trim or initialize we immediately allocated all the tiles and pinned all the space and made it
impossible for you to take advantage of new discs in the future. Um so trim and initialize will just stop at the point where you've stopped mapping from logical tiles to physical tiles. Um it
won't operate on any physical space that hasn't been mapped yet. And in practice that isn't that huge of a concern, right? Trim is mostly used for secure
right? Trim is mostly used for secure race um or uh other things like that.
And there you're mostly going to be care about doing that to things that you've already allocated as part of the pool.
So any space that you've already written as part of the pool, Trim will like happily go and write all of that.
Anything you haven't written yet, uh, trim won't deal with, but you can clear that before you add it to the Ner Vev hopefully. Initialize was originally
hopefully. Initialize was originally intended to work more with like thin provision devices and cloud devices and things like that. When you're creating those devices, you get to specify what size they are. And so the advantages of
any RAID where it can support these differentiz discs are not as essential for that use case. Uh, and so it felt like an acceptable trade-off to not have initialize work for the entire physical VDEF. Um, these things could be
VDEF. Um, these things could be augmented in the future. They're not by any means permanent problems, but for the initial implementation, these are the caveats.
Let's talk about some of the other new things it can do that other Vevs can't do or can't do in the same way. Um, so
one of the things is expansion. With
this architecture, it's very easy to add new drives to the VDEV and have that space be available. Um, you're not changing the sort of par or width of the way we actually write out the data like
the way that RAIDs expansion works.
um we simply add a bunch of new tiles that get used to back the space that are for more logical tiles and it makes more space available in the VDEV. Um so as an example we're looking at basically our
same VDEV from earlier but it's now getting very full right we're running very low on space and we want to add excuse me add another physical device to
back the pool so that we can have more storage. Um, so you can see we just slap
storage. Um, so you can see we just slap a new drive onto the side. We split it into tiles the same size as all the existing tiles and all the existing ve.
And now all of these tile tiles are available to use and do allocations from. Um, and that space is immediately
from. Um, and that space is immediately available to use uh as part of the read vev. But there is a caveat here, which
vev. But there is a caveat here, which is that because this pool got so full before you added the new device, even with all these new tiles, we can't actually store more than a couple of new
logical tiles in this whole thing without doubling up and using the same drive multiple times, which would break our reliability guarantees, which we don't want. Um, so we need some way or
don't want. Um, so we need some way or it would be really nice if we had some way to move tiles from the existing discs onto this new disc that has all this free space. And by doing that
motion, we could free up space on all the different devices and make it easier to allocate stripes across all the different discs. U and so that
different discs. U and so that capability is rebalance. Um this lets us move tiles from very full discs to very empty discs. Uh it's helpful for example
empty discs. Uh it's helpful for example in this case when you're attaching a new large disc. Um but it can also be useful
large disc. Um but it can also be useful uh in other cases that we'll talk about.
Um and it helps us always take advantage of all the space that all the discs have available. So you can see, you know, we
available. So you can see, you know, we can only allocate about two more tiles from this, right? We can get in this four- wide model, we allocate one from our biggest disc and then one from each of the first three discs, one from the
big disc, one from each of the next three discs, and then we're done. But if
we rebalance, we can pull tiles out of the old discs, move them to the new disc, and then there's now there's space available on each of these different child discs, and we can start to allocate more stripes across all of
them. And suddenly all of your space is
them. And suddenly all of your space is available for use again, though. Um, and
because uh of the way that these tiles work, it's easy to move whole tiles at a time, right? They're you're just moving the
right? They're you're just moving the whole thing from one spot to another.
And because this happens at the VDEV layer and we have the mapping that keeps track of where all these different tiles are stored, we don't have to do block pointer rewrite. We don't have to keep
pointer rewrite. We don't have to keep the sort of uh complicated indirection tables that device removal has to do to do something similar to this. Um, we
have some metadata that we keep track of, but because we're operating in these big continuous chunks, it always stays a pretty manageable size and is easy to keep track of and store efficiently.
Um, so you have this device, it's doing great, it's doing good work for you, you love it, but it's time for one of your discs to go away. Maybe that disc is getting old and you're worried about failing and you don't really need all of
the space that your device has available. Uh, so you want to get rid of
available. Uh, so you want to get rid of it before it fails and then you have to replace it with a new disc. Um, or maybe you want, you know, that smaller device is an SSD or something that you h you
cobbled this thing together out of spare parts and now you want to use that SSD for a cache or for a special device or some other purpose. You want to take that disc out and replace it with something else. You don't necessarily
something else. You don't necessarily need to put a new disc in in its place.
You don't need that extra space. Um, and
so this is another new feature in any raid. We can do contraction, uh, which
raid. We can do contraction, uh, which lets you remove individual child discs from the any raid vdev. Uh once that removal happens, there's no extra indirection metadata. This is all
indirection metadata. This is all handled by the uh the tile mapping that's part of the NEA VDEV. Uh except
during the removal process, there will be some extra stuff that's stored to keep track of the progress of the like tile motion and all that stuff, but that's a relatively small amount that's pretty ephemeral. Um, and as I mentioned
pretty ephemeral. Um, and as I mentioned before, we get to do nice efficient sequential copying of all this data because it's all stored in these big contiguous chunks and we're all moving it to the same place together. So, we
get to do big readios and big right IO's and everything gets to proceed nice and quickly. So, you can see we have our
quickly. So, you can see we have our logical device there with its four allocated tiles and we take those and we put them on different discs where we don't already have something of the same
par. Uh, and that way we get to ensure
par. Uh, and that way we get to ensure that uh, we can remove this device and you keep the same parody you were working with, but you've shrunk your VDEV. Um, and you can now use that disc
VDEV. Um, and you can now use that disc for something else or just get rid of it if it was getting old.
>> Yeah.
>> Um, there's some other new things that are just part of adding any new feature to ZFS. Uh, there's new ZDB printing
to ZFS. Uh, there's new ZDB printing functionality that let you print out the any map and display it. Um, it does it in this nice sort of tabular format.
spent a little while playing with Unicode uh table making characters to make it look nice and pretty. Um there's
Zest and ZFS test support. So Zest can now create any RAID Vevs and do all sorts of weird operations to them as part of the testing process. And there's
a bunch of new stuff in the ZFS test suite to test it. Um and then there's some new Vev properties for the Any Vdevs themselves. Uh the tile capacity,
Vdevs themselves. Uh the tile capacity, the number of tiles that you could allocate on the any raid vev or one of its children. um the number of currently
its children. um the number of currently allocated tiles in the whole any readadv or again in each of its single child devices. Um and then the tile size is
devices. Um and then the tile size is also accessible as property. So all
these things can be used programmatically if you need them for whatever purpose or if you're just curious to see what it looks like without having to parse the big ZDB output.
So where is the project now? Um the
mirror parody implementation and sort of all the core infrastructure is currently up for review on the open ZFS uh GitHub.
Um the RAID Z parody implementation adding the sort of parody style or RAID Z parody style and the couple of extra features that go along with that is done internally. I have it working and I can
internally. I have it working and I can create pools with it and everything. Um
but I'm waiting for the initial stuff to land before we open a review with the new stuff. Uh rebalance is what I'm
new stuff. Uh rebalance is what I'm currently working on right now. I have a bunch of the infra the sort of uh userf facing infrastructure in place and now it's time to actually do the sync tasks and the progress tracking and the data
motion and all that stuff. Um and then once rebalance is done it'll be time to do device contraction uh because it gets to leverage the data motion the rebalance uses uh expansion is actually
part of the mirror parity implementation that already works uh but you can't do the rebalance yet. So those are in different phases for implementation reasons.
And then what's next? Um, so the first thing is obviously finishing the review for the mirror parody implementation.
Some folks have already taken a look at that and I am extremely grateful to those people, but I would love if more people want to take a look at that and get eyes on it. Um, I want to finish the ongoing tasks, the expansion and
contraction and rebalance and all that stuff and uh then get those ready internally and then post upstream reviews for those once the original review is done. And then there's a bunch
of ideas for future development. um this
initial stuff that we're we're working on at Claraara, the future development stuff maybe we'll work on, but other people are also absolutely able to take this on. Um I have a lot of ideas. Uh we
this on. Um I have a lot of ideas. Uh we
could remove some of the limitations in the current design, right? The 256 drive limit and the,024x size limit. Um
there's a couple different ways to go about this. We might need to increase
about this. We might need to increase the mapping size or handle running into that limit. I think we could do
that limit. I think we could do something where instead of having this sort of fixed rectangle of space that you're allowed to use with, you know, this many drives and this much size in each drive, there would instead be a
limit to the a size of the VDEF where you could use any number of drives and any size of drives, but just only so many tiles in total until we would fill up the mapping. Uh, I haven't played with implementing it, but I think
something like that would work. Um, so
if people want to play around with that idea, they absolutely can. Uh right now the tile map is stored uncompressed. We
don't like compress it to save space, but uh we could compress it. And if it compressed well enough, um one that might let us increase some of those limits, and two, we might be able to store it in different locations. Uh
right now we have this sort of dedicated chunk of space at the start of each VDE.
But taking advantage of the new large label format that Allen's going to be talking about later, uh maybe we could store it in some of that space or in some of the reserved space as part of that. Um, and that would let us uh do a
that. Um, and that would let us uh do a couple some more interesting things and be a little more efficient. Um, right
now you can't do normal raz expansion if you're using any raid with raz style parody uh because it was complicated and that I didn't feel like dealing with it.
Um, but that is certainly an augmentation that could be made for the future and that would let you increase the width of your stripes uh going forwards.
Um, right now when we do contraction, uh, we can only move like whole tiles around. Uh, even if you've freed almost
around. Uh, even if you've freed almost everything or literally everything from a tile, the tile never sort of gets unpinned by the any raid code. Uh but
one thing we could do going forwards is use something like device removals indirection mappings to pull little pieces out of the tiles that that are pinning them and move those into other
sort of logical places in the VDEV. That
would let us do much better contraction.
It would let us shrink devices way down.
It would let us instead of doing full rebalances where you move all the data to a new disk, just shrink the number of tiles that are actually effectively in use. Um and in that way take advantage
use. Um and in that way take advantage of the space more efficiently.
Um there's a bunch of performance work that could be done in the initial version. Uh we're very focused on using
version. Uh we're very focused on using all the available space that's in the VDEV. Um and so that means you don't
VDEV. Um and so that means you don't always use all of the child discs, right? If you think back to the diagram
right? If you think back to the diagram from before, when you only have one or two physical lot stripes allocated, you didn't actually have something on all of the different discs. And so you weren't necessarily taking advantage of the
bandwidth of all of the discs that you have in your VDEV. Um and there are ways to improve that in even the initial version and I have some uh stuff that I'm playing with for that in the current
poll request. But um ways to improve it
poll request. But um ways to improve it going forwards is uh when we're doing allocations in CFS, we have a rotor that sort of picks between the different top level Vevs to decide where to put the
data. We could have a rotor within the
data. We could have a rotor within the VDEV layer to sort of try to pick between different parts of the VDEV that are backed by these different physical tiles and use that to try to do a better
job of distributing our IO's across the tiles. uh and take better advantage of
tiles. uh and take better advantage of all the disk bandwidth that you have available. Um there are limits to this,
available. Um there are limits to this, right? At some point if one disc is much
right? At some point if one disc is much bigger than the others, you have to do a lot more writes to fill up that disc.
And so that disc will eventually sort of be the bottleneck. Um but there's improvements that could be made there.
Um right now we always write out the full uh copy of the tile map every txg just for simplicity. Um but we can skip writing out blocks that haven't changed or even skip writing out the whole
mapping if it hasn't changed. Um, and
that would reduce sync times a little bit. It doesn't take that long for
bit. It doesn't take that long for normal TXG syncs, but uh, when you're doing like an export or forcing a bunch of transaction drips at the same time, this can, uh, lengthen your TXGs by a little bit and make those things a little slower. So, there are some
little slower. So, there are some changes that can be made there to skip some of that stuff um, that just aren't in the initial implementation for simplicity. Um, and then another cool
simplicity. Um, and then another cool idea is doing VDEV conversion from an existing VDEV, right? If you think about, you know, you have some RAID Z VDEV right now with five discs in it,
uh, that's a RAID Z vdev, but it's also kind of an any raid vdev where just all the tiles are mapped in exactly the simplest possible way. And so it would be possible, at least in theory, to
convert that into an any raid vdev that has this sort of very basic tile mapping and then you could stick new devices onto it and take advantage of that space. Um, and we would have to like
space. Um, and we would have to like detect what the furthest point is that's actually mapped in your RAID zade dev so that we have the free tiles at the end to do the balancing that we want. But
something like that would definitely be possible. Um, and uh, if we had our tile
possible. Um, and uh, if we had our tile map able to fit in the new large label format or something like that, then we could do it without having to use that space at the start of every disc or we
could store a compact enough version that it would fit in this sort of other reserved space and we could take advantage of that and do that VDE conversion. Um, so those are just some
conversion. Um, so those are just some of the ideas that I had that uh I would be very happy for people to take a look at and think think about and play with.
And if you have more ideas, come talk to me afterwards. I'd love to hear them.
me afterwards. I'd love to hear them.
Um, that's sort of the end of the content portion of the talk. I want to do some quick thanks and acknowledgements. Uh, the whole team at
acknowledgements. Uh, the whole team at Claraara, I've been at Clara for about a year now. It's been a rad place to work.
year now. It's been a rad place to work.
Um, Allan Jude came up with most of the initial design of uh, any raid uh, and then handed off to me to do the detail work on. um Igor did a bunch of the uh
work on. um Igor did a bunch of the uh userland and test work as part of the initial patch and so he was a fantastic partner to work with on all this. Uh Rob
and Mattesh for doing the initial internal reviews and being great sounding boards and just generally being great to work with. Uh ESHT is the sponsor of the Anrade work. They're the
ones who are paying us to implement this cool new feature and make it available to everyone. So I really appreciate them
to everyone. So I really appreciate them and that uh that they're that they're letting us and giving us the opportunity to do this. uh the Open ZFS community.
I've been in the community for over a decade now and it's been an awesome place to work. I get to solve cool, interesting problems with cool, interesting people. And so I'm really
interesting people. And so I'm really grateful to all of you and everybody who's uh out there watching and as part of the community. Um and the people who've already taken a look at the code review at least a little bit, Brian,
Alexander, Tony, uh and other people who've started to take a look at this. I
appreciate your eyes on it. I would like more eyes on it. So anybody who's interested, please, uh take a look and let me know what you think. I'm sure
there are many changes to be made.
Um, quick plug. Claraara Systems, we do uh software development, we do solutions design, we do performance analysis, we do data recovery uh as a service for ZFS
and FreeBSD if you need any of these things. Uh, there are contract
things. Uh, there are contract information here. You can talk to Allen.
information here. You can talk to Allen.
I think we're cool people to work with and it's been a fun place to work. So,
quick plug. Uh, and then does anybody have any questions for me? I saw Mark's hand first.
Yeah, that's very cool stuff. Um, so it strikes me that you know there's echoes of what we did with D-Ray work in in this design. Um, and you it it had a
this design. Um, and you it it had a limitation which you didn't mention I think in as one of your limitations here which is uh in order to do this you do have to start with enough devices in
your configuration to be able to have the replication level you're specifying in your configuration. So if you app try to create one of RAID, you know, your RAID Z
one 3 whatever or 31 whatever it was uh with two drives in your as a set of drives that's going to fail obviously.
>> Um that said um >> it it seems really cool that you could just sort of throw together a set of drives like this. Um I do have um a couple questions. one is so you know
couple questions. one is so you know what happens obviously you kind of touched on this when you get to the point where you you know are running out of space obviously there's another limit
there which is uh if you can't find enough tiles across this disparases to create a given
instance of a stripe you have to fail at that point what is the what does that look like I mean obviously you get you can get to a
point where you know you have space left across your drives but you can't actually create a new tile stripe >> right so the way that that works right
now is that when you create actually every time we open the vdev um it calculates how many tiles it could possibly fit like it goes and does the allocation algorithm on repeat and says
like okay this is how many tiles I could fit and then it says okay so that this is the a size that I have available um and So that's just how much space the disc has available to present. And so
you can't like right now it is an assertion in the code that's like we must be able to allocate a new tile if you're doing a right at this offset.
>> Okay.
>> Um that makes sense.
>> Yeah. One of the ideas that I had for how to get around the limits would be to do something about failure in that case where like you have to go back and suspend the pool if we run out of
mapping space or something like that.
But like that would be very you're way down in the code when you're failing this IO that's in a way that's like really unpleasant.
>> Well, also it does make sense though I think sort of pre doing the pre-mapping and saying this is the capacity could know what your capacity in in advance is I think a very useful feature for a
customer for you anyway. Um
yes so one idea I thought and it's last part of this um when you were mentioning uh you know the limits and and ability to uh sort of your your limits imposed
by the amount of space you're reserving for the tile mapping that has to put into the after the label. Um, one
thought I had that you might think about is uh you could potentially could you potentially do something like create sort of top level tile groups where you
say all right this set of drives it it belongs in one tile group and another set of drives belongs another tile group and those will actually be two separate allocation spaces and so they'll have
their own individual tile maps and that way you could have unlubed number of drives by splitting them into I mean multiple groups That is basically just creating multiple any raid vaps right at
a at a conception.
>> So any raid becomes a top level V. Yes.
>> Yeah. So you you can create a pool with three any raid v.
>> Okay. That that actually if it falls out of just your design is >> Yeah. Yeah. Because because the VDs are
>> Yeah. Yeah. Because because the VDs are independent of each other. Yeah. Yeah.
And you there are tests in the test suite that are like we can mix mirror parity ones and raz parity ones as long as the replication levels all match up or you use -f.
>> Uh st do you have a question?
>> Yeah. Um so from an import export time frame is it going to add 10% >> imports are about the same speed like we read a little bit of this extra metadata but it in practice it doesn't take very
long.
>> Yeah. like the the 64 megabytes that are reserved for the tile mapping are like the theoretical maximum. Uh in practice, the space used is usually like 8 to 10 kilobytes. Um until you start to get
kilobytes. Um until you start to get like a lot of drives or really big drives in it. Um export times go up a little bit. That's one of the things
little bit. That's one of the things that I talked about in the sort of performance work to do. Um but it adds like a second or something to an export.
doesn't like the time is relatively constant with the size the with the sort of capacity of the pool. It just has to issue a bunch of IO's to all these different discs,
>> right? It kind goes back to what Garrett
>> right? It kind goes back to what Garrett were talking about previous days on, hey, we're talking 100, 200, 400 drive Jbots,
there's a break point where it might not make sense to do this. That's fantastic.
>> But >> that fail that that quasi failover mechanism might be now 20 minutes instead of two, you know, or whatever that number is just to
>> Yeah, at large enough scales, there will definitely be some like the the extra work that goes into singing this metadata will take a little bit of time.
There are a bunch of tasks that can be done to make it more efficient. the sort
of initial version is very naive.
>> But but to be fair, anybody that's doing those big ones are going to be consistent drives.
>> Yeah.
>> So, but if you have >> I see the feeder sets that I want to use, >> but I'm not the target audience right now.
>> Yeah. But, you know, yeah, as you said, if you have a 100 drives in a thing, it'll work. You may not be exactly the
it'll work. You may not be exactly the target use case. And so, the initial implementation is not optimized for that, but there's definitely work to be done to make it more compatible for the
future.
uh do take into account so um yeah so the question was whether we take into account the health of the drive when deciding where to do our tile allocations um it doesn't in the current
version uh that's certainly something that can be done for the future um although to some extent I worry that you know if we're trying to make that decision at alloca like when we're doing
these tile allocations. Uh if the drive is only like temporarily having a hiccup or if it's going to get replaced with something else where we do our allocations is sort of a long-term decision short of doing rebalances that
can move tiles around. Um and so making long-term decisions over potentially short-term health uh behaviors is not necessarily always going to be the right trade-off, but it's something that we
could look into.
one now one drive can deg so and >> I mean so the the parody so uh sorry I
think I need you to repeat the question a little bit >> okay sounds good >> how do you deal with inconsistent uh
tile maps like if you were updating all of the drives, right? Half of them got updated, you lost power, then only half of them have the new tile map in it.
>> Um, so the way that it works right now is it's uh similar to the way the labels work. If we find one copy that where
work. If we find one copy that where like all the check sums match and all like everything verifies out uh of like a given transaction group and like that's the highest number transaction group that we have in our copies, then
that's the one we go forward with. As
long as any of the drives manage to do a full sync, we'll use that newest updated version and then we'll sync that, you know, when we do another transaction group sync, we'll write back out to all
the discs again. Um, we read the tile mapping from like the first child first, but we'll try a bunch of different ones.
If we don't find one for the transaction group we're trying to import, uh, or if it fails, it's check sums. Um, and if all of them fail, like if none of them manage to fully sync it out, then we'll in the same way that you'd have to roll
back to a previous Uber block, uh, we'll roll back to the previous TXG as part of the import process.
>> Okay. And a second question about so um, labels are at the beginning and the end of the disc do protect I guess from some crazy bug where inp
will override the first part of a disc.
Um, are tile maps similarly placed?
>> Right now they're only stored at the front, but it's not a bad idea. So, we
could do like two of the copies at the front and two of the copies at the back or something to keep those sort of spatially separated and a little bit more protected. That's not a bad idea.
more protected. That's not a bad idea.
And now would be the time to do it before the initial version gets integrated.
>> Yeah. Mhm.
>> I really like this performance perspective. I just didn't use let's say
perspective. I just didn't use let's say you started off withia corology about
Let's say you started off with a RAID uh RAID RAID Z1 actually and then you decided that uh you wanted more protection. Could you add a big drive
protection. Could you add a big drive and uh start migrating to RAID Z2?
>> So this is an interesting question that I have thought about a little bit. Uh
and the answer is it's very complicated.
>> Yeah, I as I started thinking about it.
Yeah. The other use case is going backwards. So you've got a large array
backwards. So you've got a large array and you're and you're running say RAID Z2 or RAID Z3. You're out of space. You
have no budget. Could you start to backrade to a lower RAID Z uh to free up space?
>> So again, so part of the reason that this is complicated is because of the way that RAID Z works. Um, as I had on the tile before sort of demonstrating how it works, the way RAID Z like does
its thing is it actually allocates like a single slightly larger contiguous chunk for the IO, right? If you have like a one megabyte, write, it actually gets turned into like a oneend change
megabyte write to store the extra par data.
>> Okay? And so if you want to reduce the parody, you actually have to go through and like reflow in the logical space all of the IO's that you've done because
they're actually a different size now.
Um, and it like this is one of the things that any the RAY expansion had to deal with where like if you increase the width you have to like move all the data
around. Um, and it it I think as Matt
around. Um, and it it I think as Matt explained when he did the the talk for the RAY expansion feature, like parody changing is a whole another level of complicated.
>> Yeah.
>> Um, because I think you also have to rewrite the parodies themselves so that they work with the math differently. Or
maybe the first parody is the same, but like it's >> Yeah, >> it's better to do a send and receive at that point into a new >> Yeah, it would be a lot simpler to send and receive the data. Um I I won't say
it's impossible because you know the computer's turn complete. It can do anything we can tell it to do. But it
would be a uh a serious undertaking.
>> Yeah. Okay.
>> Uh this is excellent work. Uh very
impressed. I I wish that we had designed this um tiles layer of indirection into ZFS from day one. I think it would um it it would make a lot of the other
features like flow more naturally. Um,
one minor point, um, you talked about having like each meta slab be on different uh tiles uh for simplicity, which is great, but, um, I think it also needs to be on different
tiles for correctness because you can't have a block, like a single block. If it
spans multiple tiles, then it might break the guarantees that we expect about like, you know, each each like for RAID Z, each part of the block being on different desks, right? That makes
sense.
>> Yeah. No, it is it is certainly the case that there are multiple reasons why it is important that meta-labs all fit within a given tile. When I was first designing it, I was like simplicity and then I thought about nor was like oh no,
that would be a real problem. So yeah,
it's it's important that everything fit within a single um within a single tile.
Anything else?
All right, cool. Thank you everyone.
[applause]
Loading video analysis...