LongCut logo

Insane Shadow Data Technique in C

By Tsoding

Summary

Topics Covered

  • C Hash Tables Hide Metadata Before Array
  • Store Dynamic Array Header Before Data
  • Macros Enable Generic Dynamic Arrays
  • Auto-Init Null Pointers in Array Push
  • Realloc Doubles Capacity on Overflow

Full Transcript

One of the main criticisms C programming language usually gets is that it doesn't have a built-in hashts, which is true, but there's plenty of third-party

hashts. One of the notable ones is

hashts. One of the notable ones is probably STBDS by Shan Barrett. Uh let's

take a look at it. Let's download it into our folder and create main.c file.

Let's include stdio as usual and create an entry point like so. This is a classical single header libraries which you have to just include and also define

STB implementation macro. So it doesn't act just like a header but also includes all of the necessary implementations.

The way you use the hash table, you have to create a structure that contains only two fields. First one is the key and the

two fields. First one is the key and the second one is the value. And a hash table is basically an array of these items. Then to add something to this

table, I think you have to use something like sh put. Yeah, there we go. So you

take this function, you put the table in here, you specify the key, something like fu, and some sort of a value. Let's

put 69 in here. And let's actually put a bunch of the values in here. Let's have

bar for 20 and buzz 1337. After that,

you can iterate this hash table. There

is a function if I remember correctly say chilen which actually gives you the size of the table. You can now iterate starting from zero up until the size of

the table plus i and then you can literally print the key and the value like so table I. So let's build the program c main c and let's instantly run

this program. As you can see we iterated

this program. As you can see we iterated the entire hash table. Then there is another operation that allows you to get an index within the hash table by key.

So you specify the table and you say give me the index to bar and then you can just use that index to access that

item and for example modify it like so.

This is a legit implementation of a hash table for C and it's very easy to use.

As far as I know, Shin Barrett himself doesn't really like this implementation because he had to make some compromises to make this interface usable. But for

simple cases, I think this is a passible hasht implementation. So, as far as I

hasht implementation. So, as far as I know, all of the operations are all one amortized like a hash table. If you look closer, you may notice something weird.

How does the hash table know its size?

How where is all of the metadata for the hash table? This is literally just a

hash table? This is literally just a pointer to an item. Where does it store all of that? What is this magic? Well,

let's take a look at how, for example, shenan is implemented. So, let's go ahead and actually feed the stbds into tags. So, it will create a tag file. So,

tags. So, it will create a tag file. So,

we can actually index and jump to the definitions and stuff like that without any lsp. Uh, you don't need an lsp to

any lsp. Uh, you don't need an lsp to jump to the definitions. Just saying.

So, and let's actually go there. SH is

just a alias for stbds shlen. So

basically it has some sort of like a prefix system because C doesn't have name spaces. Totally understandable.

name spaces. Totally understandable.

This thing is an alias for stbdshm.

Okay. And inside of it it takes the table and it's doing something really weird. Right. So if the table is not

weird. Right. So if the table is not null what it does it calls to stbds header. It casts the table to a type

header. It casts the table to a type stbds array header. And here is all of the meta information. What STBDS is

doing, it is using an ancient trick of actually storing all of the metadata for the array before the start of the array.

Since C is a lower level language, you can do tricks like that. Let's actually

try to implement something with that technique. Since I'm a dynamic array

technique. Since I'm a dynamic array guy, let's actually try to implement dynamic arrays. How do I usually

dynamic arrays. How do I usually implement dynamic arrays? Well, I

usually have something like this. For

each dynamic array, I create a separate structure. And within that separate

structure. And within that separate structure, I have items. I have count.

How many elements there are in the array and the capacity, how many elements already allocated. I may allocate more

already allocated. I may allocate more than I actually use. So I can reduce the amount of reallocations. This is totally normal. So let me go ahead and rename a

normal. So let me go ahead and rename a main to two and create a separate main in here. But all of this stuff is a

in here. But all of this stuff is a metadata. We have to actually change our

metadata. We have to actually change our entire approach. Let's actually remove

entire approach. Let's actually remove the items and keep only the metadata for the dynamic array. And let's call this header. Our actual dynamic array is

header. Our actual dynamic array is going to be stored somewhere here. So

what we need to do, we need to now pre-allocate some initial capacity for the array. I'm going to call maloc. And

the array. I'm going to call maloc. And

what I need to do, I need to allocate size of numbers the size of a single element. So I'm taking this variable

element. So I'm taking this variable dreferencing and taking the size of it, which means that I'm taking the size of a single element in here. and I need to multiply it by some initial capacity that I want to pre-allocate. So let's

maybe put some sort of a variable in here. Let's call it init capacity. So

here. Let's call it init capacity. So

this is our initial capacity and let's say it's going to be 256. And I multiply it by the initial capacity. So then I need to also reserve some room for the

header. So I'm adding size of the

header. So I'm adding size of the header. And now I've got a pointer in

header. And now I've got a pointer in the memory. I have the header and after

the memory. I have the header and after the header I have the numbers. And

currently the data is pointing at the header. So in fact since it's already

header. So in fact since it's already pointing at the header, maybe we should call this variable header and have it as a type as a pointer to the header. I

think it's going to be much more convenient. Now I need to assign numbers

convenient. Now I need to assign numbers to a pointer after the header. Well,

what I can do essentially I can maybe take a header and add size of the header. To do that, I'll also have to

header. To do that, I'll also have to cast pointer to the characters. But the

way pointers works in C is that I can actually do header + one. You see when you add one to a pointer, it actually

offsets it not by a single bite but by the size of the type it is pointing at.

This is why in C there's not that much difference between pointers and arrays.

Pointers can act like arrays. And by

doing header + one, I instantly find the base of my numbers array, which means I can now assign this expression to the numbers. And obviously, this is not

numbers. And obviously, this is not going to compile because we yeah, we also need to include some of the headers in here. First of all, we're using some

in here. First of all, we're using some functions without including headers. But

second of all, this pointer is not compatible with a pointer to an integer.

So we'll have to explicitly cast it to a pointer to an integer. And now it works.

Furthermore, I think we need to initialize the headers. So, as of right now, we're not using any elements in the array. So, let's assign count to zero.

array. So, let's assign count to zero.

And in terms of capacity, let's assign capacity to the initial capacity. Maybe

it would be nice to factor out this entire code into some sort of like a separate function. Let's actually call

separate function. Let's actually call it array init where we're going to also accept initial capacity. And we're going to return pointed to an integer. And

let's move all of that stuff to here. So

maybe we don't even need this intermediate variable anymore. We can

just probably return something like this. And now let's actually put this

this. And now let's actually put this comment somewhere somewhere here. I

think this is an appropriate place for it. Now I can do numbers array init 56

it. Now I can do numbers array init 56 and also return zero in here. Now if I want to push something into such array, what I have to do? Let's create a

separate operation array push and I'm taking a pointer to an array. Since

array is pointing at the elements, we have to do the opposite operation and find the beginning of the header. We can

cast the pointer to an array to a pointer to the header, subtract one, and we got the pointer to the header. So now

what we can do, we can take an array, we can take the count, increment it by one, and assign an element that we're trying to push into the array. Obviously, we

may actually exceed capacity. As of

right now, let's actually assert that assert that we're not exceeding the capacity of pre-allocated array.

Ideally, it would be nice to reallocate the array, but let's actually keep it simple for now. And now what I can do, I can just push numbers into the array. 69

420 1337 800 85. And now I want to iterate that array, but I don't know its size. So let's introduce some operation

size. So let's introduce some operation which is array length, which also takes the array and finds the header. So we

cast this entire thing to a header subtract one and the header already contains the count which we can quite easily return. So the only thing we need

easily return. So the only thing we need in here is just go ahead and iterate the entire thing less than array length array plus i. And now let's just print all of them. So this is an integer array

i. Let's go ahead and try to compile it.

i. Let's go ahead and try to compile it.

It didn't compile because I have a redefinition of a variable. I really

apologize for that. I'm now accepting this variable as a parameter. What's the

next error? uh I don't have this variable anymore. So now it has to be a

variable anymore. So now it has to be a size of an integer. Totally

understandable. So we using assert for the assert I have to include the assert.h header. Totally understandable.

assert.h header. Totally understandable.

The next error please array isn't declared. It's called numbers. So and

declared. It's called numbers. So and

this is also called numbers. There you

go. So here are all of the elements within the array. But it would be kind of nice to actually see how the structure is being allocated and how the elements are being pushed. So let's go ahead and maybe open this entire stuff

in the debugger. So I'm going to be using GF2 which is essentially front end for GDB. So this is essentially GDB but

for GDB. So this is essentially GDB but on top of that we have a GUI that communicates with GDB and gives you sort of like a conventional UI for the debugger. It's not perfect but it's

debugger. It's not perfect but it's better than you usually get on Linux.

Linux is famous for not having a decent debugger anyway. Anything slightly

debugger anyway. Anything slightly better than GDB will actually work for you. Let's go ahead and just break at

you. Let's go ahead and just break at main and let's actually run the whole thing. And I forgot to recompile my

thing. And I forgot to recompile my application with debug information. So

let's actually put GDB into the flags.

Let's try to run the debugger one more time. Break on main. Run it again. And

time. Break on main. Run it again. And

there you go. So if I take a look at the numbers in the watch window, we don't really have anything particularly interesting because we have some trash in the memory currently. I'm going to step one time and it just contains zero.

One thing we can do in here which is rather interesting, we can take numbers.

We can cast it to pointer to the header.

We can now subtract one and that should be enough to see the all of the meta information. We can actually expect that

information. We can actually expect that hidden meta information in a debugger.

So now I'm doing one step and as you can see count is increased and in the number we can see 69 but as far as I know we won't be able to see the next one because the debugger considers that not

as an array but as rather as a pointer to one element. So one thing we probably want to do, we want to dreference this entire thing and then GDB has very interesting syntax where you can put at

in here and then the size of the array.

But that size can be actually dynamic.

So essentially what you can do you can cast the numbers to a pointer subtract one to get the header and then you can say okay use count in the header as the size of the array. So it will

dynamically actually grow and as you can see we have 69 for 20 in here. So if I continue stepping, it actually adds another element in here. You can

actually see that element. You can add another element. For some reason, GF

another element. For some reason, GF actually collapses this entire thing.

But here are all of the elements. Here

is the metadata. And here are all of the elements. And what's interesting is that

elements. And what's interesting is that this has a pleasant interface. I just

create a pointer to an elements. I just

initialize the array and I just push there and I just iterate there. So

furthermore, this particular function doesn't even have to be a function. We

can turn it into a macro. Define it like that. It accepts an array. And then I

that. It accepts an array. And then I probably have to wrap array in parenthesis. And there we go. This is a

parenthesis. And there we go. This is a macro and it still works. The reason why I have to wrap it in parenthesis is because macros are functions that operate on tokens. When you pass a

parameter in here, it could be an expression. It could be an entire

expression. It could be an entire expression. And what the C processor

expression. And what the C processor will do, it will literally copy paste the sequence of tokens to here. And if

the sequence of tokens is for instance 1 + 2 what it will do it will literally copy paste 1 + 2 in here which means that this cast is not going to be applied to the entirety of the

expression but only to a sub expression which is one. This is not what we want.

What you have to do you have to wrap the parameters into parenthesis to make sure that this cast is not applied to only sub expression but to entirety of an expression. Reprocessor actually run

expression. Reprocessor actually run before the actual compilation of C.

Because of that, it is not aware of the syntax of C. What it's aware of is the tokens of the C programming language.

But another interesting consequence is that this function now can work with dynamic arrays of any type. If before

this function only accepted pointed to the integer array, now we eliminated that. And it could be any of them. It

that. And it could be any of them. It

could be array to floats. Obviously, the

rest of the functions don't work with arbitrary types. But can we make them

arbitrary types. But can we make them work with arbitrary types as well? We

can try to go ahead and turn our push into macro as well. Let's do define.

Let's accept R and integer. And

unfortunately, C doesn't support multi-line macros. So, what you have to

multi-line macros. So, what you have to do to have multi-line macros, you have to append backslash to each and individual line. The reason why I have

individual line. The reason why I have to do that is because prep-processor processes by lines and then it sees the backslash at the end of the line and it considers new line escaped character. So

it continues sparsing the rest of the lines as the same line. It is

inconvenient but at the same time C is old and stinky language. So what can you do? One thing I like to do for example

do? One thing I like to do for example is align by backslash to put backslashes at the end in here so it looks readable more or less. Not every text editor supports that. I understand that. But in

supports that. I understand that. But in

EMAC, you can actually quite easily do that. And obviously here we have to wrap

that. And obviously here we have to wrap ours in parenthesis like so. And

probably we also have to wrap access in parenthesis as well. So let's see if it still compiles. Surprisingly, it still

still compiles. Surprisingly, it still compiles. And as you can see, this

compiles. And as you can see, this particular macro doesn't care what type array or x is. The only thing it cares is that the array is the pointer to the type of x. So they are compatible with

each other. And if you have these

each other. And if you have these complex multi-line macros, it is usually a good idea to wrap them in curly braces so they have their own scope. So on that scope you can actually allocate some

variables. But this is not going to work

variables. But this is not going to work in all of the cases. For example, when you have some sort of a condition where you have array push numbers 111, this is

going to work fine. But then if you have a else branch where you also do a push, you're going to start to have problems because this particular call has

expanded into curly braces like this.

And you can't have semicolon after the closing curly brace. You just can't. You

have to remove it. So because of that, if you want to have a scope within a complex multi-line macro, you can't wrap it in curly braces. One trick C

developers came up with is to actually use do while with a false at the end. So

the entire iteration executed only once.

This particular trick actually allows you to have expressions like this and it's totally fine. So if you ever saw me using do while inside of Pacross, now you know why. I'm already tired

explaining that over and over again. So

there you go. That's why I put the while in here. But here's an interesting

in here. But here's an interesting thing. Doing array init with a certain

thing. Doing array init with a certain capacity is kind of tedious. Look in STB you don't have to initialize anything.

You can just allocate an array assign null and it's totally fine. Can we do something like that? So essentially we could have actually baked array initialization into array push if array

is equal to null. I think that would be actually very convenient. So let's go ahead and try to do that. Essentially if

array is equal to null just do that initialization. We can go ahead and

initialization. We can go ahead and literally copy paste this entire thing to here. I should also not forget to put

to here. I should also not forget to put back slashes at the end of these lines.

But while I'm writing code, I'm not going to do that. Here we have initial capacity. Maybe it would make sense to

capacity. Maybe it would make sense to transform the initial capacity into some sort of a predefined macro like this 256. So let's remove the array in it. So

256. So let's remove the array in it. So

the size of the element here is an interesting thing. We can derive the

interesting thing. We can derive the size of the element from the pointer to the array that the user provided. We can

essentially now dreference the array.

Take its size and then multiply it by initial capacity and also add additional space for the header. That way we completely abstract it away from the specific elements of the dynamic array

we're working with. After that instead of returning this entire thing, I think what we should do, we should assign the array. So since we're abstracting away

array. So since we're abstracting away from a specific type, I think it would make sense to actually cast it to void.

The nice thing about void pointers in C is that void pointers can be automatically casted to any pointers and vice versa. Any pointer can be casted

vice versa. Any pointer can be casted automatically to void star, which is not the case for C++, but who cares about C++. Let's go ahead and append

C++. Let's go ahead and append backslashes to each individual line.

It's very important. And let's align backlashes like this. So let's now try to compile the whole thing. It doesn't

compile because we don't have array init. Let's replace it with null. Let's

init. Let's replace it with null. Let's

try to recompile one more time. So, it

is complaining. I got rid of the init capacity because now I have a predefined macron. And it still works. And what's

macron. And it still works. And what's

interesting is that it is completely abstracted away. I can now use floats

abstracted away. I can now use floats instead of integers like this. And it

still works. And unlike my dynamic arrays, you don't have to create separate type for each individual kind of the dynamic array. But the downside of this approach is that just by looking

at the type, you don't know that this is a dynamic array. You don't know that there is an additional header hidden behind the base of an array. And if you

want to deallocate this array, you won't be able to just do free because to free that array, you have to actually point to the header because this is where actually allocated memory is located. So

proper deallocation of such array would be cast it to header minus one and deallocate it. Maybe you can actually

deallocate it. Maybe you can actually factor it out to separate macros something like array free numbers and define it like this array free. Here's

your array and just do this like so. It

seems to be working. So another thing that would be nice to have in here is to expand the array when we exceed the capacity. If the count is greater or

capacity. If the count is greater or equal than capacity, what we have to do is extend the capacity. I usually extend it by two. I know that the optimal one

is 1.5, but I prefer two because it's all just integer operations. After that,

since header points at the beginning of the allocated memory, we can just do real right on the header. We are

reallocating it by the size of a single element multiplied by the new capacity, extending the capacity. And let's not forget the room for the header. And then

we assign it back to the header. That

obviously invalidates the pointer to the array because as you reallocate the memory, the extended memory can actually move around to a better place where it

can fit. And if you have a pointer that

can fit. And if you have a pointer that points here and real actually reallocated that memory, that pointer becomes invalid. It's not guarantee that

becomes invalid. It's not guarantee that real lock is going to reallocate the memory on each call, but it may happen.

So you have to be always prepared for that. This call may invalidate the

that. This call may invalidate the array, which means we have to do this operation again. And believe it or not,

operation again. And believe it or not, that is basically it. That automatically

reallocates everything. So it doesn't compile. I forgot a extra backslash. And

compile. I forgot a extra backslash. And

everything works. We can do an interesting thing. Let's actually reduce

interesting thing. Let's actually reduce the initial capacity to one. It still

works. And let's run it in a debugger.

I'm going to break it main. I'm going to run the whole thing here. We're going to have numbers, but let's actually take a look at the header. So, we're going to take the numbers, cast it to a header, do minus one, and that should be enough.

We don't really have anything because nothing is allocated yet. After the

first push, we're going to have something allocated. As you can see, we

something allocated. As you can see, we have a count equal to one and capacity equal to one. Let's also do the following thing where we're going to dreference this entire thing at and let's cast the numbers to the header

minus one and let's grab the count so we can see the actual elements being pushed in here. As of right now we only have

in here. As of right now we only have 69. So since count and capacity are

69. So since count and capacity are equal to each other that means on the next push there will be a reallocation.

Let's actually see how it handles that and it handled it fine. Here is two element the capacity is two count is two. So that means on the next push

two. So that means on the next push we're going to have reallocation one more time and since we're doubling the capacity now it is four and everything is totally fine. The reallocations are

working fine and after that the capacity is eight and the count is five and here are all of the elements in here. And

again the entire thing looks just a regular array which is really nice. This

approach simplifies the interface but unfortunately at the cost of a little bit uncertainty. As already said, the

bit uncertainty. As already said, the type of this array doesn't tell you that it is dynamic array. You must know that if you are okay with this kind of limitation now you know the technique

that enables you to have this kind of interface for your generic containers in pure. person.

pure. person.

Loading...

Loading video analysis...