Introducing Lakebase - Databricks Co-founder & Chief Architect Reynold Xin
By Databricks
Summary
Topics Covered
- Databases are stuck in the 90s, need modernization.
- Branching databases like code unlocks new development workflows.
- Lakebase: PostgreSQL on open formats with decoupled storage/compute.
- Serverless databases launch in seconds, pay only for use.
- AI agents get isolated databases for cost-effective experimentation.
Full Transcript
to talk about more about this and what we've been doing at data bricks for the last two years to work on the lake base.
Uh I want to welcome my co-founder uh Rald Shin to the stage.
[Music] [Applause] Um thank you Ali. We're actually
supposed to shake hands but given you walk all the way there and I'm supposed to come here I think it will be a little bit difficult. So virtual handshake
bit difficult. So virtual handshake um the as Ali said so one um data bricks in the last decade has mostly been focusing on the analytic side of data
infrastructure um and if you look at analytic systems today what you use on data bricks or even on on other vendors they look remarkably different from what they were in the 90s um there's a lot of
foundational technology has been invented columnar storage um vectorized processing have dramatically speed up analytical workloads streaming was invented about maybe over a decade ago
actually make data a lot fresher and of course in 2020 5 years ago we published uh the lakehouse blog and pioneer new architecture which decoupled storage and compute and more importantly based on
open formats and that have enabled a lot of new workload dramatically lower the TCO for all the analytical systems the now OLTP databases however are kind of
stuck in the past um what do we mean by that if you look at OOLTP dat databases running today, whether it's commercial proprietary systems like Oracle or open
source databases like MySQL, Postgress, they look more or less the same as they were in the '90s. Yes, we've added a lot of features. They have gotten faster.
of features. They have gotten faster.
We've looked at the components and the techniques and foundational ideas.
They're more or less the same. Um, I
even sort of referred it all the way to the 70s of the system papers.
Um and databases are viewed as this heavyweight infrastructure requires a lot of manual intervention and maintenance and it's quite clunky. Um
first databases are very slow. So the
provision and sort of difficult to scale um if your workloads are fairly dynamic it's often a nightmare to actually deal with that in the database side. Um and
because of that um databases are fairly disconnected from modern day developer workflows uh which we'll zoom into it a little bit and it's also very silo from analytics and AI like it's actually not
unusual these days you want to combine analytics and AI with your transactional database workloads uh but it's very difficult to do. So what did I mean by uh databases that disconnected from
modern day developer workflows? Well,
imagine you're a software engineer and you're trying to add a new feature to a codebase. The very first thing you'll
codebase. The very first thing you'll likely do is the following command. Get
checkout-b maybe click in the UI. Um,
but what it does, it creates a new branch of your codebase. And you'll make changes to this new code branch. Um,
adding new feature, maybe fixing a bug, you'll be testing against it. But all
the changes you do are isolated to this specific branch. And creating a new
specific branch. And creating a new branch is an instant operation. It's
very, very fast. You don't have to think twice about it. You just do it right.
What's the equivalent for databases?
You want to clone your production databases, it might take days. Um, you
put one up, you almost never shut it down. Some things like this simply
down. Some things like this simply doesn't exist. Like, wouldn't it be nice
doesn't exist. Like, wouldn't it be nice if you can branch off a database just like you would do with code?
Now, let's say you get past all that development hassle and you manage to build a pretty successful app, which many of you have. Um, and now, um, the app had taken off. You want to add
introduce some analytics capabilities or AI capabilities to the app. So you have your app development team now start talking to your data infrastructure teams. They're managing the lakehouse.
You say, "Hey, how do I actually get the data from one side to another?"
Um, now you have to figure out how do you manage two disparate systems. You have to understand the IM role differences. How do you set up secure
differences. How do you set up secure networking? How do you create ETL
networking? How do you create ETL pipelines and load data from one to another? You learn fancy terms like
another? You learn fancy terms like change data capture, std type one, std type two, which I up until this stage still don't understand what they are.
um it just seemed awfully complicated.
The so in the past couple of years at data bricks we've been working on how to tackle this problem and actually eliminate all the challenges and the result of that is lakebase right
lickbase um has the following attributes first and foremost it's based on open source progress um and second it built on a novel decoupled storage from
compute architecture that actually enables the modern day developer workflow um and by building on top of data bricks infrastructure it comes with what you would expect actually amazing lakehouse integration as well as all the
enterprise readiness features. Now let's
talk about them 10. sub one each after each. So first and foremost lakebase is
each. So first and foremost lakebase is building on open-source standards which is open source postgress and in the last few years um open source postgress have
steady on the rise and if you look at the latest stack overflow survey of the most popular databases postgress actually leads by a white margin um and this is because it's robust ecosystem of
tools libraries and extensions and all this just work out of a box on the lakebase and lakebase can guarantee you singledigit millisecond latency at scale
The second most important part to Lakebase, it's building on a new novel separation of storage from compute architecture. And it actually has three
architecture. And it actually has three layers to this architecture. At the very bottom, we're using data links or object stores to store the actual physical data. And object stores are the cheapest
data. And object stores are the cheapest storage medium you can find. They're
extremely reliable at scale. Now, one of the challenge that Ali actually referred to is that object stores were not exactly designed for the type of workloads that OLTB databases need. A
100 millisecond query is plenty fast in a lakehouse for analytics, but 100 milliseconds actually unacceptable for OOLTP workloads. We need singledigit
OOLTP workloads. We need singledigit millisecond latency. Um, so the way
millisecond latency. Um, so the way we've solved that is by introducing a middle layer storage that's actually only have soft state and it acts as a right through cache for all the data to
the object stores. And for some of you that are database nerds, it also um creates a a new way to very quickly persist the right headlock or we typically call as wall um in the
database. And on top of the storage
database. And on top of the storage layer, we have the ephemeral compute nodes which are Postgress instances um that actually reads and writes to the
underlying storage layer.
Um one thing that's very important for me to point out is very similar to lakehouses. The actual data that store
lakehouses. The actual data that store in the object stores data lakes are open source formats. They are not some
source formats. They are not some proprietary format we invented to actually improve performance or to lock you in. they are just vanilla Postgress
you in. they are just vanilla Postgress pages and this opens up a whole new sort of paradigm of opportunities.
Um and there's actually a lot of past attempts at this problem. This is not the first time I think um sort of the industry have tried to create a separation of storage from computer architecture for LTB databases. Some
commercial systems especially from hyperscalers have done that. Um but they were typically actually built on yet another proprietary storage system. They
are way more expensive and also don't use open source formats which means they can manage to lock you in even more. So
um the way we crack the code here um it's actually based on um some of you might recognize of the architectural diagram and data bricks actually acquired um this company called Neon
just last month um and you're already talking to us about it. Um now we did acquire Neon. We announced acquisition
acquire Neon. We announced acquisition last month. We actually only closed the
last month. We actually only closed the acquisition last week. Uh but one of the interesting things that's not widely known is we actually invested in Neon the company uh many many years ago and been working with Neon team as a
technology partner on this separation of storage from computer architecture and everything on lakebased actually building on this collaboration right and building on top of this new
novel architecture we managed to accomplish few fairly interesting things the first is civil and what do we mean by civil well earlier we said database cases are this heavyweight
infrastructure that requires a lot of manual maintenance and intervention. Um
serless here means database should become lightweight. Um what does it mean
become lightweight. Um what does it mean it's lightweight? Well, lake base come
it's lightweight? Well, lake base come in two flavors. The first flavor is a provision throughput flavor um flavor that it's it shows you you actually specify exactly how big you want it to
be. If you know how to size your world,
be. If you know how to size your world, that's the perfect solution for you. But
for most people, the autoscaling flavor will be far more interesting. In the
autoscaling flavor, you don't have to worry about um how big a database you should be picking. And because the databases are actually just ephemeral instances, you could only launch it when
you need to. It takes less than a second to launch a brand new database. Um and
if you're sort of low scale, you could either vertically scale it, which the system will do it for you, or could choose to create relicas that also comes up in less than a second.
Um, and if your loads goes down and you actually no longer have need, for example, say for a very American centric company, uh, past 5:00 PM, maybe you have no loads, you can actually automatically shut it down very quickly.
All of those just happens in less than a second, right? And the best part is to
second, right? And the best part is to only pay for when the duration you actually need the compute.
Um, the second thing we built was branching. Uh we talked about earlier
branching. Uh we talked about earlier how difficult it is to branch off a database and to apply modern development software practices to actual databases.
It's very easy to do it with code but it's very difficult to do with databases. Um the separation of storage
databases. Um the separation of storage from compute architecture also has a copy on write um capability built in that we can instantly branch off a
database. It takes less than a second to
database. It takes less than a second to create a whole clone of the database for and that includes both the data and the schema of the database. Um, and because of the copy on write capability, you
don't actually have to pay for extra storage unless you start making changes and only the changes themselves will incur extra charge because under the hood, they all share the same storage.
Um, so something pretty magical would actually happen when you combine the branching capabilities and the serless capabilities into one. It would
completely change the way you think about database development.
Every time you do git checkout-b you should automatically actually branch off a database with that new branch of code. You should have them perfectly
code. You should have them perfectly sync in sync making schema changes. If
you actually don't like your new code branch and whatever changes they make to the databases just kill both the code branch and the database. You pay next to nothing just like how you're paying for
your code repository. Right? And this is extra important in the age of agentic coding and AI. um with AI like one way to think about AI agents is you're um
getting at very low cost an armies of thousands or maybe even in the extreme case millions of AI agents and the AI agents are acting as their own individual engineers that doing experiments on your codebase maybe
adding new features you might even have multiple AI agents adding new features um adding the same feature and you have judges to determine which feature is the
best implemented and it's now every AI agent can actually get their own code branch but also their own databases at virtually no cost for experimentation.
The separation of storage and compute um especially by leveraging actually open source formats underlying storage layer also makes it super easy to synchronize data at very high throughput from one
object store to another object store. So
from one data lake to another data lake from lakehouse to lake base. Um and many of you if you're existing data bricks customer given what we're doing you probably actually expect this out of the
box. Um you can publish any tables in
box. Um you can publish any tables in the lakehouse into lakebase for real-time serving to get your template second latency and you can also do the reverse. You can very easily get the
reverse. You can very easily get the data from lakebase directly into lake house managed by UC mud catalog with it to future uh clicks
and of course by building on top of data bricks infrastructure um lake bases are uh sort of enterprise ready it comes with all the wells and bles you expect um from security to compliance to
governance.
So given lakebase what can you do differently?
Well, first if you're trying to build a new app and you need a relational database, give leg base a try. If you
want to serve data, you have data today, whether it's machine learning feature stores, u or whether it's a simple um data pipelines you build, you want to
serve that data, um give lickbase a try.
And if you have ETL pipelines, complex ETL pipelines to ingest data from a relational database into data bricks, which I know almost every customer does,
um, give lake base a try. It'll
dramatically simplify your architecture.
So with that, I would like to invite Holly Smith onto stage to actually give you a demo to uh, visualize what looks about. And I think Holl's actually going
about. And I think Holl's actually going to come up on that side. So you have slowly shifting the stage.
[Applause] Hello. Uh I've been given the job to
Hello. Uh I've been given the job to manage inventory levels for a drinks company. Making any last minute
company. Making any last minute adjustments to stock but also to share data between analytics teams. This job is tricky at crudtime, but fortunately I
have some new tools from datab bricks to help me. And today I'll be sharing how
help me. And today I'll be sharing how lakebase powers datab bricks apps works at scale and can use data from delta tables all in real time all in one
platform.
So I'm going to switch to my demo and in front of me I have a data bricks app that's bringing together both operational and analytical data.
Whenever I'm changing these filters, these are live queries. Nothing is
cached in the app. I can see here orders that I need to review but I am not going to do that. Uh instead I am getting some intense requests to place some orders for cherry burst for 90 units. I'm going
to select my store, click through the app and go. This app will go and update Postgress in the back end where the orders table sits.
Now I'm going to double check that this has gone through correctly and I'm switching to the brand new Postgress SQL editor natively integrated to the data bricks platform and I can tell it's Postgress because I've got this icon on
the top right here. Um and when I select my compute it's got this Postgress tag next to it. I'm also not going to do this on production. Uh so I am going to use dev which is an isolated test
branch. And if I look at the left I can
branch. And if I look at the left I can see some neon details. Um I'm also going to check the settings of the version and I can see that yes this is Postgress version 16.6 and to query the data I
would just query it like I would any other table and I can see here that when I run this that yes my order has gone through but that's just one record and that's not very impressive. So earlier
today I set up a little program to simulate lots of users to run in the background. Um I'm going to switch to
background. Um I'm going to switch to the native performance monitoring in data bricks. I'm going to head to this
data bricks. I'm going to head to this metrics tab here. Um, and I can see how this has ramped up to thousands of transactions a second. Uh, and tens of thousands of rows per second. And I've
got all sorts of interesting operation types happening here. One thing this doesn't have is our app latency as it scales. And fortunately, uh, I bought my
scales. And fortunately, uh, I bought my little program so it can measure latency. So, I'm just going to run a
latency. So, I'm just going to run a simulation and see how well we're doing.
And oh, I could see my performance is almost 19,000 queries a second. Uh, with
the median at 4.56 milliseconds and the 95th percentile at 5.6 milliseconds. So,
I'm pretty pleased with that.
Thank you.
Lakebase can also access data in Delta and I can combine them together for snappier responses in the UI. So, I head to the table in Unity catalog. I can
create a synced table. I go ahead and I can name the table. I select which instance I want to use. I want to use production. I can also choose the
production. I can also choose the database as well. And then the primary key. Now I can either run this as a
key. Now I can either run this as a one-off snapshot triggered at a certain interval or have it continuously updating. And I can either group this
updating. And I can either group this with other updating items in a pipeline or I can have it as a standalone.
Now that that's set up, uh, when I go to it in Unity Catalog from the database view, uh, I can see here that I've got this
little synced icon next to my table. And
so now when I head back to my app, oh, here it is. And I head to my insights tab, I've got this insights tab, and that is including data from a delta
table for my forecast. And of course, I can sync all of my orders uh data back to the lakehouse for historical analysis and the other way around too. So, in
this demo, we've shown how data bricks is bringing together both operational and analytical data closer together. And
I've hope I've shown you how powerful this is when combined with other databicks features like apps, delta, and unity catalog. And it's inspired you to
unity catalog. And it's inspired you to give it a go on your next project. And
as for me, I don't think I like being an inventory manager, and I probably should have made this an agent. Back to you, Reynold. Thank you, Holly.
Reynold. Thank you, Holly.
And only virtual handshake. Um,
virtual handshake. Um, so at data andi summit and other conferences, we're announced products of various stages of maturity. Where where
is lickbase? We just spent so much time talking to you about it. So in the last year we actually been private previewing lakebase with hundreds of customers and many of their logos actually showing on
the stage here uh across a very variety of uh industries many of them actually running lakebased in production and we're also very happy to have the following launch partners joining us to
uh announce lakebase itself and this includes of catalog vendors BI vendors agentic coding platforms and consulting services but the best part is that
lakebase is available today. All right.
So, not something coming next here for today. Starting today in all of your
today. Starting today in all of your data bricks workspaces depending on which region you are can either explicit opt in or it's already on out of a box for you. It includes the full-blown
for you. It includes the full-blown fully managed Postgress instance all the lakehouse integrations multi cloud support HDR. All right. And there's a lot more new features coming in the
coming months.
Um, so just to summarize, Legbase is offers a fully managed Postgress instance. It comes with a new new novel
instance. It comes with a new new novel separation of storage from compute architecture which enables the modern-day developer experience both for humans and for AI agents. And more
importantly than the lakebased product announcement, we actually feel like this is how database should be built in the future. And our prediction is every
future. And our prediction is every other database, every other transaction OLTB database will evolve towards this architecture in the coming years.
Loading video analysis...