LongCut logo

Ali Ghodsi, Co-founder and CEO, Databricks kicks off Data + AI Summit 2025

By Databricks

Summary

## Key takeaways - **Conference Growth Milestone**: Ten years ago, Databricks had 3,000 attendees focused on data engineering with Apache Spark 1.0; three years ago, it was 5,000 shifting to AI; now, over 22,000 are here, making it the largest data and AI conference with 65,000 worldwide viewers from 150 countries. [00:36], [01:47] - **Open Source Impact Stats**: Open source projects launched at this conference have massive adoption: Apache Spark with over 2 billion downloads, Delta Lake over a billion, Iceberg 360 million, and MLflow over 300 million, underpinning Databricks' mission to democratize data and AI. [02:48], [02:56] - **Enterprise Data Complexity Trap**: Most organizations have fragmented architectures with multiple data warehouses, lakes, ETL tools, BI solutions, and ML platforms, each with siloed data, metadata, and security models, leading to high costs, slow innovation, and vendor lock-in. [06:26], [08:25] - **Lakehouse Revolution Accepted**: Five years ago, Databricks proposed the lakehouse to centralize data in open formats like Delta Lake and Iceberg on cheap storage such as S3 or ADLS, facing skepticism from vendors who now embrace it, though 99% of enterprise data remains proprietary. [09:25], [10:25] - **Unified Governance Imperative**: Unified governance centralizes security, access control, discovery, lineage, business semantics, cost controls, and quality monitoring across all data assets, models, and dashboards, unlike narrow solutions focused only on structured data access. [11:28], [12:41] - **Democratizing Data with AI**: Data intelligence infuses AI to let users query data in natural language via Genie, used by 81% of customers for data rooms, and the Assistant, used by 98%, which understands enterprise data to diagnose errors and generate code, boosting productivity without coding skills. [17:36], [18:34]

Topics Covered

  • Open formats unlock data from vendor lock-in.
  • Unified governance secures all data assets centrally.
  • Natural language democratizes data access instantly.
  • Enterprise AI thrives on proprietary data alone.
  • Free Databricks edition educates billions globally.

Full Transcript

Please welcome to the stage datab bricks co-founder and CEO Ari Godsy.

[Music] Hey everyone, super excited to be here.

Awesome. I actually thought we start with a little bit of a look back.

Believe it or not, this picture is from 10 years ago here at the Moscone Center.

We were so excited. We had over 3,000 people in the audience. We were like blown away.

We're like, "This is impossible.

" Uh, and back then we were doing AI, but really our focus was on data engineering, making big data simple.

That was the big focus.

And we had just released Apache Spark a few years earlier.

It was just a 1.0 release.

And it was a big deal for us.

And then three years later, this is a picture from three years ago.

Um that's again Moscone Center, but now we actually had over 5,000 people. Okay?

And we thought this is the maximum we're going to ever do. Uh but now we have started focusing more and more on artificial intelligence.

We actually changed the name of the conference from Sparks Summit to Spark Plus AI Summit.

And um I'm sure none of you remember, but I actually went back and looked.

Uh I was on stage and I actually live coded and there was no vibe coding, no assistance, so it's actually real code um with TensorFlow generative AI and I made a picture of Boris Johnson back then.

And with that our focus started uh expanding from data to both data and AI.

And if you fast forward actually till today, it's amazing to see where this conference and this community has come.

We now have actually over 22,000 people here at the Moscone Center. We've Yes.

Super exciting.

It's actually a citywide conference right now.

We've taken over all of Muscone.

It's basically four blocks and it is the largest data and AI conference on the planet.

65,000 people worldwide are watching right now.

And the stat that always blows me away is that there's 150 countries represented.

Okay. Uh so this is not like a just US or American or west coast centric event. It's under 50 countries.

Okay. Uh we have super awesome program.

700 sessions and trainings happening over 200 or you know 350 customer teams are presenting.

So that's you guys uh are going to present the projects that you've been working on.

It's going to be super exciting.

And we also have 180 exhibitors.

And you know the backbone of this conference over the years is open source.

So the open source projects that were launched in this conference and that you all built now have over two billion downloads just on Apache Spark alone.

Over a billion downloads on Delta Lake.

Uh iceberg which we adopted last year has over 360 million downloads.

And then MLflow, which we use for all of our ML ops and all our experimentation for generative AI, has over 300 million downloads.

And together, this is why our mission at data bricks is to democratize data and AI. We want to bring data and AI to as many people on the planet as possible.

And I want to take a second to thank our sponsors who make this possible.

Um, especially our legend sponsors, the hyperscalers AWS Google Microsoft.

Yes, thank you.

and also the gsis, Accenture and Deote and all the other ones that are on this slide.

And I also want to send a special thank you to all the other partners we have.

We have the system integrators that help us actually build all this with services in your companies.

Uh we have the technology partners that are super essential.

They augment the platform.

We integrate with them and they're the ones that actually make us do things that the platform alone cannot do.

the built-on partners that built on top of data bricks uh their solutions and they sort of innovate and push the boundaries of what you can do.

Uh and then finally the one I'm very excited about and which is very important are the data providers right those are the ones that you can get on the marketplace on data bricks you can do delta sharing you can share the data set with other uh BUS and other organizations in the ecosystem and together with all of these partners the most amazing things are our 15,000 customers that's all of you who are in the room that's the real impact that's where the interesting things are happening so thank you All.

So, we have a super exciting program.

We have actually announcements both today and tomorrow.

So, you know, some people think, "Oh, the announcements are the first day.

" No, there's going to be new announcements tomorrow, too.

So, you know, if you're booking flights to go home tonight, extend them, please stay another day.

Okay? We have very exciting things tomorrow.

Uh, I thought I'd just start off uh by sharing uh a video on one of our customers and what they're doing that I think is really cool.

So let's watch Joby here.

[Music] Joby has the mission of connecting people to places and the people that matter most to them. And we do that by developing a new class of all electric aircraft.

They can take off and land vertically like a helicopter, but they fly like an air taxi.

We're generating gigabytes a minute of data on our airplane.

We want to push this data to our developers, engineers so that they can gain insights from it very quickly.

So datab bricks is the platform that we have chosen to make this data democratization happen.

Using data bricks has allowed us to quickly develop some of the sandbox. We leverage the endpoints.

We leverage the model serving.

We leverage the conditional models.

So all these technologies that datab bricks has provided have proven to be very key for us. What we've tried to do is infuse the best of the way aviation does things with kind of the best of Silicon Valley.

Awesome. By the way, I see huge crowds in the back that are standing.

There are seats at the tail end and there are big videos back there as well. If you want to go have a seat, it's easier to watch.

Okay, so you heard Joby and what they're doing.

You're going to hear from hundreds of hundreds of customers here at this event what they're doing with data and AI. It's super impressive.

But the reality on the ground is that it's actually really hard still to succeed with data and AI. U especially if you're in a company that has been around more than say 10 years or 15 years, chances are that your architectural diagram of your company's data and AI infrastructure uh looks something like what I have on this slide. In fact, this is probably a simplification.

Okay, you at least have one of each of those services that I'm putting on this slide.

You probably have it from some vendors or maybe you're using some hyperscalers services for this. You probably have several data warehouses, maybe some onrem, some in the cloud. You have a bunch of unstructured data.

They're stored in the lakes. Um, you have, you know, real time solutions that you've custom built for your organization.

You have lots of ETL, data engineering, data movement that's happening that's typically bespoke.

It's been built over the years.

super hard to move.

You don't want to break anything with that.

Uh and then of course all the organizations we work with, they have one of each of the BI tools that exist. Like they have literally when I ask which BI tool do you use?

They always say well we have every one of them. Okay, dating back 20 years.

Um and then of course recent years there's been a lot of focus on data science, machine learning and now generative AI.

So you see all of this in your organization and what that means is that this is creating a complexity in most companies.

It's slowing down the organizations.

The costs are super high and it's creating a proprietary lock in.

So at data bricks since 5 10 years back we think this is the biggest problem.

Actually if I look back at our sort of history we came from academia.

We thought we were going to change the world overnight.

And the biggest thing that we learned in the last 10-15 years is what we see on this slide.

This is why everything moves slow in the organizations.

So that's why we about 5 10 years ago set out to simplify this architecture and help organizations move faster.

That's what we really wanted to do.

We sort of brainstorming how can we simplify this?

What can we do?

And what we realized over time is that the reason this is a really bit bad architectural sort of situation is that each of these things has its own data sitting in it. And that's what's really locking people in. There's data and there's metadata.

You see that with that binary icon that I have. Uh and then not only that, and this one I think we learned a little bit later, each of them have their own security model, their own access control, uh their own governance because you know, everybody's getting attacked.

Privacy is super uh important to every organization out there.

So they each of these solutions have their own security model that you have to uh take into account and that's what's making it really really hard. So step number one uh was how do we get all this data out of all these systems and just centralize them uh and have you own it yourself.

So instead of you uh having to worry about the data in the different systems, it's sitting in cheap open lakes in an open format.

So that was step number one.

We actually wrote blogs about this and we called it the lakehouse. That was 5 years ago.

And you can see here this was sort of the reaction in the market when we put that out. Uh they didn't believe in it.

You can see you know these are different vendors what they're saying you know uh choose open wisely was one of the reactions basically saying that it's a misguided application of open and really our priorities and our customer obsession dictate against it.

And I don't want to pick on a particular vendor.

It was just the general kind of uh reception of this was that way by pretty much everyone all the hyperscalers and so on. Um and then you know uh as time went on uh you can see now now everybody loves open everybody's talking about it. You can see the quotes now from all the hyperscalers from all the vendors that are out there.

Um and actually we can see now uh there's even books by the creators of data warehousing talking about the lakehouse uh and how that's the future.

So that's super exciting. So now you have the open formats.

It's kind of industry has accepted that this is the way to go forward.

Though I would urge you in your organization despite a lot of the marketing the majority of these vendors the data that you have even if they they say they like open your organization probably has locked the data 99% of it is still probably in a proprietary format inside of these data warehouses.

So I think we as a community have to push this.

Let's really push for the open formats and let's make sure that you own the data. Don't give it to data bricks.

Don't give it to other vendors.

Own it. Store it on open lakes like S3 on Amazon or like Azure data lakeink storage ADLS on Microsoft Azure or like GCS, Google Cloud Storage.

Okay, store it there. Don't store it somewhere else.

Okay, because then that leads again back to lock in. All right, so that was step number one.

But there was one more step that actually I think we didn't quite understand how important it was and it was that of governance. So, it's all these uh security models that each of these have.

And I think we actually created the category four or five years ago, but we never really wrote about it.

And I'm starting to believe that this is more and more important than ever.

Uh so, we'll be talking a lot about this uh next few days. And tomorrow, make sure you catch Mate's keynote. He's going to be talking about this. So, this is what we call unified governance. So, what is unified governance?

Unified governance isn't just let's do security on our data state.

It's two very key ideas.

The first important idea is that you have to do governance on all of your data assets in the organization. Okay?

And you should do that centrally in one place.

Okay? If in your house, you wouldn't have different vendors do the security on different floors or different doors, right?

That wouldn't make any sense.

It would be very hard to coordinate and make it work. So, you need to do it for all of the different data assets.

Um, not just the structured data that you have, but also for all the unstructured data that you have. uh but not just for the data but also for the AI models that you have but not just for the AI models also for all the dashboards that you have.

Dashboards is the easiest way today in an organization to basically excfiltrate data.

You get your hands on a dashboard it has data sets you can email them around and then data evaporates out.

Uh so you really have to do governance on all of these because the raw data becomes structured data goes into dashboards. We want to track the lineage of all that. Um so that's step number one. That's idea number one.

it has to be on all of the different things and really no governance solution today out there uh outside of Unity catalog really does it for all of your data assets.

So that's number one.

Number two, uh you want to actually have unified capabilities on top of this data.

So what I mean by that is it's not just about security. It's not just about access control.

It's equally important uh to have a platform or a governance solution that really helps you with discovery with collaboration because again and again people are recomputing and coming up with the same data sets in the organization again and again and again and then people bicker about why their numbers don't match. Um you want to track the lineage of all these different things.

Uh you want to understand the business semantics.

Where do I find the authoritative data set for revenue for this product line?

You want to be able to easily find that and there should be no doubt which data set that is.

That is what business semantics is and we'll talk a lot about that at this conference on how we taking the next step towards this cost controls.

If you're going to have all this data and AI infrastructure, all these different services, you have to be able to do cost control.

So there has to be a way where you can do that centrally. And then data quality monitoring.

Uh so these are super important and unfortunately today actually despite all the talk about governance in the market most focus is just on access control on structured data like if you look at efforts like Polaris they're really just focused on this narrow slice uh of governance.

Uh so we're really hoping that you can push your organizations and we can together push towards unified governance and at data bricks uh unified governance um that's really the lakehouse okay we get um at the bottom the open formats and I'm super excited to share now that databicks 100% supports as formats delta lake and iceberg you can use both formats we have 100% support for both.

So a year ago, we announced that we acquired a company called Table.

Um the founders of that company are the original creators of Apache iceberg and we've been working this whole year on bringing these formats closer and closer and closer.

So in data bricks essentially the differences are largely uh now uh negligible and you should listen to mate's talk tomorrow because if you're actually using unity catalog uh as your catalog of record for your open formats whether it's delta or iceberg you will be able to actually read and write to these data sets from any system out there. Okay. uh and we'll show in performance benchmarks that if you pick unity catalog as the catalog of record for your open data in Delta Lake and in iceberg you'll get significantly better performance in actually ingesting and processing that data. So this will be the most open approach you can take uh to the lakehouse and then on top of that we provide Unity catalog which we open sourced last year.

We're continuing that. We're going to continue open sourcing uh more and more. Uh and most importantly, it also implements two of the standard open source interfaces for cataloging.

The first one being Hive Metas store and the second one being ice rest catalog.

So these two interfaces, any system that wants to talk to an open catalog, you can just pretend this is a hive metas store or this is a ice rest catalog.

And it's actually the most complete implementation of ice rest catalog in the industry today.

Um, so this is the lakehouse and we're super excited about this. This is what we've been focused on for the last five years.

But what we're really excited about at data bricks is when you take this lakehouse as your foundation and then on top of that you build intelligence into the platform.

Uh so what we mean by that is you take AI and you infuse it throughout the platform. That's what we call data intelligence.

Okay, data intelligence for us means really two things.

One is we want to democratize access to data in your organization.

In the past, you had to be someone who knows Python or SQL or maybe even Scala or Java to get value out of your data.

We want to challenge that and change that so that it should be enough if you just speak English to your data or any other mother tongue or natural language that you have and we should be able to get answers for you. So that's number one for data intelligence. Number two is we want to also democratize AI which means we want to help you build your own AI for your companies so you can innovate AI that can reason and answer questions on your proprietary enterprise data or organizational data uh that's out there.

Okay. So not just general intelligence and artificial general intelligence that can answer questions about anything out there but that really with high quality and uh adequate costs can answer questions on your proprietary data.

So on the left hand side democratizing data uh the sort of you know English the new programming language we have two products that we have had in the market now for over a year.

Uh Genie on the left hand side you can see um is a product that's now used by 81% of our customers. Um how that works is that you just ask your question you create these data rooms. So you can take any data sets in Unity catalog and create a data room and can share it with anyone in the organization invite them to the data room and they can just speak English to the um to Genie and what it does is it uses uh an ensemble of agents comes up with a plan uh and actually writes code for you and then executes that.

So you don't need to know any of that.

You're just speaking English and it's just interacting with you and giving you the results that you see on the left hand side. So I think this is going to be the future eventually.

this is going to be the way everyone does data science.

But today it does require preparation of the room. If you don't prepare the room, uh you know, generative AI still has ways to go for it to be completely automated, right?

So um you know, there's you know, still can't do the full thing that data scientists do unless you've prepared the room.

On the right hand side is the assistant.

So 98% of our customers are actually using the assistant.

And what that is is, you know, throughout data bricks, we've made the assistant understand every aspect of the platform.

So you're running into an error, you can click diagnose error. If you have a piece of code, it can explain it and it understands your data in the organization.

So that's the key thing about the assistant. Uh unlike other assistants out there that you know are great at programming, this one understands the data that you have in data bricks.

So that's what we call data intelligence.

Um I've been a power user of data bricks, no surprise, for the last 1015 years. And today I just use this all the time. Like this is my preferred way.

Even though I used to prefer just coding myself, I use the assistant.

I let it write the code for me.

Uh I review it. Uh but it's completely transformed what I'm doing and I'm much much more productive and I think we're seeing it with everyone that's using um data bricks.

Okay, so that's just democratizing data.

Let's look at AI. On the AI side, we announced last year uh Mosaic AI agent framework.

You could build your own agents.

You could do vector search. Uh you could do model fine-tuning, guardrails, and so on.

And in particular um we have now if you look at the platform classic ML is being used by 95% of our customers and then we have generative AI that's picked up and it's at 81% of our customer base.

Uh so super excited about this.

Um listen to Hanland's talk today where we're going to talk about basically the next uh generation uh of how we want you to develop your own data intelligence on your data.

So one of the big announcements are going to be by Hanlin.

Okay. Um, so that's the data intelligence platform that what you're seeing in this picture. You have the lakehouse and we've infused AI throughout it.

We've simplified everything by removing a lot of the knobs using AI. Uh, and we think that this is going to be a big deal.

Data intelligence uh is going to uh be something that every organization on the planet wants.

Uh, and that's going to change and transform jobs throughout the planet.

And I think we need to educate uh billions of billions of people around the world on how to do data and AI themselves.

That's why I'm super excited to announce today that we're actually launching what's called the free edition of data bricks.

So, so what is the free edition of data bricks?

Free edition of data bricks means you don't need to swipe any credit card.

We're not going to ask you for any credit cards.

We promise. Uh and we're not going to ask you for a business email.

uh you can use your Gmail or your Hotmail or uh you know and just log in and you can get a free slice of data bricks forever.

Okay, it's just there.

Um but the limitation is you get a small slice, right?

You can't come in and say I want a million machines because then we go bankrupt in an hour. Uh right, but that's how it works. Uh and together with this we're also doing a hund00 million investment in training and education and try to get databicks free edition out to all the universities out there.

So please help us and we've taken all of our self-paced learning that we've had over the years where you can learn to do various things like you want to build a recommener or you want to do inventory analysis or you want to do forecasting or you just want to build data pipelines that are robust.

We're taking all that self-paced training and we're open sourcing it and making that free as well. So we really need your help. Please help us get the word out with this. If you have contacts at universities or if you want people to learn uh let's get this out there.

This is the best way for anyone uh to learn and get started with data and AI.

Loading...

Loading video analysis...