LLMs & Data Warehouses: Reliable AI Agents for Business (Keynote) | Ryan Boyd, AI By the Bay 2025
By FunctionalTV
Summary
## Key takeaways - **LLMs Fail at Counting and Aggregation**: LLMs are bad at counting and really bad at data aggregation because if you're bad at facts, you're probably not going to be very good at aggregating facts. It's kind of an exponential badness that happens here. [03:21], [03:26] - **RAG Doesn't Fix Aggregations**: RAG only helps encoding facts, though. LLM still can't scan, aggregate, and compute data. [03:45], [03:58] - **LLM SQL Generation Only 78% Accurate**: The state-of-the-art of SQL generation from LLMs is still pretty poor with numbers in the high 70% here at the best LLM, which actually uses multiple LLMs together to get to that score versus 92% as a human. [09:37], [09:46] - **Context Window Fixes Are Intermittent**: Even after adding context to always use a full seven days of data when calculating AWR, it worked in one run but not in another, getting the first six days wrong intermittently and sometimes generating really crappy SQL. [16:02], [16:44] - **Columnar Storage Beats Row for Analytics**: If your underlying data is stored as rows, it's really easy to do inserts and point lookups but inefficient for average age as you end up reading all the data of every single row. On a columnar storage database an analytical database you would just read the column age and it's much more efficient. [05:48], [06:07] - **Custom Tools Over Pure LLM SQL**: You can build custom tools for specific functions like getting sales by region instead of letting the LLM generate SQL freely, since the state-of-the-art accuracy is only around 78%. [09:15], [21:36]
Topics Covered
- LLMs Fail at Aggregation Like Humans
- Columnar Storage Beats Row for Analytics
- LLM SQL Generation Only 78% Accurate
- Context Alone Fails Intermittently
- DuckDB Scales Perfectly for Agents
Full Transcript
Um, I ordinarily wouldn't wear a hat uh during a talk, but a this is an awesome hat that I love, and b there's very bright spotlights up here. But let's
talk about agents. uh agents can use LLMs uh along with analytics databases to analyze business data, find patterns,
speed up reporting and more. The title
of the talk is talking about business data and I give some examples on business data. Uh but you know customers
business data. Uh but you know customers and users of DuckDB and Motherduck use the same technologies for all sorts of different types of data. Um, so just
replace some of the tables in in my uh examples with game data or um or you know other sort package delivery data,
logistics, etc. Um, first of all, I want to just give a shout out. Obviously, LLMs are magical.
shout out. Obviously, LLMs are magical.
Um, and they've advanced so much in such a short period of time. Uh, amazing at improving our writing. amazing at
understanding human language uh and sometimes even at generating human language as long as you're not trying to generate any sort of novel ideas.
They're generally awesome at composing pieces of knowledge together. Uh and of course at generating images of ducks.
They're very good at generating images of ducks. Um and there are things though
of ducks. Um and there are things though that make the duck sad. There are things where the duck magician can't get the LLMs to do what it wants it to do. Um
things that sometimes they aren't even designed to do. U however LLMs are kind of designed to understand facts. Um but
they sometimes fail. And when they fail, they're often failing in the same way that us humans fail. They make things up. They don't have current knowledge.
up. They don't have current knowledge.
uh and they don't have your personal information or your personal context in order to work. Um so how many of you as humans have these same problems at
times? I certainly do. That's uh that's
times? I certainly do. That's uh that's where LLMs are like humans. They're
fallible. Um and we're working at making them less fallible as an industry. You
know, hundreds of billions of dollars have flown into this industry. So, we've
been able to augment LLMs with capabilities to make them even more magical, to make our our ducks here happier.
We can feed them information on recent facts through rag. We can started uh doing repeated calling of LLMs in order to emulate the human chain of thought
reasoning. And uh now that is built into
reasoning. And uh now that is built into most LLMs, but they're still bad at some things that humans are decent at. Um I, you
know, I might not be able to count exactly the number of stars here cuz some are overlapping, but I'm not going to go from five, six, uh to eight and skip the number seven. Uh I think it's
an unlucky number for some people. So um
or maybe it's lucky, I don't know. Um,
but you know, LMS are bad at counting and they're really bad at data aggregation. Uh, of course, if you're
aggregation. Uh, of course, if you're bad at facts, you're probably not going to be very good at aggregating facts.
Uh, it's kind of an exponential badness that happens here. So, you know, people originally were saying, well, rag is a solution for everything. This was a while ago, right? Rag is a solution for
everything. So, rag only helps encoding
everything. So, rag only helps encoding facts, though. Um, LLM still can't scan,
facts, though. Um, LLM still can't scan, aggregate, and compute data.
So, what if you want your LLM to answer business questions, and you want those LLMs to be quasi accurate? If you have questions like, how many of my customers
are spending greater than $1,000 a month? Or, uh, which of my customers are
month? Or, uh, which of my customers are most likely to churn and why? or what is my my company's annualized revenue.
You're going to want some degree of accuracy on those. And you're not going to want to just trust the LLM to pull the data off of its training data set.
Uh because you might get someone else's annualized revenue or someone else's most likely to churn customers.
For that, you need an analytics database. And uh analytics databases can
database. And uh analytics databases can be very happy uh when you're using ducks at least as your analytics database. But
you need something that can store large amounts of analytical information and efficiently compute aggregations to answer these questions.
Now how many of you in the room would classify yourselves as uh as data people primarily focused on data?
All right. How many would classify yourselves as as software engineers?
you're building building things but maybe with data at times. Okay, so
pretty much every software engineer is probably like well what about Python?
What about our elephant here? Um and
it's funny that elephant in the room joke did not come up as I was as I was preparing this but the elephant in the room is Postgress does everything. Why
don't we just use Postgress? Um and
there are different types of databases.
I literally have one slide on this to catch the folks up, but uh the different types of databases are basically transactional and analytics databases and it's largely based off of how the
underlying data is stored and retrieved.
If your underlying data is stored as rows, it's really easy to do inserts of a whole row at a time and it's really easy to do point lookups. uh but when
you try to do like an average age uh in your data stored as rows you end up reading all the data of every single row in order to do that and it's very inefficient. So on a colmer storage
inefficient. So on a colmer storage database an analytical database you would just read the column age and it's much more efficient. Uh so that's why maybe you don't want to use Postgress uh
as your analytical database. It will
work it just might not be the most efficient. We have tons of customers of
efficient. We have tons of customers of motherduck who end up coming to mother duck out of using Postgress as their as their analytical database and realizing it hit scaling limits.
So how do LLMs interact with data warehouses, databases, data analytics tools? Uh rag was sort of the vzero for
tools? Uh rag was sort of the vzero for LLMs interacting with external data but like I said you could just h you know basically feed in facts not aggregations.
So now we have various different ways of interacting with our databases.
Uh first of all there are built-in tools in LLMs nowadays. So LLMs now actually do have calculators. You can feed basic math questions to LLM and they will use
the built-in tool which is a calculator.
They have web search. They have tools for generating files and diagrams, things like that. So the first thing that you can do if you want to interact with the database is you can build a
custom tool that says what to do with a database uh and you do that using the agent SDKs. In my case examples here uh
agent SDKs. In my case examples here uh I'm using the OpenAI SDK but you there's cloud ones and there's ones I'm sure for other LLMs. And then of course many of these
analytics databases do have MCP servers.
uh MCP servers are essentially the tools uh but exposed over a wire protocol instead of exposed uh directly running
locally on the server. Um and this gives you basically ultimate in flexibility.
You can ask CLMs anything. It just
interprets uh SQL and that's all that the MCP server is really going to do.
Whereas if you want more control, you build your own custom tools.
Regardless if you are working with a analytics database uh nowadays and for the last four decades or something it's
likely to use SQL as its language for interacting with it. Uh it's one of the biggest staying power technologies out there I think uh that has just remained
for decades.
Um, and your biggest question comes down to who is going to generate the SQL for uh interacting with your database. Is it
going to be generated by the LLM? This
is a tool example here, just a function in a in a Python uh file showing, hey, I'm going to execute SQL and I'm going to define how that executing SQL
happens. I can define it at that level
happens. I can define it at that level and then let the LLM generate the SQL.
Or if I do want more control over it, uh I can define a function and have that function be a very specific thing. So
getting the sales by region for instance. Um now you might think
instance. Um now you might think that you really want to let just the LLM use SQL. Why why are you manually
use SQL. Why why are you manually encoding the SQL that the LLM is able to generate? Um, and
generate? Um, and that would be the a reasonable conclusion. Um, but the state-of-the-art
conclusion. Um, but the state-of-the-art of SQL generation uh from LLMs is still pretty poor. Um, I
think that the my vision's not amazing, but I think that, you know, the numbers in the high 70% here at the best LLM, uh, which actually uses multiple LLMs
together to get to that score. uh so 70 78% I think is the accuracy versus 92% as a human. Now how many of
you would be comfortable producing your board reports and producing uh maybe your public analyses of for your company your quarterly earnings
reports with 78% accuracy. Not me. Uh
heck, I won't even be comfortable with the 92% that a human is accurate. Um but
uh you know that's that is the nature of humans. We are f fallible. So um we want
humans. We are f fallible. So um we want to talk about different ways that we can improve the accuracy to supplement what these things are going to do out of the box.
Now we can add additional information to the context window. So this is basically as you're typing a question to an LLM or sending a question programmatically, you can add additional information to that
window. You can use a semantic tool to
window. You can use a semantic tool to define the details about what your models look like or you can basically build a very budget semantic tool yourself and we're going to show each of
these.
But first of all, I want to talk about kind of the sample data sets that I'm using. Uh first of all, you all are
using. Uh first of all, you all are going to see revenue numbers. Those are
not real revenue numbers. Um they have nothing to do with Mother Duck as a company. Uh however I did try to use a
company. Uh however I did try to use a very artificial data set when I created this presentation. I tried to use
this presentation. I tried to use Northwind data. Northwind is a canonical
Northwind data. Northwind is a canonical data set from Microsoft that SQL Server used back in the day. And it's amazing.
Uh it is very welllaid out. It's
basically customers and their orders and what items were in those orders and what products the company sells. The problem
is is that data set is super well documented on the internet. Uh I think Michael from Shadow Traffic, you know, emulated that data set in a few hours uh for us and he was very kind to do it. So
I do want to give a shout out to Shadow Traffic. Awesome at generating uh
Traffic. Awesome at generating uh generating data sets. Um, however, our actual internal data warehouse at
Motherduck uh I think is 217 uh different tables. You know, we're a three-year-old company. We have 217
three-year-old company. We have 217 tables in our data warehouse and uh I think 41,000 columns in those 217
tables. It's a lot more complex and it's
tables. It's a lot more complex and it's not well understood because it's not documented on the internet. Um, so that is the data set that I'm using, but I'm
fudging all the numbers. So, uh, don't don't uh take any meaning from the numbers. But one of the biggest things
numbers. But one of the biggest things that we do at Motherduck is ask what is our annualized weekly revenue? Um, so
this is similar to ARR that you might have heard of. It's called AWR. Uh, a
lot of us are uh engineering background folks, so we don't really like to use a term that's not accurate. And annualized
weekly revenue is a fairly accurate term for us. Um, and if you ask OpenAI what
for us. Um, and if you ask OpenAI what it is, it's basically saying it would be one weeks of data uh repeated for all 52
weeks and that's annualized weekly revenue. Pretty good. It's the right
revenue. Pretty good. It's the right definition, of course. Then I went to the LLM and
of course. Then I went to the LLM and said, "What is our annualized weekly revenue for today?"
And the LLM said, well, the formula is the annualized weekly revenue equals revenue on a day times 365 / 7. Anyone
know what the problem with that is?
What is this divided by seven thing?
It's basically taking a day's revenue, figuring out what the whole year's revenue would be based off of that day, and then saying, "Oh, we'll just take a month and a half of that," or something
like that. like this is completely
like that. like this is completely ridiculous definition. Uh but this is
ridiculous definition. Uh but this is the definition that it used and then it computed a result. Now what we are happy for is that it explained what it did. So
as a smart human you can then say oh wait this seems a little bit wrong. Uh
but we don't want it to get it wrong.
So we add additional information to the context window. Uh and in this case we
context window. Uh and in this case we basically just did markdown format files uh with all of our go to market definitions and saying annualized weekly
revenue is defined as this uh which is basically the current dates revenue and summed with the last 6 days of revenue multiplied times 52. You'll notice that
my definition isn't very precise. It
could be more precise. Um but you know it's additional helper data for the LLM.
So then I say what is our annualized weekly revenue for each day in November and I get this.
So at the top there 1.5 million uh at the bottom 11.962 million um the original numbers before I fudged it. I looked at the bottom number and I
it. I looked at the bottom number and I was like, "This is right." All right.
But why are the other numbers like why does it grow so rapidly at the beginning?
So what happened is the LLM understood what annualized weekly revenue was. It
generated a window function in SQL, but it said, "Oh, I'm talking about November." So it excluded all data from
November." So it excluded all data from prior to November in its calculation. So
even as it was doing let's say November 1st obviously it needs to include the six prior days which are in a different month or in October but it didn't have access to that data. So it just assumed
zeros and we got our annualized weekly revenue.
Um a really bad calculation that someone may actually miss depending on how much data that they're looking at. Um, so
again we say let's update the context window and say always use a full seven days of data when calculating AWR.
I did that. I ran it. It worked well. I
patted myself on the back. Life was
good. The next day uh as I continued preparing for this talk, I ran it again.
It again got the first six days wrong.
My feeding at that additional context worked in one run but not in another run. Very intermittent.
run. Very intermittent.
Then I ran it again and all days were wrong. It generated really crappy SQL. I
wrong. It generated really crappy SQL. I
have no idea actually. I didn't even dive into what it did. But I did run it again and it worked fine.
So very frustrating experience especially if you're expecting to have accurate business data.
How many of you have kids?
So, I think uh working with an LLM is like trying to teach calculus to your 2-year-old. Um
2-year-old. Um it's not it can be very frustrating at times. You don't know actually what is
times. You don't know actually what is sinking in and what is not sinking in.
Um but you know, we we do like our ducks to teach uh all the LLMs how to do accurate SQL. It just won't always
accurate SQL. It just won't always happen if you don't have all of the right context. And the context that I
right context. And the context that I provided obviously was not enough. So
what can you do to provide additional context?
You can actually use semantic modeling tools like cube or the semantic modeling capabilities of of BI and analysis tools uh like Omni and Mallaloy and you can uh
feed it additional data in a lot of detail. So you can feed it basically
detail. So you can feed it basically what are your measures? What are the types of things that you want to compute? Uh what are the dimensions? How
compute? Uh what are the dimensions? How
do you want to categorize those uh those measures?
What are the filters? Uh do you want to be able to do by date range or by geography in these examples? And what
are the calculated measures? What are
the measures based off of other measures? You can go and spend a heck of
measures? You can go and spend a heck of a lot of time out there just kind of defining all of these things and life will be glorious. maybe uh but it is a lot of time and energy involved in doing
that.
Another way that you can do it is you can take these you know what I said is like uh 200 some odd tables and you can say uh I'm going to make a very simple
view of this data. I'm going to make a view that more looks more like that Northwind data set so that the system can understand something uh that's well documented. And luckily, you know, it's
documented. And luckily, you know, it's really easy to create views in most databases.
The other thing though you can do is start commenting on the databases and basically describing your columns. Um,
this is sort of the the poor person's semantic layer and uh I actually generated this from the LLM to start with and then you can go in and modify
it and provide additional details. So if
you wanted to say for instance that uh this is the the join ID or the foreign key basically between this table and another table uh and describe um you
know how they they uh they join and what the name of the join key is. You can do all sorts of things like that to make the LLM slightly better.
The LMS are still generating SQL not necessarily knowing what SQL syntax that it should generate though. So even you get it accurate, you might have a lot of
queries. So we run a database company
queries. So we run a database company that is based off of a consumptionbased billing model. Uh I personally love LLMs
billing model. Uh I personally love LLMs because they'll write 20 queries because and a human will only have to write one.
So the LLM is is able to rack up the bills faster. But here I'm going to give
bills faster. But here I'm going to give you some indications about how to prevent that. um which is you know teach
prevent that. um which is you know teach the model about your SQL dialect uh and provide performance optimizations of
course and tell the model what not to do. Um, and we actually do publish a uh
do. Um, and we actually do publish a uh full markdown file describing DuckDB SQL and the motherduck editions that you can
feed to your LLM to teach your LLM in the context window how duct DB works uh and how to do some types of operations which may be a little bit more esoteric
um you know in a particular variant of SQL.
You can also feed it things about how to do efficiency. So you can teach show an
do efficiency. So you can teach show an example window function here uh and show how to use the qualify function in order
to limit uh the computation there and do other efficient patterns um for when you're trying to figure out things uh
you know beyond ordinary SQL that it might might recognize.
Lastly, I mentioned you could tell it to avoid certain things or uh what I started to realize is this
looks pretty beautiful. Uh I know the LLM can generate SQL. I could use LLM to generate my get sales by region SQL. But
I as the person that understand how our data works can then say okay that is the SQL that represents get sales by region and essentially create a set of tools
for each of our our major business metrics and those set of tools can then define how to interact with the analytics database.
it still comes down to what are you doing uh as an agent analytic what are you doing as an agent interacting with the analytics database
so the capabilities of an analytics database are things like transformation ingesting data from a huge variety of different sources everything from CSV
and parquet files uh to JSON and random things off of S3 you can ingest that all into the database and then to transform
that data. It can act as your memory if
that data. It can act as your memory if you're doing a lot of analytical functions. So if you're building an
functions. So if you're building an agent that needs memory um and you're you know doing a lot of of facts, you uh you know use
a a a vector database if you're trying to store a lot of facts. But if you're trying to actually store data that you're going to analyze as numbers, uh an analyical database is really great for that memory. And then obviously the
computation, the ability to scan across a bunch of data and do analytics.
So to give you an example, if we did something like, you know, which of my customers currently spending over $10,000 are likely to churn, you'd want an analytics database that can understand different data sources
because you'd bring in different data from your business, from HubSpot, from Salesforce, from um, you know, all of your various business systems. Uh, and if you have a product, you'd bring in
product data. You want to be able to
product data. You want to be able to store temporary state uh and derive data because you want between different runs of the LLM to have some consistency
there. Scaling up and down as needed um
there. Scaling up and down as needed um you know you can easily have runaway costs uh and be able to run a bunch of
things at once of course um and keep costs under control. Um so what would a great analytical database look like for that? Uh first of all, what would it not
that? Uh first of all, what would it not look like? So this is what a lot of data
look like? So this is what a lot of data warehouses are. A lot of data warehouses
warehouses are. A lot of data warehouses are this monolithic very large engine that is scaled for peak uh traffic, peak
performance. Um so you know a lot of
performance. Um so you know a lot of businesses overspend on idle resources for their data warehouses and you don't really want to do that. Uh so there's autoscaling and things like that that
you may have used. A bunch of people actually came up to our booth and talked about how autoscaling on some of the major databases didn't really do what they need. Uh and you really don't have
they need. Uh and you really don't have your cost scaling with the amount of usage. Um and you're just worried about,
usage. Um and you're just worried about, you know, these are users, but let's say there's agents uh there spinning up and running into your database. They can
execute tons of queries.
So this is where we talk about open-source duct DB. Um, how many of you actually used DuckDB?
Okay, maybe a third. Um, super easy to install. Uh, it runs locally as a
install. Uh, it runs locally as a library on your computer. It does
analytics much the same as, uh, SQL light might be used to do transactional data. Um, and it is very easy to use. uh
data. Um, and it is very easy to use. uh
and it you know nowadays occupies maybe 100 megabytes of RAM but has full SQL analytical capabilities.
Uh it is very popular and overall duct DB gets high scores from everyone in the community in terms of the you know experience that it
provides. Uh approved SQL dialect. It's
provides. Uh approved SQL dialect. It's
a great CSV parser. It's an awesome library to use within your applications.
It's in process. It's pretty fast. Um,
you know, fast it like go to ClickBench, which is the, uh, benchmark from the Clickhouse folks, and DuckDB oftent times performs much better than than
ClickHouse in its own usage. And it's a tiny little library. Uh, it's easy to use, it scales, um, and it's free. And
that's probably the best thing for a lot of people. Uh, of course, analytics
of people. Uh, of course, analytics often requires heavy resources. It's
kind of difficult to predict how much resources you need. You're on your own to scale it if you're using the the library for duct DB. And do you really want to manage a fleet? And that's
really where Motherduct comes in as a cloud data warehouse. So, uh, I don't want to pitch you on all the stuff about mother duck, but what I do want to talk about is kind of the differences in
design versus what you might be used to for an analytical database. So, with
mother duck, we give every user their own duckling. So, every user that goes
own duckling. So, every user that goes and attaches to a data warehouse has their own compute. That's what the duckling is. It's essentially its own
duckling is. It's essentially its own database instance. They're sharing data
database instance. They're sharing data but have access to their own compute.
they can scale vertically as needed.
Uh and this extends to applications.
So a lot of use usage of motherduck is customerf facing applications where the uh end users uh is another company
that's a customer of ours uh or a customer of a customer. Um and each of those people also get their own databases in many of the use cases. is
what really helps there is that's highly partitioned data. These folks really uh
partitioned data. These folks really uh want to make sure each of their customers data stays completely separate and this is great for that. Uh but it also helps preventing collision on
resources. You get predictable
resources. You get predictable performance when each of the users have their own uh data that spins up and spins down really quickly.
Now another thing that motherduck does which is really awesome for these types of things is something we call dual execution which is basically a user
comes in and they send a SQL query to their local duct DB. So the local duct DB open source duct DB is the client for
motherduck. they send a SQL query and
motherduck. they send a SQL query and the uh the SQL query planner basically decides which parts of this data is local and which parts of this data is
remote. So you write a very similar like
remote. So you write a very similar like a normal SQL statement here. Um but it knows that hey our super top secret data is only here in this local computer and
the other data is in the cloud and it sends that part of the SQL query for the join uh over to the cloud and pulls out the data from there. Uh so you can think
of this for this architecture as being really useful for agents especially if the agent wants to do you know a lot of lowcost easy data analytics low amounts
of data you could just have the agent use duct DB locally zero zero cost associated with that but then when the agent needs the power of the cloud and
scaling up uh it is it is literally uh let's see here it is literally just changing one line uh to say I'm going to connect to MD
colon instead of connecting to your local file system. Uh so duct DB is really versatile that way. You can use your local file system. You can even use a local inmemory and then with one line
you're out actually using the cloud with this dual execution.
There are also ways, you know, if you have an agent that you really want that agent to have its own playground, its own thing to work with. Um, and
you can do that with zero copy clones.
So, you can have this shared data set that everyone has access to, all the different agents have access to, and you can very easily say, "Okay, give me a copy of that for my local agent to work
with." And it's it's basically free
with." And it's it's basically free operation there. And then your local
operation there. And then your local agent can say, I'm going to add some additional metadata. I'm going to add
additional metadata. I'm going to add some additional data and then I'm going to eventually push it back so that uh you know other folks have access to that. But that cloning capability allows
that. But that cloning capability allows you to provide full write access to your agent uh to your underlying data if it needs that.
And of course you can scale out. So if
you have for instance an agent that needs access to a database but that agent needs a lot of power um you can basically use these read replica
ducklings and say hey I'm going to have four other instances eight other instances 12 other instances I think we're up to 16 in the UI right now but you know I'm going to have all of these
other instances out there to do my analytical workload and only this agent has access to do that the other agents do not.
and you choose all of that stuff in the configuration. So, uh for for those who
configuration. So, uh for for those who can't see, basically I have three agents on the bottom and then the core data warehouse service. The core data
warehouse service. The core data warehouse service would be writing all the data that the agents need to access and then the three different agents we can configure differently. Uh so the
core data warehouse has a jumbo instance. We configure one of the agents
instance. We configure one of the agents here that really needs a lot of power to have jumbos and then our uh churn predictor agent maybe needs to look at a
lot of things at once and so we give it many ducklings uh or many instances to be able to do that. So all the cool kids are building agents nowadays. You want
an analytical database that can help answer different types of questions. you
want hypertendency that allows uh the AI applications to have a sandboxed amount of compute uh and prevent runaway costs and then you know the the duct DB and
mother duck can really help there. So uh
that's all I have for today but uh thank you very much [applause] Thank you, Ryan. And um we have about we can have five minutes of Q&A. Does
anybody have any question for Ryan?
>> I can even plug this back in.
>> No >> questions, questions. You can all get free ducks at our >> Oh, sorry. The question. So, thank you Ryan for that. Uh so, basically, it's
it's suitable. Uh so u mother duck makes
it's suitable. Uh so u mother duck makes it easier for agents to ask questions but can you give us some examples of you know actual stacks how people use it because you need to write a bunch of
stuff. Do you use lench chain? Do you
stuff. Do you use lench chain? Do you
use pantentic? Like what do people do you know put on top of of mother duck agents?
>> Yeah I think the the biggest use case that we've seen is is really around business analytics. So so we have
business analytics. So so we have customers who ingest data for all of their customers. So it's a SAS
their customers. So it's a SAS application uh as an example and it ingests all the business data for various uh of their customers and then they allow business questions to be
asked. Um and they've done a lot more
asked. Um and they've done a lot more work on the semantic side and such than I did. Uh so hopefully they're producing
I did. Uh so hopefully they're producing more accurate answers. Uh but so then they have a fleet of of agents that are out there processing the new business
data that's coming in um and answering answering questions in sort of a natural language form uh for their for their business users. Um but we have you know
business users. Um but we have you know all sorts of other different types of customers uh logistics space I mentioned earlier you know package routing things like that where all this type of stuff
would be would be useful to recognize patterns uh you know from the aggregated data >> okay great more question a little bit late but I don't know if
you mentioned whether [clears throat] you support vector search or is there like a vector database option for it?
>> Yeah, so there is there is some work on vector search within within duct DB. Um
I wasn't I wasn't really covering that but you if you look up I think it's called VSSS uh is an extension for duct DB to do some some vector search. Um but
uh yeah I was focusing more on the aggregation than the looking up of facts but yeah that capability does exist.
question.
>> No.
>> All right. Well,
>> no.
>> Thank you very much.
>> Oh, question.
>> Okay. Thank you.
>> Okay.
>> Um, how do you handle multimodal, you know, sort of images, data, etc.? Is
that something that's comfortable in in in in how you handle it? And then I didn't quite get the local versus cloud.
Are you is this just the database engine or is this actually caching underlying data and storage? I just didn't quite and I may have just missed it.
>> Sure. Um I'm going to answer those as two separate questions. The one of local versus cloud and then the one of you know multimodal type of data. uh in the
local versus cloud, you know, basically if you have an application server uh that is running your agents or you have a uh lambda function that is running
your agents uh as an example that essentially becomes the client uh and mother duck is the server. On the
client, you're running a full copy of duct DB. Um, and because duct DB is so
duct DB. Um, and because duct DB is so lightweight, less than 100 megabytes, um, of, you know, memory space that it's occupying, um, you can do everything
locally that you want with data as long as you're within those those resource constraints. Um, is it does it create a
constraints. Um, is it does it create a local cache? A lot of people do do that
local cache? A lot of people do do that type of of work. So, they pull data down, use a local cache. Uh, but right now it's not an automatic kind of functionality. Um, so you would have to
functionality. Um, so you would have to manually decide to pull it down. It's
really easy though because you can do it all like in a you know catast statement create from select star from a table uh bring it local do a bunch of computation
for free uh and then push back you know whatever you want for the cloud. Um the
other thing on the multimodal side so uh duck DB as an engine and thus thus then mother duck as well do allow you to use
data lakes and lakehouses where in a data lake like in S3 you're often storing these other types of formats alongside your uh tabular data um and
but we don't have any special capabilities built in to process those types of data there is a huge ecosystem a huge partner network. Um, so there's
folks like BEM, uh, which essentially offers an API that allows you to take, uh, the unstructured data like a PDF or an image and convert it into a
structured form for your analysis. Um,
and they've, you know, done workshops and how you can do that with mother duck, etc. So, there's ways, but is not a built-in capability.
>> Okay, great. And we have a question here. Uh, how do you handle the security
here. Uh, how do you handle the security when the data can be local?
>> Sorry, what I didn't see where the question was coming from to start with.
Um, and uh, >> ah, thank you.
>> Um, and your question is how to deal with security when the data is local.
Can you expand on that a little bit more? Well, so that if you have a data
more? Well, so that if you have a data locally, uh you probably can clone it or something, you know, bring it home or whatever. So, how do you handle the
whatever. So, how do you handle the managing the local that being, you know, it's convenient, but from perspective of security, what do you do?
>> Yeah. I mean, I think in larger larger organizations that may become more of a concern. I think most of the
concern. I think most of the organizations that we're working with today, they consider that a benefit as opposed to a liability. But um I see the
liability there uh in terms of how you you you know prevent a local user from downloading a bunch of data. Uh we have
not built controls around that. Um where
we consider it sort of a value ad that you can run it locally is the idea that you can take uh you can keep only some data local and not put it into the into
the cloud. Uh, and a lot of folks do
the cloud. Uh, and a lot of folks do appreciate that as as an ability. Um,
but yeah, it does it would allow a local user to select a bunch of data out of the database and and dump it out. Um,
and uh, that is just, you know, part of part of the architecture and I'm sure there will be constraints built in the future uh, for for organizations where
that matters.
>> Great. Thank you. I think u if there's no more question we actually have another speaker supposed to start in three minutes. So if you still have more
three minutes. So if you still have more question [clears throat] for Ryan please um you can meet him outside here in the hallway or like near the interview room
as well. So thank you all for um joining
as well. So thank you all for um joining Ryan's talk. Thank you Ryan.
Ryan's talk. Thank you Ryan.
Loading video analysis...