Building AI for Petabyte-Scale Enterprise Data (TextQL CEO)
By The MAD Podcast with Matt Turck
Summary
## Key takeaways - **Bridging Consumer AI and Enterprise Data**: Consumer AI tools excel with spreadsheets and PDFs, while traditional enterprise systems require extensive configuration. TextQL aims to bridge this gap by enabling AI agents to query petabytes of enterprise data across numerous tables with zero setup. [00:23], [01:20] - **Iterative Product Development: 7 Rebuilds in 2.5 Years**: Building effective AI for large-scale data involved significant iteration. TextQL rebuilt their product seven times over 2.5 years, exploring approaches from catalogs and fine-tuning to vector stores before arriving at their current architecture. [00:44], [09:48] - **Beyond Vectors: Scaling Context with Git**: Instead of relying solely on vectors, TextQL scales its context repository using a Git-based approach. This allows agents to write PRs, creating a form of 'infinite memory' without the limitations of vector embeddings. [05:00], [05:19] - **Non-Deterministic SQL for Complex Data**: Unlike systems that aim for a single, perfect SQL query, TextQL's agents explore the database, form hypotheses, and test them. This iterative, non-deterministic approach allows them to handle the messiness of enterprise data and recover from errors. [05:51], [06:15] - **Custom MCP Servers for BI Tool Integration**: When vendor solutions failed, TextQL built custom MCP servers for tools like Tableau, Looker, and Snowflake. This was necessary because existing solutions were not user-friendly enough for seamless integration with their AI agents. [10:36], [10:40]
Topics Covered
- Lowering Data Analysis Costs Unlocks Hyper-Granular Opportunities.
- Why One-Shot SQL AI Fails: Agents Must Explore.
- From Reporting to Proactive, Self-Optimizing Business Intelligence.
- Building Enterprise Data AI: An Immense Engineering Challenge.
- Zero-Config AI Unlocks Petabyte-Scale Data Insights.
Full Transcript
[Music]
Hi everyone, my name is Ethan. I'm the
CEO and co-founder of TexQL and we build
agents uh for really really really big
data. This presentation is called
building AI that can query 100,000
tables with pedabytes of data and zero
configuration because uh I I was told
that being very specific generally helps
to make people understand what you
actually do. And so what do we actually
do? Historically, there are kind of like
two types of like AI for data that
everyone's kind of heard of, right?
There's like uh there's like the classic
like historical infrastructure for like
really big data. This is data like that
can't fit on your computer. We're
talking about roughly at around like 16
gigs is when like your computer would
shut down if you tried to do really big
manipulation on that. And so you need
very specialized systems uh Oracle like
ERP, CRM, HR systems, data warehouses,
big like you know uh BI tools. All these
things have very very high configuration
costs. You have to like migrate and like
when they offer an E1 AI product that
only works for the data within their
environments. And then there's the other
class of uh data products and AI
products that everyone's probably played
with today here uh which are you know
cla notion glean. These are platforms
that like do really well with
powerpoints and presentations and uh and
spreadsheets and PDFs and word docs that
fit on your computer and uh do not let
you really access data that is pabytes
because there's a lot of work that goes
into that. And so that's what we try to
do. We try to make it possible to deploy
AI that you can ask questions like uh
can you connect my CRM data to my
snowflake data to like all the emails
that I sent last week and tell me like
which of my customers are most likely to
churn or can you validate that all these
numbers are correct and then you know
even commit like a PR to GitHub to make
sure that the uh that the SQL migrations
that we're doing are all like correct.
Um this is a very very hard amount of
overhead. We work with like very big
companies. Uh we work with like very uh
compliant like environments. we deploy
into like hospitals, financial services
is a super onrem like infra heavy like
environment and we work with some of the
largest AI labs because uh that that's
how hard it is to build out all this
infrastructure more use cases we do
financial services and healthcare like
predominantly and and we do a lot of
like transformations um around uh making
sure people can build pipelines move
data around really like work with big
data for lack of a better word. You
might be thinking why do I care about
this? Well, right now if the unit cost
of like you doing like a data request
and asking like, hey, can you like, you
know, figure out what the most likely
customer to turn is? Uh, if that costs
you a month of time of a Stanford PhD in
data science, like you pay $300,000 a
year to, that sets a kind of a floor on
how valuable the opportunity has to be
for you to analyze that. Obviously, if
you can bring down the cost of that
analysis, if you can train an agent to
do that for one/10enth, 100th, 1/1,000th
of the cost, um, it unlocks a lot more
opportunities, right? It lets you like
take analysis that you're doing today at
like the like you're trying to figure
out like which of our customers is going
to churn per month per state and you
want to do that analysis really like per
city per week per uh per zip code per
day. Uh really like per per LinkedIn
post from your champion at a particular
account who is leaving and says I'm very
sad and you read from that that they
were fired and therefore all their
initiatives are going to shut down and
you want to like anticipate that in
advance or something. That's like a very
very low latency thing that's very hard.
It's very expensive. That's kind of what
we do. So to walk you through an example
uh let's say uh there's a finance guy on
your team uh Dylan who is in the
audience here with us today who uh we
have to do some board reporting. It's
five minutes until the board meeting and
he sent me this uh thing and he says uh
can you look at this like thing uh
spreadsheet there's an image on it. I
have no idea where these numbers came
from. Can you double check that these
are correct and I don't happen to have a
Stanford PhD that I can like accelerate
a month of work into two 20 minutes for.
So this is actually the uh this is like
slightly like this is our real like prod
data. Uh this is on top of a data
warehouse of I think about like
10,000ish tables. Uh it worked the first
time with no configuration. Now it's
kind of cached so it won't be as sexy.
Um so if I take this and I go to I need
a prepar meeting and I go uh hey Anna
can you double check if this is true. Uh
Matt told me this is correct. Uh the man
on my team. Um and uh and uh I'm going
to be really mad if it's not correct
because the board is going to fire me.
um uh tell me if they will be happy or
not and if not um what I can do about it
to drive the usage to go up. I used to
run a data team uh at a startup uh with
like 10 people. I was responsible for
every single one of these requests. If I
if I got a request like this, I would
want to gouge my eyes out because uh
there's no context. There is a image. I
have no idea what table this came from.
There's no lineage. There's no
infrastructure. There's no metadata
documentation whatsoever. I have no idea
what table they pulled it from. I don't
know if it was like segment blogs or
snowflake or something else. But uh
you'll see Anna start to reply and start
telling me the work that she is doing.
Uh so if I go into the environment, what
you'll see is uh you'll see the uh agent
has received this. Um I can you know ask
follow-up questions and it'll like once
it's done replying it'll like generate a
report, send it back to Slack and uh I
actually didn't think this far ahead
because if I had I would have realized
now we're going to stare at this run for
a little bit maybe to like walk through
like how we managed to do this and like
the kind of things that it's happening
under the hood. um how can I like do
this over hundreds of thousands of
tables without knowing anything about
your data warehouse? Well, the first
thing it does is uh it looks at a
context repository that we maintain. Uh
that's actually um a GitHub uh context
uh repo that we like let it like write
PRs to. And this way it kind of has this
like memory concept um that is like you
can scale this to infinite amounts.
It'll just run like 50 uh if anyone's an
engineer here like GPS uh against the
thing to try to find like all the like
keywords that you're looking for um
without using ve there's no vectors. is
entirely like syntactically oriented. So
it can like you know be very like meiy
in its approach to finding its content.
Um and like it has a ton of like you
know infinitely complex calculations or
anything else like built into it example
queries all the cache stuff and at the
end of any session you can tell the
agent just save this to your context
make sure it's always there and it'll
write a PR and uh you can merge that and
uh you will kind of just keep getting
smarter. Now when it gets into the SQL
mode uh this is kind of the thing that
we really do very differently from every
other company in the world uh who claims
to have something like this which is we
don't try to oneshot the SQL. What that
means is we don't try to write SQL hit
run and like make sure the query is
perfectly correct because anybody who in
the room has who has ever written SQL
before knows that if I give you 10,000
pages of documentation and a SQL
terminal and I tell you hey can you help
me figure out which of our customers are
going to turn uh and you can only hit
the run button once you're going to get
the wrong answer. It is like literally
impossible. It is not how you go about
generating SQL. And so for us, we have
the agent explore the database. It it
it's looking at the information schema
for anyone who doesn't isn't SQL
literate. What this means is it is
literally saying tell me every table
table you have. Um and then tell me
every column you have. Um and then like
it's going to scan through the whole
thing and then like anything that's not
relevant, it's going to dump out of
context. Um and so it can like run for
an incredibly long amount of time as it
like tries tries to step through all of
this stuff. Now, um I guess while it's
running this, yeah, it's like like
pulling the it's pulling the data. It's
forming hypotheses. It might like look
at a daytime field like data warehouses
are incredibly messed up environments.
Every enterprise has like 50 of them.
Every enterprise has another 30
different BI tools from like the past
like 35 years. Um three different ERPs,
17 different CRM. And so there's a lot
of like hypothesis testing that like the
agent steps through as it's exploring.
It has to like go, oh, like plot the
daytime columns. Oh, cool. Uh there's a
drop here. There's a data pipeline that
was probably broken. Um so it's a lot of
it's a lot of like hypothesis testing so
it can like recover from all of the
things and all the pieces. Um I think
it's a written some like uh you know
written some plots, written
visualizations. Uh eventually now it's
like referencing our CRM uh and so it
can directly plug into any other data
sources that you have uh that is very
fast to configure because well it's
making an API call the exact same way
that any one thing can make an API call.
Um, and yeah, while it's doing all this,
I can show you uh what happens after I
complete this, which is um I would take
like a completed playbook or like a
report or something like for example
this daily customer upsell analysis that
tells me which customer I should reach
out to and ask them for more money. Uh,
and once I know that, you know, it steps
through like 50 steps and it'll give me
a good result, I can schedule this to
run every day, every week, every time a
customer sends me an email um to give me
a new report like this. Right? This is
how you um kind of go from like the uh
ask questions and it'll tell you what is
this number to eventually ask questions.
It'll tell you why did this number go up
and eventually uh ask questions why did
this go up? Okay, given why that you
know why it went up tell me what I can
do to make the number go up or down
depending on if it's a good number or a
bad number. And that's like a very very
useful thing because once you do that
you make way more money having the thing
uh I guess like you know run on a loop
and like just find you more money. Okay,
it's done. So it's finished the final
report. it came back. There's like a
whole thing. It says uh you know the uh
the the expected stuff is all correct.
There's a like 3% delta I think because
we changed the way we count something.
Um and also the August number was like
way off because uh this this like thing
was like presented to me in August. If
you read the text, it says, you know,
chart displayed August to date for
August represented approximately August
13th, which is probably correct. If I go
back and check when he sent me this uh
or I'm going assume it's correct because
there's no reason for it to not be
correct. Um and then it'll tell me like
uh present actually. Okay, fine fine
fine. I'll check I'll double check that
Matt uh Matt Matt actually sent me this
at that time. This is uh August 14th.
It's we're off by a date. It said
approximately 13. It's actually 14. Um
and the uh and the and the thing comes
out to uh revenue growth ACU usage uh
something something corresponds to the
thing. Uh we should tell Matt's data is
partially correct, incomplete. I likely
working with midmon data, exceptional
performance, celebration worthy data,
something something. This is great. I'm
not going to get fired. Uh and uh now
I'm really happy. And yeah, the that's
approximately uh the demo. Uh like we uh
for people who like wanted are curious
about how deeper how much deeper we go
like there's a lot of infrastructure
that we had to build out to make this
possible. We've like rebuilt this
product for the first two and a half
years of this company. We rebuilt this
product seven times over, right? We
tried like cataloges, fine-tuning, SQL
terminals, vector stores um like like
cataloges with vector stores and
sandboxes and SQL stores and all these
things like we added like ductb we added
polers we added um like we started with
the sandbox so we could execute and
we're as as I've been recently reminded
uh we're the first company to publicly
release a version of code interpreter uh
that wasn't in beta and uh and then we
were like oh the data is getting really
big okay well we need something that's
not just pandas in python it's going to
be polars then we were like okay the
data got even bigger okay we need like
DB got even bigger passed down to the
data warehouse got even bigger and we
needed joints on the fly serverless
click house infrastructure then we had
to roll a semantic layer for because
some people wanted deterministic queries
to come out every time then some of
those queries took a really long time so
we added caching at acceleration and
then some people had Tableau and looker
and powerbi and all those things so we
had to build our own MCP servers for all
those platforms uh because uh data
bricks and snowflake and Tableau's MCP
servers are moderately maybe not the
most usable things in the world in
summary uh if If you if you want to roll
this yourself at your own company uh
company like ramp or with exceptional
engineers or something you should do
that uh this is totally doable now with
today's technology in GPD5 all you have
to do is build your own uh compute
format uh table format uh that is
natively integrated with your compute
format that can do this across multiple
different large data sources. Build your
own MCP servers for the Tableau's 17
different ways of ingesting data. Four
out of four of them break. We actually
found like a roundabout way to actually
get this to work. So much so that even
the Tableau team currently tax us into
accounts to try to I guess like to
figure out their AI things for customers
who want to use Tableau AI but can't get
it to work. We rolled our own semantic
layer. It's compilable, transpilable
from like all existing semantic layers
um like look ML cube metric flow and uh
and and like a really cool like guey
that looks kind of wild when you put in
over a thousand objects onto the same
thing uh at the same time. and uh and
our own agent builder which you know
this is like very different from like
because we're running like a more
non-deterministic agent uh like uh like
glide code modality wise it looks very
very different like the right UI builder
for something that's going to be less if
else statements and more use best
judgment and run for like 2 hours or 10
hours as like test time compute like
kicks in for like the latest like
generation of models you generally want
like more logic on the trigger way more
uh instructions on the prompt
environment and then like an easier
iteration cycle for testing because
you're going to be running these models
for like hours and hours. Now, this is
not the best UI in the world, but um
generally pretty bearish on um I guess
like the canvas based like UI for like
agent building. We rolled our own I
guess we we have a semantic layer which
means we need tools to query the
semantic layer. We rolled our own
version of um we we took like lookers
like metric explorer and like palunteers
object explorer and combined them onto
something that works on the same surface
area because sometimes business users
want to ask get me all the blanks like
customers or something with all these
attributes and that's like an object
question because every row is a noun and
sometimes people ask show me how x y and
z metrics change over time and uh break
it out by 17 different objects that come
from 35 different tables and that's
going to be also a horrible thing but
that's a totally different type of math
um to get correct. I know a lot of
people in the world offer the uh offer
like AI for SQL, AI for big data, AI for
data warehouses. We work with a lot of
customers who've been, you know, two and
a half years in on building semantic
models for Snowflake and they can get it
to work for exactly 20 tables. Um, and
you know, we can for anyone here who
cares about asking questions of their
data and getting decision intelligence
and all that stuff and building AI that
can work with really big data. Like if I
can't make sense, if our agent can't
make sense of your data warehouse with,
you know, over 100,000 tables even um,
in any data environment with no
configuration time and it can't build
its own semantic model and like do all
the data integration for itself in the
same loop and like do all the PRs and
change your DBTs and everything else
required to like get it to a place where
it can analyze all your data. I will buy
your entire whole team data team uh, NOU
with our VC dollars. Uh we've made this
offer I think like 20 times. Uh nobody's
taken up up on it. And uh maybe one day
uh we will come across a horrible
horrible I'm guessing it's probably
going to be like a manufacturing company
with like IoT devices um like with like
infinite like joint complexity. Uh but
uh until then uh yeah I'd love to you
know thank thank you for watching.
Loading video analysis...