LongCut logo

Building AI for Petabyte-Scale Enterprise Data (TextQL CEO)

By The MAD Podcast with Matt Turck

Summary

## Key takeaways - **Bridging Consumer AI and Enterprise Data**: Consumer AI tools excel with spreadsheets and PDFs, while traditional enterprise systems require extensive configuration. TextQL aims to bridge this gap by enabling AI agents to query petabytes of enterprise data across numerous tables with zero setup. [00:23], [01:20] - **Iterative Product Development: 7 Rebuilds in 2.5 Years**: Building effective AI for large-scale data involved significant iteration. TextQL rebuilt their product seven times over 2.5 years, exploring approaches from catalogs and fine-tuning to vector stores before arriving at their current architecture. [00:44], [09:48] - **Beyond Vectors: Scaling Context with Git**: Instead of relying solely on vectors, TextQL scales its context repository using a Git-based approach. This allows agents to write PRs, creating a form of 'infinite memory' without the limitations of vector embeddings. [05:00], [05:19] - **Non-Deterministic SQL for Complex Data**: Unlike systems that aim for a single, perfect SQL query, TextQL's agents explore the database, form hypotheses, and test them. This iterative, non-deterministic approach allows them to handle the messiness of enterprise data and recover from errors. [05:51], [06:15] - **Custom MCP Servers for BI Tool Integration**: When vendor solutions failed, TextQL built custom MCP servers for tools like Tableau, Looker, and Snowflake. This was necessary because existing solutions were not user-friendly enough for seamless integration with their AI agents. [10:36], [10:40]

Topics Covered

  • Lowering Data Analysis Costs Unlocks Hyper-Granular Opportunities.
  • Why One-Shot SQL AI Fails: Agents Must Explore.
  • From Reporting to Proactive, Self-Optimizing Business Intelligence.
  • Building Enterprise Data AI: An Immense Engineering Challenge.
  • Zero-Config AI Unlocks Petabyte-Scale Data Insights.

Full Transcript

[Music]

Hi everyone, my name is Ethan. I'm the

CEO and co-founder of TexQL and we build

agents uh for really really really big

data. This presentation is called

building AI that can query 100,000

tables with pedabytes of data and zero

configuration because uh I I was told

that being very specific generally helps

to make people understand what you

actually do. And so what do we actually

do? Historically, there are kind of like

two types of like AI for data that

everyone's kind of heard of, right?

There's like uh there's like the classic

like historical infrastructure for like

really big data. This is data like that

can't fit on your computer. We're

talking about roughly at around like 16

gigs is when like your computer would

shut down if you tried to do really big

manipulation on that. And so you need

very specialized systems uh Oracle like

ERP, CRM, HR systems, data warehouses,

big like you know uh BI tools. All these

things have very very high configuration

costs. You have to like migrate and like

when they offer an E1 AI product that

only works for the data within their

environments. And then there's the other

class of uh data products and AI

products that everyone's probably played

with today here uh which are you know

cla notion glean. These are platforms

that like do really well with

powerpoints and presentations and uh and

spreadsheets and PDFs and word docs that

fit on your computer and uh do not let

you really access data that is pabytes

because there's a lot of work that goes

into that. And so that's what we try to

do. We try to make it possible to deploy

AI that you can ask questions like uh

can you connect my CRM data to my

snowflake data to like all the emails

that I sent last week and tell me like

which of my customers are most likely to

churn or can you validate that all these

numbers are correct and then you know

even commit like a PR to GitHub to make

sure that the uh that the SQL migrations

that we're doing are all like correct.

Um this is a very very hard amount of

overhead. We work with like very big

companies. Uh we work with like very uh

compliant like environments. we deploy

into like hospitals, financial services

is a super onrem like infra heavy like

environment and we work with some of the

largest AI labs because uh that that's

how hard it is to build out all this

infrastructure more use cases we do

financial services and healthcare like

predominantly and and we do a lot of

like transformations um around uh making

sure people can build pipelines move

data around really like work with big

data for lack of a better word. You

might be thinking why do I care about

this? Well, right now if the unit cost

of like you doing like a data request

and asking like, hey, can you like, you

know, figure out what the most likely

customer to turn is? Uh, if that costs

you a month of time of a Stanford PhD in

data science, like you pay $300,000 a

year to, that sets a kind of a floor on

how valuable the opportunity has to be

for you to analyze that. Obviously, if

you can bring down the cost of that

analysis, if you can train an agent to

do that for one/10enth, 100th, 1/1,000th

of the cost, um, it unlocks a lot more

opportunities, right? It lets you like

take analysis that you're doing today at

like the like you're trying to figure

out like which of our customers is going

to churn per month per state and you

want to do that analysis really like per

city per week per uh per zip code per

day. Uh really like per per LinkedIn

post from your champion at a particular

account who is leaving and says I'm very

sad and you read from that that they

were fired and therefore all their

initiatives are going to shut down and

you want to like anticipate that in

advance or something. That's like a very

very low latency thing that's very hard.

It's very expensive. That's kind of what

we do. So to walk you through an example

uh let's say uh there's a finance guy on

your team uh Dylan who is in the

audience here with us today who uh we

have to do some board reporting. It's

five minutes until the board meeting and

he sent me this uh thing and he says uh

can you look at this like thing uh

spreadsheet there's an image on it. I

have no idea where these numbers came

from. Can you double check that these

are correct and I don't happen to have a

Stanford PhD that I can like accelerate

a month of work into two 20 minutes for.

So this is actually the uh this is like

slightly like this is our real like prod

data. Uh this is on top of a data

warehouse of I think about like

10,000ish tables. Uh it worked the first

time with no configuration. Now it's

kind of cached so it won't be as sexy.

Um so if I take this and I go to I need

a prepar meeting and I go uh hey Anna

can you double check if this is true. Uh

Matt told me this is correct. Uh the man

on my team. Um and uh and uh I'm going

to be really mad if it's not correct

because the board is going to fire me.

um uh tell me if they will be happy or

not and if not um what I can do about it

to drive the usage to go up. I used to

run a data team uh at a startup uh with

like 10 people. I was responsible for

every single one of these requests. If I

if I got a request like this, I would

want to gouge my eyes out because uh

there's no context. There is a image. I

have no idea what table this came from.

There's no lineage. There's no

infrastructure. There's no metadata

documentation whatsoever. I have no idea

what table they pulled it from. I don't

know if it was like segment blogs or

snowflake or something else. But uh

you'll see Anna start to reply and start

telling me the work that she is doing.

Uh so if I go into the environment, what

you'll see is uh you'll see the uh agent

has received this. Um I can you know ask

follow-up questions and it'll like once

it's done replying it'll like generate a

report, send it back to Slack and uh I

actually didn't think this far ahead

because if I had I would have realized

now we're going to stare at this run for

a little bit maybe to like walk through

like how we managed to do this and like

the kind of things that it's happening

under the hood. um how can I like do

this over hundreds of thousands of

tables without knowing anything about

your data warehouse? Well, the first

thing it does is uh it looks at a

context repository that we maintain. Uh

that's actually um a GitHub uh context

uh repo that we like let it like write

PRs to. And this way it kind of has this

like memory concept um that is like you

can scale this to infinite amounts.

It'll just run like 50 uh if anyone's an

engineer here like GPS uh against the

thing to try to find like all the like

keywords that you're looking for um

without using ve there's no vectors. is

entirely like syntactically oriented. So

it can like you know be very like meiy

in its approach to finding its content.

Um and like it has a ton of like you

know infinitely complex calculations or

anything else like built into it example

queries all the cache stuff and at the

end of any session you can tell the

agent just save this to your context

make sure it's always there and it'll

write a PR and uh you can merge that and

uh you will kind of just keep getting

smarter. Now when it gets into the SQL

mode uh this is kind of the thing that

we really do very differently from every

other company in the world uh who claims

to have something like this which is we

don't try to oneshot the SQL. What that

means is we don't try to write SQL hit

run and like make sure the query is

perfectly correct because anybody who in

the room has who has ever written SQL

before knows that if I give you 10,000

pages of documentation and a SQL

terminal and I tell you hey can you help

me figure out which of our customers are

going to turn uh and you can only hit

the run button once you're going to get

the wrong answer. It is like literally

impossible. It is not how you go about

generating SQL. And so for us, we have

the agent explore the database. It it

it's looking at the information schema

for anyone who doesn't isn't SQL

literate. What this means is it is

literally saying tell me every table

table you have. Um and then tell me

every column you have. Um and then like

it's going to scan through the whole

thing and then like anything that's not

relevant, it's going to dump out of

context. Um and so it can like run for

an incredibly long amount of time as it

like tries tries to step through all of

this stuff. Now, um I guess while it's

running this, yeah, it's like like

pulling the it's pulling the data. It's

forming hypotheses. It might like look

at a daytime field like data warehouses

are incredibly messed up environments.

Every enterprise has like 50 of them.

Every enterprise has another 30

different BI tools from like the past

like 35 years. Um three different ERPs,

17 different CRM. And so there's a lot

of like hypothesis testing that like the

agent steps through as it's exploring.

It has to like go, oh, like plot the

daytime columns. Oh, cool. Uh there's a

drop here. There's a data pipeline that

was probably broken. Um so it's a lot of

it's a lot of like hypothesis testing so

it can like recover from all of the

things and all the pieces. Um I think

it's a written some like uh you know

written some plots, written

visualizations. Uh eventually now it's

like referencing our CRM uh and so it

can directly plug into any other data

sources that you have uh that is very

fast to configure because well it's

making an API call the exact same way

that any one thing can make an API call.

Um, and yeah, while it's doing all this,

I can show you uh what happens after I

complete this, which is um I would take

like a completed playbook or like a

report or something like for example

this daily customer upsell analysis that

tells me which customer I should reach

out to and ask them for more money. Uh,

and once I know that, you know, it steps

through like 50 steps and it'll give me

a good result, I can schedule this to

run every day, every week, every time a

customer sends me an email um to give me

a new report like this. Right? This is

how you um kind of go from like the uh

ask questions and it'll tell you what is

this number to eventually ask questions.

It'll tell you why did this number go up

and eventually uh ask questions why did

this go up? Okay, given why that you

know why it went up tell me what I can

do to make the number go up or down

depending on if it's a good number or a

bad number. And that's like a very very

useful thing because once you do that

you make way more money having the thing

uh I guess like you know run on a loop

and like just find you more money. Okay,

it's done. So it's finished the final

report. it came back. There's like a

whole thing. It says uh you know the uh

the the expected stuff is all correct.

There's a like 3% delta I think because

we changed the way we count something.

Um and also the August number was like

way off because uh this this like thing

was like presented to me in August. If

you read the text, it says, you know,

chart displayed August to date for

August represented approximately August

13th, which is probably correct. If I go

back and check when he sent me this uh

or I'm going assume it's correct because

there's no reason for it to not be

correct. Um and then it'll tell me like

uh present actually. Okay, fine fine

fine. I'll check I'll double check that

Matt uh Matt Matt actually sent me this

at that time. This is uh August 14th.

It's we're off by a date. It said

approximately 13. It's actually 14. Um

and the uh and the and the thing comes

out to uh revenue growth ACU usage uh

something something corresponds to the

thing. Uh we should tell Matt's data is

partially correct, incomplete. I likely

working with midmon data, exceptional

performance, celebration worthy data,

something something. This is great. I'm

not going to get fired. Uh and uh now

I'm really happy. And yeah, the that's

approximately uh the demo. Uh like we uh

for people who like wanted are curious

about how deeper how much deeper we go

like there's a lot of infrastructure

that we had to build out to make this

possible. We've like rebuilt this

product for the first two and a half

years of this company. We rebuilt this

product seven times over, right? We

tried like cataloges, fine-tuning, SQL

terminals, vector stores um like like

cataloges with vector stores and

sandboxes and SQL stores and all these

things like we added like ductb we added

polers we added um like we started with

the sandbox so we could execute and

we're as as I've been recently reminded

uh we're the first company to publicly

release a version of code interpreter uh

that wasn't in beta and uh and then we

were like oh the data is getting really

big okay well we need something that's

not just pandas in python it's going to

be polars then we were like okay the

data got even bigger okay we need like

DB got even bigger passed down to the

data warehouse got even bigger and we

needed joints on the fly serverless

click house infrastructure then we had

to roll a semantic layer for because

some people wanted deterministic queries

to come out every time then some of

those queries took a really long time so

we added caching at acceleration and

then some people had Tableau and looker

and powerbi and all those things so we

had to build our own MCP servers for all

those platforms uh because uh data

bricks and snowflake and Tableau's MCP

servers are moderately maybe not the

most usable things in the world in

summary uh if If you if you want to roll

this yourself at your own company uh

company like ramp or with exceptional

engineers or something you should do

that uh this is totally doable now with

today's technology in GPD5 all you have

to do is build your own uh compute

format uh table format uh that is

natively integrated with your compute

format that can do this across multiple

different large data sources. Build your

own MCP servers for the Tableau's 17

different ways of ingesting data. Four

out of four of them break. We actually

found like a roundabout way to actually

get this to work. So much so that even

the Tableau team currently tax us into

accounts to try to I guess like to

figure out their AI things for customers

who want to use Tableau AI but can't get

it to work. We rolled our own semantic

layer. It's compilable, transpilable

from like all existing semantic layers

um like look ML cube metric flow and uh

and and like a really cool like guey

that looks kind of wild when you put in

over a thousand objects onto the same

thing uh at the same time. and uh and

our own agent builder which you know

this is like very different from like

because we're running like a more

non-deterministic agent uh like uh like

glide code modality wise it looks very

very different like the right UI builder

for something that's going to be less if

else statements and more use best

judgment and run for like 2 hours or 10

hours as like test time compute like

kicks in for like the latest like

generation of models you generally want

like more logic on the trigger way more

uh instructions on the prompt

environment and then like an easier

iteration cycle for testing because

you're going to be running these models

for like hours and hours. Now, this is

not the best UI in the world, but um

generally pretty bearish on um I guess

like the canvas based like UI for like

agent building. We rolled our own I

guess we we have a semantic layer which

means we need tools to query the

semantic layer. We rolled our own

version of um we we took like lookers

like metric explorer and like palunteers

object explorer and combined them onto

something that works on the same surface

area because sometimes business users

want to ask get me all the blanks like

customers or something with all these

attributes and that's like an object

question because every row is a noun and

sometimes people ask show me how x y and

z metrics change over time and uh break

it out by 17 different objects that come

from 35 different tables and that's

going to be also a horrible thing but

that's a totally different type of math

um to get correct. I know a lot of

people in the world offer the uh offer

like AI for SQL, AI for big data, AI for

data warehouses. We work with a lot of

customers who've been, you know, two and

a half years in on building semantic

models for Snowflake and they can get it

to work for exactly 20 tables. Um, and

you know, we can for anyone here who

cares about asking questions of their

data and getting decision intelligence

and all that stuff and building AI that

can work with really big data. Like if I

can't make sense, if our agent can't

make sense of your data warehouse with,

you know, over 100,000 tables even um,

in any data environment with no

configuration time and it can't build

its own semantic model and like do all

the data integration for itself in the

same loop and like do all the PRs and

change your DBTs and everything else

required to like get it to a place where

it can analyze all your data. I will buy

your entire whole team data team uh, NOU

with our VC dollars. Uh we've made this

offer I think like 20 times. Uh nobody's

taken up up on it. And uh maybe one day

uh we will come across a horrible

horrible I'm guessing it's probably

going to be like a manufacturing company

with like IoT devices um like with like

infinite like joint complexity. Uh but

uh until then uh yeah I'd love to you

know thank thank you for watching.

Loading...

Loading video analysis...