LongCut logo

The Future of Data Engineering: AI, LLMs, and Automation

By Tobias Macey

Summary

Topics Covered

  • Data Engineering Demands Data Context
  • Reverse AI: LLMs Boost Data Engineers
  • Data Diffs + LLMs Automate Reviews
  • AI Slashes Migration from Years to Weeks

Full Transcript

[Music] hello and welcome to the data engineering podcast the show about modern data management data migrations are brutal they drag on for months sometimes years burning through

resources and crushing team morale data folds AI powered migration agent changes all that they unique combination of AI code translation and a at data validation has helped companies complete

migrations up to 10 times faster than manual approaches and they're so confident in their solution they'll actually guarantee your timeline in writing ready to turn your year-long migration into weeks visit

dataengineering podcast.com folds today for the details your host is Tobias Macy and today I'd like to welcome back G maansi where we're going to talk about

the work of data engineering to build AI to build better data engineering and all of the things that out of that idea so glib for folks who haven't heard any of your past appearances if you could just

give a quick introduction yeah thanks for having me again to it's always fun to be at the podcast I'm glad I Am Co and co-founder of data fold we work on

automating data engineering workflows now also with AI prior to starting data fold I was a data engineer data scientist data product manager and I got

a chance to build three data platforms pretty much from scratch and at three very different companies including Autodesk and Lyft where I was one of the

first founding data engineers and got to build a lot of pipelines and infrastructure and also break a lot of pipelines and infrastructure and I've always been

fascinated by how important is data engineering to the business in that it unlocks the delivery of the actual applications that are data driven be

that dashboards or machine learning models or now increasingly also bi applications and at the same time as a data engineer I've always been very

frustrated with how manual airr prone TVs and toysome my personal workflow was and pretty much started data fold to solve that problem and remove all the

manual work from the data engineering workflow said that we can ship high quality data faster and help all the wonderful businesses that are trying to leverage data

actually do it so excited to chat in the context of data engineering AI obviously there's a lot of hype that's being thrown around about oh you just rub some AI on it and it'll be magical and your

problems are solved you don't need to work anymore it's going to replace all of your Junior Engineers or whatever the current marketing spin is for it and

it's undeniable that large language models generative AI the current ERA that we're in has a lot of potential there are a lot of useful applications

of it but the work to actually realize those capabilities is often a little bit opaque or misunderstood or confusing and

so there are definitely a lot of opportunities for being able to bring large language models or other generative AI Technologies into the

context of data engineering work or development environments but the work of actually getting it to the point where it is more help than hindrance is often

where things start to fall apart and I'm wondering if you can just start from the work that you're doing and the experience you've had of actually incorporating llms into some of your

product some of the lessons learned about what are some of those impedance mismatches what are some of those stumbling blocks that you're going to run into on the path of saying I've got

a model I've got a problem let's put them together yeah absolutely and I think that's SP on obs to bias in terms of there's a lot of noise and hype

around AI everywhere but yet we don't have a really clear idea and consensus how actually it impacts State Engineering and maybe before we dive

into like okay what is actually working it's worth kind of disambiguating and cutting through the noise a little bit and I've been thinking about this recently but I think there is probably

two main things that everyone gets a bit confused about one is the confusion of software engineering and data engineering software engineering and data engineering are

very related and in many ways there are similar in data engineering we ultimately also write code that produces some outcome but unlike software

engineering typically we're not really building a deterministic application that performs a certain function we write in code that processes large amounts of data and usually that data is

highly imperfect and so so we're dealing not just with uh code we're dealing also with extremely complex extremely noisy

inputs and a lot of the times also unpredictable outputs and that makes the workflow quite different and I think one important distinction is when we see

lots of different tools and advancements in tools that are affecting software engineers and impacting their work CLS for it better like one example is I

think over the past year we've seen amazing amazing Improvement of the kind of co-pilot type of support within the software engineering workflow through

various tools we at beta fold for example use cursor ID a lot and we really like how it seamlessly plugs in and enables our Engineers working on the

application code just be more productive spend less time on uh a lot of like boiler plate toil tasks and those tools are really it's really exciting how it

affects the software engineering workflow there's also a huge part in the software engineering space right now that is devoted to the agents so for

example with cursor the idea is that you plug it in the IDE in a few touch points for developer like code completion and then kind of an assistant that helps you

mck up and refactor the code and it's very seamless but it's still kind of part of the core workflow for human and then there's a second school of thought where there's an agent that takes a

tasks that can be very Loosely defined and then basically Builds an app from scratch or takes a gr linear ticket and does the work from scratch and it's also very exciting I would say in our

experience testing multiple tools the results there are far less impressive and actual impact on the business for us in terms of software engineering has

been far less impressive than with more like a ID native enhancement but all of is to say that while those tools are really impactful for software engineers

and there's a lot happening also in other parts of the workflow we've seen very limited impact of those particular tools on the data Engineers workflow and

the primary reason is that although we also write in code as data Engineers the tools that are built for software Engineers they lack very important

context about the data and that is kind of a simple idea and a simple statement but what's underneath is actually quite a bit of complexity because if you think

about what data engineer needs to do in order to do their job they have to understand not just the codebase but they also have to have a really good

grasp on the underlying data that their codebase is processing which is actually a very hard task by itself starting from understanding what data you have on the

first place how is the data computed where it's coming who is consuming it what are relationships between all the data sets and absent of that context the

tools that you may have supporting your workflow yes it can help you generate the code but the impact of that would be quite limited relative to um how complex

your workflow is and I think that means that for data Engineers we need to see a specialized class of tools that would be dedicated at improving data Engineers

workflow and would excel at doing that by having that context that is critical for data engineer to do their job that's kind of I think one aspect of the confusion sort of like all the

advancements and software engineering tools are exciting and inspiring it doesn't mean that now data Engineers workflow is impacted as as significantly as a software engineer's workflow I

think the other type of confusion that I'm seeing is there's a lot of talk about AI in the data space and all the vendors you see out there are I think

smartly positioning them themselves as really relevant and essential to the fundamental tectonic ship we now seeing technology meaning they trying to

position themselves as relevant in the in the world where llms are really providing big opportunity for businesses to uh to improve and grow and automate a

lot of business processes but if you double click into what is exactly everyone is saying is it's pretty much we going to help you the data team the data engineer ship AI to your business

and to your stakeholders like we are the best you know workflow engine so that you can get data delivered for AI or we are the best data quality vendor that

will help you ensure the quality of the data that goes into AI or we have the most Integrations with all the vector

databases that are important fori and kind of the the the message that you're getting from all of this and by no means this is not import this is definitely

important and relevant but what's interesting about this is we're saying essentially data engineer you have so many things to do and now you also have to ship AI we're going to help you ship

AI it's so important that you ship data for AI applications we are the best tool to help you ship AI but it's almost sounds like this is data engineers in

the service Sur Ai and I think what's really interesting to explore and to unpack and what I would personally love for myself as a data engineer is kind of reversing that question and asking the

question of okay so we have now this fundamental shift in technology amazing capabilities by llms how does it

actually help me in my workflow so what does the AI for data engineer look like and I think we need much more of that discussion because I think that if we

make people who are actually working on all these important problems more productive with the help of AI then they will for sure do amazing things with data and I think that's a really

exciting opportunity to explore one of the first and most vocal applications of AI in that context of helping the data

Engineers by maybe taking some of the burden off them that I've seen is the idea of talk to your data warehouse in English or text a SQL or whatever formulation it ends up taking where

rather than saying oh now you need to build your complicated star or snowflake schema and then build all of the different dashboards and visualizations for your business intelligence you just

put an AI on top of it and then your data consumers just talk to the AI and say hey what was my net promoter score last quarter or what's my year-over-year

Revenue growth or how much growth can I expect in the next quarter based on current sales and it's going to just automatically generate the the relevant

queries it's going to generate the visualizations for them and you as a data engineer or as an analytics engineer don't need to worry about it anymore and from the description it

sounds amazing it's like great okay job done I don't need to worry about that toysome work I do all of the interesting work of getting the data to where it needs to be and then the AI does the rest but then you still have to deal

with issues of making sure that you have the appropriate semantics mapped so that the AI understands what the question actually means in the context of the data that you have which that's the hardest problem in data anyway no matter

what so the AI doesn't actually solve anything for you it just maybe exacerbates the problem because somebody asks the AI the question the AI gives an answer but it's answering it based on a

misunderstanding of the data that you have and so you still have those issues of hallucination incorrect data or variance in the way that the data is

being interpreted and I'm wondering what you have seen as far as the the actual practical applications of the AI being that simplifying interface versus the

amount of effort that's needed to be able to actually make that useful yeah I think this is tax to SQL is the Holly Grail of the data space I

would say for as long as I've worked in the space for over a decade that you know people really try to solve this problem multiple times and obviously now

in hindsight it's obvious that pre- LM all of those um approaches using traditional NLP were doomed and now that we we have llms it seems like okay

finally we can actually solve this problem and I'm very optimistic that it indeed will help make data way more accessible and I think it eventually

will have tremendous impact on how humans interact with data and how that is leveraged but I think that the how and how it happens and how applied is

also very important because I don't think that the fundamental problem is that people cannot write SQL SQL is actually not that hard to to write and

to master I think the fundamental issue is that if we think about the life cycle of data in the organization it's very important to understand that the raw data that it gets collected from you

know all the business systems and all the events and logs and everything we have in a data Lake it is pretty much on usable and it's unusable both by

machines and AI or and and people if we just try to you know throw a bunch of queries that it and ask you know try to answer really key business questions and

in order for the data to become use usable we need what is currently is the job of a data engineer of structuring filtering merging aggregating this data

curating it and creating a really structured representation of what is our business and what are all the entities in the business that we care about like

customers products orders so that then this data can be fed into all the applications right business intelligence machine learning Ai and I don't think

that tax to SQL replaces that because if we just do that on top of the raw data we basically get garbage in garbage out I do think that in certain applications

in certain applications of that we can actually get very good results even today if we put that level of a system on top of Highly curated semantically

structured data sets right so if we have a number of tables that are well defined that describe how our business Works having a text tosql interface could be

actually extremely powerful because we know that the questions that are asked and will be translated into code will be answered with the data which has been already prepared and structured and so

it's actually quite easy for the system to be able to make sense about it but I don't think we are there where just like you don't need the data team let's just ask a question almost

guaranteed that the answer will be wrong so data engineer in that regard data engineering and data Engineers uh are definitely not going to lose their jobs because now it's easy to generate SQL

from text and in the context even of that text tosql use case what I've been hearing a lot is that it's not even very good at that one because llms are bad at

math and SQL is just a manifestation of relational algebra thereby math but that if you bring a Knowledge Graph into the system where the AI is using the knowledge graph to understand what are

the relations between all the different entities from which it then generates the queries and actually does a much better job but again you have to build the knowledge graph first and I think maybe that's one of the places where

bringing AI earlier in the cycle is actually potentially useful where you can use the AI to do some of that wrote work of saying here are all the

different representations that I have of this entity or this concept across my different data sources give me a first pass of what a unified model looks like

to be able to represent that given all of the data that I have about it and all the ways that it's being represented and I'm wondering what you've seen in that context of bringing the AI into that

data modeling data curation workflow of it's not the end user interacting with it it's the data engineer using the AI as their co-pilot if you will or as

their assistant to be able to do some of that tedious work that would otherwise be okay well I've got 15 different spreadsheets I need to visually look across them and try and figure out the

similarities and differences Etc yeah that's a great point I would say that there are I have two thoughts

there on how the EI plugs in to actually make text tosql work yes you absolutely need that kind of semantic graph of what what data sets you have how are they

related what are all the metrics how those metrics are computed and in that regard what's really interesting is that the metrix layer that was at some point

a really hot idea in the modern Data stock probably about for you know 3 to five years ago and then everyone was really disappointed with how little impa

it actually made on on a data team's productivity and just overall a data stack it almost like now now it's the metric layers time because if you take

the metric layer and um which gives you a really structured representation of the core entities and the metrics putting the text to SQL is almost like the most impactful thing that you can do

because then you have a structured representation of your data model which allows AI to be very very effective at being able to answer questions while being while while operating on a

structured graph and so I think we'll see really exciting applications coming out of the hybrid of that kind of fundamental metric layer semantic graph

and text to SQL in you know we already seeing that the early impacts of that but I think over the next two years it probably will become the a really

popular way to open up data for ultimate stakeholders instead of classical B of like drag and drop uh interfaces and kind of passively consumed dashboards

but then the second point which you made is basically cani actually help us get to that structured representation and I think absolutely um for the data Engineers workflow so not for a I would

say business stakeholder or someone who is Data consumer but for data producer I think that leveraging llms to help you build data models and especially build

them faster build FAS in the sense of understanding all the semantic relationships not just writing code is a very promising area and that comes back

to the my point about how software tools are limited in their help of you know for data Engineers right I can write SQL but if I if my tool does not understand

what are the relationships between the data sets then it can't even help me write joints properly and one of the interesting things we've done at data fold was actually build A system that

essentially infers a entity relationship diagram from the raw data that you have combined with all the ad hoc SQL queries

that have been written by people so previously that would be a very hard problem to solve but with the help of our lamps we can actually have a really good shot at understanding what is the

what are all the entities that your business have in your data Lake how are they related and that's almost like a probabilistic graph because people can be writing joints correctly or incorrectly and you have noisy data and

sometimes keys that you think are like primary keys and foreign keys are not perfect but if you have a large enough data set of queries that were ran against your Warehouse you can actually

have a really good shot at understanding what's the semantic graph looks like and the context on which we actually did this was to help data teams build

testing environments for their data but the the implications of having that knowledge is actually very powerful right so to your point we can use that

to also help right SQL so I'm very bullish on the ability to help engine data engineers build pipelines by creating a semantic

graph without the need for curation because previously that problem was almost pushed to people with all the kind of data governance tools the idea was let's have data stewards Define all

the canonical data sets and all the relationships and obviously we just discovered this completely non-scalable so now we finally at the point where we can automate that kind of

semantic data mining uh with llms that brings us back around to another point that I wanted to dig into further in the context of how to actually integrate the

llms into these different use cases and workflows you brought up the example of cursor as an IDE that was built specifically with llm use cases in mind

supposed with something like a vs code or Vim or emac where the llm is a bolt-on and something that you're trying to retrofit into the experience and it

can be useful but it requires a lot more effort to be able to actually set it up configure it make it aware of the codebase that you're trying to operate

on Etc versus the prepackaged product and we're seeing that same type of thing in the context of data where you mentioned there are all these different vendors of oh hey we're going to make it

super easy for you to make your data ready for AI or use this AI on your data but most teams already have some sort of system in place and they just want to be able to retrofit the llm into it to be

able to start getting some of those gains with the eventual goal of having the llm maybe be a core portion of their data system their data product and I'm

wondering in that process of bringing an llm retrofitting it onto an existing system whether that be your code editor your deployment Environ enironment your data warehouse what have you what are

some of those impedance mismatches or some of the issues in conceptual understanding about how to bring the appropriate I'm going to use the word knowledge even though it's a bit of a

misnomer into the operating memory of the llm so that it can actually do the thing that you're trying to tell it to do yeah that's a great question Tobias I think that to answer this we kind of

need to go back to what are the jobs to be done for a data engineer and how does the data engineer workflow actually look like and if we were to visualize it it

actually looks quite similar to the software engineering workflow in just the types of tasks that a data engineer does day-to-day to do their work and by

the way we're saying data engineer as sort of like a blank label but I don't necessarily mean just people who have data engineer in their title because all

roles that are working with data including data scientists analysts analytics engineers and BM and M cases and software Engineers a lot of them actually do data Engineering in terms of

building pipelines and developing pipelines as part of their job it's just data Engineers probably do this you know 100% a day time and if I'm a data analyst or data scientist that would be doing this maybe 30 40% of the time of

my week and so if we think about what do I need to do to let's say ship a new data model like a table or extend an existing data model you know refactor

definitions or add new types of information into an existing model it starts with planning right so I'm doing planning I'm trying to find the data

that I need for for my work and a lot of the times a lot of information can be sourced from documentation from data catalog I think right now the data

cataloging in the sense of like what data sets I have and what's the profile of those data sets has been largely solve there are great tools you know some are open source some are vendors

but overall understanding what data sets you have now is way easier than it was 5 years ago you also probably are Consulting your tribal knowledge and you go to slack and you do like search for

certain definitions and that's also now is largely solved with a lot of the Enterprise search tools and then you go into writing code and writing code I think this is also an important

misconception like if you are not really you know doing this for for living you think that people spend most of their time actually writing SQL and in terms

of like writing SQL to for production and in my experience actual writing of this SQL or other types of code is maybe

like 10 to 15% of my time whereas all the operational tasks around testing it talking to people to get context doing

code reviews shipping it to production monitoring it remitting ating issues talking to more people is where the bulk of the work is happening and if that's

true then that means that probably as we talk about automation these operational workflows are where the bulk of the lift coming from LMS can actually happen and

so for actual writing code as a data engineer I would still recommend probably using the best-in-class software tools these days like cursor it will even though it's not aware of the data it will probably still help you

write a lot of boil boiler plate and will speed up your workflow somewhat and or you can use other IDs with Co know like V code plus copilot I think those tools will just help you speed up the

writing of the code itself but back to the operational workflows that I think take the majority of the of the time within any kind of cycle of shipping some of shipping

something when it comes to what happens after you wrote the code right typically if you have people who care about the quality of the data means that you have

to do a fair amount of testing of your work and testing is both helping making sure that my code is correct right does it conform to the expectations uh does it produce the data

that I expect but it's also about understanding potential breakages Data Systems are historically fragile in the sense that you have layers and layers of

dependencies that are often AA because um I can be changing some definition of what active user is somewhere in the pipeline but then I can be completely

oblivious of the fact that 10 jobs down the road someone builds a machine learning model that consumes that definition and tries to automate certain decisions for like for example spend and

manipulating that metric and so if I'm not aware of those Downstream dependencies I could be actually causing a massive business disruption just by the sheer fact of changing it and so the testing that involves not just

understanding how the data behaves but also how the data is consumed and what are the larger business implications for making any kind of modification to the code is where a ton of time is spent in

the data engineering and so what's interesting is that is the use case where historically we had data fault spent a lot of time thinking about even prei and before LMS were thing what we

did there was came up with a concept of data diffing and the idea is everyone can see code diff right my code looked like this before I made it change now

it's it's a different you know it's a different set of characters that uh the code looks like and uh def the code is something that is like EMB bettered in GitHub right you can see that but the very hard question is understanding how

does the data change based on the change in the code because that is not obvious that happens like once you actually run the goat against the database and so data diff allows you to see the impact

of a code change on the data and that by itself was quite impactful and we've seen a lot of teams adopt that you know large Enterprise teams uh fast moving

software you know startup teams but we were not fully satisfied with the degree of automation that feature alone produced because people are still required to like sip through all the

data diffs and explore them for multiple tables and see how the downstream impacts manifest themselves through lineage and it felt like okay now at

least we can give people all the information but they still have to sift for a a lot of it and some of the important details can be missed and the big unlock that llms bring this

particular workflow is once llms became pretty good in comprehending the code and actually semantically understanding the code which pretty much happened over

2024 with the latest generation of fundamental large you know large large language models we were able to do two things one take a lot of information and

condens it into like three bullet points kind of like an executive summary and those bullet points are essentially helping the data engineer understand on the high level what are the most important impacts that I need to worry

about for any given change and for a code reviewer to understand the same and that just helps people to get on the same page very quickly and save everyone a lot of time that otherwise could spend be spent in meetings or back and forth

you know putting comments on a code change and the second unlock that we've seen is the opportunity to drill down and explore all the impacts and do the

testing by essentially chatting with your P request chatting with your code and that comes in the form of a chat interface where you're basically speaking to an agent that has a full

context of your code full context of the data change data diff and also full context St for your lineage so that I can actually understand how every line of code was modified affecting the data

and what does that mean for the business and you can ask questions and it produces the answer is way faster than you would by essentially looking at all the different you know code changes and

and data diffs and that ended up saving a lot of times a lot of time for data teams and now that I'm describing this you kind of feel that it sounds like almost having a buddy that just like

helps you think for the code almost like having a code reviewer except for with thei with llm this is a buddy that's always available to you 24/7 and

probably makes youro mistake because it has all the context and can set through a lot of information so really quickly so that's an example of how an could be applied into an operational use case that historically has been really timec

consuming and take a lot of manual work out of that context and I really want to dig into that one word that you said probably at least a half dozen times if not maybe a couple of dozen was that

context where that I think is the key piece that is so critical and also probably the most difficult portion of making AI useful is context what context

does it need how do you get that context to it how do you model that context how do you keep it up to date and so I think that really is where the difference comes in between the cursor example that

we touched on earlier versus the retrofitting onto emac or whatever your tool or workflow of choice is is how do you actually get the context to the place that it needs to be and so you

just discussed the use case that you have of being able to use the llm in that use case of interpreting the various data diffs understanding what is the actual ramifications of this change

and I'm wondering if you can just talk through some of the lessons learned about how you actually populate and maintain that context and how you're able to instruct the llm how to take

advantage of the context that you've given it that's a great question Tobi and I think what's interesting is that at face value it seems like you want to

throw all the information you have at llm right just like tell it everything and then let it figure out things and in fact it is obviously not as easy as that

and in fact it's actually counterproductive to over Supply the llm with context in part because the context window of large language models is

limited and the trade-off there is one you just like can't physically fit everything and two even if you're dealing with a model that actually is designed to have a very large Contex

window if if you overuse it and Supply too much information LM just get gets lost it's also a starts being far far less effective in understanding what's

actually important versus not and the overall effectiveness of your system goes down so back to your question of like what is the actual information that is important to provide as context into

llm it really depends on what is the workflow that we're talking about in the context of a code review and testing where we we are trying to fundamentally

answer the question of a if we changed the code was a change correct relative to what we tried to do what was what the

task was or did we not conform to the business requirement the second question is did we follow the best practices such as you know code guidelines and

performance guidelines or not and the third question is okay let's say we conformed to the business requirements we did the good job at following our

coding best practices but we may still cause a business disruption just by making a change that can be a surprise either for a human consumer of data

Downstream or could throw off a machine learning model that was trained based on a different distribution of data right and so these are fundamental three questions that we try to answer and by

the way even without AI That's what a good code review would ultimately accomplish done by humans so so what is the context that it's important for LM

to have here first obviously it is the code difference right so we already know what the original code was what the new code is and feeding that inm is really important so that I can understand okay

what are the actual changes in the code itself in the logic and I'm I won't go into the details here because obviously the code base can be very large sometimes your PR can touch a lot of

code so you have to be quite strategic in terms of how do you feed that on a technical side but conceptually that's what we have to provide as an input number one the second important input is

the data diff right it's understanding if I have a kind of main branch version of the code understanding what data it produces and what are the metrics

showing right and then if I have a new version of the code let's call it a developer Branch what data it produces and what is the difference in the output

let's say with my main branch code I see that I have 37 orders on Monday but with the new version of the code I see that I have 39 and so that already tells me

that okay so this is the important impact on the output data and on the metric and that can and that's important both on the value levels understanding how the individual cells rows and columns are changing but it's also

important to do rollups and understand what is the impact on metrics and coupling that context with the code diff allows us to understand how changes in the code affect the actual data output

and the third really important aspect is the lineage so lineage is fundamentally understanding how the data flows throughout your system how it's computed

how it's aggregated and how it's consumed and the lineage is a graph and there are kind of two directions of exploration one of them is Upstream

which helps us understand how how did the geta data get to the point where you're looking at it right so for example if I'm looking at number of orders and I'm changing a formula where

does the information about orders come from in the first place and that is important because that can tell us a lot about how a given metric is is computed and what is the source of Truth are we getting it from Salesforce are we

getting it from our internal system and then the downstream lineage is also important because it tells us how the data gets consumed and that is absolutely essential information that

can help us understand what Downstream systems and metrics will be affected and lineage graph in itself can be very complex and building it actually is a tough problem because you have to

essentially scrape all of your data platform information all the queries all the bi tools to understand how data flows how it's consumed and produced but let's say you have this lineage graph

it's actually also a lot of information by itself and so to properly Supply that lineage information into an LM context you actually kind of need uh your system

to be able to explore lineage graph on its own to see like okay if I am if theel make made a change here what are the important Downstream implications of

that so now we're talking about kind of the system to be able to kind of Traverse that and do analysis on its own for for the context I would say these are the three most important types of context and then the fourth one is kind

of optional if you're again if your team has any kind of best practices SQL linting rules documentation rules you can also provide them as context and

then your kind of AI code reviewer assistant can help you reason about well did you conform or not and if not making suggestions about what to correct eventually probably going in and

correcting your code itself I think that's ultimately where this is going but again it's pretty much would be operating on the same set of input context another interesting element of

bringing llms into the context of the data engineering workflow and use case one is the Privacy aspect which is a whole other conversation I don't want to

get too deep into that Quagmire but also when you're working as a data engineer one of the things you need to be thinking about is what is my data platform what are the tools that I rely

on what are the ways they link together and if you're going to rely on an llm or generative AI as part of that tool chain how does that fit into that platform

what is some of the scaffolding what are some of the workflows what are some of the custom development that you need to do where a lot of the first pass and

naive use cases for generative Ai and L m is oh well just go and open up the chat GPT UI or just go run LM Studio or use CLA or what have you but if you want

to get into anything sophisticated where you're actually relying on this as a component of your workflow you want to make sure that it's customized that you own it in some fashion and so that is likely going to require doing some

custom development using something like a lan chain or a lane graph or uh crew AI or whatever where you're actually building additional scaffolding logic

around just that kernel of the llm and I'm curious how you're seeing some of the needs and use cases of incorporating

the llm more closely into that actual core capabilities of the data platform through that effort of customization and uh software engineering that's a great

point I think that the models themselves are getting rapidly commoditized in the sense that their capabilities the fun

you know the foundational large language models their interfaces are very similar their capabilities are similar we're seeing a lot of race between the

companies training those models in terms of beating each other in benchmarks looks like the whole industry is conver converging on adding more reasoning and

then the ways that this is happening is also converging on the same experience and the matter and the difference is like who is doing this better right who is beating the metrix who provides the

best the cheaper inference the faster inference uh more intelligence for for the same price and to that end I don't think that differentiation or the effectiveness of whatever is the

automation that you're trying to bring really depends on the choice of a model maybe for certain narrow applications actually maybe choosing a more specialized model and or fine-tuning model would be more applicable but still

I don't think the model is where you really where the magic happens these days model is important for magic but it's not something that actually allows you to build a really fact application

by just you know choosing something better than what's available to everyone else the actual magic and the value ad and the automation happens in how you

leverage that model in your workflow so all the orchestration in terms of how do you prompt the model what kind of context you provide how do you tune The Prompt how do you tune the inputs how do

you evaluate the performance of the model in production how do you make various LM based actors that may be playing different roles interact with

each other that is where the hard work is happening and that is where I think the actual value and impact is created and that's where all the complexity is

so I think you don't have to be you know PhD and really understand how the models are trained although I would say just like in computer science is obviously very helpful to understand how these

models are trained in the architectures and their trade-offs but you don't have to be good at um you know training those models in order to effectively leverage them but to leverage them you have to do

a lot of work to effectively plug them in the workflows and I think that the applications and companies and teams that are thinking about what is the workflow what is the ideal user interface what is all the information

that we can gather to make LM do the better job and then are able to rapidly iterate will ultimately create the most impact with alls and so on that note in

your experience of working with the llms working with other data teams and keeping apprised of the evolution of the space What are some of the most interesting or Innovative or unexpected

ways that you've seen teams bring llms into that inner loop of building and maintaining and evolving their Data Systems I think the most in hindsight

obvious but not necessarily obvious when you just in realization is that no one really knows how to ship llm AI based

applications they obviously you know guides and tutorials and still like there's a lot you can learn from looking at what people are doing but the field

is evolving so fast that nothing replaces fast experimentation and just building things it's not that you can

just hire someone who worked on building an llm based application like six months ago a year ago and all of a sudden you you know gain a lot of Advantage as you

would with many other Technologies like you know if we were I guess working in a space of video streaming it would be very beneficial to have extensive

experience with working with video streaming and codex and with llms one no one really knows exactly how they work even the company in terms of

like how they behave right in terms even the companies that are shipping them are discovering more and more novel ways of leveraging them more effectively every

week and from for the teams that are using leveraging LMS like like data folds the thing that we found matter the

most is the ability to a just stay on top of the field and understanding what's the what's the like most exciting thing that people are doing how they relate to our field how can we borrow

some of those ideas but most importantly is is rapid experimentation with some sort of methodology that allows you to try new things measure results quickly and then

being able to scrap your approach that you thought was great and just go with a different one because a lot of times when a new model is released you have to kind of adjust a lot of things you have

to adjust the prompts you have to even rearchitecturing and that is both difficult but also incredibly exciting because the pace of

innovation and what is possible to solve is evolving extremely fast I would say the fastest of any previous technological wave of disruption that

we've seen in your experience and in your work of investing in this space figuring out how best to apply llms to the problems

facing data engineers and how to incorporate at that into your products what are some of the most interesting or unexpected or challenging lessons that you've learned personally yeah I I think that the the

interesting realization was that specifically for data engineering domain again if you just take the problem at face value you think well let's just build a co-pilot

or an agent that would kind of try to automate data engineer way and I don't think we have the tech ready for an agent to just like really take a task and run with it yet I don't think it's

been solved in software space I think it's in some ways even harder to solve in data space we'll eventually get there I don't think we are there yet I don't think that the biggest

feedback you can make in bing workflow again is like having a co-pilot because that's not where the engineers spend most of their time in terms of

like writing production code it's all operational tasks and there are certain kinds of problems in the data engineering space

where it's not even a dayto day you know you help you save like an hour two hours 3 hours but there are certain types of workflows

where to complete a task a team needs to spend like 10,000 hours and a good example of such a project would be a data platform migration where for example you

have millions of lines of code on Legacy database you have to move them over to a new modern data warehouse you have to

refactor them optimize them repackage them into a new kind of framework right you may be moving from like store

procedures on Oracle to DBT Plus data bricks and doing that requires a certain number of hours for every object and because you're dealing with a large

database that at Enterprise level sums up to enormous amount of work and historically these projects would last years and be done by a lot of times

outsourced Talent from you know Consultants or or Si and for data engineer that's like probably one of the most miserable projects to do I've done

I've glad such a project at lift and it's been an absolute grind where you you're not shipping new things you're not shipping AI you're not shipping even data pipelines you're just like solving

technical debt for years and what's interesting is that those types of projects and workflows are actually I

would say where Ai and LMS can make today the most impact because we can take a task we can reverse engineer it

we know exactly what is the target of you know you move the code you do all these things with the code and ultimately the data has to be the same right you're moving you're going through multiple complex steps but what's

important for the business is once you move from let's say you know ter data to snowflake your output is the same because otherwise these wouldn't accept

it and that allows us to a lever LS for a lot of the tasks that historically manual but also have a really clear objective function for LMS like diffing

the output on a legacy system to a modern system and using it as a constraint and if you put those two things together you have a very powerful system that is a extremely flexible and

scalable thanks to all lamps but also can be constrained to a very objective definition of what's good you know unlike a lot of this text tosql

generation that cannot be constrained to the definition of what's good because like how do you know migration you do know and that allows AI to make

tremendous impact on the productivity of a data team by essentially taking a project that would last for years cost millions of dollars and go budget and

constrain that into weeks and you know just a fraction of the price I think that is where we can see real impact of AI That's like useful it's working and

we also see the powerless in software space as well there also a lot of the like really impactful Enterprise applications of AI is actually taking these Legacy code bases and you know

helping teams maintain them and or migrate them and I think that there are more opportunities like that in the data engineering space where we'll see AI

make tremendous impacts and as you continue to keep in touch with the evolution in the space work with data teams evaluate what are

the cases where llms are beneficial versus you're better off going with good old human Ingenuity what are some of the things you're keeping a particularly

close eye on or any projects or context you're excited to explore in terms of where you where I think that llms would really make a huge impact on

the workflow uh just llms in general how to apply them to data engineering problems how to incorporate them more closely and with less leg work into the

actual problem solving apparatus of an organization yeah so I think that on multiple levels there's a lot of exciting things like for

example being able to prompt an llm from SQL as a function call that's available these days in modern data

platforms is incredibly impactful right because instead of trying to in many instances we're dealing with extremely massive data and instead of having to

write like complex case when statements and rexes and like udfs to be able to clean the data to classify things and to

just tangle the mass we can now apply llms from within SQL from within the query to solve that problem and that is incredibly impactful for a whole variety

of different applications so I'm very excited about all these capabilities that are now you know brought by the major data platforms like you know snowflake data breaks uh B query I think

that the if we go into the workflow itself like what does data engineer do and how to make that work better I think there's a ton of opportunity to further

automate a lot of tasks I think a big big one is data observability in monitoring I honestly think that data observability in its current state is a dead end in terms of like let's cover

all the data with alerts and monitors and then be the first to know about any anomalies it's useful but then it quickly leads to a lot of noise alert

fatigue and ultimately kind of could be even net negative on the workflow of a data engineer I think that this is a type of workflow

where putting an a to investigate those alerts do the root cause analysis and potentially remediation is where I see a

lot of opportunity for saving a ton of time for a data team while also improving the slas and the overall quality of the

output of the data engineering team and that's something that we are really excited about something we're working on data full and we're excited about coming later this

year are there any other aspects of this overall space of using llms to improve the lives of data engineers and the work that data Engineers can do to improve

the effectiveness of those llms that we didn't discuss yet that you'd like to cover before we close out the show I think that you know we talked a lot about kind of the the workflow

Improvement I think that overall my recommendation to dat Engineers today would be to learn how to ship LM

applications it's not that hard Frameworks like Leng chain make it very easy to compose multiple blocks together and ship something that works whether or not you end up losing using L chain or

other frameworking production and whether your you know team allows that doesn't really matter but it's really really really useful to try and build

and learn all the components and by it's just like software engineering you know learning how to code opens up so many opportunities for you to solve problems right you see a

problem and you're like I can write a bon crep for that and I think that with LMS it's almost like a new skill that both software engineers and data Engineers need to learn where you see a

problem and you think that okay I actually think I can scill the problem into three tasks that I can give to an llm like one would be extraction web could be like reasoning and

classification and now it just solves the problem and so but but really learning how to build and trying helps you build that intuition and so my recommendation would be for all dat

Engineers for listening to this is try to build your own application that solves either a business problem or helps you in your own workflow because knowing how to build with LMS just gives

you tremendous superpowers and will definitely be helpful in your career in the coming years I definitely would like to reinforce that statement because

despite the AI maximalists the AI Skeptics no matter what you think about it llms aren't going anywhere they're going to continue to grow in their usage and their capabilities so it's worth

understanding how to use them and investing in that skill because it is going to be one of those core Tools in your toolbox for many years to come and So for anybody who wants to get in touch

with you and follow along with the work that you are doing I'll have you add your preferred contact information to the show notes and as the final question I'd like to get what your current perspective is on the biggest gap on the

tooling or technology for data management today I think that there's a lot of kind of skepticism and some bitterness around kind of modern data stack failed Us in

the sense that we were so excited that more days stack will make things so great five years ago and we're kind of disappointed and I think that I'm an

optimist here I think that modern data stack in the sense of infrastructure and getting a lot of the fundamental challenges out of the way like running

queries and getting data in and out of different databases and visualizing the query outputs and having amazing notebooks all of that that we now take

for granted is actually so great relative to where we were you know five seven eight 10 years ago I don't think it's enough so I think that uh I am with

the data practitioners for like well it's 2025 we have all these amazing models why is it still so hard to ship data absolutely with you and I think

what I'm excited about is now that we have this really great foundation with modern data stack in the sense of infrastructure I'm excited about one

getting everyone on Modern Data stock to the point of migrations right let's get everyone on more infrastructure so that they can ship faster obviously a problem that I'm really passionate about in

solving and working second once you are on the modern data infrastructure how to keep modernizing your team's workflow so that the engineers are spending more and

more time on solving hard problems and thinking and planning on the Val activities that are really worth their time and less and less on operational toil that just is burnout inducing and

keeps everyone back so I'm excited about the modern data stack Renaissance thanks to the fundamental capabilities of large language

models absolutely well thank you very much for taking the time today to join me and sharing your thoughts and experiences around building with llms to improve the capabilities of data

Engineers it's definitely an area that we all need to be keeping track of and investing some time into so I appreciate the insights that you've been able to share and I hope you enjoy the rest of

your day thank you so much Tobias [Music] thank you for listening and don't forget to check out our other shows podcast onit covers the Python language its

community and the innovative ways it is being used and the AI engineering podcast is your Guide to the fast moving world of building AI systems visit the site to subscribe to the show sign up for the mailing list and read the show

notes and if you've learned something or tried out a project from the show then tell us about it email hosts at dataengineering podcast.com with your story and to help other people find the show please leave a review on Apple

podcasts and to tell your friends and co-workers [Music]

Loading...

Loading video analysis...