LLMs for Devs: Model Selection, Hallucinations, Agents, AGI – Jodie Burchell | The Marco Show

By IntelliJ IDEA, a JetBrains IDE

Summary

## Key takeaways - **LLM assessments are flawed**: Assessing LLM quality is difficult, even for specific tasks. Many benchmarks are flawed due to issues like data leakage or questions that don't make sense, making it hard to judge a model's true performance. [12:25], [15:04] - **Don't always need the newest LLM**: You don't always need the most state-of-the-art LLM. Models that are a couple of years old can still be competitive, and older, cheaper models like GPT-3.5 can be sufficient for many tasks. [25:14], [25:53] - **Hallucinations are an unsolved problem**: Hallucinations in LLMs, whether factual errors or faithfulness issues, remain a major problem. Techniques like RAG can help, but the core issue of models generating incorrect information is not fully solved. [26:36], [28:21] - **Vibe coding has limitations**: While 'vibe coding' can be useful for prototyping, it's unlikely to lead to a successful startup without developer involvement. Ensuring security, performance, and maintainability requires expertise beyond what current AI agents can reliably provide. [32:25], [34:14] - **Self-hosting offers privacy**: For companies concerned about data privacy, self-hosting LLMs is the only solution. This avoids sharing sensitive company data with third-party model providers, which can be crucial for compliance and security. [55:41], [58:17] - **AI ethics and environmental costs**: The development of LLMs raises significant ethical concerns regarding data sourcing, copyright, and labor exploitation. Furthermore, the computational resources required for training and running these models have substantial environmental impacts. [59:06], [01:04:46]

Topics Covered

Smaller Models Can Outperform Larger Ones
LLM Assessment is Broken
LLM Agents Struggle with Complex, Novel Tasks
AI Ethics: Data Sourcing and Labor Exploitation
AGI is Decades Away, Not Years

Full Transcript

This whole thing is a very stupid

debate. Anytime someone says it's around

the corner, AGI is around the corner

I'm like, tell me how how do you know?

It makes me angry. Here's the thing.

Assessment of LLMs in the context of

your application is exactly the same

pretty much as assessing of an

application works. You still need your

unit test. You still need to inspect

things like traces. You still need to

get humans in the loop to check things

out. You still need to do AB testing.

[Music]

So, welcome to the very first inaugural

episode of the Marco Show. I have a very

wonderful guest with me today. Who are

you and what do you do?

>> So,, I, I'm, your, colleague.

>> Um,, so, I'm, also, developer, advocate.

>> You, kind, of, look, familiar.

>> Yeah,, I, I, I, would, hope, so, after, three

years.

Um so yes I'm also a developer advocate

in Jet Brains. My specialization is data

science. Uh been a data scientist for

about 10 years now which

>> Right., So, you, started, out, in, data

science right after uni. That that's

what you did

>> right, out, of, uni, but, in, the, sense, that, I

stayed at uni for a very long time. Um

so I did a PhD in psychology and I also

stayed on for a posttock in bioatistics.

>> Right., And, then, I, fled, academia, and, went

into the warm embrace of

>> data, science.

>> Right., Okay., How, are, you, involved, with

LLMs?

>> So, I, basically, do, a, lot, of, work, kind, of

building projects in LLMs, doing

education about LLMs. Um I've done a lot

of talks about LLMs in the last three

years.

>> My, particular, stance, is, because, I, worked

in natural language processing for quite

a number of years in my data science

career. So basically I was I like to say

I was working on LLMs before they were

cool.

>> Um, I, was, working, in, NLP, in, the, pre-LM

era. So pre208

>> right

>> and, you, know, there, are, a, lot, of, methods

in natural language processing that are

not LLM. We can also talk about them. Um

we started experimenting with the very

first LLMs uh the ones coming out of

Google like BERT right at the end of

that job. So yeah, basically I've sort

of seen their applications, seen sort of

how they can be used and I've also seen

a lot of claims about how they can be

applied which I'm not entirely convinced

about.

>> I, love, it., And, I, mean, we, talked, before

the show and I was saying you're kind of

the antidote to the to the hypers out

there and just giving you a bit of more

uh neutral context where LMS are, what

they can do and how to choose them. And

that's the topic I'd love to kind of get

into.

>> I, think, you, have, a, lot, of, questions, for

me. I do. So my issue is and um I'm not

I'm not making this up. So I go to any

of our IDs for example.

>> I, look, at, the, AI, chat, window., I, have, a

selection of 20 different models to

choose from. And it's not just about we

have Google with the Gemini models. We

have then open AAI with the GPT models.

And by the way, sorry, anthropic.

Antropic. I

>> anthropic., Anthropic.

>> Anthropic.

>> You, know,, anthropic.

>> Anthropic.

>> I, know, it's, the, th, sound., Yeah.

>> Yes., Anthropic.

>> Yeah., um, uh, with, Claude, and, then, I, have

you know these subm models the families

and I was thinking to me it feels like

uh, in, Europe, or, at least, in, Germany, as

kids we had that these card games which

are called car quartet um where

essentially you play cards and you have

everyone has has different car models

and you have you know they have

benchmarks written on them with the top

speed engine uh horsepower whatever and

then you just compare these these

numbers and then get take the cards from

from the other players.

>> Mhm.

>> And, to, me,, when, looking, at, all, these

models, I actually see, you know

context sizes. I actually see

throughput, random numbers, new versions

of models. How do I go about choosing or

even maybe just understanding what these

models can do and which one should I

choose essentially?

>> Yeah., So,

part of kind of the I think the

confusion about LLMs is they've sort of

been marketed as a one-sizefits-all for

all problems. And

even though these models are better at

generalizing with natural language tasks

than anything we've seen before, they

still have particular training sets

they still have particular um things

that they tend to excel at and things

that they tend to be weaker at. So

kind of maybe I can give like a brief

well it won't be that brief but a slight

a slightly brief history about how we

got here.

>> Sure., Yeah.

>> So, basically, we, struggled, for, a, long

time about how to actually process

natural language. So to define natural

language it's what we're speaking right

now. It's it's any language that is not

designed. And it's really complex to

process because things like word order

or context things like this they matter

a lot. And actually keeping track of all

of that, keeping track of that even

within a sentence, but over multiple

sentences is really tricky. So before

LLMs, we had a type of deep learning

network called long short-term memory

networks. Really catchy name. So things

things are named very strangely in

machine.

>> Long, short-term, memory, networks.

>> Yes., Okay., Yes., And, the, idea, of, these, is

that they could keep track of words in

sentences. So basically the the network

would go sequentially over say a

sentence. It would learn something about

the first word. It would store that in

the short memory sorry in a long-term

memory and then basically the short-term

memory would then go to the next word.

It would add extra information to the

long memory and it would accumulate

information.

But there were many limitations with

these models. First like training things

sequentially and running things

sequentially slow. But also, you know

these long-term memories couldn't

actually hold that much information. So

they start failing actually after about

20 words

>> right?

>> Yeah., So,, but, they, they, were, the, best, we

had.

>> Mhm.

>> And, then, in, 2018,, a, paper, dropped, called

attention is all you needed and it put

forth the transformer network. And

transformer networks at their core they

have a mechanism called self attention

which is far more efficient at helping

um the model to understand which words

are important for gauging meaning within

a sentence.

So the very first transformer model was

called BERT. It was by Google and it

actually wasn't a GPT model. What it was

actually trained to do was work out two

natural language tasks. It's what's

called an encoder model. It was designed

to try and predict what a missing word

was in a sentence and with two sentences

work out which order they're supposed to

go in. And by forcing the model to do

this, it basically meant that it had to

learn a lot about how the language

functioned. So you ended up with these

layers in the model which are now called

encoder vectors. They're used for a lot

of different things. And basically these

vectors um help you kind of

process text and then you can change the

downstream task. So you can change the

downstream task to say classification

sentiment analysis, things like that.

These models don't generate text.

And the big problem with building

encoder models is you can imagine this

data set you have to design by hand what

the output is. So then the idea of okay

what if we do something like next word

prediction as the task. So say we get a

sentence we just split it at a point and

we say here's the input all of the part

of the sentence and the output is you

have to correctly guess the next word

and these were the generative

pre-trained transformers the GPT models.

So over time building these models, they

were able to scale much more readily

because the size of the data set can

grow a lot bigger because you're

essentially not having to do anything to

it other than some cleaning.

And also they realize that the bigger

you make them, the better they are at

doing a whole range of natural language

tasks. So, by the time we hit GPT3

which was OpenAI's model, um you're

looking at a model that can do a lot of

things we're familiar with LLMs being

able to do now. They can um encode

parametric knowledge. They can learn

things from the data. They can do a lot

of sort of um basic reasoning tasks and

stuff like that. So where we got to is

an arms race. We're trying to make the

models bigger and bigger and bigger like

scaling them. And you'll hear this this

term like uh parameters. We're getting

into billions of parameters. It's just

the size of the neural network. They got

very expensive to train. They're very

expensive to run. And at the beginning

of this year, a Chinese company released

a model called DeepSeek that rocked

everyone to their core because they were

able to achieve pretty good performance

comparable with these massive models

with a much much much smaller network

much much smaller to run a much much

cheaper to run. So

>> which, brings, me, back, to, maybe, car

quartet. So from the outside and I have

no clue about most of the stuff. Could I

just then have you know by default said

well a new version with so many more

parameters by default then is just

better or is essentially what deepse did

show that is not the case or how does

that work out? I mean

>> yeah, so, in, order, to, improve, models

there's sort of a few different things

you can do. So one is you can just make

it bigger. you feed in more data and you

make the model bigger. And this was the

most reliable way of achieving better

model performance. But neural networks

like you they're very you can sort of

customize them and design them in a lot

of clever ways that are better at

extracting meaning out of the data. And

so um basically

it's it's that DeepS used a few tricks

not just with model architecture but

with the way that they trained the model

um that sort of manage to extract a lot

more meaning with fewer parameters.

Because if you think about the way that

these models work, you're basically you

have an input. An input is a bunch of

words and each of those words will kind

of be processed along a path as it goes

through the network and at the end it

will come to one conclusion. What's the

next word? But most of those paths

aren't used in a single prediction. So

being able to sort of trim the networks

down to what is actually like the most

useful paths either through like like at

the beginning of the training process

itself or afterwards a process called

distillation basically

allows you to build more compact models.

They're just better at sort of

extracting information efficiently.

>> And, what's, the, state, of, Deepseek, now

today? I mean, do you do do you know if

it kind of can compete with all the

other, you know, fancy models out there?

>> It's, interesting., So,, in, terms, of

competing, it's also in the open-source

realm. So, we can kind of talk about

proprietary versus open source. Um, but

in terms of model performance, yes, it

does seem to be holding its own, not

with the big big like really huge

models, but at least um they have a

reasoning model which holds its own

against some of the state-of-the-art

reasoning models. and their generalist

model also performs quite well. So this

is why everyone was so shocked because

it's not like it was, you know, half as

good or a quarter as good. It was almost

getting into the realm of as good.

>> Mhm., With, the, state-of-the-art, reasoning

models though, would I then just by

default assume

having no other information as an end

user that still context size and

parameters and new version automatically

means better?

>> Okay,, so, not, necessarily., And, defining

better, oh my god, I have a whole talk

about this. So um assessment is one of

the most neglected and most difficult

areas in natural language processing. So

assessing whether you have a good model

is really hard even when it's for a

specific task unless that task is

something like discreet like you're

trying to predict something like an

emotion

>> right, and, we're, just, talking, assessing

the quality of the output essentially

right and this is my chance to to say

>> what, we've, forgotten, with, LLM, hype, is

we're still talking about machine

learning so machine learning is not just

models machine learning isn't decision

trees or boosting models or whatever.

Machine learning is an approach to

working with data that is structured and

gives you kind of ideas about how to

assess model quality. So the goal is to

create a model that works in production

as well as it does in a development

environment. So machine learning gives

us the tools to do that. And people have

just gone, oh, we have magic models now

that can do everything. So we don't need

to think about basic machine learning

practices. No, no, we do. and it's

coming back to bite us in the ass

because we haven't thought this stuff

through. So what kind of happened with

the beginning of all NLP assessment but

also LLM assessment is they were

designed to do natural language tasks.

So we have a bunch of natural language

benchmarks. Not going to say they're

perfect, but it's like okay can this

model do classification well? Can it do

um question answering? Can it do

summarization?

>> By, the, way,, just, classification, for

anyone who's I mean you just have a for

example some sort of a text and you say

hey um you have an example of of a uh

sorry do you have an classification

example?

>> Yeah., So, um, I, recently, released, not

recently it was 6 months ago but I

released a tutorial where we were trying

to classify books into fiction or

non-fiction. This would be a

classification task. Um so based on the

description is the text describing this?

Then people started noticing, oh these

models seem to be able to do other

things. They can reason and they can

with the parametric knowledge they can

answer questions on exams. So they seem

to know things about history and math

and science. And then uh we started

making claims about AGI.

>> I've, gone, there, early., Actually, I, wanted

to ask you later on. Yeah. We can come

back. Yeah. This is a biggie. But

basically people have started creating

or reusing assessment techniques to see

how well LLMs do this stuff.

>> Mh.

>> Now

so many problems with these assessments.

Firstly like some of them were just

created using things like Amazon's

mechanical turk by people who like they

have no investment in making this good.

May it may also be English is not their

native language

>> and, some, of, these, tools, are, filled, with

gibberish. I am not joking. Like

questions that do not make grammatical

sense and they don't actually have a

correct answer if you as a human are

trying to assess them.

>> So, this, was, sort, of, the, first, generation

of assessments. The second generation

they've tried to clean things up a bit.

Um we also have problems with things

like uh there's a phenomenon in machine

learning called data leakage. It's

basically where a model performs very

well because it's actually kind of seen

the answers. you've actually you're

actually shown it the things that are in

the data set you want to test on when

you trained it. So it's kind of

memorized the answers like a student

cheating at a test

>> and, um, there's, pretty, good, evidence, that

there's a lot of data leakage of

assessments into model training data

sets and this continues to be an ongoing

problem. So now they're trying to make

these assessment sets private. But

beyond all these problems

the fundamental thing is that you are

trying to sell a model as good. Okay.

But you, Marco, you come in and you want

to use a model for something specific.

You will want to do a task. You might

want to code.

>> You, might, want, to

>> transcribe, some, audio,, for example.

>> Some, audio,

>> generate, images., I, don't, know.

>> Yeah., Yeah., Yeah., Exactly., Exactly., Or

um LLMs can't do that, by the way. But

yeah.

just as an aside but but related machine

learning models that have come out in

the last 5 years have also had advances

>> but, um, basically, like, maybe, you, want, to

I don't know you you want to research

something so you put in a paper as part

of the context and you want it to answer

questions you know knowing that a model

is good is not that helpful if it's

scoring at all these random benchmarks

that assess a whole bunch of things

>> and, you, want, to, know, how, good, it, is

completed code.

>> So

that was a very long way to say

assessment is and it's just

>> So, I, was, actually, thinking, about, that.

So as an end user again who knows

nothing.

>> Mhm.

>> Would, I, just, take, a, random, task,, a

coding, task,, send, it, to,, I, don't know,

three different models.

>> Mhm.

>> Then, just,, you, know,, do, two, or, three

examples and then take whatever I think

gave me the best response and I'm just

going to stick with the model and that's

it., I, mean, what, what what, can, I, do

essentially as as an end user?

>> I, think, basically, a, lot, of, it, seems, to

be based on vibes at the moment, right?

Like it's like my gut feeling is that

it's better at that. There are specific

coding benchmarks. So say you want to

look up um or so you want to use a model

for coding, you can go look up that

benchmark and see how well different

models have scored on it. Would it tell

me if I go to a benchmark and then it

says, "Well, that model got 42 points.

The other one just got 39 points." Does

that tell me, "Hey, great. Three, three

more points means so much more uh I

don't know that much better of a model."

>> And, this, is, the, problem, with

assessments, right? You need to

understand like what population are they

assessing? Is your target language even

represented? Are the tasks you're doing

actually represented? And how are they

assessing good? Is it does the the

co-compile? Is it the number of errors?

Is it the maintainability of that code?

So, this is the problem. I don't think

people are looking at stuff in this

depth. It's just you want one number to

boil it down to because everyone's

confused. But unfortunately, I'm not

here to necessarily tell there's tell

you there's a magic bullet. Like there

isn't. Um I would say that like

generally the models do seem

to do okay in coding task for languages

for which there's a lot of training

data.

Do I kind of can I even tell somehow

what a model was trained on or when I

now look at these big models I just

assume as you said a general purpose

model that which is somehow trained on

everything and then someone told me in

some I don't know Firebase YouTube video

oh by the way claude was more trained

specifically on code and that's why I'm

I'm using it for coding. Do I have any

idea what what these models were trained

on?

>> This, is, also, a, major, problem., So, with

the early like the early culture around

LLMs was they were all open source and

so the data sets that they were trained

on were released and anyone could go and

look at them but one of the competitive

advantages even for open source models

now has become the data because it's one

of the the few differentiating factors

that's left and so most of these model

model creators will not tell you what

the data the model was trained on. So

you might say in the the paper that they

release or in the marketing material

get a general overview. And this may be

where people are getting these numbers

from,

>> there's, also, just, a, lot, of, rumors, that

fly around. So people might just be

making stuff up. Maybe they asked Chad

GPT and it told them and it

hallucinated. Who knows? But um the best

you're going to get is like what the

model creator will release to you and

they will not release the data generally

anymore. So you yourself can't go check.

There is also the fact that these data

sets are massive. So searching them

efficiently is really hard. But

>> you, know, the, interesting, thing, is, um

with with the marketing material you

just mentioned when I go online and then

again I see massive update changes

everything and now the model much higher

throughput um I've asked you before I

think uh thinking models non-thinking

models what these are actually also and

then you see all these updates now you

have adaptive thinking and all these

words elaborate thinking stupid thinking

slow thinking fast thinking I'm just

reading through all that and thinking

what, the, hell, is, going, on, how, what what

should I do with w with all of this.

>> Yeah., Yeah., Yeah., And, I

>> And, by, the, way,, could, you, just, briefly

explain what thinking versus

non-thinking is if you mind?

>> Yeah., Yeah., So,, we've, kind, of, talked

about reasoning in the context of LM.

So, LM can't actually reason. They're

they can kind of emulate and pattern

match and do something that looks a bit

like reasoning. I just want to clarify.

They don't actually reason in the way

that humans do. But what people

discovered is that models are

particularly bad if you just straight up

ask them a question like um let's say

like a a math question. This is always

the the classic like um Jenny has four

balls and Sam has eight balls. How many

balls do they have together? Right? And

usually it'll just randomly come up with

a number. But you can do particular ways

of prompting these models. uh one is

called chain of thought where you can

show it how to break down these

problems. So if I had this problem then

I would need to say first Jenny has four

balls then I would need to identify that

Sam I can't remember what name I came up

with has eight balls. I need to add them

together and that gives me the final

answer of of 12. Now I give you another

problem. So

>> this, problem, decomposition, seems, to, work

a bit better. Uh a recent paper came

out, I think it was from Apple, Apple

doing amazing research by the way around

like the limitations of these models.

Like really interesting stuff. But this

Apple paper seemed to show that like

even with these sort of ways of

prompting models and uh thinking models

altogether, their ability to do problem

decomposition really just it it exhausts

at a point. You can't work around this

limitation.

So yeah, thinking models are either

prompted in such a way that they're told

to think more. They're given examples

through um chain of thought prompting.

They may be fine-tuned in such a way

that they are shown a lot of such

problems. But the thing is is that

thinking models usually are a lot slower

because internally they will be looking

for multiple possible solutions to a

problem. And this gets more complex when

you then start building sort of like

thinking agents. Um we can come back to

agents as well. But

generally like then people are like okay

it's good if they think sometimes we

don't need them to think all the time.

And so these adaptive thinking models

are basically okay we've trained the

model in such a way or built an

application agent application in such a

way that the model doesn't always

default to this deep thinking process.

It may sometimes think when it needs to

and when it doesn't it just is cheaper

and faster.

>> Mhm.

>> So

>> it, doesn't, make, the, whole, topic, any

easier. I feel

>> the, topic's, not, easy

>> but, it's, being, stuffed, down, your, throat

as being really easy. It literally is

like a couple numbers, choose the latest

model and then for example when I use

some of the um competitors for fun um

and then you basically go back for for

coding itself and you fall back to trial

mode. It basically brings you to a

cheaper model or to the to you know

version one to go and you're thinking oh

does it now mean I'm using a totally

dumb model it can't do anything anymore.

Do I need to spend the money on, you

know, premium tokens and otherwise I

will just get crap answers. Uh it's all

stuff which is very confusing to someone

who just wants to code and just want to

get something done.

>> Actually, that, reminded, me, of, I, did, a

talk I did talk last year for Goto and I

did a little demo at the end on um like

a really simple rag application. So it's

basically you have a PDF, you get it

sort of broken down into vectors using

these encoder models and then um you can

basically search through for the most

semantically similar vector to a query.

So you basically search through it and I

use GPT 3.5. I really like this model

because to be honest it's still a very

powerful model and it is cheap as chips

to use. And I got a comment on the video

saying, "It's very nice, but she's using

a very outof-date model." And I'm like

"It's fine for what I need it to do."

Like, you don't always need to use the

most state-of-the-art model.

>> But, that's, because, people, really, can't

judge what the what they're kind of

using.

>> Yes., And, and, like, I, all, forgiveness, in

software normally using the latest is

best. It's got like it's more secure.

Like if you're using the latest

hardware, it's usually get more bang for

less money. Um, but it's not quite the

way things work anymore with LLMs

because training these models is so

freaking expensive, so slow and we have

plateaued honestly in terms of

advancements we're seeing. So models

that are 2 years old are still

competitive I would sayish with models

that were released this year.

>> How, do, I, and, I, I've, had, that, issue

myself. How do I actually handle

hallucinations? and not just the

hallucinations but actually the the

confidence that comes with it. So you

get very confident answers. Um you have

to really really understand the topic to

to figure out hey the LM is lying to me

kind of.

>> Mhm.

>> And, then, also, the, topic, of, I, feel, maybe

with more and more AI generated content

and the training you just mentioned I

mean it's kind of a self-reinforcing

loop that the answers get worse and

worse.

>> What, what's, your, take, on, that?, I, mean, on

that whole topic.

>> Wow., big, topic., So, basically

hallucinations come because models are

trained on a limited data set, right? So

you have two types of hallucinations.

You have factuality hallucinations where

the model has actually just learned

something completely incorrect. Maybe it

got trained on some conspiracy theory

blogs, which does happen by the way. So

they've internalized knowledge that is

false. And then you have uh faithfulness

hallucinations. And this is where say

you have some sort of reference

but it can't actually I say summarize it

properly or answer questions from it

properly. So it can reference to

something but it can't actually

reproduce the information it sees in a

way that's correct. And both of them are

major problems. Um, the way that we've

tried to deal with say

factuality hallucinations

is trying to give the LLM access to

more up-to-date and correct information.

So, one of the first ways to do this was

rag retrieval augmented generation. That

was the hotness last year. And

generally, like there's nothing fancy

about it. Generally the idea is you just

somehow give the LLM access to an

external knowledge base and you say

please use this preferentially over your

parametric knowledge. But again problem

is sometimes you don't want it to do

that. Sometimes the parametric knowledge

would be okay. So there's also clever

ways to do like rag that is in a way

that is also adaptive. Um the problem is

though if you have a model that has high

faithfulness um hallucination rates even

if it has access to the correct

information it won't be able to do it

properly. And so uh yeah

I don't really know what to tell you

about

treating hallucinations. It seems to be

an unsolved problem in many ways. Do you

find yourself googling more or also

asking instead for example jet GBT

instead for for the right answer to any

of your stuff you might have googled

like a year ago? It depends um if it's

something okay I will confess a way that

I use chatgpt

so chatbt by the way I should also

explain chatgpt uh claude all these

these are not models anymore they are

actually agents so they're a whole

application that is built around an LLM

which the LLM acts as the reasoning

engine and it's reasoning comes up again

reasoning engine and it can basically

access a bunch of tools say search

engines, image generators

um it can access documents that you

upload etc etc. So basically um I like

to take advantage of that and I won't

necessarily ask it stuff that I could

Google but say I need to work with a

bunch of say I want to work with a new

research paper and I've read through it

but I just want to ask a few more

questions to clarify things maybe get it

to cross link with other things I would

upload that to chat GPT and then chat

chat with the PDF. This is what rag was

always advertised as at the beginning.

Um, I think this is neat. I wish I had

this during my PhD. It is super.

Sometimes it still hallucinates, but you

can still then go to the source and I

always ask it for sources and then

sometimes it will hallucinate the

source, but then I'll go and check

>> which, means, you're, in, that, way, kind, of.

So, you get an answer and you basically

in your mind think it's 80% correct and

you always have the feeling of I need to

double check this. I mean

>> yes., Yeah., Yeah., I, never, ever, ever, use

Chachi PT in a way or any other LLM in a

way where I cannot verify the answer

because you just cannot trust it. And

it's exactly the same as you and I

remember early search engines. I'm sorry

to cast such an aspersion on you, but

it's true. It's true. Um the early

internet was wild, right? like the early

search engines didn't index things based

on quality or like even like cross

linking and things like that and so the

quality varied and I think a lot of

millennials have that experience of okay

like

you kind of get a gut feeling for when

something's a crap source like it's just

someone's personal blog or it's some

like real dodge looking newspaper

or something um and then if it say CNN

you're like obviously their own bias But

it's it's CNN, right? It's probably

going to be checked.

>> So, in, that, way,, um, I, still, prefer, to

Google because I feel like I can look at

the source. I can get the context. If

chat GBT wants to give me something, I

want to know where it got that

information from.

>> I, think, it's, very, tricky., I, think

combined with the confidence where you

always get like certainly here's your

answer and then you notice ooh right. I

mean but it it's literally often wrong

more often than I was actually thinking.

And when it comes to coding, it's simple

because when you're a senior coder, you

can tell where, you know, went off the

rails and whatever. But I think

especially when people use it for

medical checkups and everything like

it's it's um

>> please, do, not, use, it, for, medical

checkups. Please do not do this.

>> Yeah., But, I, think, that's, I, have, the

feeling that's a general trend where

it's going down. I mean what's going to

>> Yeah., And, like, my, niece, is, a, Zoomer, and

she preferentially uses chat GPT to

check things. Like she's clever. I think

she uses her common sense. Right. But

>> double, check., Are, you, on, Tik, Tok,

Instagram, any of these new new

millennial Gen Z?

>> I, am, on, I, am, on, Instagram., I've, been, a

longtime user. Yes.

>> Right., Yeah., Yeah., Let's, see., Let's, see

what the uh what the generation after us

uh is going to do. What I'm what I've

been wondering about and to be honest

so uh I'm hearing mixed signals. So, uh

I know that vibe coding was all the rage

like 2 months ago. Now, it almost when

talking to people, I I'm getting the

feeling of, yeah, we're past that stage

kind of. And having talked about all

these issues with hallucinations and

confidence and assessments and whatever

do you really think you can vibe code

yourself to

become a startup billionaire?

Okay, so context on vibe coding as a

term. So Andre Kapathy, absolute legend

in the LLM space. He he talked about

this term in what like January or

something. Two months later, there were

two books released by major publishers

about how to do vibe coding. Like it was

like the pipeline from this term being

invented to mean this is a cool way to

prototype projects to this is literally

how you can become a developer without

knowing how to code was 86 days I

believe. So this is I think the poster

child of what what what's happening in

this space. Um

look it's a little hard for me to say

because I'm not a developer. I want to

put the caveat in. I know there's things

like security and maintainability and

latency that are concerns that

developers need to worry about. Me as a

data scientist in my pure beautiful

research world, I don't care about such

things. But it comes back to the fact

that I have not convincingly seen an

agent being able to come up with an app

that you as someone who does not need

know how to code in that language

can ensure is secure can ensure is

performant. Now for prototyping

question marks I don't think you can

become a billionaire off the back of

that first application. You're going to

have to get developer in at some point.

Maybe for just a proof of concept that

works with like small amounts of data or

like works as a cute form on an website

or whatever. Sure, maybe. But as a

developer, maybe you have better

>> I, do, share, the, same, sentiment., I, think

what I mean there's the Twitter bubble

which is insane by the way or the X

bubble or whatever where people just v

code left and right and you see what

whatever they're selling or not selling

or just faking. Uh

and I think that prototyping yes I think

if you know what you're doing and if

you're kind of seniorish it is superb

kind of because you you figure out when

it goes off the rails and you can you

know put it back. But as someone as as

something for someone who doesn't know

to how to code at all and doesn't I find

it a super tough cell. I mean we're

we're we're basically not there yet. I

think

>> I, think, a, good, example, of, this, is, so, as

a data scientist I don't really know any

programming languages confidently

outside of Python are SQL. SQL is a

language. I'm going to defend that.

>> But, um

>> you're, right., Yeah., Yeah.

>> I, love, SQL., I'm, a, SQL, girl., Um,, so, we

were putting together some demos for a

video. Um, was basically for our code

assistant, Juny, and you helped me with

this with the Java ones. And yeah

Nicholas, another of our co-workers, he

helped me with some prompts for

TypeScript.

So he's like, "Caveat, I haven't tested

these. See how you go." And I tried one

of the suggestions, which was to create

a Pokédex application, keeping track of

all your Pokémon, right?

it didn't work after multiple tries and

when I tried to debug it even with the

refactoring capabilities within webstorm

I just didn't understand why it was

break there was just too much code I

didn't understand because I don't know

TypeScript and like for me this was like

such a like it wasn't a surprise like I

knew there was going to be a point at

which the agent failed um because it was

beyond the complexity of what it could

handle but

>> it, is, exactly, that, that, spot, where, you

get something up and running quickly and

then You might have to just change one

line or a couple of lines but you don't

know what what these lines are actually.

>> You, don't, know what, the, problem, lies, and

then you spend I found one, two, three

four, five hours fixing this or going in

endless loops to the LM and saying

please fix it for me, fix it for me, fix

it for me. It doesn't fix it for you and

then you just do it 10 times and you

feel like literally you feel stupid

because you just you know please now

please really fix it and the LM goes

back to you says by the way now I really

fixed it for you.

Thank you. That was a great question.

>> And, uh, yeah,, so, I've, been, stuck, in, in in

those kind of loops.

>> Yeah., And, like, a, counter, example, where, I

knew what I was doing when I was first

playing with Genie. I got it to create

like I downloaded a classification data

set and I said create me a deep learning

model to classify this. And I told him

told it what the labels are and blah

blah blah. And it actually did a pretty

credible job because that workflow is

pretty standardized. Um, but I could see

instantly problems and I knew what to

fix. It was a completely different

feeling from me in this middle of all

this TypeScript going, "Oh my god

okay, this one didn't work."

>> Just, for, people, to, understand, and, also

for for for developers actually, I think

agentic can mean so many things kind of

what what is your definition of if any

uh of an a what an agent really is?

because um when I talk with different

people they all give me different

answers.

>> So, basically, what, uh, an, agent, is, is, kind

of what I described with say chatbt and

anthropic right but they can be much

much more basic. These are obviously

super super sophisticated. So with an

agent what you do is you take an LLM and

you give it access to tools and those

tools can be anything. The very first

one actually I remember was when chatbt

introduced function calling. This was do

you remember that? It was like a while

ago. Um but it was generally the idea

that you could write Python functions to

do API calls and things like that. The

tools have obviously gotten a bit more

sophisticated since then. Um

>> now, kind, of, standardized, with, MCP, stuff.

>> Exactly., Exactly., And, MCP, uh, just, to

kind of tell people in the audience who

don't know, it's basically a protocol to

standardize communication between LLMs

and like tool servers. So say you've got

an API that gives you weather updates

that's obviously going to have a

specific way that it wants to receive

messages. The LLM may send messages in a

different way. Then you have this

basically M*N complexity of the number

of ways you can connect tools and

models. And by the way, shameless plug.

I made a video about that on my channel

Marco Codes. Go watch it.

>> Marco, does, excellent, videos,, by, the, way.

They are very entertaining. Yes.

>> So,, yeah,, basically,, as, you, said,, MCP, a

way of um basically standardizing this

messaging and the connection system.

>> So,

this sort of evolved like the need for

MCP evolved because it's obvious how

beneficial it is to use tools with LLMs.

So LLM, like I said, they can do basic

reasoning. If you put into an LLM

something like, "Hey, I want a picture."

Hey, this is me and Chat GPT hanging

out. Hey, hey girl, how you doing? I

need a picture of a cat.

So if you tell um JGBT, hey, you got

access to Darly, an image generation

model or you've got access to an image

search, it'll be like, okay, I'm going

to decide which of those tools I think

is most appropriate based on the

description of the tools you've given

me. And I'm then going to go and use it.

I'm going to connect in some way, maybe

through MCP, maybe through something

else. And then I'm going to get the

result and I'm going to serve it to you.

How would I just again for someone to

understand if I gave it three relatively

similar descriptions of tools, how would

it how could I be confident in its

decision of what tool it chose kind of.

>> Yeah., So, this, if, you, if, you, just, give, it

the tools and the descriptions you can't

necessarily like it's a bit more free

form but there are frameworks where you

can be a bit more directive like you can

be like hey or you know from the start

you'll probably design your application

so that you would choose tools for

specific reasons. So you don't just give

it access to a smores board of tools you

give it access to the specific tools you

choose. But yes, there are also ways uh

through frameworks like llama index or

langraph which give you a lot more

control over like how the LLM selects

tools under which circumstances it would

um for which sort of context whereas

something like say small agents which is

the hugging face um way of accessing

tooling

it's a lot more free. So the LLM will

basically be like make up my own mind

based on what you give me.

>> Mhm., So

>> and, then, agents, essentially, also, uh, when

you said it's it's an umbrella um I know

we're building stuff we have agents sub

agents handing them specific subtasks

also again branching out so task is

split up maybe in even different tasks

and not just different tools but I mean

>> so, the, yeah, these, are, called, multi-, aent

applications so the reason you might

want this is for any reason that you

might want to break a program down like

as it increases in complexity. So, it

might be there's multiple subprocesses

but an additional reason you want to do

this with LLMs is because the way that

they're doing all of these tasks is they

have these chat templates. That's how

they do all of these like interactions.

But the chat templates are really just

getting appended to each other and

passed in each time you generate a new

token. So, if you're trying to do the

whole thing can get very expensive, but

it can also start

depending on the model and how big its

context window is, it can start failing

because you've exceeded the overflow the

context window or it can actually get a

little bit confused because it's like

well, this is there's a lot of

information in here and I can't pause it

correctly anymore.

>> I, don't, know, what, that, leaves, me, with

now.

AGI

>> AGI, AGI

we're not quite there yet at AGI but so

what what I'm so I'm I'm just trying to

figure out how to put it all because at

the end of the day still

um what it looks like to me is I will

still kind of I probably have some sort

of so you just told me everyone has

issues assessing the output essentially

I mean apart from the marketing material

material which gives you different

responses

which leads me to think Yeah, I will

just prompt my couple of favorite LLMs

to give me some answers and I just

choose the one which I think is the best

and then uh that I live happily ever

after.

>> Okay,, so, there, are, a, few, rules, of, thumb.

Basically, models need to be over a

certain number of billion parameters

generally to handle certain tasks

>> right?, Which, I, guess, to, the, the, state, of

reasonable all are I mean essentially

the big one. Okay.

>> Yeah., And, also, if, you, want, to, do, more

complex tasks, so say you're building a

multi- aent application, you may want a

reasoning agent as the main controller

of everything because they can pass more

complicated instructions.

Um, we also haven't talked about uh

self-hosting versus

>> it's, actually, a, topic, I, wanted, to, get

into just right now. Yeah.

>> Yeah., I, think, this, is, a, consideration

when choosing models because okay I

talked about the fact these models are

incredibly expensive to run and that's

because let's say you're talking about a

model of like few hundred billion

parameters

that model object might be like 40 GB it

might be more you then need to upload

that to a GPU server so it needs to be

held in GPU memory

and basically

Then you need to run inference through

it. And remember we're talking about

auto reggressive models. So what that

means is that you pass in your initial

prompt, it generates a token, it appends

that token to the end, then it passes it

through again. So it's doing many many

runs, sometimes hundreds of runs

depending on how long your output is per

um per time that you prompt it. So yeah

this is expensive and to do that in real

time requires really beefy machines and

they're expensive.

>> The, thing, is, how, would, I, even, as, a

company how would I go about it? I mean

not just in terms of okay I need a

hardware setup which is probably going

to be expensive.

>> Then, I, need, some, machine, learning

engineers. I mean can I would I just

take some random opensource model shove

some data into it? Which type of data?

How would I even train? How would I go

about training it?

>> How, would, I, then, maintain, it?, I, mean,

could I just train it once or do I have

to train it like every four weeks or

every week?

>> How, would, I, do, all, of, that?, I, mean,, how

feasible is it as any sort of company

except the really big ones to actually

do that on a consistent basis?

>> So,, at, the, moment,, no, one, is, really

training their own models and there's

kind of no need for it. So, a while ago

the term foundational models was termed.

I'm not sure how much I like this term

but the idea is that there are huge um

LLMs, many of them open source, and the

open source ones, you can just use them

how you like. So, it might be that the

LLM out of the box is good for you. We

should talk about tuning. Actually

let's let's talk about tuning here. So

also kind of to give a bit of context

about what an LLM is, because you see

all these like instruct, coder, like all

these subtypes, and you're like, what

what's going on here? Um, basically, we

talked about how the GPT models were

trained. Next word, prediction. And a

raw GPT model, that's what it will do.

It will just keep predicting tokens.

Yeah.

um until it runs out of like max token.

But um in order to make them useful, you

need to do something called fine-tuning.

So I think I talked a bit about what

fine tuning is. You basically knock the

bottom layers off the model and you

train it to do something else. So you

can do a few different types of

training. You can do instruction

training um or instruction tuning

sorry. uh where it's basically you

design these data sets that are here's a

prompt and here's an ideal output. So

basically if I ask you a question I want

you to answer it and here's a lot of

examples of that and you retrain the

model to do that. You also and you can

do this at the same time you can tune

models so that they're basically chat

models. So you train them so that they

understand chat templates and that's

really important because then it sort of

understands roles and it also

understands like end tokens so it stops

>> right, that, was, by, the, way, when, when, I

played with um I don't know I followed I

worked myself through an LLM book like a

year ago and then it told you know it

went through the history of of of the

models and that then with the end tokens

it just goes endlessly if you don't make

it stop and I was conf because I I got

into it also with JGBT didn't know

anything about it beforehand and then

you just get question answer question

answer with different roles but then

when you see what these models do and

they just you know endlessly going to

spew out text essentially yeah it's

interesting to see yeah

>> you, can, actually, still, on, hugging, face

they have endpoints where you can play

with demos like they're built into

spaces

>> would, you, mind, explaining, what, hugging

face is to someone who so hugging face

is a French company they have two

branches they have a for-profit branch

and they have an open source branch and

their open source branch branch has

basically become the place to access

these open source foundational models

open source data sets and they also have

a lot of tooling that they have created

themselves around how to work with LLMs

in Python. Plug for Python, you should

come over and join us. Um, but basically

the um the company just like does a lot

of work with LLM education as well. So

I have sung the praises to many people

before of their LLM course. I've

actually just recently done their agents

course, also amazing. They're both free.

You should

>> Why, not?

>> You, should, try, them, out.

>> Um,, but, yes,, back, to, hugging, face., So,

hugging face um because GPT1 and two are

open source models.

Well, GPT2 is not actually open source.

Sorry,, I, tell, a, lie., But, it, is, at least

freely available to use.

Basically the um like they set up these

endpoints where you can play with them

and you can actually see how these

models were firstly before they got

parametric knowledge. Uh some of the

outputs are hysterical. I always ask

them about Belgium. It always gives me

like really funny answers.

>> Um

>> do, you, have, an, example?

>> Um, Belgium, is, a, small, village., No,

Belgium is an empty place that's not

much than a small v much more than a

small village. Like something like that.

So like it grammatically makes sense but

it's just it's just word salad. like

it's like someone

>> you, know,, got, hit, in, the, head, and

>> Yeah.

>> Um, but, yeah,, you, can, play, with, them, and

you can also just see how it will just

keep generating if you use the um use it

through the transformers package. It

will just keep going.

So we're talking about fine-tuning. Yep.

So instruction tuning, chat tuning, and

by the way, instruction tuning can

increase hallucination rates because it

makes the model very eager

>> to, answer, the, question., Yes,, because

it's what we've been trained to do.

>> Uh, then, we, were, talking, about, why, do, we

go to this topic?

>> Just, the, practicality, of, like, a, company

trying to say, hey, kind of I want to

host my own model for whatever reason

might be privacy. Yeah.

>> Cool,, cool,, cool., So, you, will, have, all

sorts of versions of raw GPT models, but

you'll also have instruction tuned and

chat tuned models. But it could be that

as a company what you want to do is to

train the LLM to actually

learn more about your problem domain. So

maybe you will have an in-built data set

where you will, you know, teach it more

about the specific conventions of how

your customers work, something like

that. So in that case, you might

fine-tune, but you're not going to train

a model from scratch anymore. It's a

waste of time. It's a waste of money.

>> But, fine-tuning,, it's, a, lot, of, work,

>> but, that, would, probably, be, as, far, as, you

go as a company. But that means as you

said I need some sort of a foundational

model and someone who understands the

fine-tuning process and then it's not

just about fine-tuning process I guess

but getting the data in the first place.

>> Yes., Y

>> then, we, are, back, to, the, assessment

problem where I say well how do we

actually know that you know the stuff

kind of worked.

>> Yep., And, now, you're, talking, about, a

proper data science project. This is not

quick or easy. But let's say you don't

need a fine tune. Let's say you are

happy to just use a chatbot model

instruction tune model out of the box.

Well, in that case, um, you still need

an MLOps person who can host this. And

you're talking about all of the regular

problems that come with hosting some

sort of large and complex app. And in

addition, you are going to need to be

able to assess if it works for your

problem domain as well. This will be the

same problem though you have if you

decide to use one of the proprietary

models right

>> where, they, take, care, of, all, the, hosting.

Because here's the thing. Assessment of

LLMs

in the context of your application is

exactly the same pretty much as

assessing of an application works. You

still need your unit tests. You still

need to inspect things like traces. You

still need to get humans in the loop to

check things out. You still need to do

AB testing. And so, um, it's a really

nice blog post. I will send it to you so

you can share it as part of the episode

notes. But it basically um AI engineer

talks about this whole process of how

his team did this for a real estate um

it's like a real estate chatbot to

retrieve customer information and answer

other questions. And they were like hey

we were just doing like vibe assessment

at the beginning and then we're like

this is not working. Like as all the

edge cases came up they couldn't work

out if it was working anymore. So it it

it's just like again machine learning

fundamentals if you're talking about

stuff specific to models but the model

in the context of an application is just

classic software engineering. You need

to just a few tweaks but you just need

to understand like is this working for

my customers not magic.

I was just wondering if I try to

fine-tune a model or try to offer some

random powerful online model all the

data I have through MCP for example. Oh.

>> Um,, and, then, just, hoping, for, the, best.

Would there be kind of a difference kind

of?

>> So, the, model, to, train, itself, or, give, it

access to it like

>> just, give, it, access, to, it, or, maybe

>> like, in, terms, of, the, quality.

Okay. A few complications here.

>> Yeah,, sure.

>> Yeah., So, the, first, is, that, um, again

we're probably talking about like a rag

pipeline, right? So unless all of your

stuff is built into a good search engine

where retrieval is taken care of. So

like before I'm sure a lot of you have

worked with search before, but before we

had semantic search, you just had

search, right? Where you'd index terms

from text and things like that.

Search is not magic. Uh semantic search

is not magic. It's not fallible. So

being able to like set up like tuning

rag pipelines are really tricky. like

being able to actually retrieve the

correct information like you know do you

have like maybe a long chunk where like

because you could break the text down

into chunks. Do you like break it down

into small chunks or do you break it

into big chunks? How long should the

prompt be to actually give enough

information? So yeah, it could be better

if I had access to a lot of information

but it needs to be like searchable in a

way that's meaningful. Again, not a like

magic problem. We come back to search.

And again, the model still needs to be

able to use that information. So, it

needs to have like a goodish rate of

faithfulness hallucinations.

So, yeah, I'm sorry. I'm sorry. I got no

magic bullet for you.

>> No, magic, bullet.

It kind of leads us however a bit to the

topic of let's see privacy concerns that

people might have of sharing that

company data with LLMs.

>> Yeah.

>> Um, what's, your, take, on, that, by, the, way?

Do you think that uh yeah just um are

you essentially scared when typing

anything into the uh one of the model

models that they will take you do you

have that in the back of your mind?

>> I'm, always, careful, about, what, I, um, type

into them. And we should also talk about

the different sort of contracts you can

have with models, right? So we work at

Jet Brains. We have worked out an

agreement with our providers which

doesn't allow them to use the data for

training. But if I'm just using chat GPT

through my personal account, we have no

such agreement. That's part of the $20 a

month I'm paying them. And I suspect

part of the reason for this look I don't

work at these companies. I don't know.

But basically the amount of data of

sufficient quality to train these models

on we may actually be running out of it.

And like you said a lot of part of the

problem is this slop that's coming

through that's just flooding the the

public internet. So potentially that

data that you're inputting into an LLM

probably becomes more and more valuable

to these companies because it's probably

real and it's chat data as well which is

what they want for training.

So just be very very careful just be

careful about what you put into these

models. Um

and I think

>> as, a, company, then, however, that, it, that

almost sounds like you would like to try

to go back to the hosting point earlier.

You would actually try to host

everything locally as much as you can.

>> Yeah,, I, I, remember, having, these

conversations. So, our AI assistant was

in beta

from like March 2023, I think, something

like that. like it was early mid 2023.

So I remember I was at Europython and I

was doing AI assistant demos and big

companies like people from big companies

the question they would always ask is

it going to be safe for me to put my

data into

this model and I was like at that stage

I was like I think so but I'm not

entirely sure like because I wasn't

entirely clear on the legal agreements

we had and they're like yeah Even if you

say you have these agreements, I would

still feel much more comfortable

self-hosting.

>> Understandably., Yeah.

>> And, for, a, lot, of, them,, yes., Like, this, is

also, you know, it's just how they've

always been with data. Like we have

another product, data law, and one of

the biggest like customers we have is

people who want to host it on prem

because they don't want their data to

leave their ecosystem. So if you do not

want that, your only solution is

self-hosting.

>> Especially, in, the, context, which, I, find

interesting. um of

let's call them ethical concerns because

when we think about how these models

were trained in the first place and then

all the lawsuits that came after it

where for example Reddit went versus

anthropic and just where hey you just

took all our data and the open AI

copyright suit lawsuits and whatever and

the book authors everyone basically

saying hey you're ripping us off.

>> Y

>> um, how, ethical, has, all, of, that, been?, I

mean to to get where we are today

>> and, there, have, been, a, lot, of, ethical

concerns. So apart from the sourcing of

the data um by the way the latest uh

lawsuit is Midjourney is going to be

sued by Disney and Universal together.

That is not good. Um

but it I think it's interesting because

we went from we went very very quickly

from open source research utopia. This

is where we were. Everyone was like hey

we're doing this for the benefit of like

machine learning research we are trying

to do something new and exciting here

and actually let me talk about where

kind of the foundational data set came

from for NLP it's called common crawl so

common crawl was a project that came in

let's say 2007 2012 I can't remember the

exact year it's it's quite an old

project at this point But generally the

idea was they saw how Google's crawler

kind of went over the public internet

and they wanted a comparable open data

set to train new search engines or um do

natural language processing uh research

so information retrieval NLP. So common

crawl is like the most frequently linked

pages on the open internet and obviously

quite a lot of those are going to be

copyrighted material but it was all just

for research so it didn't matter at that

point right um I imagine under the

copyright usage this was fair use um the

problem is while there have been

additional data sets that have been

pulled in they're all still from the

public internet the cleaning processes

that applied to them because there is

cleaning done to term cannot be done

manually um because the data sets are

too big and so you can't guarantee that

certain things will be excluded um

that's both for quality and for ethical

use and now all of a sudden we switched

very quickly to this now becoming a

trillion dollar industry um although how

much of that is going to pay out in

action

>> trillion, dollar, loss, industry, maybe

>> trillion, dollar, loss, industry

potentially there are still some use

cases for these models. I don't want to

write it off, but um

>> yeah,, like, I, really, don't, want, to, write

it off. There are still some really

interesting things in natural language

processing that we can do with these

models, but all this stuff about like

them doing advanced reasoning tasks and

stuff, it's it's just not going to work.

Actually, I think I read that something

like 75% of AI apps fail because not

because of interest, because they just

can't be assessed properly. they just

don't work. Uh anyway, going back to

ethical use of data. Um that's not the

only problem. The early um

the early kind of transition from GPT3

to chat GPT involved the use of

additional models.

They had uh this process like a training

process where they were trying to get

data to build an additional

reinforcement learning model. So they

would get this fine-tuned GPT model.

They'd get it to output four different

outputs for the same prompt and then

they would get people to rate them from

1 to 7. Yeah, I'm sure you remember

this.

>> So, it, was, later, discovered, this, was, done

through an outsourcing firm. These

people were in developing countries.

They were being paid very little

>> and, they, were, exposed, to, because, this

was the unfiltered model without

guardrails,

>> right?

>> Traumatic, material., So, there's, been, a

lot of really serious ethical questions

about

the data sourcing for these models. And

this is also a fundamental truth of

machine learning. There are no shortcuts

to good quality data. And you it's

either bad and plentiful.

It's cheap and expensive or somewhere

along the line you've taken a shortcut

and you've gotten someone to do it for

cheap. So yeah. And and do you see any

of the companies I mean taking

responsibility? I mean not of the

companies maybe but anyone

sort of

>> I, I, don't, want, to, say, that, no, one, has

because it's not necessarily like I'm

out looking for it. Um

I haven't seen it but that's not to say

people aren't and I've just missed it.

But I think the general consensus is

there's been much more discussion about

it in the last two years. So there's um

prominent AI ethicists like Tim McGru

Margaret Mitchell um they are speaking

out about these issues and they really

sort of brought this stuff to the four

in like 2000 2023. Um

but it's still an unfortunate truth. You

can't have these models without the

data. It's I would say a bit of an

analogy for how the world works, right?

like I can't have this cheap clothing

without it being off the backs of of

someone else. Um, so I'm not saying it's

right, but I am also saying that I think

like many issues we have in society

it's something we don't want to examine

more closely.

>> Mhm.

>> How, about, examining, environmental

concerns?

>> Yes., Yes., Do, you, think, I, know, a, couple

of colleagues who were very strong on

hey we're blasting through so many

resources, just, to, to to, train, them, and

and uh

um is it in from your understanding is

it environmentally

yeah concerning what we're doing.

>> Yeah., So, the, new, data, centers, that, are

being built in the US um serious

question marks about the sustainability

of it. We've already heard problems with

like places with huge data centers

running low on water and things like

that. Um the amount of electricity

needed to power some of these things is

as much as a small town. What I would

say the silver lining is is that like I

said deep seat rocked everything to its

core. And I think what it's kind of

forced the conversation to turn to is

well we can have smaller models

and performance and that's where we need

to go and that is actually where things

have been going. Kind of part of the

problem though is there's a lot of sunk

cost into these huge models that were

trained earlier and they're still being

used. I don't know exactly what the

state of the industry around this is

right now. Um, I would just say my gut

feeling and my hope is that competition

economic competition will just force

smaller models because who wants to pay

$17

for a,000 tokens when you can pay 0.17

of a cent or 17 of a cent. by by the way

that the whole price and the cost of it

when I'm using one of the new coding

agents and you just see your credits go

>> oh, yeah, yeah

>> and, it's, a, simple, task, and, then, suddenly

you're out of credit um interesting

>> yeah, because, of, the, amount, of, steps, and

the fact that the everything just gets

added to the context window depending on

how the agent is built. So yeah, it's

just the more steps it does, they all

just get appended and yeah

>> which, brings, us, to

I'm not going to say hot hot takes, but

um we're almost at the AGI stage.

Um we're going to hold off on it. Just

one more question. Do you think that

just in pure terms of development

do you think that

the market is really going to stop

hiring juniors because now all the tasks

can be done through LLMs and agents and

whatnot and then we're going to run into

a problem with C and having no seniors

in five years or in 10 years time.

What's your take what's your take on

that?

>> Companies, are, already, doing, it., Um, I

think there was a well-known a well

publicized case here in Clana uh here in

Germany about Cler. Did you see that?

>> Nope.

>> So, I, can't, remember, the, particulars, but

I think Clana actually fired a bunch of

people and then had to rehire them.

>> Yes., Mhm.

>> Um, but, I, don't, remember, if, they, were

junior developers or not. Um

I would sort of say that I can't I I

can't attest to what a company is going

to do.

But in the end, the development team

is just going to keep getting more

senior. If you don't promote them

they're going to leave because you're

like, "Well, you should stay as a mid

because we have no one under you."

Sure, it's hard economic time, so maybe

it's not so easy right now, but the

market will recover. I'm sure of it. And

um in that case, people will have the

ability to leave again. Again, the only

thing, and I hate saying this as someone

who comes from a research background and

used to be very idealistic about the

world, but the only thing that really

makes change is economics. And the only

penalty that will matter is economic.

And so basically all your developers

leave because they have no freaking

juniors to take over

this is probably going to be the

incentive because the thing is the

consequences will not be felt by

management.

And while we're still in a market where

they can undercut mids and hire them at

junior wages

um, now as to whether these tools

actually increase product productivity

that much, I do actually think they

increase productivity quite a lot. Do

you still need to be a skilled developer

to use these tools correctly?

Absolutely. You are a junior developer.

Uh my friend Laya Bugliari did an

excellent keynote at NDC Oslo last

month. She talks a lot about this issue.

Please do not neglect developing

fundamental skills the hard way. You

need to fail. You need to learn. Don't

get lazy and let LLMs outsource your

critical thinking. And I find myself

doing it also.

It's tempting but you need to struggle a

bit to learn this stuff.

>> It, is, super, tempting., And, to, be, honest,

I mean now you get stuck. I mean back in

the day when you had even so I think my

because I had an older brother who had

for whatever reason a 56k modem back in

the day, but there was nothing online

nothing out there online. So when you

wanted to do some programming, you had

to rent a book from the library or buy a

book and that was it. And you couldn't

ask like 20,000 people on Stack Overflow

or you know what the problem was. And

now that has changed so much you can

just you know you just get stuck and

>> Yeah., But, and, I, can, even, say, like, I'm

newer to programming. So I probably

learned to code about 12 years ago in

Python.

>> I, fell, in, love, immediately, cuz, it, was

Python.

>> Yeah., But, um, it, even, then, even, with, all

the ability to Google and stuff it's so

important to be like hey like even the

step of I need to know how to Google an

error message instead of well letting AI

explain error messages to you I think

can be useful as long as again you have

a bit of a gut feeling as to whether

it's telling you rubbish or not.

It's just it's really important that you

also learn your fundamentals and learn

them in a structured way. So like do

courses, do um like if you didn't go to

uni for computer science like take the

time to at least learn things in a

structured way because otherwise you

just you have no foundation to really

rely on.

>> I, think, maybe, people, need, to, hear, you

need to struggle. So it needs to it

needs to feel painful kind of

>> not, entirely, for, like, it, needs, to, feel, a

little frustrating and you need the

intellectual satisfaction of that

payoff. You need to push yourself.

>> I'm, not, saying, you, need, to, suffer, to, be

a developer. Like it's a great career.

It's it's really nice. Um but yeah, you

need to be sitting there with that.

Okay. Actually, I'll tell you a story

about like it's amazing I'm still here.

So the first So I learned basic Python

then I put it down for a while because I

had no use for it. And last days of my

PhD, I picked up R

>> and, I, was, like,, I'm, gonna, learn, R, to, do

the last of my stats. And um I was

learning from this textbook and the

first exercise was to read in the data

file right?

Took me 2 hours. I kept getting an

error. I did not understand this error.

I had a Windows machine and I was

learning from a book that was based on

Unix. The slashes were the wrong way. I

was crying. I was like, I'm so dumb. I

can't do this. But I persisted

and now I, you know, know how to code, I

guess.

>> Yeah., People, need, to, hear, the, story

because especially these types of

problems with this wrong slashes or

different format, you run into them over

and over again and you blast through

hours and hours and hours of um you

know,

>> and, here's, the, thing,, I, would, have

instinctively understood what was wrong

later, but at that point I had no

framework because I'd just done some

basic Python and then put it down for a

year. Um, if you don't encounter these

problems and solve them by examining the

error message, understanding the context

of the code, if you're always getting

the LM to generate the code or solve the

problem.

>> Yep., Including, hallucinations., So,, last

I was coding something a couple weeks

ago and I asked I was trying to be

sneaky and asked chatb, hey, does my SQL

the database have that functionality?

And it said yes, sure it does since

version 7 something. Kind of tried it

out. didn't work, didn't work, didn't

work. Then I read the official

documentation and said, "No, never

worked. The feature doesn't exist." So

I mean, it was just very confident in

telling me the feature existed. Uh, it

unfortunately never existed. Blasted

through one or two hours figuring that

out.

>> But, if, you, didn't, know, to, go, to

documentation, you never would have

solved that.

>> Yep.

>> Yep., Yeah., And, it's, also, like, here's

something that I think needs to be said

again, these models work well when they

have a lot of training data. If you're

dealing with newer frameworks, if you

are dealing with languages that are

newer like Rust, they are not as good.

They cannot be as good. So you need to

be careful.

>> Yep.

Which brings us finally to AGI.

>> Oh, yeah.

>> Yeah.

Um just briefly by the way because I

also have the feeling that a ton of

people understand different things under

AGI for whatever reason. What is

artificial general intelligence to you?

What does it mean?

>> Okay., Um,

>> and, where, are, we?, And, is, it, going, to, are

we going to be there in five years as

people online are going to tell us?

>> No.

No. Um, this is the one fiveyear

prediction I feel confident making.

That's the others. No. But this one I'm

I'm going to bet €20. 20 with one

person. I'm not going to give out ultim

like huge amounts of€20. Not that rich.

Um, if we have AGI in 5 years, I'm going

to give you 20 years.

>> Cool.

>> All, right.

>> Okay.

>> Yeah., Cuz, I, won't, have, a, job,, I, guess.

>> Yeah., Yeah., Then, we're, Yeah., Right.

>> You, You'll, need, it, because, you, won't

have a job either. Yeah.

>> I'll, save, it.

>> We, can, meet, back, here, and, do, podcasts, on

random topics.

>> Yeah., That's, right., Right.

>> Um,

>> yeah.

>> So,, let's, go, back, to, AGI., So,, AGI, is, a

very, very poorly defined term. Um, I

can't really tell you what I think it

is, but I can tell you what Francois

Shallay thinks it is. So, Franis Shal is

a very well-known computer science

researcher, expert in AI. He was at

Google for a long time as I think he's

the head of their AI department. So, um

he wrote this beautiful, very dense, but

very nice paper in 2019 actually called

on the measure of intelligence. And

basically what he points out is, hey, we

have like all these people running

around talking about how we're going to

achieve general intelligence in

artificial systems, but basically none

of them really have a background in

psychology. So maybe we should think

about psychology.

Um, obviously having trained in

psychology do want to I always want to

put this caveat. It's a problematic

field. Real problematic in many ways

but there's at least a framework for

thinking about what we're trying to get

to. And kind of the core idea of

intelligence in humans is this thing

called G, general intelligence. And then

what happens from general intelligence

is it forms what's called crystallized

intelligence. So you have this general

ability to learn, to reason, to um

synthesize facts, things like that. This

is what G allows you to do. And then you

crystallize that into being able to do

specific tasks like um take exams or

drive cars or cook or do a podcast. Um

so

what we're trying to get at and this is

what Shallay puts out in his paper is we

need some way of assessing the true

generalizability

of problem solving capabilities of

models. Now, this assessment in and of

itself is a very non-trivial task

because if you think about the core of

what you're trying to do, trying to

create one standardized measure that

will assess for all models how different

a task is from things it's seen before.

What Shay calls the generalization

difficulty.

You don't know what these models have

seen before. Like, this is a fundamental

problem because we don't know what they

were trained on.

Um, and also being able to do that

because like if you're going to make it

a humanlike intelligence because let's

narrow it down

it needs to be representative of all of

the scope of tasks you'd expect humans

to be able to do. And so how can you do

this? Like you can, but it's it's very

very difficult. So Shallay has like a

measure which is called the

abstraction and reasoning corpus. I

think it's ARC for short. They're

working on the second version at the

moment. It's a private test data set

that reminds me actually of this

intelligence test called um Raven's

progressive matrices. But what it is is

you show a model um these sort of three

examples of patterns like sort of

problems and you say like what should be

the next one in the sequence and you

basically have to work out what the rule

is in the pattern. So he's sort of like

this does force a degree of

generalizability. This is sort of his

argument. Um

I'm not entirely convinced by it but

like I like the effort. Like I think

it's really cool that he's thinking

about this and he's not just relying on

textbased problems. Like it's a real

kind of um attempt to kind of force

models to have these kind of multiple

states of symbolic reasoning. Um

but

yeah, no one's really close to even

defining it. And like it gets really

stupid when we start talking about

things like ASI. I hate the term ASI

artificial super intelligence, because

now what we're talking about is an

intelligence that's beyond human. Okay

so what does it do? Like what are tasks

that are not capable of being done by

humans that are relevant to humans or

are they maybe not relevant to like what

are these tasks that this thing can do?

>> Are, there, any, examples, of, such, tasks?, I

mean

>> like, like, the, best, task, I, can, like, the

best example I can think of is not even

like something

That's like okay so Chalet gives an

example of something that's like beyond

humanlike intelligence and it's like the

behavior of an octopus when it needs to

camouflage. It's very intelligent

behavior,

>> right?

>> It's, not, something, humans, can, do, and

it's not really relevant to us and the

tasks we do. So people are not even

clear like they're like does the task

need to be relevant to us but it's just

beyond our capabilities but what are our

capab like what what's the limit anyway

this whole thing is a very stupid

debate. Uh anytime someone says uh we're

it's around the corner AGI is around the

corner I'm like tell me how how do you

know how do you know it makes me angry.

>> I, love, it., I, love, it.

Which brings us to something which might

not make you angry anymore.

>> Good.

>> Um,, which, is, I, have, a, couple, of

questions prepared for you. Uh, I mean

on top of what I already asked you. No

idea.

>> A, couple, of, rapid, fire, questions., Okay.

Maybe as a bit of background in case

anyone couldn't have told. You're from

Australia originally.

>> Yes.

>> You, live, in, Berlin., We, are, in, Berlin.

I'm German, right?

>> And, um, I, thought

>> now, I'm, nervous.

>> No,, no,, no.

Um, I'm going to ask you a couple of

questions which have a reference to our

nationalities and LLMs.

>> Oh,

>> and, let's, see, what, what's, going, to

happen.

>> Okay.

>> Okay., Which, is, harder, for, an, LLM, to, dec

which is harder for an LLM? Decoding

Aussie slang or handling German compound

nouns,

>> right?, And, by, the, way,, remember, we, we, we

practiced beforehand.

Yeah, that's Yeah. Mhm.

>> I, don't, know, any, Aussie, slang, terms,, but

maybe you have a

>> Okay,, so, a, drongo, is, drongo, is, like, a, an

idiot.

>> A, dag, is, someone, who's, like, kind, of, nice

but like not very fashionable. Like dads

are dags and something can be daggy as

well.

>> All right.

>> Yeah., Um, let, me, think, of, some, more.

>> Okay., Which, one's, more, difficult, you

think for NLM? I'm going to say based on

the amount of training data, I think

it's going to be better at German

compound nouns.

>> If, a, model, says,, "No, worries,, mate."

>> In, response, to, a, GDPR, violation,, would

you laugh or panic?

>> Panic.

>> Panic.

>> Yeah., I, do, live, here, in, I, I, would, be

responsible for the GDPR violation. So

that's I I've dealt with German

bureaucracy.

>> Yeah., Right., It's, It's, great, fun.

>> It's, a, little, scary.

Uh, if you wanted to spark great AI

product ideas, would you go to a beer

garden or to a beach barbecue?

>> Beer, garden.

>> Beer, garden.

>> Are, you, a, fan, of, beach, barbecues, though?

I mean

>> no., But, let, me, tell, you, a, fun, fact, about

Australia. We have coin operated

barbecues in our parks. So, you go down

Yeah. You go to a park and you just put

there's just these big kind of solid

barbecues and everyone somehow cleans

them in an honor system which is amazing

cuz we're not a very communal society

and you just put like a dollar into the

barbecue and it just works until it

times out and you put another dollar in.

>> Really?, So, you, like, a, dollar, for

whatever 30 minutes of barbecue and it

just

>> Yeah.

>> Amazing., Didn't, know, that.

>> Yeah.

>> Um, if, GPT, had, a, favorite, national, dish,

would it be schnitella or would it be

me?

I think it would be

>> Keshet, would, be, Yeah,

that that's actually my next question.

If you would name a model after food

would it be Keshler 2

>> or, would, it, be, Lamington, GBT?, GPT?

>> Lamington, GPT.

>> You, have, to, because, I, didn't, know, what

Lamington was. You have to explain to

people what Lamington is.

>> Okay., So,, a, Lamington,, you, basically

take white cake, what Americans call

white cake. It's just like a sponge

cake.

>> Like, lemon, cake., I, mean,

>> not, a, sponge, cake., Like,, it's, just, it's

just a normal

>> vanilla, cake.

>> And, then, you, cut, it, you, you, bake, it, in, a

sheet and you cut it into like

rectangles and you dip it in like a it's

like a chocolate sauce, but it's not

made from chocolate. It's made from

cocoa and water or cocoa and milk and

then you dip it immediately in coconut.

And then the best ones is you cut it in

half and you put jam and cream.

>> They're, quite, messy, to, eat,, but, they're

delicious.

>> Mhm.

>> Yeah., But, I, I, think, Kzetza, too,, as, much

as I love kespitza, it's got two umlouts

in it. And it was very difficult for me

to learn to say that word.

>> But, you, say, it, perfectly

>> because, because, it's, I, eat, it, all, the

time. It's my favorite dish to order at

German restaurant.

>> It, is, really, good., Do, you, just, have, them

by the by themselves? I mean just just

kishet, or, kish, with with, something.

>> It, always, comes, with, the, sal, with, like

the sad.

>> Yeah., But, no, meat, or, anything, like, that.

No.

>> No., No., I'm, vegetarian., Remember?

>> Oh,, I, forget., Yeah., Yeah., Yeah.

>> I'm, stupid.

>> Yeah., But, also, like

>> so, by, the, way, asking, you, for, schnitle, me

such a fun question. Okay.

>> No no, no, but, it's, also, because, schnitle

is uh Austrian

is it not

>> true, it, is., It, is, a, but, there's, also

Munich schnitle. I mean I mean I guess

we Germanized it also. I mean that there

are

>> German, schnitles, and

>> I, just, did, my, citizenship, test., This, is

how inculturated I am.

>> What, was, what, was, the, hardest, question

on the citizenship test? M uh I I can't

remember but there was all these ones

that were about like the particularities

of like who votes for who to get like

particular like there's like a the

bundas for for busong or something like

the there's like some board that gets

together like this group that gets

together to vote for who the president

is

>> right

>> it's, bundas, for, songa

>> bundas, for, samong, something

>> maybe, bund, for, maybe

>> I, don't, know, but, I, do, Remember, that

Conrad Adenau was the first chancellor

of the Bundes Republic Deutsland which

was founded in 1949.

>> Right., That's, great., Do, you, know, his

nickname?

>> No.

>> I, think, he, was, called, the, old, one., I

think I think I'm I'm maybe I'm just

misremembering sounds like Lovecraftian.

Like what the hell?

>> I'm, just, now, hallucinating., I'm, going, to

give you a confident answer. I think it

was the old one. Maybe I'm totally

wrong. People can tell me in the

comments.

>> Um, to, finish, off, your, favorite, model.

your favorite model if you had any. If

you could choose any of these models out

there, do you have a favorite one and

why?

>> That's, a, good, question., I, think, okay,

it's not a current model in in use, but

I have a real soft spot for GPT3 because

it was the first model I used where I

was like, holy moly, like this was in

the ancient days. Um, but it's just it's

kind of my favorite.

>> Mhm.

Thank you very much. I need to mention

something by the way. We do have a tiny

giveaway which means if you write in the

comments down below this video uh the

craziest funkiest model names you can

think of. So you heard Kesh Peter too

you heard Lamington GPT doesn't have to

be about meat pies and schnitles. So it

can be vegetarian, vegan, can be

anything you like. Uh let us know in the

comments. We're going to raffle some

coupons uh for for our merch store, the

Jet Brains merch store and licenses. uh

and we'll get in touch with the most

creative um comments out there.

Thank you very much, Jody. I learned a

lot. Uh it was a pleasure. And um I

don't know. Let's go eat some Keshler.

>> Yeah., Oh, no,, we're, in, Berlin., Let's, go

eat some vegan curry verse.

>> Vegan, curry, verse., It, is., Yeah., Thank

you.

Loading...

Loading video analysis...