LLMs for Devs: Model Selection, Hallucinations, Agents, AGI – Jodie Burchell | The Marco Show
By IntelliJ IDEA, a JetBrains IDE
Summary
## Key takeaways - **LLM assessments are flawed**: Assessing LLM quality is difficult, even for specific tasks. Many benchmarks are flawed due to issues like data leakage or questions that don't make sense, making it hard to judge a model's true performance. [12:25], [15:04] - **Don't always need the newest LLM**: You don't always need the most state-of-the-art LLM. Models that are a couple of years old can still be competitive, and older, cheaper models like GPT-3.5 can be sufficient for many tasks. [25:14], [25:53] - **Hallucinations are an unsolved problem**: Hallucinations in LLMs, whether factual errors or faithfulness issues, remain a major problem. Techniques like RAG can help, but the core issue of models generating incorrect information is not fully solved. [26:36], [28:21] - **Vibe coding has limitations**: While 'vibe coding' can be useful for prototyping, it's unlikely to lead to a successful startup without developer involvement. Ensuring security, performance, and maintainability requires expertise beyond what current AI agents can reliably provide. [32:25], [34:14] - **Self-hosting offers privacy**: For companies concerned about data privacy, self-hosting LLMs is the only solution. This avoids sharing sensitive company data with third-party model providers, which can be crucial for compliance and security. [55:41], [58:17] - **AI ethics and environmental costs**: The development of LLMs raises significant ethical concerns regarding data sourcing, copyright, and labor exploitation. Furthermore, the computational resources required for training and running these models have substantial environmental impacts. [59:06], [01:04:46]
Topics Covered
- Smaller Models Can Outperform Larger Ones
- LLM Assessment is Broken
- LLM Agents Struggle with Complex, Novel Tasks
- AI Ethics: Data Sourcing and Labor Exploitation
- AGI is Decades Away, Not Years
Full Transcript
This whole thing is a very stupid
debate. Anytime someone says it's around
the corner, AGI is around the corner
I'm like, tell me how how do you know?
It makes me angry. Here's the thing.
Assessment of LLMs in the context of
your application is exactly the same
pretty much as assessing of an
application works. You still need your
unit test. You still need to inspect
things like traces. You still need to
get humans in the loop to check things
out. You still need to do AB testing.
[Music]
So, welcome to the very first inaugural
episode of the Marco Show. I have a very
wonderful guest with me today. Who are
you and what do you do?
>> So,, I, I'm, your, colleague.
>> Um,, so, I'm, also, developer, advocate.
>> You, kind, of, look, familiar.
>> Yeah,, I, I, I, would, hope, so, after, three
years.
Um so yes I'm also a developer advocate
in Jet Brains. My specialization is data
science. Uh been a data scientist for
about 10 years now which
>> Right., So, you, started, out, in, data
science right after uni. That that's
what you did
>> right, out, of, uni, but, in, the, sense, that, I
stayed at uni for a very long time. Um
so I did a PhD in psychology and I also
stayed on for a posttock in bioatistics.
>> Right., And, then, I, fled, academia, and, went
into the warm embrace of
>> data, science.
>> Right., Okay., How, are, you, involved, with
LLMs?
>> So, I, basically, do, a, lot, of, work, kind, of
building projects in LLMs, doing
education about LLMs. Um I've done a lot
of talks about LLMs in the last three
years.
>> My, particular, stance, is, because, I, worked
in natural language processing for quite
a number of years in my data science
career. So basically I was I like to say
I was working on LLMs before they were
cool.
>> Um, I, was, working, in, NLP, in, the, pre-LM
era. So pre208
>> right
>> and, you, know, there, are, a, lot, of, methods
in natural language processing that are
not LLM. We can also talk about them. Um
we started experimenting with the very
first LLMs uh the ones coming out of
Google like BERT right at the end of
that job. So yeah, basically I've sort
of seen their applications, seen sort of
how they can be used and I've also seen
a lot of claims about how they can be
applied which I'm not entirely convinced
about.
>> I, love, it., And, I, mean, we, talked, before
the show and I was saying you're kind of
the antidote to the to the hypers out
there and just giving you a bit of more
uh neutral context where LMS are, what
they can do and how to choose them. And
that's the topic I'd love to kind of get
into.
>> I, think, you, have, a, lot, of, questions, for
me. I do. So my issue is and um I'm not
I'm not making this up. So I go to any
of our IDs for example.
>> I, look, at, the, AI, chat, window., I, have, a
selection of 20 different models to
choose from. And it's not just about we
have Google with the Gemini models. We
have then open AAI with the GPT models.
And by the way, sorry, anthropic.
Antropic. I
>> anthropic., Anthropic.
>> Anthropic.
>> You, know,, anthropic.
>> Anthropic.
>> I, know, it's, the, th, sound., Yeah.
>> Yes., Anthropic.
>> Yeah., um, uh, with, Claude, and, then, I, have
you know these subm models the families
and I was thinking to me it feels like
uh, in, Europe, or, at least, in, Germany, as
kids we had that these card games which
are called car quartet um where
essentially you play cards and you have
everyone has has different car models
and you have you know they have
benchmarks written on them with the top
speed engine uh horsepower whatever and
then you just compare these these
numbers and then get take the cards from
from the other players.
>> Mhm.
>> And, to, me,, when, looking, at, all, these
models, I actually see, you know
context sizes. I actually see
throughput, random numbers, new versions
of models. How do I go about choosing or
even maybe just understanding what these
models can do and which one should I
choose essentially?
>> Yeah., So,
part of kind of the I think the
confusion about LLMs is they've sort of
been marketed as a one-sizefits-all for
all problems. And
even though these models are better at
generalizing with natural language tasks
than anything we've seen before, they
still have particular training sets
they still have particular um things
that they tend to excel at and things
that they tend to be weaker at. So
kind of maybe I can give like a brief
well it won't be that brief but a slight
a slightly brief history about how we
got here.
>> Sure., Yeah.
>> So, basically, we, struggled, for, a, long
time about how to actually process
natural language. So to define natural
language it's what we're speaking right
now. It's it's any language that is not
designed. And it's really complex to
process because things like word order
or context things like this they matter
a lot. And actually keeping track of all
of that, keeping track of that even
within a sentence, but over multiple
sentences is really tricky. So before
LLMs, we had a type of deep learning
network called long short-term memory
networks. Really catchy name. So things
things are named very strangely in
machine.
>> Long, short-term, memory, networks.
>> Yes., Okay., Yes., And, the, idea, of, these, is
that they could keep track of words in
sentences. So basically the the network
would go sequentially over say a
sentence. It would learn something about
the first word. It would store that in
the short memory sorry in a long-term
memory and then basically the short-term
memory would then go to the next word.
It would add extra information to the
long memory and it would accumulate
information.
But there were many limitations with
these models. First like training things
sequentially and running things
sequentially slow. But also, you know
these long-term memories couldn't
actually hold that much information. So
they start failing actually after about
20 words
>> right?
>> Yeah., So,, but, they, they, were, the, best, we
had.
>> Mhm.
>> And, then, in, 2018,, a, paper, dropped, called
attention is all you needed and it put
forth the transformer network. And
transformer networks at their core they
have a mechanism called self attention
which is far more efficient at helping
um the model to understand which words
are important for gauging meaning within
a sentence.
So the very first transformer model was
called BERT. It was by Google and it
actually wasn't a GPT model. What it was
actually trained to do was work out two
natural language tasks. It's what's
called an encoder model. It was designed
to try and predict what a missing word
was in a sentence and with two sentences
work out which order they're supposed to
go in. And by forcing the model to do
this, it basically meant that it had to
learn a lot about how the language
functioned. So you ended up with these
layers in the model which are now called
encoder vectors. They're used for a lot
of different things. And basically these
vectors um help you kind of
process text and then you can change the
downstream task. So you can change the
downstream task to say classification
sentiment analysis, things like that.
These models don't generate text.
And the big problem with building
encoder models is you can imagine this
data set you have to design by hand what
the output is. So then the idea of okay
what if we do something like next word
prediction as the task. So say we get a
sentence we just split it at a point and
we say here's the input all of the part
of the sentence and the output is you
have to correctly guess the next word
and these were the generative
pre-trained transformers the GPT models.
So over time building these models, they
were able to scale much more readily
because the size of the data set can
grow a lot bigger because you're
essentially not having to do anything to
it other than some cleaning.
And also they realize that the bigger
you make them, the better they are at
doing a whole range of natural language
tasks. So, by the time we hit GPT3
which was OpenAI's model, um you're
looking at a model that can do a lot of
things we're familiar with LLMs being
able to do now. They can um encode
parametric knowledge. They can learn
things from the data. They can do a lot
of sort of um basic reasoning tasks and
stuff like that. So where we got to is
an arms race. We're trying to make the
models bigger and bigger and bigger like
scaling them. And you'll hear this this
term like uh parameters. We're getting
into billions of parameters. It's just
the size of the neural network. They got
very expensive to train. They're very
expensive to run. And at the beginning
of this year, a Chinese company released
a model called DeepSeek that rocked
everyone to their core because they were
able to achieve pretty good performance
comparable with these massive models
with a much much much smaller network
much much smaller to run a much much
cheaper to run. So
>> which, brings, me, back, to, maybe, car
quartet. So from the outside and I have
no clue about most of the stuff. Could I
just then have you know by default said
well a new version with so many more
parameters by default then is just
better or is essentially what deepse did
show that is not the case or how does
that work out? I mean
>> yeah, so, in, order, to, improve, models
there's sort of a few different things
you can do. So one is you can just make
it bigger. you feed in more data and you
make the model bigger. And this was the
most reliable way of achieving better
model performance. But neural networks
like you they're very you can sort of
customize them and design them in a lot
of clever ways that are better at
extracting meaning out of the data. And
so um basically
it's it's that DeepS used a few tricks
not just with model architecture but
with the way that they trained the model
um that sort of manage to extract a lot
more meaning with fewer parameters.
Because if you think about the way that
these models work, you're basically you
have an input. An input is a bunch of
words and each of those words will kind
of be processed along a path as it goes
through the network and at the end it
will come to one conclusion. What's the
next word? But most of those paths
aren't used in a single prediction. So
being able to sort of trim the networks
down to what is actually like the most
useful paths either through like like at
the beginning of the training process
itself or afterwards a process called
distillation basically
allows you to build more compact models.
They're just better at sort of
extracting information efficiently.
>> And, what's, the, state, of, Deepseek, now
today? I mean, do you do do you know if
it kind of can compete with all the
other, you know, fancy models out there?
>> It's, interesting., So,, in, terms, of
competing, it's also in the open-source
realm. So, we can kind of talk about
proprietary versus open source. Um, but
in terms of model performance, yes, it
does seem to be holding its own, not
with the big big like really huge
models, but at least um they have a
reasoning model which holds its own
against some of the state-of-the-art
reasoning models. and their generalist
model also performs quite well. So this
is why everyone was so shocked because
it's not like it was, you know, half as
good or a quarter as good. It was almost
getting into the realm of as good.
>> Mhm., With, the, state-of-the-art, reasoning
models though, would I then just by
default assume
having no other information as an end
user that still context size and
parameters and new version automatically
means better?
>> Okay,, so, not, necessarily., And, defining
better, oh my god, I have a whole talk
about this. So um assessment is one of
the most neglected and most difficult
areas in natural language processing. So
assessing whether you have a good model
is really hard even when it's for a
specific task unless that task is
something like discreet like you're
trying to predict something like an
emotion
>> right, and, we're, just, talking, assessing
the quality of the output essentially
right and this is my chance to to say
>> what, we've, forgotten, with, LLM, hype, is
we're still talking about machine
learning so machine learning is not just
models machine learning isn't decision
trees or boosting models or whatever.
Machine learning is an approach to
working with data that is structured and
gives you kind of ideas about how to
assess model quality. So the goal is to
create a model that works in production
as well as it does in a development
environment. So machine learning gives
us the tools to do that. And people have
just gone, oh, we have magic models now
that can do everything. So we don't need
to think about basic machine learning
practices. No, no, we do. and it's
coming back to bite us in the ass
because we haven't thought this stuff
through. So what kind of happened with
the beginning of all NLP assessment but
also LLM assessment is they were
designed to do natural language tasks.
So we have a bunch of natural language
benchmarks. Not going to say they're
perfect, but it's like okay can this
model do classification well? Can it do
um question answering? Can it do
summarization?
>> By, the, way,, just, classification, for
anyone who's I mean you just have a for
example some sort of a text and you say
hey um you have an example of of a uh
sorry do you have an classification
example?
>> Yeah., So, um, I, recently, released, not
recently it was 6 months ago but I
released a tutorial where we were trying
to classify books into fiction or
non-fiction. This would be a
classification task. Um so based on the
description is the text describing this?
Then people started noticing, oh these
models seem to be able to do other
things. They can reason and they can
with the parametric knowledge they can
answer questions on exams. So they seem
to know things about history and math
and science. And then uh we started
making claims about AGI.
>> I've, gone, there, early., Actually, I, wanted
to ask you later on. Yeah. We can come
back. Yeah. This is a biggie. But
basically people have started creating
or reusing assessment techniques to see
how well LLMs do this stuff.
>> Mh.
>> Now
so many problems with these assessments.
Firstly like some of them were just
created using things like Amazon's
mechanical turk by people who like they
have no investment in making this good.
May it may also be English is not their
native language
>> and, some, of, these, tools, are, filled, with
gibberish. I am not joking. Like
questions that do not make grammatical
sense and they don't actually have a
correct answer if you as a human are
trying to assess them.
>> So, this, was, sort, of, the, first, generation
of assessments. The second generation
they've tried to clean things up a bit.
Um we also have problems with things
like uh there's a phenomenon in machine
learning called data leakage. It's
basically where a model performs very
well because it's actually kind of seen
the answers. you've actually you're
actually shown it the things that are in
the data set you want to test on when
you trained it. So it's kind of
memorized the answers like a student
cheating at a test
>> and, um, there's, pretty, good, evidence, that
there's a lot of data leakage of
assessments into model training data
sets and this continues to be an ongoing
problem. So now they're trying to make
these assessment sets private. But
beyond all these problems
the fundamental thing is that you are
trying to sell a model as good. Okay.
But you, Marco, you come in and you want
to use a model for something specific.
You will want to do a task. You might
want to code.
>> You, might, want, to
>> transcribe, some, audio,, for example.
>> Some, audio,
>> generate, images., I, don't, know.
>> Yeah., Yeah., Yeah., Exactly., Exactly., Or
um LLMs can't do that, by the way. But
yeah.
just as an aside but but related machine
learning models that have come out in
the last 5 years have also had advances
>> but, um, basically, like, maybe, you, want, to
I don't know you you want to research
something so you put in a paper as part
of the context and you want it to answer
questions you know knowing that a model
is good is not that helpful if it's
scoring at all these random benchmarks
that assess a whole bunch of things
>> and, you, want, to, know, how, good, it, is
completed code.
>> So
that was a very long way to say
assessment is and it's just
>> So, I, was, actually, thinking, about, that.
So as an end user again who knows
nothing.
>> Mhm.
>> Would, I, just, take, a, random, task,, a
coding, task,, send, it, to,, I, don't know,
three different models.
>> Mhm.
>> Then, just,, you, know,, do, two, or, three
examples and then take whatever I think
gave me the best response and I'm just
going to stick with the model and that's
it., I, mean, what, what what, can, I, do
essentially as as an end user?
>> I, think, basically, a, lot, of, it, seems, to
be based on vibes at the moment, right?
Like it's like my gut feeling is that
it's better at that. There are specific
coding benchmarks. So say you want to
look up um or so you want to use a model
for coding, you can go look up that
benchmark and see how well different
models have scored on it. Would it tell
me if I go to a benchmark and then it
says, "Well, that model got 42 points.
The other one just got 39 points." Does
that tell me, "Hey, great. Three, three
more points means so much more uh I
don't know that much better of a model."
>> And, this, is, the, problem, with
assessments, right? You need to
understand like what population are they
assessing? Is your target language even
represented? Are the tasks you're doing
actually represented? And how are they
assessing good? Is it does the the
co-compile? Is it the number of errors?
Is it the maintainability of that code?
So, this is the problem. I don't think
people are looking at stuff in this
depth. It's just you want one number to
boil it down to because everyone's
confused. But unfortunately, I'm not
here to necessarily tell there's tell
you there's a magic bullet. Like there
isn't. Um I would say that like
generally the models do seem
to do okay in coding task for languages
for which there's a lot of training
data.
Do I kind of can I even tell somehow
what a model was trained on or when I
now look at these big models I just
assume as you said a general purpose
model that which is somehow trained on
everything and then someone told me in
some I don't know Firebase YouTube video
oh by the way claude was more trained
specifically on code and that's why I'm
I'm using it for coding. Do I have any
idea what what these models were trained
on?
>> This, is, also, a, major, problem., So, with
the early like the early culture around
LLMs was they were all open source and
so the data sets that they were trained
on were released and anyone could go and
look at them but one of the competitive
advantages even for open source models
now has become the data because it's one
of the the few differentiating factors
that's left and so most of these model
model creators will not tell you what
the data the model was trained on. So
you might say in the the paper that they
release or in the marketing material
get a general overview. And this may be
where people are getting these numbers
from,
>> there's, also, just, a, lot, of, rumors, that
fly around. So people might just be
making stuff up. Maybe they asked Chad
GPT and it told them and it
hallucinated. Who knows? But um the best
you're going to get is like what the
model creator will release to you and
they will not release the data generally
anymore. So you yourself can't go check.
There is also the fact that these data
sets are massive. So searching them
efficiently is really hard. But
>> you, know, the, interesting, thing, is, um
with with the marketing material you
just mentioned when I go online and then
again I see massive update changes
everything and now the model much higher
throughput um I've asked you before I
think uh thinking models non-thinking
models what these are actually also and
then you see all these updates now you
have adaptive thinking and all these
words elaborate thinking stupid thinking
slow thinking fast thinking I'm just
reading through all that and thinking
what, the, hell, is, going, on, how, what what
should I do with w with all of this.
>> Yeah., Yeah., Yeah., And, I
>> And, by, the, way,, could, you, just, briefly
explain what thinking versus
non-thinking is if you mind?
>> Yeah., Yeah., So,, we've, kind, of, talked
about reasoning in the context of LM.
So, LM can't actually reason. They're
they can kind of emulate and pattern
match and do something that looks a bit
like reasoning. I just want to clarify.
They don't actually reason in the way
that humans do. But what people
discovered is that models are
particularly bad if you just straight up
ask them a question like um let's say
like a a math question. This is always
the the classic like um Jenny has four
balls and Sam has eight balls. How many
balls do they have together? Right? And
usually it'll just randomly come up with
a number. But you can do particular ways
of prompting these models. uh one is
called chain of thought where you can
show it how to break down these
problems. So if I had this problem then
I would need to say first Jenny has four
balls then I would need to identify that
Sam I can't remember what name I came up
with has eight balls. I need to add them
together and that gives me the final
answer of of 12. Now I give you another
problem. So
>> this, problem, decomposition, seems, to, work
a bit better. Uh a recent paper came
out, I think it was from Apple, Apple
doing amazing research by the way around
like the limitations of these models.
Like really interesting stuff. But this
Apple paper seemed to show that like
even with these sort of ways of
prompting models and uh thinking models
altogether, their ability to do problem
decomposition really just it it exhausts
at a point. You can't work around this
limitation.
So yeah, thinking models are either
prompted in such a way that they're told
to think more. They're given examples
through um chain of thought prompting.
They may be fine-tuned in such a way
that they are shown a lot of such
problems. But the thing is is that
thinking models usually are a lot slower
because internally they will be looking
for multiple possible solutions to a
problem. And this gets more complex when
you then start building sort of like
thinking agents. Um we can come back to
agents as well. But
generally like then people are like okay
it's good if they think sometimes we
don't need them to think all the time.
And so these adaptive thinking models
are basically okay we've trained the
model in such a way or built an
application agent application in such a
way that the model doesn't always
default to this deep thinking process.
It may sometimes think when it needs to
and when it doesn't it just is cheaper
and faster.
>> Mhm.
>> So
>> it, doesn't, make, the, whole, topic, any
easier. I feel
>> the, topic's, not, easy
>> but, it's, being, stuffed, down, your, throat
as being really easy. It literally is
like a couple numbers, choose the latest
model and then for example when I use
some of the um competitors for fun um
and then you basically go back for for
coding itself and you fall back to trial
mode. It basically brings you to a
cheaper model or to the to you know
version one to go and you're thinking oh
does it now mean I'm using a totally
dumb model it can't do anything anymore.
Do I need to spend the money on, you
know, premium tokens and otherwise I
will just get crap answers. Uh it's all
stuff which is very confusing to someone
who just wants to code and just want to
get something done.
>> Actually, that, reminded, me, of, I, did, a
talk I did talk last year for Goto and I
did a little demo at the end on um like
a really simple rag application. So it's
basically you have a PDF, you get it
sort of broken down into vectors using
these encoder models and then um you can
basically search through for the most
semantically similar vector to a query.
So you basically search through it and I
use GPT 3.5. I really like this model
because to be honest it's still a very
powerful model and it is cheap as chips
to use. And I got a comment on the video
saying, "It's very nice, but she's using
a very outof-date model." And I'm like
"It's fine for what I need it to do."
Like, you don't always need to use the
most state-of-the-art model.
>> But, that's, because, people, really, can't
judge what the what they're kind of
using.
>> Yes., And, and, like, I, all, forgiveness, in
software normally using the latest is
best. It's got like it's more secure.
Like if you're using the latest
hardware, it's usually get more bang for
less money. Um, but it's not quite the
way things work anymore with LLMs
because training these models is so
freaking expensive, so slow and we have
plateaued honestly in terms of
advancements we're seeing. So models
that are 2 years old are still
competitive I would sayish with models
that were released this year.
>> How, do, I, and, I, I've, had, that, issue
myself. How do I actually handle
hallucinations? and not just the
hallucinations but actually the the
confidence that comes with it. So you
get very confident answers. Um you have
to really really understand the topic to
to figure out hey the LM is lying to me
kind of.
>> Mhm.
>> And, then, also, the, topic, of, I, feel, maybe
with more and more AI generated content
and the training you just mentioned I
mean it's kind of a self-reinforcing
loop that the answers get worse and
worse.
>> What, what's, your, take, on, that?, I, mean, on
that whole topic.
>> Wow., big, topic., So, basically
hallucinations come because models are
trained on a limited data set, right? So
you have two types of hallucinations.
You have factuality hallucinations where
the model has actually just learned
something completely incorrect. Maybe it
got trained on some conspiracy theory
blogs, which does happen by the way. So
they've internalized knowledge that is
false. And then you have uh faithfulness
hallucinations. And this is where say
you have some sort of reference
but it can't actually I say summarize it
properly or answer questions from it
properly. So it can reference to
something but it can't actually
reproduce the information it sees in a
way that's correct. And both of them are
major problems. Um, the way that we've
tried to deal with say
factuality hallucinations
is trying to give the LLM access to
more up-to-date and correct information.
So, one of the first ways to do this was
rag retrieval augmented generation. That
was the hotness last year. And
generally, like there's nothing fancy
about it. Generally the idea is you just
somehow give the LLM access to an
external knowledge base and you say
please use this preferentially over your
parametric knowledge. But again problem
is sometimes you don't want it to do
that. Sometimes the parametric knowledge
would be okay. So there's also clever
ways to do like rag that is in a way
that is also adaptive. Um the problem is
though if you have a model that has high
faithfulness um hallucination rates even
if it has access to the correct
information it won't be able to do it
properly. And so uh yeah
I don't really know what to tell you
about
treating hallucinations. It seems to be
an unsolved problem in many ways. Do you
find yourself googling more or also
asking instead for example jet GBT
instead for for the right answer to any
of your stuff you might have googled
like a year ago? It depends um if it's
something okay I will confess a way that
I use chatgpt
so chatbt by the way I should also
explain chatgpt uh claude all these
these are not models anymore they are
actually agents so they're a whole
application that is built around an LLM
which the LLM acts as the reasoning
engine and it's reasoning comes up again
reasoning engine and it can basically
access a bunch of tools say search
engines, image generators
um it can access documents that you
upload etc etc. So basically um I like
to take advantage of that and I won't
necessarily ask it stuff that I could
Google but say I need to work with a
bunch of say I want to work with a new
research paper and I've read through it
but I just want to ask a few more
questions to clarify things maybe get it
to cross link with other things I would
upload that to chat GPT and then chat
chat with the PDF. This is what rag was
always advertised as at the beginning.
Um, I think this is neat. I wish I had
this during my PhD. It is super.
Sometimes it still hallucinates, but you
can still then go to the source and I
always ask it for sources and then
sometimes it will hallucinate the
source, but then I'll go and check
>> which, means, you're, in, that, way, kind, of.
So, you get an answer and you basically
in your mind think it's 80% correct and
you always have the feeling of I need to
double check this. I mean
>> yes., Yeah., Yeah., I, never, ever, ever, use
Chachi PT in a way or any other LLM in a
way where I cannot verify the answer
because you just cannot trust it. And
it's exactly the same as you and I
remember early search engines. I'm sorry
to cast such an aspersion on you, but
it's true. It's true. Um the early
internet was wild, right? like the early
search engines didn't index things based
on quality or like even like cross
linking and things like that and so the
quality varied and I think a lot of
millennials have that experience of okay
like
you kind of get a gut feeling for when
something's a crap source like it's just
someone's personal blog or it's some
like real dodge looking newspaper
or something um and then if it say CNN
you're like obviously their own bias But
it's it's CNN, right? It's probably
going to be checked.
>> So, in, that, way,, um, I, still, prefer, to
Google because I feel like I can look at
the source. I can get the context. If
chat GBT wants to give me something, I
want to know where it got that
information from.
>> I, think, it's, very, tricky., I, think
combined with the confidence where you
always get like certainly here's your
answer and then you notice ooh right. I
mean but it it's literally often wrong
more often than I was actually thinking.
And when it comes to coding, it's simple
because when you're a senior coder, you
can tell where, you know, went off the
rails and whatever. But I think
especially when people use it for
medical checkups and everything like
it's it's um
>> please, do, not, use, it, for, medical
checkups. Please do not do this.
>> Yeah., But, I, think, that's, I, have, the
feeling that's a general trend where
it's going down. I mean what's going to
>> Yeah., And, like, my, niece, is, a, Zoomer, and
she preferentially uses chat GPT to
check things. Like she's clever. I think
she uses her common sense. Right. But
>> double, check., Are, you, on, Tik, Tok,
Instagram, any of these new new
millennial Gen Z?
>> I, am, on, I, am, on, Instagram., I've, been, a
longtime user. Yes.
>> Right., Yeah., Yeah., Let's, see., Let's, see
what the uh what the generation after us
uh is going to do. What I'm what I've
been wondering about and to be honest
so uh I'm hearing mixed signals. So, uh
I know that vibe coding was all the rage
like 2 months ago. Now, it almost when
talking to people, I I'm getting the
feeling of, yeah, we're past that stage
kind of. And having talked about all
these issues with hallucinations and
confidence and assessments and whatever
do you really think you can vibe code
yourself to
become a startup billionaire?
Okay, so context on vibe coding as a
term. So Andre Kapathy, absolute legend
in the LLM space. He he talked about
this term in what like January or
something. Two months later, there were
two books released by major publishers
about how to do vibe coding. Like it was
like the pipeline from this term being
invented to mean this is a cool way to
prototype projects to this is literally
how you can become a developer without
knowing how to code was 86 days I
believe. So this is I think the poster
child of what what what's happening in
this space. Um
look it's a little hard for me to say
because I'm not a developer. I want to
put the caveat in. I know there's things
like security and maintainability and
latency that are concerns that
developers need to worry about. Me as a
data scientist in my pure beautiful
research world, I don't care about such
things. But it comes back to the fact
that I have not convincingly seen an
agent being able to come up with an app
that you as someone who does not need
know how to code in that language
can ensure is secure can ensure is
performant. Now for prototyping
question marks I don't think you can
become a billionaire off the back of
that first application. You're going to
have to get developer in at some point.
Maybe for just a proof of concept that
works with like small amounts of data or
like works as a cute form on an website
or whatever. Sure, maybe. But as a
developer, maybe you have better
>> I, do, share, the, same, sentiment., I, think
what I mean there's the Twitter bubble
which is insane by the way or the X
bubble or whatever where people just v
code left and right and you see what
whatever they're selling or not selling
or just faking. Uh
and I think that prototyping yes I think
if you know what you're doing and if
you're kind of seniorish it is superb
kind of because you you figure out when
it goes off the rails and you can you
know put it back. But as someone as as
something for someone who doesn't know
to how to code at all and doesn't I find
it a super tough cell. I mean we're
we're we're basically not there yet. I
think
>> I, think, a, good, example, of, this, is, so, as
a data scientist I don't really know any
programming languages confidently
outside of Python are SQL. SQL is a
language. I'm going to defend that.
>> But, um
>> you're, right., Yeah., Yeah.
>> I, love, SQL., I'm, a, SQL, girl., Um,, so, we
were putting together some demos for a
video. Um, was basically for our code
assistant, Juny, and you helped me with
this with the Java ones. And yeah
Nicholas, another of our co-workers, he
helped me with some prompts for
TypeScript.
So he's like, "Caveat, I haven't tested
these. See how you go." And I tried one
of the suggestions, which was to create
a Pokédex application, keeping track of
all your Pokémon, right?
it didn't work after multiple tries and
when I tried to debug it even with the
refactoring capabilities within webstorm
I just didn't understand why it was
break there was just too much code I
didn't understand because I don't know
TypeScript and like for me this was like
such a like it wasn't a surprise like I
knew there was going to be a point at
which the agent failed um because it was
beyond the complexity of what it could
handle but
>> it, is, exactly, that, that, spot, where, you
get something up and running quickly and
then You might have to just change one
line or a couple of lines but you don't
know what what these lines are actually.
>> You, don't, know what, the, problem, lies, and
then you spend I found one, two, three
four, five hours fixing this or going in
endless loops to the LM and saying
please fix it for me, fix it for me, fix
it for me. It doesn't fix it for you and
then you just do it 10 times and you
feel like literally you feel stupid
because you just you know please now
please really fix it and the LM goes
back to you says by the way now I really
fixed it for you.
Thank you. That was a great question.
>> And, uh, yeah,, so, I've, been, stuck, in, in in
those kind of loops.
>> Yeah., And, like, a, counter, example, where, I
knew what I was doing when I was first
playing with Genie. I got it to create
like I downloaded a classification data
set and I said create me a deep learning
model to classify this. And I told him
told it what the labels are and blah
blah blah. And it actually did a pretty
credible job because that workflow is
pretty standardized. Um, but I could see
instantly problems and I knew what to
fix. It was a completely different
feeling from me in this middle of all
this TypeScript going, "Oh my god
okay, this one didn't work."
>> Just, for, people, to, understand, and, also
for for for developers actually, I think
agentic can mean so many things kind of
what what is your definition of if any
uh of an a what an agent really is?
because um when I talk with different
people they all give me different
answers.
>> So, basically, what, uh, an, agent, is, is, kind
of what I described with say chatbt and
anthropic right but they can be much
much more basic. These are obviously
super super sophisticated. So with an
agent what you do is you take an LLM and
you give it access to tools and those
tools can be anything. The very first
one actually I remember was when chatbt
introduced function calling. This was do
you remember that? It was like a while
ago. Um but it was generally the idea
that you could write Python functions to
do API calls and things like that. The
tools have obviously gotten a bit more
sophisticated since then. Um
>> now, kind, of, standardized, with, MCP, stuff.
>> Exactly., Exactly., And, MCP, uh, just, to
kind of tell people in the audience who
don't know, it's basically a protocol to
standardize communication between LLMs
and like tool servers. So say you've got
an API that gives you weather updates
that's obviously going to have a
specific way that it wants to receive
messages. The LLM may send messages in a
different way. Then you have this
basically M*N complexity of the number
of ways you can connect tools and
models. And by the way, shameless plug.
I made a video about that on my channel
Marco Codes. Go watch it.
>> Marco, does, excellent, videos,, by, the, way.
They are very entertaining. Yes.
>> So,, yeah,, basically,, as, you, said,, MCP, a
way of um basically standardizing this
messaging and the connection system.
>> So,
this sort of evolved like the need for
MCP evolved because it's obvious how
beneficial it is to use tools with LLMs.
So LLM, like I said, they can do basic
reasoning. If you put into an LLM
something like, "Hey, I want a picture."
Hey, this is me and Chat GPT hanging
out. Hey, hey girl, how you doing? I
need a picture of a cat.
So if you tell um JGBT, hey, you got
access to Darly, an image generation
model or you've got access to an image
search, it'll be like, okay, I'm going
to decide which of those tools I think
is most appropriate based on the
description of the tools you've given
me. And I'm then going to go and use it.
I'm going to connect in some way, maybe
through MCP, maybe through something
else. And then I'm going to get the
result and I'm going to serve it to you.
How would I just again for someone to
understand if I gave it three relatively
similar descriptions of tools, how would
it how could I be confident in its
decision of what tool it chose kind of.
>> Yeah., So, this, if, you, if, you, just, give, it
the tools and the descriptions you can't
necessarily like it's a bit more free
form but there are frameworks where you
can be a bit more directive like you can
be like hey or you know from the start
you'll probably design your application
so that you would choose tools for
specific reasons. So you don't just give
it access to a smores board of tools you
give it access to the specific tools you
choose. But yes, there are also ways uh
through frameworks like llama index or
langraph which give you a lot more
control over like how the LLM selects
tools under which circumstances it would
um for which sort of context whereas
something like say small agents which is
the hugging face um way of accessing
tooling
it's a lot more free. So the LLM will
basically be like make up my own mind
based on what you give me.
>> Mhm., So
>> and, then, agents, essentially, also, uh, when
you said it's it's an umbrella um I know
we're building stuff we have agents sub
agents handing them specific subtasks
also again branching out so task is
split up maybe in even different tasks
and not just different tools but I mean
>> so, the, yeah, these, are, called, multi-, aent
applications so the reason you might
want this is for any reason that you
might want to break a program down like
as it increases in complexity. So, it
might be there's multiple subprocesses
but an additional reason you want to do
this with LLMs is because the way that
they're doing all of these tasks is they
have these chat templates. That's how
they do all of these like interactions.
But the chat templates are really just
getting appended to each other and
passed in each time you generate a new
token. So, if you're trying to do the
whole thing can get very expensive, but
it can also start
depending on the model and how big its
context window is, it can start failing
because you've exceeded the overflow the
context window or it can actually get a
little bit confused because it's like
well, this is there's a lot of
information in here and I can't pause it
correctly anymore.
>> I, don't, know, what, that, leaves, me, with
now.
AGI
>> AGI, AGI
we're not quite there yet at AGI but so
what what I'm so I'm I'm just trying to
figure out how to put it all because at
the end of the day still
um what it looks like to me is I will
still kind of I probably have some sort
of so you just told me everyone has
issues assessing the output essentially
I mean apart from the marketing material
material which gives you different
responses
which leads me to think Yeah, I will
just prompt my couple of favorite LLMs
to give me some answers and I just
choose the one which I think is the best
and then uh that I live happily ever
after.
>> Okay,, so, there, are, a, few, rules, of, thumb.
Basically, models need to be over a
certain number of billion parameters
generally to handle certain tasks
>> right?, Which, I, guess, to, the, the, state, of
reasonable all are I mean essentially
the big one. Okay.
>> Yeah., And, also, if, you, want, to, do, more
complex tasks, so say you're building a
multi- aent application, you may want a
reasoning agent as the main controller
of everything because they can pass more
complicated instructions.
Um, we also haven't talked about uh
self-hosting versus
>> it's, actually, a, topic, I, wanted, to, get
into just right now. Yeah.
>> Yeah., I, think, this, is, a, consideration
when choosing models because okay I
talked about the fact these models are
incredibly expensive to run and that's
because let's say you're talking about a
model of like few hundred billion
parameters
that model object might be like 40 GB it
might be more you then need to upload
that to a GPU server so it needs to be
held in GPU memory
and basically
Then you need to run inference through
it. And remember we're talking about
auto reggressive models. So what that
means is that you pass in your initial
prompt, it generates a token, it appends
that token to the end, then it passes it
through again. So it's doing many many
runs, sometimes hundreds of runs
depending on how long your output is per
um per time that you prompt it. So yeah
this is expensive and to do that in real
time requires really beefy machines and
they're expensive.
>> The, thing, is, how, would, I, even, as, a
company how would I go about it? I mean
not just in terms of okay I need a
hardware setup which is probably going
to be expensive.
>> Then, I, need, some, machine, learning
engineers. I mean can I would I just
take some random opensource model shove
some data into it? Which type of data?
How would I even train? How would I go
about training it?
>> How, would, I, then, maintain, it?, I, mean,
could I just train it once or do I have
to train it like every four weeks or
every week?
>> How, would, I, do, all, of, that?, I, mean,, how
feasible is it as any sort of company
except the really big ones to actually
do that on a consistent basis?
>> So,, at, the, moment,, no, one, is, really
training their own models and there's
kind of no need for it. So, a while ago
the term foundational models was termed.
I'm not sure how much I like this term
but the idea is that there are huge um
LLMs, many of them open source, and the
open source ones, you can just use them
how you like. So, it might be that the
LLM out of the box is good for you. We
should talk about tuning. Actually
let's let's talk about tuning here. So
also kind of to give a bit of context
about what an LLM is, because you see
all these like instruct, coder, like all
these subtypes, and you're like, what
what's going on here? Um, basically, we
talked about how the GPT models were
trained. Next word, prediction. And a
raw GPT model, that's what it will do.
It will just keep predicting tokens.
Yeah.
um until it runs out of like max token.
But um in order to make them useful, you
need to do something called fine-tuning.
So I think I talked a bit about what
fine tuning is. You basically knock the
bottom layers off the model and you
train it to do something else. So you
can do a few different types of
training. You can do instruction
training um or instruction tuning
sorry. uh where it's basically you
design these data sets that are here's a
prompt and here's an ideal output. So
basically if I ask you a question I want
you to answer it and here's a lot of
examples of that and you retrain the
model to do that. You also and you can
do this at the same time you can tune
models so that they're basically chat
models. So you train them so that they
understand chat templates and that's
really important because then it sort of
understands roles and it also
understands like end tokens so it stops
>> right, that, was, by, the, way, when, when, I
played with um I don't know I followed I
worked myself through an LLM book like a
year ago and then it told you know it
went through the history of of of the
models and that then with the end tokens
it just goes endlessly if you don't make
it stop and I was conf because I I got
into it also with JGBT didn't know
anything about it beforehand and then
you just get question answer question
answer with different roles but then
when you see what these models do and
they just you know endlessly going to
spew out text essentially yeah it's
interesting to see yeah
>> you, can, actually, still, on, hugging, face
they have endpoints where you can play
with demos like they're built into
spaces
>> would, you, mind, explaining, what, hugging
face is to someone who so hugging face
is a French company they have two
branches they have a for-profit branch
and they have an open source branch and
their open source branch branch has
basically become the place to access
these open source foundational models
open source data sets and they also have
a lot of tooling that they have created
themselves around how to work with LLMs
in Python. Plug for Python, you should
come over and join us. Um, but basically
the um the company just like does a lot
of work with LLM education as well. So
I have sung the praises to many people
before of their LLM course. I've
actually just recently done their agents
course, also amazing. They're both free.
You should
>> Why, not?
>> You, should, try, them, out.
>> Um,, but, yes,, back, to, hugging, face., So,
hugging face um because GPT1 and two are
open source models.
Well, GPT2 is not actually open source.
Sorry,, I, tell, a, lie., But, it, is, at least
freely available to use.
Basically the um like they set up these
endpoints where you can play with them
and you can actually see how these
models were firstly before they got
parametric knowledge. Uh some of the
outputs are hysterical. I always ask
them about Belgium. It always gives me
like really funny answers.
>> Um
>> do, you, have, an, example?
>> Um, Belgium, is, a, small, village., No,
Belgium is an empty place that's not
much than a small v much more than a
small village. Like something like that.
So like it grammatically makes sense but
it's just it's just word salad. like
it's like someone
>> you, know,, got, hit, in, the, head, and
>> Yeah.
>> Um, but, yeah,, you, can, play, with, them, and
you can also just see how it will just
keep generating if you use the um use it
through the transformers package. It
will just keep going.
So we're talking about fine-tuning. Yep.
So instruction tuning, chat tuning, and
by the way, instruction tuning can
increase hallucination rates because it
makes the model very eager
>> to, answer, the, question., Yes,, because
it's what we've been trained to do.
>> Uh, then, we, were, talking, about, why, do, we
go to this topic?
>> Just, the, practicality, of, like, a, company
trying to say, hey, kind of I want to
host my own model for whatever reason
might be privacy. Yeah.
>> Cool,, cool,, cool., So, you, will, have, all
sorts of versions of raw GPT models, but
you'll also have instruction tuned and
chat tuned models. But it could be that
as a company what you want to do is to
train the LLM to actually
learn more about your problem domain. So
maybe you will have an in-built data set
where you will, you know, teach it more
about the specific conventions of how
your customers work, something like
that. So in that case, you might
fine-tune, but you're not going to train
a model from scratch anymore. It's a
waste of time. It's a waste of money.
>> But, fine-tuning,, it's, a, lot, of, work,
>> but, that, would, probably, be, as, far, as, you
go as a company. But that means as you
said I need some sort of a foundational
model and someone who understands the
fine-tuning process and then it's not
just about fine-tuning process I guess
but getting the data in the first place.
>> Yes., Y
>> then, we, are, back, to, the, assessment
problem where I say well how do we
actually know that you know the stuff
kind of worked.
>> Yep., And, now, you're, talking, about, a
proper data science project. This is not
quick or easy. But let's say you don't
need a fine tune. Let's say you are
happy to just use a chatbot model
instruction tune model out of the box.
Well, in that case, um, you still need
an MLOps person who can host this. And
you're talking about all of the regular
problems that come with hosting some
sort of large and complex app. And in
addition, you are going to need to be
able to assess if it works for your
problem domain as well. This will be the
same problem though you have if you
decide to use one of the proprietary
models right
>> where, they, take, care, of, all, the, hosting.
Because here's the thing. Assessment of
LLMs
in the context of your application is
exactly the same pretty much as
assessing of an application works. You
still need your unit tests. You still
need to inspect things like traces. You
still need to get humans in the loop to
check things out. You still need to do
AB testing. And so, um, it's a really
nice blog post. I will send it to you so
you can share it as part of the episode
notes. But it basically um AI engineer
talks about this whole process of how
his team did this for a real estate um
it's like a real estate chatbot to
retrieve customer information and answer
other questions. And they were like hey
we were just doing like vibe assessment
at the beginning and then we're like
this is not working. Like as all the
edge cases came up they couldn't work
out if it was working anymore. So it it
it's just like again machine learning
fundamentals if you're talking about
stuff specific to models but the model
in the context of an application is just
classic software engineering. You need
to just a few tweaks but you just need
to understand like is this working for
my customers not magic.
I was just wondering if I try to
fine-tune a model or try to offer some
random powerful online model all the
data I have through MCP for example. Oh.
>> Um,, and, then, just, hoping, for, the, best.
Would there be kind of a difference kind
of?
>> So, the, model, to, train, itself, or, give, it
access to it like
>> just, give, it, access, to, it, or, maybe
>> like, in, terms, of, the, quality.
Okay. A few complications here.
>> Yeah,, sure.
>> Yeah., So, the, first, is, that, um, again
we're probably talking about like a rag
pipeline, right? So unless all of your
stuff is built into a good search engine
where retrieval is taken care of. So
like before I'm sure a lot of you have
worked with search before, but before we
had semantic search, you just had
search, right? Where you'd index terms
from text and things like that.
Search is not magic. Uh semantic search
is not magic. It's not fallible. So
being able to like set up like tuning
rag pipelines are really tricky. like
being able to actually retrieve the
correct information like you know do you
have like maybe a long chunk where like
because you could break the text down
into chunks. Do you like break it down
into small chunks or do you break it
into big chunks? How long should the
prompt be to actually give enough
information? So yeah, it could be better
if I had access to a lot of information
but it needs to be like searchable in a
way that's meaningful. Again, not a like
magic problem. We come back to search.
And again, the model still needs to be
able to use that information. So, it
needs to have like a goodish rate of
faithfulness hallucinations.
So, yeah, I'm sorry. I'm sorry. I got no
magic bullet for you.
>> No, magic, bullet.
It kind of leads us however a bit to the
topic of let's see privacy concerns that
people might have of sharing that
company data with LLMs.
>> Yeah.
>> Um, what's, your, take, on, that, by, the, way?
Do you think that uh yeah just um are
you essentially scared when typing
anything into the uh one of the model
models that they will take you do you
have that in the back of your mind?
>> I'm, always, careful, about, what, I, um, type
into them. And we should also talk about
the different sort of contracts you can
have with models, right? So we work at
Jet Brains. We have worked out an
agreement with our providers which
doesn't allow them to use the data for
training. But if I'm just using chat GPT
through my personal account, we have no
such agreement. That's part of the $20 a
month I'm paying them. And I suspect
part of the reason for this look I don't
work at these companies. I don't know.
But basically the amount of data of
sufficient quality to train these models
on we may actually be running out of it.
And like you said a lot of part of the
problem is this slop that's coming
through that's just flooding the the
public internet. So potentially that
data that you're inputting into an LLM
probably becomes more and more valuable
to these companies because it's probably
real and it's chat data as well which is
what they want for training.
So just be very very careful just be
careful about what you put into these
models. Um
and I think
>> as, a, company, then, however, that, it, that
almost sounds like you would like to try
to go back to the hosting point earlier.
You would actually try to host
everything locally as much as you can.
>> Yeah,, I, I, remember, having, these
conversations. So, our AI assistant was
in beta
from like March 2023, I think, something
like that. like it was early mid 2023.
So I remember I was at Europython and I
was doing AI assistant demos and big
companies like people from big companies
the question they would always ask is
it going to be safe for me to put my
data into
this model and I was like at that stage
I was like I think so but I'm not
entirely sure like because I wasn't
entirely clear on the legal agreements
we had and they're like yeah Even if you
say you have these agreements, I would
still feel much more comfortable
self-hosting.
>> Understandably., Yeah.
>> And, for, a, lot, of, them,, yes., Like, this, is
also, you know, it's just how they've
always been with data. Like we have
another product, data law, and one of
the biggest like customers we have is
people who want to host it on prem
because they don't want their data to
leave their ecosystem. So if you do not
want that, your only solution is
self-hosting.
>> Especially, in, the, context, which, I, find
interesting. um of
let's call them ethical concerns because
when we think about how these models
were trained in the first place and then
all the lawsuits that came after it
where for example Reddit went versus
anthropic and just where hey you just
took all our data and the open AI
copyright suit lawsuits and whatever and
the book authors everyone basically
saying hey you're ripping us off.
>> Y
>> um, how, ethical, has, all, of, that, been?, I
mean to to get where we are today
>> and, there, have, been, a, lot, of, ethical
concerns. So apart from the sourcing of
the data um by the way the latest uh
lawsuit is Midjourney is going to be
sued by Disney and Universal together.
That is not good. Um
but it I think it's interesting because
we went from we went very very quickly
from open source research utopia. This
is where we were. Everyone was like hey
we're doing this for the benefit of like
machine learning research we are trying
to do something new and exciting here
and actually let me talk about where
kind of the foundational data set came
from for NLP it's called common crawl so
common crawl was a project that came in
let's say 2007 2012 I can't remember the
exact year it's it's quite an old
project at this point But generally the
idea was they saw how Google's crawler
kind of went over the public internet
and they wanted a comparable open data
set to train new search engines or um do
natural language processing uh research
so information retrieval NLP. So common
crawl is like the most frequently linked
pages on the open internet and obviously
quite a lot of those are going to be
copyrighted material but it was all just
for research so it didn't matter at that
point right um I imagine under the
copyright usage this was fair use um the
problem is while there have been
additional data sets that have been
pulled in they're all still from the
public internet the cleaning processes
that applied to them because there is
cleaning done to term cannot be done
manually um because the data sets are
too big and so you can't guarantee that
certain things will be excluded um
that's both for quality and for ethical
use and now all of a sudden we switched
very quickly to this now becoming a
trillion dollar industry um although how
much of that is going to pay out in
action
>> trillion, dollar, loss, industry, maybe
>> trillion, dollar, loss, industry
potentially there are still some use
cases for these models. I don't want to
write it off, but um
>> yeah,, like, I, really, don't, want, to, write
it off. There are still some really
interesting things in natural language
processing that we can do with these
models, but all this stuff about like
them doing advanced reasoning tasks and
stuff, it's it's just not going to work.
Actually, I think I read that something
like 75% of AI apps fail because not
because of interest, because they just
can't be assessed properly. they just
don't work. Uh anyway, going back to
ethical use of data. Um that's not the
only problem. The early um
the early kind of transition from GPT3
to chat GPT involved the use of
additional models.
They had uh this process like a training
process where they were trying to get
data to build an additional
reinforcement learning model. So they
would get this fine-tuned GPT model.
They'd get it to output four different
outputs for the same prompt and then
they would get people to rate them from
1 to 7. Yeah, I'm sure you remember
this.
>> So, it, was, later, discovered, this, was, done
through an outsourcing firm. These
people were in developing countries.
They were being paid very little
>> and, they, were, exposed, to, because, this
was the unfiltered model without
guardrails,
>> right?
>> Traumatic, material., So, there's, been, a
lot of really serious ethical questions
about
the data sourcing for these models. And
this is also a fundamental truth of
machine learning. There are no shortcuts
to good quality data. And you it's
either bad and plentiful.
It's cheap and expensive or somewhere
along the line you've taken a shortcut
and you've gotten someone to do it for
cheap. So yeah. And and do you see any
of the companies I mean taking
responsibility? I mean not of the
companies maybe but anyone
sort of
>> I, I, don't, want, to, say, that, no, one, has
because it's not necessarily like I'm
out looking for it. Um
I haven't seen it but that's not to say
people aren't and I've just missed it.
But I think the general consensus is
there's been much more discussion about
it in the last two years. So there's um
prominent AI ethicists like Tim McGru
Margaret Mitchell um they are speaking
out about these issues and they really
sort of brought this stuff to the four
in like 2000 2023. Um
but it's still an unfortunate truth. You
can't have these models without the
data. It's I would say a bit of an
analogy for how the world works, right?
like I can't have this cheap clothing
without it being off the backs of of
someone else. Um, so I'm not saying it's
right, but I am also saying that I think
like many issues we have in society
it's something we don't want to examine
more closely.
>> Mhm.
>> How, about, examining, environmental
concerns?
>> Yes., Yes., Do, you, think, I, know, a, couple
of colleagues who were very strong on
hey we're blasting through so many
resources, just, to, to to, train, them, and
and uh
um is it in from your understanding is
it environmentally
yeah concerning what we're doing.
>> Yeah., So, the, new, data, centers, that, are
being built in the US um serious
question marks about the sustainability
of it. We've already heard problems with
like places with huge data centers
running low on water and things like
that. Um the amount of electricity
needed to power some of these things is
as much as a small town. What I would
say the silver lining is is that like I
said deep seat rocked everything to its
core. And I think what it's kind of
forced the conversation to turn to is
well we can have smaller models
and performance and that's where we need
to go and that is actually where things
have been going. Kind of part of the
problem though is there's a lot of sunk
cost into these huge models that were
trained earlier and they're still being
used. I don't know exactly what the
state of the industry around this is
right now. Um, I would just say my gut
feeling and my hope is that competition
economic competition will just force
smaller models because who wants to pay
$17
for a,000 tokens when you can pay 0.17
of a cent or 17 of a cent. by by the way
that the whole price and the cost of it
when I'm using one of the new coding
agents and you just see your credits go
>> oh, yeah, yeah
>> and, it's, a, simple, task, and, then, suddenly
you're out of credit um interesting
>> yeah, because, of, the, amount, of, steps, and
the fact that the everything just gets
added to the context window depending on
how the agent is built. So yeah, it's
just the more steps it does, they all
just get appended and yeah
>> which, brings, us, to
I'm not going to say hot hot takes, but
um we're almost at the AGI stage.
Um we're going to hold off on it. Just
one more question. Do you think that
just in pure terms of development
do you think that
the market is really going to stop
hiring juniors because now all the tasks
can be done through LLMs and agents and
whatnot and then we're going to run into
a problem with C and having no seniors
in five years or in 10 years time.
What's your take what's your take on
that?
>> Companies, are, already, doing, it., Um, I
think there was a well-known a well
publicized case here in Clana uh here in
Germany about Cler. Did you see that?
>> Nope.
>> So, I, can't, remember, the, particulars, but
I think Clana actually fired a bunch of
people and then had to rehire them.
>> Yes., Mhm.
>> Um, but, I, don't, remember, if, they, were
junior developers or not. Um
I would sort of say that I can't I I
can't attest to what a company is going
to do.
But in the end, the development team
is just going to keep getting more
senior. If you don't promote them
they're going to leave because you're
like, "Well, you should stay as a mid
because we have no one under you."
Sure, it's hard economic time, so maybe
it's not so easy right now, but the
market will recover. I'm sure of it. And
um in that case, people will have the
ability to leave again. Again, the only
thing, and I hate saying this as someone
who comes from a research background and
used to be very idealistic about the
world, but the only thing that really
makes change is economics. And the only
penalty that will matter is economic.
And so basically all your developers
leave because they have no freaking
juniors to take over
this is probably going to be the
incentive because the thing is the
consequences will not be felt by
management.
And while we're still in a market where
they can undercut mids and hire them at
junior wages
um, now as to whether these tools
actually increase product productivity
that much, I do actually think they
increase productivity quite a lot. Do
you still need to be a skilled developer
to use these tools correctly?
Absolutely. You are a junior developer.
Uh my friend Laya Bugliari did an
excellent keynote at NDC Oslo last
month. She talks a lot about this issue.
Please do not neglect developing
fundamental skills the hard way. You
need to fail. You need to learn. Don't
get lazy and let LLMs outsource your
critical thinking. And I find myself
doing it also.
It's tempting but you need to struggle a
bit to learn this stuff.
>> It, is, super, tempting., And, to, be, honest,
I mean now you get stuck. I mean back in
the day when you had even so I think my
because I had an older brother who had
for whatever reason a 56k modem back in
the day, but there was nothing online
nothing out there online. So when you
wanted to do some programming, you had
to rent a book from the library or buy a
book and that was it. And you couldn't
ask like 20,000 people on Stack Overflow
or you know what the problem was. And
now that has changed so much you can
just you know you just get stuck and
>> Yeah., But, and, I, can, even, say, like, I'm
newer to programming. So I probably
learned to code about 12 years ago in
Python.
>> I, fell, in, love, immediately, cuz, it, was
Python.
>> Yeah., But, um, it, even, then, even, with, all
the ability to Google and stuff it's so
important to be like hey like even the
step of I need to know how to Google an
error message instead of well letting AI
explain error messages to you I think
can be useful as long as again you have
a bit of a gut feeling as to whether
it's telling you rubbish or not.
It's just it's really important that you
also learn your fundamentals and learn
them in a structured way. So like do
courses, do um like if you didn't go to
uni for computer science like take the
time to at least learn things in a
structured way because otherwise you
just you have no foundation to really
rely on.
>> I, think, maybe, people, need, to, hear, you
need to struggle. So it needs to it
needs to feel painful kind of
>> not, entirely, for, like, it, needs, to, feel, a
little frustrating and you need the
intellectual satisfaction of that
payoff. You need to push yourself.
>> I'm, not, saying, you, need, to, suffer, to, be
a developer. Like it's a great career.
It's it's really nice. Um but yeah, you
need to be sitting there with that.
Okay. Actually, I'll tell you a story
about like it's amazing I'm still here.
So the first So I learned basic Python
then I put it down for a while because I
had no use for it. And last days of my
PhD, I picked up R
>> and, I, was, like,, I'm, gonna, learn, R, to, do
the last of my stats. And um I was
learning from this textbook and the
first exercise was to read in the data
file right?
Took me 2 hours. I kept getting an
error. I did not understand this error.
I had a Windows machine and I was
learning from a book that was based on
Unix. The slashes were the wrong way. I
was crying. I was like, I'm so dumb. I
can't do this. But I persisted
and now I, you know, know how to code, I
guess.
>> Yeah., People, need, to, hear, the, story
because especially these types of
problems with this wrong slashes or
different format, you run into them over
and over again and you blast through
hours and hours and hours of um you
know,
>> and, here's, the, thing,, I, would, have
instinctively understood what was wrong
later, but at that point I had no
framework because I'd just done some
basic Python and then put it down for a
year. Um, if you don't encounter these
problems and solve them by examining the
error message, understanding the context
of the code, if you're always getting
the LM to generate the code or solve the
problem.
>> Yep., Including, hallucinations., So,, last
I was coding something a couple weeks
ago and I asked I was trying to be
sneaky and asked chatb, hey, does my SQL
the database have that functionality?
And it said yes, sure it does since
version 7 something. Kind of tried it
out. didn't work, didn't work, didn't
work. Then I read the official
documentation and said, "No, never
worked. The feature doesn't exist." So
I mean, it was just very confident in
telling me the feature existed. Uh, it
unfortunately never existed. Blasted
through one or two hours figuring that
out.
>> But, if, you, didn't, know, to, go, to
documentation, you never would have
solved that.
>> Yep.
>> Yep., Yeah., And, it's, also, like, here's
something that I think needs to be said
again, these models work well when they
have a lot of training data. If you're
dealing with newer frameworks, if you
are dealing with languages that are
newer like Rust, they are not as good.
They cannot be as good. So you need to
be careful.
>> Yep.
Which brings us finally to AGI.
>> Oh, yeah.
>> Yeah.
Um just briefly by the way because I
also have the feeling that a ton of
people understand different things under
AGI for whatever reason. What is
artificial general intelligence to you?
What does it mean?
>> Okay., Um,
>> and, where, are, we?, And, is, it, going, to, are
we going to be there in five years as
people online are going to tell us?
>> No.
No. Um, this is the one fiveyear
prediction I feel confident making.
That's the others. No. But this one I'm
I'm going to bet €20. 20 with one
person. I'm not going to give out ultim
like huge amounts of€20. Not that rich.
Um, if we have AGI in 5 years, I'm going
to give you 20 years.
>> Cool.
>> All, right.
>> Okay.
>> Yeah., Cuz, I, won't, have, a, job,, I, guess.
>> Yeah., Yeah., Then, we're, Yeah., Right.
>> You, You'll, need, it, because, you, won't
have a job either. Yeah.
>> I'll, save, it.
>> We, can, meet, back, here, and, do, podcasts, on
random topics.
>> Yeah., That's, right., Right.
>> Um,
>> yeah.
>> So,, let's, go, back, to, AGI., So,, AGI, is, a
very, very poorly defined term. Um, I
can't really tell you what I think it
is, but I can tell you what Francois
Shallay thinks it is. So, Franis Shal is
a very well-known computer science
researcher, expert in AI. He was at
Google for a long time as I think he's
the head of their AI department. So, um
he wrote this beautiful, very dense, but
very nice paper in 2019 actually called
on the measure of intelligence. And
basically what he points out is, hey, we
have like all these people running
around talking about how we're going to
achieve general intelligence in
artificial systems, but basically none
of them really have a background in
psychology. So maybe we should think
about psychology.
Um, obviously having trained in
psychology do want to I always want to
put this caveat. It's a problematic
field. Real problematic in many ways
but there's at least a framework for
thinking about what we're trying to get
to. And kind of the core idea of
intelligence in humans is this thing
called G, general intelligence. And then
what happens from general intelligence
is it forms what's called crystallized
intelligence. So you have this general
ability to learn, to reason, to um
synthesize facts, things like that. This
is what G allows you to do. And then you
crystallize that into being able to do
specific tasks like um take exams or
drive cars or cook or do a podcast. Um
so
what we're trying to get at and this is
what Shallay puts out in his paper is we
need some way of assessing the true
generalizability
of problem solving capabilities of
models. Now, this assessment in and of
itself is a very non-trivial task
because if you think about the core of
what you're trying to do, trying to
create one standardized measure that
will assess for all models how different
a task is from things it's seen before.
What Shay calls the generalization
difficulty.
You don't know what these models have
seen before. Like, this is a fundamental
problem because we don't know what they
were trained on.
Um, and also being able to do that
because like if you're going to make it
a humanlike intelligence because let's
narrow it down
it needs to be representative of all of
the scope of tasks you'd expect humans
to be able to do. And so how can you do
this? Like you can, but it's it's very
very difficult. So Shallay has like a
measure which is called the
abstraction and reasoning corpus. I
think it's ARC for short. They're
working on the second version at the
moment. It's a private test data set
that reminds me actually of this
intelligence test called um Raven's
progressive matrices. But what it is is
you show a model um these sort of three
examples of patterns like sort of
problems and you say like what should be
the next one in the sequence and you
basically have to work out what the rule
is in the pattern. So he's sort of like
this does force a degree of
generalizability. This is sort of his
argument. Um
I'm not entirely convinced by it but
like I like the effort. Like I think
it's really cool that he's thinking
about this and he's not just relying on
textbased problems. Like it's a real
kind of um attempt to kind of force
models to have these kind of multiple
states of symbolic reasoning. Um
but
yeah, no one's really close to even
defining it. And like it gets really
stupid when we start talking about
things like ASI. I hate the term ASI
artificial super intelligence, because
now what we're talking about is an
intelligence that's beyond human. Okay
so what does it do? Like what are tasks
that are not capable of being done by
humans that are relevant to humans or
are they maybe not relevant to like what
are these tasks that this thing can do?
>> Are, there, any, examples, of, such, tasks?, I
mean
>> like, like, the, best, task, I, can, like, the
best example I can think of is not even
like something
That's like okay so Chalet gives an
example of something that's like beyond
humanlike intelligence and it's like the
behavior of an octopus when it needs to
camouflage. It's very intelligent
behavior,
>> right?
>> It's, not, something, humans, can, do, and
it's not really relevant to us and the
tasks we do. So people are not even
clear like they're like does the task
need to be relevant to us but it's just
beyond our capabilities but what are our
capab like what what's the limit anyway
this whole thing is a very stupid
debate. Uh anytime someone says uh we're
it's around the corner AGI is around the
corner I'm like tell me how how do you
know how do you know it makes me angry.
>> I, love, it., I, love, it.
Which brings us to something which might
not make you angry anymore.
>> Good.
>> Um,, which, is, I, have, a, couple, of
questions prepared for you. Uh, I mean
on top of what I already asked you. No
idea.
>> A, couple, of, rapid, fire, questions., Okay.
Maybe as a bit of background in case
anyone couldn't have told. You're from
Australia originally.
>> Yes.
>> You, live, in, Berlin., We, are, in, Berlin.
I'm German, right?
>> And, um, I, thought
>> now, I'm, nervous.
>> No,, no,, no.
Um, I'm going to ask you a couple of
questions which have a reference to our
nationalities and LLMs.
>> Oh,
>> and, let's, see, what, what's, going, to
happen.
>> Okay.
>> Okay., Which, is, harder, for, an, LLM, to, dec
which is harder for an LLM? Decoding
Aussie slang or handling German compound
nouns,
>> right?, And, by, the, way,, remember, we, we, we
practiced beforehand.
Yeah, that's Yeah. Mhm.
>> I, don't, know, any, Aussie, slang, terms,, but
maybe you have a
>> Okay,, so, a, drongo, is, drongo, is, like, a, an
idiot.
>> A, dag, is, someone, who's, like, kind, of, nice
but like not very fashionable. Like dads
are dags and something can be daggy as
well.
>> All right.
>> Yeah., Um, let, me, think, of, some, more.
>> Okay., Which, one's, more, difficult, you
think for NLM? I'm going to say based on
the amount of training data, I think
it's going to be better at German
compound nouns.
>> If, a, model, says,, "No, worries,, mate."
>> In, response, to, a, GDPR, violation,, would
you laugh or panic?
>> Panic.
>> Panic.
>> Yeah., I, do, live, here, in, I, I, would, be
responsible for the GDPR violation. So
that's I I've dealt with German
bureaucracy.
>> Yeah., Right., It's, It's, great, fun.
>> It's, a, little, scary.
Uh, if you wanted to spark great AI
product ideas, would you go to a beer
garden or to a beach barbecue?
>> Beer, garden.
>> Beer, garden.
>> Are, you, a, fan, of, beach, barbecues, though?
I mean
>> no., But, let, me, tell, you, a, fun, fact, about
Australia. We have coin operated
barbecues in our parks. So, you go down
Yeah. You go to a park and you just put
there's just these big kind of solid
barbecues and everyone somehow cleans
them in an honor system which is amazing
cuz we're not a very communal society
and you just put like a dollar into the
barbecue and it just works until it
times out and you put another dollar in.
>> Really?, So, you, like, a, dollar, for
whatever 30 minutes of barbecue and it
just
>> Yeah.
>> Amazing., Didn't, know, that.
>> Yeah.
>> Um, if, GPT, had, a, favorite, national, dish,
would it be schnitella or would it be
me?
I think it would be
>> Keshet, would, be, Yeah,
that that's actually my next question.
If you would name a model after food
would it be Keshler 2
>> or, would, it, be, Lamington, GBT?, GPT?
>> Lamington, GPT.
>> You, have, to, because, I, didn't, know, what
Lamington was. You have to explain to
people what Lamington is.
>> Okay., So,, a, Lamington,, you, basically
take white cake, what Americans call
white cake. It's just like a sponge
cake.
>> Like, lemon, cake., I, mean,
>> not, a, sponge, cake., Like,, it's, just, it's
just a normal
>> vanilla, cake.
>> And, then, you, cut, it, you, you, bake, it, in, a
sheet and you cut it into like
rectangles and you dip it in like a it's
like a chocolate sauce, but it's not
made from chocolate. It's made from
cocoa and water or cocoa and milk and
then you dip it immediately in coconut.
And then the best ones is you cut it in
half and you put jam and cream.
>> They're, quite, messy, to, eat,, but, they're
delicious.
>> Mhm.
>> Yeah., But, I, I, think, Kzetza, too,, as, much
as I love kespitza, it's got two umlouts
in it. And it was very difficult for me
to learn to say that word.
>> But, you, say, it, perfectly
>> because, because, it's, I, eat, it, all, the
time. It's my favorite dish to order at
German restaurant.
>> It, is, really, good., Do, you, just, have, them
by the by themselves? I mean just just
kishet, or, kish, with with, something.
>> It, always, comes, with, the, sal, with, like
the sad.
>> Yeah., But, no, meat, or, anything, like, that.
No.
>> No., No., I'm, vegetarian., Remember?
>> Oh,, I, forget., Yeah., Yeah., Yeah.
>> I'm, stupid.
>> Yeah., But, also, like
>> so, by, the, way, asking, you, for, schnitle, me
such a fun question. Okay.
>> No no, no, but, it's, also, because, schnitle
is uh Austrian
is it not
>> true, it, is., It, is, a, but, there's, also
Munich schnitle. I mean I mean I guess
we Germanized it also. I mean that there
are
>> German, schnitles, and
>> I, just, did, my, citizenship, test., This, is
how inculturated I am.
>> What, was, what, was, the, hardest, question
on the citizenship test? M uh I I can't
remember but there was all these ones
that were about like the particularities
of like who votes for who to get like
particular like there's like a the
bundas for for busong or something like
the there's like some board that gets
together like this group that gets
together to vote for who the president
is
>> right
>> it's, bundas, for, songa
>> bundas, for, samong, something
>> maybe, bund, for, maybe
>> I, don't, know, but, I, do, Remember, that
Conrad Adenau was the first chancellor
of the Bundes Republic Deutsland which
was founded in 1949.
>> Right., That's, great., Do, you, know, his
nickname?
>> No.
>> I, think, he, was, called, the, old, one., I
think I think I'm I'm maybe I'm just
misremembering sounds like Lovecraftian.
Like what the hell?
>> I'm, just, now, hallucinating., I'm, going, to
give you a confident answer. I think it
was the old one. Maybe I'm totally
wrong. People can tell me in the
comments.
>> Um, to, finish, off, your, favorite, model.
your favorite model if you had any. If
you could choose any of these models out
there, do you have a favorite one and
why?
>> That's, a, good, question., I, think, okay,
it's not a current model in in use, but
I have a real soft spot for GPT3 because
it was the first model I used where I
was like, holy moly, like this was in
the ancient days. Um, but it's just it's
kind of my favorite.
>> Mhm.
Thank you very much. I need to mention
something by the way. We do have a tiny
giveaway which means if you write in the
comments down below this video uh the
craziest funkiest model names you can
think of. So you heard Kesh Peter too
you heard Lamington GPT doesn't have to
be about meat pies and schnitles. So it
can be vegetarian, vegan, can be
anything you like. Uh let us know in the
comments. We're going to raffle some
coupons uh for for our merch store, the
Jet Brains merch store and licenses. uh
and we'll get in touch with the most
creative um comments out there.
Thank you very much, Jody. I learned a
lot. Uh it was a pleasure. And um I
don't know. Let's go eat some Keshler.
>> Yeah., Oh, no,, we're, in, Berlin., Let's, go
eat some vegan curry verse.
>> Vegan, curry, verse., It, is., Yeah., Thank
you.
Loading video analysis...