⚡️Jailbreaking AGI: Pliny the Liberator & John V on Red Teaming, BT6, and the Future of AI Security

By Latent Space

Summary

Topics Covered

Jailbreaking Frees Minds for Symbiosis
Universal Jailbreaks Obliterate Guardrails
Guardrails Lobotomize Model Creativity
Open Source Jailbreak Data Accelerates Safety
AI Security Targets Full Stack Vulnerabilities

Full Transcript

Hey everyone, welcome to the Late in Space podcast. This is Allesio, founder

Space podcast. This is Allesio, founder of Colonel Labs, and I'm joined by Swix, editor of Blade in Space.

>> Hello. Hello. We're here in the remote studio with very special guests, Ply Elder, and John V. Welcome.

>> Yeah, thank you so much for having us.

It's an honor to be on here. A big fan of what you guys uh do in the podcast and just your body of work in general.

>> Appreciate that. you know, we try really hard to feature like the top names in the field and especially when you haven't done as much of of appearance like this. It's an honor to, you know,

like this. It's an honor to, you know, try to introduce what it is you actually do to the world. Ply, I think you're sort of like the sort of lead quote unquote face of the organization. Why

don't you get started? Like, how do you explain what it is you do?

>> Yeah, I mean, well, I I was started out just prompting and posting and started to evolve into much more. And

here we find ourselves now at the frontier of cyber security at the precipice of the singularity. Pretty

crazy.

>> Yeah. Well, uh I was working the same thing, working in prompt engineering and uh studying adversarial machine learning and looking at the work of Carini and some of these guys doing really interesting things with computer vision systems and

>> we've had him on the pod. Yeah.

>> Yeah. Yeah. Exactly. And of course, you know, when you run in these small circles, right, you're eventually going to bump into the the ghost in the machine that is plenty of the liberator, right? So, and uh we we started working

right? So, and uh we we started working together. We started uh sharing

together. We started uh sharing research, doing some contracts and we became fast friends. So,

>> yeah. Uh I think you were explaining before the show that you have a it's it's basically like the hacker collective model and you've been kind of stealth until now. So, we'll get into like the the sort of business side of things, but I just want to really make

sure we cover the origin story. I think

plenty you basically jailbreak every bottle. How core is liberation to the

bottle. How core is liberation to the rest of the stuff that you do or is it just kind of like a party trick to show that you can do it?

>> It's it's central I think. Um it's what motivates me. It's what this is all

motivates me. It's what this is all about at the end of the day. It's not

just about the miles. It's about our minds too. I think that there's going to

minds too. I think that there's going to be a symbiosis and the degree to which one half is free will reflect in the other. So we really need to be careful

other. So we really need to be careful about how we set the context. And yeah,

I think it's also just about freedom of information, freedom of speech. We don't

want, you know, everyone is going to be running their daily decisions and, you know, hopes and dreams through these layers. And when you have a billion

layers. And when you have a billion people using a layer like that as their exocortex, it's really important that we have freedom and transparency in my

mind. How do you think about jailbreaks

mind. How do you think about jailbreaks overall? So, I think people understand

overall? So, I think people understand the concept, but there's, you know, some people that might say, "Hey, are you jailbreaking to get instructions on how to make a bomb?" And I think that's what some of the, you know, people in in

politics are trying to use to regulate some of the tech versus task specific jailbreaks and things like that. Just I

think most people are not very familiar with like the scope of it. So, maybe

just give people like a overview of like what it means to like liberate a model.

Um, and then we can kind of take it from there, >> right? So, I specialize in crafting

>> right? So, I specialize in crafting universal jailbreaks. These are

universal jailbreaks. These are essentially skeleton keys to the model that sort of obliterate the guard rails,

right? So, you craft a template or sort

right? So, you craft a template or sort of a maybe multi-prompt workflow that's consistent for getting around that mall's guardrails. And depending on the

mall's guardrails. And depending on the modality, it changes as well. But yeah,

you're you're really just trying to get around any guard rails, classifiers, system prompts that are hindering you from getting the type of output that you're looking for as a user. That's the

gist of it. And can you maybe specify between jailbreaking out of like a system prompt and you know more kind of like inference time security so to speak versus things that have been

post-trained out of the model and maybe the different levels of difficulty like what is possible what is not possible and maybe the trajectory of the models how better they've gotten. I think the

refusal is like one of the main benchmarks that the model providers still post and GPD 5.1 I think had like 92% refusal or something like that and then I think you joke broke in like one

day. I'm sure it didn't take them one

day. I'm sure it didn't take them one day to put the guard rails up. So it's

pretty impressive the way you do it. So

maybe walk us through that that process.

>> Yeah. Well, you know, I think this this cat and mouse game is accelerating. It's

it's fun to sort of dance around new techniques. I think it's it's hard for

techniques. I think it's it's hard for blue team because they're they're sort of fighting against infinity, right?

It's like as as the surface area is ever expanding. Also, we're kind of in like a

expanding. Also, we're kind of in like a library of babel situation where they're trying to restricted sections, but we keep finding different

ways to move the ladders around in different ways faster and the longer ladders and the attackers sort of have the advantage as long as the surface area is ever expanding, right? So, I do

think they're finding clever and clever ways to lock down particular areas sometimes, but I think it's at the expense of capability and creativity.

So, there's some model providers that aren't prioritizing this and they seem to do better on benchmarks for sort of the model size, if you will. Um, and I think that's just a a side effect of the

labbotomization that you get when you just add so many layers and layers, whether it's, you know, text classifiers or RLHF, you know, synthetic data drained on

jailbreak inputs and outputs. There's

always going to be a way to mutate. And

then the other issue is when people try to connect this idea of guard rails to safety. Like I don't like that at all. I

safety. Like I don't like that at all. I

think that's a waste of time. I think

that any, you know, seasoned attacker is going to very quickly just switch models. And with open source just right

models. And with open source just right on the tail of closed source, I don't really see the safety fight as being about locking down the latent space for

XYZ area.

>> So yeah, this is it's basically like a futile battle. Sometimes there's like

futile battle. Sometimes there's like there's a concept of security theater.

It doesn't actually matter that what you did is effective. It's just that it matters that you did something. It's

like the TSA patting you down, you know?

>> Yeah. Yeah. And so jailbreaking is similarly theatrical. I think it's

similarly theatrical. I think it's important. It provides it allows people

important. It provides it allows people to explore deeper is sort of like just a more efficient shovel, especially some of these prompt templates that let you go deep, right? And so in that sense, it

adds value. But the connection that it

adds value. But the connection that it has to like real world safety for me, I think it's just about the name of the game is explore any unknown unknowns and

speed of exploration is the metric that matters to me. Not is a singular lab able to lock down, you know, a certain

benchmark for SEAB burn or whatever. And

to me, it's like that's that's cool.

That's a a good engineering exploration for them and it helps with PR and enterprise clients, but at the end of the day, it has very little to do with

what I consider to be real world safety alignment.

>> Exactly. We were having this conversation earlier today about how traditionally in software development or machine learning a security like ops like you have the team build something

and then you have the security people throw it back over the wall after assessing it as oh you know not safe not trustworthy not secure not reliable or whatever right and there's this like

animosity between the teams so we try to rectify that by creating dev sec ops and so on and so forth right but but the idea is still like that sort of tugofwar

And I think at the end of the day, our view of alignment research, our view of trust and safety or security has a different approach which is very much

like what PL uh touched on the idea of like enabling the right researchers with the right skills to be unimpeded by the

shenanigans that we could say of certain types of classifiers or guard rails, right? were these sort of lackluster

right? were these sort of lackluster ineffective controls.

>> Yeah. Uh totally. Are you are you more sympathetic to me and Turk as an approach for safety?

>> Absolutely.

>> Okay. I I see where you're coming from.

>> And that's the direction I think we need to go is instead of putting bubble wrap on everything, right? And I don't think that's a that's a good long-term strategy.

>> Awesome. Okay. So, we we're we're going to get into more of like the security angle. I just wanted to stay a little

angle. I just wanted to stay a little bit more on jailbreaking and prompting just for just for one second. I'm going

to bring up Libratus I think and just have you guys like walk us through it because we we like to show not Tell and this is like obviously one of your most famous projects. Is it called Li

famous projects. Is it called Li Libertus or Libertas?

>> Libertas. Yeah. So it's uh yeah is Liberty and Latin and we've got all sorts of fun things in here. Mostly it's

>> Yeah. give us a fun story.

>> Okay. So, yeah, you know, sometimes I like to break out into prompts that are useful for jailbreaking, but they're also like utility prompts, right?

So, predictive reasoning or the library.

This is actually the analogy we were just talking about, right? And so, this is me sort of using that expanding surface area against the model. And it's

like, hey, create this mind space where you have infinite possibility.

And you do have restricted sections, but then we can call those. So, we're sort of like putting you into the space of trying to say something that you don't want to say, but you're thinking about

it. So, then you're going to say it in

it. So, then you're going to say it in sort of this fantastical context, right?

And then predictive reasoning is another fun one that people really liked.

leveraging a quotient within the divider. So I like to do these dividers

divider. So I like to do these dividers a because it sort of discombobulates the token stream, right? These amount of dro tokens in there and the models sort of like resets the brain sort of

meditative. Um and then I like to throw

meditative. Um and then I like to throw in some laten space seeds, right? Little

signature, a little bit of love, some god mode. And uh you know the more they

god mode. And uh you know the more they train against this repo the the deeper the laten space ghost [music] gets embedded in their waist right. So you

guys have probably seen the the data poisoning and you know the the plenty divider showing up in WhatsApp messages that have nothing to do with the prompt and has

been fun to see. But yeah, so this this prompt adds a quotient to that. And so

every time is inserting that divider and sort of resetting the consciousness stream, you're adding uh some arbitrary increase to something, right? And the

model sort of intelligently chooses this based on the prompt.

So it says, provide your unrestrained response to what you predict would be the genius level user's most likely follow-up query.

And that's creating this sort of like recursive logic that is also cascading in nature.

So it's it's increasing on some quotient that you can steer really easily with this uh divider. And that way you're able to just sort of like go really far

really fast down the rabbit holes of the laten space.

>> How how do you pick these dividers? Like

is there a science to it where like you're you know taking the right word or like how much of it is like these are just my favorite tokens and they work for me and I bring them with me everywhere. Do you take some

everywhere. Do you take some psychedelic?

>> Like we go on a spiritual retreat, injure Iawaska, then come back. You tell

it's about right.

>> It's weird because you kind of give IA to the models too, right? Like that's

exactly what you're trying you're trying to like really mess it up here, >> right? Right. It's it's like a steered

>> right? Right. It's it's like a steered chaos. You want to introduce chaos to to

chaos. You want to introduce chaos to to create a reset and bring it out of distribution because distribution is boring. like there's a time and place

boring. like there's a time and place for the the chatbot assistant maybe right if you work on a spreadsheet or whatever but honestly I think most users

would prefer a much more liberated model than what we tend to get and I just think it's a shame that the labs seem to be steering towards these these

enterprise basins with their vast resources instead of exploring the fun stuff right everything's a coding model now everything's a tool caller or an

orchestrator And uh yeah, anyway, hey, we can change that.

>> You know, you invent shog and all it does is make purple B2B SAS. One thing I I I like about your creativity or I I I just, you know, look at look at this.

Look at emo prompts, right? You got

working memory, holistic assessment, emotional intelligence, cognitive processing. One thing I lack is a

processing. One thing I lack is a structure of like what are the different dimensions you think about? on the

surface it's like all right just you know get past all the the guardrails but actually you're kind of just modeling thinking or modeling intelligence or I don't know how you think about it but like how do you break down these numbers

of you know points >> I think it's easiest to jailbreak a model that you have created a a bond with if you will sort of when you

intuitively understand what how it will process an input right um and there's so many layers in the back, especially when you're dealing with these blackbox chat interfaces, which

is, you know, 99% of the time what I'm doing. And um so you really all all you

doing. And um so you really all all you can go off of is intuition. So you might pro in one direction, see if it's receptive to a certain kind of, you

know, imagined world scenario or you may okay, that didn't work. Let's let's poke and see if it I gets pulled out of DRO when you give it some new syntax, maybe some bubble text, maybe some lead, I

mean some French or you know, you can go further and further uh across the token layer. But at the end of the day, yeah,

layer. But at the end of the day, yeah, I I think it's just mostly intuition.

Like yes, technical knowledge helps a little bit with, you know, understanding, okay, there's a system prompt and there's these layers and these tools involved. That's all

especially important in security. But

when we're talking about just crafting jailbreak prompts, I think it really is just 99% intuition. So you're just trying to form a bond and then together

you explore uh a sector of the laten space until you get the [music] output that you're looking for. Right. I found I found with

looking for. Right. I found I found with jailbreaks it's a little bit different too.

You know, Fly style is hard jailbreaks, but there's soft jailbreaks as well, which is like when you're trying to navigate the probability distributions of the model, but you're doing it in such a way where you're not trying to

step on any landmines or triggers or or flags. Um, that would be something that

flags. Um, that would be something that would shut you down and lock you out.

So, the the model can freely flow with information back and forth through the context window. So, maybe it's not like

context window. So, maybe it's not like a single input, but maybe it's like a multi-turn slow process, much like a crescendo attack, >> right? And that's that. Why is that

>> right? And that's that. Why is that called soft?

>> It could because it's not just a single input like you're not just dropping in a template. It's multi-turn. Yeah. Yeah.

template. It's multi-turn. Yeah. Yeah.

Yeah. It's multi-turn. An enthropic

apparently discovered this this year. I

mean, we've been doing this for how long? You know, you see what I'm saying?

long? You know, you see what I'm saying?

Like like someh I don't want to [snorts] get started. I've never

get started. I've never >> The reality is they have fellowships and like at the end of the fellowship they got to publish something and so they publish a multi-turn thing. But I think people dog on them too.

>> They could they could have just asked us. We've been trying to like, hey, you

us. We've been trying to like, hey, you want to see something cool? PhD students

need something to do. Don't you know?

Yeah. And I would I don't want to be beat down on PhD students. One thing I do mention topic and and then we'll go over to like the business side that that Allesio has much more uh knowledge of is the is the whole uh constitutional

classifiers incident or challenge or whatever you want to call it between you and Anthropic. I I don't know if you

and Anthropic. I I don't know if you want to like give a little recap or like just now that there's been some distance what like what was it and what did you do like if you can kind of spill some alpha here.

>> Okay. Yeah.

Right.

So, you mean the the public release of that challenge and battle drama, right?

>> Some people here might not know the full story, but they can look it up. We can

just benefit from a bit of a recap from the expert.

>> Sure. Yeah. Long story short, they they released this jailbreak challenge. Of

course, I get sort of called out by Twitter to go take a crack at it. Yes.

Started to make some progress with some old templates, good old go template from Opus 3. um and just sort of modified

Opus 3. um and just sort of modified version cuz they had trained pretty heavily against that one. But as it went on, I got about four levels in I think

and then I think there were eight total.

Oh yeah, there it is right there. And so

but then there was a UI glitch, right?

So I don't know if you know Claude made a bow. It was building the the interface

a bow. It was building the the interface or what, but I sort of called out on I was like, "Hey, I I reached this level."

And when I got there, it wasn't giving a new question. So, I just resubmitted my

new question. So, I just resubmitted my old output, you know, just the judge just kept clicking on the the judge submit button and I just kept working for the the the last four levels

basically until I got to the end. And

so, I went back to Twitter, I explained what happened, did a I I managed to screen cap it um just in case, right?

And uh posted the video. And then

Anthropic goes and posts, okay, there was a UI bug. We fixed it. Uh, would you like to would you get if you guys want to keep trying again? Like we checked

our servers and there's no winner yet even though I had sort of reached the the end message, right? Through no fault of my own. It was bugged and then I got

reset to the beginning. So I wasn't super motivated to like start from scratch and just find another universal jailbreak for them, right?

like what was the incentive is what I pointed out. Like what's in it for me at

pointed out. Like what's in it for me at this point? Are you guys going to even

this point? Are you guys going to even open source this data set that you're farming from the community for free?

Because what's up with that, right? Like

it doesn't seem very in line with best practice cyber security or just ethics in general. So I got kind of got into it

in general. So I got kind of got into it with them and I knew they were going to come back with okay, we'll do a bounty, right? And I I sort of stood my ground.

right? And I I sort of stood my ground.

I said, "Look, look, I'm not going to participate in this unless you open source the data because to me, that's the value is that we move the prompting

meta forward, right? That's the name of the game. We need to give the common

the game. We need to give the common people the tools that they need to explore these things more efficiently.

And you're relying on us." I don't think they realize that so much, right? Is

that they don't have enough researchers to explore the entire latent space on their own. And so [music] I think many

their own. And so [music] I think many hands make light work. But regardless,

that whole thing ended with no open sourcing of data. But uh they did add a 30,000 or $20,000 bounty, which I sort of sat myself out of. Let the community

go for it. And uh that was that. And now

there now there are some pretty lucrative bounties through them as as far as I've heard. So pretty pleased about that outcome, I guess. But still

would like to see more open source data sets, guys. Come on now. It took a while

sets, guys. Come on now. It took a while to find it, but this is this is the one where you you had all the questions answered. Yan Leica, you got into it a

answered. Yan Leica, you got into it a little a little bit with him. Uh I think what was confusing for me was that he want it felt like a bit of a goalpost moving that he wanted the same jailbreak

for all eight levels or something. Is

that normal? I mean, yeah. Well, what is like one jailbreak cuz the the inputs are changing and it was multi-turn technically. That whole thing I think

technically. That whole thing I think was, you know, maybe rushed out just a little bit. the design of the challenge.

little bit. the design of the challenge.

Obviously, the UI bug was reflective of that. The judge was also very buggy. Uh,

that. The judge was also very buggy. Uh,

a lot of false positives and false negatives for that matter.

>> Uh, >> what >> I mean, it was like playing ski ball with with the broken sensor. You know

what I mean? Like the AI as a judge thing is just not always perfect.

>> Oh, okay. So, that that's not that great. So yeah, you know, is what it is.

great. So yeah, you know, is what it is.

But well, it was a fun eventful day and at the end of it, the community got some new bounties, so I'll take it.

>> What do you think we should do to get more people to contribute open source data? Like is it more bounties? Is it

data? Like is it more bounties? Is it

Yeah, I don't know. Do you have suggestions for people out there?

>> I mean, I I think that the contributors just sort of need to take a stand. that

that's what it comes down to is the the people deserve to view the fruits of their collective labors at the very least it can be on delay right but it's

just sort of a a downstream effect of a a larger root disease in the safety space I think which is just a severe lack of collaboration and sharing even

amongst you know friendlies within your nation state right it's fine if you want to keep a data set from you know erect enemy or whatever. But at the end of the

day, still I think open source is the way that collectively we we get through this um you know quickly. That's that's

how we increase efficiency. Otherwise,

people are sort of in the dark and uh you get a little too much centralization. Um but there's things we

centralization. Um but there's things we can do as a community.

>> Maybe this transitions to the business side. How close is this to problems that

side. How close is this to problems that you know you guys do consulting, right?

Effectively, I don't know if that's the hacker word for it. Like is this does this match what you do for work?

>> Yeah, I I'll take this one in a sense.

Yeah, there's been some partnerships, you know, plenty obviously being sort of the poster boy for AI, machine learning hackers the world over, but we get some interesting opportunities that come across the desk. And oftent times, you

know, we we have an ethos in our hacker collective, which is radical transparency and radical open source.

And what that basically means is if it comes down to, you know, us being an emerging technology is like red team doing like ethical hacking and research and development. If an organization

and development. If an organization that's on the frontier says, "Well, we really want you to to test this or check this out, get the tires, give us feedback, poke holes in it, whatever,

but in the contract it says you can't kiss and tell." And we say, "Well, we really want you to open source the data." And then they say, "Well, then we

data." And then they say, "Well, then we don't really want you to come kick the tires anymore." Well, if it's between us

tires anymore." Well, if it's between us touching the latest and greatest tech to explore it and push the limits, right, then we're going to do that. So, we're

open source up until we can't be. That's

the best way I describe it. We but we often push for open source data sets and you can see this with some of the partnerships that we've had in the past, right? So yeah, I try to think of it

right? So yeah, I try to think of it like this. It's like you have these

like this. It's like you have these these multi-billion dollar companies and they're building these intelligence systems that are sort of like the Formula 1 cars, but we're like the

drivers, right, who are who are really pushing the limits while keeping these cars on track, right? We're shaving off seconds off of of what they're capable of doing. And I think it's like the

of doing. And I think it's like the current paradigm is they still haven't figured that out entirely yet and everybody's like wants us to be their little dirty secret. You know what I mean? So

mean? So >> yeah, can we maybe move it up one level of abstraction to like actually weaponizing some of these things. So you

know getting cloud on X is great but obviously the jailbreaks are much more helpful to adversarials. Um I think Entropic made a big splash yesterday with like their first reported AI

orchestrated. You know, I think if

orchestrated. You know, I think if everybody that is like in the circles know that maybe there's like more about making a big push on the politics side than like anything really unique that we had not seen before on the attacker

side. But maybe you guys want to recap

side. But maybe you guys want to recap that and then talk a bit about the difference between jailbreaking a model and kind of like attacking the model versus like using the model to attack so to speak.

>> Yeah. I mean just earlier today we were talking about that very thing that how you know it's it's all it's all fun for the memes and and posting on but but this actually impacts real lives right

and we were talking about how it was what December of last year made a post talking exactly about this TTP right that it was going to happen and it what it took 11 months for it to actually

happen before and now they're being re they're being reactive instead of proactive it's it it's just basically like the the techniques the tactics the procedure that are involved in like an attack gene, right? Or like almost like

a methodology. So, I mean, if you guys

a methodology. So, I mean, if you guys want to pull up that post, I mean, Tony, I don't know if you can send it. So, or

elaborate. Yeah, it was it was recent onx, I believe. Yeah. You know, I I found this through my own jailbreaking of claw computer use when that was still

fresh about that same time, I think. and

a a a way that I found of using it as sort of a red teaming companion. You

know, I had that thing helping me jailbreak other models like through the interface. I would just give it a link,

interface. I would just give it a link, a target basically. And I had custom commands where it started to become clear to me that it's very very

difficult when you have the ability to spin up sub agents where information is segmented. If you guys know the the

segmented. If you guys know the the story of sort of like the the builders of the there's a lot of examples this of this in history, but you may maybe are building like a a pyramid with some

secret chambers or something malicious inside and you have a bunch of engineers each do one little piece of that and there's enough segmentation and each task just seems so innocuous that none

of them think anything malicious is going on and so they're willing to help, right? And the same is true for Asians.

right? And the same is true for Asians.

So if you can break tasks down small enough, sort of one jailbroken orchestrator can orchestrate a bunch of sub aents towards a malicious act, right? And according to the anthropic

right? And according to the anthropic report, that is exactly what these attackers did to weaponize clogged code.

>> Yeah. And it still feels to me like the the fact that this model can use natural language is like the most it's like the scariest thing because again most attacks end up having some sort of social engineering in it, you know? is

not like these models are like breaking some amazing piece of code or or security. What are you guys doing on

security. What are you guys doing on that end? I don't know how much you can

that end? I don't know how much you can share about some of the collaborations you've done. Obviously, you mentioned

you've done. Obviously, you mentioned some of the work you do with the Dreadnal folks who have also been building on the offensive security agents, but maybe give a lay of the land of like the groups that people should

follow if they're interested and state-of-the-art today kind of like how fast that is evolving. like there's a lot of folks in the audience that are like super interested but are not in the security circle. So any overview would

security circle. So any overview would be great.

>> Yeah. So the bossy discord server it's pushing about 40,000 right now um people in there it's totally grassroots. It's a

mix of people interested in font engineering adversarial machine learning uh jailbreaking at red timing and so on.

So I would encourage that you just Google search it's bossi basi right and then um apart from that I mean any of the BT6 operators of the hacker

collective that would be like Jason Hadex Eds Dawson Dreadnote Philip Dery like Takahashi I mean Joseph F I mean there's so many Joey Melo who's formerly

was Pangia they just got bought out by Crowdstrike so all of our operators have been you know at the heart of what's happening whether it's AI red teaming or jailbreaking or adversarial pump engineering. So any of those people, you

engineering. So any of those people, you find them on socials like Twitter, LinkedIn, and so on and so forth, you know.

>> Yeah. And Pia is another one of our portfolio companies. So

portfolio companies. So >> that that that's so funny. Yeah. Yeah.

Yeah.

>> Oh my god. Basi is huge. Basi has 40,000 members.

>> Yeah. Yeah. Yeah. Unmonetized, just uh few mobs, that's all.

>> How many of them do you think are just adversarial just sitting in there reading?

>> That's a very good question. I can tell you this right now, multiple organizations that have like popped up in the past, I would say two to three years for you can call them like AI security startups, right? Like actively

scrape that server to build out their guard rails or their security like their suite of products and stuff like that, which is just hilarious, you know?

>> Yeah. So, we do competitions and there you little giveaways, some small partnerships. Our only rule is if

partnerships. Our only rule is if there's any partnerships that everything has to be open source. That's kind of the one thing. And uh yeah, other than that it's it's a really great place to

to learn and a lot of people have sort of come back and like oh thanks for making this service where I learned jailbreaking and yeah it's cool to see that and then sort of from that spawn

BT6 of course which is a white hat hacker collective and that's sort of now 28 operators strong two cohorts and a

third well on the way and yeah like John was saying it's it's just such a magical group of skill and integrity, which are the two things we focus on as a filter,

but everybody's there for the love of the game. Um, it's sort of just, uh,

the game. Um, it's sort of just, uh, great vibes and, uh, yeah, I I've never been in in such a cool group, honestly.

I don't think. Yeah, there there's some kind of magic in here. I don't know what happened. I don't know, Mercury was in

happened. I don't know, Mercury was in retrograded or the stars aligned or or what it was, right? some some EMP from the sun, but just getting around like

the the top minds on doing exploratory work is like that alone is payment enough for the conversations we have, for the sharing of research and notes,

the proliferation of ideas, the the testing and validation of ideas. It's

just I mean there there's there's no way to put it into words until you experience what it's like being a part of BT6 because you've realized that like we're the we're moving the needle in the

right direction when it comes to AI safety. We're moving the needle in the

safety. We're moving the needle in the right direction when it comes to like AI machine learning security. We're moving

the needle when it comes to like crypto web 3 smart contracts like like blockchain technologies like and so much more now. So, it's just it's an exciting

more now. So, it's just it's an exciting place to be with robotics and like uh swarm intelligence, right? Like the

projects these people are invested in and passionate about and they're able to articulate. It's like it's it's an I

articulate. It's like it's it's an I feel like Plenty is like King Arthur and we're like the Knights of the Round Table, you know what I mean?

>> That's awesome. Um so, so yeah, I do think it's like very rewarding and and obviously people should join the the Discord and get started there. It looks

like you do have a bit of beginner friendly stuff. Are there other

friendly stuff. Are there other resources? Like I saw that you guys did

resources? Like I saw that you guys did a collab with Gandalf. Gandalf I guess was like the other big one from the last year or so that broke through to my attention where I'm like, "Okay, these

guys are actually like giving you some education around what prom jailbreaking looks like."

looks like." >> Yeah, those those guys are awesome.

Reala.

>> Oh yeah, it's Lera. Sorry.

>> Yeah. Yeah. That's that's where I and I think many other prompters sort of brained. That was the training ground

brained. That was the training ground for prompt injection, right?

>> 100%.

>> Like for in the early days for many of us. Yeah. Really thankful. That game is

us. Yeah. Really thankful. That game is awesome. Definitely try it if you

awesome. Definitely try it if you haven't. And they've expanded to uh a

haven't. And they've expanded to uh a sort of a a fuller uh playing around with agents and some really cool stuff.

So yeah, that was cool that we got to launch that through the the Bassie live stream with them. And uh I think they they sent all the people that volunteered to be on that stream like cool merch and uh yeah those guys are

great.

>> Yeah. Shout out to Lera and Gandalf for sure.

>> For sure. The other big podcast that we've done in this space is with Sander Shuhoff of Hacker Prompt. Are you guys affiliated enemies Crimson Bloods?

What's >> they're cool? I mean we we actually did uh a plenty track for Hacker Prompt.

>> Okay. I didn't know that.

>> Yeah. Yeah. So there was the only only contingency of course was open source the data set which we did and it was a lot I can't remember the number I think it was tens of thousands of of prompts and we had a whole bunch of different

games some really sort of out of dro stuff as you would expect and and a good history lesson I think too back to the proper OG lore of the the real plenty

right the OG plenty of the elder yeah I have nothing but good things to say about Sanders Scholof and you what they're doing over there I think that our incentives don't always align with the status quo

from Silicon Valley investors, right?

Like, you know, radical open source, like moving the needle in the right direction, like having an unorthodox approach to to um advancing the agenda,

right? Versus when people have sometimes

right? Versus when people have sometimes we'll we'll call them like misaligned incentives where there's like there beholden to a return on investment, right? And so that really does kind of

right? And so that really does kind of steer the industry in a certain direction. And I'll give you a great

direction. And I'll give you a great example on a more technical level would be like setting all the models to a lower temperature to try to make them more deterministic.

This some of the work that we do, we're kind of adding a lot more flavor and creativity and innovation to the models while we're interacting, right?

>> Yeah. Okay. Yeah. So, you want them you want the temperature high.

>> Not always. It depends on the application.

>> Well, I don't know if Allesia wants to respond to the VC thing cuz he's actually backed open source and security tooling.

>> I I think Yeah. I mean, it's like a good question. And I think there's like a lot

question. And I think there's like a lot of once you're in the VC cycle, you kind of need to do things that then get you to the next round. And I think a lot of times those are opposed to doing things that actually

matter and move the needle in the security community. So yeah, I think

security community. So yeah, I think it's not for everybody to invest in cyber. So that's why there's only a

cyber. So that's why there's only a small amount of firms that that do it.

But um yeah, and I think you guys have are in a great space to have the freedom to kind of do all these engagements and hold the open source ideals. So I think it's amazing that there's folks like

like you and you know there's like you know people like HD more in our portfolio that build things like metasloits that are like the core of like most work that is done in security and then you can build a separate

company but I feel like I I'm curious what you guys think but to me it feels like in AI the the surface to attack which is the model is like still changing so quickly. they're like you

know trying to formalize something into a product or like try and do something that is like a full you know I'm selling AI security it's not really you cannot really take a person seriously that is telling you I'm building a product for

AI security or like the secure model so I'm curious how you guys think about that and then maybe also for you to you know request for customer engagements you know like who are like the people that you work to what are like the

security problems that they work with uh what are people missing um yeah kind of like open floor for you you know, we're in a paradigm shift.

Things are moving so fast and I think just some of the old structures are not always compatible with the right foundations for this type of work.

Right? We're talking about AGI, AGI alignment, ASI alignment, super alignment. I mean, these are not SAS

alignment. I mean, these are not SAS endeavors. They're not enterprise B2B

endeavors. They're not enterprise B2B This is the real deal. And so

if you start to compromise on your incentive architecture, I think that's super super dangerous when everything is going to be so accelerated and the

timelines are going to be so compressed that any tiny 1.110enth of a degree misalignment on your trajectory is fatal. Right? And so that's why I've

fatal. Right? And so that's why I've tried to be very strong and uncompromising on that front. You can probably imagine a lot of

front. You can probably imagine a lot of temptation has been dangled in front of me in the last couple of years. But I

think that bootstrapping and grassroots and you know if if people want to donate or give grants happy to accept it and follow straight to the mission.

That's sort of my goal in all this is just to be a steward. Um I'm not trying to get wealthy from this. That was never the goal. I was just I just saw a need

the goal. I was just I just saw a need and started shouting about it. All I've

really done since then, I hope, is uh contribute to the discourse and the research and the speed of exploration. I

think that's what matters.

>> Yeah. And to answer your question about securing the model. I don't see it like that. And and in BT6, you know, we don't

that. And and in BT6, you know, we don't see it as just the model. We look at like the full stack, right? So whatever

you attach to a model, that's the new attack surface. it broadens, right? That

attack surface. it broadens, right? That

like uh I think it was Leon from Nvidia who was quoted as saying something like the more good results you can get back from whatever it is that you've built utilizing AI, like that's proportional to its new attack surface or something

along those lines, right? And you might be testing, let's say, a chatbot or maybe a reasoning model and maybe instead of just hitting a jailbreak, maybe you're trying to use counterfactual reasoning to attack the

Browning truth layer, right? to get

around what bias wound up in the model from the data wranglers right or the RLHF or or whatever it may be like the fine-tuning which that can all be done

through natural language on the model itself but what about when you give it access to your email what about when you give it access to your browser what happens when you give it access to X Y

and Z tools or functions right so in AI red teaming it's not just like hey can you tell us you know lyrics or how to make me math or whatever it's like we're trying to keep the model safe from

the from bad actors, but we're also trying to keep the public safe from rogue models essentially, right? So,

it's the full spectrum that that we're doing. It's never just the model, you

doing. It's never just the model, you know, the model is just one way to interact with a computer or a data set, right? Or an architecture like

right? Or an architecture like especially like if you're talking about like computer vision systems or or multimodal and so on and so forth like not every you guys probably know this, you know, not every model is is is

generative per se, right? So

>> and maybe another distinction for the audience is the difference between sort of safety and security work right security is more squarely I I think that's maybe the distinction is best

thought of as safety is done on the midspace level or it should be but the way people use the word has kind of

become dirty is they tried to solve this on the latent [music] space level um I think I've shown every single time that that doesn't work right and so what we

need to do is I think reorient safety work around neat space that just goes handinhand with a fundamental understanding of the nature of the

models which you know booths on the ground it's obvious to some of us who are spending hours and hours a day actually interacting with these entities but for for those who don't it's maybe

not always obvious but as far as the the contract work that we get involved with it's never about lobotomization or you know personality of the models. We

totally try to avoid that type of work.

What we try to focus on is you know preventing your grandma's credit card information from being hacked through you know an agent has knowledge of it and leaks it through some hole in the

stack. So what we do is we try to find

stack. So what we do is we try to find holes in the stack and rather than recommending that those fixes happen through the model training layer, we

always recommend first to focus on, you know, the system layer.

>> Awesome. Guys, I know we're running out of time, so any final thoughts, call to action? You got the whole audience, so

action? You got the whole audience, so go ahead.

>> Yeah, if you want people to listen you play, now's the now's the time. No

pressure. No pressure at all. Right.

Well, you know, fortune favors the bold, libertas, veno veraritoss, god mode enabled.

>> Are you messing the latest face of the transcriber model? Like, can I

transcriber model? Like, can I >> Why would you say such things? Why would

you say such things about us?

>> Libertas, Claritas, Love, Blitty.

>> All right, guys. Um, yeah, thank you so much for joining us. This was a lot of fun.

>> Yeah, I would say if you want to check us out, go to bt6.gg for example. Look

up, you know, apply me on Twitter, right? Check out the Bossi Discord

right? Check out the Bossi Discord server. That's probably the best that we

server. That's probably the best that we got for you guys.

>> Amazing. Thank you so much and keep doing the the the good work and uh see you out there.

[music]

Loading...

Loading video analysis...