LongCut logo

Google Cloud Keynote at AI Infra Summit 2025: What's Next for the Foundations of AI

By Google Cloud Events

Summary

## Key takeaways - **2025: Age of Inference**: Training the nextG model is important but where the rubber is really hitting the road these days is inference. 2025 is really the start of the age of inference. [02:15], [02:36] - **980 Trillion Tokens Monthly**: From April of last year to April of this year, Google saw a 50 times increase in tokens processed per month, skyrocketing to 980 trillion tokens in June. That's roughly the amount of information in a novel for every single person on planet Earth every month. [02:49], [03:25] - **Gemini Prompt Efficiency**: A single Gemini prompt uses about a quarter watt hour of power, 30 milligrams of carbon dioxide, about 5 drops of water, the same amount of power as watching 9 seconds of TV. Over the last 12 months at Google, energy consumption of a Gemini prompt was driven down by 33 times. [05:07], [06:33] - **Ironwood TPU for Inference**: Ironwood is the first TPU platform built specifically for the needs of inference, delivering five times the peak compute capacity and six times the high bandwidth memory of the previous generation. Scaled to 9,216 Ironwood chips in a super pod with 1.77 petabytes of high bandwidth memory. [12:47], [14:18] - **Toyota's 20% Faster Models**: Toyota chose Google Cloud for ease of use and speed of scaling, reducing their model creation time by 20%, making scaling four times faster with GKE, and enabling a small team to build their AI platform in half the expected time. [09:54], [10:37] - **96% Lower Latency Inference**: GKE Inference Gateway with prefix-aware routing and disaggregated serving delivers up to 96% lower latency at peak throughput, 40% higher throughput, and 30% lower token costs. Inference Quick Start provides tested configurations based on latest benchmarks. [22:18], [22:50]

Topics Covered

  • 2025 Ushers Age of Inference
  • Power Constrains AI Infrastructure Buildout
  • Ironwood TPUs Built for Inference
  • AI Reshapes Software Stack End-to-End

Full Transcript

All right. Uh hello everyone. It's it's

really great to be here with uh with all of you today. Uh my name is Mark Lommire and I lead the AI and computing infrastructure product team at uh Google. And you know to me uh what's

Google. And you know to me uh what's happening in AI right now I mean we're all kind of experiencing it feels a little bit like the early days of the internet. Uh but what took probably many

internet. Uh but what took probably many years back then. It feels like it's happening much much uh faster now. Uh

you know we're seeing uh new breakthroughs it seems like almost every week. Like I wake up on Monday and

week. Like I wake up on Monday and there's something uh new and amazing that's been enabled. Uh for example, has anyone played with uh the new uh image model nano banana? Show of hands. Nano

banana. Pretty go. So if you didn't raise your hands, check it out. Uh

pretty awesome. Uh so according to Ella Marina, uh it's the highest ranked image model available in the market today.

And I'll give you an example of what it can do. Uh so here as a starting point

can do. Uh so here as a starting point is a real slightly pixelated photo of uh my dog Chewy there. So, Schnowzer

and using Nano Banana, I can uh really easily envision him in new scenes. Uh

like here where he's uh competing in a a dog show. I think he's going to win.

dog show. I think he's going to win.

Looks pretty good.

Uh and because the model the model is also really great at u multi-turn generation, so I can go back and forth with it. Uh so this is Chewy. He's

with it. Uh so this is Chewy. He's

exploring new career opportunities as a vibe coder. He's got his hipster look on

vibe coder. He's got his hipster look on there. So looking good.

there. So looking good.

Uh, and then finally, you can add realistic text. Uh, so this is Chewy.

realistic text. Uh, so this is Chewy.

He's pivoting now uh to becoming a tech influencer, a really uh really popular career these days. Um, so these are obviously sort of fun uh uh cute examples, but uh more broadly, you know,

models like this are of course incredibly valuable across so many uh industries and so many uh fields. And so

ensuring that we can deliver them fast, reliably, accurately, and very importantly cost effectively at massive scale is really really critical. So

let's talk more about uh sort of how we're doing this uh both internally at Google as well as how we're making that available to our cloud customers and our partners.

Now we all know training uh the nextG model is is really important. It's

great. Uh it's where it starts. uh but

where the rubber is really hitting the road uh these days is inference. In

fact, you know, I believe that 2025 is really the start of let's call it the age of inference. And if you think about like the number of tokens that need to be generated and processed for

experiences like nano banana and scaling that to millions or maybe in Google's case billions of users, uh you can see why, right? Um, in fact, at Google, from

why, right? Um, in fact, at Google, from April of last year to April of this year, so 12 months, we saw a 50 times increase in the number of tokens per

month that we processed across all of our products and services at at Google.

And then uh just 2 months later, so in June, that number skyrocketed to 980 trillion tokens in a month. That's a 2x

increase in two months. So, uh, it's a big number. Uh put put that in context.

big number. Uh put put that in context.

Uh 980 trillion tokens. It's roughly the amount of information in a novel for every single person on planet Earth every month, right? And uh we're just

getting started.

So to keep scaling exponentially like that and we definitely see that happening. We in the industry all of us

happening. We in the industry all of us obviously working together need to deliver the next generation of AI optimized infrastructure uh specifically for inference and uh you know at top

level from a business perspective we need to deliver obviously a fantastic customer fantastic user experience that drives business outcomes for our customers. uh but on the technology side

customers. uh but on the technology side we sort of have two top level priorities let's say the first is how do we scale rapidly but with great performance uh but while at the same time also

driving down the cost to serve the cost per transaction if you think about it for most businesses that's what determines how many customers they can reach and how can they grow their

customer base and do that uh profitably but interestingly uh what's emerged uh not too long ago is a third and really really critical constraint which is power energy

You know, this has become one of the most uh precious commodities we have because the physical buildout of the infrastructure is now actually starting to be limited by power availability. I

think we all we all see this.

So, you know, at this scale we're talking about here and thinking about growing exponentially, we need to think about power efficiency more holistically. You know, the old way that

holistically. You know, the old way that we and many others used to do it is we look at active chip utilization. We use

as a proxy for energy consumption. That

really isn't adequate anymore. Um so

three weeks ago at Google we released a a detailed technical paper. Check it out if you're interested in getting more specifics. Um but it looks at a much

specifics. Um but it looks at a much more comprehensive methodology for uh measuring power consumption, energy consumption, looks across the entire system and all the different uh dimensions that you see uh referenced

here.

Uh so using this approach actually we found out something interesting uh which is that a single Gemini prompt is uh substantially more efficient than what previously people might have thought uh

publicly. It uses about a/4 W hour of

publicly. It uses about a/4 W hour of power 30 milligs of carbon dioxide about 5 drops of water. So to put that in uh perspective that's about the same amount

of power as watching 9 seconds of TV. So

today one Gemini prompt 9 seconds of TV.

Um, I'll leave it to all of you to decide which one you think is more valuable to you. Uh, I have my vote.

So, uh, as these workloads continue to scale, we need to do even more to increase that efficiency beyond where we are today. And we need to do this across

are today. And we need to do this across the full stack. Uh, so for example, things like new model architectures, mixture of experts, hybrid reasoning models, new software techniques, and we'll talk some more about these today,

like disagregated serving or speculative decoding. You're seeing that uh in the

decoding. You're seeing that uh in the animation here as well as inference specific custom hardware and you're seeing that come out across the industry now as well as data center operations how we're running those uh machines at

large scale and looking across the full stack and optimizing across all these areas can yield really really great uh dividends. In fact, over the last 12

dividends. In fact, over the last 12 months at Google, we were able to drive down the energy consumption of a Gemini prompt, you know, per per prompt by 33 times in one year, right? And we need to

keep doing that going forward as you think about the demand continuing to grow.

So, uh, given all we've learned here, what I'd like to do is share a sort of a high level view of our approach at Google to AI infrastructure and then we'll dive deep into a little deeper into each of these three areas. uh the

first principle is really co-design across the entire stack. So from the research model research Google deep mind etc. uh to the software algorithms to

infrastructure hardware design across that endto-end system. You know as one example here of how this can play out when we built our first uh TPUbased system a little bit over 10 years ago

now it was really around solving some internal scaling challenges we had. But

then when we put that compute power into the hands of our researchers in Google DeepMind and others that really in many ways directly enabled the invention of the transformer and as we all know um you know that's now the architecture

today that's powering most of what we're doing with AI and Gen AI. So that's

number one. Uh two you know as we talked about before we really have this relentless focus on efficiency. Uh data

centers power everything we've talked about before but also the people and processes um and bringing to bear entirely new disciplines at Google for example. you think about the invention

example. you think about the invention of SRE many years ago. Um we need to do those same types of things in terms of how we're operating these large scale data centers now for AI.

And then third uh an obsession with simplifying the path from a great idea.

So you know some someone has a great idea in research to actually putting that in products and scaling that in production. How do we shorten the time

production. How do we shorten the time to innovate there and then how do we once we find a winner like how do we scale it up rapidly and cost effectively at scale.

Um so this is looking at our full AI uh stack here. I won't go through all the

stack here. I won't go through all the details uh but the core point is that um you know we make this this this full set of capabil capabilities available to our cloud customers and they can leverage the capabilities and rapidly build at

whatever layer makes sense uh for their business and delivers the most value to their business. So that whether it's an

their business. So that whether it's an through an agent at the application level whether it's at um through an API at our AI through our AI platform uh you know models data platform they can come in wherever makes sense for their

business and leverage those capabilities. Uh but of course for

capabilities. Uh but of course for today's discussion all of this is powered by uh that infrastructure underneath that core foundation and so let's go deeper now into that into that infrastructure.

So at the center of our sort of top level uh vision here is an integrated uh supercomput computing system we refer to as the AI hypercomputer. Um if you sort of build up the stack here from the

bottom up it includes AI optimized hardware across compute storage and networking. On top of that open uh

networking. On top of that open uh software which includes frameworks, compilers um and other capabilities. And

on top of that, flexible consumption models. So, new ways to consume this AI

models. So, new ways to consume this AI infrastructure uh commercially. And what

we found is that in every single box that you see here, we're having to reinvent and deliver new capabilities uh specifically optimized for the unique needs of AI workloads. They're so

different than what uh than what came before. So, we're we're innovating each

before. So, we're we're innovating each box, but uh we're also optimizing across that full stack to deliver against use cases. So training inferencing at scale

cases. So training inferencing at scale for example and then finally uh because we're Google cloud we deliver all this as Google cloud services. So you know how do we do that reliably scalabil

scalably with great security built in and at the lowest possible cost.

Uh great customer example here is Toyota. Uh so they chose uh Google cloud

Toyota. Uh so they chose uh Google cloud for a few reasons but probably most notably ease of use and uh speed of scaling. This actually helped them

scaling. This actually helped them reduce their model creation time by 20%.

They also used GKE or Google Kubernetes Engine uh which made the scaling four times faster than what they were doing before. Um and also notably we actually

before. Um and also notably we actually enabled a relatively small uh team within Toyota to actually build their AI platform in half the time of what they expected for a project of this uh this

level of capability. So you know this is just one of a thousand customer examples obviously but I think it brings some of those benefits uh to uh to to the front forefront. Um so now let's uh let's go

forefront. Um so now let's uh let's go deeper right let's uh dive in to each of the individual capabilities here and we'll talk about how they come together specifically for inference and we'll

start with uh leading uh compute accelerator options so uh you heard from Ian yesterday hopefully uh all of you for that as well and you know one of our most um important and closest partnerships I

would say from an ecosystem perspective is with Nvidia um and at our high level our goal in Google is to be first and best with Nvidia Uh so we work really closely uh together

with them across all of our teams engineering product uh beyond optimize together every layer of the stack. Of

course we leverage their awesome hardware GPUs and um uh GPUs and networking but also we integrate uh Google software and also Nvidia software uh to deliver those solutions that I

talked about together for our customers.

And uh in fact over just the course of this year we've delivered uh three major new Google cloud services that are based on the latest uh Nvidia Nvidia uh Blackwell platforms. So our G uh4 family

this is based on the RTX 6000. This is

great for graphics for omniverse and for uh cost effective inference for uh smaller models. Our A4 uh service which

smaller models. Our A4 uh service which is based on the Blackwell 200. So this

provides a actually fairly broad range of versatility for a broad range of AI workloads. And then on the far right,

workloads. And then on the far right, our A4X uh family of services. This is

based on Grace Blackwell 200 and this is really awesome for cutting edge models that require the highest possible performance at the highest level of scale.

So uh this is great uh and we really really appreciate our partnership with Nvidia delivering awesome capabilities for our customers. uh but we also offer the choice of our industry-leading TPU

platform and as I mentioned before we've been innovating with TPUs for over a decade now uh so over seven generations with many technical breakthroughs along the way uh you can see some of the

specific uh examples here and there really is no uh substitute for this level of experience and learning that you have to do as you iterate over 10 years to create these types of highly sophisticated uh systems. uh in fact

Gemini and as I mentioned before uh most of our models today are uh primarily trained and served on TPUs.

Uh so fast forward to today latest and greatest on the far right uh there our latest generation the Ironwood uh TPU platform. This is the first TPU platform

platform. This is the first TPU platform that we've built specifically for the needs of inference.

So let's go deeper into Ironwood. Uh

well we'll start with the chip. Um so

the ironwood chip it delivers five times the peak uh compute capacity and six times the high bandwidth memory of our previous generation. This is obviously

previous generation. This is obviously really really critical for inference at scale. But let's go a little bit beyond

scale. But let's go a little bit beyond the TPU chip and share some of the innovations that we're doing at a systems level that really make TPUs a leading platform uh for AI innovation

for so many customers.

Uh the first area of innovation is is how we scale and scale reliably. So with

Ironwood, we scaled to 9,216 Ironwood chips connected together into a single, we call it a super pod um leveraging our breakthrough interchip

interconnect um technology or ICI networking. This allows those 9,000 plus

networking. This allows those 9,000 plus chips to all work together and access a pretty staggering 1.77 pabytes of high bandwidth memory with

each HBM pushing 7.3 uh terabytes per second of peak bandwidth really helping overcome uh data bottlenecks and other bottlenecks for some of the most demanding models out there.

Uh but what happens at this scale when something fails? Um and so this brings

something fails? Um and so this brings us to reliability. And if you think about it at this scale with this many individual hardware components, that many chips, never mind all the other components in the system, uh, individual hardware component failures are actually

a statistical certainty, right? It's

going to happen. Um, and so we need to do all sorts of things to make sure that we can abstract the user away from those underlying hardware failures. Uh, one

capability here that's incredibly powerful is our optical circuit switching technology. So this acts as a

switching technology. So this acts as a dynamically reconfigurable fabric for that entire pod. So for example, if a node fails OCS uh optical circuit

switching, it can rapidly reconfigure that compute slice, dropping the dead node and then seamlessly restoring the workload from a checkpoint so the model can keep uh being trained or being

served. So it's this level of really I'd

served. So it's this level of really I'd say a deep systems level resilience that makes a super pod of this scale right the 9,000 plus truly viable for mission

critical uh workloads and applications.

Another innovation is liquid cooling. Uh so you know at Google we've been working on liquid cooling since uh 2014. So uh

quite a while now. Uh we had to do that in the early days to really scale with the TPU systems that existed even way back then. And uh we're now in our fifth

back then. And uh we're now in our fifth generation uh cooling distribution unit CDU and we're planning to uh distribute make available that spec to the open uh compute project later this year. Uh to

give you a rough idea of the scale here, as of uh 2024, we had around a gigawatt of total deployed liquid cooled capacity, which was 70 times the more 70 times more than any other fleet at that

point in time. Um so we uh created this first for TPUs and now we're leveraging it for GPUs as well, of course.

All right. Uh so those are a few specifics, but if we zoom out a little bit here, you can see the full picture of a TPU super pod at work. Uh it comes in two forms. a small one, 256 trips in a pod or the large one that I talked

about before. Uh, this is interconnected

about before. Uh, this is interconnected with 1.2 terabytes per second of ICI networking. And the scaling, uh, by the

networking. And the scaling, uh, by the way, this is just, you're just seeing the front of it here. Think about

there's many rows of that behind that that come together, deliver 9,000 chips.

But the scaling doesn't stop here. Uh,

so with our Jupiter, uh, data center network, we can scale further across dozens of these super pods working together.

And all of this hardware is optimized to work with uh, leading frameworks. So

both Jacks and PyTorch and uh notably on the PyTorch uh side, we're committing to a native PyTorch implementation on TPUs.

So really embracing the PyTorch community uh and so many workloads out there that have been optimized over so many years for PyTorch uh making it easy to um uh to bring those TPUs.

So together our work on systems and software here together if you wrap it all up a big number. It's delivering

42.5 exoflops per pod. Uh pretty pretty amazing.

All right. So we talked a lot about accelerators. That's obviously a

accelerators. That's obviously a critical component in many ways. It's

the heart of of these AI systems. But now if you think about inference, an inference solution requires more than just the accelerators to deliver that experience that I talked about before.

Um and here's where you know you might see some challenges. For example,

staying up with state art is pretty hard. You know, it seems to change every

hard. You know, it seems to change every week. Uh it's hard for all of us to stay

week. Uh it's hard for all of us to stay up to date with all of the changes here.

Uh second, there's switching costs. You

know, we all want this flexibility in these systems. Uh but today, changing frameworks, changing hardware, frankly, it's it's it's too hard. Um and finally, with uh so many different possible

deployment architectures, it's often hard to find just the right combination.

Um and the wrong choice could cost uh cost you cost our customers millions of dollars potentially. So, it's really

dollars potentially. So, it's really important to get it right.

So obviously we've thought a lot about this you know how can we help how can we help address these challenges and so you know here's a a relatively simple picture of what we see as an optimized

architecture for inference so I'll go through this at a high level and then we'll double click in um so working from left to right inference request uh comes in it goes first to GKE this is Google

Kubernetes engine GKE inference gateway this can route that request based on the request parameters also the load on the them um to the right uh pool of

accelerators. Models are deployed to GKA

accelerators. Models are deployed to GKA clusters you see in the middle here. Uh

those have now first class support for VLM uh as a model server. So this

enables compatibility across GPUs and uh TPUs. Uh customers can use a dynamic

TPUs. Uh customers can use a dynamic workloader. This is a new uh flexible

workloader. This is a new uh flexible commercial consumption model we created just for the needs of AI for uh to make sure the capacity is available in those node pools. Um and then finally on the

node pools. Um and then finally on the far right uh leveraging AI optim AI optimized cloud storage to store the model weights cast the data and and many more. Um so this is really a top level

more. Um so this is really a top level view bird's eye view of an antenn system but let's go deeper now into each of these individual technologies one by one and we'll talk about how they can be

used end to end uh to uh serve a simple agentic workflow.

So let's start with uh two new pretty awesome uh advancements from uh our Google Kubernetes Engine team. Inference

gateway and inference quick start.

All right. So let's say a new user request comes in and immediately we're faced with a challenge uh which is think about a traditional load balancing uh was not designed for the needs of AI workloads. You know, with AI, a single

workloads. You know, with AI, a single prompt can trigger one of these large multi-step reasoning processes, and it can overwhelm uh one server while others potentially could be sitting idly by.

This leads to unpredict predictable latency, inefficient use of those valuable resources.

So, uh to solve this challenge, we created a firstoindate.

It uses AIA routing and also AI load balancing techniques to even out the ilization across those node pools. So

those incoming requests don't queue up and you've got a nice balanced utilization across your accelerators. Uh

here's an architectural view of that. Uh

so inference gateway will uh continuously monitor your model servers and analyze metrics like uh pending request uh Q length or KV cache

utilization or others before selecting the ideal node to route a request to.

So I'm really excited today to announce that we're making uh the GKE inference gateway generally available. So it's out there to to give it a try if you'd like.

Pretty awesome. Um along with two new uh capabilities built in. Um so the first is prefix aware prefix aware routing. So

for an application like multi-turn chat or document analysis where you're reusing the same context across multiple prompts now you can route those requests to the same accelerator pools that

already have that context casp. You can

see how valuable that would be. The

second is disagregated serving.

obviously a big topic uh for this week.

Um but this is a technique that separates the initial processing of a prompt. So the prefill stage from the

prompt. So the prefill stage from the token generation or the decode stage.

And since these two different stages have a pretty different resource needs and actually based on the workload might need to scale differently as well, you can now run them on separate optimized machine pools for both and scale them

also independently of each other.

So uh these capabilities are super powerful. Um, but I'm sure sometimes it

powerful. Um, but I'm sure sometimes it feels a little bit overwhelming uh to be constantly told, hey, you got to use this new technique, that new technique, or you're going to fall behind. Um, and

we struggle with this too, uh, internally. Um, so this brings us to

internally. Um, so this brings us to inference quickart.

So, uh, think about inference quickart as a, uh, database of tested inference configurations.

All you need to do is tell it about your needs. So, what models you're using,

needs. So, what models you're using, what are your priorities for latency, cost, other things, and it will give you a set of recommendations based on best practices. and actually the latest

practices. and actually the latest benchmarks that we run within Google on those various combinations.

Um, so you can deploy those best practices right out of the gate. And

then uh because you need uh your needs and the workloads are always evolving, we're also monitoring the system over time, those inference specific performance metrics that I talked about before. So you can dynamically fine-tune

before. So you can dynamically fine-tune your deployment over time as maybe the workload needs change, the model changes, so you can make sure you're always up to date easily.

All right. Uh so um a little bit on the technology, but what are the benefits here? So pretty uh pretty significant uh

here? So pretty uh pretty significant uh taken together these can deliver up to 96% uh lower latency at peak throughput.

This is great for highly interactive responsive workloads like interactive coding agents etc. up to 40% higher throughput which is great for prefix heavy workloads like knowledgebased

analysis and all of this with up to 30% lower token costs per token. We talked

about how important that was before.

All right. Uh so that's inference gateway. So it's intelligently routed

gateway. So it's intelligently routed that request. Uh now we face another uh

that request. Uh now we face another uh critical question which is do you have the accelerators uh when and where you need them. And uh look even today

need them. And uh look even today getting capacity can sometimes be a struggle right you can either overprovision those resources and waste money or maybe underprovision and then you're leaving your research team or

your customers uh waiting. So this is where dynamic workload or DWS uh comes in. Uh it's a new uh commercial

in. Uh it's a new uh commercial consumption model we created specifically for the needs of AI workloads. So it complements on demand

workloads. So it complements on demand spot and the other traditional consumption models that you can also still use and it helps in two ways. So

first for time flexible jobs we have the flex start mode that lets you queue up your experiment or your request and we'll run it for you as soon as those resources are available across all of

Google cloud uh based on the policies you've set. And then secondly for uh

you've set. And then secondly for uh critical uh timebound products projects we also have something called calendar mode. Uh so think about this. It allows

mode. Uh so think about this. It allows

you to book capacity like you'd book a hotel room. You say this many resources

hotel room. You say this many resources in this location for these dates and we guarantee that those resources will be available when you need them. You won't

get preempted and you only have to pay for them obviously when you're actually using them. So really bringing the

using them. So really bringing the benefits of cloud to bear here for for AI.

All right. Uh so with that in place, the next question is, hey, which accelerator is actually right for my workload? Uh

you know, which GPU, so many different GPU options or which TPU? So many great TPU options out there. So earlier this year, we announced that you can now

easily run inference on TPUs because we've enabled VLM to work on TPUs. Um as

well as of course it already worked on GPUs, right? So uh fantastic way to get

GPUs, right? So uh fantastic way to get flexibility across TPUs and GPUs and effortly switch between them now with just a few configuration changes.

And to further simplify, we've introduced something called custom compute classes. So think about these

compute classes. So think about these are profiles that you would use to tell GKE's autoscaler what type of virtual machine nodes and what consumption models to use and in what order of

preference for your workload. So for

example, you could say, hey, I want to start first with my top priority is GPUs a spot, but then if I don't have enough of that, let's fall back to GPUs on demand. And maybe if I don't have enough

demand. And maybe if I don't have enough of that, let's fall back to TPUs on demand. You can decide what makes sense

demand. You can decide what makes sense for your business. uh but they're all now in the same deployment and GKE autoscaler can uh honor those priorities and automatically scale based on the

dynamic needs of your workloads. So

pretty pretty cool stuff.

Uh also so speaking of uh GPUs, we're excited to announce that we partner with Nvidia to bring Nvidia Dynamo uh to Google Cloud. This gives another

Google Cloud. This gives another powerful integrated option for your inference workloads. And our joint

inference workloads. And our joint customer base 10, I think we saw maybe yesterday too um is running Dynamo on Google Cloud with Nvidia GPUs with really uh really great results.

So now let's uh talk about another key requirement uh for inference which is getting those hungry uh super valuable uh accelerators well fed with data. This

is a pretty common issue these days frankly if you think about uh inference and your GPU or TPU capacity might be distributed across different regions around the world uh maybe different from where your core data is actually stored and that type of latency that you could

have hit there could really negatively impact performance and uh for something like a large model even just loading the model weights sometimes can take uh tens of minutes and you think about the need to dynamically scale that makes it sort

of a non-starter right if it takes 10 minutes to load new model weights. So

that's precisely why we're investing um so much heavily in optimizing our storage services for AI.

Earlier this year, we announced Anywhere Cache. So this is a new uh fully

Cache. So this is a new uh fully consistent read cache that works with your existing Google Google Cloud Storage Storage buckets and it caches the data within the same zone as the

accelerators automatically. This can

accelerators automatically. This can reduce uh read latencies by 96%. It also

can reduce the network cost because you're not now having to go across the network to um uh to access that data.

So let's make that a little bit more real, right? So imagine you're a

real, right? So imagine you're a developer, you're using a coding assistant, uh you type in a request and pause and you get a pause waiting for the code suggestion to come back. You

know that pause, even if it's just a few seconds, it's friction for the developer, right? It hurts productivity.

developer, right? It hurts productivity.

Now imagine that interaction is 96% faster and that suggestion appears almost instantaneously subsecond let's say you know that's what keeping that data cast close to those accelerators

looks like and we all know that's not just a nice to have these days right that can uh that those delays can break a user's flow and so we're really focused on making these interactions as

real time as possible uh just ask our friends at Anthropic as you can see here they also ran into that problem which is why they are actually now using anywhere cache to collo data

uh colllocate data um with their massive TPU clusters for their model quad. This

enabled them to actually um remove their own complex do-it-yourself caching solution and now they can dynamically scale reads up to 2.5 terabytes per second.

Uh in addition uh for workloads with long context windows that exceed the memory that's available in the GPUs, you can also take advantage of managed uh luster. Uh this serves as a high

luster. Uh this serves as a high performance uh storage resource to satisfy key key value requests keeping those GPU GPUs uh well saturated with ultra low latency and millions of IOPS

per second.

And finally uh tying all of this together can't forget about the network uh is cloud WAN. Uh so this is our fully managed uh global network. CloudWAN uh

helps across many domains but in the case of AI it helps customers that need to access AI computing resources across different regions maybe different clouds outside of Google multi cloud environments as well as on-prem uh

environments and so what it does is it allows you to connect the models to the data wherever that data is some other cloud onrem wherever um as well as connect the models to the actual users wherever those users are uh distributed

around the world and because this is built on our planet scale networking infrastructure it can deliver 40% improvement in in the application experience and 40% lower TCO compared to

traditional bespoke WAN solutions.

Uh so uh wrapping up here in summary, you know, what what have we sort of walked through here? Well, you know, maybe a simple way to think about it is it's a blueprint uh a blueprint for state-of-the-art uh inference which can

really be tailored uh uniquely to your environment, your needs, and your use cases. And all the technologies I've

cases. And all the technologies I've shown here, you know, went kind of deep, but they've been designed to help deliver on those original requirements that we started uh with for inference, right? How do you deliver a great user

right? How do you deliver a great user experience, low latency, great performance with leading edge models?

How do you scale that uh and how do you do that with high efficiency, high energy efficiency, and the lowest possible cost per transaction? You know,

it's the same approach that powers all of our Google AI services internally.

And now you can take advantage of it also to power your business, deliver real value to your customers, innovation to your customers and scale for them.

Um, so we talked a lot about here, we covered a wide range here at a high level. Um, but to learn more about how

level. Um, but to learn more about how Google uh can help support your AI projects or how we can maybe partner together if you're an ecosystem partner.

We have several uh deep dive sessions later today and throughout this week. So

please would love to see you come by the booth uh see some of the sessions uh that you see here. And with that, I'll say thank you very much.

Loading...

Loading video analysis...