Google Cloud Keynote at AI Infra Summit 2025: What's Next for the Foundations of AI
By Google Cloud Events
Summary
## Key takeaways - **2025: Age of Inference**: Training the nextG model is important but where the rubber is really hitting the road these days is inference. 2025 is really the start of the age of inference. [02:15], [02:36] - **980 Trillion Tokens Monthly**: From April of last year to April of this year, Google saw a 50 times increase in tokens processed per month, skyrocketing to 980 trillion tokens in June. That's roughly the amount of information in a novel for every single person on planet Earth every month. [02:49], [03:25] - **Gemini Prompt Efficiency**: A single Gemini prompt uses about a quarter watt hour of power, 30 milligrams of carbon dioxide, about 5 drops of water, the same amount of power as watching 9 seconds of TV. Over the last 12 months at Google, energy consumption of a Gemini prompt was driven down by 33 times. [05:07], [06:33] - **Ironwood TPU for Inference**: Ironwood is the first TPU platform built specifically for the needs of inference, delivering five times the peak compute capacity and six times the high bandwidth memory of the previous generation. Scaled to 9,216 Ironwood chips in a super pod with 1.77 petabytes of high bandwidth memory. [12:47], [14:18] - **Toyota's 20% Faster Models**: Toyota chose Google Cloud for ease of use and speed of scaling, reducing their model creation time by 20%, making scaling four times faster with GKE, and enabling a small team to build their AI platform in half the expected time. [09:54], [10:37] - **96% Lower Latency Inference**: GKE Inference Gateway with prefix-aware routing and disaggregated serving delivers up to 96% lower latency at peak throughput, 40% higher throughput, and 30% lower token costs. Inference Quick Start provides tested configurations based on latest benchmarks. [22:18], [22:50]
Topics Covered
- 2025 Ushers Age of Inference
- Power Constrains AI Infrastructure Buildout
- Ironwood TPUs Built for Inference
- AI Reshapes Software Stack End-to-End
Full Transcript
All right. Uh hello everyone. It's it's
really great to be here with uh with all of you today. Uh my name is Mark Lommire and I lead the AI and computing infrastructure product team at uh Google. And you know to me uh what's
Google. And you know to me uh what's happening in AI right now I mean we're all kind of experiencing it feels a little bit like the early days of the internet. Uh but what took probably many
internet. Uh but what took probably many years back then. It feels like it's happening much much uh faster now. Uh
you know we're seeing uh new breakthroughs it seems like almost every week. Like I wake up on Monday and
week. Like I wake up on Monday and there's something uh new and amazing that's been enabled. Uh for example, has anyone played with uh the new uh image model nano banana? Show of hands. Nano
banana. Pretty go. So if you didn't raise your hands, check it out. Uh
pretty awesome. Uh so according to Ella Marina, uh it's the highest ranked image model available in the market today.
And I'll give you an example of what it can do. Uh so here as a starting point
can do. Uh so here as a starting point is a real slightly pixelated photo of uh my dog Chewy there. So, Schnowzer
and using Nano Banana, I can uh really easily envision him in new scenes. Uh
like here where he's uh competing in a a dog show. I think he's going to win.
dog show. I think he's going to win.
Looks pretty good.
Uh and because the model the model is also really great at u multi-turn generation, so I can go back and forth with it. Uh so this is Chewy. He's
with it. Uh so this is Chewy. He's
exploring new career opportunities as a vibe coder. He's got his hipster look on
vibe coder. He's got his hipster look on there. So looking good.
there. So looking good.
Uh, and then finally, you can add realistic text. Uh, so this is Chewy.
realistic text. Uh, so this is Chewy.
He's pivoting now uh to becoming a tech influencer, a really uh really popular career these days. Um, so these are obviously sort of fun uh uh cute examples, but uh more broadly, you know,
models like this are of course incredibly valuable across so many uh industries and so many uh fields. And so
ensuring that we can deliver them fast, reliably, accurately, and very importantly cost effectively at massive scale is really really critical. So
let's talk more about uh sort of how we're doing this uh both internally at Google as well as how we're making that available to our cloud customers and our partners.
Now we all know training uh the nextG model is is really important. It's
great. Uh it's where it starts. uh but
where the rubber is really hitting the road uh these days is inference. In
fact, you know, I believe that 2025 is really the start of let's call it the age of inference. And if you think about like the number of tokens that need to be generated and processed for
experiences like nano banana and scaling that to millions or maybe in Google's case billions of users, uh you can see why, right? Um, in fact, at Google, from
why, right? Um, in fact, at Google, from April of last year to April of this year, so 12 months, we saw a 50 times increase in the number of tokens per
month that we processed across all of our products and services at at Google.
And then uh just 2 months later, so in June, that number skyrocketed to 980 trillion tokens in a month. That's a 2x
increase in two months. So, uh, it's a big number. Uh put put that in context.
big number. Uh put put that in context.
Uh 980 trillion tokens. It's roughly the amount of information in a novel for every single person on planet Earth every month, right? And uh we're just
getting started.
So to keep scaling exponentially like that and we definitely see that happening. We in the industry all of us
happening. We in the industry all of us obviously working together need to deliver the next generation of AI optimized infrastructure uh specifically for inference and uh you know at top
level from a business perspective we need to deliver obviously a fantastic customer fantastic user experience that drives business outcomes for our customers. uh but on the technology side
customers. uh but on the technology side we sort of have two top level priorities let's say the first is how do we scale rapidly but with great performance uh but while at the same time also
driving down the cost to serve the cost per transaction if you think about it for most businesses that's what determines how many customers they can reach and how can they grow their
customer base and do that uh profitably but interestingly uh what's emerged uh not too long ago is a third and really really critical constraint which is power energy
You know, this has become one of the most uh precious commodities we have because the physical buildout of the infrastructure is now actually starting to be limited by power availability. I
think we all we all see this.
So, you know, at this scale we're talking about here and thinking about growing exponentially, we need to think about power efficiency more holistically. You know, the old way that
holistically. You know, the old way that we and many others used to do it is we look at active chip utilization. We use
as a proxy for energy consumption. That
really isn't adequate anymore. Um so
three weeks ago at Google we released a a detailed technical paper. Check it out if you're interested in getting more specifics. Um but it looks at a much
specifics. Um but it looks at a much more comprehensive methodology for uh measuring power consumption, energy consumption, looks across the entire system and all the different uh dimensions that you see uh referenced
here.
Uh so using this approach actually we found out something interesting uh which is that a single Gemini prompt is uh substantially more efficient than what previously people might have thought uh
publicly. It uses about a/4 W hour of
publicly. It uses about a/4 W hour of power 30 milligs of carbon dioxide about 5 drops of water. So to put that in uh perspective that's about the same amount
of power as watching 9 seconds of TV. So
today one Gemini prompt 9 seconds of TV.
Um, I'll leave it to all of you to decide which one you think is more valuable to you. Uh, I have my vote.
So, uh, as these workloads continue to scale, we need to do even more to increase that efficiency beyond where we are today. And we need to do this across
are today. And we need to do this across the full stack. Uh, so for example, things like new model architectures, mixture of experts, hybrid reasoning models, new software techniques, and we'll talk some more about these today,
like disagregated serving or speculative decoding. You're seeing that uh in the
decoding. You're seeing that uh in the animation here as well as inference specific custom hardware and you're seeing that come out across the industry now as well as data center operations how we're running those uh machines at
large scale and looking across the full stack and optimizing across all these areas can yield really really great uh dividends. In fact, over the last 12
dividends. In fact, over the last 12 months at Google, we were able to drive down the energy consumption of a Gemini prompt, you know, per per prompt by 33 times in one year, right? And we need to
keep doing that going forward as you think about the demand continuing to grow.
So, uh, given all we've learned here, what I'd like to do is share a sort of a high level view of our approach at Google to AI infrastructure and then we'll dive deep into a little deeper into each of these three areas. uh the
first principle is really co-design across the entire stack. So from the research model research Google deep mind etc. uh to the software algorithms to
infrastructure hardware design across that endto-end system. You know as one example here of how this can play out when we built our first uh TPUbased system a little bit over 10 years ago
now it was really around solving some internal scaling challenges we had. But
then when we put that compute power into the hands of our researchers in Google DeepMind and others that really in many ways directly enabled the invention of the transformer and as we all know um you know that's now the architecture
today that's powering most of what we're doing with AI and Gen AI. So that's
number one. Uh two you know as we talked about before we really have this relentless focus on efficiency. Uh data
centers power everything we've talked about before but also the people and processes um and bringing to bear entirely new disciplines at Google for example. you think about the invention
example. you think about the invention of SRE many years ago. Um we need to do those same types of things in terms of how we're operating these large scale data centers now for AI.
And then third uh an obsession with simplifying the path from a great idea.
So you know some someone has a great idea in research to actually putting that in products and scaling that in production. How do we shorten the time
production. How do we shorten the time to innovate there and then how do we once we find a winner like how do we scale it up rapidly and cost effectively at scale.
Um so this is looking at our full AI uh stack here. I won't go through all the
stack here. I won't go through all the details uh but the core point is that um you know we make this this this full set of capabil capabilities available to our cloud customers and they can leverage the capabilities and rapidly build at
whatever layer makes sense uh for their business and delivers the most value to their business. So that whether it's an
their business. So that whether it's an through an agent at the application level whether it's at um through an API at our AI through our AI platform uh you know models data platform they can come in wherever makes sense for their
business and leverage those capabilities. Uh but of course for
capabilities. Uh but of course for today's discussion all of this is powered by uh that infrastructure underneath that core foundation and so let's go deeper now into that into that infrastructure.
So at the center of our sort of top level uh vision here is an integrated uh supercomput computing system we refer to as the AI hypercomputer. Um if you sort of build up the stack here from the
bottom up it includes AI optimized hardware across compute storage and networking. On top of that open uh
networking. On top of that open uh software which includes frameworks, compilers um and other capabilities. And
on top of that, flexible consumption models. So, new ways to consume this AI
models. So, new ways to consume this AI infrastructure uh commercially. And what
we found is that in every single box that you see here, we're having to reinvent and deliver new capabilities uh specifically optimized for the unique needs of AI workloads. They're so
different than what uh than what came before. So, we're we're innovating each
before. So, we're we're innovating each box, but uh we're also optimizing across that full stack to deliver against use cases. So training inferencing at scale
cases. So training inferencing at scale for example and then finally uh because we're Google cloud we deliver all this as Google cloud services. So you know how do we do that reliably scalabil
scalably with great security built in and at the lowest possible cost.
Uh great customer example here is Toyota. Uh so they chose uh Google cloud
Toyota. Uh so they chose uh Google cloud for a few reasons but probably most notably ease of use and uh speed of scaling. This actually helped them
scaling. This actually helped them reduce their model creation time by 20%.
They also used GKE or Google Kubernetes Engine uh which made the scaling four times faster than what they were doing before. Um and also notably we actually
before. Um and also notably we actually enabled a relatively small uh team within Toyota to actually build their AI platform in half the time of what they expected for a project of this uh this
level of capability. So you know this is just one of a thousand customer examples obviously but I think it brings some of those benefits uh to uh to to the front forefront. Um so now let's uh let's go
forefront. Um so now let's uh let's go deeper right let's uh dive in to each of the individual capabilities here and we'll talk about how they come together specifically for inference and we'll
start with uh leading uh compute accelerator options so uh you heard from Ian yesterday hopefully uh all of you for that as well and you know one of our most um important and closest partnerships I
would say from an ecosystem perspective is with Nvidia um and at our high level our goal in Google is to be first and best with Nvidia Uh so we work really closely uh together
with them across all of our teams engineering product uh beyond optimize together every layer of the stack. Of
course we leverage their awesome hardware GPUs and um uh GPUs and networking but also we integrate uh Google software and also Nvidia software uh to deliver those solutions that I
talked about together for our customers.
And uh in fact over just the course of this year we've delivered uh three major new Google cloud services that are based on the latest uh Nvidia Nvidia uh Blackwell platforms. So our G uh4 family
this is based on the RTX 6000. This is
great for graphics for omniverse and for uh cost effective inference for uh smaller models. Our A4 uh service which
smaller models. Our A4 uh service which is based on the Blackwell 200. So this
provides a actually fairly broad range of versatility for a broad range of AI workloads. And then on the far right,
workloads. And then on the far right, our A4X uh family of services. This is
based on Grace Blackwell 200 and this is really awesome for cutting edge models that require the highest possible performance at the highest level of scale.
So uh this is great uh and we really really appreciate our partnership with Nvidia delivering awesome capabilities for our customers. uh but we also offer the choice of our industry-leading TPU
platform and as I mentioned before we've been innovating with TPUs for over a decade now uh so over seven generations with many technical breakthroughs along the way uh you can see some of the
specific uh examples here and there really is no uh substitute for this level of experience and learning that you have to do as you iterate over 10 years to create these types of highly sophisticated uh systems. uh in fact
Gemini and as I mentioned before uh most of our models today are uh primarily trained and served on TPUs.
Uh so fast forward to today latest and greatest on the far right uh there our latest generation the Ironwood uh TPU platform. This is the first TPU platform
platform. This is the first TPU platform that we've built specifically for the needs of inference.
So let's go deeper into Ironwood. Uh
well we'll start with the chip. Um so
the ironwood chip it delivers five times the peak uh compute capacity and six times the high bandwidth memory of our previous generation. This is obviously
previous generation. This is obviously really really critical for inference at scale. But let's go a little bit beyond
scale. But let's go a little bit beyond the TPU chip and share some of the innovations that we're doing at a systems level that really make TPUs a leading platform uh for AI innovation
for so many customers.
Uh the first area of innovation is is how we scale and scale reliably. So with
Ironwood, we scaled to 9,216 Ironwood chips connected together into a single, we call it a super pod um leveraging our breakthrough interchip
interconnect um technology or ICI networking. This allows those 9,000 plus
networking. This allows those 9,000 plus chips to all work together and access a pretty staggering 1.77 pabytes of high bandwidth memory with
each HBM pushing 7.3 uh terabytes per second of peak bandwidth really helping overcome uh data bottlenecks and other bottlenecks for some of the most demanding models out there.
Uh but what happens at this scale when something fails? Um and so this brings
something fails? Um and so this brings us to reliability. And if you think about it at this scale with this many individual hardware components, that many chips, never mind all the other components in the system, uh, individual hardware component failures are actually
a statistical certainty, right? It's
going to happen. Um, and so we need to do all sorts of things to make sure that we can abstract the user away from those underlying hardware failures. Uh, one
capability here that's incredibly powerful is our optical circuit switching technology. So this acts as a
switching technology. So this acts as a dynamically reconfigurable fabric for that entire pod. So for example, if a node fails OCS uh optical circuit
switching, it can rapidly reconfigure that compute slice, dropping the dead node and then seamlessly restoring the workload from a checkpoint so the model can keep uh being trained or being
served. So it's this level of really I'd
served. So it's this level of really I'd say a deep systems level resilience that makes a super pod of this scale right the 9,000 plus truly viable for mission
critical uh workloads and applications.
Another innovation is liquid cooling. Uh so you know at Google we've been working on liquid cooling since uh 2014. So uh
quite a while now. Uh we had to do that in the early days to really scale with the TPU systems that existed even way back then. And uh we're now in our fifth
back then. And uh we're now in our fifth generation uh cooling distribution unit CDU and we're planning to uh distribute make available that spec to the open uh compute project later this year. Uh to
give you a rough idea of the scale here, as of uh 2024, we had around a gigawatt of total deployed liquid cooled capacity, which was 70 times the more 70 times more than any other fleet at that
point in time. Um so we uh created this first for TPUs and now we're leveraging it for GPUs as well, of course.
All right. Uh so those are a few specifics, but if we zoom out a little bit here, you can see the full picture of a TPU super pod at work. Uh it comes in two forms. a small one, 256 trips in a pod or the large one that I talked
about before. Uh, this is interconnected
about before. Uh, this is interconnected with 1.2 terabytes per second of ICI networking. And the scaling, uh, by the
networking. And the scaling, uh, by the way, this is just, you're just seeing the front of it here. Think about
there's many rows of that behind that that come together, deliver 9,000 chips.
But the scaling doesn't stop here. Uh,
so with our Jupiter, uh, data center network, we can scale further across dozens of these super pods working together.
And all of this hardware is optimized to work with uh, leading frameworks. So
both Jacks and PyTorch and uh notably on the PyTorch uh side, we're committing to a native PyTorch implementation on TPUs.
So really embracing the PyTorch community uh and so many workloads out there that have been optimized over so many years for PyTorch uh making it easy to um uh to bring those TPUs.
So together our work on systems and software here together if you wrap it all up a big number. It's delivering
42.5 exoflops per pod. Uh pretty pretty amazing.
All right. So we talked a lot about accelerators. That's obviously a
accelerators. That's obviously a critical component in many ways. It's
the heart of of these AI systems. But now if you think about inference, an inference solution requires more than just the accelerators to deliver that experience that I talked about before.
Um and here's where you know you might see some challenges. For example,
staying up with state art is pretty hard. You know, it seems to change every
hard. You know, it seems to change every week. Uh it's hard for all of us to stay
week. Uh it's hard for all of us to stay up to date with all of the changes here.
Uh second, there's switching costs. You
know, we all want this flexibility in these systems. Uh but today, changing frameworks, changing hardware, frankly, it's it's it's too hard. Um and finally, with uh so many different possible
deployment architectures, it's often hard to find just the right combination.
Um and the wrong choice could cost uh cost you cost our customers millions of dollars potentially. So, it's really
dollars potentially. So, it's really important to get it right.
So obviously we've thought a lot about this you know how can we help how can we help address these challenges and so you know here's a a relatively simple picture of what we see as an optimized
architecture for inference so I'll go through this at a high level and then we'll double click in um so working from left to right inference request uh comes in it goes first to GKE this is Google
Kubernetes engine GKE inference gateway this can route that request based on the request parameters also the load on the them um to the right uh pool of
accelerators. Models are deployed to GKA
accelerators. Models are deployed to GKA clusters you see in the middle here. Uh
those have now first class support for VLM uh as a model server. So this
enables compatibility across GPUs and uh TPUs. Uh customers can use a dynamic
TPUs. Uh customers can use a dynamic workloader. This is a new uh flexible
workloader. This is a new uh flexible commercial consumption model we created just for the needs of AI for uh to make sure the capacity is available in those node pools. Um and then finally on the
node pools. Um and then finally on the far right uh leveraging AI optim AI optimized cloud storage to store the model weights cast the data and and many more. Um so this is really a top level
more. Um so this is really a top level view bird's eye view of an antenn system but let's go deeper now into each of these individual technologies one by one and we'll talk about how they can be
used end to end uh to uh serve a simple agentic workflow.
So let's start with uh two new pretty awesome uh advancements from uh our Google Kubernetes Engine team. Inference
gateway and inference quick start.
All right. So let's say a new user request comes in and immediately we're faced with a challenge uh which is think about a traditional load balancing uh was not designed for the needs of AI workloads. You know, with AI, a single
workloads. You know, with AI, a single prompt can trigger one of these large multi-step reasoning processes, and it can overwhelm uh one server while others potentially could be sitting idly by.
This leads to unpredict predictable latency, inefficient use of those valuable resources.
So, uh to solve this challenge, we created a firstoindate.
It uses AIA routing and also AI load balancing techniques to even out the ilization across those node pools. So
those incoming requests don't queue up and you've got a nice balanced utilization across your accelerators. Uh
here's an architectural view of that. Uh
so inference gateway will uh continuously monitor your model servers and analyze metrics like uh pending request uh Q length or KV cache
utilization or others before selecting the ideal node to route a request to.
So I'm really excited today to announce that we're making uh the GKE inference gateway generally available. So it's out there to to give it a try if you'd like.
Pretty awesome. Um along with two new uh capabilities built in. Um so the first is prefix aware prefix aware routing. So
for an application like multi-turn chat or document analysis where you're reusing the same context across multiple prompts now you can route those requests to the same accelerator pools that
already have that context casp. You can
see how valuable that would be. The
second is disagregated serving.
obviously a big topic uh for this week.
Um but this is a technique that separates the initial processing of a prompt. So the prefill stage from the
prompt. So the prefill stage from the token generation or the decode stage.
And since these two different stages have a pretty different resource needs and actually based on the workload might need to scale differently as well, you can now run them on separate optimized machine pools for both and scale them
also independently of each other.
So uh these capabilities are super powerful. Um, but I'm sure sometimes it
powerful. Um, but I'm sure sometimes it feels a little bit overwhelming uh to be constantly told, hey, you got to use this new technique, that new technique, or you're going to fall behind. Um, and
we struggle with this too, uh, internally. Um, so this brings us to
internally. Um, so this brings us to inference quickart.
So, uh, think about inference quickart as a, uh, database of tested inference configurations.
All you need to do is tell it about your needs. So, what models you're using,
needs. So, what models you're using, what are your priorities for latency, cost, other things, and it will give you a set of recommendations based on best practices. and actually the latest
practices. and actually the latest benchmarks that we run within Google on those various combinations.
Um, so you can deploy those best practices right out of the gate. And
then uh because you need uh your needs and the workloads are always evolving, we're also monitoring the system over time, those inference specific performance metrics that I talked about before. So you can dynamically fine-tune
before. So you can dynamically fine-tune your deployment over time as maybe the workload needs change, the model changes, so you can make sure you're always up to date easily.
All right. Uh so um a little bit on the technology, but what are the benefits here? So pretty uh pretty significant uh
here? So pretty uh pretty significant uh taken together these can deliver up to 96% uh lower latency at peak throughput.
This is great for highly interactive responsive workloads like interactive coding agents etc. up to 40% higher throughput which is great for prefix heavy workloads like knowledgebased
analysis and all of this with up to 30% lower token costs per token. We talked
about how important that was before.
All right. Uh so that's inference gateway. So it's intelligently routed
gateway. So it's intelligently routed that request. Uh now we face another uh
that request. Uh now we face another uh critical question which is do you have the accelerators uh when and where you need them. And uh look even today
need them. And uh look even today getting capacity can sometimes be a struggle right you can either overprovision those resources and waste money or maybe underprovision and then you're leaving your research team or
your customers uh waiting. So this is where dynamic workload or DWS uh comes in. Uh it's a new uh commercial
in. Uh it's a new uh commercial consumption model we created specifically for the needs of AI workloads. So it complements on demand
workloads. So it complements on demand spot and the other traditional consumption models that you can also still use and it helps in two ways. So
first for time flexible jobs we have the flex start mode that lets you queue up your experiment or your request and we'll run it for you as soon as those resources are available across all of
Google cloud uh based on the policies you've set. And then secondly for uh
you've set. And then secondly for uh critical uh timebound products projects we also have something called calendar mode. Uh so think about this. It allows
mode. Uh so think about this. It allows
you to book capacity like you'd book a hotel room. You say this many resources
hotel room. You say this many resources in this location for these dates and we guarantee that those resources will be available when you need them. You won't
get preempted and you only have to pay for them obviously when you're actually using them. So really bringing the
using them. So really bringing the benefits of cloud to bear here for for AI.
All right. Uh so with that in place, the next question is, hey, which accelerator is actually right for my workload? Uh
you know, which GPU, so many different GPU options or which TPU? So many great TPU options out there. So earlier this year, we announced that you can now
easily run inference on TPUs because we've enabled VLM to work on TPUs. Um as
well as of course it already worked on GPUs, right? So uh fantastic way to get
GPUs, right? So uh fantastic way to get flexibility across TPUs and GPUs and effortly switch between them now with just a few configuration changes.
And to further simplify, we've introduced something called custom compute classes. So think about these
compute classes. So think about these are profiles that you would use to tell GKE's autoscaler what type of virtual machine nodes and what consumption models to use and in what order of
preference for your workload. So for
example, you could say, hey, I want to start first with my top priority is GPUs a spot, but then if I don't have enough of that, let's fall back to GPUs on demand. And maybe if I don't have enough
demand. And maybe if I don't have enough of that, let's fall back to TPUs on demand. You can decide what makes sense
demand. You can decide what makes sense for your business. uh but they're all now in the same deployment and GKE autoscaler can uh honor those priorities and automatically scale based on the
dynamic needs of your workloads. So
pretty pretty cool stuff.
Uh also so speaking of uh GPUs, we're excited to announce that we partner with Nvidia to bring Nvidia Dynamo uh to Google Cloud. This gives another
Google Cloud. This gives another powerful integrated option for your inference workloads. And our joint
inference workloads. And our joint customer base 10, I think we saw maybe yesterday too um is running Dynamo on Google Cloud with Nvidia GPUs with really uh really great results.
So now let's uh talk about another key requirement uh for inference which is getting those hungry uh super valuable uh accelerators well fed with data. This
is a pretty common issue these days frankly if you think about uh inference and your GPU or TPU capacity might be distributed across different regions around the world uh maybe different from where your core data is actually stored and that type of latency that you could
have hit there could really negatively impact performance and uh for something like a large model even just loading the model weights sometimes can take uh tens of minutes and you think about the need to dynamically scale that makes it sort
of a non-starter right if it takes 10 minutes to load new model weights. So
that's precisely why we're investing um so much heavily in optimizing our storage services for AI.
Earlier this year, we announced Anywhere Cache. So this is a new uh fully
Cache. So this is a new uh fully consistent read cache that works with your existing Google Google Cloud Storage Storage buckets and it caches the data within the same zone as the
accelerators automatically. This can
accelerators automatically. This can reduce uh read latencies by 96%. It also
can reduce the network cost because you're not now having to go across the network to um uh to access that data.
So let's make that a little bit more real, right? So imagine you're a
real, right? So imagine you're a developer, you're using a coding assistant, uh you type in a request and pause and you get a pause waiting for the code suggestion to come back. You
know that pause, even if it's just a few seconds, it's friction for the developer, right? It hurts productivity.
developer, right? It hurts productivity.
Now imagine that interaction is 96% faster and that suggestion appears almost instantaneously subsecond let's say you know that's what keeping that data cast close to those accelerators
looks like and we all know that's not just a nice to have these days right that can uh that those delays can break a user's flow and so we're really focused on making these interactions as
real time as possible uh just ask our friends at Anthropic as you can see here they also ran into that problem which is why they are actually now using anywhere cache to collo data
uh colllocate data um with their massive TPU clusters for their model quad. This
enabled them to actually um remove their own complex do-it-yourself caching solution and now they can dynamically scale reads up to 2.5 terabytes per second.
Uh in addition uh for workloads with long context windows that exceed the memory that's available in the GPUs, you can also take advantage of managed uh luster. Uh this serves as a high
luster. Uh this serves as a high performance uh storage resource to satisfy key key value requests keeping those GPU GPUs uh well saturated with ultra low latency and millions of IOPS
per second.
And finally uh tying all of this together can't forget about the network uh is cloud WAN. Uh so this is our fully managed uh global network. CloudWAN uh
helps across many domains but in the case of AI it helps customers that need to access AI computing resources across different regions maybe different clouds outside of Google multi cloud environments as well as on-prem uh
environments and so what it does is it allows you to connect the models to the data wherever that data is some other cloud onrem wherever um as well as connect the models to the actual users wherever those users are uh distributed
around the world and because this is built on our planet scale networking infrastructure it can deliver 40% improvement in in the application experience and 40% lower TCO compared to
traditional bespoke WAN solutions.
Uh so uh wrapping up here in summary, you know, what what have we sort of walked through here? Well, you know, maybe a simple way to think about it is it's a blueprint uh a blueprint for state-of-the-art uh inference which can
really be tailored uh uniquely to your environment, your needs, and your use cases. And all the technologies I've
cases. And all the technologies I've shown here, you know, went kind of deep, but they've been designed to help deliver on those original requirements that we started uh with for inference, right? How do you deliver a great user
right? How do you deliver a great user experience, low latency, great performance with leading edge models?
How do you scale that uh and how do you do that with high efficiency, high energy efficiency, and the lowest possible cost per transaction? You know,
it's the same approach that powers all of our Google AI services internally.
And now you can take advantage of it also to power your business, deliver real value to your customers, innovation to your customers and scale for them.
Um, so we talked a lot about here, we covered a wide range here at a high level. Um, but to learn more about how
level. Um, but to learn more about how Google uh can help support your AI projects or how we can maybe partner together if you're an ecosystem partner.
We have several uh deep dive sessions later today and throughout this week. So
please would love to see you come by the booth uh see some of the sessions uh that you see here. And with that, I'll say thank you very much.
Loading video analysis...