LongCut logo

Transformers.js: Building Next-Generation WebAI Applications

By Chrome for Developers

Summary

## Key takeaways - **Transformers.js Runs AI 100% Locally**: Transformers.js is a JavaScript library that allows you to run AI models 100% locally in the browser. Since everything runs 100% locally, all your data is kept safe and secure, with extremely low latency and effortless scalability. [02:10], [02:17] - **Quantization Shrinks Models 8x**: Quantization reduces computational and memory costs by using lower precision data types like 8-bit integers or 16-bit floating points. We're able to reduce the size of models up to 8x without seeing major quality degradation. [04:34], [04:45] - **1.7M Monthly Users, 11M CDN Requests**: In just the last month, we hit around 1.68 million npm downloads, 1.7 million unique monthly users of transformers.js models, and nearly 11 million CDN requests, all up from the previous month. This is over a 2x increase since last year's version 3. [08:22], [09:33] - **3 Lines for Sentiment Analysis**: It's only as simple as three lines of code to get started: import the pipeline function, create an instance for sentiment analysis, and run your input text to return a result like positive with high likelihood. [10:10], [10:27] - **1.7B Model at 160 Tokens/Second**: This is specifically the speed of this model that is running a 1.7B model running on my M4 Mac at over 160 tokens a second. Blink and you'll miss it, really amazing for real-time applications. [15:58], [16:08] - **Version 4 Developer Preview Announced**: We are currently in developer preview for Transformers.js version 4, with even faster execution and a wider range of models being supported. We'll be releasing a release candidate in a couple of weeks on npm. [24:17], [24:35]

Topics Covered

  • Browser AI Secures Privacy
  • Quantization Shrinks Models 8x
  • Three Lines Launch AI Pipeline
  • 1.7B Chatbot Hits 160 Tokens/Second
  • Version 4 Preview Accelerates Everything

Full Transcript

[music] Hi everyone, my name is Joshua and I'm a web machine learning engineer at HuggingFace and today I'm excited to talk about uh how you can build next

generation web AI applications using transformers.js.

transformers.js.

So first a quick introduction what is transformersjs and maybe what is hugging face? Well, HuggingFace is the platform

face? Well, HuggingFace is the platform where the machine learning community collaborates on models, data sets, and

applications, otherwise known as spaces.

For models, we host over 2.1 million models, uh, AI models built by the community on the hugging face hub. Uh,

you're able to search for them, filter by various maybe library tags or tasks, and find the model which best suits your needs.

We also host a large collection of data sets. Uh just over half a million. Some

sets. Uh just over half a million. Some

of these are megabytes in size. Some of

them are pabytes in size. And you can even query them in your browser. Uh

visualizing certain data and maybe understanding data a bit better to see how you will be training a model.

Next, we have spaces also known as our application uh or our AI app store. Uh

currently to date we have over 1 million AI apps built by the community and we're really excited for uh web AI apps as you maybe saw a couple moments ago uh to be

deployed as a hugging face. Uh and we also support semantic search across these spaces. So if you are a user of uh

these spaces. So if you are a user of uh web AI applications or just AI applications in general, you can search for them using our semantic search here.

for example, change the lighting and it'll show you a space that has been created by the community that can do your task. So, we also maintain a large

your task. So, we also maintain a large collection of open- source libraries.

Uh, some of these you may be familiar with like the transformers library, diffusers, safe tenses, just to name a few. Uh, but the one we'll be talking

few. Uh, but the one we'll be talking about today is transformers.js.

So, what is transformersjs? Well,

transformers.js is a JavaScript library that allows you to run AI models 100% locally in the browser. Since everything

runs 100% locally, all your data is kept safe and secure. And we're also able to achieve extremely low latency and effortless scalability. We also take

effortless scalability. We also take advantage of many browser APIs that are now at our disposal like webin as you just heard of um uh web GPU and the web assembly.

Some benefits of in browser inference include number one security and privacy.

So let's say you're recording uh video of your or your face or microphone input. You maybe don't want that data to

input. You maybe don't want that data to be sent over to a cloud. Maybe you're

processing uh very sensitive documents that are related to your company. You

want those to be parsed locally. Next

step uh is real time applications.

Thanks to the lack of uh server, you're able to not need uh uh bounces between server making requests. uh especially

this is especially important in uh areas where internet connectivity is uh not as strong. Um and you also don't need to

strong. Um and you also don't need to send extremely large files over the internet. Everything happens 100%

internet. Everything happens 100% locally. Then this serves to highlight

locally. Then this serves to highlight both for the developer as well as the user of the application. As a developer you when you're creating uh or maybe

you're showcasing a model of yours and you want to show users how to play around with this model. uh you don't have to host it on a dedicated GPU. you

can rather distribute that compute to the users as well as so for the now we've spoken about what did we see for the developers the improvements and also for the user there's no API keys being

exchanged you're not paying per token uh everything runs on your device therefore you'll pay for the compute by simply using your device and then thanks to the

web distribution is as simple as posting a link uh someone goes to the website and there you go there's your AI application there's no worries about you finding PyTorch uh and Python

dependencies in worrying about whether this uh this model will run uh well on Mac or Linux or Windows. Um everything

is just bundled into the website. You

view the website and everything is good to go. Uh some steps to maybe think

to go. Uh some steps to maybe think about when optimizing for in browser inference. Uh number one is

inference. Uh number one is quantization. This basically involves

quantization. This basically involves reducing the computational and memory costs by simply using lower precision uh data types like 8 bit integers or um 16

bit floating points. Um we're able to reduce the size of models up to 8x without seeing major quality degradation. Uh of course this change

degradation. Uh of course this change changes depending on the models you're trying to run. Um especially for smaller models maybe more sensitive to quantization. So it's very model

quantization. So it's very model specific but we try to provide a wide uh range of various quantizations that may be useful as well as choose some good defaults for you when you're using

transforms.js.

The next thing is to take advantage of browser APIs like webgu and webin to really uh take advantage of the native

uh hardware that a user has but also in a highly efficient optimized manner. And

then finally is ensuring that when taking our model from uh maybe a a Pythonbased ecosystem to the web uh

taking into account how to bring this model and export it in a way that ensures extremely high optimization levels. So this may

include uh fused kernels, custom operations um and and so on. And in this case uh for a very simple burst embedding model able to achieve a 4x

performance boost just by changing the way the models exported. Uh we also benefit greatly from the versatility of JavaScript. So yes, you're able to run

JavaScript. So yes, you're able to run these uh models and the library in the browser. But that's not where JavaScript

browser. But that's not where JavaScript stops. Uh you may know that there are

stops. Uh you may know that there are various JavaScript runtimes, Node.js, bun, dino. uh web GPU support is

bun, dino. uh web GPU support is currently being worked on in a few of them and even in in some uh working already quite well. Therefore,

transforms JS is able to take advantage of these uh browser APIs that are now bundled into native uh executions. Um we

also uh work well with various libraries and frameworks like React, Spelt, Angular, Vue, etc. Um and then various environments where you're able to deploy your applications. So yes, you're able

your applications. So yes, you're able to create a website that maybe would use a web worker. Uh but you're also able to use transformers.js and these models as browser extensions, maybe serverless

with maybe superbase edge functions as well as desktop applications like electron and then finally we also uh integrate quite well with various build tools vit

Webpack etc. Uh this is able so you're able to bundle your application into a way that can be can be shipped to the users. We're also working on mobile

users. We're also working on mobile support uh via React Native uh and we will be hopefully giving you a few more updates in the near future. Uh currently

speaking of browser support, so Google Chrome uh and other Chromium based browsers have extremely good web GPU support specifically. So all the demos

support specifically. So all the demos I'll be showing are recorded in uh Chromium based browsers um and they work extremely well and are able to take advantage of your hardware capabilities.

Then Firefox TransformersJS actually helps power Firefox's AI runtime. Um

being able to run various tasks like image classification, translation and and a wide other a wide variety of others. Um WebGPU and Webn is still

others. Um WebGPU and Webn is still experimental but we hope to be uh getting this uh shipped really really soon. Uh and then finally, Safari. uh

soon. Uh and then finally, Safari. uh

WebGPU actually just shipped in Safari 26, meaning you'll be able to use and run these web AI applications in your browser uh in Mac OS, iOS, iPad OS, and even Vision OS, which is really exciting

to see.

So, let's talk about usage and how Transformers JS has grown over time. So,

starting off with npm downloads, in just the last month, we hit around 1.68 million uh npm downloads. Uh this is up 7% from the month before. Uh unique

monthly users of transformers.js uh models just also around 1.7 million up 12% from the previous month. And then

CDN requests. So for those who don't want to maybe uh npm install and and run uh in in various build tools you can access directly with um with the CDN

link. And this is at uh nearly 11

link. And this is at uh nearly 11 million requests which is up 13% from last month. and maybe seeing this in

last month. and maybe seeing this in context. Uh TransformersJS version one

context. Uh TransformersJS version one uh when we released in March 2023, it's just a little side project, nothing major. Um few interested uh people maybe

major. Um few interested uh people maybe playing around with some things. uh very

low usage in the beginning but as we kept iterating over time version two hit around 5,000 um unique monthly users and then um in the past year uh when version

3 released which is actually was released uh at this or announced at the web AI uh summit last year uh hitting around 750,000 unique monthly users for

version 3 and now today we just hit over uh 1.7 million unique monthly users so it's almost a 2x increase or over a 2x increase since last year and this is all

thanks to our amazing community from all over the world building really amazing web AI applications with transformers JS. So we just want to say a massive

JS. So we just want to say a massive thank you to the community. Uh

transformers JS would be nothing without you all. So we thank you so much for

you all. So we thank you so much for building and creating and showcasing what you're what you've built.

So how can you take and use transformersjs in your web AI applications? Well, I hope to show it's

applications? Well, I hope to show it's only as simple as three lines of code to get started. So the first line is to

get started. So the first line is to import the pipeline function from the transformers.js library. The second line

transformers.js library. The second line is to create an instance of the pipeline. In this case, we'll be

pipeline. In this case, we'll be performing a CL a task known as sentiment analysis. And then the third

sentiment analysis. And then the third line is running the uh your input in this case text. Uh so I love transformers and then using the pipeline that you just created to return a result

in this case positive with high likelihood. Uh we also support uh being

likelihood. Uh we also support uh being able to specify an your custom models.

So a very similar thing to what you just saw uh but now the task has changed. So

we're now doing background removal and we're also choosing a different model that has been uh created by the community um for background removal. And

then in the same way you run them you run the model with providing an input image in this case a link to the image and you'll be able to remove the image 100% locally in your browser. We also

support various loading and runtime uh runtime parameters. So in this case the

runtime parameters. So in this case the first loading parameters when you create the the pipeline uh instance in the beginning you're able to specify the device whether you want to run on GPU

maybe web GPU webn um CPU or web assembly uh as well as the data type or the quantization. So in this t in this

the quantization. So in this t in this in this case uh four-bit quantization with uh 16- bit activations and then uh at runtime you're able to specify various parameters like how many tokens

you want to generate whether you want to do sampling what's the temperature uh those kinds of uh parameters you may be familiar with. Uh we also support a bit

familiar with. Uh we also support a bit more advanced usage if you want to really take advantage of uh maybe lower lower level things maybe trying to uh integrate this into your application

logic of maybe mouse movements specifically. So in this case being able

specifically. So in this case being able to do something like image segmentation with segment anything um in very similar uh way to how the Python transformers library may work. So if you're familiar

with that library, we hope that the translation over to the JavaScript runtime will not be too challenging. Uh

so maybe taking a step back like maybe how asking yourself how it works. So

first we provide a large collection of preconverted models uh nearing around 2 and a half thousand on the hugging phase hub. Uh and if you have your own custom

hub. Uh and if you have your own custom model, we we provide various scripts and libraries that allow you to uh convert your PyTorch, Jax or TensorFlow model to

a unifying standard known as Onyx. Uh

onyx stands for open neural network exchange. And then what you do is you

exchange. And then what you do is you write your transformers.js code. In this

case, we're performing speech recognition uh using whisper tiny uh and we're running on web GPU. And then

behind the scenes, we take advantage of Onyx runtime web, which allows you to run these models on web assembly, web GPU, and even webin uh allowing you to choose the device and then run on CPU,

GPU, or NPU depending on what the user has at their disposal.

So let's see what does it take to actually build these uh AI powered web applications. It all starts with the

applications. It all starts with the idea and it involves asking yourself the question what is the problem you're trying to solve or what experience are you trying to create. Then maybe ask

yourself why do I want to run this model in the browser? What are the advantages I can I can I can take advantage of uh for running in the browser? Uh maybe the low latency is something that's

important to you. Maybe the distribution is something that's important to you.

Maybe security and privacy is important to you. And those are all benefits for

to you. And those are all benefits for running on device. uh and then you ask yourself is there a task that has already been created for me that I can use to solve this problem whether it's

sentiment analysis computing embeddings depth estimation uh and then once you've identified the task you're trying to use finding the model which best suits your use case so

if you're maybe translating between various languages you might want to find a model that is good for French translation for example or um maybe the model size is important to you and uh

you'd rather prefer maybe a 10 to 20 me 20 megabyte uh background removal model versus maybe a couple hundred megabytes depending on the real-time aspect you're looking for. So if all those boxes are

looking for. So if all those boxes are text, let's build it with transforms.js.

And we also encourage you to take uh and to to learn from the community and see the example applications we've put out.

So I put a few links here that if you want to go visit and see what applications we've built, um highly recommend it. And it's kind of showing

recommend it. And it's kind of showing you what's possible and then giving the power back to you to integrate into your own workflows to take advantage of your

own uh knowledge in your very specific domain. Um and we we really encourage

domain. Um and we we really encourage and and like to see what you what you build with it. Some factors to consider when building web AI applications. So of

course bandwidth. The user is going to need to download the model at least once. Uh the model once downloaded once

once. Uh the model once downloaded once is cached. So uh you won't have to be

is cached. So uh you won't have to be redownloading things. So model sizes

redownloading things. So model sizes needs to be taken into account. Accuracy

versus speed. Whether you're trying to ensure high accuracy is something that's very important or you're trying to run in real time. There are some trade-offs to make. Uh some device features. What

to make. Uh some device features. What

capabilities does the user have? Whether

they have access to browser APIs, web GPU, microphone input, etc. And then some target devices. What are you trying to run on? Are you trying to run on mobile? Are you trying to only run on

mobile? Are you trying to only run on desktop? Um, this can all help uh make

desktop? Um, this can all help uh make dec help you make decisions to what models, what creations you're you're going to be going to be working on. So,

now that we've got that out of the way, let's see what developers have been able to build so far. Starting off with the traditional chatbot experience. But what

I want you to take a note of is specifically the speed of this model that is running a 1.7B model running on my M4 Mac at over 160 tokens a second.

So blink and you'll miss it. Um this is really amazing for real-time applications as I'll show a bit later.

Um and as models get smaller you can even pump those numbers up to to even higher. It's very great for uh for low

higher. It's very great for uh for low latency experiences. We also support

latency experiences. We also support various reasoning models. So this is uh DeepSeek R1's distilled uh version a 1.5B model uh for for reasoning in the

browser. Um this actually this model

browser. Um this actually this model when released outperformed uh GPT40 and Claude 3.5 set on various math benchmarks which is really amazing to see for a model that can actually run in

your browser.

Uh next we have a vision language model which is able to take video input either video stream or um like from camera input or from from maybe your screen recording and is able to do live

captioning um depending on what what it sees. So in this case running a video

sees. So in this case running a video and live captioning of every every frame every as fast as it can every frame that it captures uh performing uh description

even able to recognize um recognize text and I'll actually be showcasing this demo outside for for those who would like to see it in real time. Then we

also support uh this is a a Gemma 3270M model for um and I think this is where the power of small ondevice models uh come into play where for your very

specific task in this case a a fun little bedtime story generator um finding a model which works well for you that can run in the browser uh is is really great to see. So in this case

selecting a few options uh and then being able to generate a story based on those inputs. And then actually what

those inputs. And then actually what you'll see here um it's muted for now but I will show it a little bit later uh is while the text is being generated while the story is being generated it's actually being spoken out to you. Um and

what this means is that you actually have very low latency from the from the time you click start to the time you actually hear something back which is really great. Um then this is a really

really great. Um then this is a really fun one running in browser a tool calling model. Uh language models are

calling model. Uh language models are notoriously bad at maybe mathematics. So

instead of uh making it asking it to hallucinate we instead give that control to a function that we've defined in JavaScript actually uh in this case math

evaluation for some input even random number generation and this now it'll call the tool and then return uh a random number between 1 and a th00and and where this gets interesting is being

able to hook this up to browser APIs like location and time APIs. So in this case requesting the user's location of course they will have to accept and then

returning that back to the LM where it can formulate a more informant response.

Next this is something called the semantic galaxy using embedding gema.

What happens here is that you select a bunch of documents and you click generate galaxy and what happens is that the model embeds these in a higher dimensional space and we project that

back to 3D allowing you to search for these documents uh number one in real time and number two in a more visual and interactive way. So in this case we

interactive way. So in this case we search for weather and all the weather related documents are displayed. You can

also hop around the galaxy see and and the numbers that are attached there indicate the similarity score. And as

you can see the semantics behind this behind the documents are taken into account for the clusters which is which is quite fun to visualize various documents of yours. Then coco which is a

texttospech model. You maybe saw the

texttospech model. You maybe saw the demo earlier and we'll show a live a demo in a few moments. Uh this model is uh kind of groundbreaking in a way

because at only 82 million parameters, you're able to uh to produce high quality realistic text to speech. Uh

this one is Whisper Web, a web GPU allowing you to perform real-time speech recognition in the browser using OpenAI's set of whisper models.

And then this is a game we developed actually almost two years ago, maybe over two years ago, called Doodle uh Doodle Dash. It's uh based on Google's

Doodle Dash. It's uh based on Google's quick draw game, but in this case, what's happening is that you're as you're draw as you're drawing it live, an image classification model is running behind the scenes detecting whether you

are correctly drawing the label. Um it

also runs on mobile. So if you want to try it out, uh just search for Doodle Dash and you'll be able to run on your phone. Uh and then the next one I think

phone. Uh and then the next one I think is pretty interesting for educational purposes. The idea here is to show uh

purposes. The idea here is to show uh people how vision transformers work. So

patching up the image and then uh identifying and creating semantically relevant chunks for for uh analysis. So

in this case what we do is we upload an image of a tiger. And as the model progressively begins to maybe understand what's going on. So in the first layer this this is the first attention head.

It doesn't really know what's going on.

Um as you progress down the the neural network as you increase the number of past the number of layers you start seeing a bit more structure being developed. So that's the sixth layer.

developed. So that's the sixth layer.

Uh, as you keep going until the last layer, you kind of the model seems to understand maybe what makes up a tiger.

Maybe you see the the eyes are are an interesting um interesting part. Maybe

the maybe the fur uh and it corre correctly predicts tiger. Then this is a new model released by Meta called Dino V3. Uh it enables actually video

V3. Uh it enables actually video tracking in your browser which uh is really exciting as well as uh the the tool on the right is more for uh visualization and educational purposes

trying to show you when highlighting over a query uh uh position seeing what other features are highlighted. And as

you can see in the video tracking example, being able to play, select a few important uh key regions, and then clicking play to see where those points move over time, which is really, really

great to see. Uh, funny enough, this model wasn't even trained for the specific task, but uh, it is able to generalize um, using the rich features it's able to produce. And then this is

our real time.

>> Hey there, my name is Hart. How can I help you today?

>> Uh, can you tell me a joke, please?

Sure.

Here's one for you. Why don't scientists trust Adams? Because they make up

trust Adams? Because they make up everything.

>> Now, can you tell me what the capital of France is?

>> Of course. The capital of France is Paris.

>> Now, I want you to role play as Santa, and I'll be a little boy asking you for gifts.

>> Ho ho ho. Merry Christmas, little buddy.

What can I bring you this year?

>> I was hoping uh maybe a lump of coal.

>> A lump of coal? That's a bit harsh, isn't it? How about a nice warm cozy

isn't it? How about a nice warm cozy blanket instead?

>> Uh, that'll work, too.

>> Perfect. I'll make sure to wrap it up nicely for you.

>> There we go. So what's amazing about that uh demo is it's actually using and incorporating a bunch of different models. So using a voice activity

models. So using a voice activity detection model to uh know when you're trying to speak at in the beginning, then a speech recognition model and then a language model um backbone for the

quoteunquote brain and then text to speech model at the end. What we're

actually working on right now is unifying that into like a single model.

This has been a an open uh topic for for many years now, but it'll actually increase uh lower the latency and increase performance which is which we'll be excited to show in the near future when we get it working. Uh so

what is the latest news? What is the current plans and what are maybe the next steps? So taking a step back, where

next steps? So taking a step back, where have we gone over time? So the idea early 2023 just want to create a spam detection uh version. It was it was it

was really really simple. Version one

released a couple of weeks later for on npm just a few architectures supported.

Version two involved a complete rewrite to ES modules. Uh 19 supported architectures. A year and a bit later

architectures. A year and a bit later version three was released introducing web GPU and web andn support. Now 119

supported architectures. And then a year from that today um we support around 170 architectures. But you might be asking

architectures. But you might be asking yourself what's next? Well, I'm excited to announce that we are currently in developer preview for Transformers.js version 4. So, if you're looking forward

version 4. So, if you're looking forward to even foster execution, a wide range an even wider range of models being supported, um we hope you try it out. Uh

we'll be releasing a release candidate in a couple of weeks uh on npm for you to try out. Um and I'm excited to see what you build with that. So with that,

uh, hopefully this talk has inspired you to maybe consider new technologies for your applications and more specifically your web AI applications. And I'm really excited to see what you build with it

next. Thanks so much. [applause]

next. Thanks so much. [applause]

[music]

Loading...

Loading video analysis...