New DeepSeek just did something crazy...

By Matthew Berman

Summary

## Key takeaways - **DeepSeek OCR: Compressing Text in Images**: DeepSeek OCR introduces a novel method to represent text within images, achieving a 10x compression ratio while maintaining 97% accuracy. This technique could significantly enhance the capabilities of large language models. [00:49], [00:59] - **Context Window Bottleneck in LLMs**: A major limitation for large language models like Gemini and ChatGPT is their context window size, which restricts the amount of information they can process. Scaling this window quadratically increases compute costs, making it inefficient. [01:07], [01:43] - **Vision Language Models for Text Compression**: DeepSeek OCR utilizes a Vision Language Model (VLM) that processes images of text by breaking them into patches, analyzing local details with SAM, compressing them, and then reconstructing the text using DeepSeek 3B. [02:43], [03:33] - **Potential for Image-Based LLM Inputs**: Andrej Karpathy suggests that all LLM inputs, even pure text, could potentially be rendered as images. This approach could lead to greater information compression, shorter context windows, and more efficient processing of diverse inputs. [06:14], [06:34] - **Training Data for DeepSeek OCR**: The DeepSeek OCR model was trained on approximately 30 million pages of diverse PDF data across 100 languages, with Chinese and English comprising the majority of the dataset. [08:04], [08:14]

Topics Covered

How Deepseek OCR radically compresses text for LLMs.
Can image-based input solve LLM context limits?
Unleashing 10x LLM Context Windows Today.
Should LLMs abandon text tokens for image inputs?
How DeepSeek trained its 100-language OCR model.

Full Transcript

Deepseek just did it again. They just

dropped a new paper and model DeepSseek

OCR. OCR is basically image recognition.

But why is that a big deal? Image

recognition has been around forever,

right? Well, they discovered something

completely novel that has the potential

to make language models, textbased

models so much more powerful. Let me

show you. This is the new paper from

Deep Seek. Now, like I said, image

recognition has been around for a long

time. It's nothing special. We've seen

it. It's been done a million times. But

here is what makes Deepseek OCR very

special. There's this saying, a picture

is worth a thousand words. And that is

the key to Deepseek OCR. Deepseek has

figured out a way to represent text in

an image. And this has allowed them to

compress text by 10 times while still

maintaining 97% accuracy. If this

doesn't make sense yet, don't worry. I'm

going to break it all down for you. So,

right now, a big bottleneck in large

language models, the text models you're

used to, Gemini, Chat GPT, the big

bottleneck is the context window. That

is how many words or how many tokens you

can actually fit in your prompt. And

it's a little bit more complicated than

that, but that's the gist of it. And the

context window is where you give the

model all of the information it needs to

produce the best possible output. And

the bottleneck occurs because as you

scale up the context window, the compute

cost associated with scaling up the

context window increases quadratically.

And that means that it increases very

quickly. And so adding one more token to

the context window really means

significantly more compute, especially

as you get further and further in

increasing that context window. But what

if it didn't have to be that way? And

what if we were able to get 10 times as

much context into the context window

without actually changing anything? That

would be huge, right? Well, that is what

Deepseek is proposing with this paper.

What they figured out is that with an

image, you can represent 10 times as

much text as it takes to represent the

image on a per token basis. Listen to

this. A single image containing document

text can represent rich information

using substantially fewer tokens than

the equivalent digital text, suggesting

that optical compression through vision

tokens could achieve much higher

compression ratios. And so they present

DeepSseek OCR, a VLM vision language

model designed as a preliminary proof of

concept for efficient vision text

compression. Okay, so here's how it

works. On the left, we see the input and

this is actually an image of text. And

in this example, it looks like a PDF,

but you could basically take an image of

text. And yes, you can actually stuff a

bunch of text into the image and you can

actually make the text in the image very

small. Now, there is an upper limit on

how small that text can be before you

start to get noise and the visual

resolution essentially becomes

impossible to read. Then that image is

taken and split into 16x6

patches, little squares from the actual

image. Then it uses a few different

techniques in the main engine. First is

SAM. It is an 80 million parameter model

that looks for local details like the

shapes of letters and different details

in the image. Then it downsamples or

basically just continues to compress it

and it takes all of those images and

smooshes it down into something much

smaller. Then we have clip which is a

300 million parameter model that

basically stores all the information

about how to put these different images

together and what page it is and yeah

basically just kind of putting it all

together for us. Then in the output we

have deepseek 3B that is a 3 billion

parameter mixture of experts 570 million

active parameters in this mixture of

experts model and it decodes it. It

takes the image and it converts it back

into text. And with that we have a very

efficient way to compress text down to

an image and then we have 10 times the

amount of text actual text we can fit in

the same token budget. Imagine we have a

Gemini model which has a million or two

million tokens which by the way is kind

of the largest we've seen out there and

really insane to think about. Then all

of a sudden we can give it 10x. So all

of a sudden has 10 million or 20 million

tokens that it's able to work with just

a slight increase in latency because of

the conversion from text to image and

image to text. So according to the

paper, our method achieves 96% plus OCR

decoding precision at 9 to 10x text

compression, 90% at 10 to 12x

compression, and 60% at 20x. So yes, as

the compression increases, the accuracy

definitely decreases. And a lot of

people have been reacting to this paper.

But before I show you that, you can

actually run this model locally on the

sponsor of today's video. And a quick

thanks to Dell Technologies for

sponsoring this portion of the video.

Dell Technologies has a family of

incredible laptops called the Dell Pro

Max featuring Nvidia RTX Pro Blackwell

chips, which are portable AI workhorses.

It comes in 14 and 16in screen sizes and

up to 32 GB of GPU memory. Perfect for

onthe-go AI workloads. Check them out.

Link in the description below. Let me

show you a few of those reactions. Now,

so first, Andre Carpathy. Of course, I

quite like the new Deepseek OCR paper.

It's a good OCR model. So, again, it's

just kind of a good model to begin with.

It's good at recognizing what an image

is about. And the more interesting part

to me is whether pixels are better

inputs to LLMs than text, whether text

tokens are wasteful and just terrible at

the input. So, imagine if we started

switching all input, even for textbased

models, to being images. He continues,

"Maybe it makes more sense that all

inputs to LLMs should only ever be

images. Even if you happen to have pure

text input, maybe you'd prefer to render

it and then feed that in." And here's

what you get for doing that. More

information compression, shorter context

windows, more efficiency, significantly

more general information stream, not

just text, but bold text, colored text,

arbitrary images. Input can now be

processed with birectional attention

easily as default, not auto reggressive

attention. A lot more powerful. Delete

the tokenizer at the input. I already

ranted about how much I dislike the

tokenizer. The tokenizer is the thing

that takes words and converts it into

tokens. And remember, a token is just

really 3/4s of a word. It's a little bit

more sophisticated than that, but you

can think of it just as that. And of

course, he finishes with, "Now I also

have to fight the urge to side quest an

image inputonly version of Nanohat."

Nanohat is his small language model that

he created just a couple weeks ago. And

yeah, I think it would be really cool to

essentially 10x the context window

simply by having an image version of it.

And I think Brian Romel really

highlighted why this is so impressive.

Listen to what he says you can do with

it. an entire encyclopedia compressed

into a single highresolution image. So

the efficiency gains are incredible.

All right, so how did they actually

train this model? Well, in very Deep

Seek style, they revealed everything in

the research paper. We collect 30

million pages of diverse PDF data

covering about a 100 languages from the

internet with Chinese and English

accounting for approximately 25 million

and other languages accounting for 5

million. So here's an example. Here's a

ground truth image and then on the right

they provide the fine annotations with

layouts. So this really seems like a

critical breakthrough in compressing the

amount of information that can fit into

a context window. This really opens up a

whole new set of use cases and I am so

excited to see this technology

proliferate through models that we know

and love today. If you enjoyed this

video, please consider giving a like and

subscribe.

Loading...

Loading video analysis...