New DeepSeek just did something crazy...
By Matthew Berman
Summary
## Key takeaways - **DeepSeek OCR: Compressing Text in Images**: DeepSeek OCR introduces a novel method to represent text within images, achieving a 10x compression ratio while maintaining 97% accuracy. This technique could significantly enhance the capabilities of large language models. [00:49], [00:59] - **Context Window Bottleneck in LLMs**: A major limitation for large language models like Gemini and ChatGPT is their context window size, which restricts the amount of information they can process. Scaling this window quadratically increases compute costs, making it inefficient. [01:07], [01:43] - **Vision Language Models for Text Compression**: DeepSeek OCR utilizes a Vision Language Model (VLM) that processes images of text by breaking them into patches, analyzing local details with SAM, compressing them, and then reconstructing the text using DeepSeek 3B. [02:43], [03:33] - **Potential for Image-Based LLM Inputs**: Andrej Karpathy suggests that all LLM inputs, even pure text, could potentially be rendered as images. This approach could lead to greater information compression, shorter context windows, and more efficient processing of diverse inputs. [06:14], [06:34] - **Training Data for DeepSeek OCR**: The DeepSeek OCR model was trained on approximately 30 million pages of diverse PDF data across 100 languages, with Chinese and English comprising the majority of the dataset. [08:04], [08:14]
Topics Covered
- How Deepseek OCR radically compresses text for LLMs.
- Can image-based input solve LLM context limits?
- Unleashing 10x LLM Context Windows Today.
- Should LLMs abandon text tokens for image inputs?
- How DeepSeek trained its 100-language OCR model.
Full Transcript
Deepseek just did it again. They just
dropped a new paper and model DeepSseek
OCR. OCR is basically image recognition.
But why is that a big deal? Image
recognition has been around forever,
right? Well, they discovered something
completely novel that has the potential
to make language models, textbased
models so much more powerful. Let me
show you. This is the new paper from
Deep Seek. Now, like I said, image
recognition has been around for a long
time. It's nothing special. We've seen
it. It's been done a million times. But
here is what makes Deepseek OCR very
special. There's this saying, a picture
is worth a thousand words. And that is
the key to Deepseek OCR. Deepseek has
figured out a way to represent text in
an image. And this has allowed them to
compress text by 10 times while still
maintaining 97% accuracy. If this
doesn't make sense yet, don't worry. I'm
going to break it all down for you. So,
right now, a big bottleneck in large
language models, the text models you're
used to, Gemini, Chat GPT, the big
bottleneck is the context window. That
is how many words or how many tokens you
can actually fit in your prompt. And
it's a little bit more complicated than
that, but that's the gist of it. And the
context window is where you give the
model all of the information it needs to
produce the best possible output. And
the bottleneck occurs because as you
scale up the context window, the compute
cost associated with scaling up the
context window increases quadratically.
And that means that it increases very
quickly. And so adding one more token to
the context window really means
significantly more compute, especially
as you get further and further in
increasing that context window. But what
if it didn't have to be that way? And
what if we were able to get 10 times as
much context into the context window
without actually changing anything? That
would be huge, right? Well, that is what
Deepseek is proposing with this paper.
What they figured out is that with an
image, you can represent 10 times as
much text as it takes to represent the
image on a per token basis. Listen to
this. A single image containing document
text can represent rich information
using substantially fewer tokens than
the equivalent digital text, suggesting
that optical compression through vision
tokens could achieve much higher
compression ratios. And so they present
DeepSseek OCR, a VLM vision language
model designed as a preliminary proof of
concept for efficient vision text
compression. Okay, so here's how it
works. On the left, we see the input and
this is actually an image of text. And
in this example, it looks like a PDF,
but you could basically take an image of
text. And yes, you can actually stuff a
bunch of text into the image and you can
actually make the text in the image very
small. Now, there is an upper limit on
how small that text can be before you
start to get noise and the visual
resolution essentially becomes
impossible to read. Then that image is
taken and split into 16x6
patches, little squares from the actual
image. Then it uses a few different
techniques in the main engine. First is
SAM. It is an 80 million parameter model
that looks for local details like the
shapes of letters and different details
in the image. Then it downsamples or
basically just continues to compress it
and it takes all of those images and
smooshes it down into something much
smaller. Then we have clip which is a
300 million parameter model that
basically stores all the information
about how to put these different images
together and what page it is and yeah
basically just kind of putting it all
together for us. Then in the output we
have deepseek 3B that is a 3 billion
parameter mixture of experts 570 million
active parameters in this mixture of
experts model and it decodes it. It
takes the image and it converts it back
into text. And with that we have a very
efficient way to compress text down to
an image and then we have 10 times the
amount of text actual text we can fit in
the same token budget. Imagine we have a
Gemini model which has a million or two
million tokens which by the way is kind
of the largest we've seen out there and
really insane to think about. Then all
of a sudden we can give it 10x. So all
of a sudden has 10 million or 20 million
tokens that it's able to work with just
a slight increase in latency because of
the conversion from text to image and
image to text. So according to the
paper, our method achieves 96% plus OCR
decoding precision at 9 to 10x text
compression, 90% at 10 to 12x
compression, and 60% at 20x. So yes, as
the compression increases, the accuracy
definitely decreases. And a lot of
people have been reacting to this paper.
But before I show you that, you can
actually run this model locally on the
sponsor of today's video. And a quick
thanks to Dell Technologies for
sponsoring this portion of the video.
Dell Technologies has a family of
incredible laptops called the Dell Pro
Max featuring Nvidia RTX Pro Blackwell
chips, which are portable AI workhorses.
It comes in 14 and 16in screen sizes and
up to 32 GB of GPU memory. Perfect for
onthe-go AI workloads. Check them out.
Link in the description below. Let me
show you a few of those reactions. Now,
so first, Andre Carpathy. Of course, I
quite like the new Deepseek OCR paper.
It's a good OCR model. So, again, it's
just kind of a good model to begin with.
It's good at recognizing what an image
is about. And the more interesting part
to me is whether pixels are better
inputs to LLMs than text, whether text
tokens are wasteful and just terrible at
the input. So, imagine if we started
switching all input, even for textbased
models, to being images. He continues,
"Maybe it makes more sense that all
inputs to LLMs should only ever be
images. Even if you happen to have pure
text input, maybe you'd prefer to render
it and then feed that in." And here's
what you get for doing that. More
information compression, shorter context
windows, more efficiency, significantly
more general information stream, not
just text, but bold text, colored text,
arbitrary images. Input can now be
processed with birectional attention
easily as default, not auto reggressive
attention. A lot more powerful. Delete
the tokenizer at the input. I already
ranted about how much I dislike the
tokenizer. The tokenizer is the thing
that takes words and converts it into
tokens. And remember, a token is just
really 3/4s of a word. It's a little bit
more sophisticated than that, but you
can think of it just as that. And of
course, he finishes with, "Now I also
have to fight the urge to side quest an
image inputonly version of Nanohat."
Nanohat is his small language model that
he created just a couple weeks ago. And
yeah, I think it would be really cool to
essentially 10x the context window
simply by having an image version of it.
And I think Brian Romel really
highlighted why this is so impressive.
Listen to what he says you can do with
it. an entire encyclopedia compressed
into a single highresolution image. So
the efficiency gains are incredible.
All right, so how did they actually
train this model? Well, in very Deep
Seek style, they revealed everything in
the research paper. We collect 30
million pages of diverse PDF data
covering about a 100 languages from the
internet with Chinese and English
accounting for approximately 25 million
and other languages accounting for 5
million. So here's an example. Here's a
ground truth image and then on the right
they provide the fine annotations with
layouts. So this really seems like a
critical breakthrough in compressing the
amount of information that can fit into
a context window. This really opens up a
whole new set of use cases and I am so
excited to see this technology
proliferate through models that we know
and love today. If you enjoyed this
video, please consider giving a like and
subscribe.
Loading video analysis...