My Claude Code Can INSTANTLY Watch Any Video (Here's How)

By Brad | AI & Automation

Summary

Topics Covered

Half of Video Content Lives on Screen
You're Not Watching—You're Downloading
Frames + Transcript = Free Video Intelligence
Build Your Second Brain from Competitor Videos

Full Transcript

When you give Claude code the ability to instantly watch any video on the internet for free, it becomes genuinely unstoppable. With this Claude's skill,

unstoppable. With this Claude's skill, Claude can understand video as well as it reads PDFs, hours long YouTube videos, Instagram reels, looms, local files, anything. Before Claude was just

files, anything. Before Claude was just guessing. Now it can watch the whole

guessing. Now it can watch the whole thing frame by frame instantly. It's

like Neo plugging into the Matrix. By

the time you've hit play, Claude's already watched the whole thing and become an expert. I've tried a bunch of transcript tools before developing this one, and they all let me down. They

either cost way too much or they only ever read the transcript and missed half the video. This skill gives Claude the

the video. This skill gives Claude the frames and the audio together, so it actually sees what's happening on screen. Right now, I'll walk you through

screen. Right now, I'll walk you through exactly how it all works. The use case that completely changed how I consume content and how to set this up in your own clawed code in under 5 minutes.

Here's what it actually looks like on a 45minute video done in less than a few minutes. On the left, I have a YC

minutes. On the left, I have a YC lecture from Sam Alman about how to start a startup. I'm going to press play on that now and then grab the URL. All I

have to do is go over to Claude and type /watch and then paste the URL here. Then

Claude gets to work grabbing the subtitles from YouTube for free, extracting the frames and analyzing them all together. So, the reason this is

all together. So, the reason this is better than just pulling the transcript is because Claude can actually grab the frames from this video. In this lecture, Sam goes through and shows a bunch of really great graphs. And this is

important context for Claude because if you're only getting the transcript, you're only getting half of the information. Now, here's where most of

information. Now, here's where most of the existing video tools fall short because they base everything around the transcript. When something happens on

transcript. When something happens on screen and it's not explicitly referenced in the transcript, Claude doesn't know about it and you miss out on key context, which matters because half of the interesting stuff in a video

isn't said out loud. It happens on screen. So, this skill actually watches.

screen. So, this skill actually watches.

It pulls frame by frame screenshots and puts it together with a per second timestamped transcript to get Claude the full picture and full context. And just

like that, we're only 2 minutes into the lecture. Sam is still introducing what

lecture. Sam is still introducing what he's going to talk about today. And

Claude has already ingested the entire thing. I have a structured summary of

thing. I have a structured summary of all of the speakers. I can see exactly what they talk about. And now I can actually query Claude on anything about this context and then start to put it to

work instantly right here in the terminal. That's a 45minute video done

terminal. That's a 45minute video done in less than 2 minutes watched, analyzed, and applied. That's the matrix moment. You're not watching content

moment. You're not watching content anymore. You're actually downloading

anymore. You're actually downloading context automatically and putting it to work straight away. And you're probably thinking at the moment there's the some expensive API doing the heavy lifting here. But there isn't. But before we get

here. But there isn't. But before we get into that, let's get into the setup. By

the way, I'm giving this whole skill away for free on GitHub. The link is in the description below. Just run these install commands and the setup takes care of the rest. Once the skill is installed, Claude runs the setup script

and installs any dependencies that you don't have already. It authenticates

with the transcription API. Don't worry,

this one is pretty much free and we'll get to it in a second. But under the hood, the pipeline is actually surprisingly simple. Now, here's the

surprisingly simple. Now, here's the part that nobody really talks about.

Claude can't actually watch video because Anthropic doesn't have a video model yet. There are some other

model yet. There are some other providers that can, like Google's Gemini model, but they're pretty expensive and they don't integrate nicely with Claude.

So, if you're watching a lot of content, that bill stacks up pretty fast.

Luckily, there's a smarter way to do this because if you really break it down, a video is just two things. It's a

bunch of frames and a transcript. That's

it. So, instead of paying for another expensive model, I can just split the video into those two pieces and hand it to Claude in a format that it already knows how to read, pictures and text.

Now, this is the part I love because the skill is doing this with two of the oldest, most battle tested command line tools on the internet, YouTube DLP and FFmpeg. These aren't MCPs. They're not

FFmpeg. These aren't MCPs. They're not

some new rapper. There's no third party service involved in the middle here.

They install once right on your machine.

Millions of developers have used them for over a decade now. They're rock

solid and completely free. And they're

what every video tool you've ever touched is probably using under the hood. YTDLP is the downloader. You can

hood. YTDLP is the downloader. You can

think of it as a right-click save video, but it works on basically the whole internet. FFmpeg is the video engine. It

internet. FFmpeg is the video engine. It

takes the video and turns it into two things that Claude actually wants.

First, screenshots which are taken every few seconds all the way through the video. And then second, the audio file,

video. And then second, the audio file, which is pulled out as a clean little file ready to be transcribed into text using Whisper. Now, Claude has the full

using Whisper. Now, Claude has the full picture. When we put these two together,

picture. When we put these two together, it's flipping through the screenshots like a flip book, reading the transcript like a script, and the timestamps line up exactly, so it knows on screen when something is being said. So, that's the

whole pipeline. YouTube DP and FFmpeg

whole pipeline. YouTube DP and FFmpeg doing all the heavy lifting locally on your laptop for free. The only thing we actually have to pay for here is the transcription and claude usage. Captions

transcription is pretty much free. The

skill just pulls them and if it doesn't, it transcribes the audio using whisper hosted on Grock or OpenAI. I prefer

Grock because it's extremely fast and their free tier covers basically anything you throw at it. So most videos cost you literally nothing to transcribe. I even used this exact skill

transcribe. I even used this exact skill to grow a universal context layer for content research. And I'll show you

content research. And I'll show you exactly how it works in a minute. And I

can literally hear the keyboards cluttering right now. Brad, this is going to torch your token budget. But

this actually surprised me. So let's do the math. The skill scales frame count

the math. The skill scales frame count to video length. And it actually caps anything over 30 minutes to 100 frames.

So a 30 minute video and a 1 hour video pretty much cost the same amount in dollar terms. And that's about $1 per run. I ran every test in this video

run. I ran every test in this video three times in parallel and burned less than 10% of my session. And that's over 5 hours of video watched live by Claude with transcription. And the

with transcription. And the transcription part is where it gets ridiculous. Every YouTube video comes

ridiculous. Every YouTube video comes with the free transcript. The skill just pulls them. There's no whisper, no API

pulls them. There's no whisper, no API call. It's totally free. And that goes

call. It's totally free. And that goes for a bunch of other sites, too. Whisper

only kicks in for the stuff without captions, like a raw m4, a loom, or Instagram reel. Grock's free tier

Instagram reel. Grock's free tier actually gives you 2 hours of transcriptions per hour, which covers more than you'll realistically throw at it. I've used this skill every day for 2

it. I've used this skill every day for 2 weeks, and I'm still on the free tier.

It's crazy. Look, I'm not saying this is perfect, and there's probably optimizations I haven't thought of, but for most people watching, this is essentially free. If you got ideas to

essentially free. If you got ideas to make it cheaper or quicker, drop them in the comments below. Once I realized this was basically free, I started running it on everything, which is how I ended up building the system I'll show you at the

end. and it's one that's generally

end. and it's one that's generally changed how I consume content. Here's

the part that actually makes this skill a mustave. It works on any URL YTLP

a mustave. It works on any URL YTLP supports, which is over a thousand sites. So, this isn't just limited to

sites. So, this isn't just limited to the big social media companies or YouTube, and it even works if you have the files locally download. So, that

opens up a bunch of use cases that you probably wouldn't expect. So, this is what I'm doing for content research.

Now, I take a winning video from the internet, and I ask Claude to break down the hook. Claude tells me the visual

the hook. Claude tells me the visual setup, the exact words, where the patent interrupt lands, and what's on screen at the moment of the hook. Stuff that used to take me 10 minutes per video pausing and scrubbing. Now it's just a paste.

and scrubbing. Now it's just a paste.

And for developers, there's another use case, debugging screen recordings. If

you're a developer and a UI bug shows up, you record a 30-se secondond screen recording, drop it into Claude, and ask what happens right before the crash.

Claude reads the frames around that moment, finds a state change, and tells you the exact frame the issue starts with. That alone has saved me hours. The

with. That alone has saved me hours. The

skill also has a zoom flag, start time and end time. So you can drop those in and cl and focus frame by frame extraction on a specific window of a video. So you can ask about a 10-second

video. So you can ask about a 10-second segment of a 2-hour video without burning your entire context window.

Whatever you're using video for, you can probably stop watching it manually because of this skill. So earlier I told you that once you start using this thing, it seriously starts to change how you consume content. Now I want to show you my personal favorite use case for

it, which is feeding my second brain. I

keep a knowledge base in Obsidian with notes, snippets, and ideas for content.

And the bottleneck for me has always been throughput because there's just so much good content out there by creators at the moment. There's not enough time to watch it all and write it all down.

So, I let Claude do both. I give it every single competitor that I think makes great content. And then from there, Claude uses the watch skill to automatically watch it and feed it straight into my second brain. So,

Claude watches each of these videos, frames, audio, everything, and then comes back with a clean structure and notes about what made the video work. It

builds that straight into the second brain. And this is where things start to

brain. And this is where things start to compound because the skill and your second brain are watching more and more videos, getting more and more context, and it's getting better and better over time, getting smarter automatically. The

second brain side of this whole thing is a video on its own. And I walk through exactly how I run mine, content research, competitor intel, every podcast video I've ever listened to, all in one searchable layer in Obsidian. If

that's where you want to take this, that's the next video to watch. It's

linked up here. If this was useful, hit subscribe. Thanks for watching and I'll

subscribe. Thanks for watching and I'll see you in the next

Loading...

Loading video analysis...