How I Test With Claude Code (AI TDD)

By Owain Lewis

Summary

Topics Covered

Tests Are an AI Agent Superpower
TDD Prevents AI Agents from Over-Engineering
AI Agents Still Need Human Test Review
Test Against Production Infrastructure, Not Mocks
Browser Testing Closes the Verification Gap

Full Transcript

If you're using AI agents to write code for anything important, there is one thing that matters more than anything else, testing. [music]

else, testing. [music] Tests used to take a long time to write, but with AI agents, that's no longer true. Without [music] tests, agents have

true. Without [music] tests, agents have no way of knowing if the code they wrote actually works. And more importantly,

actually works. And more importantly, they have no way to verify if the code they wrote [music] broke anything else in your code base. So in this video, I'll show you exactly how I test code when working with AI agents. We'll

[music] look at what to test, what to skip, and we'll also build a feature live together using test-driven development and Claude [music] code so that you can see the entire workflow end to end. I'll give away all of the code,

to end. I'll give away all of the code, [music] the resources, and the prompts free in the description below. So let's

get into it.

So testing is one [music] of those topics that's really fundamental to writing good software. If you don't have any tests, you don't really have a way of verifying if your software actually works correctly. And if you think about

works correctly. And if you think about any kind of software, there are so many things that could go wrong. There are so many different bits of logic and different rules as well.

You can't manually test those things. It

would be unreasonable to do that. There

are just too many edge cases. So we need some kind of automated testing system to guarantee that our code is working correctly. When we're working with AI

correctly. When we're working with AI agents, it's particularly important because agents can write code that looks correct. It can work on first glance.

correct. It can work on first glance.

But then it can break in ways you don't expect. And so if you have a set of

expect. And so if you have a set of tests, the agent can verify that everything is still working correctly.

It gives an agent some feedback to work off. So here's an example or diagram you

off. So here's an example or diagram you can take a look at. So an agent writes code. It then runs the tests. It sees

code. It then runs the tests. It sees

something is broken, and then it can fix it. So this is a feedback loop for the

it. So this is a feedback loop for the agent. But if you don't have any tests,

agent. But if you don't have any tests, the agent is going to write code.

It doesn't know if it's broken anything.

It has no way to check other than maybe just kind of looking at the code. Does

it look correct? It has no way to verify that it actually works. And here are the two types of testing approach that you see commonly. So test after is where we

see commonly. So test after is where we write the code first. And then we write tests to verify that everything works.

The other style is to do test first or test-driven development. This is where

test-driven development. This is where we write the test first. The test will describe what the code should do. Then

we write the code to make the test pass.

This is quite useful when it comes to AI agents in particular. So what we can do is we

in particular. So what we can do is we can write a test. We can run the test to make sure it fails. Then we can ask the agent to implement the minimal amount of code to make that test pass. And then we

can tidy up the code or refactor it and improve it. This works really well with

improve it. This works really well with AI agents because you give them something very very specific to focus on. And it can prevent over-engineering

on. And it can prevent over-engineering or writing code that is unnecessary. So

you might be wondering how you go about doing testing with agents. And there are a couple of ways. The first simple way is just to say use red-green refactor or use test-driven development to implement

the code. That will work. Most agents

the code. That will work. Most agents

already know how to do this. But you can also find a library like this.

So this is a great library, which I'll link in the description below. So this

is a test-driven development skill. And

you can see here this skill is basically telling an agent to write a test before writing writing the code. A code base with good tests is an AI agent superpower. A code base without tests is

superpower. A code base without tests is a liability. The skill that I'm going to

a liability. The skill that I'm going to be using in the demo is basically very very similar. It's the same thing. So

very similar. It's the same thing. So

build a feature by writing the tests first. And then the first step is to

first. And then the first step is to write a failing test. Then we implement the logic, and then we simplify or improve the the code.

And then finally, we want to run the full test suite to verify that everything is still working. Okay, so

let's see how this works in action.

We're going to do a quick demo. We're

going to add a new feature to an existing application. And the

existing application. And the application we're going to be working on is this recruitment application, which I'm currently working on.

What we have here is a lead generation application for recruitment agencies. It

will scan the local area, search hundreds of job boards, and find all of the companies currently hiring in the local area. And this allows recruiters

local area. And this allows recruiters to find more business, essentially. But

what we don't have right now is if I publish this to the internet, this will be available for anyone to see. So we

need to add some kind of login system to this application. So the application we

this application. So the application we have here is made up of two parts. We

have a back end, which is written in Python. And then we have a front end,

Python. And then we have a front end, which uses Next.js and TypeScript. So a

pretty standard web application setup.

So I already have a specification here.

So what we're going to do is add a session-based authentication mechanism to this application. So basically, a user should be able to log in with an email and a password. They should also

be able to log out at any time. The

sessions are going to be stored in the database, and they're going to expire after 2 hours.

We also have a bunch of protected routes as well. So the API endpoints should be

as well. So the API endpoints should be protected.

And any unauthenticated API requests should get a 401 response. You can see here there's lots of stuff that we could potentially test. Notably, things like

potentially test. Notably, things like the application must protect against cross-site request forgery. This is the kind of security component that you actually do want to test. And so you want to make sure you've got a test for that.

So there's lots of stuff to work with.

And what I've done is I've broken this down into four specific phases so that we can demonstrate how this would work in a real application. So we've got unit tests in phase one. So we're going to

implement the session service. This is

all going to be Python code, really straightforward simple code that we can add unit tests for. In phase two, we're going to add some API endpoints.

So we're going to add API endpoints.

We're going to do some integration tests here. And then finally at the end, we're

here. And then finally at the end, we're going to do some browser-based testing using a tool called Playwright. So these

are going to be full browser tests right at the end in phase four.

So I already have some prompts prepared.

So what we're going to do is move into Claude code and start working on this.

So in order to implement this, we're going to start with phase one. And we're

using this TDD skill. There's no magic here. This is just a prompt that says to

here. This is just a prompt that says to use red-green refactoring and test-driven development. And it has a

test-driven development. And it has a couple of bits of guidance around what to test and what not to test. But there

is no magic here. It's pretty simple. So

what we do is we read the spec, and then we read the plan.

We're going to implement phase one, which is the auth service. And then

we're giving a bit of guidance about what to focus on for the tests. So look

at expiry logic, token generation, and password verification. We're also giving

password verification. We're also giving it an encouragement to think about what could go wrong. This is really useful uh to ensure that we're actually testing all of the possible things that could go

wrong with this implementation. So let's

go ahead and give this to Claude code.

All right, so what we're going to do, we're going to paste this into Claude code and run it. Just before I do, let's go and quickly check what tests we already have. All right, so I'm

already have. All right, so I'm currently in the back end folder, and I'm just going to go ahead and run the Python tests. So we currently have 11

Python tests. So we currently have 11 tests currently. So we don't have too

tests currently. So we don't have too many. So this is our starting point

many. So this is our starting point right now. We have a couple of passing

right now. We have a couple of passing tests, but we don't have a lot. So let's

kick off this phase one. So we're going to go ahead and invoke the TDD skill, and then go and build the tests first.

Wait for the tests to fail, and then implement the code needed to pass the test. What I like about this is approach

test. What I like about this is approach is you can already see that Claude has already thought about the potential things that could go wrong. So Claude is now thinking about session expiry, token uniqueness. So

uniqueness. So I really like this approach because Claude is now thinking about the edge cases in our code that would maybe otherwise go unconsidered. So I think this is why this is such a great

approach. So you can see here now that

approach. So you can see here now that Claude is writing the initial tests for the auth service.

We're writing unit tests here. So this

is using the standard Pytest library. So

these are just unit tests.

You can see here we're testing password hashing, session token generation, session expiry. And so what Claude should be

expiry. And so what Claude should be doing now is running those and seeing that they fail. So you can see here all the tests fail at import. So the tests are currently failing.

And then we're implementing the minimal code. Then we're running the tests to

code. Then we're running the tests to make sure everything passes.

You can see here that we've already caught an issue with library compatibility. So now Claude is going to

compatibility. So now Claude is going to fix that. And you can see here now we're

fix that. And you can see here now we're going on to step four, which is simplify the code base. It's interesting. So this

is actually quite a straightforward implementation, 29 lines. And so we don't really need to need to do anything here in terms of simplification. But now

we're running the test suite, and we have 25 passing tests. Everything looks

good.

And so now we can go and check out the implementation here.

You can see here now we have the auth service implemented, and we also have our tests. So phase one is complete.

our tests. So phase one is complete.

Claude followed the test-driven development cycle. But what we now have

development cycle. But what we now have is not only do we have working code, we have a set of tests that we can use now going forward. And we can rely on these

going forward. And we can rely on these tests to ensure that our application code is correct. Just one more thing before we move on. And this is a really important point that's worth thinking

about. One of the things I found is when

about. One of the things I found is when doing test-driven development, so if you use any of the libraries on GitHub, you might find that the tests that Claude writes don't make sense. And so

what I mean by that, we might have additional tests that are testing library functionality or things like that. So what I found useful is to go

that. So what I found useful is to go and review the tests before we move on to the next stage. So I'm going to say, "Review the current tests.

Do they all make sense? Is there

anything to fix?"

So this is just making sure that the tests we wrote are necessary and they're covering all of the edge cases. Often I

found the first time through, we have working tests, but they maybe are redundant or we have too many tests, etc. So let's take a look at here. You

can see here that we've already identified some problems. So I'm just going to take a look at these.

So, the test token contains no user data.

Test the wrong thing. Okay, that's kind of interesting. So, you can see here,

of interesting. So, you can see here, even if you're following one of these frameworks off GitHub, even if you're downloading the Superpowers framework or whatever it is you're using,

it is important to remember that these agents make mistakes, and so it's always worth worth reviewing the tests as well as reviewing the code as well. Okay, so

this is a lot better. We've caught some issues. So,

issues. So, I'm happy with that. And again, I think it just goes to demonstrate that it doesn't matter whatever framework you're using, if you're using a test-driven development skill that you pull down

from GitHub, whether that's the Superpowers framework or some other um flavor of the month GitHub project, these agents still make mistakes. It

doesn't really matter how good your prompt is.

There are often some mistakes when you go and review the tests. So, this is always a useful step. Okay, so now we've done that, let's go on to phase two, which is where we're going to add integration tests and build out the API

endpoints. So, I'm just going to clear

endpoints. So, I'm just going to clear the session. I'm going to run

the session. I'm going to run this second prompt. So, we're going to go and implement phase two. So, now we should just go through the same cycle as we did before, where we implement the tests, we watch them fail, and then we

implement the code afterwards. As usual,

Claude has now read up on the project, it's gathered its context that it needs, and it's going to start the TDD cycle. I

can already see a potential issue here with the plan. It looks like we're using SQLite as a dependency, and this application is currently using PostgreSQL. So, even though

PostgreSQL. So, even though the to-do's here make sense, we probably want to go and correct this afterwards because we want to use the same database

for our application as we do in our tests, because we're more likely to catch issues that way. So, that's

something we need to fix afterwards. So,

you can see here we're adding these new additional tests. So, now we're testing

additional tests. So, now we're testing the endpoints.

And unlike the unit tests before, we're now actually doing integration tests where we're calling the actual API endpoints. You

can see here we're making HTTP requests to the actual endpoints, and then verifying that these status codes that we're getting back are correct. So, now we're running the tests

correct. So, now we're running the tests to make sure that they're all passing.

What I really like about this is how repeatable this methodology is and how consistent the quality you get is when you're using this test-driven development flow, and we're making sure that we have tests

as we go. So,

it's a little bit more effort to follow this process, but the result is you have much higher quality code that is much easier to maintain over time. So, this

this approach is more work up front, but it's well worth it.

All right, all looks good. We have a couple of warnings here, which we might want to go and fix, but otherwise all looks good. All right, so the current

looks good. All right, so the current code is using SQLite, but our production database is PostgreSQL. Can you switch to use PostgreSQL in the tests, please?

And can you also review the tests we wrote and see if there's any improvements we need to make to our existing tests.

So, again, now we're just making some corrections because the first cycle through we made some choices, which we need to go and fix, and we're also just verifying that the code

is uh it the tests we wrote are correct.

All right, so you can see here now Claude's going through making some of these changes. So, we're fixing the

these changes. So, we're fixing the issue where we're using the wrong database types for tests, and it's also found a bunch of issues with the tests itself. So, again, it just goes to show

itself. So, again, it just goes to show that it's always worth doing the second review, and even if you're following the perfect prompt, you'll often find it's worth doing a second review of the tests

to make sure everything is good. Kind of

interesting. So, we did switch back from the SQLite database to PostgreSQL, and we already caught a real bug. So, this

is why it makes sense to test against the same database you're using in production. So,

really useful really useful idea. All right, this looks good. So, now we have 50 passing

looks good. So, now we have 50 passing tests. Everything looks good. We're

tests. Everything looks good. We're

going to do the same cycle again and just quickly review the tests. Okay, so

we did some changes to our tests, and and what's really nice, you can see here, even just asking this sim- simple prompt of just review the tests and see how you could improve them, we reduced the test

run time from 4 seconds to 2.6 seconds.

So, this is great. We've improved the speed of our tests and the quality just by doing that one thing of asking Claude to review the tests. So, here are the tests that we wrote in phase three. You

can see here we have a bunch of tests to ensure that when a user tries to access the endpoints, they they get a 401 if they're not authorized. So, this is this is exactly what we want. So, again, I'm

just going to clear out the session, and then we're going to start again with phase four. So, now we're going to say

phase four. So, now we're going to say uh implement phase four, login page and auth redirect using Playwright end-to-end tests.

So, this is an interesting phase because we're now we're going to add browser-based testing. So, we're going

browser-based testing. So, we're going to be doing end-to-end using a real browser.

And so, just for reference, this is the library we're going to be using to do the browser-based testing. It's called

Playwright. There are other options available, but this is a solid option.

It allows you to do browser-based testing, and it's built well for AI agents as well. So, this is a great library for doing browser-based end-to-end testing. So, in phase four,

end-to-end testing. So, in phase four, we're looking at the end-to-end scenario, where we're essentially doing browser-based testing. But you can see

browser-based testing. But you can see happens after you log in, where do you get redirected to. And you can imagine if you were having to test these kind of things manually or allowing the AI agent

to test it, well, you just wouldn't test it because there's too many things here to verify, and so this is why automated tests are so important, because you have a programmatic way to ensure that all of

the behavior you expect is actually working correctly. Without this, the

working correctly. Without this, the agent would not have a way to easily verify any of this, especially once you've exited the current session. Okay,

so it looks like we finished uh phase four. And so, you as you can see now,

four. And so, you as you can see now, we've finished the browser-based testing, but we also now have the login screen working correctly.

So, let's just double-check if this works.

These are the credentials from the database for our test user. So, you can see this all looks correct. Now, we have a logout button, and we are not able to access any of

this data unless we are logged in. So,

if we log out and try and access any of those pages, you can see now we're redirected. So,

it looks like everything is done. We

have our browser-based testing, we have our unit tests, we have our integration tests, and we used test-driven development to build this entire feature. Okay, so now that we've finished, let's just do another quick check and make sure all of

the tests are passing. All right,

perfect. So, every test is now passing, and anytime we make any changes to any any other part of our codebase, we now have a good set of tests to make sure that if we introduce any regressions or issues, we

have a test suite that's going to catch those issues. So, the next time we break

those issues. So, the next time we break anything, our agents will be able to self-correct and fix it. So, this is all looking good, and everything is done.

So, a final reflection is that testing is a critical part of developing high-quality software.

Now that we have AI agents, there's no reason not to write tests, because it's so quick and easy to add tests to your codebase now.

Having tests will give your agents something concrete to validate against, and it will give the agents a feedback loop that is really, really important in ensuring that they're not breaking things as they're making changes to your

code.

If you have any questions about the video, let me know in the comments below. I'd love to hear them.

below. I'd love to hear them.

Thank you for watching. If you enjoyed the video, remember to like and subscribe.

If you're interested in these topics, I also run an AI engineering community. I

will link it in the comments below.

Thanks again for watching. I hope you enjoyed the video, and I'll see you in the next one.

Loading...

Loading video analysis...