LongCut logo

How to Detect Bad Patterns in Logs and Traces

By Dynatrace

Summary

Topics Covered

  • Group Database Queries by Text Instantly
  • Classify SQL Types Over Time
  • Pinpoint CPU-Hotspot Internal Methods
  • Uncover Logs per Request per Endpoint
  • Metricize Frequent Queries via OpenPipeline

Full Transcript

Hi. I spent

quite some time over the last weeks with our Dynatrace users, showing them some tips and tricks around analyzing traces and logs in context to answer questions like: do we have inefficient database queries like the N+1 query problem?

Or do we have excessive logging. Meaning: creating too many logs in the context of a request?

Identifying problematic exceptions or identifying CPU intensive methods.

So what I have created here is a Notebook on our Playground tenant where I'm sharing some of these best practices and tips and tricks that I've shared with the community recently when I was on the road.

There's also an introduction around what are Distributed Traces!

Credit goes to Christoph Neumueller who did a great video with me.

You can also see the full video called Unleashing the Power of Distributed Traces.

Now, in that notebook, if you scroll a little further down, I always explain how can you actually analyze certain things!

And first I try to show it to you in our out of the box apps!

The Distributed Traces App, or the Logs App.

And, then a little bit further down.

You then always get to see a couple of examples on the DQL queries behind the scenes.

Right. So actually let me quickly do this.

I want to start with, the Distributed Traces App.

Let's just quickly go over a couple of tips and tricks.

There's also the distributed traces video that I did. The App Spotlight! But very quickly.

And this is, super helpful, at least for me.

When you're opening up the app, you always get to see, the analysis based on either requests or spans.

So you can actually switch between those.

Remember, requests are all the spans that happen within a service.

Spans are the individual spans, kind of the individual nodes that represent the method call, the database call, and so on and so forth.

So we can switch between these two.

Now the first use case I want to show you is the one on database.

So if you are collapsing all the facets here, you can see the grouping: Core HTTP Kubernetes Databases.

And if I go to Database I get to see all of the metadata that Dynatrace has found on the spans that have come in.

And the query text is an interesting one.

And if I expand query text, I immediately get an overview of these as some of the query text metadata pieces we have seen on these spans.

Now I could select one.

That's easy.

That means I can just say: Show me where this database statement is called!

Or, what I really like: The three dots here group by.

And then you can immediately group by and see an analysis of all the database calls that have been executed in all the spans that are, that came in, in the last couple of minutes.

You can obviously sort here by execution time.

You can sort it by count.

So you actually get to see how often they've been executed.

And you can do multiple groupings!

So for instance, if you also want to see, let's say the service name, right.

So if you want to see which service actually called this database statement.

Now you get the overview of database query.

And this service where it came in.

So this means it seems here we have some SELECT statements where the same SELECT statement was called by multiple services.

I can also do the other way around like changing that grouping order.

So if I unselect DB Query and then select it again it will end up in the second position, which means now I get the analysis of show me all of the services and then show me which database queries these services were executing.

Right. It is as simple as this.

It crunches over all of the data and it gives me the overview, service and then the call. Now some of them are empty.

Now what does that mean?

This means it actually also went through all of the spans that didn't have a database query.

So in this case, I can also do one more thing.

I can do a filter.

I can click on one of those and see where that value exists.

So that means you can now see it added this to the filter on the top.

And I now really only look at spans that have a query text.

And now I can see the journey service is making all of these database calls.

This is the duration.

How often it's executed and so on and so forth.

Right.

So pretty cool. You can do this with databases.

You can also do the same thing with, exceptions with messaging, with remote procedure calls.

So check this out.

Okay.

Now let's go quickly back to, the notebook that I had open earlier.

It's the, tips and tricks.

If you scroll further down, as I mentioned here, I also show you a couple of examples that I found very useful.

For instance, give me the number of database, calls over time, but grouped by the type.

So INSERT, SELECT, UPDATE, DELETE and so on and so forth.

So what I did here is doing a fetch span.

So all the spans where we have a database query.

So where the database query is not null.

Then parse the database query because we want to get the first string type.

That is the first part of the SQL statement.

It always starts with INSERT, SELECT, UPDATE ...

And then I want to make a time series based on the count.

And then I just do a group by basically, and because some SELECT statements might be written in lowercase, some in upper, I just use the upper() statement.

So now I get all of these statements here right.

Also some of database queries all the time queries with total affected item count.

So if you have database queries where we know the number of data of rows selected, updated or deleted, if this information is on your span, then we can get this as well.

Now, a little further down. It's Exceptions.

Same thing as before.

Just focusing on how many Exceptions are coming in.

Also services with the most Exceptions.

So give me all the spans where we have an Exception on it.

And then, summarized by the count and then grouped by the service name.

So I immediately see that the front end has most of the Exceptions, hotspot internal methods.

Similar to what I mentioned earlier.

Here, what I've basically done is in the Distributed Traces App, you can filter on span kind.

So I just say only focus on the internal calls.

And then again group by. Its that easy!

And then you get to see which spans are taking most of the time.

Also same again: DQL examples.

Now we're coming to the last piece.

This is around inefficient logs per request.

And the logs are very cool.

Because I'm assuming most of you have logs.

And now the question is how can we leverage these logs?

We can obviously query them, very easily.

We can also, for instance, easily find out how many logs, actually created per particular service in the context of a trace.

So we can just say, give me all the logs with a Trace.Id as a field exists.

And then in this case I summarize it by the service name, so I can see which service has how many logs in the context of a trace.

Now I could do the reverse, show me all the logs with a trace that does not exist.

Super simple and easy.

But now the real cool thing is, joining spans and traces.

And here I want to also quickly zoom in a little bit because I think that's interesting.

So I'm basically fetching spans.

So give me all the spans and then join it with the logs where we are joining logs and spans on that Trace.Id. Right.

So this is the way you can join two different data sets.

Basically saying give me the spans, join it with the logs and join it on the Trace.Id which is, available in both data sets.

And then in my case, in my use case, I want to only analyze HTTP endpoints.

So I then basically say only filter where HTTP URL is not null.

And then I'm just doing the same thing as before.

I'm doing a summarize. So give me the request count.

So how many requests are coming in.

How many traces. That is how many logs.

So I'm doing a sum of the log count that comes from the right data set.

And if you're doing a join you have a left and right.

Then I'm also adding another field.

I want to calculate the ratio between how many log lines per request.

So I can do a log per request.

Is log count divided by a request count.

And then I'm doing a descending sort.

So what I'm getting in the bottom is this is a table.

Now that shows me on the top: What is the topmost URL and how many logs are being created per request.

So it's like 27.92k logs on average for every time this URL gets executed.

So I have a thousand requests in total creating 29,800 log counts.

And this is ten the math So this is where you can find your heavy hitters right.

And then you could obviously go further.

You can set and show me the logs, go in, figure out what these are.

All right folks, this is a some tips and tricks.

I hope this was useful if I scroll all the way up again.

There's many different things we can do.

Check out the distributor traces. Check out the log sap.

Check out the videos to. We have.

But hopefully this gave you some additional tips and tricks on how you can be more become more efficient in analyzing, logs and traces.

And you can obviously put this also on the dashboard.

Just maybe the last thing, if you have certain metrics that you want or let's say extract from a database call or database queries, like in my case here, this would actually be a good candidate if this is a question I want to answer.

Give me the count of database types.

Then this might be a good candidate for a metric that you can extract from the logs as they come in into Dynatrace using OpenPipeline.

Or the number of Exceptions could be another good one, right?

So always think about it.

These are these adhoc queries are great work, super great.

But if you really constantly need them, always think about where does it make sense to create, a metric that you create, during ingest through the OpenPipeline?

Good. With this, I want to say see you next time.

Bye

Loading...

Loading video analysis...