Chaitanya Bandikatla & Tomas Kvasnicka - Improving Video Delivery and Streaming Performance with...

By Demuxed

Summary

Topics Covered

TikTok Live: Massive Low-Latency Platform
FLV Delivers Top Performance, CMF Adds Features
Audio/Video First Cuts Startup RTTs
Media First Merges Everything, Edge Generates
30% Faster Startup Matches FLV Performance

Full Transcript

[music] [music] Hello everyone. My name is Tom. I work

Hello everyone. My name is Tom. I work

at CDN77.

And sharing the stage with me, I've got Tritana from Tik Tok's live streaming.

Our teams have been working together on improving the QoS and QoE metrics for Tik Tok live users since

2021. And today we're going to talk

2021. And today we're going to talk about a challenge that we solved together.

So we're gonna focus on how we combine the features of cmath and the performance of FLV to get the best out of both worlds.

Chances are you might be looking into the cf startup time as well. In that

case we'll provide you with an idea to follow.

So, let's get to it, right?

Well, first of all, uh I know that when I say Tik Tok, most people immediately translate that to the, you know, user

generated VOD social media application.

That's per point. However, Tik Tok runs a huge life video platform as well, and that's something we're going to focus on

today. So in other words, no VOD for

today. So in other words, no VOD for now, only low latency live.

Very shortly, I'm going to do a brief overview of what we have here.

So Tik Tok's life has been historically powered by FL from the engine side where it's wrapped into RTMP to the delivery

side where it uses HTTP. you know, for streams that do not require transcoding, this setup needs very little processing resources

and by using quick instead of TCP at the network level, it also kind of provides second to none performance when it comes to Tik Tok key metrics that matter the

most to Tik Tok users. However, the

delivery side behaves kind of like a neverending HTTP request pretty much like what we've seen with HBBTV and its

impacts in the early days for example.

So, this doesn't really follow the filebased request response nature of the

HTTP HTTP protocol like CMF does, right?

And at the same time, Cath allows you to do a variety of things that FLV doesn't or at least not in an easy and native

way. You know, whether it's client

way. You know, whether it's client driven ABR, VRM with support for all the major platforms, out ofbox scaling using commercial CDNs

or I don't know easy and seamless failover. The important thing that is

failover. The important thing that is that while CMF provides all these nice to have features, it at the same time

doesn't match the performance of FLV and some of the key metrics where FLV excels. So I'm going to hand over to

excels. So I'm going to hand over to Titana now and he's going to talk about what these metrics are, why they matter to Tik Tok users, and what we try to do

about this.

Thank you Thomas. Uh that was a really good introduction of FLV and cmap. So

now let's look at what are some of the indicators that we focus on at Tik Tok live to measure performance of a protocol. So the first metric is uh

protocol. So the first metric is uh startup time. So this is basically how

startup time. So this is basically how fast users are able to see content on their devices. In other words, this is

their devices. In other words, this is uh like the time to first frame. And for

a user generated content platform like Tik Tok, this is critical because we need to keep the users engaged.

>> And then once they see the content, how smooth is the playback is measured using stalling indicators or stall metrics.

Now given this uh metric background, you know, let's see how FB and CMF perform.

So as Thomas mentioned previously, FLB from the client perspective just takes one request and the response that it receives has everything it needs to get

the playback started and it keeps receiving the data. So it keeps the playback going. But then on cmap or

playback going. But then on cmap or traditional d uh dash there are multiple requests that the client sends for example the playlist initialization

segments and the media segments and it needs to get responses or you know get everything that it needs to get the playback started. Now part of these

playback started. Now part of these requests can be done in parallel but it kind of needs at least two RTDs to you know start the playback and obviously this has an impact on the startup time.

Now our idea on like how to lower this is well can we cut down the RTDs right?

So that is where we introduced a concept called audio first and video first. So

what what are these? So these are expected to have everything the player needs to start the playback. Now again

why did we do this is to reduce the RTDs. And how did how does it work?

RTDs. And how did how does it work?

Right now the client instead of requesting for a playlist in the first in the first step it sends two parallel requests to the CDN edge and these are obviously the audio first and video

first segments and the expected uh response that it receives or the segments that it receives has a JSON playlist emitted in the MP4 and it uses the audio and video data to get the

playback started and then uses the playlist to make the subsequent requests. Well, everything works fine

requests. Well, everything works fine and since these requests are parallel, it's it's more or less having like a singularity.

Well, nothing's, you know, ever ideal.

So, what are some of the problems that we face with this approach? As I

mentioned previously, uh the client sends these requests in parallel. So,

they could be processed or they will be processed completely independently on the CDN ed server. Right? This has a problem in the sense that let's say if the requests are processed by completely

different servers, the segments can be misaligned. So then the player might

misaligned. So then the player might need to make you know additional requests and get more data to make sure it has everything to get the playback started. Now we've seen this happen in

started. Now we've seen this happen in about 10% of you know the requests that the client sends.

And what is the second problem? Before

we go into the second problem, I want to kind of uh emphasize that for the low latency live that we are achieving uh trying to achieve these segments, the audio first and video first need to be

the latest in the sense that the content is latest. So that brings us to the

is latest. So that brings us to the question how long does the CDN cache these for? So if it's too short then we

these for? So if it's too short then we run into the problems like the hit ratio is low ed server might need to fetch these segments all the way from the

origin all the bad stuff and if they're too long then the content that the client receives is too old and we are no longer doing low latency live. So how

did we solve these problems?

Well we like merging things. So we

thought why don't we merge audio first and video first. [snorts]

Right? So our primary goal with the new approach was to reduce this segment misalignment issues. So what we did is

misalignment issues. So what we did is we merged audio, video and the playlist into a single MP4. And how does that work is now the client instead of making

two parallel requests makes a single request for the for a media first.mpp4

MP4 segment and the response it receives is expected to have a JSON playlist and the audio and video data to get the playback. Now

playback. Now it all looks good but how does how does a media first look right? So this is a typical structure of media first. So we

introduced a custom box within the MP4 where [snorts] we embed a a JSON playlist and the audio and video data are basically interle within the MP4

container. So, it all looks good and

container. So, it all looks good and ideally it should work well, but the problem is how are these MP4 media first MP4 generated, right? Well, the origin

could always generate them and they're just cached on the CDNet server. But

this brings us to the second problem we talked about earlier, which is how long does the CDNet server cache them. So,

that is where we thought of using edge computing to generate these on the CDNet server. and I'm going to hand it over to

server. and I'm going to hand it over to Thomas who's going to talk about how CDN7 does it.

>> Thanks, that was a very good explanation of what we have here. So now let's come back to the CDN side and take a look at what the

edge is doing. Right?

The first important thing to realize here is that while the edge might have all the data that it needs to create the

media first in its cache, that doesn't really mean that it has the media first itself in its cache. You know, the media first response is obviously going to be

very different for each and every single user. For one user, it will have very

user. For one user, it will have very different content than from a user who joins the stream later in time. So the

URL stays the same but the content changes as the live stream moves on.

This obviously leads to a very poor hit rate of such a URL and a poor hit rate leads to all the bad stuff in our lives.

Right? At the same time, as long as at least one user is watching the stream, the edge must already kind of have the

data it needs to create the media first blobs for other users, right? Since the

same segments that the first user is watching and downloading can be used to create the media first for someone who's trying to join the

stream later in time. So this brings us to a bit of a paradox here. Maybe we

have a very poor hit rate for a super important URL while we had the actual content in cache. That doesn't sound

right at all. Right? So this is me after we realized the problem.

[laughter] What [snorts] do we do about it? Right?

So one way to solve this is a typical use case for edge computing. So we take the task of merging the audio video in

it segments playlist and and we move it from the origin to the edge. This way we can now use segments of user A who's

already watching the stream to create the media first for user B who wants to join later in time. Naturally, you know,

this optimization works and it improves the hit rate significantly which back leads to all the good stuff in our lives like decreased origin load, better

startup latency and so on and so on.

This is again me when the team you know came with the solution.

[clears throat] So how do we do this? uh well for the edge to be able to to to handle a situation it needs to provide two major

functions. First it needs to behave

functions. First it needs to behave a little bit like a player and understand the playlist and second it needs to understand the MP4 container

itself and behave a little bit like an origin. The playlist knowledge and the

origin. The playlist knowledge and the player behavior allow the edge to prefetch segments even before users are actually asking

for them. And at the same time, it

for them. And at the same time, it allows it to know which segments are the current edge. You know, which of them

current edge. You know, which of them are the current live, which should be used to create the media first blob on the fly at any given point in time. On

the other hand, understanding the MP4 container allows the edge to mix the audio, the video with the init segments,

add a bit of playlist, and create the outcoming MP4 that's going to be used as a response for the media first request.

So, we put these ideas together. We try

to create a situation in in which the origin never sees more than one media first request per stream no matter how many users are watching

that stream and no matter how long that stream is. Okay. So that's the ideal

stream is. Okay. So that's the ideal world scenario and this combination leads to a setup

where we can benefit from the features that CMF is offering without sacrificing the performance that FLV delivered from

the very beginning.

Best of both worlds at the same time.

Right. So I'm going to hand over to Caitana bag and he's going to talk about the results that we've seen here.

>> Thank you Thomas. That was a good explanation of how we generate media first using edge computing. Now let's

see some things that we have achieved.

So the first thing that we were able to achieve is uh with the media first approach we were able to lower the startup time for Tik Tok live by about 30% compared to the traditional sigma

for dash approach. Now this might look small but for a UGC content platform again for Tik Tok live this is a significant win cuz we've seen uh user

engagement metrics jump pretty pretty significantly I'd say and while we did that we also managed to lower the origin load or basically improve the edge hit

ratio by about 15%. And lastly, we were able to match FLB's performance in terms of startup time and stall related metrics. So that is a significant win

metrics. So that is a significant win coming from a largely FLB background.

So what's next in store for us? So right

now we are working on adapting this media first approach with other cmap use cases like ABR and DRM for Tik Tok live.

I hope to talk about this in the upcoming demug talks but let's see how that goes. But until then, if there are

that goes. But until then, if there are a couple of things to take away from this talk, one is how did we achieve CMF or dashbased live streaming protocol

with just one RT and while we did it, how did we use edge computing and the second is some of the you know the metrics or the results so far and lastly

things that we're working on. Yes, that

brings us to the end of our talk and yeah, thank you all for joining.

[applause] >> [music]

Loading...

Loading video analysis...