Chaitanya Bandikatla & Tomas Kvasnicka - Improving Video Delivery and Streaming Performance with...
By Demuxed
Summary
Topics Covered
- TikTok Live: Massive Low-Latency Platform
- FLV Delivers Top Performance, CMF Adds Features
- Audio/Video First Cuts Startup RTTs
- Media First Merges Everything, Edge Generates
- 30% Faster Startup Matches FLV Performance
Full Transcript
[music] [music] Hello everyone. My name is Tom. I work
Hello everyone. My name is Tom. I work
at CDN77.
And sharing the stage with me, I've got Tritana from Tik Tok's live streaming.
Our teams have been working together on improving the QoS and QoE metrics for Tik Tok live users since
2021. And today we're going to talk
2021. And today we're going to talk about a challenge that we solved together.
So we're gonna focus on how we combine the features of cmath and the performance of FLV to get the best out of both worlds.
Chances are you might be looking into the cf startup time as well. In that
case we'll provide you with an idea to follow.
So, let's get to it, right?
Well, first of all, uh I know that when I say Tik Tok, most people immediately translate that to the, you know, user
generated VOD social media application.
That's per point. However, Tik Tok runs a huge life video platform as well, and that's something we're going to focus on
today. So in other words, no VOD for
today. So in other words, no VOD for now, only low latency live.
Very shortly, I'm going to do a brief overview of what we have here.
So Tik Tok's life has been historically powered by FL from the engine side where it's wrapped into RTMP to the delivery
side where it uses HTTP. you know, for streams that do not require transcoding, this setup needs very little processing resources
and by using quick instead of TCP at the network level, it also kind of provides second to none performance when it comes to Tik Tok key metrics that matter the
most to Tik Tok users. However, the
delivery side behaves kind of like a neverending HTTP request pretty much like what we've seen with HBBTV and its
impacts in the early days for example.
So, this doesn't really follow the filebased request response nature of the
HTTP HTTP protocol like CMF does, right?
And at the same time, Cath allows you to do a variety of things that FLV doesn't or at least not in an easy and native
way. You know, whether it's client
way. You know, whether it's client driven ABR, VRM with support for all the major platforms, out ofbox scaling using commercial CDNs
or I don't know easy and seamless failover. The important thing that is
failover. The important thing that is that while CMF provides all these nice to have features, it at the same time
doesn't match the performance of FLV and some of the key metrics where FLV excels. So I'm going to hand over to
excels. So I'm going to hand over to Titana now and he's going to talk about what these metrics are, why they matter to Tik Tok users, and what we try to do
about this.
Thank you Thomas. Uh that was a really good introduction of FLV and cmap. So
now let's look at what are some of the indicators that we focus on at Tik Tok live to measure performance of a protocol. So the first metric is uh
protocol. So the first metric is uh startup time. So this is basically how
startup time. So this is basically how fast users are able to see content on their devices. In other words, this is
their devices. In other words, this is uh like the time to first frame. And for
a user generated content platform like Tik Tok, this is critical because we need to keep the users engaged.
>> And then once they see the content, how smooth is the playback is measured using stalling indicators or stall metrics.
Now given this uh metric background, you know, let's see how FB and CMF perform.
So as Thomas mentioned previously, FLB from the client perspective just takes one request and the response that it receives has everything it needs to get
the playback started and it keeps receiving the data. So it keeps the playback going. But then on cmap or
playback going. But then on cmap or traditional d uh dash there are multiple requests that the client sends for example the playlist initialization
segments and the media segments and it needs to get responses or you know get everything that it needs to get the playback started. Now part of these
playback started. Now part of these requests can be done in parallel but it kind of needs at least two RTDs to you know start the playback and obviously this has an impact on the startup time.
Now our idea on like how to lower this is well can we cut down the RTDs right?
So that is where we introduced a concept called audio first and video first. So
what what are these? So these are expected to have everything the player needs to start the playback. Now again
why did we do this is to reduce the RTDs. And how did how does it work?
RTDs. And how did how does it work?
Right now the client instead of requesting for a playlist in the first in the first step it sends two parallel requests to the CDN edge and these are obviously the audio first and video
first segments and the expected uh response that it receives or the segments that it receives has a JSON playlist emitted in the MP4 and it uses the audio and video data to get the
playback started and then uses the playlist to make the subsequent requests. Well, everything works fine
requests. Well, everything works fine and since these requests are parallel, it's it's more or less having like a singularity.
Well, nothing's, you know, ever ideal.
So, what are some of the problems that we face with this approach? As I
mentioned previously, uh the client sends these requests in parallel. So,
they could be processed or they will be processed completely independently on the CDN ed server. Right? This has a problem in the sense that let's say if the requests are processed by completely
different servers, the segments can be misaligned. So then the player might
misaligned. So then the player might need to make you know additional requests and get more data to make sure it has everything to get the playback started. Now we've seen this happen in
started. Now we've seen this happen in about 10% of you know the requests that the client sends.
And what is the second problem? Before
we go into the second problem, I want to kind of uh emphasize that for the low latency live that we are achieving uh trying to achieve these segments, the audio first and video first need to be
the latest in the sense that the content is latest. So that brings us to the
is latest. So that brings us to the question how long does the CDN cache these for? So if it's too short then we
these for? So if it's too short then we run into the problems like the hit ratio is low ed server might need to fetch these segments all the way from the
origin all the bad stuff and if they're too long then the content that the client receives is too old and we are no longer doing low latency live. So how
did we solve these problems?
Well we like merging things. So we
thought why don't we merge audio first and video first. [snorts]
Right? So our primary goal with the new approach was to reduce this segment misalignment issues. So what we did is
misalignment issues. So what we did is we merged audio, video and the playlist into a single MP4. And how does that work is now the client instead of making
two parallel requests makes a single request for the for a media first.mpp4
MP4 segment and the response it receives is expected to have a JSON playlist and the audio and video data to get the playback. Now
playback. Now it all looks good but how does how does a media first look right? So this is a typical structure of media first. So we
introduced a custom box within the MP4 where [snorts] we embed a a JSON playlist and the audio and video data are basically interle within the MP4
container. So, it all looks good and
container. So, it all looks good and ideally it should work well, but the problem is how are these MP4 media first MP4 generated, right? Well, the origin
could always generate them and they're just cached on the CDNet server. But
this brings us to the second problem we talked about earlier, which is how long does the CDNet server cache them. So,
that is where we thought of using edge computing to generate these on the CDNet server. and I'm going to hand it over to
server. and I'm going to hand it over to Thomas who's going to talk about how CDN7 does it.
>> Thanks, that was a very good explanation of what we have here. So now let's come back to the CDN side and take a look at what the
edge is doing. Right?
The first important thing to realize here is that while the edge might have all the data that it needs to create the
media first in its cache, that doesn't really mean that it has the media first itself in its cache. You know, the media first response is obviously going to be
very different for each and every single user. For one user, it will have very
user. For one user, it will have very different content than from a user who joins the stream later in time. So the
URL stays the same but the content changes as the live stream moves on.
This obviously leads to a very poor hit rate of such a URL and a poor hit rate leads to all the bad stuff in our lives.
Right? At the same time, as long as at least one user is watching the stream, the edge must already kind of have the
data it needs to create the media first blobs for other users, right? Since the
same segments that the first user is watching and downloading can be used to create the media first for someone who's trying to join the
stream later in time. So this brings us to a bit of a paradox here. Maybe we
have a very poor hit rate for a super important URL while we had the actual content in cache. That doesn't sound
right at all. Right? So this is me after we realized the problem.
[laughter] What [snorts] do we do about it? Right?
So one way to solve this is a typical use case for edge computing. So we take the task of merging the audio video in
it segments playlist and and we move it from the origin to the edge. This way we can now use segments of user A who's
already watching the stream to create the media first for user B who wants to join later in time. Naturally, you know,
this optimization works and it improves the hit rate significantly which back leads to all the good stuff in our lives like decreased origin load, better
startup latency and so on and so on.
This is again me when the team you know came with the solution.
[clears throat] So how do we do this? uh well for the edge to be able to to to handle a situation it needs to provide two major
functions. First it needs to behave
functions. First it needs to behave a little bit like a player and understand the playlist and second it needs to understand the MP4 container
itself and behave a little bit like an origin. The playlist knowledge and the
origin. The playlist knowledge and the player behavior allow the edge to prefetch segments even before users are actually asking
for them. And at the same time, it
for them. And at the same time, it allows it to know which segments are the current edge. You know, which of them
current edge. You know, which of them are the current live, which should be used to create the media first blob on the fly at any given point in time. On
the other hand, understanding the MP4 container allows the edge to mix the audio, the video with the init segments,
add a bit of playlist, and create the outcoming MP4 that's going to be used as a response for the media first request.
So, we put these ideas together. We try
to create a situation in in which the origin never sees more than one media first request per stream no matter how many users are watching
that stream and no matter how long that stream is. Okay. So that's the ideal
stream is. Okay. So that's the ideal world scenario and this combination leads to a setup
where we can benefit from the features that CMF is offering without sacrificing the performance that FLV delivered from
the very beginning.
Best of both worlds at the same time.
Right. So I'm going to hand over to Caitana bag and he's going to talk about the results that we've seen here.
>> Thank you Thomas. That was a good explanation of how we generate media first using edge computing. Now let's
see some things that we have achieved.
So the first thing that we were able to achieve is uh with the media first approach we were able to lower the startup time for Tik Tok live by about 30% compared to the traditional sigma
for dash approach. Now this might look small but for a UGC content platform again for Tik Tok live this is a significant win cuz we've seen uh user
engagement metrics jump pretty pretty significantly I'd say and while we did that we also managed to lower the origin load or basically improve the edge hit
ratio by about 15%. And lastly, we were able to match FLB's performance in terms of startup time and stall related metrics. So that is a significant win
metrics. So that is a significant win coming from a largely FLB background.
So what's next in store for us? So right
now we are working on adapting this media first approach with other cmap use cases like ABR and DRM for Tik Tok live.
I hope to talk about this in the upcoming demug talks but let's see how that goes. But until then, if there are
that goes. But until then, if there are a couple of things to take away from this talk, one is how did we achieve CMF or dashbased live streaming protocol
with just one RT and while we did it, how did we use edge computing and the second is some of the you know the metrics or the results so far and lastly
things that we're working on. Yes, that
brings us to the end of our talk and yeah, thank you all for joining.
[applause] >> [music]
Loading video analysis...