webrtc

[해석] WebRTC SFU Load Testing (Alex Gouaillard)

growww 2018. 12. 13. 21:37

출처 : https://webrtchacks.com/sfu-load-testing/

 


요약

 

미디어 스트리밍은 단일 소스에서 수천 또는 수만 명의 시청자에게 스트리밍, 다중서버 계층구조로 가야함

 

- 테스트 기준은 Video conferencing – many to many, all equals, one participant speaking at a time (hopefully)

: 우리가 필요한 Media streaming – one-to many, unidirectional(단방향)는 Streaming Media West에서 공개했다고 함...

 

- SFU 미디어 서버들로 테스트

: Janus, Jitsi, Kurento, mediasoup, Medooze

 

- 500명의 유저를 초대하여 테스트 (한방에 7명씩, 한번에 한명씩 초대해서 500명을 채움)

 

- 테스트한 미디어 서버들 버전정보 (다 동일 설정의 개별 VM에 올려서 테스트함)

Jitsi Meet (JVB version 0.1.1077),

Janus Gateway (version 0.4.3) with its video room plugin,

Medooze (version 0.32.0) SFU app,

Kurento (from OpenVidu Docker container, Kurento Media Server version  6.7.0),

mediasoup (version 2.2.3)

 

- 제일 아래 그림에 all video checks라고 참여자들의 영상수신여부 체크항목이있는데 janus 50%대

: 하지만 Media streaming기준이면 pass?

 

- 테스트 환경 구성도 한세월일듯...

 

 

Q. 당연하단듯 sfu로 테스트했는데 브로드캐스팅/화상회의에 webRTC 쓰면 sfu인건가? (mcu 사례?성능?)

Q. mediasoup, Medooze ?

Q. 보통 

Video conferencing과 Media streaming용 플러그인?라이브러리?가 다른건가? 

아니면 동일 플러그인 사용하지만 그냥 테스트 기준만 나눈건가?

 

=> janus 최신버전 0.6.0, 일단 테스트 고고

 


 

 

If you plan to have multiple participants in your WebRTC calls then you will probably end up using a Selective Forwarding Unit (SFU)

 

WebRTC를 활용해서 여러참여자를 연동할 계획이라면 아마도 SFU (Selective Forwarding Unit)를 사용하게 될 것이다

-> 왜? SFU인가 (mcu보다 성능상 더 좋은가? mesh는 당연히 아닐것같고)

 

Capacity planning for SFU’s can be difficult – there are estimates to be made for where they should be placed, how much bandwidth they will consume, and what kind of servers you need.

SFU의 수용력을 계획하는데는 어려움이 있을것이다. 대역폭, 서버, 위치에 대한 견적을 내야할것이다

-> 위치란 누구의 위치인가. sfu의 위치?

To help network architects and WebRTC engineers make some of these decisions, webrtcHacks contributor Dr. Alex Gouaillard and his team at CoSMo Software put together a load test suite to measure load vs. video quality. 

알렉스와 그의 팀은 부하 대 비디오 품질 테스트 모듈을 만들었다

They published their results for all of the major open source WebRTC SFU’s. This suite based is the Karoshi Interoperability Testing Engine (KITE) Google funded and uses on webrtc.org to show interoperability status. 

The CoSMo team also developed a machine learning based video quality assessment framework optimized for real time communications scenarios.

CoSMo 팀은 실시간 통신 시나리오를 대비하여 머신러닝을 기반으로 비디오 품질 평가를 하는 프레임워크를 개발했다. 

 

First an important word of caution – asking what kind of SFU is the best is kind of like asking what car is best. If you only want fast then you should get a Formula 1 car but that won’t help you take the kids to school. Vendors never get excited about these kinds of tests because it boils down their functionality into just a few performance metrics. These metrics may not have been a major part of their design criterion and a lot of times they just aren’t that important. 

 

For WebRTC SFU’s in particular, just because you can load a lot of streams on an SFU, there may be many resiliency, user behavior, and cost optimization reasons for not doing that.

WebRTC SFU의 경우, 다량스트림에 SFU 사용시 탄력적이고 사용성에 영향받고 비용 최적화 문제를 일으킬 수 있으므로 설계시 유의할것

-> SFU도 설계방법이 여러가지인갑다

 

 Load also tests don’t take a deep look at the end-to-end user experience, ease of development, or all the other functional elements that go into a successful service. Lastly, a published report like this represents a single point in time – these systems are always improving so result might be better today.

That being said, I personally have had many cases where I wish I had this kind of data when building out cost models. Alex and his team have done a lot of thorough work here and this is great sign for maturity in the WebRTC open source ecosystem. I personally reached out to each of the SFU development teams mentioned here to ensure they were each represented fairly. 

This test setup is certainly not perfect, but I do think it will be a useful reference for the community.

테스트 설정이 완벽하진 않지만 커뮤니티를 위한 참고 자료로 좋을듯함.

Please read on for Alex’s test setup and analysis summary.

 


Introduction

One recurring question on the discuss-webrtc mailing list is “What is the best SFU”. This invariably produces a response of “Mine obviously” from the various SFU vendors and teams. Obviously, they cannot all be right at the same time!

You can check the full thread here. Chad Hart, then with Dialogic answered kindly recognizing the problem and expressed a need:

In any case, I think we need a global (same applied to all) reproducible and unbiased (source code available, and every vendor can tune their installation if they want) benchmark, for several scalability metrics.

Three years later my team and I have built such a benchmark system. I will explain how this system works and show some of our initial results below.


The Problem

Several SFU vendors provide load testing tools. 

몇몇 SFU 밴더들은 로드 테스팅 툴을 제공한다.

Janus has Jattack. Jitsi has jitsi-hammer and even published some of their results. Jitsi in particular has done a great job with transparency and provides reliable data and enough information to reproduce the results. However, not all vendors have these tools and fewer still make them fully publicly available.  

In addition, each tool is designed to answer slightly different questions for their own environments such as:

각각의 툴들은 아래와 같이 서로 다른 테스트 기준을 갖고 있다

  • How many streams can a single server instance of chosen type and given bandwidth limit handle?
  • How many users can I support on the same instance?
  • How many users can I support in a single conference?
  • Etc.…

There was just no way to make a real comparative study – one that is independent reproducible, and unbiased. The inherent ambiguity also opened the door for some unsavory behavior from some who realized they could get away with any claim because no one could actually check them. We wanted to produce some results that one does not have to take on faith and that could be peer-reviewed.


What use cases?

To have a good answer to “What is the best SFU?” you need to explain what you are planning to use it for.

사용목적에 따라 "최고의 SFU설계방식"이 다를것이다

We chose to work on the two use cases that seemed to gather the most attention, or at least those which were generating the most traffic on discuss-webrtc:

우리는 아래 두가지를 기준으로 테스트를 진행하기로 했다 (선정기준: discuss-webrtc에서 많이 논의된, 많이 언급되는)

  1. Video conferencing – many to many, all equals, one participant speaking at a time (hopefully) ,
  2. Media streaming – one-to many, unidirectional(단방향)

(우린 2번에 주목)

Most video conferencing questions are focused on single server instance. Having 20+ people in a given conference is usually plenty for most. Studies like this one show that in most social cases most of the calls are 1-1, and the average is around 3. , This configuration fits very well a single small instance in any public cloud provider (as long as you get a 1Gbps NIC). You can then use very simple load balancing and horizontal scalability techniques since the ratio of senders to viewers is rarely high. 

Media streaming, on the other hand, typically involves streaming from a single source to thousands or tens of thousands of viewers. This requires a multi-server hierarchy.

미디어 스트리밍은 단일 소스에서 수천 또는 수만 명의 시청자에게 스트리밍, 다중서버 계층구조로 가야함-> 결국 오리진서버에 배포서버 아키텍쳐를 고민해봐야하는건가?

 

We wanted to accommodate different testing scenarios and implement them in the same fashion across several WebRTC Servers so that the only difference is the system being tested, and the results are not biased.

동일 모양(환경?)의 여러 webRTC서버를 두고 안에 시스템(janus, jitsi..)만 다르게 하여 여러 시나리오를 공평하게 테스트하고 싶었다

 

For purposes of this post I will focus on the video conferencing scenario. For those that are interested, we are finalizing our media streaming test results and plan to present them  at Streaming Media West on November 14th.

 

-> 1번(video conferencing) 기준에 맞춰 테스트한 포스팅이였다

-> media streaming test는 11/14 Streaming Media West에서 공개라니....ㅠ_ㅠ


VES203. WebRTC: The Future Champion Of Low Latency Video

Wednesday, November 14: 1:45 p.m. - 2:45 p.m.

HLS and MPEG-DASH are the current standards for HTTP-based live streaming, but these designs are inherently slow and add delays to live feeds. Sub-second latency is critical for scenarios such as gambling, auctions, interactive communications, VR, sports, and gaming. WebRTC is touted for its sub-second latency but couldn’t scale to the volume needed by CDNs and couldn’t reach Apple devices. In this session, learn about a novel WebRTC-based solution that would match current CDNs in terms of reach (all devices), quality, cost, and scale, while providing sub-500 milliseconds latency.

Speakers:

Alexandre GouaillardFounder and CEOCosMo Software Consulting and IETF, W3C

Richard Blakely, CEO, Millicast



The test suite

In collaboration with Google and many others, we developed KITE, a testing engine that would allow us to support all kinds of clients – browsers and native across mobile or desktop – and all kind of test scenarios easily. It is used to test WebRTC implementation everyday across browsers as seen on webrtc.org

Kite logo title


Selecting a test client

Load testing is typically done with a single client to control for client impacts. Ideally you can run many instances of the test client in parallel in a single virtual machine (VM). Since this is WebRTC, it makes sense to use one of the browsers. Edge and Safari are limited to a single process, which does not make they very suitable. Additionally, Safari only runs MacOS or iOS, which only runs on Apple hardware. It is relatively easy to spawn a million VMs on AWS if you’re running Windows or Linux. It’s quite a bit more difficult, and costly, to setup one million Macs, iPhones, or iPads for testing (Note, I am still dreaming about this though).

That leaves you with Chrome or Firefox which allow multiple instances just fine. It is our opinion that the implementation of webdriver for Chrome is easier to manage with fewer flags and plugins (i.e. H264) to handle, so we chose to use Chrome.

결론은 크롬으로 테스트했다


Systems Under Test

We tested the following SFUs:

To help make sure each SFU showed its best results, we contacted the teams behind each of these projects. 

We offered to let them set up the server themselves or connect to the servers and check-up their settings. 

We also shared the results so they could comment. That made sure we properly configured each system to handle optimally for our test.

Interestingly enough, during the life of this study we found quite a few bugs and worked with the teams to improve their solutions. This is discussed more in detail in the last section.

테스트하는도중 버그들 발견해서 같이 수정함. 이에 대해 마지막에 얘기하겠음


Test Setup

We used the following methodology to increase traffic to a high load. 

아래의 방법으로 최대 부하로 끌어올림

First we populated each video conference rooms with one user at a time until it reached 7 total users. 

각 video conference room에 총 7명이 찰때까지 +1 명씩 초대-> 초대 .. 반복

We repeated this process until the total target number of users was reached.  close to 500 simultaneous users.

이 과정을 유저가 500 정도로 찰때까지 계속 반복

-> 그럼 약 71개의 방을 생성한거?

The diagram below shows the elements in the testbed:

data flow


Metrics

Most people interested in scalability questions will measure the CPU, RAM, and bandwidth footprints of the server as the “load” (streams, users, rooms…) ramps up. That is a traditional way of doing things that supposes that the quality of the streams, their bitrate… all stay equal.

WebRTC’s encoding engine makes this much more complex. 

webRTC의 인코딩 엔진은 이를 훨씬 복잡하게 만든다 (스트림 로드에 따라 cpu, ram, 대역폭 증가를 단순히 쫓는것과의 비교강조)

WebRTC includes bandwidth estimation, bitrate adaptation and overall congestion control mechanism, one cannot assume streams will remain unmodified across the experiment. 

webRTC에는 대역폭 추정, 다른 비트레이트 전송 및 전반적인 혼잡 제어 메커니즘이 포함되어 있어 동일 스트림이지는 않을것이다

-> 그럼 테스트할때 마다 전송 영상이 달라지는건데,, 테스트도 넘나어려운것

In addition to the usual metrics, the tester also needs to record client-side metrics like sent bitrate, bandwidth estimation results and latency. It is also important to keep an eye on the video quality, as it can degrade way before you saturate the CPU, RAM and/or bandwidth of the server.

 

On the client side, we ended up measuring the following:

클라이언트 측면에서 우린 아래를 측정함

  • Rate of success and failures (frozen video, or no video)
  • Sender and receiver bitrates
  • Latency
  • Video quality (more on that in the next section)

Measuring different metrics on the server side can be as easy as pooling the getStats API yourself or integrating a solution like callstats.io. In our case, we measured:

서버사이드는 아래를 측정

  • CPU footprint,
  • RAM footprint,
  • Ingress and egress bandwidth in and out, 대역폭 입출구(?)
  • number of streams,
  • along with a few of other less relevant metrics.

The metrics above were not published in the Scientific article because of space limitation, but should be released in a subsequent Research Report.

All of these metrics are simple to produce and measure with the exception of video quality. 

What is an objective measure of video quality? 

비디오 화질의 객관적 척도는 무엇?

Several proxies for video quality exist such as Google rendering time, received frames, bandwidth usage, but none of these gave an accurate measure.

구글랜더링시간, 수신프레임,대역폭 사용량 같은 비디오 품질측정 지표가 있긴 하지만 정확한 측정은 아님


Video quality metric

Ideally a video quality metric would be visually obvious when impairments are present.  This would allow one to measure the relative benefits of resilient techniques, such as like Scalable Video Coding (SVC), where conceptually the output video has a looser correlation with jitter, packet loss, etc. than other encoding methods. See the below video from Agora for a good example of a visual comparison:

https://www.youtube.com/watch?v=M71uov3OMfk

After doing some quick research on a way to automate this kind of visual quality measurement, we realized that nobody had developed a method to assess the video quality as well as a human would in the absence of reference media with a  real-time stream. So, we went on to develop our own metric leveraging Machine Learning with neural networks. This allowed for real-time, on-the-fly video quality assessment. As an added benefit, it can be used without recording customer media, which is a sometimes a legal or privacy issue.

The specifics of this mechanism is beyond the scope of this article but you can read more about the video quality algorithm here. The specifics of this AI-based algorithm have been submitted for publication and will be made public as soon as it is accepted.


Show me the 

money

 results

We set up the following five open-source WebRTC SFUs, using the latest source code downloaded from their respective public GitHub repositories (except for Kurento/OpenVidu, for which the Docker container was used):

Each was setup in a separate but identical Virtual Machine and with default configuration.


Disclaimers

First a few disclaimers. All teams have seen and commented on the result of their SFUs.

The Kurento Media Server team is aware that their server is currently crashing early and we are working with them to address this. On Kurento/OpenVidu, we tested max 140 streams (since it crashes so early).

쿠렌토는 140 스트림 이전에 이미 다이

In addition, there is a known bug in libnice, which affected both Kurento/OpenVidu and Janus during our initial tests.  

libnice에 버그가 있었음, 최초 테스트할 때 Kurento/OpenVidu와 Janus에서 사용중이였음

After a libnice patch was applied as advised by the Janus team, their results significantly improved.  However, the re-test with the patch on Kurento/OpenVidu actually proved even worse. Our conclusion was that there are other issues with Kurento. We are in contact with them and working on fixes so, the Kurento/OpenVidu results might improve soon.

Janus가 지적해서 libnice가 패치된 이후 성능이 월등히 좋아짐, 쿠렌토는 더 최악으로 치달음. 쿠렌토에 문제가 있다고 판단하여 알렸고 성능이 좋아질꺼야 아마도..

-> 무엇을 패치하게 했는가.... janus에 최적화 되도록 패치요청한거라면..? (의심쟁이)

The latest version of Jitsi Videobridge (up to the point of this publication) always became unstable at exactly 240 users. 

최신버전의 Jitsi Videobridge 역시 240 스트림 (명)이 되자 불안정해짐

The Jitsi team is aware of that and working on the problem. 

Jitsi team 알고 있었고 문제 해결 중

They have however pointed out that their general advice is to rely on horizontal scaling with a larger number of smaller instances as described here

jitsi쪽에서 조언하길, 수평확장horizontal scaling (적은 인스턴스로 여러개) 해서 써라

-> 위에 유튜브를 잠시 보니 xmpp통신용 server를 준 meet에 여러 videobridge를 묶어서 쓰게 한듯

Note that a previous version (as two months ago) did not have these stability issues but did not perform as well (see more on this in the next section). 

이전 버전은 안정성에 문제는 없었지만 성능이 좋지 않았음

We chose to keep version 0.1.1077 as it included made simulcast much better and improved the results significantly (up to 240 participants, that is).

그래서 0.1.1077로 테스트 유지하기로함.

Also note nearly all of these products have had version releases since testing. Some have made improvements since the test results shown here.

 


Measurements

As an reference point, we chose one of the usual video test sequences, and computed its video quality score using several video quality assessment metrics:

테스트용 비디오로 링크된비디오를 선택했고 아래의 비디오 품질 측정기준을 사용하여 비디오 품질 점수계산

  • SSIM – a common metric that compares the difference between a distorted image and its original
  • VMAF – an aggregate measure of a few metrics used and developed by Netflix
  • NARVAL – our algorithm which does not require a reference
graph

Image 1: benchmarking various video quality metric over different bitrates

Note the relationship between quality score and bitrate is not linear. If you slowly decrease the bandwidth from the reference value (1.7Mbps) the quality score only decreases slightly until it hits a low bitrate threshold and then decreases more drastically. 

비디오품질 점수와 비트 전송률의 관계는 리니어하지않음. 기준값 (1.7Mbps)에서 대역폭을 천천히 줄이면 낮은 비트 전송률 임계 값에 도달 할 때까지 품질 점수가 약간만 감소하고 이후에는 더 급격히 감소

To lose 10% of the perceived video quality, you need to reduce the bandwidth to 250Kbps according to WMAF, or even 150k according to SSIM, and 100k according to NARVAL.

비디오 품질의 10 %를 줄이려면 대역폭을 WMAF에 따르면 250Kbps로, SSIM에 따르면 150k로, NARVAL에 따르면 100kbps로 줄여야함-> ???

Tests on the SFUs showed the same pattern. Image 2 gives the bitrate as a function of the number of participants for each SFU. 

One can see here that WebRTC’s congestion control algorithms kick in early (at around 250 participants) to maintain bitrate. 

webRTC 혼잡제어 알고리즘이 비트레이트 유지를 위해 사용자 250만 참여시킴? 

-> webRTC 자체에서 받아주질 않은건가? 그러면 영상 보내는 방식이 상이 한건가?

However, Image 3 shows that the latency increases more linearly. 

Despite decreasing bandwidth and increasing latency, the video quality metric shown in Image 4 only reports quality degradation much later around when the bandwidth goes below 200k. 

That shows again that bit rate and latency are not good proxies for Video Quality.

 

average bitrates final legend

Image 2: JItsi fails at 240 participants. Kurento/OpenVidu had issues early. Janus and mediasoup seem to fare better than Medooze. It seems to be related to better CPU optimizations, as the inflection points correlate with the saturation of respective CPUs (not shown in this post).

 

rtt august final legend

Image 3: JItsi fails at 240 participants and Kurento/OpenVidu had issues around 50. Otherwise the SFUs exhibit comparable behavior.

 

narval score

Image 4: The video Quality is only dropping toward the end of the experiment, showing that the congestion control mechanism is doing its job well, and manage to make the right compromise as to keep the perceived quality high while adjusting other parameters.


SFU improvements during testing

Beyond the results themselves presented above, what is interesting is to see the progress in the results triggered by this study. Just getting visibility has allowed the respective teams to address the most egregious problems.

Then you can also observe that Janus was very quickly limited. 

They had identified this bottleneck in an external library, and a possible solution, but had never really assessed the true impact. 

One can clearly see the difference between the graphs in this section (first runs), and the graphs in the previous section (latest results), were Janus seems to perform the best.

all sfu bitrate jitsiavg legend average bitrates final legend
Bitrate as a function of the load.
Before (left) and after (right) application of patches to Janus and Jitsi. We also added mediasoup results (in green). Medooze and Kurento/OpenVidu results are the same in both plots as no better results could be generated the second time around.
all sfu rtt legend rtt august final legend
RTT, or latency, as a function of the load (logarithmic scale).
Before (left) and after (right) application of patches to Janus and Jitsi. We also added mediasoup results (in green). Medooze and Kurento/OpenVidu results come from the same dataset.

Finally, one reviewer of our original article pointed to the fact that Medooze by Sergio Garcia Murillo’s, a CoSMo employee, ended up on top of our study, hinting to a possible bias caused by a conflict of interest. We went to great efforts to conduct all of our tests transparently without bias. I think it is refreshing to see that in the latest results several SFUs end up being on par or better than Medooze, removing the final worry some might have. It was good news for the Medooze team too – now they know what they have to work on (like improvements made in Medooze 0.46) and they have a tool to measure their progress.


https://www.cosmosoftware.io/publications/andre2018_Comparative_Study_of_SFUs.pdf

https://www.cosmosoftware.io/publications/andre2018_slides_Comparative_Study_of_SFUs.pdf

 

'webrtc' 카테고리의 다른 글

SDP(Session Description Protocol)  (0) 2019.08.03
About WebRTC_basic  (0) 2019.04.11