At 2,000 and climbing! [SOLVED] Standalone all-in-one Jitsi server max capacity 600 users, no matter how much hardware (at least in AWS)
UPDATE 20210605: Now at 2,000, and still climbing.
First of all I am aware that an all-in-one default install is not recommended for larger scale setups. I am currently taking this approach as part of a rigorously methodical documentation process of baseline, and then expanding use-case and design implementation scenarios covering increasingly complex use-cases and infrastructure design, including separating various components of Jitsi (jicofo, videobridge, prosody, etc.) into Jitsi clusters and a variety of vertical and horizontal scaling combinations, as well as later adding and testing various autoscaling options.
First the good news: The current (as of the writing of this blog posting, May 24th, 2021) version of default Jitsi in the Ubuntu 20.04 setup has some nice improvements in capacity compared to previous releases.
These initial baseline tests are MINIMAL, but will be added to in demand and complexity down the road.
The current testing environment is in AWS, using terraform, docker, selenium, Malleus Jitsificus, Maven, Java, AWS ECS with AWS Fargate. I am currently spot instance capacity limited to 1,000 simulated users, though I have put in a request to raise to 5,000 pending AWS approval.
- ChromeDriver version: 90.0.4430.85
- org.jitsi:jitsi-meet-torture 1.0-SNAPSHOT
- maven-resources-plugin:2.6
- maven-compiler-plugin:3.7.0
- maven-surefire-plugin:2.20:test
The real-world testing laptop specs, both Lenovo Thinkpad Carbon X1 Gen 8 Model 20U9-005LUS - Intel Core i7-10610U 2304Mhz, 4 cores, 8 logical processors, 16 GB ram.
- 1 with Windows 10 Pro 10.0.19042 Build 19042
- 1 with Ubuntu Linux 20.04.2 LTS - kernel 5.6.0-1056-oem #60 Ubuntu SMP x86_64
The simulated users with ChromeDriver were each 512 1024....
Current jitsi core components versions are:
- jitsi-meet/stable,now 2.0.5870-1 - WebRTC JavaScript video conferences
- jitsi-meet-prosody/stable,now 1.0.4985-1 - Prosody configuration for Jitsi Meet
- jitsi-meet-turnserver/stable,now 1.0.4985-1 - Configures coturn to be used with Jitsi Meet
- jitsi-meet-web/stable,now 1.0.4985-1 - WebRTC JavaScript video conferences
- jitsi-videobridge2/stable,now 2.1-492-g5edaf7dd-1 - WebRTC compatible Selective Forwarding Unit (SFU)
- jigasi/stable,now 1.1-178-g3c53cf6-1 - Jitsi Gateway for SIP
I will be adding additional Jitsi-related components in later test scenarios.
The tests are VERY basic at this point and not quite real-world equivalent (again, starting with basic baselines with as few confounds as possible).
For initial baseline I am using the same testing scenario many times with different configurations before modifying the test scenarios. The baseline testing scenario is that each room has 10 participants/users, with 1 of the 10 automated simulated users sending a video loop simulating the primary speaker as the sole video sender, all other participants are the room but not broadcasting any audio, video, or text chat.
Room 0 (/loadtest0) also has 1-2 additional real-world users, which are my Windows-installed laptop and a Linux-installed laptop in room to observe the behaviors of the room and the quality of the video. I have started with extreme simplicity by testing with video-only and no audio (other than my 2 laptops sending/receiving video/audio). This is far from a full real world scenario, but is a starting baseline. Once this very basic test case is working with established repeatable data for up to 5,000 users, I will then repeat subsets of the tests using increasingly complex feature sets and use-cases, incrementally adding additional complexity (and load) such as:
- Increase number of rooms until at 500 room for a total of 5,000 participants (or whatever AWS sets as a final limit since I have run into a 1,000 instance limit and am awaiting their response to my request to increase the number to 5,000)
- Increase the number of participants per room in 10-user increments as far as possible
- Incrementally increase the number of other video senders from 1 to 10 simultaneously per room
- Audio, starting with 1 user then adding (to what will become a cacophany)
- Closed captions (cc) in an increasing number of rooms, starting at just 1
- Recording audio/video in an increasing number of rooms, starting at just 1
- Increase the duration of the tests from 5 minutes, incrementally up to 24 hours sustained
- And any other scenarios as needed, and as time, budget, employer/client directive, etc., allow.
So far testing has gone smoothly with the most basic scenario (default all-in-one server install with no optimizations, 10 users, 1 video-sender per room, no audio, no cc, no recording) (this is just some highlights from far more detailed information):
- t3a.medium (2 cpu, 4GB ram), 22 participants, smooth, text in video is clear, no issues.
- t3a.medium 32 participants. Jitsi begins to have some initial blurriness in the video for about 30 seconds, and then clears up.
- t3a.medium 102 participants. Sender video blurry at first and turned on and off by jitsi "to save bandwidth", until second 112, then finally stabilized and clear, rest of session went smoothly).
- t3a.medium 202 participants. Sender video cleared up within first 28 seconds, went smoothly for about 3 minutes, then occassionaly when large screen change, would be slightly blurry for 1-2 seconds then cleared up. Adding additional users after this immediately started causing users to drop.
This is decent for an all-in-one un-tuned setup on a very inexpensive setup!
Limit started to be noticed around 600 participants
This was first noticed with an understandably cpu heavy load on an AWS:
- c5a.xl (4 cpu, 8 GB ram) instance, with a 56% peak cpu load during 5 minute tests.
- Increasing to c5a.2xl (8cpu 16 GB ram) cleared up the video completely, and though still had a cpu load around 50%, and unfortunately still participants kept dropping every time I try to go above ~600 participants total.
- Increased to c5.4xl (32 cpu, 8 GB ram only 21% cpu peak load, but still total of 25-50 drops (participant disconnects and tries to reconnect) during a 5 minute test.
Out curiosity and to gather the data, incrementally increased to:
- 9xl
- 12xl
- 16xl
- and finally 24xl (92 cpu and 192 GB ram, 6.5% cpu peak load), made for crystal clear smooth video and faster rejoins after dropping, but dropping numbers actually became worse (probably because users could rejoin faster).
I have posted this summarized in the Jitsi Community forum, and hope that together we can figure out how to resolve this soon. I will update this page accordingly when a solution is found.
See the discussion thread here:
Happy Jitsi-ing!
-Hawke
Thank you very kindly for the helpful response. I might need a little guidance on gathering some of this information the first time around if you wouldn't mind.
This is a default install with no special modifications on Ubuntu 20.04, all version numbers are listed in this link:
https://www2.techtalkhawke.com/news/standalone-all-in-one-jitsi-server-max-capacity-600-users-no-matter-how-much-hardware-at-least-in-aws
But here they are copy/pasted:
* jitsi-meet/stable,now 2.0.5870-1 - WebRTC JavaScript video conferences
* jitsi-meet-prosody/stable,now 1.0.4985-1 - Prosody configuration for Jitsi Meet
* prosody/focal,now 0.11.4-1 amd64
* jitsi-meet-turnserver/stable,now 1.0.4985-1 - Configures coturn to be used with Jitsi Meet
* jitsi-meet-web/stable,now 1.0.4985-1 - WebRTC JavaScript video conferences
* jitsi-videobridge2/stable,now 2.1-492-g5edaf7dd-1 - WebRTC compatible Selective Forwarding Unit (SFU)
* jigasi/stable,now 1.1-178-g3c53cf6-1 - Jitsi Gateway for SIP
Here is the information I have so far in relation to your response:
RE: Prosody:
I was aware of the Prosody single-thread being a concern, but didn't know it kicked in below 1,000 users now.
Right now I'm only using the default AWS CloudWatch graphs (since this was primarily to figure out AWS costs), but I have a Grafana server I could attach.
There are a lot of year-old threads for the much older jitsi design (so many great improvements in a year!).
Is there one you would particularly recommend I follow for trying to gather the Prosody-specific information in a way that would be useful to you and the rest here?
I was in the recent Jitsi-hackathon and excited about what is in the pipe. When you say latest, which version? Is this new enough, or do I need to pull something less stable? I'm supposed to be performing my analysis on stable for the capacity planning if at all possible. Configuration modifications are okay, but they would rather avoid using non-stable branches where possible.
File limits configs:
cat /proc/sys/fs/file-max: 9223372036854775807
ulimit -a
* core file size (blocks, -c) 0
* data seg size (kbytes, -d) unlimited
* scheduling priority (-e) 0
* file size (blocks, -f) unlimited
* pending signals (-i) 127000
* max locked memory (kbytes, -l) 65536
* max memory size (kbytes, -m) unlimited
* open files (-n) 1024
* pipe size (512 bytes, -p) 8
* POSIX message queues (bytes, -q) 819200
* real-time priority (-r) 0
* stack size (kbytes, -s) 8192
* cpu time (seconds, -t) unlimited
* max user processes (-u) 127000
* virtual memory (kbytes, -v) unlimited
* file locks (-x) unlimited
at /proc/`cat /var/run/jitsi-videobridge/jitsi-videobridge.pid`/limits
* Limit Soft Limit Hard Limit Units
* Max cpu time unlimited unlimited seconds
* Max file size unlimited unlimited bytes
* Max data size unlimited unlimited bytes
* Max stack size 8388608 unlimited bytes
* Max core file size 0 unlimited bytes
* Max resident set unlimited unlimited bytes
* Max processes 65000 65000 processes
* Max open files 65000 65000 files
* Max locked memory 65536 65536 bytes
* Max address space unlimited unlimited bytes
* Max file locks unlimited unlimited locks
* Max pending signals 127000 127000 signals
* Max msgqueue size 819200 819200 bytes
* Max nice priority 0 0
* Max realtime priority 0 0
* Max realtime timeout unlimited unlimited us
prlimit
* RESOURCE DESCRIPTION SOFT HARD UNITS
* AS address space limit unlimited unlimited bytes
* CORE max core file size 0 unlimited bytes
* CPU CPU time unlimited unlimited seconds
* DATA max data size unlimited unlimited bytes
* FSIZE max file size unlimited unlimited bytes
* LOCKS max number of file locks held unlimited unlimited locks
* MEMLOCK max locked-in-memory address space 67108864 67108864 bytes
* MSGQUEUE max bytes in POSIX mqueues 819200 819200 bytes
* NICE max nice prio allowed to raise 0 0
* NOFILE max number of open files 1024 1048576 files
* NPROC max number of processes 127000 127000 processes
* RSS max resident set size unlimited unlimited bytes
* RTPRIO max real-time priority 0 0
* RTTIME timeout for real-time tasks unlimited unlimited microsecs
* SIGPENDING max number of pending signals 127000 127000 signals
* STACK max stack size 8388608 unlimited bytes
This is a default install with no special modifications on Ubuntu 20.04, all version numbers are listed in this link:
https://www2.techtalkhawke.[…]ch-hardware-at-least-in-aws
But here they are copy/pasted:
* jitsi-meet/stable,now 2.0.5870-1 - WebRTC JavaScript video conferences
* jitsi-meet-prosody/stable,now 1.0.4985-1 - Prosody configuration for Jitsi Meet
* prosody/focal,now 0.11.4-1 amd64
* jitsi-meet-turnserver/stable,now 1.0.4985-1 - Configures coturn to be used with Jitsi Meet
* jitsi-meet-web/stable,now 1.0.4985-1 - WebRTC JavaScript video conferences
* jitsi-videobridge2/stable,now 2.1-492-g5edaf7dd-1 - WebRTC compatible Selective Forwarding Unit (SFU)
* jigasi/stable,now 1.1-178-g3c53cf6-1 - Jitsi Gateway for SIP
Here is the information I have so far in relation to your response:
RE: Prosody:
I was aware of the Prosody single-thread being a concern, but didn't know it kicked in below 1,000 users now.
It looks like the default install doesn't specify, so I assume it is using default network_backend = "select". I do see a suggestion in prosody.cfg.lua to use_libevent.
To clarify you are recommending the newer
```
network_backend = "epoll"
```
to put into the that config file, correct?
Regarding breaking out the graphing details by task, right now I'm only using the default AWS CloudWatch graphs (since this was primarily to figure out AWS costs), but I have a Grafana server I could attach, unless you have an alternative opensource solution you would recommend?
There are a lot of 1-5 year-old threads for the much older versions (so many great improvements in a year!) for trying to monitor, but they are all for the older versions.
Do you have an up to date link/resource you would particularly recommend I follow for trying to gather the Prosody-specific information, and/or performance tuning?
I was in the recent Jitsi-hackathon and excited about what is in the pipe. When you say latest version, which version do you consider recent enough for this issue?
Is what is listed above new enough? Or do I need to pull something less stable? I am supposed to try performing my analysis on stable for the capacity planning if at all possible. Configuration modifications are okay, but I am expected to avoid non-stable branches if possible, and absolutely not to use custom-compiled versions for this baseline data. For production down the road we can consider such options, but for this baseline testing I may not.
Some details from the testing Jitsi server...
File limits configs:
cat /proc/sys/fs/file-max: 9223372036854775807
ulimit -a
* core file size (blocks, -c) 0
* data seg size (kbytes, -d) unlimited
* scheduling priority (-e) 0
* file size (blocks, -f) unlimited
* pending signals (-i) 127000
* max locked memory (kbytes, -l) 65536
* max memory size (kbytes, -m) unlimited
* open files (-n) 1024
* pipe size (512 bytes, -p) 8
* POSIX message queues (bytes, -q) 819200
* real-time priority (-r) 0
* stack size (kbytes, -s) 8192
* cpu time (seconds, -t) unlimited
* max user processes (-u) 127000
* virtual memory (kbytes, -v) unlimited
* file locks (-x) unlimited
at /proc/`cat /var/run/jitsi-videobridge/jitsi-videobridge.pid`/limits
* Limit Soft Limit Hard Limit Units
* Max cpu time unlimited unlimited seconds
* Max file size unlimited unlimited bytes
* Max data size unlimited unlimited bytes
* Max stack size 8388608 unlimited bytes
* Max core file size 0 unlimited bytes
* Max resident set unlimited unlimited bytes
* Max processes 65000 65000 processes
* Max open files 65000 65000 files
* Max locked memory 65536 65536 bytes
* Max address space unlimited unlimited bytes
* Max file locks unlimited unlimited locks
* Max pending signals 127000 127000 signals
* Max msgqueue size 819200 819200 bytes
* Max nice priority 0 0
* Max realtime priority 0 0
* Max realtime timeout unlimited unlimited us
prlimit
* RESOURCE DESCRIPTION SOFT HARD UNITS
* AS address space limit unlimited unlimited bytes
* CORE max core file size 0 unlimited bytes
* CPU CPU time unlimited unlimited seconds
* DATA max data size unlimited unlimited bytes
* FSIZE max file size unlimited unlimited bytes
* LOCKS max number of file locks held unlimited unlimited locks
* MEMLOCK max locked-in-memory address space 67108864 67108864 bytes
* MSGQUEUE max bytes in POSIX mqueues 819200 819200 bytes
* NICE max nice prio allowed to raise 0 0
* NOFILE max number of open files 1024 1048576 files
* NPROC max number of processes 127000 127000 processes
* RSS max resident set size unlimited unlimited bytes
* RTPRIO max real-time priority 0 0
* RTTIME timeout for real-time tasks unlimited unlimited microsecs
* SIGPENDING max number of pending signals 127000 127000 signals
* STACK max stack size 8388608 unlimited bytes
I have watched that video before, thank you for the refresh. While it is helpful from a high-level design and planning for clustered high capacity setups we're planning soon, unfortunately it does not have specific details to help make the tweaks to increase the nginx and prosofy capacities, just that they did them and it improved things, it just references the tweaks but no specifics.
It appears to be using default Bosh, is this the best resource you would recommend for switching to websockets?
With the stats posted above, with the goal being 1,000 users (currently stuck at 600), do you have suggestions on the appropriate ballpark numbers to try out as a starting point?
Thanks kindly!