Forum is down (edit: discussion about downtime and update design)

By Christian Scheuer @chrscheuer

2021-02-25 20:59:45.759Z

Our site is down!
https://forum.soundflow.org

Solved in post #3, click to view

Reply

15 replies

There are 15 replies. Estimated reading time: 11 minutes

C
Christian Scheuer @chrscheuer
2021-02-25 21:06:08.250Z
It appears to be working again. @KajMagnus are there any plans to make updates in a way that doesn't cause the entire site to go down?
Reply
1. KajMagnus @KajMagnus
  core-dev
  support-team
  2021-02-25 21:08:52.622Z
  Hmm it was a regular upgrade to the next version, about 1 – 2 minutes downtime.
  
  This makes me start thinking that I should announce these things beforehand — so people won't get surprised.
  
  Sorry about that.
  
  Ok so I'll need to announce both upcoming versions, in the Announcements category
  — and also when the upgrades will actually happen.
  Reply 1 LikeSolution
2. In reply tochrscheuer⬆:
  KajMagnus @KajMagnus
  core-dev
  support-team
  2021-02-25 21:10:13.778Z
  
  make updates in a way that doesn't cause the entire site to go down?
  
  The nearest time:
  
  It'll have to be instead me doing the upgrades when most people won't notice — after all, it's just about a minute or two.
  Combined with announcements, before the upgrades happen.
  But I understand if this is disconcerting (is that the right English word?) if one is working currently in the forum, and then it suddenly disappears, ... refreshes the page ... it's still not there.
  Reply 1 Like
  C Christian Scheuer @chrscheuer
  2021-02-25 21:12:49.616Z
  Yea, announcements is only a temporary band-aid - it's not a sustainable fix.
  
  As the number of users on the forum(s) grow, having a zero-downtime solution is a must-have. All of our services are designed for zero downtime so that we have rolling updates of all backend services with load balancers ensuring there's never any downtime.
  
  As you said, the users won't know that the downtime is just a couple of minutes - and, since we are having a global audience, there are almost always people logged on. So there's not really a good time. Evening time in Europe right now (22:00) is prime time for all of our American customers. It makes us look really bad that our forum goes down in the middle of people's work day.
  
  Reply 1 Like
  In reply toKajMagnus⬆:
  KajMagnus @KajMagnus
  core-dev
  support-team
  2021-02-25 21:15:06.822Z
  @chrscheuer and long term:
  
  Yes definitely. However having two application servers (so they can upgrade one at a time) can make the tech stack "a lot" more complicated, instead increasing the overall risk for downtime.
  
  But now, one thing that comes to my mind, is that, with two application servers — one could be always read-only. Then any data write races couldn't happen.
  
  And when upgrading, Ngnix would redirect all traffic to the read-only app server — which would then say "Please wait a minute and try again — the server is being upgraded".
  
  Reply 1 Like
  C Christian Scheuer @chrscheuer
  2021-02-25 21:18:51.649Z
  Yea I know.. We've steared away from Kubernetes so far - which would be the solution - namely because it's more complex, so I know the feeling.
  We are using Cloud Run as front-end APIs because they are essentially Docker containers that Google automatically scale on top of GKE (the Kubernetes engine) - this allows us to not have to worry about the complexity. We pay more to Google, but they handle scaling and uptime.
  Providing always-on cloud services is always gonna have some bit of complexity, but there's a lot of products available on Google, AWS and Azure to help with exactly these types of issues :)
  
  Yea, agreed on having a "stand-by while updating read-only app server" would be one way to solve the issue.
  Ideally, IMO, you should think scaling in as well. Imagine the server gets 100x the traffic. How could you scale it up so that more application nodes can work simultaneously.
  This is sort-of what Kubernetes (or something like the adaged Docker Swarm) would allow, keeping X replicates of each node type. Then, obviously, the Postgres layer may be a bottle-neck, but I imagine that wouldn't need to be updated too often.
  While not necessarily something you'd wanna do here and now, I think the design should take it into account long term.
  
  Reply 1 Like
  C Christian Scheuer @chrscheuer
  2021-02-25 21:25:10.418Z
  Perhaps another band-aid would be to set up a load balancer or reverse proxy in front of the current app server with a nicer looking error message so that you'll never get unanswered connections, but instead just send a 500 response with some pretty HTML that says - "sorry, we're updating the server. This will just take a minute, please come back"
  
  Reply 1 Like
  KajMagnus @KajMagnus
  core-dev
  support-team
  2021-02-25 21:28:20.482Z
  Yes that a good idea
  
  Reply
  In reply tochrscheuer⬆:
  C Christian Scheuer @chrscheuer
  2021-02-25 21:29:50.013Z
  By the way - previously, I think this was also less visible when it was down. I think I've experienced it before where I didn't notice as big of an UI issue as I did right now.
  The reason was that the "will save draft XXX" text became red and said something like "You're offline". That made me refresh the page (because I thought my Wifi connection was bad), thus being presented with that the server was fully down.
  If the "You're offline" text had been less intrusive or alerting, or for example said "The server is busy for a bit, please wait", then I wouldn't have refreshed the page but just waited.
  So there's some changes that could be made client side too :)
  
  Anyway just ideas. I understand it may feel stupid to put that much effort into something that is, as you said, just 1-2 minutes - but, increasingly, as our user base grows, this is something that would worry me.
  
  Reply
  In reply tochrscheuer⬆:
  KajMagnus @KajMagnus
  core-dev
  support-team
  2021-02-25 21:27:43.719Z
  Related to Kubernetes: @ChrisEke wrote Kubernetes installation instructions for Talkyard a while ago:
  
  K8s, Swarm, Traefik and Talkyard #post-30
  I've now written a more detailed guide on how Talkyard can be deployed with k8s for anyone who might be interested: https://www.ekervhen.xyz/posts/2021-02/talkyard-on-k8s/
  Reply 1 Like
  C Christian Scheuer @chrscheuer
  2021-02-25 21:35:01.848Z
  Wow, thank you! Lots of interesting stuff in there!
  Maybe I have to dive in and fully understand K8s some day soon haha. It certainly would be cheaper, but the last thing I want to spend time on is devops.
  Saying this because I considered for a long while to host more of our own services on K8s instead of using 100% managed services.
  The thing is just, 100% managed works really well for a startup that should be focusing its energy on product, not on devops.
  
  Reply
  KajMagnus @KajMagnus
  core-dev
  support-team
  2021-02-25 21:47:37.406Z
  Ok yes definitely sounds like a good choice in SoundFlow's case
  
  And could have been for Talkyard too, I imagine, ...
  ... hadn't Ty been open source, and therefore cannot integrate so much with one hosting provider's managed services (except for managed K8s).
  
  Reply 1 Like
  C Christian Scheuer @chrscheuer
  2021-02-26 11:45:42.932Z
  Here's a view of our load on an average 24 hour span. Times are in PST (UTC -8).
  
  As you can see, it's hard to find any time that's good to have downtime, but if we had to pick it would be around 10 PM (22:00) PST (06:00 UTC - 07:00 CET). That is late evening in the Americas and before the workday opens up in Europe.
  
  Reply
  KajMagnus @KajMagnus
  core-dev
  support-team
  2021-02-26 13:29:35.112Z
  What about the weekends? Saturday or Sunday morning CET for example.
  
  Thanks for the graph, interesting to see. I'm surprised the activity & load is so evenly distributed, I previously thought the Pacific Ocean would have been more visible (just the small dip you mention at 22:00 PST)
  
  Reply 1 Like
  C Christian Scheuer @chrscheuer
  2021-02-26 18:03:28.995Z
  Yea – it truely is a connected world we live in :)
  Oh yea great point about weekends. It is slightly better for us there:
  
  This is another graph where you can see the too small peaks towards the left being Saturday and Sunday on the blue line. So the peaks are smaller in the weekend, but their valleys around 07 UTC is not lower in the weekends.
  So that would mean - for us, at least - European mornings around 07 UTC are the best place to do it, and preferably Saturday or Sunday mornings in case of any extended downtime. But really it could be any morning.
  
  Reply
Progress
with handling this problem
@KajMagnus marked this topic as Done 2021-03-18 13:33:23.743Z.
@KajMagnus marked this topic as New 2021-03-18 13:33:31.522Z.

Reply (discussion)Add progress note