No internet connection
  1. Home
  2. Issues

Forum is down (edit: discussion about downtime and update design)

By Christian Scheuer @chrscheuer
    2021-02-25 20:59:45.759Z

    Our site is down!

    Solved in post #3, click to view
    • 15 replies

    There are 15 replies. Estimated reading time: 11 minutes

    1. C
      Christian Scheuer @chrscheuer
        2021-02-25 21:06:08.250Z

        It appears to be working again. @KajMagnus are there any plans to make updates in a way that doesn't cause the entire site to go down?

        1. Hmm it was a regular upgrade to the next version, about 1 – 2 minutes downtime.

          This makes me start thinking that I should announce these things beforehand — so people won't get surprised.

          Sorry about that.

          Ok so I'll need to announce both upcoming versions, in the Announcements category
          — and also when the upgrades will actually happen.

          Reply1 LikeSolution
          1. In reply tochrscheuer:

            make updates in a way that doesn't cause the entire site to go down?

            The nearest time:

            It'll have to be instead me doing the upgrades when most people won't notice — after all, it's just about a minute or two.
            Combined with announcements, before the upgrades happen.
            But I understand if this is disconcerting (is that the right English word?) if one is working currently in the forum, and then it suddenly disappears, ... refreshes the page ... it's still not there.

            1. CChristian Scheuer @chrscheuer
                2021-02-25 21:12:49.616Z

                Yea, announcements is only a temporary band-aid - it's not a sustainable fix.

                As the number of users on the forum(s) grow, having a zero-downtime solution is a must-have. All of our services are designed for zero downtime so that we have rolling updates of all backend services with load balancers ensuring there's never any downtime.

                As you said, the users won't know that the downtime is just a couple of minutes - and, since we are having a global audience, there are almost always people logged on. So there's not really a good time. Evening time in Europe right now (22:00) is prime time for all of our American customers. It makes us look really bad that our forum goes down in the middle of people's work day.

                1. In reply toKajMagnus:

                  @chrscheuer and long term:

                  Yes definitely. However having two application servers (so they can upgrade one at a time) can make the tech stack "a lot" more complicated, instead increasing the overall risk for downtime.

                  But now, one thing that comes to my mind, is that, with two application servers — one could be always read-only. Then any data write races couldn't happen.

                  And when upgrading, Ngnix would redirect all traffic to the read-only app server — which would then say "Please wait a minute and try again — the server is being upgraded".

                  1. CChristian Scheuer @chrscheuer
                      2021-02-25 21:18:51.649Z

                      Yea I know.. We've steared away from Kubernetes so far - which would be the solution - namely because it's more complex, so I know the feeling.
                      We are using Cloud Run as front-end APIs because they are essentially Docker containers that Google automatically scale on top of GKE (the Kubernetes engine) - this allows us to not have to worry about the complexity. We pay more to Google, but they handle scaling and uptime.
                      Providing always-on cloud services is always gonna have some bit of complexity, but there's a lot of products available on Google, AWS and Azure to help with exactly these types of issues :)

                      Yea, agreed on having a "stand-by while updating read-only app server" would be one way to solve the issue.
                      Ideally, IMO, you should think scaling in as well. Imagine the server gets 100x the traffic. How could you scale it up so that more application nodes can work simultaneously.
                      This is sort-of what Kubernetes (or something like the adaged Docker Swarm) would allow, keeping X replicates of each node type. Then, obviously, the Postgres layer may be a bottle-neck, but I imagine that wouldn't need to be updated too often.
                      While not necessarily something you'd wanna do here and now, I think the design should take it into account long term.

                      1. CChristian Scheuer @chrscheuer
                          2021-02-25 21:25:10.418Z

                          Perhaps another band-aid would be to set up a load balancer or reverse proxy in front of the current app server with a nicer looking error message so that you'll never get unanswered connections, but instead just send a 500 response with some pretty HTML that says - "sorry, we're updating the server. This will just take a minute, please come back"

                          1. Yes that a good idea

                            1. In reply tochrscheuer:
                              CChristian Scheuer @chrscheuer
                                2021-02-25 21:29:50.013Z

                                By the way - previously, I think this was also less visible when it was down. I think I've experienced it before where I didn't notice as big of an UI issue as I did right now.
                                The reason was that the "will save draft XXX" text became red and said something like "You're offline". That made me refresh the page (because I thought my Wifi connection was bad), thus being presented with that the server was fully down.
                                If the "You're offline" text had been less intrusive or alerting, or for example said "The server is busy for a bit, please wait", then I wouldn't have refreshed the page but just waited.
                                So there's some changes that could be made client side too :)

                                Anyway just ideas. I understand it may feel stupid to put that much effort into something that is, as you said, just 1-2 minutes - but, increasingly, as our user base grows, this is something that would worry me.

                              • In reply tochrscheuer:

                                Related to Kubernetes: @ChrisEke wrote Kubernetes installation instructions for Talkyard a while ago:

                                1. CChristian Scheuer @chrscheuer
                                    2021-02-25 21:35:01.848Z

                                    Wow, thank you! Lots of interesting stuff in there!
                                    Maybe I have to dive in and fully understand K8s some day soon haha. It certainly would be cheaper, but the last thing I want to spend time on is devops.
                                    Saying this because I considered for a long while to host more of our own services on K8s instead of using 100% managed services.
                                    The thing is just, 100% managed works really well for a startup that should be focusing its energy on product, not on devops.

                                    1. Ok yes definitely sounds like a good choice in SoundFlow's case

                                      And could have been for Talkyard too, I imagine, ...
                                      ... hadn't Ty been open source, and therefore cannot integrate so much with one hosting provider's managed services (except for managed K8s).

                                      1. CChristian Scheuer @chrscheuer
                                          2021-02-26 11:45:42.932Z

                                          Here's a view of our load on an average 24 hour span. Times are in PST (UTC -8).

                                          As you can see, it's hard to find any time that's good to have downtime, but if we had to pick it would be around 10 PM (22:00) PST (06:00 UTC - 07:00 CET). That is late evening in the Americas and before the workday opens up in Europe.

                                          1. What about the weekends? Saturday or Sunday morning CET for example.

                                            Thanks for the graph, interesting to see. I'm surprised the activity & load is so evenly distributed, I previously thought the Pacific Ocean would have been more visible (just the small dip you mention at 22:00 PST)

                                            1. CChristian Scheuer @chrscheuer
                                                2021-02-26 18:03:28.995Z

                                                Yea – it truely is a connected world we live in :)
                                                Oh yea great point about weekends. It is slightly better for us there:

                                                This is another graph where you can see the too small peaks towards the left being Saturday and Sunday on the blue line. So the peaks are smaller in the weekend, but their valleys around 07 UTC is not lower in the weekends.
                                                So that would mean - for us, at least - European mornings around 07 UTC are the best place to do it, and preferably Saturday or Sunday mornings in case of any extended downtime. But really it could be any morning.

                              • Progress
                                with handling this problem
                              • @KajMagnus marked this topic as Done 2021-03-18 13:33:23.743Z.
                              • @KajMagnus marked this topic as New 2021-03-18 13:33:31.522Z.