Real-time notification stops working
Hello!
I've deployed a talkyard for my community but after a few seconds the real-time notification stops working and I need to refresh the page.
I've check the logs and I have a "client prematurely closed connection while sending to client" right along the time the browser stops receiving it.
I tried to check from where it might be coming but due to my lack of full understanding of the Talkyard, and where to look, I can't.
Could you help me?
Thanks!
- KajMagnus @KajMagnus2018-06-29 12:15:58.053Z
Hello Tiago! Thanks for including the error log message, i.e.
"client prematurely ..."
. Apparently it's from Nchan, the Nginx module that's being used for real time notifications. I'll look into this now & during the weekend ... I think this problem happens to me too sometimes.(Interesting that you tried to check from where the message comes — what's your background if I may ask? You do softw dev sometimes?)
- TTiago @Tiago
Actually yes, sometimes. I am doing my Master in Machine Learning and Deep Learning.
I thought about nchan but given the fact I'm still trying to understand how everything works together ahah.
- KajMagnus @KajMagnus2018-06-30 06:22:59.541Z
Ok :- ) that sounds interesting. I did juts a little bit neural networks long ago b.t.w. ... before Deep Learning happened.
I've fixed the bug now (or so I think), and works when I test on localhost. I'll release a new version in a few days and then live notifications should work again.
(The reason for the bug, is that long ago I changed from
jQuery.ajax
, toBliss.fetch
, and didn't notice that after that,theRequest.abort()
no longer invoked [an error callback in which next long polling request got sent]. ... So, after the first long polling request, no more long polling requests got sent :- P )- TTiago @Tiago
Oh nice! It's amazing if we compare the uses it had before and how the architectures evolved and got ultra complex. You get a bit "how the hell does this work?" ahah
Oh So it was on the client side. I would have spent a few days on that one (I was still reading and understanding nchan ahaha).
Thank you very much :D
- KajMagnus @KajMagnus2018-07-20 15:25:31.319Z
(Sorry for the late reply.) Seems as if the above-mentioned fix wasn't the only problem. Live notifications now work when I test on localhost, but when the server has been up and running for a while, apparently they stop wroking. My best guess right now is that there's a bug in Nchan, and the Nchan author have been coding a lot lately and fixed bugs, and says he'll release a new version soon, like, in a week. So I'll upgrade to that new version and see if live notfs start working properly then.
- In reply toTiago⬆:KajMagnus @KajMagnus2018-08-02 13:34:47.606Z
@Tiago Turns out there's another problem too: there's a segfault (C code crash) in an Nginx worker thread, from inside a Lua module. When the worker thread suddenly exits, Nchan's internal state gets messed up, and notifications stop working. I posted about this yesterday over at GitHub, the Lua module repo. https://github.com/openresty/lua-nginx-module/issues/1361
- Progresswith handling this problem
- KajMagnus @KajMagnus2018-10-15 14:31:37.813Z
This is still a problem — i.e. the Nginx worker segfault mentioned above. Recently the Nchan author fixed a worker crash; hopefully it's the same crash. I'll upgrade to the new version of Nchan:
https://github.com/slact/nchan/blob/master/changelog.txt1.2.2 (Oct. 9 2018) ... fix (security): subscriber may erroneously receive a 400 Bad Request or crash a worker based on data from a previous subscriber
- KajMagnus @KajMagnus2019-02-07 18:40:57.006Z
I think this has been fixed now ... after 7 months :- P. I changed Nginx to use only 1 worker, and that reportedly avoids the problem. If a worker crashes, Nginx will somehow reset its state, if there's just one single worker. i haven't noticed any problems since changing to 1 worker, soon a month ago.
1 worker is faster, than fast enough. Nevertheless, the long term plan is to actually remove Nchan. In Talkyard's case, I think it's not really needed. Instead I have in mind to use Server Sent Events and HTTP/2 directly from Play Framework.
From a GitHub issue: https://github.com/slact/nchan/issues/477#issuecomment-452848234
I wrote:
@neben I have this problem too, that Nchan in effect stops working after a worker crash, and can stay broken until the next restart which might not be until weeks later (no live notifications, until then). How do you detect a crash and send a SIGHUP, you don't happen to have to have a reusable script or something?
( @slact I suppose it'd be impossibly much work to do this, but anyway, there's Rust for Nginx: https://github.com/nginxinc/ngx-rust (hmm only a proof of concept though) — maybe Rust could be a way to fix all crashes once and for all ... except that ... impossibly much work to port to Rust I suppose.)
@ivanovv replied:
@kajmagnus the easiest fix for this is to have only one worker, then nginx master process will auto restart it and everything works again. I guess that needs to go into the README as it is non obvious thing and many had prod servers stuck