Page 1 of 1

MSO fix terry hardware server

PostPosted: Wed Apr 07, 2021 5:14 pm
by Tlaltecuhtli
it crashes almost every round, if they cant provide a stable machine you arent getting your money's worth mso, you should find new server provider these guys suck ass, they probably use the server coolant to deep fry spring rolls and also get paid

Re: MSO fix terry hardware server

PostPosted: Wed Apr 07, 2021 5:44 pm
by bobbahbrown
good afternoon tlal,

terry will hopefully be moved to new hardware soon as part of our transition to our new infrastructure.

best,
bobbah 'bee' brown

Re: MSO fix terry hardware server

PostPosted: Thu Apr 08, 2021 11:32 am
by iprice
Hello,

I had suspected this was something in the software myself, though I don't really know what happens behind the scenes so can very much be wrong here, however, I felt that the stability took a dive some time ago. The other night I noticed that server logs are not uploaded to Statbus for rounds that crashed out. I had an autism moment and realised I can scrape the index page (with a sleep 5 between gets :P) and probably produce some actual statistical evidence of this if it really was the case.

I wrote a shitty perl script to process the index pages I scraped, and maybe it's worth someone writing a proper server side version of this that interrogate ... the database (?) to produce something similar, but in the end I plotted a graph of "server log hours missing per week per server" for the 4 main servers (plotting per day was just an ugly mess) and this does sort of confirm that something changed about 2 months ago that caused a deviation in Terry's stability from the other servers.

Bagil also had an instability spike, but Terry has become a consistent winner where it used to follow the same trend as the other servers.

Image

Edit: it's not a perfect measure of stability, some rounds don't have start or end times just a duration, and I think these are ones where admins restarted before the round began, doesn't include restart or pre/post round time, or something similar, not really sure where the web page gets its start/end dates from, but since this correlated very strongly with what I'd already suspected, there's probably /some/ value in the data, but it shouldn't be taken as a firm measure of stability, merely a relative value, as most other things that cause missing data should 'average out' across the servers to create comparable lines, as Sybil and Manuel tend to have. I suspect Sybil's higher count over Manuel is probably related to round duration, and expected Manuel to be lowest on this graph due to probably longer average rounds.

Thanks,
Iain

Re: MSO fix terry hardware server

PostPosted: Thu Apr 08, 2021 11:29 pm
by oranges
Cool analysis but you could have just looked at
https://status.tgstation13.org/

I'd say 99% of the stability problems are due to the database being in the US and the server being in the EU

Re: MSO fix terry hardware server

PostPosted: Fri Apr 09, 2021 12:15 am
by iprice
oranges wrote:Cool analysis but you could have just looked at
https://status.tgstation13.org/

I'd say 99% of the stability problems are due to the database being in the US and the server being in the EU


Sure, but uptime isn't the same as completing a round, the server's up for all but the 5 minutes it (usually) takes to reboot normally, but the hours lost of logs show the rounds that failed to end properly :)

Re: MSO fix terry hardware server

PostPosted: Fri Apr 09, 2021 7:34 am
by oranges
interesting, I believe the keyholders really only feel/notice the visible outages, so any crashes that immediately recover into a new round we're probably not aware of.

Re: MSO fix terry hardware server

PostPosted: Fri Apr 09, 2021 12:37 pm
by iprice
Right, a lot of the time it just bounces back on its own (I assume), quite quickly - there are rounds where e.g. the BYOND client connects to the server, renders some graphically corrupt display and immediately goes into "server not responding" mode, and these usually do take longer to recover from and someone to ping the key holders, however, these sort of crashes have "always" been a thing, but rarer (during my time anyway), but overall the stability felt like it had gotten worse recently versus say the start of the year or last year.

What I wanted to avoid was just being another person stating opinion as fact or being lost amongst the hyperbole such as "every round dies all the time" sort of comments, plus I hoped that finding the period where it started to decline may help some diagnostics, maybe something changed around this point, maybe merges could be reviewed (though I'd have thought those merges would have affected other servers too by this point, while Terry does get the test code, test code is only test code for so long..)

It does sort of correlate to my opinion that it's been a couple of months it's been worse, and the round failure rate is somewhere between 1 in 4 and 1 in 2 depending on some unknown factors, and the average lost hours per day versus active hours (ignoring the dead-shifts when most people are asleep) does show a lot of lost rounds during prime time.

Maybe there's a better way to present this data, such as a timeline of the gaps between rounds, which may illustrate a higher concentration at various times rather than averaging everything across a whole week, but mostly I just wanted to evidence things a bit better to try draw some more attention to the issue.

Re: MSO fix terry hardware server

PostPosted: Fri Apr 09, 2021 4:27 pm
by iprice
Okay, here's a visualisation of the data in a different and more useful way.

I made each "row" 5 pixels tall as otherwise the image is a bit squashed and hard to visually parse.

The top left of each image is November 1st 2020, and each row represents 24 hours, so one day per row. The last 60 days (2 months) have a green background, while all the rest have a white background, and then I drew a black bar for each period that contains server logs. Thus generally a 'good' graph should be mostly black bars with the odd background-coloured blob in between rounds. Turns out Basil's plot is a bit messy too (as seen in the original graph also), but both Sybil and Manuel have pretty good plots, I'll link Sybil first as the example of a "normal" server (and you can commit URL fudgery to get to Manuel and Basil if you want). The solid block at the bottom is simply where my scraping ended.

Image

Pretty decent, there's a "all servers" wide blob of outages which correspond to the spike at the turn of the year ((D)DOS attacks if I remember correctly), and the odd crash here and there, but not too bad.

In contrast to Terry's plot, there are quite clearly a concentration of dead rounds particularly towards the end of the day recently, and perhaps even more intensely in the last 30 days rather than 60, but it does show that it's quite common for 2 or 3 rounds to crash out in the same afternoon/evening of a row(day).

Image

Re: MSO fix terry hardware server

PostPosted: Fri Apr 09, 2021 5:44 pm
by bobbahbrown
now that's what i call creative graphing.

best,
bobbah 'bee' brown