Issues for the last week/weeks - explanation
August 29th, 2008
A lot of discussion has been going on about why the servers have been so slow recently. It is actually fairly complex so let me explain.
We used to have one server for Apache, with one server for Mysql (replicated for backups). We started a move last october to a load balanced architecture for greater scalability, 2 apaches, one NAS (Netapp provided by Noris.net, our main provider, that also hosts all our servers), and one replicated MySQL. The french locoteam was the first to move to that new platform, the other ones staying on the old apache server.
At the beginning of summer we got kicked out of the NAS because of a too heavy usage. We managed to get a new server, Nun, in urgency, which we sent from France to Nuremberg. Nun is a dual xeon 3Ghz, 4GB RAM, 5×70GB SCSI 10k - a reasonable server for NFS. Sadly the performance appeared to be very bad, which resulted in a very slow response time. This even got to a catastrophic state once we tried to move our wiki to the NFS; Dokuwiki was actually the problem here. So we left dokuwiki running on one server only (without using the NAS) while we were investigating; we started to find and report some issues. Further improvements to dokuwiki’s code have been made since then.
Even without the wiki the performance on our NFS still were a lot worse than expected - and it wouldn’t support other locoteams moving to it. After a long investigation by smurf, it appeared that there was a bug in Hardy’s kernel, which was flooding the raid card (bug reported here). We then moved the ubuntu-de loco website to only one of the new Apache server (the ubuntu-de old portal wasn’t made to be load balanced) - and used the old apache server as NFS server instead. Asa is a dual xeon 3Ghz, 4GB ram, 3×70GB SCSI 10k disks - quite comparable to nun - though it was running Dapper and not Hardy. And we found the performance to be nearly ten times better. So Nun wasn’t used anymore, and we were running NFS on the previous apache server that was actually still serving some smaller locos.
At this time we also had some big issues with our load balancer - we are using HaProxy. For some reason, some files were systematically returned with a 504 error - although they were served fine when accessing directly the backend server (without going through the reverse proxy). It took us a long, very long time to spot this issue. It appeared to be a bug in HaProxy; when one specific option turned on, it was incorrectly parsing the headers of the files and failed to pass it back. The issue has been reported and will hopefully be corrected later, for the time being we disabled that option. The bug was also present in wget.
A couple of weeks ago Ubuntu-de changed completely their portal system, to a home-made one. This involved several issues. The move to a database-based wiki added load to an already loaded SQL server. The portal was also using mod_wsgi which apparently was having big issues, apache processes kept forking, number of workers was reaching limits without any apparent reason, resulting in a massive slowdown for web requests.
One of the biggest bottlenecks was still the SQL though - so we bought 2×1GB memory on ebay, using the data from “lshw” to chose the correct ones. Took three days to arrive, sadly, they weren’t the good ones :/ We may still use them on some other server, though. The SQL was also running a software raid5, which wasn’t optimal -> we wanted to switch to raid1. Following that failed memory upgrade + raid change, the server started acting weirdly; things that should have worked, just didn’t. Even after a full reinstall it kept acting weirdly. Smurf reported it was faulty, and moved the SQL to the NFS server - which was already doing NFS, and apache for a couple of smaller websites - and also had slower disks.
So the SQL has been terribly slow for the last week or so. This morning, smurf bought another 2GB RAM on ebay, the correct one this time. He set it up late this afternoon, and reinstalled the server with Hardy. It is apparently working fine, we hope to get the SQL completely back tomorrow.
At the same time we have made huge improvements to the SQL configuration and to some of the requests. We noticed a missing key in one of punbb’s tables, removed completely any reference to punbb’s search (which was a *huge* bottleneck, and was locking some tables for a long time, resulting in a “frozen” forum). The configuration has been improved a lot as well.
At the same time the german locoteam has been looking at why their portal was making Apache run so many workers. They eventually found an issue which may have been causing it and corrected it, so this issue should be solved as well.
For the curious, this is the platform we will then be using.
Entry Filed under: Uncategorized
3 Comments - Add your own
1. Probleme bei ubuntuusers.&hellip | August 30th, 2008 at 3:42 pm
[…] hat das Serverteam die Dinge jedoch unter Kontrolle und eine Stellungnahme publiziert, die die Hintergründe besser erläutert und ein wenig Licht ins Dunkel bringt. Besonders spannend […]
2. JIMMY&hellip | July 4th, 2010 at 5:35 am
< blockquote >< a href=”http://pillspot.org/”>Pillspot.org. Canadian Health&Care.Best quality drugs.No prescription online pharmacy.Special Internet Prices. No prescription drugs. Order pills online< /a >…
Buy:Zyban.Prednisolone.Valtrex.100% Pure Okinawan Coral Calcium.Mega Hoodia.Retin-A.Human Growth Hormone.Arimidex.Nexium.Synthroid.Zovirax.Prevacid.Actos.Accutane.Lumigan.Petcam (Metacam) Oral Suspension….
3. Computer&hellip | August 30th, 2010 at 2:56 am
Speakers http://doak0n3txn.04FORDPARTS.US/tag/funny+Computer+Speakers/ : Computer…
funny…
Leave a Comment
You must be logged in to post a comment.
Trackback this post | Subscribe to the comments via RSS Feed