Thoughts on the mirroring system etc - Mailing list pgsql-www
From | Magnus Hagander |
---|---|
Subject | Thoughts on the mirroring system etc |
Date | |
Msg-id | 6BCB9D8A16AC4241919521715F4D8BCE476685@algol.sollentuna.se Whole thread Raw |
List | pgsql-www |
Hello! In light of yesterdays release and what was probably the largest hit so far on the current websites "way of things", I had a couple of thoughts. The site more or less went down, which is not good. What's in there now is a temporary fix, and a permanent one is needed. And one that does not need manual intervention to fix (as this one did). So here are some thoughts on what I think need to be done. I know some of these things have been discussed before. Some exactly the same way, some slightly different. I know steps are in motion to do some of them. I'm just lining up everything here. And yes, actually offering to help out if wanted. Just say the words. And if I'm stepping on someones toes here, let me apologize in advance. Just point me in the right direction. It's not my intention to be someone who just complains about what is now, I'd rather be someone who helps with ideas on how to move forward. Number of mirrors ----------------- * There are currently almost 60 mirrors for the static web content. * During the very largest load during slashdotting etc, the three servers serving up the static content totalled no more than a little over 6Mbit of traffic, at around less than 500 requests / second. * During this time, wwwmaster pushed around 1.5Mbit * As long as www.postgresql.org is fast, people will *not* pick their local mirror for the web (ftp is a different thing, as it's more bandwidth intensive). This leads me to the conclusion that we do *not* in fact need the large mirror network to handle the bandwidth load. In fact, most of those sites probably use up more bandwidth syncing than they save. It *is much needed* for redundancy, however, and we need better automation for that one. (A lot of man-hours were thrown in to fix this problem. For next time, it's better if it's done before) My suggestion for this is to limit the number of mirrors to around 5, give or take a few. But instead, put higher demands on these mirrors than we do now. Demand they sync every 30 minutes (or 60, but you get the point). Demand that they have a fast machine and a fast network connection. There have been enough offers of servers and networks that this should not be a major problem. Demand that they respond to www.postgresql.org - if it can have a dedicated IP, even better. Distributed across the world of course. The other mirrors can stay if they want. Don't let them sync to the master, to keep the load down, just to another mirror (as it is now with only srv4, borg and eastside syncing to wwwmaster, and all others syncing to svr4). For wwwmaster, have two machines at different locations. Use Slony to replicate the database. Some coding probably needed to manually handle some updates (like the logs), since Slony isn't multimaster yet. wwwmaster held up fine now, but if something happens to the box or the network it's on we're dead in the water. Then do some "DNS magic" to do the load balancing: * Create a new zone, let's call it "mirrors.postgresql.org". With a TTL of no more than 10-15 minutes. Distribute this zone to more DNS servers than the current zone, since the load on the nameservers will be much higher. But require that all these machine respond to update notifications so they pick up changes *right away*. By creating a new zone we can both separate the handling of it (so a bug only affects this and not say the mailinglists etc), and we can keep the TTL on the main zone fairly large. * Add a CNAME for www.postgresql.org to www-static.mirrors.postgresql.org * Have a script running at a dedicated machine somewhere *very* well connected that is *not* one of the webservers. This script will poll the website every 5 minutes. If the site does not respond, it's dropped from the zone right away. If it is not up to date, the site is dropped from the zone if it's more than <n> minutes old (depending on how often sync is demanded) * This also provides a way to gracefully take one machine out of the cluster without needing any manual hacking of DNZ zones, etc. Simply stop syncing and then wait an hour or so and all requests should be elsewhere. Then once the machine is upgraded/reinstalled/moved/whatever, just start syncing again and things should be picked up again. A similar solution for wwwmaster, of course. I am willing to invest some time in doing these scripts if wanted. I don't think it's a huge amount of work. And parts of it has already been done by dave in the current mirror checking script. A similar solution can be made for the ftp servers, but I think it's of less need there. If we want to do it, let's start with www and take it from there if necessary. Sync speed ---------- After setting up eastside to help handle the load of www.postgresql.org I noticed the sync was horribly slow when nothing had changed. This was because it synced the attributes on all files every time - the update date, I beleive. Dave has committed a couple of patches I made for this now, and sync time has dropped from >5 minutes down to <5 seconds. A mirror pull when *nothing* has changed is right now around 400Kb. With 60 servers syncing up that's a full 24Mb every time when nothing has changed. With just 5 servers, well, do the math ;-) Bittorrent/Ftp -------------- As Dave has already referred to, I think it'd be good to link bittorrent links from every file in the ftp browser. Slashdot linked directly to the bittorrent downloads, and that showed. But once it fell down on the slashdot page, the amount of people using bittorrent fell off very fast. During peak my two seeders sent about 4Mbit/sec on bittorrent. Also, the load hit bt.postgresql.org instead of www.postgresql.org, so it was not distributed. Since this means more bittorrent seeders, it should perhaps be on a separate box from the web stuff. There could be several that just rsynced the .torrents between each other so the project always has a couple of seeders in. This would be a very easy point for people to just "plug in more bandwidth" when required as well, since bittorrent automatically makes sure that nobody can serve a non-up-to-date file, etc. With some tweak to the scripts it ought to be possible to make this run with just one process serving a whole lots of torrents - they just need to be in the same directory. As for ftp mirrors, the bandwidth demand there is no dobut much higher than it is on the web servers, so keeping more mirrors here make a lot of sense. Also, some of the ftp sites that mirror us now have *huge* amuonts of bandwidth (in the size of many gigabits/sec). wwwmaster --------- If you hit the ftp browser (or a download link), and then click anything in the menu, you get the whole site served from wwwmaster. If the above is fixed, so mirrors are all referred to as www.postgresql.org, it should be as simple as sticking a <base href> in there or something. BUt until then, perhaps some creative coding in the framework can fix it so links that are hit on wwwmaster point back to www whereas the static site uses relative links only? Wow. That was a lot longer than initially intended. Hope someone has the patience to read it all ;-) //Magnus