Re: UPDATE on two large datasets is very slow - Mailing list pgsql-general
From | Listmail |
---|---|
Subject | Re: UPDATE on two large datasets is very slow |
Date | |
Msg-id | op.tp7xiooezcizji@apollo13 Whole thread Raw |
In response to | Re: UPDATE on two large datasets is very slow (Scott Marlowe <smarlowe@g2switchworks.com>) |
Responses |
Re: UPDATE on two large datasets is very slow
|
List | pgsql-general |
> I can't help but think that the way this application writes data is > optimized for MySQL's transactionless table type, where lots of > simultaneous input streams writing at the same time to the same table > would be death. > > Can you step back and work on how the app writes out data, so that it > opens a persistent connection, and then sends in the updates one at a > time, committing every couple of seconds while doing so? No, he can't, I also coded a bittorrent tracker of the same kind, and the problem is : - each user has N torrents active - torrent client does N tracker requests - tracker only does 1 UDPATE to update user's stats So if you do instantaneous updates you multiply your query load by N (on average between 6 and 8). Besides, these kinds of trackers face several problems : - they are accessed by clients which have near saturated connections since they're leeching illegal prOn like crazy - therefore these HTTP connections are very slow - therefore you have a hell of a lot of concurrent connections. Therefore using a threaded server for this kind of load is asking for trouble. All decent torrent trackers are designed like lighttpd : select() / poll() or other variants, and no threads. No threads means, database queries are something long and to be avoided. Hosting providers will delete your account if they see a php torrent tracker on it, and for good reason. 600 hits/s = 600 connections = 600 apache and PG process = you cry. Anyway my tracker was in Python with select/poll asynchronous HTTP model. It handled 200 HTTP requests per second using 10% CPU on a Core 2. I guess thats pretty decent. (I do NOT work on it anymore, DO NOT ASK for sources, it is illegal now in my country to code trackers so I have completely dropped the project, but I guess helping a fellow living in a free country is OK) Back to databases. You complain that postgres is slow for your application. Yes, it is a lot slower than MyISAM *on this application* (but try InnoDB and cry). But PG is a real database. It is simply not the right tool to your application. You have to choose between in-place updates and transactions. (besides, your website is locked while MySQL does your big UPDATE). Here is how you can do it : Your problem is that you put the peers in the database. Ask yourself why ? You need seeders / leechers count for each torrent ? -> Two INTEGERs in your torrents table, updated in batch by the tracker every hour. You need to have all peers saved somewhere so that you may exit and restart your tracker ? -> code your tracker in python and be able to reload running code -> or just save it when you exit -> or don't save it, it's not like it's your bank accounting data, who cares -> the FBI will be happy to have all that data when they seize your server (see: piratebay laughs as all data was in RAM and police had to unplug the server to seize it.) So, DO NOT put the peers in the database. IF you put the peers info in the database you get one UPDATE per user per torrent. If you only update the user stats you only get one UPDATE per user. And the tracker never inserts users and torrents (hopefully) so you only get UPDATES to users and to torrents tables, never inserts. Now you need to display realtime info on the user's and torrents pages. This is easily done : your tracker is a HTTP server, it can serve data via HTTP (php serialized, JSON, whatever) that is inserted via AJAX of PHP in your webpages. From my stats my tracker needs about 100 microseconds to serve a HTTP web page with the peer counts for a torrent. So, you don't need Postgres for your tracker ! Use it for your website instead...
pgsql-general by date: