Re: UPDATE on two large datasets is very slow - Mailing list pgsql-general

From Steve Gerhardt
Subject Re: UPDATE on two large datasets is very slow
Date
Msg-id 4612E1B0.8030102@ocean.fraknet.org
Whole thread Raw
In response to Re: UPDATE on two large datasets is very slow  (Listmail <lists@peufeu.com>)
List pgsql-general
The problem with this approach is really that the tracker + website
combination is designed to show really detailed statistics for every
user on every torrent, which (as you mentioned) complicates things a
great deal. It's also able to store all the history information for all
the clients, which is periodically combined into a type of "Old
Torrents" row, one per user, used as a placeholder for old statistics.
The tracker was designed around the concept of the website, so it's not
something that can easily be changed. In addition, the data inside the
tracker is very transient; it sort of "garbage collects" its internal
structures to save memory, which means a database-like system is needed
for any kind of persistence.

As an aside, the tracker itself is designed with the asynchronous
approach you mentioned, but it has a worker thread which handles the
database updates so the main thread can continue running as fast as
possible. The current production version, still using MySQL (sadly),
handles ~500-800 connections/sec on average, which equates to about 2.7
MiB of raw query data needing to be updated in the database. However,
benchmarking indicates it can handle up to 6,800 connections/sec with
around 1,000 parallel connections on a fairly high end system, and at
that rate the volume of SQL data being generated would be utterly
enormous (I believe it'd be around 22.5 MiB), probably so much so that
keeping statistics would be impossible no matter what RDBMS was used,
which is a problem that will need solving eventually.

All that said, I'm really just looking for a way to make Postgres work
for me as best as I can get it to, without having to re-engineer the
fundamental ways in which the tracker operates, but I do appreciate the
advice dearly.

Thanks for the reply.

Steve Gerhardt

Listmail wrote:
>     No, he can't, I also coded a bittorrent tracker of the same kind,
> and the problem is :
>
>     - each user has N torrents active
>     - torrent client does N tracker requests
>     - tracker only does 1 UDPATE to update user's stats
>
>     So if you do instantaneous updates you multiply your query load by N
> (on average between 6 and 8).
>
>     Besides, these kinds of trackers face several problems :
>
>     - they are accessed by clients which have near saturated connections
> since they're leeching illegal prOn like crazy
>     - therefore these HTTP connections are very slow
>     - therefore you have a hell of a lot of concurrent connections.
>
>     Therefore using a threaded server for this kind of load is asking
> for trouble.
>     All decent torrent trackers are designed like lighttpd : select() /
> poll() or other variants, and no threads.
>     No threads means, database queries are something long and to be
> avoided.
>     Hosting providers will delete your account if they see a php torrent
> tracker on it, and for good reason.
>
>     600 hits/s = 600 connections = 600 apache and PG process = you cry.
>
>     Anyway my tracker was in Python with select/poll asynchronous HTTP
> model. It handled 200 HTTP requests per second using 10% CPU on a Core
> 2. I guess thats pretty decent.
>
>     (I do NOT work on it anymore, DO NOT ASK for sources, it is illegal
> now in my country to code trackers so I have completely dropped the
> project, but I guess helping a fellow living in a free country is OK)
>
>     Back to databases.
>
>     You complain that postgres is slow for your application.
>     Yes, it is a lot slower than MyISAM *on this application* (but try
> InnoDB and cry).
>     But PG is a real database. It is simply not the right tool to your
> application.
>     You have to choose between in-place updates and transactions.
>
>     (besides, your website is locked while MySQL does your big UPDATE).
>
>     Here is how you can do it :
>
>     Your problem is that you put the peers in the database.
>     Ask yourself why ?
>
>     You need seeders / leechers count for each torrent ?
>     -> Two INTEGERs in your torrents table, updated in batch by the
> tracker every hour.
>
>     You need to have all peers saved somewhere so that you may exit and
> restart your tracker ?
>     -> code your tracker in python and be able to reload running code
>     -> or just save it when you exit
>     -> or don't save it, it's not like it's your bank accounting data,
> who cares
>     -> the FBI will be happy to have all that data when they seize your
> server (see: piratebay laughs as all data was in RAM and police had to
> unplug the server to seize it.)
>
>     So, DO NOT put the peers in the database.
>     IF you put the peers info in the database you get one UPDATE per
> user per torrent.
>     If you only update the user stats you only get one UPDATE per user.
>     And the tracker never inserts users and torrents (hopefully) so you
> only get UPDATES to users and to torrents tables, never inserts.
>
>     Now you need to display realtime info on the user's and torrents pages.
>     This is easily done : your tracker is a HTTP server, it can serve
> data via HTTP (php serialized, JSON, whatever) that is inserted via AJAX
> of PHP in your webpages.
>
>     From my stats my tracker needs about 100 microseconds to serve a
> HTTP web page with the peer counts for a torrent.
>
>     So, you don't need Postgres for your tracker ! Use it for your
> website instead...

pgsql-general by date:

Previous
From: Huynh Ngoc Doan
Date:
Subject: Need your help on using "partion"
Next
From: Nikolay Moskvichev
Date:
Subject: Storing blobs in PG DB