Thread: UPDATE on two large datasets is very slow

UPDATE on two large datasets is very slow

From

Steve Gerhardt

Date:

03 April 2007, 12:10:33

I've been working for the past few weeks on porting a closed source
BitTorrent tracker to use PostgreSQL instead of MySQL for storing
statistical data, but I've run in to a rather large snag. The tracker in
question buffers its updates to the database, then makes them all at
once, sending anywhere from 1-3 MiB of query data. With MySQL, this is
accomplished using the INSERT INTO...ON DUPLICATE KEY UPDATE query,
which seems to handle the insert/update very quickly; generally it only
takes about a second for the entire set of new data to be merged.

The problem I am encountering is that when I attempt to duplicate this
functionality in Postgres, it is terrifically slow to a point of utter
unusability. The tracker currently handles around 10,000-40,000 client
updates per minute, which translates roughly to the same number of rows
in the database. Part of the issue is that some of those rows cannot be
updated because they do not yet exist in the database, but there is
likely around a 100:1 ratio on updates to inserts.

After consulting with some of the folks on the PostgreSQL IRC channel on
freenode.net, I was left with this idea to try:

---------------------------------------------------------------------
BEGIN

CREATE TEMP TABLE temp_p2 ON COMMIT DROP AS (SELECT tid, uid, uploaded,
downloaded, remaining, avg_up, avg_down, active, timespent, ip, port,
peer_id, blocked FROM peers2 WHERE FALSE)

COPY temp_p2 FROM STDIN WITH CSV QUOTE AS ''''
< the data is sent by the tracker using PQputCopyData >

UPDATE peers2 AS p SET uploaded = p.uploaded + t.uploaded, downloaded =
p.downloaded + t.downloaded, remaining = t.remaining, avg_up = t.avg_up,
avg_down = t.avg_down, active = t.active, timespent = p.timespent +
t.timespent, ip = t.ip, port = t.port, blocked = t.blocked, timestamp =
CURRENT_TIMESTAMP FROM temp_p2 AS t WHERE (p.uid = t.uid AND p.tid = t.tid)

INSERT INTO peers2 (tid, uid, uploaded, downloaded, remaining, avg_up,
avg_down, active, timespent, ip, port, peer_id, blocked) SELECT t.* FROM
temp_p2 AS t LEFT JOIN peers2 USING (uid, tid) WHERE peers2.uid IS NULL
AND peers2.tid IS NULL

COMMIT
---------------------------------------------------------------------

Initial attempts showed the UPDATE query was incredibly slow. After
sitting down at the psql command line, I managed to get the query plan
for it after much waiting.

# EXPLAIN ANALYZE UPDATE peers2...etc etc
QUERY PLAN
-------------------------------------------------------------------------
  Merge Join  (cost=262518.76..271950.65 rows=14933 width=153) (actual
time=8477.422..9216.893 rows=26917 loops=1)
    Merge Cond: ((p.tid = t.tid) AND (p.uid = t.uid))
    ->  Sort  (cost=177898.12..180004.09 rows=842387 width=65) (actual
time=7803.248..8073.817 rows=109732 loops=1)
          Sort Key: p.tid, p.uid
          ->  Seq Scan on peers2 p  (cost=0.00..25885.87 rows=842387
width=65) (actual time=0.043..4510.771 rows=647686 loops=1)
    ->  Sort  (cost=84620.64..85546.64 rows=370400 width=96) (actual
time=641.438..761.893 rows=55393 loops=1)
          Sort Key: t.tid, t.uid
          ->  Seq Scan on temp_p2 t  (cost=0.00..11112.00 rows=370400
width=96) (actual time=0.093..275.110 rows=55393 loops=1)
  Total runtime: 192569.492 ms
(9 rows)

(Apologies if the formatting got ruined by my e-mail client.)

Essentially, it looks like what it's doing is sorting both tables on the
WHERE clause, then finding which positions correspond between the two.
The problem is that, as can be seen, peers2 has 600,000+ rows, so
sequential scanning and sorting it is a rather non-trivial operation. As
a sidenote, there is a unique index set up for peers2.uid and
peers2.tid, so any lookups should be fully indexed.

After this method seemed to fail miserably, I took another approach and
wrote a stored procedure, which should in theory accomplish much the
same thing. I assumed this would be faster because it would iterate over
the temp_p2 table sequentially, and do a simple index lookup + update to
the peers2 table on each step, without any sorting or craziness
required. For this to work, the tracker needs to automatically
categorize client updates into "needs UPDATE" or "needs INSERT" buffers,
which would be handled separately. The inserts are lightning quick and
are not an issue, but the updates, as illustrated below, are not very good.

Here is the first version of the stored procedure:

CREATE OR REPLACE FUNCTION tracker_update() RETURNS integer AS $PROC$
DECLARE
   rec temp_p2%ROWTYPE;
BEGIN
   FOR rec IN SELECT * FROM temp_p2 LOOP
     UPDATE peers2 SET uploaded = uploaded + rec.uploaded,
        downloaded = downloaded + rec.downloaded,
        remaining = rec.remaining,
        avg_up = rec.avg_up,
        avg_down = rec.avg_down,
        active = rec.active,
        timespent = timespent + rec.timespent,
        ip = rec.ip,
        port = rec.port,
        peer_id = rec.peer_id,
        blocked = rec.blocked,
        timestamp = CURRENT_TIMESTAMP
        WHERE uid = rec.uid AND tid = rec.tid;
   END LOOP;
   RETURN 1;
END;
$PROC$ LANGUAGE plpgsql;

Here are the results. I ran it through EXPLAIN ANALYZE just to get the
timing information, since one cannot view query plans for stored
procedures that way. This is on the same dataset as the UPDATE above.

# EXPLAIN ANALYZE SELECT tracker_update();
QUERY PLAN
-------------------------------------------------------------------------
  Result  (cost=0.00..0.01 rows=1 width=0) (actual
time=496676.127..496676.128 rows=1 loops=1)
  Total runtime: 496684.305 ms
(2 rows)

Realizing this took about 2.57 times as long as the original UPDATE
query, I ventured to write a second version that used EXECUTE instead so
there was no pre-planning and it would run the UPDATES individually with
constant values. Here is that second version:

CREATE OR REPLACE FUNCTION tracker_update2() RETURNS integer AS $PROC$
DECLARE
   rec temp_p2%ROWTYPE;
BEGIN
   FOR rec IN SELECT * FROM temp_p2 LOOP
     EXECUTE 'UPDATE peers2 SET uploaded = uploaded + ' || rec.uploaded
        || ', downloaded = downloaded + ' || rec.downloaded
        || ', remaining = ' || rec.remaining
        || ', avg_up = ' || rec.avg_up
        || ', avg_down = ' || rec.avg_down
        || ', active = ' || rec.active
        || ', timespent = timespent + ' || rec.timespent
        || ', ip = ' || quote_literal(host(rec.ip))
        || ', port = ' || rec.port
        || ', peer_id = ' || quote_literal(encode(rec.peer_id, 'escape'))
        || ', blocked = ' || rec.blocked
        || ', timestamp = CURRENT_TIMESTAMP WHERE'
        || ' uid = ' || rec.uid
        || ' AND tid = ' || rec.tid;
   END LOOP;
   RETURN 1;
END;
$PROC$ LANGUAGE plpgsql;

As an aside, the quote_literal calls are needed for the ip and peer_id
fields, which are inet and bytea types, respectively.

Here is the execution time of this stored procedure on the same dataset
as the UPDATE above:
# EXPLAIN ANALYZE SELECT tracker_update2();
QUERY PLAN
------------------------------------------------------------------------
  Result  (cost=0.00..0.01 rows=1 width=0) (actual
time=432153.101..432153.103 rows=1 loops=1)
  Total runtime: 432153.127 ms
(2 rows)

Hardly an improvement. Unfortunately at this time I'm at a bit of a
loss. I cannot think of any solution that would be faster than the
original UPDATE method, but this method is far, far too slow anyway,
since the tracker attempts to send a new buffer to Postgres every
minute, so if each buffer takes ~3 minutes to process, it will simply
backlog itself into oblivion, which means I might as well not bother
porting the code at all. I don't mean to come down on Postgres, but it
surprises me that for all the advanced features it has, MySQL utterly
blows it out of the water for this one aspect, but I'm absolutely
willing to admit it's probably my fault for approaching the problem the
wrong way, which is why I'm seeking help :)

Anyway, I admit I haven't done a great deal of configuration file tuning
for the Postgres setup, so if there is something obvious behind the poor
performance, I'll be happy to test out any proposed changes and report
back with the results. If anyone has *ANY* ideas as to why this is so
slow, or different approaches that may be faster than what I have tried,
please e-mail me back; I've struggled with this for week with absolutely
no results. If you need to see configuration files, sample data, or
anything else, don't hesitate to ask, I'll be happy to provide as much
as I can if it means there's a solution on the horizon.

For those of you that have actually read all of this, thank you for your
patience, I know it's a huge message.

Thanks much,

Steve Gerhardt

Re: UPDATE on two large datasets is very slow

From

Scott Marlowe

Date:

03 April 2007, 12:44:36

On Mon, 2007-04-02 at 22:24, Steve Gerhardt wrote:
> I've been working for the past few weeks on porting a closed source
> BitTorrent tracker to use PostgreSQL instead of MySQL for storing
> statistical data, but I've run in to a rather large snag. The tracker in
> question buffers its updates to the database, then makes them all at
> once, sending anywhere from 1-3 MiB of query data. With MySQL, this is
> accomplished using the INSERT INTO...ON DUPLICATE KEY UPDATE query,
> which seems to handle the insert/update very quickly; generally it only
> takes about a second for the entire set of new data to be merged.
>
> The problem I am encountering is that when I attempt to duplicate this
> functionality in Postgres, it is terrifically slow to a point of utter
> unusability. The tracker currently handles around 10,000-40,000 client
> updates per minute, which translates roughly to the same number of rows
> in the database. Part of the issue is that some of those rows cannot be
> updated because they do not yet exist in the database, but there is
> likely around a 100:1 ratio on updates to inserts.
>
> After consulting with some of the folks on the PostgreSQL IRC channel on
> freenode.net, I was left with this idea to try:

I can't help but think that the way this application writes data is
optimized for MySQL's transactionless table type, where lots of
simultaneous input streams writing at the same time to the same table
would be death.

Can you step back and work on how the app writes out data, so that it
opens a persistent connection, and then sends in the updates one at a
time, committing every couple of seconds while doing so?

Re: UPDATE on two large datasets is very slow

From

Jonathan Vanasco

Date:

03 April 2007, 13:33:34

On Apr 3, 2007, at 11:44 AM, Scott Marlowe wrote:

> I can't help but think that the way this application writes data is
> optimized for MySQL's transactionless table type, where lots of
> simultaneous input streams writing at the same time to the same table
> would be death.
>
> Can you step back and work on how the app writes out data, so that it
> opens a persistent connection, and then sends in the updates one at a
> time, committing every couple of seconds while doing so?

I'd look into indexing the tables your update requires in such a way
that you're not doing so many  sequential scans.

I have a system that does many updates on a quickly growing db - 5M
rows last week, 25M this week.

Even simple updates could take forever, because of poor indexing in
relation to fields addressed in the 'where' on the update and foreign
keys.
With some proper updating, the system is super fast again.

So i'd look into creating new indexes and trying to shift the seq
scans into more time-efficient index scans.

Re: UPDATE on two large datasets is very slow

From

Listmail

Date:

03 April 2007, 14:29:12

> I can't help but think that the way this application writes data is
> optimized for MySQL's transactionless table type, where lots of
> simultaneous input streams writing at the same time to the same table
> would be death.
>
> Can you step back and work on how the app writes out data, so that it
> opens a persistent connection, and then sends in the updates one at a
> time, committing every couple of seconds while doing so?

    No, he can't, I also coded a bittorrent tracker of the same kind, and the
problem is :

    - each user has N torrents active
    - torrent client does N tracker requests
    - tracker only does 1 UDPATE to update user's stats

    So if you do instantaneous updates you multiply your query load by N (on
average between 6 and 8).

    Besides, these kinds of trackers face several problems :

    - they are accessed by clients which have near saturated connections
since they're leeching illegal prOn like crazy
    - therefore these HTTP connections are very slow
    - therefore you have a hell of a lot of concurrent connections.

    Therefore using a threaded server for this kind of load is asking for
trouble.
    All decent torrent trackers are designed like lighttpd : select() /
poll() or other variants, and no threads.
    No threads means, database queries are something long and to be avoided.
    Hosting providers will delete your account if they see a php torrent
tracker on it, and for good reason.

    600 hits/s = 600 connections = 600 apache and PG process = you cry.

    Anyway my tracker was in Python with select/poll asynchronous HTTP model.
It handled 200 HTTP requests per second using 10% CPU on a Core 2. I guess
thats pretty decent.

    (I do NOT work on it anymore, DO NOT ASK for sources, it is illegal now
in my country to code trackers so I have completely dropped the project,
but I guess helping a fellow living in a free country is OK)

    Back to databases.

    You complain that postgres is slow for your application.
    Yes, it is a lot slower than MyISAM *on this application* (but try InnoDB
and cry).
    But PG is a real database. It is simply not the right tool to your
application.
    You have to choose between in-place updates and transactions.

    (besides, your website is locked while MySQL does your big UPDATE).

    Here is how you can do it :

    Your problem is that you put the peers in the database.
    Ask yourself why ?

    You need seeders / leechers count for each torrent ?
    -> Two INTEGERs in your torrents table, updated in batch by the tracker
every hour.

    You need to have all peers saved somewhere so that you may exit and
restart your tracker ?
    -> code your tracker in python and be able to reload running code
    -> or just save it when you exit
    -> or don't save it, it's not like it's your bank accounting data, who
cares
    -> the FBI will be happy to have all that data when they seize your
server (see: piratebay laughs as all data was in RAM and police had to
unplug the server to seize it.)

    So, DO NOT put the peers in the database.
    IF you put the peers info in the database you get one UPDATE per user per
torrent.
    If you only update the user stats you only get one UPDATE per user.
    And the tracker never inserts users and torrents (hopefully) so you only
get UPDATES to users and to torrents tables, never inserts.

    Now you need to display realtime info on the user's and torrents pages.
    This is easily done : your tracker is a HTTP server, it can serve data
via HTTP (php serialized, JSON, whatever) that is inserted via AJAX of PHP
in your webpages.

    From my stats my tracker needs about 100 microseconds to serve a HTTP web
page with the peer counts for a torrent.

    So, you don't need Postgres for your tracker ! Use it for your website
instead...

Re: UPDATE on two large datasets is very slow

From

Tom Lane

Date:

03 April 2007, 14:34:47

Steve Gerhardt <ocean@ocean.fraknet.org> writes:
> # EXPLAIN ANALYZE UPDATE peers2...etc etc
> QUERY PLAN
> -------------------------------------------------------------------------
>   Merge Join  (cost=262518.76..271950.65 rows=14933 width=153) (actual
> time=8477.422..9216.893 rows=26917 loops=1)
>     Merge Cond: ((p.tid = t.tid) AND (p.uid = t.uid))
>     ->  Sort  (cost=177898.12..180004.09 rows=842387 width=65) (actual
> time=7803.248..8073.817 rows=109732 loops=1)
>           Sort Key: p.tid, p.uid
>           ->  Seq Scan on peers2 p  (cost=0.00..25885.87 rows=842387
> width=65) (actual time=0.043..4510.771 rows=647686 loops=1)
>     ->  Sort  (cost=84620.64..85546.64 rows=370400 width=96) (actual
> time=641.438..761.893 rows=55393 loops=1)
>           Sort Key: t.tid, t.uid
>           ->  Seq Scan on temp_p2 t  (cost=0.00..11112.00 rows=370400
> width=96) (actual time=0.093..275.110 rows=55393 loops=1)
>   Total runtime: 192569.492 ms
> (9 rows)

> Essentially, it looks like what it's doing is sorting both tables on the
> WHERE clause, then finding which positions correspond between the two.

You're focusing on the wrong thing --- there's nothing wrong with the plan.
It's only taking 9 seconds to perform the merge join.  The other 183
seconds are going somewhere else; you need to find out where.

One thing that came to mind was triggers, which would be shown in the
EXPLAIN results if you are using a sufficiently recent version of PG
(but you didn't say what you're using) ... however if this is a straight
port of MySQL code it's pretty unlikely to have either custom triggers
or foreign keys, so that is most likely the wrong guess.  It may just be
that it takes that long to update 26917 rows, which would suggest a
configuration problem to me.

> Anyway, I admit I haven't done a great deal of configuration file tuning
> for the Postgres setup,

shared_buffers, wal_buffers, and checkpoint_segments seem like things
you might need to increase.

Another problem with this approach is that it's not going to take long
before the table is bloated beyond belief, if it's not vacuumed
regularly.  Do you have autovacuum turned on?

Does the tracker tend to send a lot of null updates (no real change to
the rows)?  If so it'd be worth complicating the query to check for
no-change and avoid the update for unchanged rows.

            regards, tom lane

Re: UPDATE on two large datasets is very slow

From

Martijn van Oosterhout

Date:

04 April 2007, 03:21:33

On Mon, Apr 02, 2007 at 08:24:46PM -0700, Steve Gerhardt wrote:
> I've been working for the past few weeks on porting a closed source
> BitTorrent tracker to use PostgreSQL instead of MySQL for storing
> statistical data, but I've run in to a rather large snag. The tracker in
> question buffers its updates to the database, then makes them all at
> once, sending anywhere from 1-3 MiB of query data. With MySQL, this is
> accomplished using the INSERT INTO...ON DUPLICATE KEY UPDATE query,
> which seems to handle the insert/update very quickly; generally it only
> takes about a second for the entire set of new data to be merged.

For the record, this is what the SQL MERGE command is for... I don't
think anyone is working on implementing that though...

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Attachment

signature.asc

Re: UPDATE on two large datasets is very slow

From

Tommy Gildseth

Date:

04 April 2007, 06:53:06

Martijn van Oosterhout wrote:
> On Mon, Apr 02, 2007 at 08:24:46PM -0700, Steve Gerhardt wrote:
>
>> I've been working for the past few weeks on porting a closed source
>> BitTorrent tracker to use PostgreSQL instead of MySQL for storing
>> statistical data, but I've run in to a rather large snag. The tracker in
>> question buffers its updates to the database, then makes them all at
>> once, sending anywhere from 1-3 MiB of query data. With MySQL, this is
>> accomplished using the INSERT INTO...ON DUPLICATE KEY UPDATE query,
>> which seems to handle the insert/update very quickly; generally it only
>> takes about a second for the entire set of new data to be merged.
>>
>
> For the record, this is what the SQL MERGE command is for... I don't
> think anyone is working on implementing that though...
>
This will possibly provide a solution to this question:
http://www.postgresql.org/docs/current/static/plpgsql-control-structures.html#PLPGSQL-UPSERT-EXAMPLE

--
Tommy

Re: UPDATE on two large datasets is very slow

From

Martijn van Oosterhout

Date:

04 April 2007, 12:26:08

On Wed, Apr 04, 2007 at 11:34:22AM +0200, Tommy Gildseth wrote:
> >For the record, this is what the SQL MERGE command is for... I don't
> >think anyone is working on implementing that though...
> >
> This will possibly provide a solution to this question:
> http://www.postgresql.org/docs/current/static/plpgsql-control-structures.html#PLPGSQL-UPSERT-EXAMPLE

Almost. Merge will allow you to combine the INSERT and UPDATE into a
single command. It doesn't avoid the race condition though, so you
still need the exception block and the loop.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Attachment

signature.asc

Re: UPDATE on two large datasets is very slow

From

Steve Gerhardt

Date:

04 April 2007, 14:58:58

Tom Lane wrote:

> You're focusing on the wrong thing --- there's nothing wrong with the plan.
> It's only taking 9 seconds to perform the merge join.  The other 183
> seconds are going somewhere else; you need to find out where.
>
> One thing that came to mind was triggers, which would be shown in the
> EXPLAIN results if you are using a sufficiently recent version of PG
> (but you didn't say what you're using) ... however if this is a straight
> port of MySQL code it's pretty unlikely to have either custom triggers
> or foreign keys, so that is most likely the wrong guess.  It may just be
> that it takes that long to update 26917 rows, which would suggest a
> configuration problem to me.

Any suggestions for finding out where all the time is being spent? I'm
running 8.2.0 by the way, bit boneheaded of me to not mention that in
the original message, but I'm planning on upgrading to 8.2.3 soon. I
don't have any triggers or other procedures set up that would interrupt
this, which is why I'm really confused as to the enormous runtime. I
agree that the merge join only takes 9 seconds, but it looks to me like
the 183 seconds are spent sequential scanning both tables, then sorting
the results, which I imagine would be necessary for the merge join to
take place. Sadly, I'm not familiar enough with the internals to know if
this is the case or not.

> shared_buffers, wal_buffers, and checkpoint_segments seem like things
> you might need to increase.

I'll try modifying those and report back with what kind of performance
increases I can get.

> Another problem with this approach is that it's not going to take long
> before the table is bloated beyond belief, if it's not vacuumed
> regularly.  Do you have autovacuum turned on?
>
> Does the tracker tend to send a lot of null updates (no real change to
> the rows)?  If so it'd be worth complicating the query to check for
> no-change and avoid the update for unchanged rows.

The tracker is set up to run a VACUUM ANALYZE after each commit; I
neglected to mention that. From the testing I've done, it seems like
performance is more or less the same whether the table has been vacuumed
recently. Also, the tracker specifically ignores null updates where no
data is changed to cut down on the size of the data being sent.

Also, if you don't mind answering, I've been pretty puzzled why the two
stored procedures are substantially slower than the original method,
since the "concept" in my head seems like they would be a lot more
simple. Am I missing something huge with the way Postgres works?

>             regards, tom lane

Thanks for the help, Tom.

Steve Gerhardt

Re: UPDATE on two large datasets is very slow

From

Steve Gerhardt

Date:

04 April 2007, 15:05:48

The problem with this approach is really that the tracker + website
combination is designed to show really detailed statistics for every
user on every torrent, which (as you mentioned) complicates things a
great deal. It's also able to store all the history information for all
the clients, which is periodically combined into a type of "Old
Torrents" row, one per user, used as a placeholder for old statistics.
The tracker was designed around the concept of the website, so it's not
something that can easily be changed. In addition, the data inside the
tracker is very transient; it sort of "garbage collects" its internal
structures to save memory, which means a database-like system is needed
for any kind of persistence.

As an aside, the tracker itself is designed with the asynchronous
approach you mentioned, but it has a worker thread which handles the
database updates so the main thread can continue running as fast as
possible. The current production version, still using MySQL (sadly),
handles ~500-800 connections/sec on average, which equates to about 2.7
MiB of raw query data needing to be updated in the database. However,
benchmarking indicates it can handle up to 6,800 connections/sec with
around 1,000 parallel connections on a fairly high end system, and at
that rate the volume of SQL data being generated would be utterly
enormous (I believe it'd be around 22.5 MiB), probably so much so that
keeping statistics would be impossible no matter what RDBMS was used,
which is a problem that will need solving eventually.

All that said, I'm really just looking for a way to make Postgres work
for me as best as I can get it to, without having to re-engineer the
fundamental ways in which the tracker operates, but I do appreciate the
advice dearly.

Thanks for the reply.

Steve Gerhardt

Listmail wrote:
>     No, he can't, I also coded a bittorrent tracker of the same kind,
> and the problem is :
>
>     - each user has N torrents active
>     - torrent client does N tracker requests
>     - tracker only does 1 UDPATE to update user's stats
>
>     So if you do instantaneous updates you multiply your query load by N
> (on average between 6 and 8).
>
>     Besides, these kinds of trackers face several problems :
>
>     - they are accessed by clients which have near saturated connections
> since they're leeching illegal prOn like crazy
>     - therefore these HTTP connections are very slow
>     - therefore you have a hell of a lot of concurrent connections.
>
>     Therefore using a threaded server for this kind of load is asking
> for trouble.
>     All decent torrent trackers are designed like lighttpd : select() /
> poll() or other variants, and no threads.
>     No threads means, database queries are something long and to be
> avoided.
>     Hosting providers will delete your account if they see a php torrent
> tracker on it, and for good reason.
>
>     600 hits/s = 600 connections = 600 apache and PG process = you cry.
>
>     Anyway my tracker was in Python with select/poll asynchronous HTTP
> model. It handled 200 HTTP requests per second using 10% CPU on a Core
> 2. I guess thats pretty decent.
>
>     (I do NOT work on it anymore, DO NOT ASK for sources, it is illegal
> now in my country to code trackers so I have completely dropped the
> project, but I guess helping a fellow living in a free country is OK)
>
>     Back to databases.
>
>     You complain that postgres is slow for your application.
>     Yes, it is a lot slower than MyISAM *on this application* (but try
> InnoDB and cry).
>     But PG is a real database. It is simply not the right tool to your
> application.
>     You have to choose between in-place updates and transactions.
>
>     (besides, your website is locked while MySQL does your big UPDATE).
>
>     Here is how you can do it :
>
>     Your problem is that you put the peers in the database.
>     Ask yourself why ?
>
>     You need seeders / leechers count for each torrent ?
>     -> Two INTEGERs in your torrents table, updated in batch by the
> tracker every hour.
>
>     You need to have all peers saved somewhere so that you may exit and
> restart your tracker ?
>     -> code your tracker in python and be able to reload running code
>     -> or just save it when you exit
>     -> or don't save it, it's not like it's your bank accounting data,
> who cares
>     -> the FBI will be happy to have all that data when they seize your
> server (see: piratebay laughs as all data was in RAM and police had to
> unplug the server to seize it.)
>
>     So, DO NOT put the peers in the database.
>     IF you put the peers info in the database you get one UPDATE per
> user per torrent.
>     If you only update the user stats you only get one UPDATE per user.
>     And the tracker never inserts users and torrents (hopefully) so you
> only get UPDATES to users and to torrents tables, never inserts.
>
>     Now you need to display realtime info on the user's and torrents pages.
>     This is easily done : your tracker is a HTTP server, it can serve
> data via HTTP (php serialized, JSON, whatever) that is inserted via AJAX
> of PHP in your webpages.
>
>     From my stats my tracker needs about 100 microseconds to serve a
> HTTP web page with the peer counts for a torrent.
>
>     So, you don't need Postgres for your tracker ! Use it for your
> website instead...

Re: UPDATE on two large datasets is very slow

From

Listmail

Date:

04 April 2007, 17:31:58

> Any suggestions for finding out where all the time is being spent? I'm

    - time spent updating indexes ? do you have a lot of them ?
    - try with fsync off (not in production, but it will give you an idea)