Re: Cursor + upsert (astronomical data) - Mailing list pgsql-performance

From Jeff Janes
Subject Re: Cursor + upsert (astronomical data)
Date
Msg-id CAMkU=1ws0YbQy-a9FTrb3e7xkhJ=jmSvZQRfbPyTSFYMamTYeg@mail.gmail.com
Whole thread Raw
In response to Cursor + upsert (astronomical data)  (Jiří Nádvorník <nadvornik.ji@gmail.com>)
List pgsql-performance
On Sat, Jul 26, 2014 at 3:46 AM, Jiří Nádvorník <nadvornik.ji@gmail.com> wrote:

 

The reason why I solve the performance issues here is that the table of observations has atm cca 3e8 rows after 1.5 year of gathering the data. The number growth is linear.


So about 500,000 new records a day.



                               (UPDATE

                                               \schema.objcat

                               SET

                                               ipix_cat=q3c_ang2ipix(

                                                               (raj2000 * weight + curr_raj2000) / (weight + 1),

                                                               (dej2000 * weight + curr_dej2000) / (weight + 1)

                                               ),

                                               raj2000 = (raj2000 * weight + curr_raj2000) / (weight + 1),

                                               dej2000 = (dej2000 * weight + curr_dej2000) / (weight + 1),

                                               weight=weight+1

                               WHERE

                                               q3c_join(curr_raj2000, curr_dej2000,

                                                               raj2000, dej2000,

                                                               radius)

                               RETURNING *),



Doing all of this (above, plus the other parts I snipped) as a single query seems far too clever.  How can you identify the slow component when you have them all munged up like that?

Turn the above select query and run it on a random smattering of records, 'explain (analyze, buffers)', periodically while the load process is going on.


 

Results: When I run the query only once (1 client thread), it runs cca 1 mil rows per hour.


Is that 1 million, or 1 thousand?  I'm assuming million, but...
 

Which is days for the whole dataset. When I run it in parallel with that lock to ensure pessimistic synchronization, it runs sequentially too J the other threads just waiting. When I delete that lock and hope to solve the resulting conflicts later, the ssd disk serves up to 4 threads relatively effectively – which can divide my days of time by 4 – still inacceptable.


It is processing new records 192 times faster than you are generating them.  Why is that not acceptable?
 

 

The reason is quite clear here – I’m trying to write something in one cycle of the script to a table – then in the following cycle I need to read that information.


That is the reason for concurrency issues, but it is not clear that that is the reason that the performance is not what you desire.  If you first partition your data into stripes that are a few arc minutes wide, each stripe should not interact with anything other than itself and two neighbors.  That should parallelize nicely.
 

 

Questions for you:

1.       The first question is if you can think of a better way how to do this and maybe if SQL is even capable of doing such thing – or do I have to do it in C? Would rewriting the SQL function to C help?


Skillfully hand-crafted C will always be faster than SQL, if you don't count the time needed to write and debug it.

 

2.       Could I somehow bend the commiting during the algorithm for my thing? Ensure that inside one cycle, the whole part of the identifiers table would be kept in memory for faster lookups?

Is committing a bottleneck?  It looks like you are doing everything in large transactional chunks already, so it probably isn't.  If the identifier table fits in memory, it should probably stay in memory there on its own just through usage.  If it doesn't fit, there isn't much you can do other than pre-cluster the data in a coarse-grained way such that only a few parts of the table (and its index) are active at any one time, such that those active parts stay in memory.

 

3.       Why is this so slow? J It is comparable to the quadratic algorithm in the terms of speed – only does not use any memory.


Use 'explain (analyze, buffers)', preferably with track_io_timing on.  use top, strace, gprof, or perf.
 
Cheers,

Jeff

pgsql-performance by date:

Previous
From: Jiří Nádvorník
Date:
Subject: Re: Cursor + upsert (astronomical data)
Next
From: Rural Hunter
Date:
Subject: Re: Very slow planning performance on partition table