Re: Cursor + upsert (astronomical data) - Mailing list pgsql-performance

From Jiří Nádvorník
Subject Re: Cursor + upsert (astronomical data)
Date
Msg-id 004e01cfab15$765c7f30$63157d90$@gmail.com
Whole thread Raw
In response to Cursor + upsert (astronomical data)  (Jiří Nádvorník <nadvornik.ji@gmail.com>)
List pgsql-performance
Hi Oleg, Sergey,

The problem would be crossmatch if I would have a catalog to crossmatch with. But I am actually trying to build this
catalog.

The crossmatching can be actually used to solve that problem, when I crossmatch the observations with themselves on
q3c_joinwith 1 arcsec. But as I said, that crashes on the fact that we have ~thousand images overlapping. This is the
factorof quadratic complexity of self-crossmatching the table (self-joining is the same thing if I understand the
crossmatchingterm correctly).  

I actually managed to write a silver-bullet query, which did the self-joining in the most nested subquery and then
workedwith the results via analytic methods like count and rank, which found the best candidate of these self-joined
tuplesto compute the average coordination on which I grouped them and got my identifier. I can send you the code if you
areinterested. 

It worked like charm for smaller data - fast and small error (<1%). But self-joining 3e8 rows when in the highest
densityareas you multiply each observation by the factor of 1000^2, the temporary results run out of disk space (didn’t
havemore than 1.5 TB). So I managed to solve this by dividing the dataset to smaller parts (cca 25000 for one part).
Whenran in parallel on 8 cores, it used them quite good (cca 6 cores fully loaded at one time) and the 24 GB memory had
loadat ~75%.  

If the time was linear compared to processed results, the time to process the 3e8 rows was ~9 days. However I tried
thisonly once (for obvious reasons) and the RDMS crashed after cca 4 days of this heavy load. Don't know whether I was
stretchingPostgres over it's limits here so I tried to find an algorithm with linear complexity at that point. That's
howI got into the point where I'm right now. 

P.S.: I consulted this with people from catalina and they are doing this thing by Friends-of-Friends algorithm - but
theydon't have most of the stars in the high density areas like SMC as we do. That's why I didn't decide to not use it
asI think it would crash horribly when you have the distances comparable with the threshold of your astrometry. 

Thank you very much for your invested effort.

Kind Regards,

Jiri Nadvornik

-----Original Message-----
From: Oleg Bartunov [mailto:obartunov@gmail.com]
Sent: Sunday, July 27, 2014 6:57 PM
To: Jiří Nádvorník
Cc: pgsql-performance@postgresql.org; Sergey Karpov
Subject: Re: [PERFORM] Cursor + upsert (astronomical data)

Jiri,

as I understand your problem is called crossmatch ?  I attach pdf of our work in progress, where we compared several
spatialindexing techniques, including postgis, q3c and new pgsphere.  Sergey Karpov made these benchmarks. 

New pgsphere you may find here
https://github.com/akorotkov/pgsphere

Oleg

On Sat, Jul 26, 2014 at 1:46 PM, Jiří Nádvorník <nadvornik.ji@gmail.com> wrote:
> Hello guys.
>
>
>
> My issue kind of hits multiple topics, but the main question is about
> performance. I think you need to understand the background a little
> bit to be able to help me. So I will firstly define the problem and my
> solutions to it and place the questions for you to the end of this message.
>
>
>
> Problem:
>
> I have a table of observations of objects on the sky. The most
> important columns are the coordinates (x,y). All other columns in
> there are just additional information about the observation. The
> problem is that when I take an image of the same object on the sky
> twice, the coordinates x,y won’t be the same, they will be only close
> to each other. My task is to generate a common identifier to all of
> the observations of the same object and assign the observations to it
> (N:1 relation). The criterium is that all of the observations which
> are within 1 arcsec of this identifier are considered as the same
> object. I keep the identifiers in a separate table (objcat) and have a foreign key in the observations table.
>
> The reason why I solve the performance issues here is that the table
> of observations has atm cca 3e8 rows after 1.5 year of gathering the
> data. The number growth is linear.
>
>
>
> Technical background:
>
> I’m trying to keep the whole algoritm in DB if possible because I have
> a good PostgreSQL plugin Q3C for indexing the coordinates of the
> objects on the sky
> (https://code.google.com/p/q3c/source/browse/README.q3c). It also has
> quite a few stored procedures to look in that table for near neighbors
> which is what I’m doing. The function q3c_join(x1, x2, y1, y2, radius)
> returns true if the object y is within radius of the object x. It
> simply generates a list of index bitmap or queries with the operators
> <=, >= which define the position on sky. Asking for the nearest neighbors are then only index scans.
>
>
>
> Solution:
>
> After lot of experimentation with clustering the objects and trying to
> process them all together in one “silver-bullet” SQL query I decided
> to use some more simple approach. The main problem with the “process
> all at once approach” is that the finding neighbours for each
> observation is in definition quadratic and for 3e8 rows just runs out
> of disk space (~TBs of memory for the temporary results).
>
> The simplest approach I could think of is that I process each row of
> the 3e8 rows sequentially and ask:
>
> Do I have an identifier in the radius of 1 arcsec?
>
> No: Generate one and assign me to it.
>
> Yes: Update it and assigne me to it. The update is done as weighted
> average – I keep the number of how many observations the identifier
> has been computed. The result is that the identifier will have average
> coordinates of all the observations it identifies – it will be the center.
>
>
>
> So here I come with my procedure. It has 3 params. The first two are
> range of oids to list in the table. Used for scaling and
> parallelization of the algorithm. The third is the radius in which to search for the neighbours.
>
>
>
> DROP TYPE IF EXISTS coords;
>
> CREATE TYPE coords AS (
>
>                 raj2000 double precision,
>
>                 dej2000 double precision
>
>                 );
>
>
>
>
>
> DROP FUNCTION IF EXISTS build_catalog(int,int,double precision);
>
> CREATE OR REPLACE FUNCTION build_catalog (fromOID int, toOID int,
> radius double precision)
>
>                 RETURNS VOID AS $$
>
> DECLARE
>
>                 cur1 CURSOR FOR
>
>                                SELECT
>
>                                                raj2000, dej2000
>
>                                FROM
>
>                                                \schema.observation AS
> obs
>
>                                WHERE
>
>                                                obs.oid >= fromOID
>
>                                AND
>
>                                                obs.oid < toOID;
>
>                 curr_raj2000 double precision;
>
>                 curr_dej2000 double precision;
>
>                 curr_coords_cat coords;
>
>                 cnt int;
>
>
>
> BEGIN
>
> /*SELECT current_setting('transaction_isolation') into tmp;
>
> raise notice 'Isolation level %', tmp;*/
>
> OPEN cur1;
>
> cnt:=0;
>
> LOCK TABLE \schema.objcat IN SHARE ROW EXCLUSIVE MODE;
>
> LOOP
>
>                 FETCH cur1 INTO curr_raj2000, curr_dej2000;
>
>                                EXIT WHEN NOT found;
>
>
>
>                 WITH
>
>                                upsert
>
>                 AS
>
>                                (UPDATE
>
>                                                \schema.objcat
>
>                                SET
>
>                                                ipix_cat=q3c_ang2ipix(
>
>
> (raj2000 * weight + curr_raj2000) / (weight + 1),
>
>
> (dej2000 * weight + curr_dej2000) / (weight + 1)
>
>                                                ),
>
>                                                raj2000 = (raj2000 *
> weight +
> curr_raj2000) / (weight + 1),
>
>                                                dej2000 = (dej2000 *
> weight +
> curr_dej2000) / (weight + 1),
>
>                                                weight=weight+1
>
>                                WHERE
>
>                                                q3c_join(curr_raj2000,
> curr_dej2000,
>
>
> raj2000, dej2000,
>
>                                                                radius)
>
>                                RETURNING *),
>
>                 ins AS
>
>                 (INSERT INTO
>
>                                                \schema.objcat
>
>                                                (ipix_cat, raj2000,
> dej2000,
> weight)
>
>                                SELECT
>
>
> (q3c_ang2ipix(curr_raj2000, curr_dej2000)),
>
>                                                curr_raj2000,
>
>                                                curr_dej2000,
>
>                                                1
>
>                 WHERE NOT EXISTS
>
>                                (SELECT * FROM upsert)
>
>                 RETURNING *)
>
>                 UPDATE
>
>                                \schema.observation
>
>                 SET
>
>                                id_cat = (SELECT DISTINCT
>
>                                                                id_cat
>
>                                                FROM
>
>                                                                upsert
>
>                                                UNION
>
>                                                SELECT
>
>                                                                id_cat
>
>                                                FROM
>
>                                                                ins
>
>                                                WHERE id_cat IS NOT
> NULL
>
>                                                LIMIT 1)
>
>                 WHERE CURRENT OF cur1;
>
>                 cnt:=cnt+1;
>
>
>
>                 IF ((cnt % 100000 ) = 0) THEN
>
>                                RAISE NOTICE 'Processed % entries',
> cnt;
>
>                 END IF;
>
>
>
> END LOOP;
>
> CLOSE cur1;
>
> END;
>
> $$ LANGUAGE plpgsql;
>
>
>
> Results: When I run the query only once (1 client thread), it runs cca
> 1 mil rows per hour. Which is days for the whole dataset. When I run
> it in parallel with that lock to ensure pessimistic synchronization,
> it runs sequentially too J the other threads just waiting. When I
> delete that lock and hope to solve the resulting conflicts later, the
> ssd disk serves up to 4 threads relatively effectively – which can
> divide my days of time by 4 – still inacceptable.
>
>
>
> The reason is quite clear here – I’m trying to write something in one
> cycle of the script to a table – then in the following cycle I need to
> read that information.
>
>
>
> Questions for you:
>
> 1.       The first question is if you can think of a better way how to do
> this and maybe if SQL is even capable of doing such thing – or do I
> have to do it in C? Would rewriting the SQL function to C help?
>
> 2.       Could I somehow bend the commiting during the algorithm for my
> thing? Ensure that inside one cycle, the whole part of the identifiers
> table would be kept in memory for faster lookups?
>
> 3.       Why is this so slow? J It is comparable to the quadratic algorithm
> in the terms of speed – only does not use any memory.
>
>
>
> I tried to sum this up the best I could – for more information please
> don’t hesitate to contact me.
>
>
>
> Thank you very much for even reading this far.
>
>
>
> Best Regards,
>
>
>
> Jiri Nadvornik
>
>
>
>



pgsql-performance by date:

Previous
From: Jiří Nádvorník
Date:
Subject: Re: Cursor + upsert (astronomical data)
Next
From: Craig James
Date:
Subject: Re: Cursor + upsert (astronomical data)