Thread: Cursor + upsert (astronomical data)

Cursor + upsert (astronomical data)

From
Jiří Nádvorník
Date:

Hello guys.

 

My issue kind of hits multiple topics, but the main question is about performance. I think you need to understand the background a little bit to be able to help me. So I will firstly define the problem and my solutions to it and place the questions for you to the end of this message.

 

Problem:

I have a table of observations of objects on the sky. The most important columns are the coordinates (x,y). All other columns in there are just additional information about the observation. The problem is that when I take an image of the same object on the sky twice, the coordinates x,y won’t be the same, they will be only close to each other. My task is to generate a common identifier to all of the observations of the same object and assign the observations to it (N:1 relation). The criterium is that all of the observations which are within 1 arcsec of this identifier are considered as the same object. I keep the identifiers in a separate table (objcat) and have a foreign key in the observations table.

The reason why I solve the performance issues here is that the table of observations has atm cca 3e8 rows after 1.5 year of gathering the data. The number growth is linear.

 

Technical background:

I’m trying to keep the whole algoritm in DB if possible because I have a good PostgreSQL plugin Q3C for indexing the coordinates of the objects on the sky (https://code.google.com/p/q3c/source/browse/README.q3c). It also has quite a few stored procedures to look in that table for near neighbors which is what I’m doing. The function q3c_join(x1, x2, y1, y2, radius) returns true if the object y is within radius of the object x. It simply generates a list of index bitmap or queries with the operators <=, >= which define the position on sky. Asking for the nearest neighbors are then only index scans.

 

Solution:

After lot of experimentation with clustering the objects and trying to process them all together in one “silver-bullet” SQL query I decided to use some more simple approach. The main problem with the “process all at once approach” is that the finding neighbours for each observation is in definition quadratic and for 3e8 rows just runs out of disk space (~TBs of memory for the temporary results).

The simplest approach I could think of is that I process each row of the 3e8 rows sequentially and ask:

Do I have an identifier in the radius of 1 arcsec?

No: Generate one and assign me to it.

Yes: Update it and assigne me to it. The update is done as weighted average – I keep the number of how many observations the identifier has been computed. The result is that the identifier will have average coordinates of all the observations it identifies – it will be the center.

 

So here I come with my procedure. It has 3 params. The first two are range of oids to list in the table. Used for scaling and parallelization of the algorithm. The third is the radius in which to search for the neighbours.

 

DROP TYPE IF EXISTS coords;

CREATE TYPE coords AS (

                raj2000 double precision,

                dej2000 double precision

                );

 

 

DROP FUNCTION IF EXISTS build_catalog(int,int,double precision);

CREATE OR REPLACE FUNCTION build_catalog (fromOID int, toOID int, radius double precision)

                RETURNS VOID AS $$

DECLARE

                cur1 CURSOR FOR

                               SELECT

                                               raj2000, dej2000

                               FROM

                                               \schema.observation AS obs

                               WHERE

                                               obs.oid >= fromOID

                               AND

                                               obs.oid < toOID;

                curr_raj2000 double precision;

                curr_dej2000 double precision;

                curr_coords_cat coords;

                cnt int;       

 

BEGIN

/*SELECT current_setting('transaction_isolation') into tmp;

raise notice 'Isolation level %', tmp;*/

OPEN cur1;

cnt:=0;

LOCK TABLE \schema.objcat IN SHARE ROW EXCLUSIVE MODE;

LOOP

                FETCH cur1 INTO curr_raj2000, curr_dej2000;

                               EXIT WHEN NOT found;

               

                WITH

                               upsert

                AS

                               (UPDATE

                                               \schema.objcat

                               SET

                                               ipix_cat=q3c_ang2ipix(

                                                               (raj2000 * weight + curr_raj2000) / (weight + 1),

                                                               (dej2000 * weight + curr_dej2000) / (weight + 1)

                                               ),

                                               raj2000 = (raj2000 * weight + curr_raj2000) / (weight + 1),

                                               dej2000 = (dej2000 * weight + curr_dej2000) / (weight + 1),

                                               weight=weight+1

                               WHERE

                                               q3c_join(curr_raj2000, curr_dej2000,

                                                               raj2000, dej2000,

                                                               radius)

                               RETURNING *),

                ins AS

                (INSERT INTO

                                               \schema.objcat

                                               (ipix_cat, raj2000, dej2000, weight)

                               SELECT

                                               (q3c_ang2ipix(curr_raj2000, curr_dej2000)),

                                               curr_raj2000,

                                               curr_dej2000,

                                               1

                WHERE NOT EXISTS

                               (SELECT * FROM upsert)

                RETURNING *)

                UPDATE

                               \schema.observation

                SET

                               id_cat = (SELECT DISTINCT

                                                               id_cat

                                               FROM

                                                               upsert

                                               UNION

                                               SELECT

                                                               id_cat

                                               FROM

                                                               ins

                                               WHERE id_cat IS NOT NULL

                                               LIMIT 1)

                WHERE CURRENT OF cur1;

                cnt:=cnt+1;

 

                IF ((cnt % 100000 ) = 0) THEN

                               RAISE NOTICE 'Processed % entries', cnt;

                END IF;

                              

END LOOP;

CLOSE cur1;

END;

$$ LANGUAGE plpgsql;

 

Results: When I run the query only once (1 client thread), it runs cca 1 mil rows per hour. Which is days for the whole dataset. When I run it in parallel with that lock to ensure pessimistic synchronization, it runs sequentially too J the other threads just waiting. When I delete that lock and hope to solve the resulting conflicts later, the ssd disk serves up to 4 threads relatively effectively – which can divide my days of time by 4 – still inacceptable.

 

The reason is quite clear here – I’m trying to write something in one cycle of the script to a table – then in the following cycle I need to read that information.

 

Questions for you:

1.       The first question is if you can think of a better way how to do this and maybe if SQL is even capable of doing such thing – or do I have to do it in C? Would rewriting the SQL function to C help?

2.       Could I somehow bend the commiting during the algorithm for my thing? Ensure that inside one cycle, the whole part of the identifiers table would be kept in memory for faster lookups?

3.       Why is this so slow? J It is comparable to the quadratic algorithm in the terms of speed – only does not use any memory.

 

I tried to sum this up the best I could – for more information please don’t hesitate to contact me.

 

Thank you very much for even reading this far.

 

Best Regards,

 

Jiri Nadvornik

 

 

Re: Cursor + upsert (astronomical data)

From
Vitalii Tymchyshyn
Date:

I am not sure I understand the problem fully, e.g. what to do if there are observations A,B and C with A to B and B to C less then treshold and A to C over treshold, but anyway.

Could you first apply a kind of grid to your observations? What I mean is to round your coords to, say, 1/2 arcsec on each axe and group the results. I think you will have most observations grouped this way and then use your regular algorithm to combine the results.

Best regards, Vitalii Tymchyshyn

Re: Cursor + upsert (astronomical data)

From
Jiří Nádvorník
Date:

Hi Vitalii, thank you for your reply.

 

The problem you suggested can in the most pathological way be, that these observations are on one line. As you suggested it, the B would be in the middle. So A and C are not in 1 arcsec range of each other, but they must be within 1 arcsec of their common average coordinates. If the distances between A,B,C are 1 arcsec for each, the right solution is to pick B as reference identifier and assign A and C to it.

 

We already tried the approach you suggest with applying a grid based on the Q3C indexing of the database. We were not just rounding the results, but using the center of the Q3C “square” in which the observation took place. The result was poor however – 22% of the identifiers were closer to each other than 1 arcsec. That means that when you crossmatch the original observations to them, you don’t know which one to use and you have duplicates. The reason for this is that nearly all of the observations are from SMC (high density of observations), which causes that you have more than 2 “rounded” positions in a row and don’t know which ones to join together (compute average coordinates from it). If it is not clear enough I can draw it on an image for you.

Maybe the simple round up would have better results because the squares are not each the same size and you can scale them only by 2 (2-times smaller, or larger square). We used a squre with the side cca 0.76 arcsec which approximately covers the 1 arcsec radius circle.

 

Oh and one more important thing. The difficulty of our data is not that it is 3e8 rows. But in the highest density, there are cca 1000 images overlapping. Which kills you when you try to self-join the observations to find neighbours for each of them – the quadratic complexity is based on the overlappingon the image (e.g. 10000 observations on one image with another 999 images overlapping it means 10000 *1000^2).

 

Best regards,

 

Jiri Nadvornik

 

From: tivv00@gmail.com [mailto:tivv00@gmail.com] On Behalf Of Vitalii Tymchyshyn
Sent: Sunday, July 27, 2014 8:06 AM
To: Jiří Nádvorník
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Cursor + upsert (astronomical data)

 

I am not sure I understand the problem fully, e.g. what to do if there are observations A,B and C with A to B and B to C less then treshold and A to C over treshold, but anyway.

Could you first apply a kind of grid to your observations? What I mean is to round your coords to, say, 1/2 arcsec on each axe and group the results. I think you will have most observations grouped this way and then use your regular algorithm to combine the results.

Best regards, Vitalii Tymchyshyn

Re: Cursor + upsert (astronomical data)

From
Craig James
Date:
Jiri,

If you haven't looked at clustering algorithms yet, you might want to do so. Your problem is a special case of clustering, where you have a large number of small clusters.  A good place to start is the overview on Wikipedia: http://en.wikipedia.org/wiki/Cluster_analysis

A lot of people have worked extensively on this problem, and you might find a good solution, or at least some ideas to guide your own algorithm. In my field (chemistry), researchers often need to cluster 10^6 to 10^7 chemical compounds, and a great deal of research has gone into efficient ways to do so.

Craig



On Sun, Jul 27, 2014 at 7:35 AM, Jiří Nádvorník <nadvornik.ji@gmail.com> wrote:

Hi Vitalii, thank you for your reply.

 

The problem you suggested can in the most pathological way be, that these observations are on one line. As you suggested it, the B would be in the middle. So A and C are not in 1 arcsec range of each other, but they must be within 1 arcsec of their common average coordinates. If the distances between A,B,C are 1 arcsec for each, the right solution is to pick B as reference identifier and assign A and C to it.

 

We already tried the approach you suggest with applying a grid based on the Q3C indexing of the database. We were not just rounding the results, but using the center of the Q3C “square” in which the observation took place. The result was poor however – 22% of the identifiers were closer to each other than 1 arcsec. That means that when you crossmatch the original observations to them, you don’t know which one to use and you have duplicates. The reason for this is that nearly all of the observations are from SMC (high density of observations), which causes that you have more than 2 “rounded” positions in a row and don’t know which ones to join together (compute average coordinates from it). If it is not clear enough I can draw it on an image for you.

Maybe the simple round up would have better results because the squares are not each the same size and you can scale them only by 2 (2-times smaller, or larger square). We used a squre with the side cca 0.76 arcsec which approximately covers the 1 arcsec radius circle.

 

Oh and one more important thing. The difficulty of our data is not that it is 3e8 rows. But in the highest density, there are cca 1000 images overlapping. Which kills you when you try to self-join the observations to find neighbours for each of them – the quadratic complexity is based on the overlappingon the image (e.g. 10000 observations on one image with another 999 images overlapping it means 10000 *1000^2).

 

Best regards,

 

Jiri Nadvornik

 

From: tivv00@gmail.com [mailto:tivv00@gmail.com] On Behalf Of Vitalii Tymchyshyn
Sent: Sunday, July 27, 2014 8:06 AM
To: Jiří Nádvorník
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Cursor + upsert (astronomical data)

 

I am not sure I understand the problem fully, e.g. what to do if there are observations A,B and C with A to B and B to C less then treshold and A to C over treshold, but anyway.

Could you first apply a kind of grid to your observations? What I mean is to round your coords to, say, 1/2 arcsec on each axe and group the results. I think you will have most observations grouped this way and then use your regular algorithm to combine the results.

Best regards, Vitalii Tymchyshyn




--
---------------------------------
Craig A. James
Chief Technology Officer
eMolecules, Inc.
---------------------------------

Re: Cursor + upsert (astronomical data)

From
Marc Mamin
Date:
[Craig]
>>If you haven't looked at clustering algorithms yet, you might want to do so.
>>Your problem is a special case of clustering, where you have a large number
>>of small clusters.  A good place to start is the overview on Wikipedia:
>>http://en.wikipedia.org/wiki/Cluster_analysis

According to this list, your method is similar to http://en.wikipedia.org/wiki/Basic_sequential_algorithmic_scheme,
but with what seems to be some logical errors.

>The simplest approach I could think of is that I process each row of the 3e8 rows
>sequentially and ask:
>Do I have an identifier in the radius of 1 arcsec?
>No: Generate one and assign me to it.
>Yes: Update it and assigne me to it. The update is done as weighted average – I keep the
>number of how many observations the identifier has been computed. The result is that the
>identifier will have average coordinates of all the observations it identifies – it will
>be the center.
 

Let say you have 2 single measures on a line at arcsec 1 and 2.1. which hence correspond to 2 ipix_cat.
Now add a new measure at 1.9: as you choose any of the possible adjacent ipix_cat whitout considering the least distance, you may end with an ipix_at at 1.45 which is at less than 1 arcsec then the next one.
   Moreover you have raised the weight of both ipix_cat.
     Which increases the lock probability when trying a nutlithreaded upgrade.
     
The max distance between 2 observations belonging to a same ipix_cat tends to 2 arcsec with your method. If this is ok, you should probbaly modify your method so that the 2 first points of my example would have megred to a single ipix_cat. You could use your weigth for this: increase your search radius to 2arcsec and then reject the candidates located between 1 and 2 arsec depending on their weight. The additional work load might be compensated by the smaller number of ipix_cat that woul will have.     
     




Re: Cursor + upsert (astronomical data)

From
Vitalii Tymchyshyn
Date:

Well, that's why I said to apply regular algorithm to deduplicate after this step. Basically, what I expect is to have first pass with group by that do not require any joins and produces "dirty" set of identifiers.

It should do next things:
1) Provide working set of dirty identifiers that has a huge factor less cardinality than the original observations set.
2) Most of the identifiers can be used as is, only for small fraction you need to perform additional merge. 22% is actually very good number, it means only 1/5 of identifiers should be analyzed for merging.

Best regards, Vitalii Tymchyshyn

27 лип. 2014 10:35, користувач "Jiří Nádvorník" <nadvornik.ji@gmail.com> написав:

Hi Vitalii, thank you for your reply.

 

The problem you suggested can in the most pathological way be, that these observations are on one line. As you suggested it, the B would be in the middle. So A and C are not in 1 arcsec range of each other, but they must be within 1 arcsec of their common average coordinates. If the distances between A,B,C are 1 arcsec for each, the right solution is to pick B as reference identifier and assign A and C to it.

 

We already tried the approach you suggest with applying a grid based on the Q3C indexing of the database. We were not just rounding the results, but using the center of the Q3C “square” in which the observation took place. The result was poor however – 22% of the identifiers were closer to each other than 1 arcsec. That means that when you crossmatch the original observations to them, you don’t know which one to use and you have duplicates. The reason for this is that nearly all of the observations are from SMC (high density of observations), which causes that you have more than 2 “rounded” positions in a row and don’t know which ones to join together (compute average coordinates from it). If it is not clear enough I can draw it on an image for you.

Maybe the simple round up would have better results because the squares are not each the same size and you can scale them only by 2 (2-times smaller, or larger square). We used a squre with the side cca 0.76 arcsec which approximately covers the 1 arcsec radius circle.

 

Oh and one more important thing. The difficulty of our data is not that it is 3e8 rows. But in the highest density, there are cca 1000 images overlapping. Which kills you when you try to self-join the observations to find neighbours for each of them – the quadratic complexity is based on the overlappingon the image (e.g. 10000 observations on one image with another 999 images overlapping it means 10000 *1000^2).

 

Best regards,

 

Jiri Nadvornik

 

From: tivv00@gmail.com [mailto:tivv00@gmail.com] On Behalf Of Vitalii Tymchyshyn
Sent: Sunday, July 27, 2014 8:06 AM
To: Jiří Nádvorník
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Cursor + upsert (astronomical data)

 

I am not sure I understand the problem fully, e.g. what to do if there are observations A,B and C with A to B and B to C less then treshold and A to C over treshold, but anyway.

Could you first apply a kind of grid to your observations? What I mean is to round your coords to, say, 1/2 arcsec on each axe and group the results. I think you will have most observations grouped this way and then use your regular algorithm to combine the results.

Best regards, Vitalii Tymchyshyn

Re: Cursor + upsert (astronomical data)

From
Jiří Nádvorník
Date:

Hi Craig,

 

I’m really interested in those algorithms and study them. But I would need somebody to point me directly at a specific algorithm to look at. The main problem with choosing the right one (which couldn’t get over even my university teacher) is that you don’t know the number of clusters (classification problem) and you don’t know the number of objects in one cluster (that isn’t such a big deal), but each cluster has different count of objects – which is a big issue because it actually disqualifies all of k-means based algorithms (correct me if I’m wrong).

 

We were thinking of some kind of Bayesian method or a likelihood ratio computation, but dismissed that as we found an implementation that ran ~hours for ~5e5 records – I just don’t think such an approach would provide better results from a faster algorithm than the simple linear one (checking each row and updating/inserting it to catalog).

 

I actually left the algorithm to run for a smaller set of data (~80 mil. Rows) and on 8 cores (cca 4 of them used at a time) it ran 48 hours. That’s not that bad assuming it must run only once for the dataset (on DB population) – and then the additions will be processed fast. The result is actually cca 1.2 mil identifiers and 50 thousand from them are closer to each other than 1 arcsec (collisions) – but that is only <5% error. I think if we worked a bit on this algorithm we could make it faster and reduce that error percentage to maybe 1% which would be acceptable? What do you think?

 

Thank you very much for your effort.

 

Kind Regards,

 

Jiri Nadvornik

 

From: Craig James [mailto:cjames@emolecules.com]
Sent: Sunday, July 27, 2014 5:35 PM
To: Jiří Nádvorník
Cc: Vitalii Tymchyshyn; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Cursor + upsert (astronomical data)

 

Jiri,

 

If you haven't looked at clustering algorithms yet, you might want to do so. Your problem is a special case of clustering, where you have a large number of small clusters.  A good place to start is the overview on Wikipedia: http://en.wikipedia.org/wiki/Cluster_analysis

 

A lot of people have worked extensively on this problem, and you might find a good solution, or at least some ideas to guide your own algorithm. In my field (chemistry), researchers often need to cluster 10^6 to 10^7 chemical compounds, and a great deal of research has gone into efficient ways to do so.

 

Craig

 

 

On Sun, Jul 27, 2014 at 7:35 AM, Jiří Nádvorník <nadvornik.ji@gmail.com> wrote:

Hi Vitalii, thank you for your reply.

 

The problem you suggested can in the most pathological way be, that these observations are on one line. As you suggested it, the B would be in the middle. So A and C are not in 1 arcsec range of each other, but they must be within 1 arcsec of their common average coordinates. If the distances between A,B,C are 1 arcsec for each, the right solution is to pick B as reference identifier and assign A and C to it.

 

We already tried the approach you suggest with applying a grid based on the Q3C indexing of the database. We were not just rounding the results, but using the center of the Q3C “square” in which the observation took place. The result was poor however – 22% of the identifiers were closer to each other than 1 arcsec. That means that when you crossmatch the original observations to them, you don’t know which one to use and you have duplicates. The reason for this is that nearly all of the observations are from SMC (high density of observations), which causes that you have more than 2 “rounded” positions in a row and don’t know which ones to join together (compute average coordinates from it). If it is not clear enough I can draw it on an image for you.

Maybe the simple round up would have better results because the squares are not each the same size and you can scale them only by 2 (2-times smaller, or larger square). We used a squre with the side cca 0.76 arcsec which approximately covers the 1 arcsec radius circle.

 

Oh and one more important thing. The difficulty of our data is not that it is 3e8 rows. But in the highest density, there are cca 1000 images overlapping. Which kills you when you try to self-join the observations to find neighbours for each of them – the quadratic complexity is based on the overlappingon the image (e.g. 10000 observations on one image with another 999 images overlapping it means 10000 *1000^2).

 

Best regards,

 

Jiri Nadvornik

 

From: tivv00@gmail.com [mailto:tivv00@gmail.com] On Behalf Of Vitalii Tymchyshyn
Sent: Sunday, July 27, 2014 8:06 AM
To: Jiří Nádvorník
Cc:
pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Cursor + upsert (astronomical data)

 

I am not sure I understand the problem fully, e.g. what to do if there are observations A,B and C with A to B and B to C less then treshold and A to C over treshold, but anyway.

Could you first apply a kind of grid to your observations? What I mean is to round your coords to, say, 1/2 arcsec on each axe and group the results. I think you will have most observations grouped this way and then use your regular algorithm to combine the results.

Best regards, Vitalii Tymchyshyn



 

--

---------------------------------
Craig A. James

Chief Technology Officer

eMolecules, Inc.

---------------------------------

Re: Cursor + upsert (astronomical data)

From
Jiří Nádvorník
Date:
Hi Oleg, Sergey,

The problem would be crossmatch if I would have a catalog to crossmatch with. But I am actually trying to build this
catalog.

The crossmatching can be actually used to solve that problem, when I crossmatch the observations with themselves on
q3c_joinwith 1 arcsec. But as I said, that crashes on the fact that we have ~thousand images overlapping. This is the
factorof quadratic complexity of self-crossmatching the table (self-joining is the same thing if I understand the
crossmatchingterm correctly).  

I actually managed to write a silver-bullet query, which did the self-joining in the most nested subquery and then
workedwith the results via analytic methods like count and rank, which found the best candidate of these self-joined
tuplesto compute the average coordination on which I grouped them and got my identifier. I can send you the code if you
areinterested. 

It worked like charm for smaller data - fast and small error (<1%). But self-joining 3e8 rows when in the highest
densityareas you multiply each observation by the factor of 1000^2, the temporary results run out of disk space (didn’t
havemore than 1.5 TB). So I managed to solve this by dividing the dataset to smaller parts (cca 25000 for one part).
Whenran in parallel on 8 cores, it used them quite good (cca 6 cores fully loaded at one time) and the 24 GB memory had
loadat ~75%.  

If the time was linear compared to processed results, the time to process the 3e8 rows was ~9 days. However I tried
thisonly once (for obvious reasons) and the RDMS crashed after cca 4 days of this heavy load. Don't know whether I was
stretchingPostgres over it's limits here so I tried to find an algorithm with linear complexity at that point. That's
howI got into the point where I'm right now. 

P.S.: I consulted this with people from catalina and they are doing this thing by Friends-of-Friends algorithm - but
theydon't have most of the stars in the high density areas like SMC as we do. That's why I didn't decide to not use it
asI think it would crash horribly when you have the distances comparable with the threshold of your astrometry. 

Thank you very much for your invested effort.

Kind Regards,

Jiri Nadvornik

-----Original Message-----
From: Oleg Bartunov [mailto:obartunov@gmail.com]
Sent: Sunday, July 27, 2014 6:57 PM
To: Jiří Nádvorník
Cc: pgsql-performance@postgresql.org; Sergey Karpov
Subject: Re: [PERFORM] Cursor + upsert (astronomical data)

Jiri,

as I understand your problem is called crossmatch ?  I attach pdf of our work in progress, where we compared several
spatialindexing techniques, including postgis, q3c and new pgsphere.  Sergey Karpov made these benchmarks. 

New pgsphere you may find here
https://github.com/akorotkov/pgsphere

Oleg

On Sat, Jul 26, 2014 at 1:46 PM, Jiří Nádvorník <nadvornik.ji@gmail.com> wrote:
> Hello guys.
>
>
>
> My issue kind of hits multiple topics, but the main question is about
> performance. I think you need to understand the background a little
> bit to be able to help me. So I will firstly define the problem and my
> solutions to it and place the questions for you to the end of this message.
>
>
>
> Problem:
>
> I have a table of observations of objects on the sky. The most
> important columns are the coordinates (x,y). All other columns in
> there are just additional information about the observation. The
> problem is that when I take an image of the same object on the sky
> twice, the coordinates x,y won’t be the same, they will be only close
> to each other. My task is to generate a common identifier to all of
> the observations of the same object and assign the observations to it
> (N:1 relation). The criterium is that all of the observations which
> are within 1 arcsec of this identifier are considered as the same
> object. I keep the identifiers in a separate table (objcat) and have a foreign key in the observations table.
>
> The reason why I solve the performance issues here is that the table
> of observations has atm cca 3e8 rows after 1.5 year of gathering the
> data. The number growth is linear.
>
>
>
> Technical background:
>
> I’m trying to keep the whole algoritm in DB if possible because I have
> a good PostgreSQL plugin Q3C for indexing the coordinates of the
> objects on the sky
> (https://code.google.com/p/q3c/source/browse/README.q3c). It also has
> quite a few stored procedures to look in that table for near neighbors
> which is what I’m doing. The function q3c_join(x1, x2, y1, y2, radius)
> returns true if the object y is within radius of the object x. It
> simply generates a list of index bitmap or queries with the operators
> <=, >= which define the position on sky. Asking for the nearest neighbors are then only index scans.
>
>
>
> Solution:
>
> After lot of experimentation with clustering the objects and trying to
> process them all together in one “silver-bullet” SQL query I decided
> to use some more simple approach. The main problem with the “process
> all at once approach” is that the finding neighbours for each
> observation is in definition quadratic and for 3e8 rows just runs out
> of disk space (~TBs of memory for the temporary results).
>
> The simplest approach I could think of is that I process each row of
> the 3e8 rows sequentially and ask:
>
> Do I have an identifier in the radius of 1 arcsec?
>
> No: Generate one and assign me to it.
>
> Yes: Update it and assigne me to it. The update is done as weighted
> average – I keep the number of how many observations the identifier
> has been computed. The result is that the identifier will have average
> coordinates of all the observations it identifies – it will be the center.
>
>
>
> So here I come with my procedure. It has 3 params. The first two are
> range of oids to list in the table. Used for scaling and
> parallelization of the algorithm. The third is the radius in which to search for the neighbours.
>
>
>
> DROP TYPE IF EXISTS coords;
>
> CREATE TYPE coords AS (
>
>                 raj2000 double precision,
>
>                 dej2000 double precision
>
>                 );
>
>
>
>
>
> DROP FUNCTION IF EXISTS build_catalog(int,int,double precision);
>
> CREATE OR REPLACE FUNCTION build_catalog (fromOID int, toOID int,
> radius double precision)
>
>                 RETURNS VOID AS $$
>
> DECLARE
>
>                 cur1 CURSOR FOR
>
>                                SELECT
>
>                                                raj2000, dej2000
>
>                                FROM
>
>                                                \schema.observation AS
> obs
>
>                                WHERE
>
>                                                obs.oid >= fromOID
>
>                                AND
>
>                                                obs.oid < toOID;
>
>                 curr_raj2000 double precision;
>
>                 curr_dej2000 double precision;
>
>                 curr_coords_cat coords;
>
>                 cnt int;
>
>
>
> BEGIN
>
> /*SELECT current_setting('transaction_isolation') into tmp;
>
> raise notice 'Isolation level %', tmp;*/
>
> OPEN cur1;
>
> cnt:=0;
>
> LOCK TABLE \schema.objcat IN SHARE ROW EXCLUSIVE MODE;
>
> LOOP
>
>                 FETCH cur1 INTO curr_raj2000, curr_dej2000;
>
>                                EXIT WHEN NOT found;
>
>
>
>                 WITH
>
>                                upsert
>
>                 AS
>
>                                (UPDATE
>
>                                                \schema.objcat
>
>                                SET
>
>                                                ipix_cat=q3c_ang2ipix(
>
>
> (raj2000 * weight + curr_raj2000) / (weight + 1),
>
>
> (dej2000 * weight + curr_dej2000) / (weight + 1)
>
>                                                ),
>
>                                                raj2000 = (raj2000 *
> weight +
> curr_raj2000) / (weight + 1),
>
>                                                dej2000 = (dej2000 *
> weight +
> curr_dej2000) / (weight + 1),
>
>                                                weight=weight+1
>
>                                WHERE
>
>                                                q3c_join(curr_raj2000,
> curr_dej2000,
>
>
> raj2000, dej2000,
>
>                                                                radius)
>
>                                RETURNING *),
>
>                 ins AS
>
>                 (INSERT INTO
>
>                                                \schema.objcat
>
>                                                (ipix_cat, raj2000,
> dej2000,
> weight)
>
>                                SELECT
>
>
> (q3c_ang2ipix(curr_raj2000, curr_dej2000)),
>
>                                                curr_raj2000,
>
>                                                curr_dej2000,
>
>                                                1
>
>                 WHERE NOT EXISTS
>
>                                (SELECT * FROM upsert)
>
>                 RETURNING *)
>
>                 UPDATE
>
>                                \schema.observation
>
>                 SET
>
>                                id_cat = (SELECT DISTINCT
>
>                                                                id_cat
>
>                                                FROM
>
>                                                                upsert
>
>                                                UNION
>
>                                                SELECT
>
>                                                                id_cat
>
>                                                FROM
>
>                                                                ins
>
>                                                WHERE id_cat IS NOT
> NULL
>
>                                                LIMIT 1)
>
>                 WHERE CURRENT OF cur1;
>
>                 cnt:=cnt+1;
>
>
>
>                 IF ((cnt % 100000 ) = 0) THEN
>
>                                RAISE NOTICE 'Processed % entries',
> cnt;
>
>                 END IF;
>
>
>
> END LOOP;
>
> CLOSE cur1;
>
> END;
>
> $$ LANGUAGE plpgsql;
>
>
>
> Results: When I run the query only once (1 client thread), it runs cca
> 1 mil rows per hour. Which is days for the whole dataset. When I run
> it in parallel with that lock to ensure pessimistic synchronization,
> it runs sequentially too J the other threads just waiting. When I
> delete that lock and hope to solve the resulting conflicts later, the
> ssd disk serves up to 4 threads relatively effectively – which can
> divide my days of time by 4 – still inacceptable.
>
>
>
> The reason is quite clear here – I’m trying to write something in one
> cycle of the script to a table – then in the following cycle I need to
> read that information.
>
>
>
> Questions for you:
>
> 1.       The first question is if you can think of a better way how to do
> this and maybe if SQL is even capable of doing such thing – or do I
> have to do it in C? Would rewriting the SQL function to C help?
>
> 2.       Could I somehow bend the commiting during the algorithm for my
> thing? Ensure that inside one cycle, the whole part of the identifiers
> table would be kept in memory for faster lookups?
>
> 3.       Why is this so slow? J It is comparable to the quadratic algorithm
> in the terms of speed – only does not use any memory.
>
>
>
> I tried to sum this up the best I could – for more information please
> don’t hesitate to contact me.
>
>
>
> Thank you very much for even reading this far.
>
>
>
> Best Regards,
>
>
>
> Jiri Nadvornik
>
>
>
>



Re: Cursor + upsert (astronomical data)

From
Craig James
Date:
Hi Jiri,

I’m really interested in those [clustering] algorithms and study them. But I would need somebody to point me directly at a specific algorithm to look at. The main problem with choosing the right one (which couldn’t get over even my university teacher) is that you don’t know the number of clusters (classification problem) and you don’t know the number of objects in one cluster (that isn’t such a big deal), but each cluster has different count of objects – which is a big issue because it actually disqualifies all of k-means based algorithms (correct me if I’m wrong).


I'm not an expert in clustering, I just worked with several people who developed clustering packages.

One algorithm that's widely used in chemistry and genetics is the Jarvis-Patrick algorithm. Google for "jarvis patrick clustering" and you'll find several good descriptions. It has the advantages that it's deterministic, it works with any distance metric, and it chooses the number of clusters (you don't have to specify ahead of time). The drawback to Jarvis-Patrick clustering is that you have to start with a nearest-neighbors list (i.e. for each item in your data set, you must find the nearest N items, where N is something like 10-20.) In a brute-force approach, finding nearest neighbors is O(N^2) which is bad, but if you have any method of partitioning observations to narrow down the range of neighbors that have to be examined, the running time can be dramatically reduced.

One advantage of JP clustering is that once you have the nearest-neighbors list (which is the time-consuming part), the actual JP clustering is fast. You can spend some time up front computing the nearest-neighbors list, and then run JP clustering a number of time with different parameters until you get the clusters you want.

A problem with J-P clustering in your situation is that it will tend to merge clusters. If you have two astronomical objects that have overlapping observations, they'll probable merge into one. You might be able to address this with a post-processing step that identified "bimodal" or "multimodal" clusters -- say by identifying a cluster with a distinctly asymmetrical or eliptical outline and splitting it. But I think any good clustering package would have trouble with these cases.

There are many other clustering algorithms, but many of them suffer from being non-deterministic, or from requiring a number-of-clusters parameter at the start.

That pretty much exhausts my knowledge of clustering. I hope this gives you a little insight.

Craig


Re: Cursor + upsert (astronomical data)

From
Jiří Nádvorník
Date:
Hello Sergey,

Oh dear, I should have written you before several months :). I actually did exactly what you suggest - it was actually
MarkusDemleitners (GAVO org.) idea. The groupby in ipixcenter runs indeed less than an hour on the whole table and is
linearby definition so no problem with memory. I used I think bit 11, which means when I proclaim that the ipix is a
squareand it has the number of steradians (one of your func. I don't remember which), it has cca 0.7 arcsec side and
cca1.1 arcsec "half-diameter" which means it approximately covers the circle of 1 arcsec radius with the center in the
squarescenter - if you understand what I mean. 

This approach and it's quite good results I actually presented on IVOA Interop in Madrid this year (don't be afraid, I
pointedout, that I use your Q3C plugin for PostgreSQL).  

The problem actually comes when you have more than 2 of these ipix "squares" in line - when joining them, you don't
knowwhich ones to use. It causes about 22% of these joined ipix centers to have collisions - when I take the resulting
ipix(joined from the ipix centers) and look back for all the original observations from which these ipix centers where
constructed- they don't belong to the same object. It is because when you count with the ipix centers instead of the
originalobservations - in most pathological case you shift the observations from the bottom left corner to the top
uppercorner when joining with a square to your upper left. That means up to 2.2 arcsec shift which you can't backtrack. 

But. It definitely brought us closer to our goal - you don't have 3e8 observations, you have cca 7e7 observations. It's
stilltoo much to process with some quadratical algorithm, but it's something..  

Maybe I could use it along with my *naïve* sequential algorithm.. Could you look at the script attached if you could
seeany obvious inefficient statements? 

Thank you very much! You know even talking with someone about the problem brings you closer to the solution.

P.S.: I am sorry if I got a little confused with your name when addresing you - Are you the co-author of Q3C paper
wherethe name is S. Koposov (+ O. Bartunov)? 

Best Regards,

Jiri Nadvornik

-----Original Message-----
From: Sergey Karpov [mailto:karpov.sv@gmail.com]
Sent: Monday, July 28, 2014 5:03 PM
To: Oleg Bartunov
Cc: Jiří Nádvorník; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Cursor + upsert (astronomical data)

Hi Jiri,

I understand your problem (and I actually have exactly the same in my sky monitoring experiment). Unfortunately, I have
nocomplete solution for it as of now. 

I just may suggest you to look at q3c_ipixcenter() function from q3c which returns the Q3C bin number for a given
coordinatesat a given binning length. Probably, you may first group your observations using this function, selecting
thebinning depth to roughly match your desired angular resolution, and then probably merge the (much shorter) list of
resultingbin numbers (sort of common identifier you are looking for) for corner cases (when the same cluster of
observationalpoints lies between two bins). I did not try this approach myself yet, but probably it will be
significantlyfaster. 

Good luck!

Sergey

2014-07-27 20:56 GMT+04:00 Oleg Bartunov <obartunov@gmail.com>:
> Jiri,
>
> as I understand your problem is called crossmatch ?  I attach pdf of
> our work in progress, where we compared several spatial indexing
> techniques, including postgis, q3c and new pgsphere.  Sergey Karpov
> made these benchmarks.
>
> New pgsphere you may find here
> https://github.com/akorotkov/pgsphere
>
> Oleg
>
> On Sat, Jul 26, 2014 at 1:46 PM, Jiří Nádvorník <nadvornik.ji@gmail.com> wrote:
>> Hello guys.
>>
>>
>>
>> My issue kind of hits multiple topics, but the main question is about
>> performance. I think you need to understand the background a little
>> bit to be able to help me. So I will firstly define the problem and
>> my solutions to it and place the questions for you to the end of this message.
>>
>>
>>
>> Problem:
>>
>> I have a table of observations of objects on the sky. The most
>> important columns are the coordinates (x,y). All other columns in
>> there are just additional information about the observation. The
>> problem is that when I take an image of the same object on the sky
>> twice, the coordinates x,y won’t be the same, they will be only close
>> to each other. My task is to generate a common identifier to all of
>> the observations of the same object and assign the observations to it
>> (N:1 relation). The criterium is that all of the observations which
>> are within 1 arcsec of this identifier are considered as the same
>> object. I keep the identifiers in a separate table (objcat) and have a foreign key in the observations table.
>>
>> The reason why I solve the performance issues here is that the table
>> of observations has atm cca 3e8 rows after 1.5 year of gathering the
>> data. The number growth is linear.
>>
>>
>>
>> Technical background:
>>
>> I’m trying to keep the whole algoritm in DB if possible because I
>> have a good PostgreSQL plugin Q3C for indexing the coordinates of the
>> objects on the sky
>> (https://code.google.com/p/q3c/source/browse/README.q3c). It also has
>> quite a few stored procedures to look in that table for near
>> neighbors which is what I’m doing. The function q3c_join(x1, x2, y1,
>> y2, radius) returns true if the object y is within radius of the
>> object x. It simply generates a list of index bitmap or queries with
>> the operators <=, >= which define the position on sky. Asking for the nearest neighbors are then only index scans.
>>
>>
>>
>> Solution:
>>
>> After lot of experimentation with clustering the objects and trying
>> to process them all together in one “silver-bullet” SQL query I
>> decided to use some more simple approach. The main problem with the
>> “process all at once approach” is that the finding neighbours for
>> each observation is in definition quadratic and for 3e8 rows just
>> runs out of disk space (~TBs of memory for the temporary results).
>>
>> The simplest approach I could think of is that I process each row of
>> the 3e8 rows sequentially and ask:
>>
>> Do I have an identifier in the radius of 1 arcsec?
>>
>> No: Generate one and assign me to it.
>>
>> Yes: Update it and assigne me to it. The update is done as weighted
>> average – I keep the number of how many observations the identifier
>> has been computed. The result is that the identifier will have
>> average coordinates of all the observations it identifies – it will be the center.
>>
>>
>>
>> So here I come with my procedure. It has 3 params. The first two are
>> range of oids to list in the table. Used for scaling and
>> parallelization of the algorithm. The third is the radius in which to search for the neighbours.
>>
>>
>>
>> DROP TYPE IF EXISTS coords;
>>
>> CREATE TYPE coords AS (
>>
>>                 raj2000 double precision,
>>
>>                 dej2000 double precision
>>
>>                 );
>>
>>
>>
>>
>>
>> DROP FUNCTION IF EXISTS build_catalog(int,int,double precision);
>>
>> CREATE OR REPLACE FUNCTION build_catalog (fromOID int, toOID int,
>> radius double precision)
>>
>>                 RETURNS VOID AS $$
>>
>> DECLARE
>>
>>                 cur1 CURSOR FOR
>>
>>                                SELECT
>>
>>                                                raj2000, dej2000
>>
>>                                FROM
>>
>>                                                \schema.observation AS
>> obs
>>
>>                                WHERE
>>
>>                                                obs.oid >= fromOID
>>
>>                                AND
>>
>>                                                obs.oid < toOID;
>>
>>                 curr_raj2000 double precision;
>>
>>                 curr_dej2000 double precision;
>>
>>                 curr_coords_cat coords;
>>
>>                 cnt int;
>>
>>
>>
>> BEGIN
>>
>> /*SELECT current_setting('transaction_isolation') into tmp;
>>
>> raise notice 'Isolation level %', tmp;*/
>>
>> OPEN cur1;
>>
>> cnt:=0;
>>
>> LOCK TABLE \schema.objcat IN SHARE ROW EXCLUSIVE MODE;
>>
>> LOOP
>>
>>                 FETCH cur1 INTO curr_raj2000, curr_dej2000;
>>
>>                                EXIT WHEN NOT found;
>>
>>
>>
>>                 WITH
>>
>>                                upsert
>>
>>                 AS
>>
>>                                (UPDATE
>>
>>                                                \schema.objcat
>>
>>                                SET
>>
>>                                                ipix_cat=q3c_ang2ipix(
>>
>>
>> (raj2000 * weight + curr_raj2000) / (weight + 1),
>>
>>
>> (dej2000 * weight + curr_dej2000) / (weight + 1)
>>
>>                                                ),
>>
>>                                                raj2000 = (raj2000 *
>> weight +
>> curr_raj2000) / (weight + 1),
>>
>>                                                dej2000 = (dej2000 *
>> weight +
>> curr_dej2000) / (weight + 1),
>>
>>                                                weight=weight+1
>>
>>                                WHERE
>>
>>                                                q3c_join(curr_raj2000,
>> curr_dej2000,
>>
>>
>> raj2000, dej2000,
>>
>>
>> radius)
>>
>>                                RETURNING *),
>>
>>                 ins AS
>>
>>                 (INSERT INTO
>>
>>                                                \schema.objcat
>>
>>                                                (ipix_cat, raj2000,
>> dej2000,
>> weight)
>>
>>                                SELECT
>>
>>
>> (q3c_ang2ipix(curr_raj2000, curr_dej2000)),
>>
>>                                                curr_raj2000,
>>
>>                                                curr_dej2000,
>>
>>                                                1
>>
>>                 WHERE NOT EXISTS
>>
>>                                (SELECT * FROM upsert)
>>
>>                 RETURNING *)
>>
>>                 UPDATE
>>
>>                                \schema.observation
>>
>>                 SET
>>
>>                                id_cat = (SELECT DISTINCT
>>
>>                                                                id_cat
>>
>>                                                FROM
>>
>>                                                                upsert
>>
>>                                                UNION
>>
>>                                                SELECT
>>
>>                                                                id_cat
>>
>>                                                FROM
>>
>>                                                                ins
>>
>>                                                WHERE id_cat IS NOT
>> NULL
>>
>>                                                LIMIT 1)
>>
>>                 WHERE CURRENT OF cur1;
>>
>>                 cnt:=cnt+1;
>>
>>
>>
>>                 IF ((cnt % 100000 ) = 0) THEN
>>
>>                                RAISE NOTICE 'Processed % entries',
>> cnt;
>>
>>                 END IF;
>>
>>
>>
>> END LOOP;
>>
>> CLOSE cur1;
>>
>> END;
>>
>> $$ LANGUAGE plpgsql;
>>
>>
>>
>> Results: When I run the query only once (1 client thread), it runs
>> cca 1 mil rows per hour. Which is days for the whole dataset. When I
>> run it in parallel with that lock to ensure pessimistic
>> synchronization, it runs sequentially too J the other threads just
>> waiting. When I delete that lock and hope to solve the resulting
>> conflicts later, the ssd disk serves up to 4 threads relatively
>> effectively – which can divide my days of time by 4 – still inacceptable.
>>
>>
>>
>> The reason is quite clear here – I’m trying to write something in one
>> cycle of the script to a table – then in the following cycle I need
>> to read that information.
>>
>>
>>
>> Questions for you:
>>
>> 1.       The first question is if you can think of a better way how to do
>> this and maybe if SQL is even capable of doing such thing – or do I
>> have to do it in C? Would rewriting the SQL function to C help?
>>
>> 2.       Could I somehow bend the commiting during the algorithm for my
>> thing? Ensure that inside one cycle, the whole part of the
>> identifiers table would be kept in memory for faster lookups?
>>
>> 3.       Why is this so slow? J It is comparable to the quadratic algorithm
>> in the terms of speed – only does not use any memory.
>>
>>
>>
>> I tried to sum this up the best I could – for more information please
>> don’t hesitate to contact me.
>>
>>
>>
>> Thank you very much for even reading this far.
>>
>>
>>
>> Best Regards,
>>
>>
>>
>> Jiri Nadvornik
>>
>>
>>
>>

Attachment

Re: Cursor + upsert (astronomical data)

From
Jeff Janes
Date:
On Sat, Jul 26, 2014 at 3:46 AM, Jiří Nádvorník <nadvornik.ji@gmail.com> wrote:

 

The reason why I solve the performance issues here is that the table of observations has atm cca 3e8 rows after 1.5 year of gathering the data. The number growth is linear.


So about 500,000 new records a day.



                               (UPDATE

                                               \schema.objcat

                               SET

                                               ipix_cat=q3c_ang2ipix(

                                                               (raj2000 * weight + curr_raj2000) / (weight + 1),

                                                               (dej2000 * weight + curr_dej2000) / (weight + 1)

                                               ),

                                               raj2000 = (raj2000 * weight + curr_raj2000) / (weight + 1),

                                               dej2000 = (dej2000 * weight + curr_dej2000) / (weight + 1),

                                               weight=weight+1

                               WHERE

                                               q3c_join(curr_raj2000, curr_dej2000,

                                                               raj2000, dej2000,

                                                               radius)

                               RETURNING *),



Doing all of this (above, plus the other parts I snipped) as a single query seems far too clever.  How can you identify the slow component when you have them all munged up like that?

Turn the above select query and run it on a random smattering of records, 'explain (analyze, buffers)', periodically while the load process is going on.


 

Results: When I run the query only once (1 client thread), it runs cca 1 mil rows per hour.


Is that 1 million, or 1 thousand?  I'm assuming million, but...
 

Which is days for the whole dataset. When I run it in parallel with that lock to ensure pessimistic synchronization, it runs sequentially too J the other threads just waiting. When I delete that lock and hope to solve the resulting conflicts later, the ssd disk serves up to 4 threads relatively effectively – which can divide my days of time by 4 – still inacceptable.


It is processing new records 192 times faster than you are generating them.  Why is that not acceptable?
 

 

The reason is quite clear here – I’m trying to write something in one cycle of the script to a table – then in the following cycle I need to read that information.


That is the reason for concurrency issues, but it is not clear that that is the reason that the performance is not what you desire.  If you first partition your data into stripes that are a few arc minutes wide, each stripe should not interact with anything other than itself and two neighbors.  That should parallelize nicely.
 

 

Questions for you:

1.       The first question is if you can think of a better way how to do this and maybe if SQL is even capable of doing such thing – or do I have to do it in C? Would rewriting the SQL function to C help?


Skillfully hand-crafted C will always be faster than SQL, if you don't count the time needed to write and debug it.

 

2.       Could I somehow bend the commiting during the algorithm for my thing? Ensure that inside one cycle, the whole part of the identifiers table would be kept in memory for faster lookups?

Is committing a bottleneck?  It looks like you are doing everything in large transactional chunks already, so it probably isn't.  If the identifier table fits in memory, it should probably stay in memory there on its own just through usage.  If it doesn't fit, there isn't much you can do other than pre-cluster the data in a coarse-grained way such that only a few parts of the table (and its index) are active at any one time, such that those active parts stay in memory.

 

3.       Why is this so slow? J It is comparable to the quadratic algorithm in the terms of speed – only does not use any memory.


Use 'explain (analyze, buffers)', preferably with track_io_timing on.  use top, strace, gprof, or perf.
 
Cheers,

Jeff