Re: Cursor + upsert (astronomical data) - Mailing list pgsql-performance

From Jiří Nádvorník
Subject Re: Cursor + upsert (astronomical data)
Date
Msg-id 004401cfab11$bd4a4a70$37dedf50$@gmail.com
Whole thread Raw
In response to Re: Cursor + upsert (astronomical data)  (Craig James <cjames@emolecules.com>)
Responses Re: Cursor + upsert (astronomical data)  (Craig James <cjames@emolecules.com>)
List pgsql-performance

Hi Craig,

 

I’m really interested in those algorithms and study them. But I would need somebody to point me directly at a specific algorithm to look at. The main problem with choosing the right one (which couldn’t get over even my university teacher) is that you don’t know the number of clusters (classification problem) and you don’t know the number of objects in one cluster (that isn’t such a big deal), but each cluster has different count of objects – which is a big issue because it actually disqualifies all of k-means based algorithms (correct me if I’m wrong).

 

We were thinking of some kind of Bayesian method or a likelihood ratio computation, but dismissed that as we found an implementation that ran ~hours for ~5e5 records – I just don’t think such an approach would provide better results from a faster algorithm than the simple linear one (checking each row and updating/inserting it to catalog).

 

I actually left the algorithm to run for a smaller set of data (~80 mil. Rows) and on 8 cores (cca 4 of them used at a time) it ran 48 hours. That’s not that bad assuming it must run only once for the dataset (on DB population) – and then the additions will be processed fast. The result is actually cca 1.2 mil identifiers and 50 thousand from them are closer to each other than 1 arcsec (collisions) – but that is only <5% error. I think if we worked a bit on this algorithm we could make it faster and reduce that error percentage to maybe 1% which would be acceptable? What do you think?

 

Thank you very much for your effort.

 

Kind Regards,

 

Jiri Nadvornik

 

From: Craig James [mailto:cjames@emolecules.com]
Sent: Sunday, July 27, 2014 5:35 PM
To: Jiří Nádvorník
Cc: Vitalii Tymchyshyn; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Cursor + upsert (astronomical data)

 

Jiri,

 

If you haven't looked at clustering algorithms yet, you might want to do so. Your problem is a special case of clustering, where you have a large number of small clusters.  A good place to start is the overview on Wikipedia: http://en.wikipedia.org/wiki/Cluster_analysis

 

A lot of people have worked extensively on this problem, and you might find a good solution, or at least some ideas to guide your own algorithm. In my field (chemistry), researchers often need to cluster 10^6 to 10^7 chemical compounds, and a great deal of research has gone into efficient ways to do so.

 

Craig

 

 

On Sun, Jul 27, 2014 at 7:35 AM, Jiří Nádvorník <nadvornik.ji@gmail.com> wrote:

Hi Vitalii, thank you for your reply.

 

The problem you suggested can in the most pathological way be, that these observations are on one line. As you suggested it, the B would be in the middle. So A and C are not in 1 arcsec range of each other, but they must be within 1 arcsec of their common average coordinates. If the distances between A,B,C are 1 arcsec for each, the right solution is to pick B as reference identifier and assign A and C to it.

 

We already tried the approach you suggest with applying a grid based on the Q3C indexing of the database. We were not just rounding the results, but using the center of the Q3C “square” in which the observation took place. The result was poor however – 22% of the identifiers were closer to each other than 1 arcsec. That means that when you crossmatch the original observations to them, you don’t know which one to use and you have duplicates. The reason for this is that nearly all of the observations are from SMC (high density of observations), which causes that you have more than 2 “rounded” positions in a row and don’t know which ones to join together (compute average coordinates from it). If it is not clear enough I can draw it on an image for you.

Maybe the simple round up would have better results because the squares are not each the same size and you can scale them only by 2 (2-times smaller, or larger square). We used a squre with the side cca 0.76 arcsec which approximately covers the 1 arcsec radius circle.

 

Oh and one more important thing. The difficulty of our data is not that it is 3e8 rows. But in the highest density, there are cca 1000 images overlapping. Which kills you when you try to self-join the observations to find neighbours for each of them – the quadratic complexity is based on the overlappingon the image (e.g. 10000 observations on one image with another 999 images overlapping it means 10000 *1000^2).

 

Best regards,

 

Jiri Nadvornik

 

From: tivv00@gmail.com [mailto:tivv00@gmail.com] On Behalf Of Vitalii Tymchyshyn
Sent: Sunday, July 27, 2014 8:06 AM
To: Jiří Nádvorník
Cc:
pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Cursor + upsert (astronomical data)

 

I am not sure I understand the problem fully, e.g. what to do if there are observations A,B and C with A to B and B to C less then treshold and A to C over treshold, but anyway.

Could you first apply a kind of grid to your observations? What I mean is to round your coords to, say, 1/2 arcsec on each axe and group the results. I think you will have most observations grouped this way and then use your regular algorithm to combine the results.

Best regards, Vitalii Tymchyshyn



 

--

---------------------------------
Craig A. James

Chief Technology Officer

eMolecules, Inc.

---------------------------------

pgsql-performance by date:

Previous
From: Rural Hunter
Date:
Subject: Re: Very slow planning performance on partition table
Next
From: Jiří Nádvorník
Date:
Subject: Re: Cursor + upsert (astronomical data)