Re: Cursor + upsert (astronomical data) - Mailing list pgsql-performance

From Vitalii Tymchyshyn
Subject Re: Cursor + upsert (astronomical data)
Date
Msg-id CABWW-d2Y4GVKXLfap2h+bDn3Y0YA_Dj=+OD8xN=NQCem+hQxDA@mail.gmail.com
Whole thread Raw
In response to Re: Cursor + upsert (astronomical data)  (Jiří Nádvorník <nadvornik.ji@gmail.com>)
List pgsql-performance

Well, that's why I said to apply regular algorithm to deduplicate after this step. Basically, what I expect is to have first pass with group by that do not require any joins and produces "dirty" set of identifiers.

It should do next things:
1) Provide working set of dirty identifiers that has a huge factor less cardinality than the original observations set.
2) Most of the identifiers can be used as is, only for small fraction you need to perform additional merge. 22% is actually very good number, it means only 1/5 of identifiers should be analyzed for merging.

Best regards, Vitalii Tymchyshyn

27 лип. 2014 10:35, користувач "Jiří Nádvorník" <nadvornik.ji@gmail.com> написав:

Hi Vitalii, thank you for your reply.

 

The problem you suggested can in the most pathological way be, that these observations are on one line. As you suggested it, the B would be in the middle. So A and C are not in 1 arcsec range of each other, but they must be within 1 arcsec of their common average coordinates. If the distances between A,B,C are 1 arcsec for each, the right solution is to pick B as reference identifier and assign A and C to it.

 

We already tried the approach you suggest with applying a grid based on the Q3C indexing of the database. We were not just rounding the results, but using the center of the Q3C “square” in which the observation took place. The result was poor however – 22% of the identifiers were closer to each other than 1 arcsec. That means that when you crossmatch the original observations to them, you don’t know which one to use and you have duplicates. The reason for this is that nearly all of the observations are from SMC (high density of observations), which causes that you have more than 2 “rounded” positions in a row and don’t know which ones to join together (compute average coordinates from it). If it is not clear enough I can draw it on an image for you.

Maybe the simple round up would have better results because the squares are not each the same size and you can scale them only by 2 (2-times smaller, or larger square). We used a squre with the side cca 0.76 arcsec which approximately covers the 1 arcsec radius circle.

 

Oh and one more important thing. The difficulty of our data is not that it is 3e8 rows. But in the highest density, there are cca 1000 images overlapping. Which kills you when you try to self-join the observations to find neighbours for each of them – the quadratic complexity is based on the overlappingon the image (e.g. 10000 observations on one image with another 999 images overlapping it means 10000 *1000^2).

 

Best regards,

 

Jiri Nadvornik

 

From: tivv00@gmail.com [mailto:tivv00@gmail.com] On Behalf Of Vitalii Tymchyshyn
Sent: Sunday, July 27, 2014 8:06 AM
To: Jiří Nádvorník
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Cursor + upsert (astronomical data)

 

I am not sure I understand the problem fully, e.g. what to do if there are observations A,B and C with A to B and B to C less then treshold and A to C over treshold, but anyway.

Could you first apply a kind of grid to your observations? What I mean is to round your coords to, say, 1/2 arcsec on each axe and group the results. I think you will have most observations grouped this way and then use your regular algorithm to combine the results.

Best regards, Vitalii Tymchyshyn

pgsql-performance by date:

Previous
From: Marc Mamin
Date:
Subject: Re: Cursor + upsert (astronomical data)
Next
From: Rural Hunter
Date:
Subject: Re: Very slow planning performance on partition table