Re: Cursor + upsert (astronomical data) - Mailing list pgsql-performance

From Marc Mamin
Subject Re: Cursor + upsert (astronomical data)
Date
Msg-id B6F6FD62F2624C4C9916AC0175D56D8828AC3042@jenmbs01.ad.intershop.net
Whole thread Raw
In response to Re: Cursor + upsert (astronomical data)  (Craig James <cjames@emolecules.com>)
List pgsql-performance
[Craig]
>>If you haven't looked at clustering algorithms yet, you might want to do so.
>>Your problem is a special case of clustering, where you have a large number
>>of small clusters.  A good place to start is the overview on Wikipedia:
>>http://en.wikipedia.org/wiki/Cluster_analysis

According to this list, your method is similar to http://en.wikipedia.org/wiki/Basic_sequential_algorithmic_scheme,
but with what seems to be some logical errors.

>The simplest approach I could think of is that I process each row of the 3e8 rows
>sequentially and ask:
>Do I have an identifier in the radius of 1 arcsec?
>No: Generate one and assign me to it.
>Yes: Update it and assigne me to it. The update is done as weighted average – I keep the
>number of how many observations the identifier has been computed. The result is that the
>identifier will have average coordinates of all the observations it identifies – it will
>be the center.
 

Let say you have 2 single measures on a line at arcsec 1 and 2.1. which hence correspond to 2 ipix_cat.
Now add a new measure at 1.9: as you choose any of the possible adjacent ipix_cat whitout considering the least distance, you may end with an ipix_at at 1.45 which is at less than 1 arcsec then the next one.
   Moreover you have raised the weight of both ipix_cat.
     Which increases the lock probability when trying a nutlithreaded upgrade.
     
The max distance between 2 observations belonging to a same ipix_cat tends to 2 arcsec with your method. If this is ok, you should probbaly modify your method so that the 2 first points of my example would have megred to a single ipix_cat. You could use your weigth for this: increase your search radius to 2arcsec and then reject the candidates located between 1 and 2 arsec depending on their weight. The additional work load might be compensated by the smaller number of ipix_cat that woul will have.     
     




pgsql-performance by date:

Previous
From: Tom Lane
Date:
Subject: Re: Very slow planning performance on partition table
Next
From: Vitalii Tymchyshyn
Date:
Subject: Re: Cursor + upsert (astronomical data)