Re: Import large data set into a table and resolve duplicates? - Mailing list pgsql-general

From Francisco Olarte
Subject Re: Import large data set into a table and resolve duplicates?
Date
Msg-id CA+bJJbytzU2qerqmibSj4jTGcGJtQUvyg-Stw+8NC6QYSqEP1w@mail.gmail.com
Whole thread Raw
In response to Re: Import large data set into a table and resolve duplicates?  (Eugene Dzhurinsky <jdevelop@gmail.com>)
List pgsql-general
Hi Eugene:

On Sun, Feb 15, 2015 at 6:36 PM, Eugene Dzhurinsky <jdevelop@gmail.com> wrote:
​...​
 
Since the "dictionary" already has an index on the "series", it seems that
patch_data doesn't need to have any index here.
​....
At this point "patch_data" needs to get an index on "already_exists = false",
which seems to be cheap.

​As I told you before, do not focus in the indexes too much. When you do bulk updates like this they tend to be much slower than a proper sort.

The reason is locality of reference. When you do the things with sorts you do two or three nicely ordered passes on the data, using full pages. When you use indexes you spend a lot of time parsing index structures and switching read-index, read-data, index, data, .... ( They are cached, but you have to switch to them anyway ). Also, with your kind of data indexes on series are going to be big, so less cache available​ for data.


As I said before, it depends on your data anyway, with the current machines this day what I'll do with this problem would be to just make a program ( in perl, seems adequate for this ), copy dictionary to client memory and just read the patch spitting the result file and inserting the needed lines along the way, seems it should fit in 1Gb without problems, which is not much by today standards.

Regards.
Francisco Olarte.



pgsql-general by date:

Previous
From: Francisco Olarte
Date:
Subject: Fwd: Import large data set into a table and resolve duplicates?
Next
From: Eugene Dzhurinsky
Date:
Subject: Re: Import large data set into a table and resolve duplicates?