Thread: Import large data set into a table and resolve duplicates?

Import large data set into a table and resolve duplicates?

From
Eugene Dzhurinsky
Date:
Hello!

I have a huge dictionary table with series data generated by a third-party
service. The table consists of 2 columns

- id : serial, primary key
- series : varchar, not null, indexed

From time to time I need to apply a "patch" to the dictionary, the patch file
consists of "series" data, one per line.

Now I need to import the patch into the database, and produce another file as
- if the passed "series" field exists in the database, then return ID:series
- otherwise insert a new row to the table and generate new ID and return ID:series
for each row in the source file.

So the new file will contain both ID and series data, separated by tab or
something.

While reading and writing the data is not a question (I simply described the
whole task just in case), I wonder what is the most efficient way of importing
such a data into a table, keeping in mind that

- the dictionary table already consists of ~200K records
- the patch could be ~1-50K of records long
- records could not be removed from the dictionary, only added if not exist

Thanks!

--
Eugene Dzhurinsky

Attachment

Re: Import large data set into a table and resolve duplicates?

From
John McKown
Date:
On Sat, Feb 14, 2015 at 3:54 PM, Eugene Dzhurinsky <jdevelop@gmail.com> wrote:
Hello!

I have a huge dictionary table with series data generated by a third-party
service. The table consists of 2 columns

- id : serial, primary key
- series : varchar, not null, indexed

From time to time I need to apply a "patch" to the dictionary, the patch file
consists of "series" data, one per line.

Now I need to import the patch into the database, and produce another file as
- if the passed "series" field exists in the database, then return ID:series
- otherwise insert a new row to the table and generate new ID and return ID:series
for each row in the source file.

So the new file will contain both ID and series data, separated by tab or
something.

While reading and writing the data is not a question (I simply described the
whole task just in case), I wonder what is the most efficient way of importing
such a data into a table, keeping in mind that

- the dictionary table already consists of ~200K records
- the patch could be ~1-50K of records long
- records could not be removed from the dictionary, only added if not exist

Thanks!

--
Eugene Dzhurinsky

​I was hoping that you'd get a good reply from someone else. But it is the weekend. So I will _try_ to explain what I would try. You can see if it is of any use to you. Sometimes my ideas are really good. And other times they are, well let's be nice and say "not as good as other times" [grin/]. I can't test the below because I don't have any data. But hopefully it will be close and even of some help to you.

The first thing that I would do is put the input patch data into its own table. Perhaps a temporary table, or even a permanent one.

DROP TABLE IF EXISTS patch_data;
CREATE TABLE patch_data (
   input_record_number SERIAL,
   already_exists BOOLEAN DEFAULT FALSE;
   id INTEGER,
   series TEXT ​NOT NULL );

​I don't know if the order of the records​ in output file need to match the order of the records in the input file. If not, they you don't need the "input_record_number" field. Otherwise, that is used to maintain the order of the input. At this point, you can load the input file with an SQL command similar to:

COPY patch_data (series) FROM input-file.txt;

Now update the path_data from the existing dictionary table to see which "series" data already exists.

UPDATE patch_data SET already_exists=((SELECT TRUE FROM dictionary WHERE dictionary.series = patch_data.series));

At this point, the table patch_data has been updated such that if the series data in it already exists, the "already_exists" column is now TRUE instead of the initial FALSE. This means that we need to insert all the series data in "patch_data" which does not exist in "dictionary" ( i.e. "already_exists" is FALSE in "patch_data") into "dictionary".

INSERT INTO dictionary(series) SELECT series FROM patch_data WHERE already_exists = FALSE;

The above should insert the "series" records which don't exist in "dictionary" into it and generate an "id" column for it from your SERIAL definition. Now we need to find out the "id" values for each of the "series" values and return them. 

UPDATE patch_data SET id=((SELECT id FROM dictionary WHERE dictionary.series = patch_data.series));

At this point, if I have not messed up, every "series" value in "patch_data" should have the proper "id" value from "dictionary". So the table "patch_data" should now have all that we need in it. So we just use it to make our output file. I don't know you're output requirements, but I'd just do something like:

SELECT id, e'\t',series FROM patch_data ORDER BY input_record_number;

Again, if the output order does not need to be the same as the input order, then all the stuff with the "input_record_number" column can be eliminated. Just to be a good citizen:

DROP TABLE patch_data;

at this point. Unless, you think you might need it for some reason. Who knows, you might even want to archive it as some sort of audit information. 


--
He's about as useful as a wax frying pan.

10 to the 12th power microphones = 1 Megaphone

Maranatha! <><
John McKown

Re: Import large data set into a table and resolve duplicates?

From
John McKown
Date:
OOPS, I forgot to mention in the SELECT generating the output file that the e'\t' generates a "tab" character. You likely already know this, but I like to make my posts as self contained as possible. 

On Sun, Feb 15, 2015 at 10:00 AM, John McKown <john.archie.mckown@gmail.com> wrote:
On Sat, Feb 14, 2015 at 3:54 PM, Eugene Dzhurinsky <jdevelop@gmail.com> wrote:
Hello!

I have a huge dictionary table with series data generated by a third-party
service. The table consists of 2 columns

- id : serial, primary key
- series : varchar, not null, indexed

From time to time I need to apply a "patch" to the dictionary, the patch file
consists of "series" data, one per line.

Now I need to import the patch into the database, and produce another file as
- if the passed "series" field exists in the database, then return ID:series
- otherwise insert a new row to the table and generate new ID and return ID:series
for each row in the source file.

So the new file will contain both ID and series data, separated by tab or
something.

While reading and writing the data is not a question (I simply described the
whole task just in case), I wonder what is the most efficient way of importing
such a data into a table, keeping in mind that

- the dictionary table already consists of ~200K records
- the patch could be ~1-50K of records long
- records could not be removed from the dictionary, only added if not exist

Thanks!

--
Eugene Dzhurinsky

​I was hoping that you'd get a good reply from someone else. But it is the weekend. So I will _try_ to explain what I would try. You can see if it is of any use to you. Sometimes my ideas are really good. And other times they are, well let's be nice and say "not as good as other times" [grin/]. I can't test the below because I don't have any data. But hopefully it will be close and even of some help to you.

The first thing that I would do is put the input patch data into its own table. Perhaps a temporary table, or even a permanent one.

DROP TABLE IF EXISTS patch_data;
CREATE TABLE patch_data (
   input_record_number SERIAL,
   already_exists BOOLEAN DEFAULT FALSE;
   id INTEGER,
   series TEXT ​NOT NULL );

​I don't know if the order of the records​ in output file need to match the order of the records in the input file. If not, they you don't need the "input_record_number" field. Otherwise, that is used to maintain the order of the input. At this point, you can load the input file with an SQL command similar to:

COPY patch_data (series) FROM input-file.txt;

Now update the path_data from the existing dictionary table to see which "series" data already exists.

UPDATE patch_data SET already_exists=((SELECT TRUE FROM dictionary WHERE dictionary.series = patch_data.series));

At this point, the table patch_data has been updated such that if the series data in it already exists, the "already_exists" column is now TRUE instead of the initial FALSE. This means that we need to insert all the series data in "patch_data" which does not exist in "dictionary" ( i.e. "already_exists" is FALSE in "patch_data") into "dictionary".

INSERT INTO dictionary(series) SELECT series FROM patch_data WHERE already_exists = FALSE;

The above should insert the "series" records which don't exist in "dictionary" into it and generate an "id" column for it from your SERIAL definition. Now we need to find out the "id" values for each of the "series" values and return them. 

UPDATE patch_data SET id=((SELECT id FROM dictionary WHERE dictionary.series = patch_data.series));

At this point, if I have not messed up, every "series" value in "patch_data" should have the proper "id" value from "dictionary". So the table "patch_data" should now have all that we need in it. So we just use it to make our output file. I don't know you're output requirements, but I'd just do something like:

SELECT id, e'\t',series FROM patch_data ORDER BY input_record_number;

Again, if the output order does not need to be the same as the input order, then all the stuff with the "input_record_number" column can be eliminated. Just to be a good citizen:

DROP TABLE patch_data;

at this point. Unless, you think you might need it for some reason. Who knows, you might even want to archive it as some sort of audit information. 


--
He's about as useful as a wax frying pan.

10 to the 12th power microphones = 1 Megaphone

Maranatha! <><
John McKown



--
He's about as useful as a wax frying pan.

10 to the 12th power microphones = 1 Megaphone

Maranatha! <><
John McKown

Re: Import large data set into a table and resolve duplicates?

From
Eugene Dzhurinsky
Date:
On Sun, Feb 15, 2015 at 01:06:02PM +0100, Francisco Olarte wrote:
> You state below 200k rows, 50k lines per path. That is not huge unless
> "series" really big, is it?

series data is in between of 100-4096 chars

> 1.- Get the patches into a ( temp ) table, using something like \copy, call
> this patches_in.
> 2.- create (temp) table existing_out as select series, id from dictionary
> join patches_in on (series);
> 3.- delete from patches_in where series in (select series from
> existing_out);
> 4.- create (temp) table new_out as insert into dictionary (series) select
> patches_in.series from patches_in returning series, id
> 5.- Copy existing out and patches out.
> 6.- Cleanup temps.

That sounds cool, but I'm a bit worried about the performance of a lookup over
the series column and the time to create index for the "temp" table on
"series" column. But perhaps it's better to try this and if a performance will
go really bad - then do some optimizations, like partitioning etc.

Thank you!

--
Eugene Dzhurinsky

Attachment

Re: Import large data set into a table and resolve duplicates?

From
Eugene Dzhurinsky
Date:
On Sun, Feb 15, 2015 at 10:00:50AM -0600, John McKown wrote:
> UPDATE patch_data SET already_exists=((SELECT TRUE FROM dictionary WHERE
> dictionary.series = patch_data.series));

Since the "dictionary" already has an index on the "series", it seems that
patch_data doesn't need to have any index here.

> At this point, the table patch_data has been updated such that if the
> series data in it already exists, the "already_exists" column is now TRUE
> instead of the initial FALSE. This means that we need to insert all the
> series data in "patch_data" which does not exist in "dictionary" ( i.e.
> "already_exists" is FALSE in "patch_data") into "dictionary".
>
> INSERT INTO dictionary(series) SELECT series FROM patch_data WHERE
> already_exists = FALSE;

At this point "patch_data" needs to get an index on "already_exists = false",
which seems to be cheap.

> UPDATE patch_data SET id=((SELECT id FROM dictionary WHERE
> dictionary.series = patch_data.series));

No index needed here except the existing one on "dictionary".

That looks really promising, thank you John! I need only one index on the
"patch_data" table, and I will re-use the existing index on the "dictionary".

Thanks again!

--
Eugene Dzhurinsky

Attachment

Fwd: Import large data set into a table and resolve duplicates?

From
Francisco Olarte
Date:
Forwarding to the list, as I forgot ( as usual, think I did the same with my original message ). Sorry for the duplicates.
---------- Forwarded message ----------
From: Francisco Olarte <folarte@peoplecall.com>
Date: Sun, Feb 15, 2015 at 6:48 PM
Subject: Re: [GENERAL] Import large data set into a table and resolve duplicates?
To: Eugene Dzhurinsky <jdevelop@gmail.com>


Hi Eugene:

On Sun, Feb 15, 2015 at 6:29 PM, Eugene Dzhurinsky <jdevelop@gmail.com> wrote:
On Sun, Feb 15, 2015 at 01:06:02PM +0100, Francisco Olarte wrote:
> You state below 200k rows, 50k lines per path. That is not huge unless
> "series" really big, is it?

series data is in between of 100-4096 chars

​Well, then it depends on the average length. With 1k avg you'll have 200Mb for the series. Not having stats on your setup, that is easily manageable by today standard machines.

One thing that strikes me is you are either at the beginning of your usage of this or you have A LOT of already present lines in the path ( I mean, path has one fourth of the lines of dictionary ). Which is the case?​

That sounds cool, but I'm a bit worried about the performance of a lookup over
the series column and the time to create index for the "temp" table on
"series" column. But perhaps it's better to try this and if a performance will
go really bad - then do some optimizations, like partitioning etc.

Who said create an index on the temp tables?​​ They ​are small. And, anywhere, If i would be doing this with files I'll NEVER do it indexing, I'll sort, merge, go on ( this is Why I suggested temp files, postgres does sort and merges really well if it needs to and the optimizer has other tricks available ).

When doing bulk operations ( like this, with a record count of 25% of the whole file ) indexing is nearly always slowed than sorting and merging. As I said, I do not know the distribution of your data size, but assuming the worst case ( 200k 4k entries dict, which is 800Mb data, 50k 4k patches, which is 200Mb data), you can do it with text files in 5 minutes by sorting and merging in 8Mb RAM ( I've done similar things, in that time, and the disk where much slower, I've even done this kind of things with half inch tapes and it wasn't thar slow ). You just sort the dictionary by series, sort patch by series, read both comparing keys and write a result file and a new dictionary, and, as the new dictionary is already sorted, you do not need to read it the next time. It was the usual thing to do for updating accounts on the tape days.

And, as they always said, measure, then optimize. I would try it ( use some little tables, like say 20 entries dict and 5 lines patches to debug the queries ), then measure ( if you don't mind hogging the server, or have a test one launch the whole thing and measure times, if you get it done in a reasonable time, and you may be surprised, some times simple algorithms are fast ).

​The only other thing I would try maybe doing the things without temp tables, something like:

select series, id from dictionary join patches_in using series
UNION ALL ( faster, as you know there will be no duplicates do not bother checking them )
insert into dictionary (series) select p.series from patches where patches not in (select series from dictionary) returning series, id

( you may need to experiment a little with the queries and use some with expresions, but you see the logic ).

Francisco Olarte.






 

Thank you!

--
Eugene Dzhurinsky


Re: Import large data set into a table and resolve duplicates?

From
Francisco Olarte
Date:
Hi Eugene:

On Sun, Feb 15, 2015 at 6:36 PM, Eugene Dzhurinsky <jdevelop@gmail.com> wrote:
​...​
 
Since the "dictionary" already has an index on the "series", it seems that
patch_data doesn't need to have any index here.
​....
At this point "patch_data" needs to get an index on "already_exists = false",
which seems to be cheap.

​As I told you before, do not focus in the indexes too much. When you do bulk updates like this they tend to be much slower than a proper sort.

The reason is locality of reference. When you do the things with sorts you do two or three nicely ordered passes on the data, using full pages. When you use indexes you spend a lot of time parsing index structures and switching read-index, read-data, index, data, .... ( They are cached, but you have to switch to them anyway ). Also, with your kind of data indexes on series are going to be big, so less cache available​ for data.


As I said before, it depends on your data anyway, with the current machines this day what I'll do with this problem would be to just make a program ( in perl, seems adequate for this ), copy dictionary to client memory and just read the patch spitting the result file and inserting the needed lines along the way, seems it should fit in 1Gb without problems, which is not much by today standards.

Regards.
Francisco Olarte.



Re: Import large data set into a table and resolve duplicates?

From
Eugene Dzhurinsky
Date:
On Sun, Feb 15, 2015 at 06:48:36PM +0100, Francisco Olarte wrote:
> One thing that strikes me is you are either at the beginning of your usage
> of this or you have A LOT of already present lines in the path ( I mean,
> path has one fourth of the lines of dictionary ). Which is the case?

Right now I'm on the very beginning stage, yes, and I expect to have "cache
miss" for the dictionary at ratio of at least 70%, which eventually will drop
down to 5-10%.

> When doing bulk operations ( like this, with a record count of 25% of the
> whole file ) indexing is nearly always slowed than sorting and merging. As
> I said, I do not know the distribution of your data size, but assuming the
> worst case ( 200k 4k entries dict, which is 800Mb data, 50k 4k patches,
> which is 200Mb data), you can do it with text files in 5 minutes by sorting
> and merging in 8Mb RAM ( I've done similar things, in that time, and the
> disk where much slower, I've even done this kind of things with half inch
> tapes and it wasn't thar slow ). You just sort the dictionary by series,
> sort patch by series, read both comparing keys and write a result file and
> a new dictionary, and, as the new dictionary is already sorted, you do not
> need to read it the next time. It was the usual thing to do for updating
> accounts on the tape days.

So you suggest to take this off the Postgres? Thats interesting. Simply put,
I'll do a dump of the dictionary, sorted by series, to some file. Then sort
the file with patch by series. Then merge the dictionary (left) and the patch
(right) files. And during the merge if the (right) line doesn't have a
corresponding (left) line, then put a nextval expression for sequence as an ID
parameter. Then truncate existing dictionary table and COPY the data from the
merged file into it.

Is it what you've meant?

Thank you!

--
Eugene Dzhurinsky

Attachment

Re: Import large data set into a table and resolve duplicates?

From
Paul A Jungwirth
Date:
Hi Eugene,

> Now I need to import the patch into the database, and produce another file as
> - if the passed "series" field exists in the database, then return ID:series
> - otherwise insert a new row to the table and generate new ID and return ID:series
> for each row in the source file.

I think Francisco's approach is good, and I agree that ~200k rows is
hardly anything. My approach is similar but uses CTEs to combine a lot
of Francisco's queries into one. I still have a separate COPY command
though. (It'd be great if you could COPY into a CTE, but I guess
COPYing into a temporary table is pretty close.) Anyway, when I run
this on my machine, the import finishes in a few seconds:

# Makefile
database=dictionary
port=5432
words=/usr/share/dict/american-english
SHELL=/bin/bash

initial.txt:
  for i in {1..3}; do \
    cat "${words}" | while read line; do \
      echo $$i "$$line"; \
    done; \
  done > initial.txt

tables: initial.txt
  sudo su postgres -c 'psql -p ${port} ${database} -f tables.sql < initial.txt'

a.txt:
  for i in {1,4}; do \
    cat "${words}" | while read line; do \
      echo $$i "$$line"; \
    done; \
  done > a.txt

b.txt:
  for i in {4,5}; do \
    cat "${words}" | while read line; do \
      echo $$i "$$line"; \
    done; \
  done > b.txt

a: a.txt
  sudo su postgres -c 'psql -p ${port} ${database} -f import.sql < a.txt'

b: b.txt
  sudo su postgres -c 'psql -p ${port} ${database} -f import.sql < b.txt'

clean:
  rm -f initial.txt a.txt b.txt

.PHONY: tables a b clean


# tables.sql
DROP TABLE IF EXISTS dictionary;
CREATE TABLE dictionary (id SERIAL PRIMARY KEY, series VARCHAR NOT NULL);
\copy dictionary (series) from pstdin
CREATE UNIQUE INDEX idx_series ON dictionary (series);


# import.sql
CREATE TEMPORARY TABLE staging (
  series VARCHAR NOT NULL
);
\copy staging (series) from pstdin
CREATE INDEX idx_staging_series ON staging (series);

WITH already AS (
  SELECT  id, staging.series
  FROM    staging
  LEFT OUTER JOIN dictionary
  ON      dictionary.series = staging.series
),
adding as (
  INSERT INTO dictionary
  (series)
  SELECT  series::text
  FROM    already
  WHERE   id IS NULL
  RETURNING id, series
)
SELECT  id, series
FROM    adding
UNION
SELECT  id, series
FROM    already WHERE id IS NOT NULL
;


Good luck!
Paul


Re: Import large data set into a table and resolve duplicates?

From
Francisco Olarte
Date:


On Sun, Feb 15, 2015 at 7:11 PM, Eugene Dzhurinsky <jdevelop@gmail.com> wrote:
On Sun, Feb 15, 2015 at 06:48:36PM +0100, Francisco Olarte wrote:
> One thing that strikes me is you are either at the beginning of your usage
> of this or you have A LOT of already present lines in the path ( I mean,
> path has one fourth of the lines of dictionary ). Which is the case?

Right now I'm on the very beginning stage, yes, and I expect to have "cache
miss" for the dictionary at ratio of at least 70%, which eventually will drop
down to 5-10%.

​Well, that gives one important data point ( the other being the expected median/average size of your series entries ).

And this means the size of the problem is completely different, so the solutions should be too. I mean, at a 70% hit rate with a 50k lines patch you are expecting 35k insertions, and you even expect to have 2k5 / 5k insertions when it drops. Not knowing any other num​ber, I'll urge you to DISCARD any of my propossed solutions ( i.e, if you have 20 runs at 35k followed by 100 runs at 5k you will end up with a 200+20*35+100*5=200+700+500=1400k rows table receiving 5k updates from a 50k file, which is a completely different problem than the original. Not having any expected problem size, I will not propose any more.


> When doing bulk operations ( like this, with a record count of 25% of the
> whole file ) indexing is nearly always slowed than sorting and merging. As
​.....

So you suggest to take this off the Postgres?

​As stated above, I do not presently suggest anything, as I'm unable to do it without further knowledge of the problem ( which, apart from the great unknown of the problem size would also need how is the dictionary being to be used, among other things ).​

Thats interesting. Simply put,
I'll do a dump of the dictionary, sorted by series, to some file. Then sort
the file with patch by series. Then merge the dictionary (left) and the patch
(right) files. And during the merge if the (right) line doesn't have a
corresponding (left) line, then put a nextval expression for sequence as an ID
parameter. Then truncate existing dictionary table and COPY the data from the
merged file into it.

​I would not truncate the dictionary, just put insert returning expressions in the data file. Anyway, I would not presently suggest doing that, with the data I have it's performance is a great unknown. With the original problem size, it can be made highly performant, and I know batch processes ( sort, merge, etc...) easily beat ''interactive'' ( index lookup etc ) throughput wise ( and so total time wise ) on some sizes, but you need to know the problem.

Also, when doing this, it depends on who updates the table and how. If only the patches do it you can keep the table on the server and a sorted copy out of it, and use the sorted copy plus the patch to prepare a set of insert-returnings, which you then process and use ( their return value ) to generate a new sorted copy ( this is normally very fast, and it can be made robust, and most important, does not hit the server too much ( you use the sorted copy plus the patch to generate a 'to_insert' file, then go to the server, copy that into a to_insert temp table, insert it into dict, with returning, inserting the result ( with ID's ) into an 'inserted' temp table, then copy out the table, and then with the copy out (which you made already sorted, or just sort again) you patch the dictionary copy. If you ever think you've lost sync you just copy out the dictionary again and sort it. ) This is batch-processing 101, but you need knowledge of the problem, with the current data, I cannot recommend anything.

​I've been doing that sort of things for decades, and it works quite well, but normally only bother doing it for ''big'' tasks, meaning multihour at least. The thing is indexes seem quite fast as they give you a row very fast, while the sort spends hours preparing, but after that the sorted rows go in really fast, much faster than the index, and ends up first. As I say, it depends on the problem. If you have a 5 min runtime doing straight update/inserts ( not unreasonable for 200k/50k) to be done daily, it does not make sense to make it more complex to make the runtime 15 secs. OTOH, if you have a weekly 20h process which you can cut to 6h to fit it into a nightly window, it does ( And I'm not exagerating, on the early 90s changing indexes to sorting and batching allowed me to cut a 4.5 DAYS process to 8 h. It did not seem to do anything for the first 7, but then it went really fast  ).

Anyway, dessign it according to your knowledge of the problem.

Regards.
    Francisco Olarte.





Is it what you've meant?

Thank you!

--
Eugene Dzhurinsky