Thread: Key/Value reference table generation: INSERT/UPDATE performance
I found several post about INSERT/UPDATE performance in this group, but actually it was not really what I am searching an answer for... I have a simple reference table WORD_COUNTS that contains the count of words that appear in a word array storage in another table. CREATE TABLE WORD_COUNTS ( word text NOT NULL, count integer, CONSTRAINT PK_WORD_COUNTS PRIMARY KEY (word) ) WITHOUT OIDS; I have some PL/pgSQL code in a stored procedure like FOR r IN select id, array_of_words from word_storage LOOP begin -- insert the missing words insert into WORD_COUNTS ( word, count ) ( select word, 0 from ( select distinct (r.array_of_words) [s.index] as d_word from generate_series(1, array_upper( r.array_of_words, 1 ) ) as s(index) ) as distinct_words where word not in ( select d_word from WORD_COUNTS ) ); -- update the counts update WORD_COUNTS set count = COALESCE( count, 0 ) + 1 where word in ( select distinct (r.array_of_words)[s.index] as word from generate_series(1, array_upper( r.array_of_words, 1) ) as s(index) ); exception when others then error_count := error_count + 1; end; record_count := record_count + 1; END LOOP; This code runs extremely slowly. It takes about 10 minutes to process 10000 records and the word storage has more then 2 million records to be processed. Does anybody have a know-how about populating of such a reference tables and what can be optimized in this situation. Maybe the generate_series() procedure to unnest the array is the place where I loose the performance? Are the set update/inserts more effitient, then single inserts/updates run in smaller loops? Thanks for your help, Valentine Gogichashvili
On 22 May 2007 01:23:03 -0700, valgog <valgog@gmail.com> wrote:
Is there any reason why count is not not null? (That should siplify your code by removing the coalesce)
insert is more efficient than update because update is always a delete followed by an insert.
Oh and group by is nearly always quicker than distinct and can always? be rewritten as such. I'm not 100% sure why its different but it is.
Peter.
I found several post about INSERT/UPDATE performance in this group,
but actually it was not really what I am searching an answer for...
I have a simple reference table WORD_COUNTS that contains the count of
words that appear in a word array storage in another table.
CREATE TABLE WORD_COUNTS
(
word text NOT NULL,
count integer,
CONSTRAINT PK_WORD_COUNTS PRIMARY KEY (word)
)
WITHOUT OIDS;
Is there any reason why count is not not null? (That should siplify your code by removing the coalesce)
insert is more efficient than update because update is always a delete followed by an insert.
Oh and group by is nearly always quicker than distinct and can always? be rewritten as such. I'm not 100% sure why its different but it is.
Peter.
I have some PL/pgSQL code in a stored procedure like
FOR r
IN select id, array_of_words
from word_storage
LOOP
begin
-- insert the missing words
insert into WORD_COUNTS
( word, count )
( select word, 0
from ( select distinct (r.array_of_words)
[s.index] as d_word
from generate_series(1,
array_upper( r.array_of_words, 1 ) ) as s(index) ) as distinct_words
where word not in ( select d_word from
WORD_COUNTS ) );
-- update the counts
update WORD_COUNTS
set count = COALESCE( count, 0 ) + 1
where word in ( select distinct ( r.array_of_words)[s.index] as
word
from generate_series(1,
array_upper( r.array_of_words, 1) ) as s(index) );
exception when others then
error_count := error_count + 1;
end;
record_count := record_count + 1;
END LOOP;
This code runs extremely slowly. It takes about 10 minutes to process
10000 records and the word storage has more then 2 million records to
be processed.
Does anybody have a know-how about populating of such a reference
tables and what can be optimized in this situation.
Maybe the generate_series() procedure to unnest the array is the place
where I loose the performance?
Are the set update/inserts more effitient, then single inserts/updates
run in smaller loops?
Thanks for your help,
Valentine Gogichashvili
---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly
valgog wrote: > I found several post about INSERT/UPDATE performance in this group, > but actually it was not really what I am searching an answer for... > > I have a simple reference table WORD_COUNTS that contains the count of > words that appear in a word array storage in another table. I think this is the root of your problem, I'm afraid. You're trying to count individual words when you're storing an array of words. I don't think any of the Gist/GIN indexes will help you with this either. However, since "you don't want to start from here" isn't very useful here and now: 1. See what the performance (explain analyse) of the "select distinct...generate_series()" statement is. I think you're right and it's going to be slow. 2. You're looping through each row of word_storage and counting separately. Write it as one query if possible. 3. As Peter says, don't insert then update, start with an empty table and just insert totals for the lot (see #2). I'd probably write the query in plperl/python or something else that supports hash/dictionary structures. Then just process the whole word_storage into the hash - assuming you only have a few thousand distinct words that shouldn't take up too much memory. -- Richard Huxton Archonet Ltd
I have rewritten the code like existing_words_array := ARRAY( select word from WORD_COUNTS where word = ANY ( array_of_words ) ); not_existing_words_array := ARRAY( select distinct_word from ( select distinct (array_of_words)[s.index] as distinct_word from generate_series(1, array_upper( array_of_words, 1 ) ) as s(index) ) as distinct_words where distinct_word <> ALL ( existing_words_array ) ); -- insert the missing words if not_existing_words_array is not null then insert into WORD_COUNTS ( word, count ) ( select word, 1 from ( select not_existing_words_array[s.index] as word from generate_series( 1, array_upper( not_existing_words_array, 1 ) ) as s(index) ) as distinct_words ); end if; -- update the counts if existing_words_array is not null then update WORD_COUNTS set count = COALESCE( count, 0 ) + 1 where sw_word = ANY ( existing_words_array ); end if; Now it processes a million records in 14 seconds... so it was probably the problem of looking up NOT IN WORD_COUNTS was way too expencive
On Tue, 22 May 2007 10:23:03 +0200, valgog <valgog@gmail.com> wrote: > I found several post about INSERT/UPDATE performance in this group, > but actually it was not really what I am searching an answer for... > > I have a simple reference table WORD_COUNTS that contains the count of > words that appear in a word array storage in another table. Mmm. If I were you, I would : - Create a procedure that flattens all the arrays and returns all the words : PROCEDURE flatten_arrays RETURNS SETOF TEXT FOR word_array IN SELECT word_array FROM your_table LOOP FOR i IN 1...array_upper( word_array ) LOOP RETURN NEXT tolower( word_array[ i ] ) So, SELECT * FROM flatten_arrays() returns all the words in all the arrays. To get the counts quickly I'd do this : SELECT word, count(*) FROM flatten_arrays() AS word GROUP BY word You can then populate your counts table very easily and quickly, since it's just a seq scan and hash aggregate. One second for 10.000 rows would be slow.
On May 22, 12:14 pm, l...@peufeu.com (PFC) wrote: > On Tue, 22 May 2007 10:23:03 +0200, valgog <val...@gmail.com> wrote: > > I found several post about INSERT/UPDATE performance in this group, > > but actually it was not really what I am searching an answer for... > > > I have a simple reference table WORD_COUNTS that contains the count of > > words that appear in a word array storage in another table. > > Mmm. > > If I were you, I would : > > - Create a procedure that flattens all the arrays and returns all the > words : > > PROCEDURE flatten_arrays RETURNS SETOF TEXT > FOR word_array IN SELECT word_array FROM your_table LOOP > FOR i IN 1...array_upper( word_array ) LOOP > RETURN NEXT tolower( word_array[ i ] ) > > So, SELECT * FROM flatten_arrays() returns all the words in all the arrays. > To get the counts quickly I'd do this : > > SELECT word, count(*) FROM flatten_arrays() AS word GROUP BY word > > You can then populate your counts table very easily and quickly, since > it's just a seq scan and hash aggregate. One second for 10.000 rows would > be slow. > > ---------------------------(end of broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq good idea indeed! will try this approach.
On May 22, 12:00 pm, valgog <val...@gmail.com> wrote: > I have rewritten the code like > > existing_words_array := ARRAY( select word > from WORD_COUNTS > where word = ANY > ( array_of_words ) ); > not_existing_words_array := ARRAY( select distinct_word > from ( select distinct > (array_of_words)[s.index] as distinct_word > from > generate_series(1, array_upper( array_of_words, 1 ) ) as s(index) > ) as distinct_words > where distinct_word <> ALL > ( existing_words_array ) ); > -- insert the missing words > if not_existing_words_array is not null then > insert into WORD_COUNTS > ( word, count ) > ( select word, 1 > from ( select > not_existing_words_array[s.index] as word > from generate_series( 1, > array_upper( not_existing_words_array, 1 ) ) as s(index) ) as > distinct_words > ); > end if; > -- update the counts > if existing_words_array is not null then > update WORD_COUNTS > set count = COALESCE( count, 0 ) + 1 > where sw_word = ANY ( existing_words_array ); > end if; > > Now it processes a million records in 14 seconds... so it was probably > the problem of looking up NOT IN WORD_COUNTS was way too expencive Sorry... this code did not update anythig at all, as I forgot about the NULL values... had to COALASCE practically everything and use array_upper()... do not have the performance numbers of the insert, updates yet...
Le mardi 22 mai 2007, Richard Huxton a écrit : > valgog wrote: > > I found several post about INSERT/UPDATE performance in this group, > > but actually it was not really what I am searching an answer for... > > > > I have a simple reference table WORD_COUNTS that contains the count of > > words that appear in a word array storage in another table. > > I think this is the root of your problem, I'm afraid. You're trying to > count individual words when you're storing an array of words. I don't > think any of the Gist/GIN indexes will help you with this either. > > However, since "you don't want to start from here" isn't very useful > here and now: > > 1. See what the performance (explain analyse) of the "select > distinct...generate_series()" statement is. I think you're right and > it's going to be slow. > 2. You're looping through each row of word_storage and counting > separately. Write it as one query if possible. > 3. As Peter says, don't insert then update, start with an empty table > and just insert totals for the lot (see #2). > > I'd probably write the query in plperl/python or something else that > supports hash/dictionary structures. Then just process the whole > word_storage into the hash - assuming you only have a few thousand > distinct words that shouldn't take up too much memory. +1 I made something very similar, and using PL/pgsql is very slow, when using perl is very quick. I have also use partioning because of cost of update (copy last partition to the new, adding the new count, so there is only insert, and drop old table if you want)
On 5/22/07, cedric <cedric.villemain@dalibo.com> wrote: > I made something very similar, and using PL/pgsql is very slow, when using > perl is very quick. Another solution is to use tsearch2 for that: CREATE TABLE word_counts AS SELECT * FROM stat('SELECT to_tsvector(''simple'', lower(coalesce(field containing words, ''''))) FROM your table'); I don't know if the fact you have an array of words is a must have or just a design choice. If you have to keep that, you can transform the array easily into a string with array_to_string and use the same sort of query. I don't know what are exactly your speed requirements but it's quite fast here. If you drop your table and recreate it into a transaction, it should work like a charm (or you can use TRUNCATE and INSERT INTO). -- Guillaume