Thread: Key/Value reference table generation: INSERT/UPDATE performance

Key/Value reference table generation: INSERT/UPDATE performance

From
valgog
Date:
I found several post about INSERT/UPDATE performance in this group,
but actually it was not really what I am searching an answer for...

I have a simple reference table WORD_COUNTS that contains the count of
words that appear in a word array storage in another table.

CREATE TABLE WORD_COUNTS
(
  word text NOT NULL,
  count integer,
  CONSTRAINT PK_WORD_COUNTS PRIMARY KEY (word)
)
WITHOUT OIDS;

I have some PL/pgSQL code in a stored procedure like

  FOR r
   IN select id, array_of_words
        from word_storage
  LOOP
    begin
      -- insert the missing words
      insert into WORD_COUNTS
                  ( word, count )
                  ( select word, 0
                      from ( select distinct (r.array_of_words)
[s.index] as d_word
                               from generate_series(1,
array_upper( r.array_of_words, 1 ) ) as s(index) ) as distinct_words
                     where word not in ( select d_word from
WORD_COUNTS  ) );
      -- update the counts
      update WORD_COUNTS
         set count = COALESCE( count, 0 ) + 1
       where word in ( select distinct (r.array_of_words)[s.index] as
word
                            from generate_series(1,
array_upper( r.array_of_words, 1) ) as s(index) );
    exception when others then
      error_count := error_count + 1;
    end;
    record_count := record_count + 1;
  END LOOP;

This code runs extremely slowly. It takes about 10 minutes to process
10000 records and the word storage has more then 2 million records to
be processed.

Does anybody have a know-how about populating of such a reference
tables and what can be optimized in this situation.

Maybe the generate_series() procedure to unnest the array is the place
where I loose the performance?

Are the set update/inserts more effitient, then single inserts/updates
run in smaller loops?

Thanks for your help,

Valentine Gogichashvili


Re: Key/Value reference table generation: INSERT/UPDATE performance

From
"Peter Childs"
Date:


On 22 May 2007 01:23:03 -0700, valgog <valgog@gmail.com> wrote:
I found several post about INSERT/UPDATE performance in this group,
but actually it was not really what I am searching an answer for...

I have a simple reference table WORD_COUNTS that contains the count of
words that appear in a word array storage in another table.

CREATE TABLE WORD_COUNTS
(
  word text NOT NULL,
  count integer,
  CONSTRAINT PK_WORD_COUNTS PRIMARY KEY (word)
)
WITHOUT OIDS;


Is there any reason why count is not not null? (That should siplify your code by removing the coalesce)

insert is more efficient than update because update is always a delete followed by an insert.

Oh and group by is nearly always quicker than distinct and can always? be rewritten as such. I'm not 100% sure why its different but it is.

Peter.



I have some PL/pgSQL code in a stored procedure like

  FOR r
   IN select id, array_of_words
        from word_storage
  LOOP
    begin
      -- insert the missing words
      insert into WORD_COUNTS
                  ( word, count )
                  ( select word, 0
                      from ( select distinct (r.array_of_words)
[s.index] as d_word
                               from generate_series(1,
array_upper( r.array_of_words, 1 ) ) as s(index) ) as distinct_words
                     where word not in ( select d_word from
WORD_COUNTS  ) );
      -- update the counts
      update WORD_COUNTS
         set count = COALESCE( count, 0 ) + 1
       where word in ( select distinct ( r.array_of_words)[s.index] as
word
                            from generate_series(1,
array_upper( r.array_of_words, 1) ) as s(index) );
    exception when others then
      error_count := error_count + 1;
    end;
    record_count := record_count + 1;
  END LOOP;

This code runs extremely slowly. It takes about 10 minutes to process
10000 records and the word storage has more then 2 million records to
be processed.

Does anybody have a know-how about populating of such a reference
tables and what can be optimized in this situation.

Maybe the generate_series() procedure to unnest the array is the place
where I loose the performance?

Are the set update/inserts more effitient, then single inserts/updates
run in smaller loops?

Thanks for your help,

Valentine Gogichashvili


---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
       subscribe-nomail command to majordomo@postgresql.org so that your
       message can get through to the mailing list cleanly

Re: Key/Value reference table generation: INSERT/UPDATE performance

From
Richard Huxton
Date:
valgog wrote:
> I found several post about INSERT/UPDATE performance in this group,
> but actually it was not really what I am searching an answer for...
>
> I have a simple reference table WORD_COUNTS that contains the count of
> words that appear in a word array storage in another table.

I think this is the root of your problem, I'm afraid. You're trying to
count individual words when you're storing an array of words. I don't
think any of the Gist/GIN indexes will help you with this either.

However, since "you don't want to start from here" isn't very useful
here and now:

1. See what the performance (explain analyse) of the "select
distinct...generate_series()" statement is. I think you're right and
it's going to be slow.
2. You're looping through each row of word_storage and counting
separately. Write it as one query if possible.
3. As Peter says, don't insert then update, start with an empty table
and just insert totals for the lot (see #2).

I'd probably write the query in plperl/python or something else that
supports hash/dictionary structures. Then just process the whole
word_storage into the hash - assuming you only have a few thousand
distinct words that shouldn't take up too much memory.

--
   Richard Huxton
   Archonet Ltd

I have rewritten the code like

      existing_words_array :=     ARRAY( select word
                                           from WORD_COUNTS
                                          where word = ANY
( array_of_words ) );
      not_existing_words_array := ARRAY( select distinct_word
                                           from ( select distinct
(array_of_words)[s.index] as distinct_word
                                                    from
generate_series(1, array_upper( array_of_words, 1 ) ) as s(index)
                                                 ) as distinct_words
                                          where distinct_word <> ALL
( existing_words_array ) );
      -- insert the missing words
      if not_existing_words_array is not null then
        insert into WORD_COUNTS
                    ( word, count )
                    ( select word, 1
                        from ( select
not_existing_words_array[s.index] as word
                                 from generate_series( 1,
array_upper( not_existing_words_array, 1 ) ) as s(index) ) as
distinct_words
                     );
      end if;
      -- update the counts
      if existing_words_array is not null then
        update WORD_COUNTS
           set count = COALESCE( count, 0 ) + 1
         where sw_word = ANY ( existing_words_array );
      end if;


Now it processes a million records in 14 seconds... so it was probably
the problem of looking up NOT IN WORD_COUNTS was way too expencive


On Tue, 22 May 2007 10:23:03 +0200, valgog <valgog@gmail.com> wrote:

> I found several post about INSERT/UPDATE performance in this group,
> but actually it was not really what I am searching an answer for...
>
> I have a simple reference table WORD_COUNTS that contains the count of
> words that appear in a word array storage in another table.

    Mmm.

    If I were you, I would :

    - Create a procedure that flattens all the arrays and returns all the
words :

PROCEDURE flatten_arrays RETURNS SETOF TEXT
FOR word_array IN SELECT word_array FROM your_table LOOP
    FOR i IN 1...array_upper( word_array ) LOOP
        RETURN NEXT tolower( word_array[ i ] )

So, SELECT * FROM flatten_arrays() returns all the words in all the arrays.
To get the counts quickly I'd do this :

SELECT word, count(*) FROM flatten_arrays() AS word GROUP BY word

You can then populate your counts table very easily and quickly, since
it's just a seq scan and hash aggregate. One second for 10.000 rows would
be slow.

On May 22, 12:14 pm, l...@peufeu.com (PFC) wrote:
> On Tue, 22 May 2007 10:23:03 +0200, valgog <val...@gmail.com> wrote:
> > I found several post about INSERT/UPDATE performance in this group,
> > but actually it was not really what I am searching an answer for...
>
> > I have a simple reference table WORD_COUNTS that contains the count of
> > words that appear in a word array storage in another table.
>
>         Mmm.
>
>         If I were you, I would :
>
>         - Create a procedure that flattens all the arrays and returns all the
> words :
>
> PROCEDURE flatten_arrays RETURNS SETOF TEXT
> FOR word_array IN SELECT word_array FROM your_table LOOP
>         FOR i IN 1...array_upper( word_array ) LOOP
>                 RETURN NEXT tolower( word_array[ i ] )
>
> So, SELECT * FROM flatten_arrays() returns all the words in all the arrays.
> To get the counts quickly I'd do this :
>
> SELECT word, count(*) FROM flatten_arrays() AS word GROUP BY word
>
> You can then populate your counts table very easily and quickly, since
> it's just a seq scan and hash aggregate. One second for 10.000 rows would
> be slow.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faq

good idea indeed! will try this approach.


On May 22, 12:00 pm, valgog <val...@gmail.com> wrote:
> I have rewritten the code like
>
>       existing_words_array :=     ARRAY( select word
>                                            from WORD_COUNTS
>                                           where word = ANY
> ( array_of_words ) );
>       not_existing_words_array := ARRAY( select distinct_word
>                                            from ( select distinct
> (array_of_words)[s.index] as distinct_word
>                                                     from
> generate_series(1, array_upper( array_of_words, 1 ) ) as s(index)
>                                                  ) as distinct_words
>                                           where distinct_word <> ALL
> ( existing_words_array ) );
>       -- insert the missing words
>       if not_existing_words_array is not null then
>         insert into WORD_COUNTS
>                     ( word, count )
>                     ( select word, 1
>                         from ( select
> not_existing_words_array[s.index] as word
>                                  from generate_series( 1,
> array_upper( not_existing_words_array, 1 ) ) as s(index) ) as
> distinct_words
>                      );
>       end if;
>       -- update the counts
>       if existing_words_array is not null then
>         update WORD_COUNTS
>            set count = COALESCE( count, 0 ) + 1
>          where sw_word = ANY ( existing_words_array );
>       end if;
>
> Now it processes a million records in 14 seconds... so it was probably
> the problem of looking up NOT IN WORD_COUNTS was way too expencive

Sorry... this code did not update anythig at all, as I forgot about
the NULL values... had to COALASCE practically everything and use
array_upper()... do not have the performance numbers of the insert,
updates yet...


Le mardi 22 mai 2007, Richard Huxton a écrit :
> valgog wrote:
> > I found several post about INSERT/UPDATE performance in this group,
> > but actually it was not really what I am searching an answer for...
> >
> > I have a simple reference table WORD_COUNTS that contains the count of
> > words that appear in a word array storage in another table.
>
> I think this is the root of your problem, I'm afraid. You're trying to
> count individual words when you're storing an array of words. I don't
> think any of the Gist/GIN indexes will help you with this either.
>
> However, since "you don't want to start from here" isn't very useful
> here and now:
>
> 1. See what the performance (explain analyse) of the "select
> distinct...generate_series()" statement is. I think you're right and
> it's going to be slow.
> 2. You're looping through each row of word_storage and counting
> separately. Write it as one query if possible.
> 3. As Peter says, don't insert then update, start with an empty table
> and just insert totals for the lot (see #2).
>
> I'd probably write the query in plperl/python or something else that
> supports hash/dictionary structures. Then just process the whole
> word_storage into the hash - assuming you only have a few thousand
> distinct words that shouldn't take up too much memory.
+1
I made something very similar, and using PL/pgsql is very slow, when using
perl is very quick.

I have also use partioning because of cost of update (copy last partition to
the new, adding the new count, so there is only insert, and drop old table if
you want)

Re: Key/Value reference table generation: INSERT/UPDATE performance

From
"Guillaume Smet"
Date:
On 5/22/07, cedric <cedric.villemain@dalibo.com> wrote:
> I made something very similar, and using PL/pgsql is very slow, when using
> perl is very quick.

Another solution is to use tsearch2 for that:
CREATE TABLE word_counts AS SELECT * FROM stat('SELECT
to_tsvector(''simple'', lower(coalesce(field containing words, '''')))
FROM your table');

I don't know if the fact you have an array of words is a must have or
just a design choice. If you have to keep that, you can transform the
array easily into a string with array_to_string and use the same sort
of query.

I don't know what are exactly your speed requirements but it's quite
fast here. If you drop your table and recreate it into a transaction,
it should work like a charm (or you can use TRUNCATE and INSERT INTO).

--
Guillaume