Thread: Performance Concern

Performance Concern

From
"John Pagakis"
Date:
Greetings.

I have a table that will require 100,000 rows initially.

Assume the following (some of the field names have been changed for
confidentiality reasons):

CREATE TABLE baz (
    baz_number CHAR(15) NOT NULL,
    customer_id CHAR(39),
    foobar_id INTEGER,
    is_cancelled BOOL DEFAULT false NOT NULL,
    create_user VARCHAR(60) NOT NULL,
    create_datetime TIMESTAMP DEFAULT 'now()' NOT NULL,
    last_update_user VARCHAR(60) NOT NULL,
    last_update_datetime TIMESTAMP DEFAULT 'now()' NOT NULL,
    CONSTRAINT PK_baz PRIMARY KEY (baz_number)
);

ALTER TABLE baz
    ADD FOREIGN KEY (customer_id) REFERENCES purchase (customer_id);

ALTER TABLE baz
    ADD FOREIGN KEY (foobar_id) REFERENCES foobar (foobar_id);


Using JDBC, it took approximately one hour to insert 100,000 records.  I
have an algorithm to generate a unique baz_number - it is a mixture of alpha
and numerics.

There is a purchase table; one purchase can have many associated baz
records, but the baz records will be pre-allocated - baz.customer_id allows
null.  The act of purchasing a baz will cause baz.customer_id to be
populated from the customer_id (key) field in the purchase table.

If it took an hour to insert 100,000 records, I can only imagine how much
time it will take if one customer were to attempt to purchase all 100,000
baz.  Certainly too long for a web page.

I've not had to deal with this kind of volume in Postgres before; I have my
suspicions on what is wrong here (could it be using a CHAR( 15 ) as a key?)
but I'd *LOVE* any thoughts.

Would I be better off making the key an identity field and not indexing on
baz_number?

Thanks in advance for any help.

__________________________________________________________________
John Pagakis
Email: ih8spam_thebfh@toolsmythe.com


"The best way to make your dreams come true is to wake up."
        -- Paul Valery

This signature generated by
     ... and I Quote!!(tm) Copyright (c) 1999 SpaZmodic Frog Software, Inc.
                                              www.spazmodicfrog.com


Re: Performance Concern

From
Rod Taylor
Date:
On Thu, 2003-10-23 at 08:21, John Pagakis wrote:
> Greetings.
>
> I have a table that will require 100,000 rows initially.
>
> Assume the following (some of the field names have been changed for
> confidentiality reasons):
>
> CREATE TABLE baz (
>     baz_number CHAR(15) NOT NULL,
>     customer_id CHAR(39),
>     foobar_id INTEGER,
>     is_cancelled BOOL DEFAULT false NOT NULL,
>     create_user VARCHAR(60) NOT NULL,
>     create_datetime TIMESTAMP DEFAULT 'now()' NOT NULL,
>     last_update_user VARCHAR(60) NOT NULL,
>     last_update_datetime TIMESTAMP DEFAULT 'now()' NOT NULL,
>     CONSTRAINT PK_baz PRIMARY KEY (baz_number)
> );
>
> ALTER TABLE baz
>     ADD FOREIGN KEY (customer_id) REFERENCES purchase (customer_id);
>
> ALTER TABLE baz
>     ADD FOREIGN KEY (foobar_id) REFERENCES foobar (foobar_id);
>
>
> Using JDBC, it took approximately one hour to insert 100,000 records.  I
> have an algorithm to generate a unique baz_number - it is a mixture of alpha
> and numerics.

Using an int for identification is certainly suggested, however it
sounds like you may be short a few indexes on the foreign key'd fields.

EXPLAIN ANALYZE output is always nice..

Attachment

Re: Performance Concern

From
Sean Shanny
Date:
John,

Are you treating each insertion as a separate transaction?  If so the
performance will suffer.  I am doing the same thing in building a data
warehouse using PG.  I have to load millions of records each night.  I
do two different things:

1) If I need to keep the insertions inside the java process I turn off
auto-commit and every n insertions (5000 seems to give me the best
performance for my setup) issue a commit.  Make sure you do a final
commit in a finally block so you don't miss anything.

2) Dump all the data to a file and then use a psql COPY <table>
(columns) FROM 'file path' call to load it.  Very fast.

--sean

John Pagakis wrote:

>Greetings.
>
>I have a table that will require 100,000 rows initially.
>
>Assume the following (some of the field names have been changed for
>confidentiality reasons):
>
>CREATE TABLE baz (
>    baz_number CHAR(15) NOT NULL,
>    customer_id CHAR(39),
>    foobar_id INTEGER,
>    is_cancelled BOOL DEFAULT false NOT NULL,
>    create_user VARCHAR(60) NOT NULL,
>    create_datetime TIMESTAMP DEFAULT 'now()' NOT NULL,
>    last_update_user VARCHAR(60) NOT NULL,
>    last_update_datetime TIMESTAMP DEFAULT 'now()' NOT NULL,
>    CONSTRAINT PK_baz PRIMARY KEY (baz_number)
>);
>
>ALTER TABLE baz
>    ADD FOREIGN KEY (customer_id) REFERENCES purchase (customer_id);
>
>ALTER TABLE baz
>    ADD FOREIGN KEY (foobar_id) REFERENCES foobar (foobar_id);
>
>
>Using JDBC, it took approximately one hour to insert 100,000 records.  I
>have an algorithm to generate a unique baz_number - it is a mixture of alpha
>and numerics.
>
>There is a purchase table; one purchase can have many associated baz
>records, but the baz records will be pre-allocated - baz.customer_id allows
>null.  The act of purchasing a baz will cause baz.customer_id to be
>populated from the customer_id (key) field in the purchase table.
>
>If it took an hour to insert 100,000 records, I can only imagine how much
>time it will take if one customer were to attempt to purchase all 100,000
>baz.  Certainly too long for a web page.
>
>I've not had to deal with this kind of volume in Postgres before; I have my
>suspicions on what is wrong here (could it be using a CHAR( 15 ) as a key?)
>but I'd *LOVE* any thoughts.
>
>Would I be better off making the key an identity field and not indexing on
>baz_number?
>
>Thanks in advance for any help.
>
>__________________________________________________________________
>John Pagakis
>Email: ih8spam_thebfh@toolsmythe.com
>
>
>"The best way to make your dreams come true is to wake up."
>        -- Paul Valery
>
>This signature generated by
>     ... and I Quote!!(tm) Copyright (c) 1999 SpaZmodic Frog Software, Inc.
>                                              www.spazmodicfrog.com
>
>
>---------------------------(end of broadcast)---------------------------
>TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>
>
>


Re: Performance Concern

From
Christopher Browne
Date:
john@pagakis.com ("John Pagakis") writes:
> Greetings.
>
> I have a table that will require 100,000 rows initially.
>
> Assume the following (some of the field names have been changed for
> confidentiality reasons):
>
> CREATE TABLE baz (
>     baz_number CHAR(15) NOT NULL,
>     customer_id CHAR(39),
>     foobar_id INTEGER,
>     is_cancelled BOOL DEFAULT false NOT NULL,
>     create_user VARCHAR(60) NOT NULL,
>     create_datetime TIMESTAMP DEFAULT 'now()' NOT NULL,
>     last_update_user VARCHAR(60) NOT NULL,
>     last_update_datetime TIMESTAMP DEFAULT 'now()' NOT NULL,
>     CONSTRAINT PK_baz PRIMARY KEY (baz_number)
> );
>
> ALTER TABLE baz
>     ADD FOREIGN KEY (customer_id) REFERENCES purchase (customer_id);
>
> ALTER TABLE baz
>     ADD FOREIGN KEY (foobar_id) REFERENCES foobar (foobar_id);
>
> Using JDBC, it took approximately one hour to insert 100,000 records.  I
> have an algorithm to generate a unique baz_number - it is a mixture of alpha
> and numerics.

Question #1:  How did you do the inserts?

If AUTO-COMMIT was turned on, then that would indicate that you
invoked 100,000 transactions, and that would contribute considerably
to the process being slow.  Put them all in as one transaction and
you'd probably see it run in a fraction of the time.

Question #2.  Do you have indices on purchase(customer_id) and on
foobar(foobar_id)?

If not, then the foreign key check would be rather inefficient.

> There is a purchase table; one purchase can have many associated baz
> records, but the baz records will be pre-allocated - baz.customer_id
> allows null.  The act of purchasing a baz will cause baz.customer_id
> to be populated from the customer_id (key) field in the purchase
> table.
>
> If it took an hour to insert 100,000 records, I can only imagine how
> much time it will take if one customer were to attempt to purchase
> all 100,000 baz.  Certainly too long for a web page.

I take it that each "baz" is a uniquely identifiable product, akin to
(say) an RSA certificate or the like?

By the way, if you set up a stored procedure in PostgreSQL that can
generate the "baz_number" identifiers, you could probably do the
inserts Right Well Fast...

Consider the following.  I have a stored procedure, genauth(), which
generates quasi-random values.  (They're passwords, sort of...)

cctld=# explain analyze insert into baz (baz_number, create_user, last_update_user)
cctld-#   select substr(genauth(), 1, 15), 'cbbrowne', 'cbbrowne' from big_table;
                                                  QUERY PLAN
---------------------------------------------------------------------------------------------------------------
 Seq Scan on big_table  (cost=0.00..789.88 rows=28988 width=0) (actual time=0.20..1713.60 rows=28988 loops=1)
 Total runtime: 3197.40 msec
(2 rows)

It took about 3 seconds to insert 28988 rows into baz.  (big_table,
also renamed, to protect the innocent, has 28988 rows. I didn't care
about its contents, just that it had a bunch of rows.)

And the above is on a cheap desktop PC with IDE disk.

> I've not had to deal with this kind of volume in Postgres before; I
> have my suspicions on what is wrong here (could it be using a CHAR(
> 15 ) as a key?)  but I'd *LOVE* any thoughts.

> Would I be better off making the key an identity field and not
> indexing on baz_number?

That might be something of an improvement, but it oughtn't be
cripplingly different to use a text field rather than an integer.

What's crippling is submitting 100,000 queries in 100,000
transactions.  Cut THAT down to size and you'll see performance return
to being reasonable.
--
"cbbrowne","@","libertyrms.info"
<http://dev6.int.libertyrms.com/>
Christopher Browne
(416) 646 3304 x124 (land)

Re: Performance Concern

From
Hannu Krosing
Date:
Christopher Browne kirjutas R, 24.10.2003 kell 22:10:

> That might be something of an improvement, but it oughtn't be
> cripplingly different to use a text field rather than an integer.

I suspect his slowness comes from not running analyze when it would be
time to start using indexes for fk checks - if you run analyze on an
empty table and then do 10000 inserts, then all these will run their
checks using seqscan, as this is the fastest way to do it on an empty
table ;)

> What's crippling is submitting 100,000 queries in 100,000
> transactions.  Cut THAT down to size and you'll see performance return
> to being reasonable.

even this should not be too crippling.

I 0nce did some testing for insert performance and got about 9000
inserts/sec on 4 CPU Xeon with 2GB ram and RAID-5 (likely with battery
backed cache).

This 9000 dropped to ~250 when I added a primary key index (to a
60.000.000 record table, so that the pk index fit only partly in
memory), all this with separate transactions, but with many clients
running concurrently. (btw., the clients were not java/JDBC but
Python/psycopg)


With just one client you are usually stuck to 1 trx/disk revolution, at
least with no battery-backed write cache.

even 250/sec should insert 10000 in 40 sec.

--------------
Hannu


Re: Performance Concern

From
"John Pagakis"
Date:
Thanks Rod.

While I was waiting for my post to make it I went ahead and made the key an
int.  It improved it a lot, but was still pretty slow.

This is weird:

I was testing in a query window thus:

UPDATE baz SET customer_id = '1234' WHERE ( SELECT baz_number FROM baz WHERE
customer_id IS NULL LIMIT 1000 );

In the version of the table I posted this took 3 1/2 minutes.  By making
baz_number not part of the key, adding a baz_key of int4 and adjusting the
above query for that it dropped to 1 1/2 minutes.

But, I realized that was not how my app was going to be updating, so I wrote
a little simulation in JAVA that gets a list of baz_keys where the customer_
is null and then iterates through the list one at a time attempting to
UPDATE baz SET customer_id = '1234' WHERE baz_key = <bazKeyFromList> AND
customer_id IS NULL.  One thousand iterations took only 37 seconds.

It would appear PostgreSQL is tuned towards single updates as opposed to
handing a big bunch off to the query engine.  Does that seem right?  Seems
odd to me.

Anyway thanks for your response.  I'll add some indexes and see if I can't
shave that time down even further.

__________________________________________________________________
John Pagakis
Email: john@pagakis.com


"If you can't beat them, arrange
 to have them beaten."
        -- George Carlin

This signature generated by
     ... and I Quote!!(tm) Copyright (c) 1999 SpaZmodic Frog Software, Inc.
                                              www.spazmodicfrog.com


-----Original Message-----
From: pgsql-performance-owner@postgresql.org
[mailto:pgsql-performance-owner@postgresql.org]On Behalf Of Rod Taylor
Sent: Friday, October 24, 2003 11:23 AM
To: john@pagakis.com
Cc: Postgresql Performance
Subject: Re: [PERFORM] Performance Concern


On Thu, 2003-10-23 at 08:21, John Pagakis wrote:
> Greetings.
>
> I have a table that will require 100,000 rows initially.
>
> Assume the following (some of the field names have been changed for
> confidentiality reasons):
>
> CREATE TABLE baz (
>     baz_number CHAR(15) NOT NULL,
>     customer_id CHAR(39),
>     foobar_id INTEGER,
>     is_cancelled BOOL DEFAULT false NOT NULL,
>     create_user VARCHAR(60) NOT NULL,
>     create_datetime TIMESTAMP DEFAULT 'now()' NOT NULL,
>     last_update_user VARCHAR(60) NOT NULL,
>     last_update_datetime TIMESTAMP DEFAULT 'now()' NOT NULL,
>     CONSTRAINT PK_baz PRIMARY KEY (baz_number)
> );
>
> ALTER TABLE baz
>     ADD FOREIGN KEY (customer_id) REFERENCES purchase (customer_id);
>
> ALTER TABLE baz
>     ADD FOREIGN KEY (foobar_id) REFERENCES foobar (foobar_id);
>
>
> Using JDBC, it took approximately one hour to insert 100,000 records.  I
> have an algorithm to generate a unique baz_number - it is a mixture of
alpha
> and numerics.

Using an int for identification is certainly suggested, however it
sounds like you may be short a few indexes on the foreign key'd fields.

EXPLAIN ANALYZE output is always nice..


Re: Performance Concern

From
"John Pagakis"
Date:
Sean -
I believe auto-commit was off (not at the box right now).  I'll play with
the commit interval; I know commits are expensive operations.

Thanks for item 2.  I was toying with the notion of pre-creating 100000
bazes off-loading them and then seeing if the COPY would be any faster; you
saved me the effort of experimenting.  Thanks for the benefit of your
experience.

__________________________________________________________________
John Pagakis
Email: ih8spam_thebfh@toolsmythe.com


"Oh, you hate your job?  Why didn't you say so?
 There's a support group for that. It's called
 EVERYBODY, and they meet at the bar."
        -- Drew Carey

This signature generated by
     ... and I Quote!!(tm) Copyright (c) 1999 SpaZmodic Frog Software, Inc.
                                              www.spazmodicfrog.com


-----Original Message-----
From: pgsql-performance-owner@postgresql.org
[mailto:pgsql-performance-owner@postgresql.org]On Behalf Of Sean Shanny
Sent: Friday, October 24, 2003 11:31 AM
To: john@pagakis.com
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Performance Concern


John,

Are you treating each insertion as a separate transaction?  If so the
performance will suffer.  I am doing the same thing in building a data
warehouse using PG.  I have to load millions of records each night.  I
do two different things:

1) If I need to keep the insertions inside the java process I turn off
auto-commit and every n insertions (5000 seems to give me the best
performance for my setup) issue a commit.  Make sure you do a final
commit in a finally block so you don't miss anything.

2) Dump all the data to a file and then use a psql COPY <table>
(columns) FROM 'file path' call to load it.  Very fast.

--sean

John Pagakis wrote:

>Greetings.
>
>I have a table that will require 100,000 rows initially.
>
>Assume the following (some of the field names have been changed for
>confidentiality reasons):
>
>CREATE TABLE baz (
>    baz_number CHAR(15) NOT NULL,
>    customer_id CHAR(39),
>    foobar_id INTEGER,
>    is_cancelled BOOL DEFAULT false NOT NULL,
>    create_user VARCHAR(60) NOT NULL,
>    create_datetime TIMESTAMP DEFAULT 'now()' NOT NULL,
>    last_update_user VARCHAR(60) NOT NULL,
>    last_update_datetime TIMESTAMP DEFAULT 'now()' NOT NULL,
>    CONSTRAINT PK_baz PRIMARY KEY (baz_number)
>);
>
>ALTER TABLE baz
>    ADD FOREIGN KEY (customer_id) REFERENCES purchase (customer_id);
>
>ALTER TABLE baz
>    ADD FOREIGN KEY (foobar_id) REFERENCES foobar (foobar_id);
>
>
>Using JDBC, it took approximately one hour to insert 100,000 records.  I
>have an algorithm to generate a unique baz_number - it is a mixture of
alpha
>and numerics.
>
>There is a purchase table; one purchase can have many associated baz
>records, but the baz records will be pre-allocated - baz.customer_id allows
>null.  The act of purchasing a baz will cause baz.customer_id to be
>populated from the customer_id (key) field in the purchase table.
>
>If it took an hour to insert 100,000 records, I can only imagine how much
>time it will take if one customer were to attempt to purchase all 100,000
>baz.  Certainly too long for a web page.
>
>I've not had to deal with this kind of volume in Postgres before; I have my
>suspicions on what is wrong here (could it be using a CHAR( 15 ) as a key?)
>but I'd *LOVE* any thoughts.
>
>Would I be better off making the key an identity field and not indexing on
>baz_number?
>
>Thanks in advance for any help.
>
>__________________________________________________________________
>John Pagakis
>Email: ih8spam_thebfh@toolsmythe.com
>
>
>"The best way to make your dreams come true is to wake up."
>        -- Paul Valery
>
>This signature generated by
>     ... and I Quote!!(tm) Copyright (c) 1999 SpaZmodic Frog Software, Inc.
>                                              www.spazmodicfrog.com
>
>
>---------------------------(end of broadcast)---------------------------
>TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>
>
>


---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
      joining column's datatypes do not match


Re: Performance Concern

From
Sean Shanny
Date:
John,

One other thing I forgot to mention with solution 2.  If you are going
to be adding a fair number of records to the table on an ongoing basis
you will want to drop indexes first and re-create them after the load is
complete.  I have tried it both ways and dropping is faster overall.

--sean

John Pagakis wrote:

>Sean -
>I believe auto-commit was off (not at the box right now).  I'll play with
>the commit interval; I know commits are expensive operations.
>
>Thanks for item 2.  I was toying with the notion of pre-creating 100000
>bazes off-loading them and then seeing if the COPY would be any faster; you
>saved me the effort of experimenting.  Thanks for the benefit of your
>experience.
>
>__________________________________________________________________
>John Pagakis
>Email: ih8spam_thebfh@toolsmythe.com
>
>
>"Oh, you hate your job?  Why didn't you say so?
> There's a support group for that. It's called
> EVERYBODY, and they meet at the bar."
>        -- Drew Carey
>
>This signature generated by
>     ... and I Quote!!(tm) Copyright (c) 1999 SpaZmodic Frog Software, Inc.
>                                              www.spazmodicfrog.com
>
>
>-----Original Message-----
>From: Sean Shanny [mailto:shannyconsulting@earthlink.net]
>Sent: Friday, October 24, 2003 11:31 AM
>To: john@pagakis.com
>Cc: pgsql-performance@postgresql.org
>Subject: Re: [PERFORM] Performance Concern
>
>
>John,
>
>Are you treating each insertion as a separate transaction?  If so the
>performance will suffer.  I am doing the same thing in building a data
>warehouse using PG.  I have to load millions of records each night.  I
>do two different things:
>
>1) If I need to keep the insertions inside the java process I turn off
>auto-commit and every n insertions (5000 seems to give me the best
>performance for my setup) issue a commit.  Make sure you do a final
>commit in a finally block so you don't miss anything.
>
>2) Dump all the data to a file and then use a psql COPY <table>
>(columns) FROM 'file path' call to load it.  Very fast.
>
>--sean
>
>John Pagakis wrote:
>
>
>
>>Greetings.
>>
>>I have a table that will require 100,000 rows initially.
>>
>>Assume the following (some of the field names have been changed for
>>confidentiality reasons):
>>
>>CREATE TABLE baz (
>>   baz_number CHAR(15) NOT NULL,
>>   customer_id CHAR(39),
>>   foobar_id INTEGER,
>>   is_cancelled BOOL DEFAULT false NOT NULL,
>>   create_user VARCHAR(60) NOT NULL,
>>   create_datetime TIMESTAMP DEFAULT 'now()' NOT NULL,
>>   last_update_user VARCHAR(60) NOT NULL,
>>   last_update_datetime TIMESTAMP DEFAULT 'now()' NOT NULL,
>>   CONSTRAINT PK_baz PRIMARY KEY (baz_number)
>>);
>>
>>ALTER TABLE baz
>>   ADD FOREIGN KEY (customer_id) REFERENCES purchase (customer_id);
>>
>>ALTER TABLE baz
>>   ADD FOREIGN KEY (foobar_id) REFERENCES foobar (foobar_id);
>>
>>
>>Using JDBC, it took approximately one hour to insert 100,000 records.  I
>>have an algorithm to generate a unique baz_number - it is a mixture of
>>
>>
>alpha
>
>
>>and numerics.
>>
>>There is a purchase table; one purchase can have many associated baz
>>records, but the baz records will be pre-allocated - baz.customer_id allows
>>null.  The act of purchasing a baz will cause baz.customer_id to be
>>populated from the customer_id (key) field in the purchase table.
>>
>>If it took an hour to insert 100,000 records, I can only imagine how much
>>time it will take if one customer were to attempt to purchase all 100,000
>>baz.  Certainly too long for a web page.
>>
>>I've not had to deal with this kind of volume in Postgres before; I have my
>>suspicions on what is wrong here (could it be using a CHAR( 15 ) as a key?)
>>but I'd *LOVE* any thoughts.
>>
>>Would I be better off making the key an identity field and not indexing on
>>baz_number?
>>
>>Thanks in advance for any help.
>>
>>__________________________________________________________________
>>John Pagakis
>>Email: ih8spam_thebfh@toolsmythe.com
>>
>>
>>"The best way to make your dreams come true is to wake up."
>>       -- Paul Valery
>>
>>This signature generated by
>>    ... and I Quote!!(tm) Copyright (c) 1999 SpaZmodic Frog Software, Inc.
>>                                             www.spazmodicfrog.com
>>
>>
>>---------------------------(end of broadcast)---------------------------
>>TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>>
>>
>>
>>
>>
>
>
>
>


Re: Performance Concern

From
"John Pagakis"
Date:
Christopher -
Thanks.

Answer 1:
I believe auto commit was off (but I'm not at my dev box right now).  I'll
double-check that and the commit interval.

Answer 2:
Ah ha!!  No indexes on FKs.  I'll try that.

Yes, each baz is a uniquely identifiable.  I had started a SP to create gen
the key but scrapped it when I saw no rand() function in pgpsql.  Did I miss
something?

Turns out switching to ints no improvement on the inserts but a rather large
one on the updates.  Also, I saw evidence in my testing that Postgres seemed
to like doing single updates as opposed to being handed a group of updates;
see my response to Rod Taylor's post here (and Rod, if you're reading this:
you were *GREAT* in "The Time Machine" <g>!!

Answer 3:
Oh, there was no question three .... <g>!!


Thanks again Christopher!!

__________________________________________________________________
John Pagakis
Email: ih8spam_thebfh@toolsmythe.com


"I am *SINCERE* about life, but I'm not *SERIOUS* about it."
        -- Alan Watts

This signature generated by
     ... and I Quote!!(tm) Copyright (c) 1999 SpaZmodic Frog Software, Inc.
                                              www.spazmodicfrog.com


-----Original Message-----
From: pgsql-performance-owner@postgresql.org
[mailto:pgsql-performance-owner@postgresql.org]On Behalf Of Christopher
Browne
Sent: Friday, October 24, 2003 12:11 PM
To: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Performance Concern


john@pagakis.com ("John Pagakis") writes:
> Greetings.
>
> I have a table that will require 100,000 rows initially.
>
> Assume the following (some of the field names have been changed for
> confidentiality reasons):
>
> CREATE TABLE baz (
>     baz_number CHAR(15) NOT NULL,
>     customer_id CHAR(39),
>     foobar_id INTEGER,
>     is_cancelled BOOL DEFAULT false NOT NULL,
>     create_user VARCHAR(60) NOT NULL,
>     create_datetime TIMESTAMP DEFAULT 'now()' NOT NULL,
>     last_update_user VARCHAR(60) NOT NULL,
>     last_update_datetime TIMESTAMP DEFAULT 'now()' NOT NULL,
>     CONSTRAINT PK_baz PRIMARY KEY (baz_number)
> );
>
> ALTER TABLE baz
>     ADD FOREIGN KEY (customer_id) REFERENCES purchase (customer_id);
>
> ALTER TABLE baz
>     ADD FOREIGN KEY (foobar_id) REFERENCES foobar (foobar_id);
>
> Using JDBC, it took approximately one hour to insert 100,000 records.  I
> have an algorithm to generate a unique baz_number - it is a mixture of
alpha
> and numerics.

Question #1:  How did you do the inserts?

If AUTO-COMMIT was turned on, then that would indicate that you
invoked 100,000 transactions, and that would contribute considerably
to the process being slow.  Put them all in as one transaction and
you'd probably see it run in a fraction of the time.

Question #2.  Do you have indices on purchase(customer_id) and on
foobar(foobar_id)?

If not, then the foreign key check would be rather inefficient.

> There is a purchase table; one purchase can have many associated baz
> records, but the baz records will be pre-allocated - baz.customer_id
> allows null.  The act of purchasing a baz will cause baz.customer_id
> to be populated from the customer_id (key) field in the purchase
> table.
>
> If it took an hour to insert 100,000 records, I can only imagine how
> much time it will take if one customer were to attempt to purchase
> all 100,000 baz.  Certainly too long for a web page.

I take it that each "baz" is a uniquely identifiable product, akin to
(say) an RSA certificate or the like?

By the way, if you set up a stored procedure in PostgreSQL that can
generate the "baz_number" identifiers, you could probably do the
inserts Right Well Fast...

Consider the following.  I have a stored procedure, genauth(), which
generates quasi-random values.  (They're passwords, sort of...)

cctld=# explain analyze insert into baz (baz_number, create_user,
last_update_user)
cctld-#   select substr(genauth(), 1, 15), 'cbbrowne', 'cbbrowne' from
big_table;
                                                  QUERY PLAN
----------------------------------------------------------------------------
-----------------------------------
 Seq Scan on big_table  (cost=0.00..789.88 rows=28988 width=0) (actual
time=0.20..1713.60 rows=28988 loops=1)
 Total runtime: 3197.40 msec
(2 rows)

It took about 3 seconds to insert 28988 rows into baz.  (big_table,
also renamed, to protect the innocent, has 28988 rows. I didn't care
about its contents, just that it had a bunch of rows.)

And the above is on a cheap desktop PC with IDE disk.

> I've not had to deal with this kind of volume in Postgres before; I
> have my suspicions on what is wrong here (could it be using a CHAR(
> 15 ) as a key?)  but I'd *LOVE* any thoughts.

> Would I be better off making the key an identity field and not
> indexing on baz_number?

That might be something of an improvement, but it oughtn't be
cripplingly different to use a text field rather than an integer.

What's crippling is submitting 100,000 queries in 100,000
transactions.  Cut THAT down to size and you'll see performance return
to being reasonable.
--
"cbbrowne","@","libertyrms.info"
<http://dev6.int.libertyrms.com/>
Christopher Browne
(416) 646 3304 x124 (land)

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster


Re: Performance Concern

From
Hannu Krosing
Date:
John Pagakis kirjutas L, 25.10.2003 kell 10:16:
> Christopher -
> Thanks.
>
> Answer 1:
> I believe auto commit was off (but I'm not at my dev box right now).  I'll
> double-check that and the commit interval.
>
> Answer 2:
> Ah ha!!  No indexes on FKs.  I'll try that.
>
> Yes, each baz is a uniquely identifiable.  I had started a SP to create gen
> the key but scrapped it when I saw no rand() function in pgpsql.  Did I miss
> something?

hannu=# select random();
      random
------------------
 0.59924242859671
(1 row)


\df lists all available functions in psql

to generate string keys you could use something like:

hannu=# select 'key' || to_hex(cast(random()*1000000000 as int));
  ?column?
-------------
 key1e22d8ea
(1 row)

-----------------
Hannu


Re: Performance Concern

From
"John Pagakis"
Date:
Bear with me all - working my way through this.

First of all, thanks for all the terrific advice.  I think I focused you on
the inserts, when my *REAL* concern is the updates.  Initially, I was
surmising that if the insert of 100,000 baz took an hour, an update to
customer_id of, say 1000 baz, would simply be outrageous.  I now have a
better feel for how bad it is.

I have already learned that making an integer the key of baz as opposed to
baz_number - a CHAR( 15 ) - cuts my update cost almost in half, so my
reiteration of the example uses this schema change.

Please let me start again and perhaps do a little better job of explaining:

Assume the following (some of the field names have been changed for
confidentiality reasons):

CREATE TABLE baz (
    baz_key int4 NOT NULL,
    baz_number CHAR(15) NOT NULL,
    customer_id CHAR(39),
    foobar_id INTEGER,
    is_cancelled BOOL DEFAULT false NOT NULL,
    create_user VARCHAR(60) NOT NULL,
    create_datetime TIMESTAMP DEFAULT 'now()' NOT NULL,
    last_update_user VARCHAR(60) NOT NULL,
    last_update_datetime TIMESTAMP DEFAULT 'now()' NOT NULL,
    CONSTRAINT PK_baz PRIMARY KEY (baz_key)
);

ALTER TABLE baz
    ADD FOREIGN KEY (customer_id) REFERENCES purchase (customer_id);

ALTER TABLE baz
    ADD FOREIGN KEY (foobar_id) REFERENCES foobar (foobar_id);


There is a purchase table; one purchase can have many associated baz
records, but the 100,00 baz records will be pre-allocated - baz.customer_id
allows
null.  The act of purchasing a baz will cause baz.customer_id to be
populated from the customer_id (key) field in the purchase table.  The
column customer_id is actually the key to the purchase table despite the
name.

The foobar table is inconsequential as it will not be populated until the
baz table is sold out.  So for the inserts and updates, foobar will be
empty.  I could probably not even gen it until I needed it.


As I said earlier I'm less concerned about the inserts than I am about the
updates.  The reason is the 100,000 inserts will happen before the site is
live.  The updates will happen as part of the purchase process, so updates
need to be as fast as possible.

I needed to do this because I absolutely positively cannot over-allocate
baz.  I cannot allocate more than 100,000 period, and any number of users
can attempt to purchase one or more baz simultaneously.  I am attempting to
avoid a race condition and avoid using database locks as I feared this table
would turn into a bottleneck.

Note, as this question came up more than once from my previous post: Auto
Commit was off for the inserts.


This will be for a public website and multiple users will be "competing" for
baz resources.  My thought was for each user wishing to purchase one or more
bazes:

- Create a list of potentially available baz: SELECT baz_key WHERE
customer_id IS NULL LIMIT 100;
  - If there are no more records in baz with customer_id of NULL, it's a
sell-out.
- Iterate through the list attempting to reserve a BAZ.  Iterate until you
have reserved the number of baz requested or until the list is exhausted:
UPDATE baz SET customer_id = <someCustId> WHERE baz_key = <currentKeyInList>
AND customer_id IS NULL;
  - For a given update, if no record was updated, someone else set the
customer_id before you could - go to the next baz_key in the list and try
again.
  - If the list is exhausted go get the next block of 100 potential
available baz keys and go again.


Anyway, given this scenario, I *HAVE* to have auto commit on for updates so
that everyone is aware of everyone else immediately.


I wrote a JAVA simulation of the above that did 1000 updates in 37 seconds.
That left me scratching my head because in psql when I did the
semi-equivalent:

UPDATE baz SET customer_id = '1234' WHERE baz_key IN( SELECT baz_key FROM
baz WHERE customer_id IS NULL LIMIT 1000 );

it took 1:27 (one minute 27 seconds) to execute.  This led me (erroneously)
to the conclusion that Postgres was somehow happier doing single updates
than "grouping" them.  I realized today that I missed something in my
simulation (pulling an all-nighter will do that to you): my JAVA simulation
had Auto Commit off and I was doing a commit at the end.  Obviously that
won't work given what I'm trying to do.  Any updates must *IMMEDIATLY* be
visible to all other processes, or I could get hit with a race condition.  I
re-ran with Auto Commit on and the timing fell more in line with what I saw
in psql - 1:13.

This seems a slow to me.

Is there any way to optimize the update?  Or, perhaps my design is the issue
and I just need to do something else.  Perhaps a lock on the table and an
insert would be quicker.  I'm just worried about locking in a multi-user
environment.  On the other hand, it looks to me like this table will be a
bottleneck no matter what I do.

Your thoughts, as always, are much appreciated.


__________________________________________________________________
John Pagakis
Email: ih8spam_thebfh@toolsmythe.com


"If I had a formula for bypassing trouble, I would not pass it round.
 Trouble creates a capacity to handle it.  I don't embrace trouble; that's
 as bad as treating it as an enemy.  But I do say meet it as a friend, for
 you'll see a lot of it and had better be on speaking terms with it."
        -- Oliver Wendell Holmes

This signature generated by
     ... and I Quote!!(tm) Copyright (c) 1999 SpaZmodic Frog Software, Inc.
                                              www.spazmodicfrog.com


-----Original Message-----
From: pgsql-performance-owner@postgresql.org
[mailto:pgsql-performance-owner@postgresql.org]On Behalf Of John Pagakis
Sent: Thursday, October 23, 2003 5:21 AM
To: pgsql-performance@postgresql.org
Subject: [PERFORM] Performance Concern


Greetings.

I have a table that will require 100,000 rows initially.

Assume the following (some of the field names have been changed for
confidentiality reasons):

CREATE TABLE baz (
    baz_number CHAR(15) NOT NULL,
    customer_id CHAR(39),
    foobar_id INTEGER,
    is_cancelled BOOL DEFAULT false NOT NULL,
    create_user VARCHAR(60) NOT NULL,
    create_datetime TIMESTAMP DEFAULT 'now()' NOT NULL,
    last_update_user VARCHAR(60) NOT NULL,
    last_update_datetime TIMESTAMP DEFAULT 'now()' NOT NULL,
    CONSTRAINT PK_baz PRIMARY KEY (baz_number)
);

ALTER TABLE baz
    ADD FOREIGN KEY (customer_id) REFERENCES purchase (customer_id);

ALTER TABLE baz
    ADD FOREIGN KEY (foobar_id) REFERENCES foobar (foobar_id);


Using JDBC, it took approximately one hour to insert 100,000 records.  I
have an algorithm to generate a unique baz_number - it is a mixture of alpha
and numerics.

There is a purchase table; one purchase can have many associated baz
records, but the baz records will be pre-allocated - baz.customer_id allows
null.  The act of purchasing a baz will cause baz.customer_id to be
populated from the customer_id (key) field in the purchase table.

If it took an hour to insert 100,000 records, I can only imagine how much
time it will take if one customer were to attempt to purchase all 100,000
baz.  Certainly too long for a web page.

I've not had to deal with this kind of volume in Postgres before; I have my
suspicions on what is wrong here (could it be using a CHAR( 15 ) as a key?)
but I'd *LOVE* any thoughts.

Would I be better off making the key an identity field and not indexing on
baz_number?

Thanks in advance for any help.

__________________________________________________________________
John Pagakis
Email: ih8spam_thebfh@toolsmythe.com


"The best way to make your dreams come true is to wake up."
        -- Paul Valery

This signature generated by
     ... and I Quote!!(tm) Copyright (c) 1999 SpaZmodic Frog Software, Inc.
                                              www.spazmodicfrog.com


---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org


Re: Performance Concern

From
Allen Landsidel
Date:
At 05:56 10/25/2003, John Pagakis wrote:

Snipping most of this, I only have one suggestion/comment to make.

[snip]

>CREATE TABLE baz (
>     baz_key int4 NOT NULL,
>     baz_number CHAR(15) NOT NULL,
>     customer_id CHAR(39),
>     foobar_id INTEGER,
>     is_cancelled BOOL DEFAULT false NOT NULL,
>     create_user VARCHAR(60) NOT NULL,
>     create_datetime TIMESTAMP DEFAULT 'now()' NOT NULL,
>     last_update_user VARCHAR(60) NOT NULL,
>     last_update_datetime TIMESTAMP DEFAULT 'now()' NOT NULL,
>     CONSTRAINT PK_baz PRIMARY KEY (baz_key)
>);
>
>ALTER TABLE baz
>     ADD FOREIGN KEY (customer_id) REFERENCES purchase (customer_id);
>
>ALTER TABLE baz
>     ADD FOREIGN KEY (foobar_id) REFERENCES foobar (foobar_id);

[snip]

>I needed to do this because I absolutely positively cannot over-allocate
>baz.  I cannot allocate more than 100,000 period, and any number of users
>can attempt to purchase one or more baz simultaneously.  I am attempting to
>avoid a race condition and avoid using database locks as I feared this table
>would turn into a bottleneck.

[snip]

I have a similar situation in the database here, using the following
example schema:

CREATE TABLE foo
(
   nID serial UNIQUE NOT NULL,
   bAvailable boolean NOT NULL DEFAULT true,
   nSomeField int4 NOT NULL,
   sSomeField text NOT NULL
);

CREATE TABLE bar
(
   nfoo_id int4 UNIQUE NOT NULL
);

Assume foo is the table with the 100k pre-populated records that you want
to assign to visitors on your site.  bar is a table whos only purpose is to
eliminate race conditions, working off the following business rules:

1. -- someone attempts to get a 'foo'
   SELECT nID from foo WHERE bAvailable;

2. -- we first try to assign this 'foo' to ourselves
    -- the ? is bound to the foo.nID we selected in step 1.
   INSERT INTO bar (nfoo_ID) VALUES (?)

3. -- Only if step 2 is successful, do we continue, otherwise someone beat
us to it.
   UPDATE foo SET ... WHERE nID=?

The key here is step 2.

Since there is a UNIQUE constraint defined on the bar.nfoo_id (could even
be an FK), only one INSERT will ever succeed.  All others will fail.  In
step 3, you can set the bAvailable flag to false, along with whatever other
values you need to set for your 'baz'.

This will get much easier once 7.4 is production-ready, as the WHERE IN ..
or WHERE NOT IN.. subselects are (according to the HISTORY file) going to
be as efficient as joins, instead of the O(n) operation they apparently are
right now.

Until then however, I've found this simple trick works remarkably well.

-Allen


Re: Performance Concern

From
Greg Stark
Date:
"John Pagakis" <thebfh@toolsmythe.com> writes:

> UPDATE baz SET customer_id = '1234' WHERE baz_key IN( SELECT baz_key FROM
> baz WHERE customer_id IS NULL LIMIT 1000 );

Do an "explain analyze" on this query. I bet it's doing two sequential scans.
Unfortunately in 7.3 the WHERE IN type of clause is poorly handled. If you're
still in development perhaps you should move to the 7.4 beta as it should
handle this much better:

test74=> explain UPDATE test SET customer_id = 1 WHERE a IN (SELECT a FROM test WHERE customer_id IS NULL LIMIT 1000 );
                                   QUERY PLAN
---------------------------------------------------------------------------------
 Nested Loop  (cost=1447.26..2069.43 rows=201 width=10)
   ->  HashAggregate  (cost=1447.26..1447.26 rows=200 width=4)
         ->  Subquery Scan "IN_subquery"  (cost=0.00..1446.01 rows=501 width=4)
               ->  Limit  (cost=0.00..1441.00 rows=501 width=4)
                     ->  Seq Scan on test  (cost=0.00..1441.00 rows=501 width=4)
                           Filter: (customer_id IS NULL)
   ->  Index Scan using ii on test  (cost=0.00..3.10 rows=1 width=10)
         Index Cond: (test.a = "outer".a)
(8 rows)


However notice you still get at the one sequential scan. One way to help the
situation would be to create a partial index WHERE customer_id IS NULL. This
would especially help when things are almost completely sold out and available
slots are sparse.

slo=> explain UPDATE test SET customer_id = 1 WHERE a IN (SELECT a FROM test WHERE customer_id IS NULL LIMIT 1000 );
                                        QUERY PLAN
------------------------------------------------------------------------------------------
 Nested Loop  (cost=181.01..803.18 rows=201 width=10)
   ->  HashAggregate  (cost=181.01..181.01 rows=200 width=4)
         ->  Subquery Scan "IN_subquery"  (cost=0.00..179.76 rows=501 width=4)
               ->  Limit  (cost=0.00..174.75 rows=501 width=4)
                     ->  Index Scan using i on test  (cost=0.00..174.75 rows=501 width=4)
                           Filter: (customer_id IS NULL)
   ->  Index Scan using ii on test  (cost=0.00..3.10 rows=1 width=10)
         Index Cond: (test.a = "outer".a)
(8 rows)

Notice the both sequential scans are gone and replaced by index scans.


I kind of worry you might still have a race condition with the above query.
Two clients could do the subquery and pick up the same records, then both run
and update them. The database would lock the records until the first one
commits but I don't think that would stop the second one from updating them a
second time.

Perhaps moving to serializable transactions would help this, I'm not sure.

It's too bad the LIMIT clause doesn't work on UPDATEs.
Then you could simply do:

UPDATE baz SET customer_id = '1234' where customer_id IS NULL LIMIT 1000

Which shouldn't have to scan the table twice at all and I don't think suffer
from any race conditions.

--
greg

Re: Performance Concern

From
"John Pagakis"
Date:
Sean -
I believe auto-commit was off (not at the box right now).  I'll play with
the commit interval; I know commits are expensive operations.

Thanks for item 2.  I was toying with the notion of pre-creating 100000
bazes off-loading them and then seeing if the COPY would be any faster; you
saved me the effort of experimenting.  Thanks for the benefit of your
experience.

__________________________________________________________________
John Pagakis
Email: ih8spam_thebfh@toolsmythe.com


"Oh, you hate your job?  Why didn't you say so?
 There's a support group for that. It's called
 EVERYBODY, and they meet at the bar."
        -- Drew Carey

This signature generated by
     ... and I Quote!!(tm) Copyright (c) 1999 SpaZmodic Frog Software, Inc.
                                              www.spazmodicfrog.com


-----Original Message-----
From: Sean Shanny [mailto:shannyconsulting@earthlink.net]
Sent: Friday, October 24, 2003 11:31 AM
To: john@pagakis.com
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Performance Concern


John,

Are you treating each insertion as a separate transaction?  If so the
performance will suffer.  I am doing the same thing in building a data
warehouse using PG.  I have to load millions of records each night.  I
do two different things:

1) If I need to keep the insertions inside the java process I turn off
auto-commit and every n insertions (5000 seems to give me the best
performance for my setup) issue a commit.  Make sure you do a final
commit in a finally block so you don't miss anything.

2) Dump all the data to a file and then use a psql COPY <table>
(columns) FROM 'file path' call to load it.  Very fast.

--sean

John Pagakis wrote:

>Greetings.
>
>I have a table that will require 100,000 rows initially.
>
>Assume the following (some of the field names have been changed for
>confidentiality reasons):
>
>CREATE TABLE baz (
>    baz_number CHAR(15) NOT NULL,
>    customer_id CHAR(39),
>    foobar_id INTEGER,
>    is_cancelled BOOL DEFAULT false NOT NULL,
>    create_user VARCHAR(60) NOT NULL,
>    create_datetime TIMESTAMP DEFAULT 'now()' NOT NULL,
>    last_update_user VARCHAR(60) NOT NULL,
>    last_update_datetime TIMESTAMP DEFAULT 'now()' NOT NULL,
>    CONSTRAINT PK_baz PRIMARY KEY (baz_number)
>);
>
>ALTER TABLE baz
>    ADD FOREIGN KEY (customer_id) REFERENCES purchase (customer_id);
>
>ALTER TABLE baz
>    ADD FOREIGN KEY (foobar_id) REFERENCES foobar (foobar_id);
>
>
>Using JDBC, it took approximately one hour to insert 100,000 records.  I
>have an algorithm to generate a unique baz_number - it is a mixture of
alpha
>and numerics.
>
>There is a purchase table; one purchase can have many associated baz
>records, but the baz records will be pre-allocated - baz.customer_id allows
>null.  The act of purchasing a baz will cause baz.customer_id to be
>populated from the customer_id (key) field in the purchase table.
>
>If it took an hour to insert 100,000 records, I can only imagine how much
>time it will take if one customer were to attempt to purchase all 100,000
>baz.  Certainly too long for a web page.
>
>I've not had to deal with this kind of volume in Postgres before; I have my
>suspicions on what is wrong here (could it be using a CHAR( 15 ) as a key?)
>but I'd *LOVE* any thoughts.
>
>Would I be better off making the key an identity field and not indexing on
>baz_number?
>
>Thanks in advance for any help.
>
>__________________________________________________________________
>John Pagakis
>Email: ih8spam_thebfh@toolsmythe.com
>
>
>"The best way to make your dreams come true is to wake up."
>        -- Paul Valery
>
>This signature generated by
>     ... and I Quote!!(tm) Copyright (c) 1999 SpaZmodic Frog Software, Inc.
>                                              www.spazmodicfrog.com
>
>
>---------------------------(end of broadcast)---------------------------
>TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>
>
>


Re: Performance Concern

From
Hannu Krosing
Date:
John Pagakis kirjutas L, 25.10.2003 kell 12:56:

> I wrote a JAVA simulation of the above that did 1000 updates in 37 seconds.
> That left me scratching my head because in psql when I did the
> semi-equivalent:
>
> UPDATE baz SET customer_id = '1234' WHERE baz_key IN( SELECT baz_key FROM
> baz WHERE customer_id IS NULL LIMIT 1000 );

try it this way, maybe it will start using an index :

UPDATE baz
   SET customer_id = '1234'
 WHERE baz_key IN (
    SELECT baz_key
      FROM baz innerbaz
     WHERE customer_id IS NULL
       and innerbaz.baz_key = baz.baz_key
     LIMIT 1000 );

you may also try to add a conditional index to baz:

CREATE INDEX baz_key_with_null_custid_nxd
    ON  baz
 WHERE customer_id IS NULL;

to make the index access more efficient.

----------------
Hannu


Re: Performance Concern

From
Manfred Koizar
Date:
On Sun, 26 Oct 2003 00:13:36 +0300, Hannu Krosing <hannu@tm.ee> wrote:
>UPDATE baz
>   SET customer_id = '1234'
> WHERE baz_key IN (
>    SELECT baz_key
>      FROM baz innerbaz
>     WHERE customer_id IS NULL
>       and innerbaz.baz_key = baz.baz_key
>     LIMIT 1000 );

AFAICS this is not what the OP intended.  It is equivalent to

    UPDATE baz
       SET customer_id = '1234'
     WHERE customer_id IS NULL;

because the subselect is now correlated to the outer query and is
evaluated for each row of the outer query which makes the LIMIT clause
ineffective.

Servus
 Manfred