Thread: Performance on inserts

Performance on inserts

From
Jules Bean
Date:
[Meta: Please advise if this is the wrong list.  I think, since this
observation relates to Pg internals, this might be the right one, but
feel free to move it back to -general if I'm wrong; I subscribe to
both in any case]

As some of you will have inferred if you've read my last couple of
posts, I'm working on a database with a structure like this (I'm
abstracting; I'm not unfortunately allowed to show you what my client
is really doing here)

A table called 'things' with two columns 'name' and 'category'. The
pair ('name','category') is a primary key.

There are ~ 10 000 000 rows in the table, and category takes values
from a more-or-less fixed set of ~1000 possibilities.  As previously
described the most popular category holds around half the rows, the
next most popular holds nearly half of those left, and most categories
occur very rarely indeed. The median is probably around 1000 (which is
less than the 10 000 you'd expect).

Anyhow, this question isn't about speeding up queries --- we already
have that in the histogram thread.  This question is about speeding up
inserts.

My standard configuration is to have a unique index on (name,category)
and a non-unique index on (category). The main table is ~ 2G on disk,
the index on (name,cat) is about the same size, the index on (cat) is
around 0.6G.

In this set-up inserts have dropped to the terrifyingly slow rate of
several hours per 10 000 insertions. This is not adequate to my needs,
I occasionally have to process 1 000 000 insertions or more!

I have several ideas for speeding this up at the SQL level (including
inserting into a temp table and then using INSERT ... SELECT to remove
the overhead of separate inserts) but that's not what I want to talk
about either...

What I did to day, which made a staggering difference, is dropping the
non-unique index on (category). Suddenly I can insert at approx 40 000
insertions per minute, which is fine for my needs!

So why is updating the huge (2G) unique index on (name,cat) not too
much of a problem, but updating the small (600M) non-unique index on
(cat) sufficient to increase speed by around two orders of magnitude?

A possible reason has occurred to me:

The real slow-down is noticeable when I'm doing a bulk insert where
all new rows belong to the most popular category.  I know that some
btree implementations don't behave particularly sanely with several
million rows in a single key.. is the btree implementation used too
slow in this case?

I haven't collected all the performance statistics I'd like to have,
due to external time pressures, but I can say that under the new
faster configuration, the insertion process is CPU bound, with disk
access far below the levels the machine is capable of.  If I have a chance
I'll collect these stats for the old method too.

Any ideas as to what's going on here appreciated (if not, perhaps it
will point you towards an optimisation you ought to make for 7.1)

Jules


Re: Performance on inserts

From
Bruce Momjian
Date:
> > > * Prevent index lookups (or index entries using partial index) on most
> > >   common values; instead use sequential scan 
> > 
> > This behavior already exists for the most common value, and would
> > exist for any additional values that we had stats for.  Don't see
> > why you think a separate TODO item is needed.
> 
> You mean the optimizer already skips an index lookup for the most common
> value, and instead does a sequential scan?  Seems you are way ahead of
> me.

If the answer is yes, how do you prevent a join that hits the most
common value?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Performance on inserts

From
Bruce Momjian
Date:
[ Charset ISO-8859-1 unsupported, converting... ]
> Could you add to the TODO:
> 
>  support of binary data (eg varbinary type)
> 
> I think the above is not trivial, as I think the parser choques on \00 bytes
> at several levels...

bytea type works for inserting null:  'a\\000b'.

> 
> I had a check on the TODO and it seems that TOAST is not planned anymore for
> 7.1. Is it true?

It is in 7.1.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


RE: Performance on inserts

From
Franck Martin
Date:
Could you add to the TODO:
support of binary data (eg varbinary type)

I think the above is not trivial, as I think the parser choques on \00 bytes
at several levels...

I had a check on the TODO and it seems that TOAST is not planned anymore for
7.1. Is it true?

Franck Martin
Database Development Officer
SOPAC South Pacific Applied Geoscience Commission
Fiji
E-mail: franck@sopac.org <mailto:franck@sopac.org> 
Web site: http://www.sopac.org/ <http://www.sopac.org/> 

This e-mail is intended for its recipients only. Do not forward this
e-mail without approval. The views expressed in this e-mail may not be
neccessarily the views of SOPAC.



-----Original Message-----
From: Bruce Momjian [mailto:pgman@candle.pha.pa.us]
Sent: Monday, October 16, 2000 9:45 AM
To: Jules Bean
Cc: Tom Lane; Alfred Perlstein; pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Performance on inserts



Added to TODO:

* Prevent index lookups (or index entries using partial index) on most common values; instead use sequential scan 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026