Thread: Newbie Question: FAQ for database optimization?

Newbie Question: FAQ for database optimization?

From
Alexander Scholz
Date:
Hi,

is there a newbie's FAQ / book / link for "howto optimize databases with
PostgreSQL"?

Background: Customer has the Windows* (sorry <g>) Postgres 8.1.0
standard installation "out of the box". A table has 2.5 mio records. No
indizes defined, primary key (sequence) does exist. In pgAdmin "select
count(*)" takes over 30 seconds, an "update" affecting 70'000 records
takes minutes... I am sure PostgreSQL could do better, we "just" need to
tune the database. (I hope so at least!)

What action and/or reading can you recommend? (We quickly need some
'wow' effects to keep the customer happy <sigh>).

Thanx,

Alexander.

*) sorry, I don't have server's hardware spec. available right now, but
the MSSQL2005 instance on it does the same things in a few seconds... ;-)

Re: Newbie Question: FAQ for database optimization?

From
"A. Kretschmer"
Date:
am  20.12.2005, um 22:21:54 +0100 mailte Alexander Scholz folgendes:
> Hi,
>
> is there a newbie's FAQ / book / link for "howto optimize databases with
> PostgreSQL"?

07:12 < rtfm_please> For information about tuning
07:12 < rtfm_please> see http://www.powerpostgresql.com
07:12 < rtfm_please> or http://www.powerpostgresql.com/PerfList
07:12 < rtfm_please> or http://www.varlena.com/varlena/GeneralBits/116.php



>
> Background: Customer has the Windows* (sorry <g>) Postgres 8.1.0 standard
> installation "out of the box". A table has 2.5 mio records. No indizes
> defined, primary key (sequence) does exist. In pgAdmin "select count(*)"

bad & ugly


> What action and/or reading can you recommend? (We quickly need some 'wow'
> effects to keep the customer happy <sigh>).

Create suitable indexes.

07:14 < akretschmer> ??index
07:14 < rtfm_please> For information about index
07:14 < rtfm_please> see http://www.postgresql.org/docs/current/static/indexes-expressional.html
07:14 < rtfm_please> or http://www.postgresql.org/docs/current/static/indexes-partial.html
07:14 < rtfm_please> or http://www.postgresql.org/docs/current/static/indexes.html


Andreas
--
Andreas Kretschmer    (Kontakt: siehe Header)
Heynitz:  035242/47212,      D1: 0160/7141639
GnuPG-ID 0x3FFF606C http://wwwkeys.de.pgp.net
 ===    Schollglas Unternehmensgruppe    ===

Indices for select count(*)?

From
Alexander Scholz
Date:
Hi, thank you for your answer.

Regarding the performance flow when trying to find out how many records
are currently being stored in the table, I don't see how an index should
help... Nevertheless we've created an unique index on "ID" but SELECT
count("ID") from "XYZ" still takes 35 seconds*. (ID is the primary key
basing on a sequence, select count(*) isn't faster.)

So - what kind of indexing would speed this up then?

Thanx in advance!

Alexander.

*) MSSQL 2005 on the same server takes 4 seconds for this query for the
analogue table, and there hasn't any special tuning been applied, too.

Re: Indices for select count(*)?

From
Peter Eisentraut
Date:
Am Mittwoch, 21. Dezember 2005 12:01 schrieb Alexander Scholz:
> So - what kind of indexing would speed this up then?

You can't speed up a full-table count using an index.

Re: Indices for select count(*)?

From
Greg Stark
Date:
Alexander Scholz <alexander.scholz1@freenet.de> writes:

> Hi, thank you for your answer.
>
> Regarding the performance flow when trying to find out how many records are
> currently being stored in the table, I don't see how an index should help...
> Nevertheless we've created an unique index on "ID" but SELECT count("ID") from
> "XYZ" still takes 35 seconds*. (ID is the primary key basing on a sequence,
> select count(*) isn't faster.)
>
> So - what kind of indexing would speed this up then?

No form of indexing can speed this up. To answer the server has to look at
every record and count up how many of them should be included in your result.

If you only need an approximate value there's one available in the stats
tables (I don't remember exactly how to get it) or you can keep a recent value
in a table and update it periodically and just query that.

> *) MSSQL 2005 on the same server takes 4 seconds for this query for the
> analogue table, and there hasn't any special tuning been applied, too.

MSSQL presumably has the entire table cached in RAM and postgres doesn't. Even
if MSSQL can scan just the index (which postgres can't do) I would only expect
a factor of 2-4x. Hm. Unless perhaps this table is extremely wide? How large
are these records?

--
greg

Re: Indices for select count(*)?

From
Nicolas Barbier
Date:
On 12/21/05, Alexander Scholz <alexander.scholz1@freenet.de> wrote:

> Regarding the performance flow when trying to find out how many records
> are currently being stored in the table, I don't see how an index should
> help... Nevertheless we've created an unique index on "ID" but SELECT
> count("ID") from "XYZ" still takes 35 seconds*. (ID is the primary key
> basing on a sequence, select count(*) isn't faster.)

I would like to redirect you to the zillions of mailing list posts
about this subject :-).

> So - what kind of indexing would speed this up then?

None.

greetings,
Nicolas

--
Nicolas Barbier
http://www.gnu.org/philosophy/no-word-attachments.html

Re: Indices for select count(*)?

From
Marcus Engene
Date:
Greg Stark wrote:
> Alexander Scholz <alexander.scholz1@freenet.de> writes:
>
>>Hi, thank you for your answer.
>>
>>Regarding the performance flow when trying to find out how many records are
>>currently being stored in the table, I don't see how an index should help...
>>Nevertheless we've created an unique index on "ID" but SELECT count("ID") from
>>"XYZ" still takes 35 seconds*. (ID is the primary key basing on a sequence,
>>select count(*) isn't faster.)
>>
>>So - what kind of indexing would speed this up then?
>
>
> No form of indexing can speed this up. To answer the server has to look at
> every record and count up how many of them should be included in your result.

Why couldn't it be possible to count # of items in an index?
The density of the information (items/inode|block|whatever it's called
in btrees) is likely to be much higher giving less disk i/o.

I'm sorry if this has been discussed recently.

Best regards,
Marcus

Re: Indices for select count(*)?

From
Jaime Casanova
Date:
On 12/21/05, Marcus Engene <mengpg@engene.se> wrote:
> Greg Stark wrote:
> > Alexander Scholz <alexander.scholz1@freenet.de> writes:
> >
> >>Hi, thank you for your answer.
> >>
> >>Regarding the performance flow when trying to find out how many records are
> >>currently being stored in the table, I don't see how an index should help...
> >>Nevertheless we've created an unique index on "ID" but SELECT count("ID") from
> >>"XYZ" still takes 35 seconds*. (ID is the primary key basing on a sequence,
> >>select count(*) isn't faster.)
> >>
> >>So - what kind of indexing would speed this up then?
> >
> >
> > No form of indexing can speed this up. To answer the server has to look at
> > every record and count up how many of them should be included in your result.
>
> Why couldn't it be possible to count # of items in an index?
> The density of the information (items/inode|block|whatever it's called
> in btrees) is likely to be much higher giving less disk i/o.
>

because in the MVCC model an index contains tuples (records) that are
dead to you (doesn't exist, becuase were deleted, updated) but that
are live to other transactions... so you still have to visit the table
to see if that tuple is live to to you and have to count it or not...

> I'm sorry if this has been discussed recently.
>
> Best regards,
> Marcus
>



--
Atentamente,
Jaime Casanova
(DBA: DataBase Aniquilator ;)

Re: Indices for select count(*)?

From
Chris Browne
Date:
mengpg@engene.se (Marcus Engene) writes:
> Greg Stark wrote:
>> Alexander Scholz <alexander.scholz1@freenet.de> writes:
>>
>>>Hi, thank you for your answer.
>>>
>>>Regarding the performance flow when trying to find out how many records are
>>>currently being stored in the table, I don't see how an index should help...
>>>Nevertheless we've created an unique index on "ID" but SELECT count("ID") from
>>>"XYZ" still takes 35 seconds*. (ID is the primary key basing on a sequence,
>>>select count(*) isn't faster.)
>>>
>>>So - what kind of indexing would speed this up then?
>> No form of indexing can speed this up. To answer the server has to
>> look at
>> every record and count up how many of them should be included in your result.
>
> Why couldn't it be possible to count # of items in an index?
> The density of the information (items/inode|block|whatever it's called
> in btrees) is likely to be much higher giving less disk i/o.
>
> I'm sorry if this has been discussed recently.

The index does not contain tuple visibility information, and so is
*useless* for the purpose.  It does not contain the useful information
you evidently imagine it does.

This question is asked steadily, frequently.
--
output = ("cbbrowne" "@" "ntlug.org")
http://www3.sympatico.ca/cbbrowne/
Rules of the Evil Overlord #32. "I will not fly into a rage and kill a
messenger who brings me bad news  just to illustrate how evil I really
am. Good messengers are hard to come by."
<http://www.eviloverlord.com/>

Re: Indices for select count(*)?

From
"Jim C. Nasby"
Date:
On Wed, Dec 21, 2005 at 04:54:08PM -0500, Greg Stark wrote:
> MSSQL presumably has the entire table cached in RAM and postgres doesn't. Even
> if MSSQL can scan just the index (which postgres can't do) I would only expect
> a factor of 2-4x. Hm. Unless perhaps this table is extremely wide? How large
> are these records?

Back when I was using other databases more often, it wasn't uncommon to
see a 10x speed improvement on count(*) from using an index. This is an
area where PostgreSQL is seriously behind other databases. Of course
having vastly superior concurrency goes a long way towards offsetting
that in the real world, but it would be a Good Thing if we could get
some form of tuple visibility into indexes, as has been discussed in the
past.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: Indices for select count(*)?

From
Martijn van Oosterhout
Date:
On Thu, Dec 22, 2005 at 08:52:08AM -0600, Jim C. Nasby wrote:
> Back when I was using other databases more often, it wasn't uncommon to
> see a 10x speed improvement on count(*) from using an index. This is an
> area where PostgreSQL is seriously behind other databases. Of course
> having vastly superior concurrency goes a long way towards offsetting
> that in the real world, but it would be a Good Thing if we could get
> some form of tuple visibility into indexes, as has been discussed in the
> past.

Actually, ISTM the trend is going the other way. MySQL has instant
select count(*), as long as you're only using ISAM. Recent versions of
MSSQL use an MVCC type system and it also scans the whole table. Oracle
is the only one I've found that has any optimisation on this front.

The thing is, it *is* possible to change PostgreSQL to do counts via
the index. The problem is, the cost is high enough that we're
reasonably sure most people don't want to pay it. I've neverneeded an
exact row count of a large table (estimates are good enough) so I'm not
sure I'd be willing to pay a price to have it.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

Re: Indices for select count(*)?

From
"Jim C. Nasby"
Date:
On Thu, Dec 22, 2005 at 04:10:50PM +0100, Martijn van Oosterhout wrote:
> Actually, ISTM the trend is going the other way. MySQL has instant
> select count(*), as long as you're only using ISAM. Recent versions of

No comment.

> MSSQL use an MVCC type system and it also scans the whole table. Oracle
> is the only one I've found that has any optimisation on this front.

I think this is more an indication of the power of MVCC over traditional
locking rather than the importance of indexes covering (reading just an
index to satisfy a query). Index covering can be a huge benefit, and I'd
be surprised if MS didn't come out with some way to do it in a future
version. I'm actually a bit surprised they don't do it in SQL2005.

> The thing is, it *is* possible to change PostgreSQL to do counts via
> the index. The problem is, the cost is high enough that we're
> reasonably sure most people don't want to pay it. I've neverneeded an
> exact row count of a large table (estimates are good enough) so I'm not
> sure I'd be willing to pay a price to have it.

I didn't think the method of adding the imperfect known_visible bit to
the indexes had that much overhead, but it's been a while since those
discussions took place. I do recall some issue being raised that will be
very difficult to solve (though again I don't remember the details now).

I agree that SELECT count(*) FROM table; is a pretty bogus use case.
SELECT count(*) FROM table WHERE field = blah; isn't though, and people
often depend on that being extremely fast. When you can do index
covering, that case usually is very fast, and PostgreSQL can be much
slower. Of course, there are ways around that, but it's more work (and
something that I'd bet most developers wouldn't think of).
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: Indices for select count(*)?

From
Greg Stark
Date:
"Jim C. Nasby" <jnasby@pervasive.com> writes:

> I didn't think the method of adding the imperfect known_visible bit to
> the indexes had that much overhead, but it's been a while since those
> discussions took place. I do recall some issue being raised that will be
> very difficult to solve (though again I don't remember the details now).

I doubt very much any visibility information will ever make it into the
indexes. The cost to update it in all the indexes terrible, and when would
that update even happen?

The proposal that had the most going for it was to maintain a bit in the FSM
or something like it that was your "known visible" bit. That would speed up
index scans and vacuums too. It would largely solve the problem with vacuuming
large tables that have mostly untouched pages.

The reason Oracle gets away with this is because they use optimistic MVCC
where the new record replaces the old one entirely. They keep the old records
in a separate space entirely. You pay the costs elsewhere instead. In Oracle
every update requires updating the rollback segment too, and if you have a
very busy table each record can cause you a second (or even third or fourth)
read in the rollback segment. And you pay these costs on *all* scans.

--
greg

Re: Indices for select count(*)?

From
Scott Marlowe
Date:
On Thu, 2005-12-22 at 09:33, Jim C. Nasby wrote:
> On Thu, Dec 22, 2005 at 04:10:50PM +0100, Martijn van Oosterhout wrote:
> > Actually, ISTM the trend is going the other way. MySQL has instant
> > select count(*), as long as you're only using ISAM. Recent versions of
>
> No comment.
>
> > MSSQL use an MVCC type system and it also scans the whole table. Oracle
> > is the only one I've found that has any optimisation on this front.
>
> I think this is more an indication of the power of MVCC over traditional
> locking rather than the importance of indexes covering (reading just an
> index to satisfy a query). Index covering can be a huge benefit, and I'd
> be surprised if MS didn't come out with some way to do it in a future
> version. I'm actually a bit surprised they don't do it in SQL2005.

I wouldn't mind a "with visibility" switch for indexes that you could
throw when creating them for this purpose.  But burdening all indexes
with this overhead when most wouldn't need it is not, IMHO, a good idea.

I seem to remember Tom saying that there was a race condition issue
though with updating the table AND the index at the same time, that they
could be out of sync for a fraction of a second or something like that.
So, if we had this kind of thing, the indexes and / or tables would have
to be locked for updates.

Again, for a reporting database, no big deal.  For a transactional
database, very big deal.

Re: Indices for select count(*)?

From
Jaime Casanova
Date:
>
> I wouldn't mind a "with visibility" switch for indexes that you could
> throw when creating them for this purpose.  But burdening all indexes
> with this overhead when most wouldn't need it is not, IMHO, a good idea.
>

that would add complexity to the index code for... just one case?

what about a set of functions instead...

one function to create all necesary triggers  to maintain a different
table with a count for the table, and one function that retrieves that
info

select start_counter_on_table('table_name');
select get_counter_on_table('table_name');

of course, this could be usefull just for the case of "select * from
table"... but that case is the whole problem...

--
regards,
Jaime Casanova
(DBA: DataBase Aniquilator ;)

Re: Newbie Question: FAQ for database optimization?

From
David Fetter
Date:
On Tue, Dec 20, 2005 at 10:21:54PM +0100, Alexander Scholz wrote:
> Hi,
>
> is there a newbie's FAQ / book / link for "howto optimize databases with
> PostgreSQL"?
>
> Background: Customer has the Windows* (sorry <g>) Postgres 8.1.0
> standard installation "out of the box". A table has 2.5 mio records.
> No indizes defined, primary key (sequence) does exist. In pgAdmin
> "select count(*)" takes over 30 seconds,

That sounds about right.  If you want to cache this result, there are
ways to do that, and there are approximations to the result if you're
interested in such things.

> an "update" affecting 70'000 records takes minutes...

An index on the (set of) column(s) the WHERE clause refers to would
very likely help.  For example, if your update looks like:

UPDATE foo
SET bar = 555
WHERE baz = 'blurf';

You could get some mileage out of indexing the baz column.  See the
docs on CREATE INDEX for the syntax.

> I am sure PostgreSQL could do better, we "just" need to tune the
> database. (I hope so at least!)

>
> What action and/or reading can you recommend? (We quickly need some
> 'wow' effects to keep the customer happy <sigh>).

There are archives of the pgsql-performance mailing list at
<http://archves.postresql.org/> for a lot of this.  For things you
don't find there, you can either post here or go to
<irc://irc.freenode.net/postgresql>, where there are friendly, helpful
people, and occasionally Yours Truly.

Cheers,
D
--
David Fetter david@fetter.org http://fetter.org/
phone: +1 415 235 3778

Remember to vote!

Re: Indices for select count(*)?

From
Peter Eisentraut
Date:
One way to conceptually tackle this count(*) issue would be to create a new
index type for it.  The index type would (logically) just need to implement
insert and delete operations and keep a running count with a big lock around
it.  Users could then choose to trade off concurrent performance against the
speed of count() by creating or dropping that index.  Implementing that type
of index might not even be that hard but convincing the planer and executor
to use it without too many hardcoded cases seems more challenging.

Re: Indices for select count(*)?

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> One way to conceptually tackle this count(*) issue would be to create a new
> index type for it.  The index type would (logically) just need to implement
> insert and delete operations and keep a running count with a big lock around
> it.  Users could then choose to trade off concurrent performance against the
> speed of count() by creating or dropping that index.  Implementing that type
> of index might not even be that hard but convincing the planer and executor
> to use it without too many hardcoded cases seems more challenging.

It's not that easy --- in the MVCC world there simply isn't a unique
count that is the right answer for every observer.  But the idea of
packaging a count(*) mechanism as an index type seems like it might be
a good one.  I don't think the planner objection need be taken too
seriously: we already have a good big wart in there for recognizing
MIN/MAX indexability, and this sort of transformation would fit pretty
naturally with what's already done in planagg.c.

            regards, tom lane

Re: Indices for select count(*)?

From
Martijn van Oosterhout
Date:
On Fri, Dec 23, 2005 at 11:04:50AM -0500, Tom Lane wrote:
> It's not that easy --- in the MVCC world there simply isn't a unique
> count that is the right answer for every observer.  But the idea of
> packaging a count(*) mechanism as an index type seems like it might be
> a good one.  I don't think the planner objection need be taken too
> seriously: we already have a good big wart in there for recognizing
> MIN/MAX indexability, and this sort of transformation would fit pretty
> naturally with what's already done in planagg.c.

AFAICS two big problems with using an index type:

1. The index isn't told when the tuple is deleted.
2. The server expects to be able to lookup an index.

Other than that...

--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

Re: Indices for select count(*)?

From
Tom Lane
Date:
Martijn van Oosterhout <kleptog@svana.org> writes:
> AFAICS two big problems with using an index type:

> 1. The index isn't told when the tuple is deleted.

Hm, good point ... we could make it do so but for ordinary deletes it'd
be a waste of cycles to open indexes at all.

> 2. The server expects to be able to lookup an index.

Only if there is a WHERE operator that matches the index's opclass.
This hypothetical index type would probably have one dummy opclass
containing no operators.

            regards, tom lane

Re: Indices for select count(*)?

From
Bruce Momjian
Date:
Martijn van Oosterhout wrote:
-- Start of PGP signed section.
> On Fri, Dec 23, 2005 at 11:04:50AM -0500, Tom Lane wrote:
> > It's not that easy --- in the MVCC world there simply isn't a unique
> > count that is the right answer for every observer.  But the idea of
> > packaging a count(*) mechanism as an index type seems like it might be
> > a good one.  I don't think the planner objection need be taken too
> > seriously: we already have a good big wart in there for recognizing
> > MIN/MAX indexability, and this sort of transformation would fit pretty
> > naturally with what's already done in planagg.c.
>
> AFAICS two big problems with using an index type:
>
> 1. The index isn't told when the tuple is deleted.
> 2. The server expects to be able to lookup an index.
>
> Other than that...

I think our TODO has a good summary of the issues:

---------------------------------------------------------------------------

* Speed up COUNT(*)

  We could use a fixed row count and a +/- count to follow MVCC
  visibility rules, or a single cached value could be used and
  invalidated if anyone modifies the table.  Another idea is to
  get a count directly from a unique index, but for this to be
  faster than a sequential scan it must avoid access to the heap
  to obtain tuple visibility information.

* Add estimated_count(*) to return an estimate of COUNT(*)

  This would use the planner ANALYZE statistics to return an estimated
  count.

* Allow data to be pulled directly from indexes

  Currently indexes do not have enough tuple visibility information
  to allow data to be pulled from the index without also accessing
  the heap.  One way to allow this is to set a bit on index tuples
  to indicate if a tuple is currently visible to all transactions
  when the first valid heap lookup happens.  This bit would have to
  be cleared when a heap tuple is expired.

  Another idea is to maintain a bitmap of heap pages where all rows
  are visible to all backends, and allow index lookups to reference
  that bitmap to avoid heap lookups, perhaps the same bitmap we might
  add someday to determine which heap pages need vacuuming.  Frequently
  accessed bitmaps would have to be stored in shared memory.  One 8k
  page of bitmaps could track 512MB of heap pages.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: Indices for select count(*)?

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
>> On Fri, Dec 23, 2005 at 11:04:50AM -0500, Tom Lane wrote:
>>> It's not that easy --- in the MVCC world there simply isn't a unique
>>> count that is the right answer for every observer.  But the idea of
>>> packaging a count(*) mechanism as an index type seems like it might be
>>> a good one.

> I think our TODO has a good summary of the issues:

The point here was the idea that we might implement something like the
delta-counts approach, but package it to look like a specialized index
type --- as opposed to making the user create triggers and so on,
which'd surely be a lot more error-prone to set up.  Also, if it were
an index type then it would be relatively straighforward to get the
planner to recognize the availability of a substitute way of doing
COUNT(*).  We could do all this in other ways but it'd require more
new infrastructure.

The DELETE problem might kill the idea though.

            regards, tom lane