Thread: Optimizer not using index on 120M row table

Optimizer not using index on 120M row table

From
"Jim C. Nasby"
Date:
As background: this is the main table for http://stats.distributed.net

      Table "public.email_contrib"
   Column   |     Type      | Modifiers
------------+---------------+-----------
 id         | integer       | not null
 team_id    | integer       | not null
 date       | date          | not null
 project_id | smallint      | not null
 work_units | numeric(20,0) | not null
Indexes: email_contrib_pkey primary key btree (project_id, id, date)

id is the id of a participant, team_id is the team they were on for that
day, date is the day the work was done, project_id is the project, and
work_units is the amount of work done.

 explain select * from email_contrib where project_id=8 and id=39622 and
date='3/1/03';
                                  QUERY PLAN
-------------------------------------------------------------------------------
 Seq Scan on email_contrib  (cost=0.00..2942185.40 rows=1 width=25)
   Filter: ((project_id = 8) AND (id = 39622) AND (date =
'2003-03-01'::date))
(2 rows)

Is there any reason why this shouldn't be using the index? The
selectivity on project_id is very low (only 5 values for all 120M rows).

select attname, avg_width, n_distinct, correlation from pg_stats where
tablename='email_contrib';
  attname   | avg_width | n_distinct | correlation
------------+-----------+------------+-------------
 id         |         4 |      95184 |    0.496598
 team_id    |         4 |       1361 |    0.219478
 date       |         4 |       1769 |    0.329469
 project_id |         2 |          5 |           1
 work_units |        11 |       2100 |   0.0900541

It seems that not only should the query I explained be using the index,
but it seems that something like select sum(work_units) .. where
project_id=8 and id=39622 should also use the index.

I've run vacuum analyze to no effect.

On a related note, will pgsql do 'index covering', reading only the
index if it contains all the information a query needs? IE: in Sybase,
this query will only hit the index on Email_Contrib:

select date from email_contrib where project_id=8 and id=39622;

because email_contrib_pkey contains all required values.
--
Jim C. Nasby (aka Decibel!)                    jim@nasby.net
Member: Triangle Fraternity, Sports Car Club of America
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"


Re: Optimizer not using index on 120M row table

From
Neil Conway
Date:
On Mon, 2003-04-07 at 23:29, Jim C. Nasby wrote:
>       Table "public.email_contrib"
>    Column   |     Type      | Modifiers
> ------------+---------------+-----------
>  id         | integer       | not null
>  team_id    | integer       | not null
>  date       | date          | not null
>  project_id | smallint      | not null
>  work_units | numeric(20,0) | not null
> Indexes: email_contrib_pkey primary key btree (project_id, id, date)

>  explain select * from email_contrib where project_id=8 and id=39622 and
> date='3/1/03';

Since project_id is an int2, you need to add single quotes to the
integer literal or cast it to int2 explicitly:

select * from email_contrib where project_id='8' ...

or

select * from email_contrib where project_id=8::int ...

BTW, this is a (well) known bug -- search the archives for more
information.

> On a related note, will pgsql do 'index covering', reading only the
> index if it contains all the information a query needs?

No -- in PostgreSQL, tuple visibility information is only stored in the
heap, not in the index. So the heap tuple always needs to be examined,
to determine if the specified tuple has been updated/deleted by some
transaction.

Cheers,

Neil


Re: Optimizer not using index on 120M row table

From
Tom Lane
Date:
"Jim C. Nasby" <jim@nasby.net> writes:
>  project_id | smallint      | not null

>  explain select * from email_contrib where project_id=8 and id=39622 and
> date='3/1/03';

Cast the constants to smallint, eg

    project_id = 8::smallint and ...

or quote them if that feels cleaner to you:

    project_id = '8' and ...

or just declare project_id to be integer (hint: the space savings are
completely illusory in this example, anyway, because of alignment
requirements).

See the archives for much prior discussion ;-)

            regards, tom lane


Re: Optimizer not using index on 120M row table

From
"Jim C. Nasby"
Date:
On Mon, Apr 07, 2003 at 11:51:17PM -0400, Tom Lane wrote:
> or just declare project_id to be integer (hint: the space savings are
> completely illusory in this example, anyway, because of alignment
> requirements).

Hrm... where can I find more info about that?
--
Jim C. Nasby (aka Decibel!)                    jim@nasby.net
Member: Triangle Fraternity, Sports Car Club of America
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"


Re: Optimizer not using index on 120M row table

From
Tom Lane
Date:
"Jim C. Nasby" <jim@nasby.net> writes:
> On Mon, Apr 07, 2003 at 11:51:17PM -0400, Tom Lane wrote:
>> or just declare project_id to be integer (hint: the space savings are
>> completely illusory in this example, anyway, because of alignment
>> requirements).

> Hrm... where can I find more info about that?

The authoritative reference is pg_type.typalign ... see
http://www.ca.postgresql.org/users-lounge/docs/7.3/postgres/catalog-pg-type.html
In this example, your next field requires integer alignment, so the
two bytes "saved" by using smallint disappear into the padding.

            regards, tom lane


Re: Optimizer not using index on 120M row table

From
"Jim C. Nasby"
Date:
On Mon, Apr 07, 2003 at 11:49:27PM -0400, Neil Conway wrote:
> > On a related note, will pgsql do 'index covering', reading only the
> > index if it contains all the information a query needs?
>
> No -- in PostgreSQL, tuple visibility information is only stored in the
> heap, not in the index. So the heap tuple always needs to be examined,
> to determine if the specified tuple has been updated/deleted by some
> transaction.

'visibility information'? Is this related to the versioning that pgsql
uses instead of read locks? I know in other RDBMS's it's very useful to
have index covering because it eliminates needing to load the base pages
into memory at all, but I don't know pgqsl's layout well enough to know
how much this helps. Chapter 7 of the developers guide has info on page
files, but I haven't seen info on how index tuples are stored.
--
Jim C. Nasby (aka Decibel!)                    jim@nasby.net
Member: Triangle Fraternity, Sports Car Club of America
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"


Re: Optimizer not using index on 120M row table

From
Neil Conway
Date:
On Tue, 2003-04-08 at 00:18, Jim C. Nasby wrote:
> 'visibility information'? Is this related to the versioning that pgsql
> uses instead of read locks?

Yes -- it's the information used by PostgreSQL to implement MVCC.

> I know in other RDBMS's it's very useful to
> have index covering because it eliminates needing to load the base pages
> into memory at all

Storing visibility information in index entries (in addition to heap
tuples) doesn't strike me as a good idea: for one thing, an
UPDATE/DELETE would require touching both heap tuples and any index
entries that point to them. It would also bloat the size of indexes.

Cheers,

Neil


Re: Optimizer not using index on 120M row table

From
"Jim C. Nasby"
Date:
On Tue, Apr 08, 2003 at 12:02:55AM -0400, Tom Lane wrote:
> The authoritative reference is pg_type.typalign ... see
> http://www.ca.postgresql.org/users-lounge/docs/7.3/postgres/catalog-pg-type.html
> In this example, your next field requires integer alignment, so the
> two bytes "saved" by using smallint disappear into the padding.

Thanks, makes sense to me now. Next question... will pgsql intelligently
order things in an item record so that they're stored most efficiently?
ie: group all the byte-align stuff together, all the 2-byte aligned
stuff, etc.? Or will it just go off of the order defined in the create
table statement? If it's the later, is there any way to re-organize
things without rebuilding the table from scratch? What other things
should be considered for column ordering?

Thanks!
--
Jim C. Nasby (aka Decibel!)                    jim@nasby.net
Member: Triangle Fraternity, Sports Car Club of America
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"


Re: Optimizer not using index on 120M row table

From
"Jim C. Nasby"
Date:
On Tue, Apr 08, 2003 at 12:29:00AM -0400, Neil Conway wrote:
> > I know in other RDBMS's it's very useful to
> > have index covering because it eliminates needing to load the base pages
> > into memory at all
>
> Storing visibility information in index entries (in addition to heap
> tuples) doesn't strike me as a good idea: for one thing, an
> UPDATE/DELETE would require touching both heap tuples and any index
> entries that point to them. It would also bloat the size of indexes.

True, though these could require touching the indexes anyway due to the
data changes (or at least I assume so, given how other RDBMS's work). It
might be a useful option to allow; in certain situations this could
provide fairly substantial gains, depending on how fat the index is
compared to the raw data rows.

It might also be worth allowing for a dirty read option if you don't
care about transactions that are in progress.
--
Jim C. Nasby (aka Decibel!)                    jim@nasby.net
Member: Triangle Fraternity, Sports Car Club of America
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"


Re: Optimizer not using index on 120M row table

From
Tom Lane
Date:
"Jim C. Nasby" <jim@nasby.net> writes:
> ... Next question... will pgsql intelligently
> order things in an item record so that they're stored most efficiently?

No, it stores 'em in the order they're declared in.  There has been some
speculation in the past about trying to be smarter, but I think that
line of thought is pretty much dead in the face of ALTER TABLE ADD/DROP
COLUMN.

> What other things should be considered for column ordering?

There is some execution-speed advantage to putting fixed-width
never-null columns first, then fixed-width nullable columns (in order of
increasing probability of being null, if you want to get really tense),
finally the variable-width columns (nullability no longer matters).
Note that CHAR(n) is considered variable-width, likewise BIT(n) and
NUMERIC(n), even though one might think they could be considered fixed
width.

            regards, tom lane


Re: Optimizer not using index on 120M row table

From
Neil Conway
Date:
On Tue, 2003-04-08 at 00:35, Jim C. Nasby wrote:
> On Tue, Apr 08, 2003 at 12:29:00AM -0400, Neil Conway wrote:
> > Storing visibility information in index entries (in addition to heap
> > tuples) doesn't strike me as a good idea: for one thing, an
> > UPDATE/DELETE would require touching both heap tuples and any index
> > entries that point to them. It would also bloat the size of indexes.
>
> True, though these could require touching the indexes anyway due to the
> data changes (or at least I assume so, given how other RDBMS's work).

Not in the case of DELETE (since you don't need to add a new index
entry, and the logic to determine whether a tuple is deleted is based
upon the visibility information stored in the heap). In the case of
UPDATE, chances are that the index entry you need to add for the new
version of the tuple isn't on the same page as the old one -- so the
fact that the two pages happen to belong to the same index doesn't
really help.

> It might be a useful option to allow

Indeed, I can see how some might find it useful. Perhaps you'd like to
implement it? :-)

Cheers,

Neil


Re: Optimizer not using index on 120M row table

From
"Jim C. Nasby"
Date:
On Tue, Apr 08, 2003 at 02:04:30AM -0400, Neil Conway wrote:
> On Tue, 2003-04-08 at 00:35, Jim C. Nasby wrote:
> > On Tue, Apr 08, 2003 at 12:29:00AM -0400, Neil Conway wrote:
> > > Storing visibility information in index entries (in addition to heap
> > > tuples) doesn't strike me as a good idea: for one thing, an
> > > UPDATE/DELETE would require touching both heap tuples and any index
> > > entries that point to them. It would also bloat the size of indexes.
> >
> > True, though these could require touching the indexes anyway due to the
> > data changes (or at least I assume so, given how other RDBMS's work).
>
> Not in the case of DELETE (since you don't need to add a new index
> entry, and the logic to determine whether a tuple is deleted is based
> upon the visibility information stored in the heap). In the case of
> UPDATE, chances are that the index entry you need to add for the new
> version of the tuple isn't on the same page as the old one -- so the
> fact that the two pages happen to belong to the same index doesn't
> really help.
>
> > It might be a useful option to allow
>
> Indeed, I can see how some might find it useful. Perhaps you'd like to
> implement it? :-)

Sure, if it can be done strictly in SQL. :)
--
Jim C. Nasby (aka Decibel!)                    jim@nasby.net
Member: Triangle Fraternity, Sports Car Club of America
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"


Re: Optimizer not using index on 120M row table

From
"scott.marlowe"
Date:
There are a few settings that you pretty much have to change on bigger
hardware to get good performance.

shared_buffers (1000 to 10000 is a good place to start.) measured in 8k
blocks.
effective_cache_size (size of OS file system and disk caches measured in
8k blocks)

The CPU cost settings:

These two all measure the cost of each tuple operation (index or table
respectively) as a fraction of a sequential scan.

cpu_index_tuple_cost
cpu_tuple_cost

This one measures the cost of each operator in a where clause, again as a
fraction of a sequential scan.

cpu_operator_cost

Lastly, the big one:

random_page_cost tells the planner how much a random page fetch costs
compared to a sequential page cost.  A setting of one would mean that a
seq scan and an index scan are even.

Here's the settings off of my box, which runs a dual 10krpm UWSCSI disk
set on a dual PIII-750 with 1.5 gig ram:

shared_buffers 32768 #default 64
effective_cache_size = 100000  # default 1000
random_page_cost = 1 # default 4
cpu_tuple_cost = 0.01 # default 0.01
cpu_index_tuple_cost = 0.0001 # default 0.001
cpu_operator_cost = 0.0025 # default 0.0025

32768*8192=256M
100000*8192=780M

Note that we originally set random_page_cost to 1 long before I'd realized
we had our effective cache size set way too low (i.e. default.) and so the
planner was picking seq scans because it was sure the data weren't in
memory at the time.  Setting effective cache size to the right setting
means we could probably go back to a setting of 1.5 to 2.

It seems that when postgresql picks an index scan when it should
have picked a seq scan, the cost is that you're up to twice as slow as a
seq scan, but picking a seq scan when it shoulda picked an index can
result in performance 10 times slower, so we fall on the side of caution
and favor index scans over seq scans because of this.


Re: Optimizer not using index on 120M row table

From
"Jim C. Nasby"
Date:
On Tue, Apr 29, 2003 at 09:46:20AM -0600, scott.marlowe wrote:
> There are a few settings that you pretty much have to change on bigger
> hardware to get good performance.
>
> shared_buffers (1000 to 10000 is a good place to start.) measured in 8k
> blocks.
> effective_cache_size (size of OS file system and disk caches measured in
> 8k blocks)

Should effective_cache_size include the size of shared_buffers? Also,
FreeBSD doesn't seem to want to use more than about 300M for disk
caching, so I currently have shared_buffers set to 90000 or about 700M
(the box has 2G, but pgsql currently has to share with Sybase). Are
there any issues with setting shared_buffers so high? Anyone know how to
make FBSD do more caching on it's own, or should I just let pgsql handle
it?
--
Jim C. Nasby (aka Decibel!)                    jim@nasby.net
Member: Triangle Fraternity, Sports Car Club of America
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"


Re: Optimizer not using index on 120M row table

From
Tom Lane
Date:
"Jim C. Nasby" <jim@nasby.net> writes:
> Should effective_cache_size include the size of shared_buffers?

Yes ... although IMHO, if shared_buffers is large enough to materially
affect that number, it's too large ;-)

> FreeBSD doesn't seem to want to use more than about 300M for disk
> caching, so I currently have shared_buffers set to 90000 or about 700M
> (the box has 2G, but pgsql currently has to share with Sybase). Are
> there any issues with setting shared_buffers so high?

Plenty, see many past threads in pgsql-performance and other lists.
There are strong reasons to think that you should let the kernel do the
bulk of the caching work.

            regards, tom lane


Re: Optimizer not using index on 120M row table

From
"Jim C. Nasby"
Date:
On Wed, Apr 30, 2003 at 12:14:34AM -0400, Tom Lane wrote:
> Plenty, see many past threads in pgsql-performance and other lists.
> There are strong reasons to think that you should let the kernel do the
> bulk of the caching work.

OK, anyone have any ideas on how to get FreeBSD to cache more than 300M?
:)

BTW, all the searches I did on this topic and FreeBSD resulted in
results saying I should set shared_buffers way high, so it seems FBSD
users are having pgsql do the caching instead of the OS.
--
Jim C. Nasby (aka Decibel!)                    jim@nasby.net
Member: Triangle Fraternity, Sports Car Club of America
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"


Re: Optimizer not using index on 120M row table

From
Shridhar Daithankar
Date:
On Wednesday 30 April 2003 11:26, Jim C. Nasby wrote:
> On Wed, Apr 30, 2003 at 12:14:34AM -0400, Tom Lane wrote:
> > Plenty, see many past threads in pgsql-performance and other lists.
> > There are strong reasons to think that you should let the kernel do the
> > bulk of the caching work.
>
> OK, anyone have any ideas on how to get FreeBSD to cache more than 300M?

some sysctl tweaks? HAven't booted in BSD for months so not exactly helpful
but still..

> BTW, all the searches I did on this topic and FreeBSD resulted in
> results saying I should set shared_buffers way high, so it seems FBSD
> users are having pgsql do the caching instead of the OS.

Well, that is the postgresql philosphy on any platform. Further about setting
the shared buffers, ideally you could watch ipcs output for shared memory
usage and set the shared buffers between 120%-150% of average shared memory
usage.

Here one of the database I am playing with now has 80M rows and 6GB big.. but
shared memory usage for random index scan queries does not seem to grow
beyond 10MB. Of course there are not many queries fired on it but this is
just to give you an idea..

HTH

 Shridhar


Re: Optimizer not using index on 120M row table

From
"Jim C. Nasby"
Date:
On Wed, Apr 30, 2003 at 12:56:11AM -0500, Jim C. Nasby wrote:
> On Wed, Apr 30, 2003 at 12:14:34AM -0400, Tom Lane wrote:
> > Plenty, see many past threads in pgsql-performance and other lists.
> > There are strong reasons to think that you should let the kernel do the
> > bulk of the caching work.
>
> OK, anyone have any ideas on how to get FreeBSD to cache more than 300M?
> :)

Ok, looks like I was just mis-reading top/not understanding the FBSD vm
subsystem well enough.

I'll try and dig up stuff in the archives about shared_buffer settings,
but it seems like it would be really useful to have a whitepaper
describing why the OS should be left to do caching, what pgsql actually
tries to use it's shared buffers for (beyond simple read caching), and
how best to size shared_buffers.
--
Jim C. Nasby (aka Decibel!)                    jim@nasby.net
Member: Triangle Fraternity, Sports Car Club of America
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"


Re: Optimizer not using index on 120M row table

From
"scott.marlowe"
Date:
On Tue, 29 Apr 2003, Jim C. Nasby wrote:

> On Tue, Apr 29, 2003 at 09:46:20AM -0600, scott.marlowe wrote:
> > There are a few settings that you pretty much have to change on bigger
> > hardware to get good performance.
> >
> > shared_buffers (1000 to 10000 is a good place to start.) measured in 8k
> > blocks.
> > effective_cache_size (size of OS file system and disk caches measured in
> > 8k blocks)
>
> Should effective_cache_size include the size of shared_buffers? Also,
> FreeBSD doesn't seem to want to use more than about 300M for disk
> caching, so I currently have shared_buffers set to 90000 or about 700M
> (the box has 2G, but pgsql currently has to share with Sybase). Are
> there any issues with setting shared_buffers so high? Anyone know how to
> make FBSD do more caching on it's own, or should I just let pgsql handle
> it?

No, it doesn't. From reading the docs it would appear that it's the
estimate of how much memory the kernel is using to cache disk access for
postgresql alone.  I.e. if you're running an LDAP server that uses 32
Megs, an apache server using another 32 Megs or so, and postgresql set for
shared buffers of 32 Megs, then on a 256 Meg machine that shows 128 Megs
used for cache, it's likely that only portion is postgresql cache, around
96 Megs if all three apps do the same amount of disk access.

There may be some tuning parameters for BSD that will let it use more disk
cache, or it could there just isn't any more to cache.

There's no problem with cranking up shared_buffers, but you should always
test it and compare it to a few smaller settings and find the "knee" where
Postgresql stops getting faster.  On my box that was around 128 Meg of
ram, with a slight gain to 256 Meg, so, since I had 900 Megs as disk cache
at the time I set it to 256 Meg.  For my machine and my load, that's lots
and lots of shared_buffer cache.  There are questions about performance
loss as the shared_buffer setting goes up due to the nature of
postgresql's buffering method, which is based on shared memory.  It would
appear that most kernels are great at buffering huge amounts of data in a
dirt simple way, while the shared memory that postgresql uses may start
off faster (say at a few thousand buffers) but seems to reach a point of
diminishing returns in real tests.  For me that was somewhere around 512
Meg of shared buffer memory.  Note that it's not like Postgresql ground to
a halt and stopped working, it just stopped going faster, and started
getting a little slower.

It might still be a win for certain types of loads, just not mine.  The
load my database runs is about 90% small lookups (select * from view where
id=123) and 5% medium reports (select * from view where date>1 month ago)
and about 5% gargantuan reports (select * from sales_history where year
like '199%').  The problem for me is that only the small lookup type apps
are changing a lot of data, and refetching all the time.  The medium
reports may get run a dozen times in one sitting, but most of the time
it's a one shot.  The gargantuan reports are run by batch files mostly, or
by single users who know it's ok that it takes a minute or two to get
their report and they won't hit reload six or seven times.  :-)

The thing that slowed down the most at high buffer for me was the small
access applications, i.e. work flow, order processing type stuff.  It
didn't benefit from a larger cache because it's working over a small
dataset, so any performance loss due to overhead was seen here first,
since there was no corresponding advantage to a large cache on a small
dataset.

The big reports were actually almost the same speed, sometimes still
faster at 512Meg than 256Meg, but that makes sense, since select * from
logs order by random() is gonna take up some space :-) especially if
you're joining it to get a username -> realname lookup.  But all those big
things are faster if you can run them on a cron and have them ready in the
morning.  And the machine doesn't care if it's 20% faster at 2:00am.

But shared_buffers isn't the dangerous setting, that's sort_mem
cranked up too high.  Since sort_mem comes from plain old memory, not the
shared_buffers, a bunch of order by statements running in parallel can
make the kernel flush it's cache, then start swapping out programs if
things get real bad.  Causing the kernel to flush cache means that you can
get into a round robin cache problem where everything the database asks
for was just here a minute ago, but we just flushed it to make room for
this last block here.  And that happens for each block you're looking for.

Bad as that is, it's nothing compared to making your machine start
swapping, especially if what it's swapping are shared buffer blocks...
Which many OSes (Linux is one not sure about the BSDs) will happily do.

After that the danger in Linux is that you'll use up all the swap, all the
mem, and the kernel will start kill -9ing "rogue" processes.
Unfortunately, the "rogue process recognition algorhythm" in Linux is
kill that which is big and hungry.  I.e. Postgresql backends.

So, I think the process of setting those settings for a machine is by it's
nature an interactive thing, you have to build it and then put it under
load and measure it's performance and tweak it a little at a time.  The
real danger is in overshooting the optimum.  The first sign is that things
aren't getting faster, and the next sign might be that the machine is
crawling in a swap storm.  Mediocre but reliable is preferable to meteoric
(both in terms of speed and reliability.)

What I shoot for now is to have half of the memory in a box be kernel
cache, 1/4th or so be postgresql programs and cache, and the other 1/4th
to run anything else on the box.  On my current machine that's 1.5 Gig
mem, 700-800M used for kernel disk cache, 256 Meg used for postgresql, and
about 400-500M used for everything else.  Postgresql performance is really
not an issue unless we write a poorly designed query.  And if we do, we
can't usually use up enough memory to cause problems.