Thread: index usage (and foreign keys/triggers)

index usage (and foreign keys/triggers)

From
Patrik Kudo
Date:
Hi gurus et al ;)

I have a database with a couple of hundred tables. In this database one
table, "person", represents a user. person.userid is a primary key. To
this key there are about a hundred foreign keys spread out over
aproximately equaly many tables. When deleting a user I noticed a quite
big difference in time depending on how much data there are in the
foreign key-tables. After some investigation I concluded that for some
users and some tables the indices wheren't used when deleting, resulting
in longer run-times.

Here's an example:

select count(*) from login;
  count
-------
  96824

select count(*) from login where userid = 'patrik';
  count
-------
    608

select count(*) from login where userid = 'jennie';
  count
-------
   4211

explain delete from login where userid = 'patrik';
                                    QUERY PLAN

---------------------------------------------------------------------------------
  Index Scan using login_userid_idx on login  (cost=0.00..237.06 rows=61
width=6)
    Index Cond: (userid = 'patrik'::text)

explain delete from login where userid = 'jennie';
                         QUERY PLAN
-----------------------------------------------------------
  Seq Scan on login  (cost=0.00..2045.30 rows=3421 width=6)
    Filter: (userid = 'jennie'::text)


What makes the planer choose seq scan for 'jennie', but not for
'patrik'? I also tested the following:

delete from login where userid = 'jennie';
DELETE 4211
Time: 508.94 ms

set enable_seqscan = false;

delete from login where userid = 'jennie';
DELETE 4211
Time: 116.92 ms

As you can see the index scan is almost 5 times faster, but still
postgres chooses to seq scan... Am I doing something wrong or is
postgres being stupid on me?

select version();
                               version
-------------------------------------------------------------------
  PostgreSQL 7.3 on i386-portbld-freebsd4.7, compiled by GCC 2.95.4

I tried the same thing on a similar database running on 7.3.2 with the
same results.

Regards,
Patrik Kudo


Re: index usage (and foreign keys/triggers)

From
Stephan Szabo
Date:
On Wed, 26 Feb 2003, Patrik Kudo wrote:

> Hi gurus et al ;)
>
> I have a database with a couple of hundred tables. In this database one
> table, "person", represents a user. person.userid is a primary key. To
> this key there are about a hundred foreign keys spread out over
> aproximately equaly many tables. When deleting a user I noticed a quite
> big difference in time depending on how much data there are in the
> foreign key-tables. After some investigation I concluded that for some
> users and some tables the indices wheren't used when deleting, resulting
> in longer run-times.

>
> Here's an example:
>
> select count(*) from login;
>   count
> -------
>   96824
>
> select count(*) from login where userid = 'patrik';
>   count
> -------
>     608
>
> select count(*) from login where userid = 'jennie';
>   count
> -------
>    4211
>
> explain delete from login where userid = 'patrik';
>                                     QUERY PLAN
>
> ---------------------------------------------------------------------------------
>   Index Scan using login_userid_idx on login  (cost=0.00..237.06 rows=61
> width=6)
>     Index Cond: (userid = 'patrik'::text)
>
> explain delete from login where userid = 'jennie';
>                          QUERY PLAN
> -----------------------------------------------------------
>   Seq Scan on login  (cost=0.00..2045.30 rows=3421 width=6)
>     Filter: (userid = 'jennie'::text)
>
>
> What makes the planer choose seq scan for 'jennie', but not for
> 'patrik'? I also tested the following:

Well at 3421 of 96824 it's estimating that the cost is lower, what's
the explain look like with seqscan turned off (my guess'd be it's
slightly higher cost).  It's possible that random_page_cost should
be lower, or that perhaps there's some level of clustering in the data
that's not being picked up.  You might want to try raising the
number of statistics buckets and re-analyzing just to see if that helps.



Re: index usage (and foreign keys/triggers)

From
Patrik Kudo
Date:
Stephan Szabo wrote:
>>explain delete from login where userid = 'jennie';
>>                         QUERY PLAN
>>-----------------------------------------------------------
>>  Seq Scan on login  (cost=0.00..2045.30 rows=3421 width=6)
>>    Filter: (userid = 'jennie'::text)
>>
>
> Well at 3421 of 96824 it's estimating that the cost is lower, what's
> the explain look like with seqscan turned off (my guess'd be it's
> slightly higher cost).  It's possible that random_page_cost should

Yepp! You're right. The cost is higher:

set enable_seqscan to off;
explain delete from login where userid = 'jennie';
                                      QUERY PLAN

------------------------------------------------------------------------------------
  Index Scan using login_userid_idx on login  (cost=0.00..3363.71
rows=4131 width=6)
    Index Cond: (userid = 'jennie'::text)

If I lower the random_page_cost to about 2 the index is being used
instead of seq scan. Is it reasonable to have such a setting on a
production server? random_page_cost = 2 is good for this particular
query, but could it have negative effect on other queries?

> be lower, or that perhaps there's some level of clustering in the data
> that's not being picked up.  You might want to try raising the
> number of statistics buckets and re-analyzing just to see if that helps.

I'm afraid I'm a bit too new at this kind of tweaking... do you mean the
"default_statistics_target"? In that case I tried to raise it from the
default 10 to as high as 45, but without any other result than vacuum
analyze being slower. Did I understand your suggestion right?

Regards,
Patrik Kudo


Re: index usage (and foreign keys/triggers)

From
Stephan Szabo
Date:
On Wed, 26 Feb 2003, Patrik Kudo wrote:

> Stephan Szabo wrote:
> >>explain delete from login where userid = 'jennie';
> >>                         QUERY PLAN
> >>-----------------------------------------------------------
> >>  Seq Scan on login  (cost=0.00..2045.30 rows=3421 width=6)
> >>    Filter: (userid = 'jennie'::text)
> >>
> >
> > Well at 3421 of 96824 it's estimating that the cost is lower, what's
> > the explain look like with seqscan turned off (my guess'd be it's
> > slightly higher cost).  It's possible that random_page_cost should
>
> Yepp! You're right. The cost is higher:
>
> set enable_seqscan to off;
> explain delete from login where userid = 'jennie';
>                                       QUERY PLAN
>
> ------------------------------------------------------------------------------------
>   Index Scan using login_userid_idx on login  (cost=0.00..3363.71
> rows=4131 width=6)
>     Index Cond: (userid = 'jennie'::text)
>
> If I lower the random_page_cost to about 2 the index is being used
> instead of seq scan. Is it reasonable to have such a setting on a
> production server? random_page_cost = 2 is good for this particular
> query, but could it have negative effect on other queries?

It's possible since it might make other queries use an index when the
sequence scan is better.  It's probably worth doing some testing with the
setting.

>
> > be lower, or that perhaps there's some level of clustering in the data
> > that's not being picked up.  You might want to try raising the
> > number of statistics buckets and re-analyzing just to see if that helps.
>
> I'm afraid I'm a bit too new at this kind of tweaking... do you mean the
> "default_statistics_target"? In that case I tried to raise it from the
> default 10 to as high as 45, but without any other result than vacuum
> analyze being slower. Did I understand your suggestion right?

I'd thought about doing it with ALTER TABLE ALTER COLUMN SET STATISTICS,
but I would think that it would probably have worked with default as well.

Is it possible that the data has local clustering on the field (many
rows with the same value stuck together) while not being terribly ordered
overall?  That's a case that the statistics don't really cover right now
(there have been some discussions of this in the past)



Re: index usage (and foreign keys/triggers)

From
"scott.marlowe"
Date:
On Wed, 26 Feb 2003, Patrik Kudo wrote:

> Hi gurus et al ;)
>
> I have a database with a couple of hundred tables. In this database one
> table, "person", represents a user. person.userid is a primary key. To
> this key there are about a hundred foreign keys spread out over
> aproximately equaly many tables. When deleting a user I noticed a quite
> big difference in time depending on how much data there are in the
> foreign key-tables. After some investigation I concluded that for some
> users and some tables the indices wheren't used when deleting, resulting
> in longer run-times.
>
> Here's an example:
>
> select count(*) from login;
>   count
> -------
>   96824
>
> select count(*) from login where userid = 'patrik';
>   count
> -------
>     608
>
> select count(*) from login where userid = 'jennie';
>   count
> -------
>    4211
>
> explain delete from login where userid = 'patrik';
>                                     QUERY PLAN
>
> ---------------------------------------------------------------------------------
>   Index Scan using login_userid_idx on login  (cost=0.00..237.06 rows=61
> width=6)
>     Index Cond: (userid = 'patrik'::text)
>
> explain delete from login where userid = 'jennie';
>                          QUERY PLAN
> -----------------------------------------------------------
>   Seq Scan on login  (cost=0.00..2045.30 rows=3421 width=6)
>     Filter: (userid = 'jennie'::text)
>
>
> What makes the planer choose seq scan for 'jennie', but not for
> 'patrik'? I also tested the following:
>
> delete from login where userid = 'jennie';
> DELETE 4211
> Time: 508.94 ms
>
> set enable_seqscan = false;
>
> delete from login where userid = 'jennie';
> DELETE 4211
> Time: 116.92 ms
>
> As you can see the index scan is almost 5 times faster, but still
> postgres chooses to seq scan... Am I doing something wrong or is
> postgres being stupid on me?

Postgresql is being smart, just not smart enough.

Imagine that one of your queries was to delete 99.9% of all the tuples.
Would an index scan help then?  Of course not, since you're going to visit
nearly every row in the database.

the planner uses several settings to try and figure out the cost of
sequentially scanning a table versus index access, and it doesn't always
get things right.

Take a look at random_page_cost.  It defaults to 4, which means that
postgresql will make it's decisions on index versus seq scan assuming that
random individual pages cost 4 times as much to get as a sequential scan
that just happens to include them.

On most modern machines the difference in cost is very low, what with disk
caching and all.  This is especially true for smaller tables that can fit
in memory.  Once a table's in buffer memory, along with it's index, random
page cost will be about 1.  I.e. a seq scan in memory or an index op are
both quite fast.  In fact, it is possible that at this point, a random
page access for certain percentages of your table will cost you LESS than
1 in practice because linear access in memory yields little if any gain
over random access.  The only overhead is reading the index blocks.

So, try tuning your random page cost down to somewhere between 1.0 and 2.0
for best performance on these kinds of things.  Our database at work runs
on a machine with 1.5 gig ram, of which 800 megs is cache, and postgresql
has 256 meg shared buffer.  It generally only hits the drives about 5% of
the reads, so random page cost for us is set to 1.2 and works well.

Welcome to the wonderful world of performance tuning...


Re: index usage (and foreign keys/triggers)

From
Patrik Kudo
Date:
Stephan Szabo wrote:

>>If I lower the random_page_cost to about 2 the index is being used
>>instead of seq scan. Is it reasonable to have such a setting on a
>>production server? random_page_cost = 2 is good for this particular
>>query, but could it have negative effect on other queries?
>
>
> It's possible since it might make other queries use an index when the
> sequence scan is better.  It's probably worth doing some testing with the
> setting.

Ok. I'll do some testing and see what seems to work best for us.

>>>be lower, or that perhaps there's some level of clustering in the data
>>>that's not being picked up.  You might want to try raising the
>>>number of statistics buckets and re-analyzing just to see if that helps.
>>
>>I'm afraid I'm a bit too new at this kind of tweaking... do you mean the
>>"default_statistics_target"? In that case I tried to raise it from the
>>default 10 to as high as 45, but without any other result than vacuum
>>analyze being slower. Did I understand your suggestion right?
>
>
> I'd thought about doing it with ALTER TABLE ALTER COLUMN SET STATISTICS,
> but I would think that it would probably have worked with default as well.

What exactly does this setting do and how does it affect the
planner/optimizer? I couldn't find much about this in the docs.

> Is it possible that the data has local clustering on the field (many
> rows with the same value stuck together) while not being terribly ordered
> overall?  That's a case that the statistics don't really cover right now
> (there have been some discussions of this in the past)

How can I find this out? A simple "select * from login" and just browse
the result, or is there any automated way to analyze this?

Thanks,
Patrik Kudo


Re: index usage (and foreign keys/triggers)

From
Patrik Kudo
Date:
scott.marlowe wrote:
>
>
> Postgresql is being smart, just not smart enough.
>
> Imagine that one of your queries was to delete 99.9% of all the tuples.
> Would an index scan help then?  Of course not, since you're going to visit
> nearly every row in the database.
>
> the planner uses several settings to try and figure out the cost of
> sequentially scanning a table versus index access, and it doesn't always
> get things right.
>
> Take a look at random_page_cost.  It defaults to 4, which means that
> postgresql will make it's decisions on index versus seq scan assuming that
> random individual pages cost 4 times as much to get as a sequential scan
> that just happens to include them.
>
> On most modern machines the difference in cost is very low, what with disk
> caching and all.  This is especially true for smaller tables that can fit
> in memory.  Once a table's in buffer memory, along with it's index, random
> page cost will be about 1.  I.e. a seq scan in memory or an index op are
> both quite fast.  In fact, it is possible that at this point, a random
> page access for certain percentages of your table will cost you LESS than
> 1 in practice because linear access in memory yields little if any gain
> over random access.  The only overhead is reading the index blocks.
>
> So, try tuning your random page cost down to somewhere between 1.0 and 2.0
> for best performance on these kinds of things.  Our database at work runs
> on a machine with 1.5 gig ram, of which 800 megs is cache, and postgresql
> has 256 meg shared buffer.  It generally only hits the drives about 5% of
> the reads, so random page cost for us is set to 1.2 and works well.

Thanks for a good explanation! However, a setting lower than 2 seems a
bit scary for me though. Our databases are quite large due to many large
objects, in some cases around 4Gb, so all the data couldn't possible be
cached all the time. The most frequently accessed tables however are
fewer and smaller and would probably easily fit into the 1-2Gb RAM
(that's the span we usually have on the servers).
  Any top of mind suggestions or reflections on tuning strategies? ;)

> Welcome to the wonderful world of performance tuning...

It's a jungle, but I love it! =)

Regards,
Patrik Kudo


Re: index usage (and foreign keys/triggers)

From
Stephan Szabo
Date:
On Thu, 27 Feb 2003, Patrik Kudo wrote:

> Stephan Szabo wrote:
> >>>be lower, or that perhaps there's some level of clustering in the data
> >>>that's not being picked up.  You might want to try raising the
> >>>number of statistics buckets and re-analyzing just to see if that helps.
> >>
> >>I'm afraid I'm a bit too new at this kind of tweaking... do you mean the
> >>"default_statistics_target"? In that case I tried to raise it from the
> >>default 10 to as high as 45, but without any other result than vacuum
> >>analyze being slower. Did I understand your suggestion right?
> >
> >
> > I'd thought about doing it with ALTER TABLE ALTER COLUMN SET STATISTICS,
> > but I would think that it would probably have worked with default as well.
>
> What exactly does this setting do and how does it affect the
> planner/optimizer? I couldn't find much about this in the docs.

It changes the number of entries that get stored in the statistics table,
but IIRC in this case (potentially more importantly) also raises the
number of rows that it grabs as its sample.

> > Is it possible that the data has local clustering on the field (many
> > rows with the same value stuck together) while not being terribly ordered
> > overall?  That's a case that the statistics don't really cover right now
> > (there have been some discussions of this in the past)
>
> How can I find this out? A simple "select * from login" and just browse
> the result, or is there any automated way to analyze this?

There's probably a reasonable way to automate it, although I don't know if
anyone's done it.  Technically I think want you really want to know is for
a given value of the search key how many pages of the database heap file
can you find such rows on versus the total number of pages in the table.
You might be able to estimate the number of distinct pages that live rows
are taking by something like:
select ctid, col from table order by col;
and parsing it in perl (AFAIK the first part of the ctid column is the
page).  You can get the total number from vacuum verbose's output or
an estimate from last vacuum (I think) from pg_class.


Re: index usage (and foreign keys/triggers)

From
"scott.marlowe"
Date:
On Thu, 27 Feb 2003, Patrik Kudo wrote:


> Thanks for a good explanation! However, a setting lower than 2 seems a
> bit scary for me though. Our databases are quite large due to many large
> objects, in some cases around 4Gb, so all the data couldn't possible be
> cached all the time. The most frequently accessed tables however are
> fewer and smaller and would probably easily fit into the 1-2Gb RAM
> (that's the span we usually have on the servers).
>   Any top of mind suggestions or reflections on tuning strategies? ;)

Well, my experience has been that accidentally picking an index lookup
when a sequential scan would be better may cost up to twice as much as the
seq scan, due to the slower random access.  And usually these are queries
you expect to be slow anyway, like something that selects 90% of a 10M row
table.

But, making the mistake the other way can be much more costly.  I've
watched queries that took 30 or more seconds with a seq scan drop to sub
second with indexes.  So, don't be afraid about dropping below 2.  Just
make sure it makes senese for your database.  note that you can change
random_page_cost on the fly as well, so you could do something like:

begin;
select * from table a;
set random_page_cost=1.2;
select * from table b where c='d';
set random_page_cost=3.0;
...



Re: index usage (and foreign keys/triggers)

From
Greg Stark
Date:
"scott.marlowe" <scott.marlowe@ihs.com> writes:

> Well, my experience has been that accidentally picking an index lookup
> when a sequential scan would be better may cost up to twice as much as the
> seq scan, due to the slower random access.  And usually these are queries
> you expect to be slow anyway, like something that selects 90% of a 10M row
> table.

My experience is similar. Usually if it picks an index scan when a sequential
scan would be better further experimentation shows the two are roughly the
same anyways. Whereas when it fails the other way it can be very very bad for
an OLTP system like a web server.

> But, making the mistake the other way can be much more costly.  I've
> watched queries that took 30 or more seconds with a seq scan drop to sub
> second with indexes.  So, don't be afraid about dropping below 2.

However, I've just tried something new and it seems to be helping. I tried
raising cpu_tuple_cost instead. It now chooses nested loops with index lookups
far more aggressively, even when random_page_cost isn't insanely low.

I've actually raised cpu_tuple_cost to 0.1. That's a factor of 10 higher. That
seems to be what is required to get the ratio of costs between nested loops
and merge joins to line up with the ratio of actual times.

I wonder if other people see similar behaviour?

--
greg