Thread: Column correlation drifts, index ignored again

Column correlation drifts, index ignored again

From
John Siracusa
Date:
This is a follow-up to an old thread of mine, but I can't find it now
so I'll just re-summarize.

I have a ~1 million row table that I mostly want to query by date
range.  The rows are pretty uniformly spread over a 3 year date range.
I have an index on the date column, but it wasn't always used in the
past.  I disabled the seqscan plan before running my query as a first
fix, but it bothered me that I had to do that.

Next, thanks to my earlier thread, I clustered the table on the date
column and then "SET STATISTICS" on the date column to be 100.  That
did the trick, and I stopped explicitly disabling seqscan.

Today, I noticed that Postgres (still 7.4) stopped using the date index
again.  I checked the correlation for the date column and it was down
to 0.4.  So I guess that stat does drift away from 1.0 after
clustering.  That's a bummer, because clustering locks up the table
while it works, which I can't really afford to do often.  Even at a
correlation of 0.4 on the date column, using the date index was still
much faster than the seqscan plan that Postgres was choosing.  Anyway,
it's reclustering now.

A common query looks like this:

SELECT
     SUM(amount),
     SUM(quantity),
     date_trunc('day', date) AS date
FROM
     mytable
WHERE
     col1 IS NOT NULL AND
     col2 = 'foo'     AND
     col3 = 'bar'     AND
     date BETWEEN '2004-02-01 00:00:00' AND '2004-02-28 23:59:59'
GROUP BY
     date_trunc('day', date)
ORDER BY
     date;

The EXPLAIN ANALYZE output should look like this:

  Sort  (cost=4781.75..4824.15 rows=16963 width=23) (actual
time=2243.595..2243.619 rows=21 loops=1)
    Sort Key: date_trunc('day'::text, date)
    ->  HashAggregate  (cost=3462.87..3590.09 rows=16963 width=23)
(actual time=2241.773..2243.454 rows=21 loops=1)
          ->  Index Scan using mytable_date_idx on mytable
(cost=0.00..3071.70 rows=52155 width=23) (actual time=2.610..1688.111
rows=49679 loops=1)
                Index Cond: ((date >= '2004-02-01 00:00:00'::timestamp
without time zone) AND (date <= '2004-02-28 23:59:59'::timestamp
without time zone))
                Filter: ((col1 IS NOT NULL) AND ((col2)::text =
'foo'::text) AND ((col3)::text = 'bar'::text))
  Total runtime: 2244.391 ms

Unfortunately, since I just re-clustered, I can't get the old EXPLAIN
output, but just imagine "Seq Scan" in place of "Index Scan using
mytable_date_idx" to get the idea.

My question is: what other options do I have?  Should I "SET
STATISTICS" on the date column to 200?  500? The maximum value of 1000?
  I want to do something that will convince Postgres that using the date
index is, by far, the best plan when running my queries, even when the
date column correlation stat drops well below 1.0.

-John


Re: Column correlation drifts, index ignored again

From
Josh Berkus
Date:
On Saturday 21 February 2004 16:18, John Siracusa wrote:
John,

> Next, thanks to my earlier thread, I clustered the table on the date
> column and then "SET STATISTICS" on the date column to be 100.  That
> did the trick, and I stopped explicitly disabling seqscan.

100?  Are you sure you don't mean some other number?   100 is not very high
for problem analyze issues.  You might try 500.   Generally when I have a
problem query I raise stats to something like 1000 and drop it down until the
problem behaviour starts re-appearing.

>      date_trunc('day', date) AS date

Have you tried putting an index on date_trunc('day', date) and querying on
that instead of using this:

>      date BETWEEN '2004-02-01 00:00:00' AND '2004-02-28 23:59:59'

--
Josh Berkus
Aglio Database Solutions
San Francisco

Re: Column correlation drifts, index ignored again

From
John Siracusa
Date:
On 2/22/04 2:05 PM, Josh Berkus wrote:
> On Saturday 21 February 2004 16:18, John Siracusa wrote:
>> Next, thanks to my earlier thread, I clustered the table on the date
>> column and then "SET STATISTICS" on the date column to be 100.  That
>> did the trick, and I stopped explicitly disabling seqscan.
>
> 100?  Are you sure you don't mean some other number?   100 is not very high
> for problem analyze issues.  You might try 500.

IIRC, 100 was the number suggested in the earlier thread.  I did set it to
500 yesterday, I believe.  We'll see how that goes.

> Generally when I have a problem query I raise stats to something like 1000 and
> drop it down until the problem behaviour starts re-appearing.

Since this problem takes a long time to appear (months), that cycle could
take a long time... :)

>>      date_trunc('day', date) AS date
>
> Have you tried putting an index on date_trunc('day', date) and querying on
> that instead of using this:
>
>>      date BETWEEN '2004-02-01 00:00:00' AND '2004-02-28 23:59:59'

No, but then I'd just have a different index to persuade the planner to use
:)  Not every query does date_trunc() stuff, but they all do date ranges,
often at a granularity of seconds.

-John


Re: Column correlation drifts, index ignored again

From
Tom Lane
Date:
John Siracusa <siracusa@mindspring.com> writes:
>   I want to do something that will convince Postgres that using the date
> index is, by far, the best plan when running my queries, even when the
> date column correlation stat drops well below 1.0.

Have you tried experimenting with random_page_cost?  Seems like your
results suggest that you need to lower it.

            regards, tom lane

Re: Column correlation drifts, index ignored again

From
John Siracusa
Date:
On 2/22/04 5:06 PM, Tom Lane wrote:
> John Siracusa <siracusa@mindspring.com> writes:
>> I want to do something that will convince Postgres that using the date
>> index is, by far, the best plan when running my queries, even when the
>> date column correlation stat drops well below 1.0.
>
> Have you tried experimenting with random_page_cost?  Seems like your
> results suggest that you need to lower it.

I don't want to do anything that "universal" if I can help it, because I
don't want to adversely affect any other queries that the planner currently
aces.

I'm guessing that the reason using the date index is always so much faster
is that doing so only reads the rows in the date range (say, 1,000 of them)
instead of reading every single row in the table (1,000,000) as in a seqscan
plan.

I think the key is to get the planner to correctly ballpark the number of
rows in the date range.  If it does, I can't imagine it ever deciding to
read 1,000,000 rows instead of 1,000 with any sane "cost" setting.  I'm
assuming the defaults are sane :)

-John


Re: Column correlation drifts, index ignored again

From
Tom Lane
Date:
John Siracusa <siracusa@mindspring.com> writes:
> I think the key is to get the planner to correctly ballpark the number of
> rows in the date range.

I thought it was.  What you showed was

-> Index Scan using mytable_date_idx on mytable (cost=0.00..3071.70 rows=52155 width=23) (actual time=2.610..1688.111
rows=49679loops=1) 

which seemed plenty close enough to me.

            regards, tom lane

Re: Column correlation drifts, index ignored again

From
John Siracusa
Date:
On 2/22/04 6:40 PM, Tom Lane wrote:
> John Siracusa <siracusa@mindspring.com> writes:
>> I think the key is to get the planner to correctly ballpark the number of
>> rows in the date range.
>
> I thought it was.  What you showed was
>
> -> Index Scan using mytable_date_idx on mytable (cost=0.00..3071.70 rows=52155
> width=23) (actual time=2.610..1688.111 rows=49679 loops=1)
>
> which seemed plenty close enough to me.

That's after the planner correctly chooses the date index.  Unfortunately, I
forgot to save the EXPLAIN output from when it was choosing seqscan instead.
Does the planner get estimates from both plans before deciding whether or
not to use the one that references the date index?

-John


Re: Column correlation drifts, index ignored again

From
Josh Berkus
Date:
John,

> I think the key is to get the planner to correctly ballpark the number of
> rows in the date range.  If it does, I can't imagine it ever deciding to
> read 1,000,000 rows instead of 1,000 with any sane "cost" setting.  I'm
> assuming the defaults are sane :)

The default for random_page_cost is sane, but very conservative; it's pretty
much assuming tables that are bigger than RAM and a single IDE disk.   If
your setup is better than that, you can lower it.

For example, in the ideal case (database fits in RAM, fast RAM, CPU, and
random seek on the disk), you can lower it to 1.5.    For less ideal
situations, 1.8 to 2.5 is reasonable on high-end hardware.

--
-Josh Berkus
 Aglio Database Solutions
 San Francisco


Re: Column correlation drifts, index ignored again

From
Tom Lane
Date:
John Siracusa <siracusa@mindspring.com> writes:
> Does the planner get estimates from both plans before deciding whether or
> not to use the one that references the date index?

The rowcount estimate is made prior to the plan cost estimate, much less
the plan selection.  So you'd see the same number either way.

            regards, tom lane

Re: Column correlation drifts, index ignored again

From
Kevin Brown
Date:
Josh Berkus wrote:
> John,
>
> > I think the key is to get the planner to correctly ballpark the number of
> > rows in the date range.  If it does, I can't imagine it ever deciding to
> > read 1,000,000 rows instead of 1,000 with any sane "cost" setting.  I'm
> > assuming the defaults are sane :)
>
> The default for random_page_cost is sane, but very conservative; it's pretty
> much assuming tables that are bigger than RAM and a single IDE disk.   If
> your setup is better than that, you can lower it.
>
> For example, in the ideal case (database fits in RAM, fast RAM, CPU, and
> random seek on the disk), you can lower it to 1.5.    For less ideal
> situations, 1.8 to 2.5 is reasonable on high-end hardware.

I suspect this ultimately depends on the types of queries you do, the
size of the tables involved, disk cache, etc.

For instance, if you don't have sort_mem set high enough, then things
like large hash joins will spill to disk and almost certainly cause a
lot of contention (random access patterns) even if a sequential scan is
being used to read the table data.  The fix there is, of course, to
increase sort_mem if possible (as long as you don't cause paging during
the operation, which will also slow things down), but you might not
really have that option -- in which case you might see some improvement
by tweaking random_page_cost.

On a system where the PG data is stored on a disk that does other things,
you'll actually want random_page_cost to be *closer* to 1 rather than
further away.  The reason is that the average access time of a sequential
page in that case is much closer to that of a random page than it would
be if the disk in question were dedicated to PG duty.  This also goes for
large RAID setups where multiple types of data (e.g., home directories,
log files, etc.) are stored along with the PG data -- such disk setups
will have more random activity happening on the disk while PG activity
is happening, thus making the PG sequential access pattern appear more
like random access.


The best way I can think of to tune random_page_cost is to do EXPLAIN
ANALYZE on the queries you want to optimize the most under the
circumstances the queries are most likely to be run, then do the same
with enable_seqscan off.  Then look at the ratio of predicted and actual
times for the scans themselves.  Once you've done that, you can tweak
random_page_cost up or down and do further EXPLAINs (with enable_seqscan
off and without ANALYZE) until the ratio of the estimated index scan time
to the actual index scan time of the same query (gotten previously via
EXPLAIN ANALYZE) is the same as the ratio of the estimated sequential
scan time (which won't change based on random_page_cost) to the actual
sequential scan time.

So:

1.  set enable_seqscan = on
2.  set random_page_cost = <some really high value to force seqscans>
3.  EXPLAIN ANALYZE query
4.  record the ratio of estimated to actual scan times.
5.  set enable_seqscan = off
6.  set random_page_cost = <rough estimate of what it should be>
7.  EXPLAIN ANALYZE query
8.  record the actual index scan time(s)
9.  tweak random_page_cost
10.  EXPLAIN query
11.  If ratio of estimate to actual (recorded in step 8) is much
     different than that recorded in step 4, then go back to step 9.
     Reduce random_page_cost if the random ratio is larger than the
     sequential ratio, increase if it's smaller.


As a result, I ended up setting my random_page_cost to 1.5 on my system.
I suspect that the amount of pain you'll suffer when the planner
incorrectly chooses a sequential scan is much greater on average than
the amount of pain if it incorrectly chooses an index scan, so I'd tend
to favor erring on the low side for random_page_cost.


I'll know tomorrow whether or not my tweaking worked properly, as I have
a job that kicks off every night that scans the entire filesystem and
stores all the inode information about every file in a newly-created table,
then "merges" it into the existing file information table.  Each table
is about 2.5 million rows...


--
Kevin Brown                          kevin@sysexperts.com

Re: Column correlation drifts, index ignored again

From
Josh Berkus
Date:
Kevin,

> 1.  set enable_seqscan = on
> 2.  set random_page_cost = <some really high value to force seqscans>
> 3.  EXPLAIN ANALYZE query
> 4.  record the ratio of estimated to actual scan times.
> 5.  set enable_seqscan = off
> 6.  set random_page_cost = <rough estimate of what it should be>
> 7.  EXPLAIN ANALYZE query
> 8.  record the actual index scan time(s)
> 9.  tweak random_page_cost
> 10.  EXPLAIN query
> 11.  If ratio of estimate to actual (recorded in step 8) is much
>      different than that recorded in step 4, then go back to step 9.
>      Reduce random_page_cost if the random ratio is larger than the
>      sequential ratio, increase if it's smaller.

Nice, we ought to post that somewhere people can find it in the future.

I'm also glad that your new job allows you to continue doing PostgreSQL stuff.

--
Josh Berkus
Aglio Database Solutions
San Francisco

Re: Column correlation drifts, index ignored again

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> Kevin,
>> 1.  set enable_seqscan = on
>> 2.  set random_page_cost = <some really high value to force seqscans>
>> 3.  EXPLAIN ANALYZE query
>> 4.  record the ratio of estimated to actual scan times.
>> 5.  set enable_seqscan = off
>> 6.  set random_page_cost = <rough estimate of what it should be>
>> 7.  EXPLAIN ANALYZE query
>> 8.  record the actual index scan time(s)
>> 9.  tweak random_page_cost
>> 10.  EXPLAIN query
>> 11.  If ratio of estimate to actual (recorded in step 8) is much
>> different than that recorded in step 4, then go back to step 9.
>> Reduce random_page_cost if the random ratio is larger than the
>> sequential ratio, increase if it's smaller.

> Nice, we ought to post that somewhere people can find it in the future.

If we post it as recommended procedure we had better put big caveat
notices on it.  The pitfalls with doing this are:

1. If you repeat the sequence exactly as given, you will be homing in on
a RANDOM_PAGE_COST that describes your system's behavior with a fully
cached query.  It is to be expected that you will end up with 1.0 or
something very close to it.  The only way to avoid that is to use a
query that is large enough to blow out your kernel's RAM cache; which of
course will take long enough that iterating step 10 will be no fun,
and people will be mighty tempted to take shortcuts.

2. Of course, you are computing a RANDOM_PAGE_COST that is relevant to
just this single query.  Prudence would suggest repeating the process
with several different queries and taking some sort of average.

When I did the experiments that led up to choosing 4.0 as the default,
some years ago, it took several days of thrashing the disks on a couple
of different machines before I had numbers that I didn't think were
mostly noise :-(.  I am *real* suspicious of any replacement numbers
that have been derived in just a few minutes.

            regards, tom lane

Re: Column correlation drifts, index ignored again

From
Kevin Brown
Date:
Tom Lane wrote:
> Josh Berkus <josh@agliodbs.com> writes:
> > Kevin,
> >> 1.  set enable_seqscan = on
> >> 2.  set random_page_cost = <some really high value to force seqscans>
> >> 3.  EXPLAIN ANALYZE query
> >> 4.  record the ratio of estimated to actual scan times.
> >> 5.  set enable_seqscan = off
> >> 6.  set random_page_cost = <rough estimate of what it should be>
> >> 7.  EXPLAIN ANALYZE query
> >> 8.  record the actual index scan time(s)
> >> 9.  tweak random_page_cost
> >> 10.  EXPLAIN query
> >> 11.  If ratio of estimate to actual (recorded in step 8) is much
> >> different than that recorded in step 4, then go back to step 9.
> >> Reduce random_page_cost if the random ratio is larger than the
> >> sequential ratio, increase if it's smaller.
>
> > Nice, we ought to post that somewhere people can find it in the future.
>
> If we post it as recommended procedure we had better put big caveat
> notices on it.  The pitfalls with doing this are:
>
> 1. If you repeat the sequence exactly as given, you will be homing in on
> a RANDOM_PAGE_COST that describes your system's behavior with a fully
> cached query.  It is to be expected that you will end up with 1.0 or
> something very close to it.  The only way to avoid that is to use a
> query that is large enough to blow out your kernel's RAM cache; which of
> course will take long enough that iterating step 10 will be no fun,
> and people will be mighty tempted to take shortcuts.

Oops.  You're right.  I did this on my system, but forgot to put it in
the list of things to do:

0.  Fill the page cache with something other than PG data, e.g. by
    repeatedly catting several large files and redirecting the output to
    /dev/null.  The sum total size of the files should exceed the amount
    of memory on the system.

The reason you might not have to do this between EXPLAIN ANALYZE queries
is that the first query will scan the table itself while the second one
will scan the index.  But that was probably more specific to the query I
was doing.  If the one you're doing is complex enough the system may have
to read data pages from the table itself after fetching the index page,
in which case you'll want to fill the page cache between the queries.

> 2. Of course, you are computing a RANDOM_PAGE_COST that is relevant to
> just this single query.  Prudence would suggest repeating the process
> with several different queries and taking some sort of average.

Right.  And the average should probably be weighted based on the
relative frequency that the query in question will be executed.

In my case, the query I was experimenting with was by far the biggest
query that occurs on my system (though it turns out that there are
others in that same process that I should look at as well).

> When I did the experiments that led up to choosing 4.0 as the default,
> some years ago, it took several days of thrashing the disks on a couple
> of different machines before I had numbers that I didn't think were
> mostly noise :-(.  I am *real* suspicious of any replacement numbers
> that have been derived in just a few minutes.

One problem I've been running into is the merge join spilling to disk
because sort_mem isn't big enough.  The problem isn't that this is
happening, it's that I think the planner is underestimating the impact
that doing this will have on the time the merge join takes.  Does the
planner even account for the possibility that a sort or join will spill
to disk?  Spilling to disk like that will suddenly cause sequential
reads to perform much more like random reads, unless the sequential
scans are performed in their entirety between sorts/merges.


In any case, one thing that none of this really accounts for is that
it's better to set random_page_cost too low than too high.  The reason is
that index scans are more selective than sequential scans: a sequential
scan will read the entire table every time, whereas an index scan will
read only the index pages (and their parents) that match the query.
My experience is that when the planner improperly computes the selectivity
of the query (e.g., by not having good enough or sufficiently up to
date statistics), it generally computes a lower selectivity than the
query actually represents, and thus selects a sequential scan when an
index scan would be more efficient.

The auto vacuum daemon helps in this regard, by keeping the statistics
more up-to-date.

Certainly you shouldn't go overboard by setting random_page_cost too low
"just in case", but it does mean that if you go through the process of
running tests to determine the proper value for random_page_cost, you
should probably select a random_page_cost that's in the lower part of
the range of values you got.


--
Kevin Brown                          kevin@sysexperts.com

Re: Column correlation drifts, index ignored again

From
"Ed L."
Date:
On Tuesday February 24 2004 1:14, Kevin Brown wrote:
>
> One problem I've been running into is the merge join spilling to disk
> because sort_mem isn't big enough.  The problem isn't that this is
> happening, it's that I think the planner is underestimating the impact
> that doing this will have on the time the merge join takes.  Does the
> planner even account for the possibility that a sort or join will spill
> to disk?  Spilling to disk like that will suddenly cause sequential
> reads to perform much more like random reads, unless the sequential
> scans are performed in their entirety between sorts/merges.

How do you know the merge join is spilling to disk?  How are you identifying
that?  Just assuming from vmstat?  iostat?

Re: Column correlation drifts, index ignored again

From
Kevin Brown
Date:
Ed L. wrote:
> How do you know the merge join is spilling to disk?  How are you identifying
> that?  Just assuming from vmstat?  iostat?

The existence of files in $PG_DATA/base/<db-oid>/pgsql_tmp while the
query is running, combined with the EXPLAIN output (which shows what
sorts and joins are being performed).


--
Kevin Brown                          kevin@sysexperts.com

Re: Column correlation drifts, index ignored again

From
Tom Lane
Date:
Kevin Brown <kevin@sysexperts.com> writes:
> One problem I've been running into is the merge join spilling to disk
> because sort_mem isn't big enough.  The problem isn't that this is
> happening, it's that I think the planner is underestimating the impact
> that doing this will have on the time the merge join takes.  Does the
> planner even account for the possibility that a sort or join will spill
> to disk?

Yes it does.  I thought it was making a pretty good estimate, actually.
The only obvious hole in the assumptions is

 * The disk traffic is assumed to be half sequential and half random
 * accesses (XXX can't we refine that guess?)

Because of the way that tuplesort.c works, the first merge pass should
be pretty well sequential, but I think the passes after that might be
mostly random from the kernel's viewpoint :-(.  Possibly the I/O cost
should be adjusted depending on how many merge passes we expect.


> In any case, one thing that none of this really accounts for is that
> it's better to set random_page_cost too low than too high.

That depends on what you are doing, although I will concede that a lot
of people are doing things where indexscans should be favored.

            regards, tom lane