Thread: Bad query optimizer misestimation because of TOAST tables

Bad query optimizer misestimation because of TOAST tables

From
Markus Schaber
Date:
[This mail goes as X-Post to both pgsql-perform and postgis-users
because postgis users may suffer from this problem, but I would prefer
to keep the Discussion on pgsql-performance as it is a general TOAST
problem and not specific to PostGIS alone.]

Hello,

Running PostGIS 0.8.1 under PostgreSQL 7.4.6-7 (Debian), I struggled
over the following problem:

logigis=# explain analyze SELECT geom FROM adminbndy1 WHERE geom &&
setsrid('BOX3D(9.4835390946502 47.39365740740741,9.5164609053498
47.40634259259259)'::box3d,4326);
                                                       QUERY PLAN


------------------------------------------------------------------------------------------------------------------------
 Seq Scan on adminbndy1  (cost=0.00..4.04 rows=1 width=121) (actual
time=133.591..7947.546 rows=5 loops=1)
   Filter: (geom && 'SRID=4326;BOX3D(9.4835390946502 47.3936574074074
0,9.5164609053498 47.4063425925926 0)'::geometry)
 Total runtime: 7947.669 ms
(3 Zeilen)

logigis=# set enable_seqscan to off;
SET
logigis=# explain analyze SELECT geom FROM adminbndy1 WHERE geom &&
setsrid('BOX3D(9.4835390946502 47.39365740740741,9.5164609053498
47.40634259259259)'::box3d,4326);
                                                             QUERY PLAN


------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using adminbndy1_geom_idx on adminbndy1  (cost=0.00..4.44
rows=1 width=121) (actual time=26.902..27.066 rows=5 loops=1)
   Index Cond: (geom && 'SRID=4326;BOX3D(9.4835390946502
47.3936574074074 0,9.5164609053498 47.4063425925926 0)'::geometry)
 Total runtime: 27.265 ms
(3 Zeilen)

So the query planner choses to ignore the index, although it is
appropriate. My first idea was that the statistics, but that turned out
not to be the problem. As the above output shows, the query optimizer
already guesses a rowcount of 1 which is even smaller than the actual
number of 5 fetched rows, so this should really make the query planner
use the index.

Some amount of increasingly random tries later, I did the following:

logigis=# vacuum full freeze analyze verbose adminbndy1;
INFO:  vacuuming "public.adminbndy1"
INFO:  "adminbndy1": found 0 removable, 83 nonremovable row versions in
3 pages
DETAIL:  0 dead row versions cannot be removed yet.
Nonremovable row versions range from 128 to 1968 bytes long.
There were 1 unused item pointers.
Total free space (including removable row versions) is 5024 bytes.
0 pages are or will become empty, including 0 at the end of the table.
3 pages containing 5024 free bytes are potential move destinations.
CPU 0.00s/0.00u sec elapsed 0.01 sec.
INFO:  index "adminbndy1_geom_idx" now contains 83 row versions in 1 pages
DETAIL:  0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  "adminbndy1": moved 0 row versions, truncated 3 to 3 pages
DETAIL:  CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  vacuuming "pg_toast.pg_toast_19369"
INFO:  "pg_toast_19369": found 0 removable, 32910 nonremovable row
versions in 8225 pages
DETAIL:  0 dead row versions cannot be removed yet.
Nonremovable row versions range from 37 to 2034 bytes long.
There were 0 unused item pointers.
Total free space (including removable row versions) is 167492 bytes.
0 pages are or will become empty, including 0 at the end of the table.
66 pages containing 67404 free bytes are potential move destinations.
CPU 0.67s/0.04u sec elapsed 2.76 sec.
INFO:  index "pg_toast_19369_index" now contains 32910 row versions in
127 pages
DETAIL:  0 index pages have been deleted, 0 are currently reusable.
CPU 0.01s/0.00u sec elapsed 0.14 sec.
INFO:  "pg_toast_19369": moved 0 row versions, truncated 8225 to 8225 pages
DETAIL:  CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  analyzing "public.adminbndy1"
INFO:  "adminbndy1": 3 pages, 83 rows sampled, 83 estimated total rows
VACUUM
logigis=#

IMHO, this tells the reason. The query planner has a table size of 3
pages, which clearly is a case for a seqscan. But during the seqscan,
the database has to fetch an additional amount of 8225 toast pages and
127 toast index pages, and rebuild the geometries contained therein.

And the total number of 8355 pages = 68MB is a rather huge amount of
data to fetch.

I think this problem bites every user that has rather large columns that
get stored in the TOAST table, when querying on those column.

As a small workaround, I could imagine to add a small additional column
in the table that contains the geometry's bbox, and which I use the &&
operator against. This should avoid touching the TOAST for the skipped rows.

But the real fix should be to add the toast pages to the query planners
estimation for the sequential scan. What do you think about it?

Markus

--
markus schaber | dipl. informatiker
logi-track ag | rennweg 14-16 | ch 8001 zürich
phone +41-43-888 62 52 | fax +41-43-888 62 53
mailto:schabios@logi-track.com | www.logi-track.com

Re: Bad query optimizer misestimation because of TOAST tables

From
Tom Lane
Date:
Markus Schaber <schabios@logi-track.com> writes:
> IMHO, this tells the reason. The query planner has a table size of 3
> pages, which clearly is a case for a seqscan. But during the seqscan,
> the database has to fetch an additional amount of 8225 toast pages and
> 127 toast index pages, and rebuild the geometries contained therein.

I don't buy this analysis at all.  The toasted columns are not those in
the index (because we don't support out-of-line-toasted index entries),
so a WHERE clause that only touches indexed columns isn't going to need
to fetch anything from the toast table.  The only stuff it would fetch
is in rows that passed the WHERE and need to be returned to the client
--- and those costs are going to be the same either way.

I'm not entirely sure where the time is going, but I do not think you
have proven your theory about it.  I'd suggest building the backend
with -pg and getting some gprof evidence.

            regards, tom lane

Re: Bad query optimizer misestimation because of TOAST

From
Markus Schaber
Date:
Hi, Tom,

Tom Lane schrieb:

>>IMHO, this tells the reason. The query planner has a table size of 3
>>pages, which clearly is a case for a seqscan. But during the seqscan,
>>the database has to fetch an additional amount of 8225 toast pages and
>>127 toast index pages, and rebuild the geometries contained therein.
>
> I don't buy this analysis at all.  The toasted columns are not those in
> the index (because we don't support out-of-line-toasted index entries),
> so a WHERE clause that only touches indexed columns isn't going to need
> to fetch anything from the toast table.  The only stuff it would fetch
> is in rows that passed the WHERE and need to be returned to the client
> --- and those costs are going to be the same either way.
>
> I'm not entirely sure where the time is going, but I do not think you
> have proven your theory about it.  I'd suggest building the backend
> with -pg and getting some gprof evidence.

The column is a PostGIS column, and the index was created using GIST.
Those are lossy indices that do not store the whole geometry, but only
the bounding box  corners of the Geometry (2 Points).

Without using the index, the && Operator (which tests for bbox
overlapping) has to load the whole geometry from disk, and extract the
bbox therein (as it cannot make use of partial fetch).

Some little statistics:

logigis=# select max(mem_size(geom)), avg(mem_size(geom))::int,
max(npoints(geom)) from adminbndy1;
   max    |   avg   |  max
----------+---------+--------
 20998856 | 1384127 | 873657
(1 Zeile)

So the geometries use are about 1.3 MB average size, and have a maximum
size of 20Mb. I'm pretty shure this cannot be stored without TOASTing.

Additionally, my suggested workaround using a separate bbox column
really works:

logigis=# alter table adminbndy1 ADD column bbox geometry;
ALTER TABLE
logigis=# update adminbndy1 set bbox = setsrid(box3d(geom)::geometry, 4326);
UPDATE 83
logigis=# explain analyze SELECT geom FROM adminbndy1 WHERE bbox &&
setsrid('BOX3D(9.4835390946502 47.39365740740741,9.5164609053498
47.40634259259259)'::box3d,4326);
                                                        QUERY PLAN


---------------------------------------------------------------------------------------------------------------------------
 Seq Scan on adminbndy1  (cost=100000000.00..100000022.50 rows=1
width=32) (actual time=0.554..0.885 rows=5 loops=1)
   Filter: (bbox && 'SRID=4326;BOX3D(9.4835390946502 47.3936574074074
0,9.5164609053498 47.4063425925926 0)'::geometry)
 Total runtime: 0.960 ms
(3 Zeilen)

Here, the seqential scan matching exactly the same 5 rows only needs
about 1/8000th of time, because it does not have to touch the TOAST
pages at all.

logigis=# \o /dev/null
logigis=# \timing
Zeitmessung ist an.
logigis=# SELECT geom FROM adminbndy1 WHERE geom &&
setsrid('BOX3D(9.4835390946502 47.39365740740741,9.5164609053498
47.40634259259259)'::box3d,4326);
Zeit: 11224,185 ms
logigis=# SELECT geom FROM adminbndy1 WHERE bbox &&
setsrid('BOX3D(9.4835390946502 47.39365740740741,9.5164609053498
47.40634259259259)'::box3d,4326);
Zeit: 7689,720 ms

So you can see that, when actually detoasting the 5 rows and
deserializing the geometries to WKT format (their canonical text
representation), the time relation gets better, but there's still a
noticeable difference.

Markus
--
markus schaber | dipl. informatiker
logi-track ag | rennweg 14-16 | ch 8001 zürich
phone +41-43-888 62 52 | fax +41-43-888 62 53
mailto:schabios@logi-track.com | www.logi-track.com

Attachment

Re: Bad query optimizer misestimation because of TOAST

From
Tom Lane
Date:
Markus Schaber <schabios@logi-track.com> writes:
> Tom Lane schrieb:
>> I don't buy this analysis at all.  The toasted columns are not those in
>> the index (because we don't support out-of-line-toasted index entries),
>> so a WHERE clause that only touches indexed columns isn't going to need
>> to fetch anything from the toast table.

> The column is a PostGIS column, and the index was created using GIST.
> Those are lossy indices that do not store the whole geometry, but only
> the bounding box  corners of the Geometry (2 Points).
> Without using the index, the && Operator (which tests for bbox
> overlapping) has to load the whole geometry from disk, and extract the
> bbox therein (as it cannot make use of partial fetch).

Ah, I see; I forgot to consider the GIST "storage" option, which allows
the index contents to be something different from the represented column.
Hmm ...

What I would be inclined to do is to extend ANALYZE to make an estimate
of the extent of toasting of every toastable column, and then modify
cost_qual_eval to charge a nonzero cost for evaluation of Vars that are
potentially toasted.

This implies an initdb-forcing change in pg_statistic, which might or
might not be allowed for 8.1 ... we are still a bit up in the air on
what our release policy will be for 8.1.

My first thought about what stat ANALYZE ought to collect is "average
number of out-of-line TOAST chunks per value".  Armed with that number
and size information about the TOAST table, it'd be relatively simple
for costsize.c to estimate the average cost of fetching such values.

I'm not sure if it's worth trying to model the cost of decompression of
compressed values.  Surely that's a lot cheaper than fetching
out-of-line values, so maybe we can just ignore it.  If we did want to
model it then we'd also need to make ANALYZE note the fraction of values
that require decompression, and maybe something about their sizes.

This approach would overcharge for operations that are able to work with
partially fetched values, but it's probably not reasonable to expect the
planner to account for that with any accuracy.

Given this we'd have a pretty accurate computation of the true cost of
the seqscan alternative, but what of indexscans?  The current
implementation charges one evaluation of the index qual(s) per
indexscan, which is not really right because actually the index
component is never evaluated at all.  This didn't matter when the index
component was a Var with zero eval cost, but if we're charging some eval
cost it might.  But ... since it's charging only one eval per scan
... the error is probably down in the noise in practice, and it may not
be worth trying to get it exactly right.

A bigger concern is "what about lossy indexes"?  We currently ignore the
costs of rechecking qual expressions for fetched rows, but this might be
too inaccurate for situations like yours.  I'm hesitant to mess with it
though.  For one thing, to get it right we'd need to understand how many
rows will be returned by the raw index search (which is the number of
times we'd need to recheck).  At the moment the only info we have is the
number that will pass the recheck, which could be a lot less ... and of
course, even that is probably a really crude estimate when we are
dealing with this sort of operator.

Seems like a bit of a can of worms ...

            regards, tom lane

Re: Bad query optimizer misestimation because of TOAST

From
Markus Schaber
Date:
Hi, Tom,

Tom Lane schrieb:

> What I would be inclined to do is to extend ANALYZE to make an estimate
> of the extent of toasting of every toastable column, and then modify
> cost_qual_eval to charge a nonzero cost for evaluation of Vars that are
> potentially toasted.

I currently do not have any internal knowledge of the query planner, but
that sounds good in my ears :-)

My (simpler) alternative would have been to simply add the number of
toast pages to the table size when estimating sequential scan costs.
This would clearly help in my case, but I now realize that it would give
rather bad misestimations when the TOASTed columns are never touched.

> This implies an initdb-forcing change in pg_statistic, which might or
> might not be allowed for 8.1 ... we are still a bit up in the air on
> what our release policy will be for 8.1.

Is it possible to add metadata table columns to an existing database? At
least when the database is offline (no postmaster running on it)?

You could make the query optimizer code work with the old and new
statistic schema (at least during the 8.x series). Thus users could
upgrade as normal (without dump/restore, and withut benefiting from this
change), and then manually change the schema to benefit (maybe using
some offline tool or special command). Of course, this should be clearly
documented. ANALYZE could spit out a warning message about the missing
columns.

The most convenient method might be to make ANALYZE automatically add
those columns, but'm somehow reluctant to accept such unexpected side
effects (metadata schema changes) .

> My first thought about what stat ANALYZE ought to collect is "average
> number of out-of-line TOAST chunks per value".  Armed with that number
> and size information about the TOAST table, it'd be relatively simple
> for costsize.c to estimate the average cost of fetching such values.

This sounds good.

> I'm not sure if it's worth trying to model the cost of decompression of
> compressed values.  Surely that's a lot cheaper than fetching
> out-of-line values, so maybe we can just ignore it.  If we did want to
> model it then we'd also need to make ANALYZE note the fraction of values
> that require decompression, and maybe something about their sizes.

Well, the first step is to generate those statistics (they may be of
interest for administrators and developers, too), and as we are already
changing the metadata schema, I would vote to add those columns, even in
case the query optimizer does not exploit them yet.

> This approach would overcharge for operations that are able to work with
> partially fetched values, but it's probably not reasonable to expect the
> planner to account for that with any accuracy.

I think it is impossible to give accurate statistics for this. We could
give some hints in "CREATE OPERATOR" to tell the query planner whether
the operator could make use of partial fetches, but this could never be
really accurate, as the amount of fetched data may vary wildly depending
on the value itself.

> A bigger concern is "what about lossy indexes"?  We currently ignore the
> costs of rechecking qual expressions for fetched rows, but this might be
> too inaccurate for situations like yours. I'm hesitant to mess with it
> though.  For one thing, to get it right we'd need to understand how many
> rows will be returned by the raw index search (which is the number of
> times we'd need to recheck).  At the moment the only info we have is the
> number that will pass the recheck, which could be a lot less ... and of
> course, even that is probably a really crude estimate when we are
> dealing with this sort of operator.

I do not know whether PostGIS actually rechecks against the real
geometry. If the app needs the bbox check (&& operator), then the lossy
index contains just all the information. If a real intersection is
needed, PostGIS users usually use "(column && bbox_of_reference) AND
intersects(column, reference)". This uses the bbox based index for
efficient candidate selection, and then uses the rather expensive
geometric intersection algorithm for real decision.

But maybe there'll be a real intersection Operator in the future, that
makes use of the bbox index in the first stage.

> Seems like a bit of a can of worms ...

Sorry :-)

I did not really expect the problem to be so complicated when I posted
my problem, I merely thought about a 10-liner patch to the query
estimator that I could backport and aply to my 7.4.6. Seems that my
personal problem-size estimator has some serious bugs, too...

Markus

--
markus schaber | dipl. informatiker
logi-track ag | rennweg 14-16 | ch 8001 zürich
phone +41-43-888 62 52 | fax +41-43-888 62 53
mailto:schabios@logi-track.com | www.logi-track.com

Attachment

Re: [postgis-users] Bad query optimizer misestimation because of

From
Markus Schaber
Date:
Hi, all,

Markus Schaber schrieb:

> As a small workaround, I could imagine to add a small additional column
> in the table that contains the geometry's bbox, and which I use the &&
> operator against. This should avoid touching the TOAST for the skipped rows.

For your personal amusement: I just noticed that, after adding the
additional column containing the bbox and running VACUUM FULL, the table
now has a size of 4 pages, and that's enough for the query optimizer to
choose an index scan. At least that saves us from modifying and
redeploying a bunch of applications to use the && bbox query:-)

Markus
--
markus schaber | dipl. informatiker
logi-track ag | rennweg 14-16 | ch 8001 zürich
phone +41-43-888 62 52 | fax +41-43-888 62 53
mailto:schabios@logi-track.com | www.logi-track.com

Attachment

Re: Bad query optimizer misestimation because of TOAST

From
Markus Schaber
Date:
Hi, Tom,

Tom Lane schrieb:
> Markus Schaber <schabios@logi-track.com> writes:
>> [Query optimizer misestimation using lossy GIST on TOASTed columns]
>
> What I would be inclined to do is to extend ANALYZE to make an estimate
> of the extent of toasting of every toastable column, and then modify
> cost_qual_eval to charge a nonzero cost for evaluation of Vars that are
> potentially toasted.

What to do now? To fix this issue seems to be a rather long-term job.

Is it enough to document workarounds (as in PostGIS), provided that
there are such workarounds for other GIST users?

Is there a bug tracking system we could file the problem, so it does not
get lost?

Markus
--
markus schaber | dipl. informatiker
logi-track ag | rennweg 14-16 | ch 8001 zürich
phone +41-43-888 62 52 | fax +41-43-888 62 53
mailto:schabios@logi-track.com | www.logi-track.com

Attachment