Thread: [HACKERS] Highly Variable Planning Times

[HACKERS] Highly Variable Planning Times

From

Michael Malis

Date:

20 April 2017, 02:39:40

Hi,

I've been encountering highly variable planning time on PG 9.5.4.
Running `EXPLAIN SELECT * FROM events_1171738` will take anywhere from
200ms to 4s. This likely has to do with the table having 1300 partial
indexes on it (for reasons elaborated on in
https://blog.heapanalytics.com/running-10-million-postgresql-indexes-in-production/).
I'm trying to figure out if there is something I can do to eliminate
the slow planning from happening.

When I used `strace -c` on the backend process, I found that the
number of `semop` system calls is much higher for the slower queries
than the faster ones. After digging in a bit more I found that when
the query was slow, there would be a single semaphore which would make
up most of the `semop` calls. Here is a snippet of the output when I
ran `strace -r` on a query with high planning time:

0.000225 open("base/16384/13236647", O_RDWR) = 219
0.000110 lseek(219, 0, SEEK_END)   = 16384
0.000122 semop(1736711, [{9, -1, 0}], 1) = 0
0.004110 semop(1769480, [{10, 1, 0}], 1) = 0
0.000197 semop(1736711, [{9, -1, 0}], 1) = 0
0.010096 semop(1540097, [{1, 1, 0}], 1) = 0
0.000176 semop(1736711, [{9, -1, 0}], 1) = 0
0.004069 semop(1638404, [{15, 1, 0}], 1) = 0
0.000244 open("base/16384/13236648", O_RDWR) = 220
0.000131 lseek(220, 0, SEEK_END)   = 16384
0.000212 semop(1736711, [{9, -1, 0}], 1) = 0
0.007851 semop(1736711, [{8, 1, 0}], 1) = 0
0.000130 semop(1736711, [{9, -1, 0}], 1) = 0
0.000101 semop(1769480, [{5, 1, 0}], 1) = 0
0.000450 open("base/16384/13236653", O_RDWR) = 221
0.000151 lseek(221, 0, SEEK_END)   = 180224
0.000141 semop(1736711, [{9, -1, 0}], 1) = 0
0.000239 semop(1736711, [{9, -1, 0}], 1) = 0
0.000419 semop(1736711, [{10, 1, 0}], 1) = 0
0.000127 semop(1736711, [{9, -1, 0}], 1) = 0
0.001493 semop(1703942, [{3, 1, 0}], 1) = 0

You can see that most of the semop calls are waits on semaphore #9 in
semaphore set 1736711. I looked up the relation for the files being
opened and they all seem to be indexes on the table. Based on some
flame graphs I generated, I believe most of the time is spent inside
of the function get_relation_info. I think the open/lseek calls are
coming from the estimate_rel_size call in get_relation_info, but I
haven't been able to figure out where the semop calls are coming from.

It seems that the semop calls are at least a symptom of the high
planning time. I'm looking for help in finding the root cause of the
high planning time and for figuring out a way for eliminating the high
planning time.

Thanks,
Michael

Re: [HACKERS] Highly Variable Planning Times

From

Andres Freund

Date:

20 April 2017, 02:47:14

Hi,

On 2017-04-19 13:39:40 -0700, Michael Malis wrote:
> I've been encountering highly variable planning time on PG 9.5.4.
> Running `EXPLAIN SELECT * FROM events_1171738` will take anywhere from
> 200ms to 4s. This likely has to do with the table having 1300 partial
> indexes on it (for reasons elaborated on in
> https://blog.heapanalytics.com/running-10-million-postgresql-indexes-in-production/).
> I'm trying to figure out if there is something I can do to eliminate
> the slow planning from happening.

I'd suspect that that's triggered by cache rebuilds.  If there's any
concurrent relcache invalidation (e.g. by a concurrent vacuum, create
index, alter table, relation extension, ...), a lot of metadata for all
those indexes will have to be rebuilt.

TBH, I don't think we're particularly likely to optimize hugely for
workloads with 1300 indexes on individual tables - such an effort seems
not unlikely to hurt more common cases.

> When I used `strace -c` on the backend process, I found that the
> number of `semop` system calls is much higher for the slower queries
> than the faster ones. After digging in a bit more I found that when
> the query was slow, there would be a single semaphore which would make
> up most of the `semop` calls. Here is a snippet of the output when I
> ran `strace -r` on a query with high planning time:

Could you also get a profile using perf record -g?  The strace isn't
telling us all that much.

Greetings,

Andres Freund

Re: [HACKERS] Highly Variable Planning Times

From

Alvaro Herrera

Date:

20 April 2017, 02:49:45

Michael Malis wrote:
> Hi,
> 
> I've been encountering highly variable planning time on PG 9.5.4.
> Running `EXPLAIN SELECT * FROM events_1171738` will take anywhere from
> 200ms to 4s. This likely has to do with the table having 1300 partial
> indexes on it (for reasons elaborated on in
> https://blog.heapanalytics.com/running-10-million-postgresql-indexes-in-production/).
> I'm trying to figure out if there is something I can do to eliminate
> the slow planning from happening.

I wonder if some BRIN index could help this use case, reducing the
number of indexes.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Highly Variable Planning Times

From

Tom Lane

Date:

20 April 2017, 02:58:38

Michael Malis <michaelmalis2@gmail.com> writes:
> I've been encountering highly variable planning time on PG 9.5.4.
> Running `EXPLAIN SELECT * FROM events_1171738` will take anywhere from
> 200ms to 4s. This likely has to do with the table having 1300 partial
> indexes on it (for reasons elaborated on in
> https://blog.heapanalytics.com/running-10-million-postgresql-indexes-in-production/).

TBH, I see no convincing explanation in that article of why 1300 partial
indexes are a good idea.  *At best*, you're doing substantial work in the
planner to avoid the first tree descent step or two in a single
non-partial index.  At worst, all your savings go out the window during a
relcache flush, which is likely to happen whenever a vacuum or analyze
gets done on the underlying table.
        regards, tom lane

Re: [HACKERS] Highly Variable Planning Times

From

Michael Malis

Date:

20 April 2017, 03:14:54

> Could you also get a profile using perf record -g?  The strace isn't
> telling us all that much.

I've attached the perf.data file from `perf record -g -p
<backend-pid>`. I'm not that familiar with perf so if there is more
information you need let me know. This was on Ubuntu 16.04

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

perf.data

Re: [HACKERS] Highly Variable Planning Times

From

Michael Malis

Date:

20 April 2017, 03:39:15

> TBH, I see no convincing explanation in that article of why 1300 partial
> indexes are a good idea.

I don't like it much either. I've been trying to eliminate most of the
need for the partial indexes, but this is the current state of things.

> *At best*, you're doing substantial work in the
> planner to avoid the first tree descent step or two in a single
> non-partial index.

While the specific example I gave in the post could be replaced with a
non-partial index, most of the partial indexes contain predicates that
are not straightforward to index with non-partial indexes. About 85%
of the partial indexes contain a regular expression in them for CSS
selector matching. I've tried using a trigram GIN index, but it wound
up not working too well due to the data being highly redundant (almost
every CSS hierarchy contains 'div' in it). Additionally, most of the
predicates for partial indexes are extremely specific making the
partial indexes very small. The sum total size of all of the partial
indexes is still much smaller than if we were to index every necessary
field with regular indexes.

Thanks,
Michael

Re: [HACKERS] Highly Variable Planning Times

From

Jeff Janes

Date:

20 April 2017, 03:58:16

On Wed, Apr 19, 2017 at 2:39 PM, Michael Malis <michaelmalis2@gmail.com> wrote:

> TBH, I see no convincing explanation in that article of why 1300 partial
> indexes are a good idea.

I don't like it much either. I've been trying to eliminate most of the
need for the partial indexes, but this is the current state of things.

> *At best*, you're doing substantial work in the
> planner to avoid the first tree descent step or two in a single
> non-partial index.

While the specific example I gave in the post could be replaced with a
non-partial index, most of the partial indexes contain predicates that
are not straightforward to index with non-partial indexes. About 85%
of the partial indexes contain a regular expression in them for CSS
selector matching. I've tried using a trigram GIN index, but it wound
up not working too well due to the data being highly redundant (almost
every CSS hierarchy contains 'div' in it).

pg_trgm 1.2 or higher has code to help with very common trigrams. But to try it, you would need to upgrade PostgreSQL to 9.6.

Cheers,

Jeff

Re: [HACKERS] Highly Variable Planning Times

From

Andres Freund

Date:

20 April 2017, 04:12:21

On 2017-04-19 14:14:54 -0700, Michael Malis wrote:
> > Could you also get a profile using perf record -g?  The strace isn't
> > telling us all that much.
> 
> I've attached the perf.data file from `perf record -g -p
> <backend-pid>`. I'm not that familiar with perf so if there is more
> information you need let me know. This was on Ubuntu 16.04

You'd have to do a perf report > outfile.txt too, without your system we
can't analyze the perf.data.

https://wiki.postgresql.org/wiki/Profiling_with_perf

- Andres

Re: [HACKERS] Highly Variable Planning Times

From

Andres Freund

Date:

20 April 2017, 05:35:38

On 2017-04-19 14:14:54 -0700, Michael Malis wrote:
> > Could you also get a profile using perf record -g?  The strace isn't
> > telling us all that much.
> 
> I've attached the perf.data file from `perf record -g -p
> <backend-pid>`. I'm not that familiar with perf so if there is more
> information you need let me know. This was on Ubuntu 16.04

The profile seems to confirm that this is largely due to cache misses.

Greetings,

Andres Freund

Re: [HACKERS] Highly Variable Planning Times

From

Michael Malis

Date:

20 April 2017, 05:43:07

> The profile seems to confirm that this is largely due to cache misses.

Can you elaborate? Cache misses of what? Why is the planning time so variable?

Re: [HACKERS] Highly Variable Planning Times

From

Andres Freund

Date:

20 April 2017, 05:48:40

On 2017-04-19 16:43:07 -0700, Michael Malis wrote:
> > The profile seems to confirm that this is largely due to cache misses.
> 
> Can you elaborate? Cache misses of what? Why is the planning time so
> variable?

Postgres uses caches for a lot of metadata of tables, indexes, ... Some
actions, like vacuum, DDL, table size changes trigger cache
invalidations for other backends so the caches stay coherent.  What
you're likely seeing is that the cached cases are quick enough, and that
the slow plan cycles are when the cache has been invalidated.

- Andres

Re: [HACKERS] Highly Variable Planning Times

From

Michael Malis

Date:

20 April 2017, 09:40:01

Thanks for the help Andres.

Re: [HACKERS] Highly Variable Planning Times

From

Greg Stark

Date:

20 April 2017, 18:51:34

On 19 April 2017 at 22:39, Michael Malis <michaelmalis2@gmail.com> wrote:

>> *At best*, you're doing substantial work in the
>> planner to avoid the first tree descent step or two in a single
>> non-partial index.

Fwiw, in addition to replacing the first few levels of the descent
with planning-time work, there's also an advantage due to the smaller
keys. Effectively these partial indexes are emulating
prefix-compression in the btree.

> While the specific example I gave in the post could be replaced with a
> non-partial index, most of the partial indexes contain predicates that
> are not straightforward to index with non-partial indexes. About 85%
> of the partial indexes contain a regular expression in them for CSS
> selector matching. I've tried using a trigram GIN index, but it wound
> up not working too well due to the data being highly redundant (almost
> every CSS hierarchy contains 'div' in it). Additionally, most of the
> predicates for partial indexes are extremely specific making the
> partial indexes very small. The sum total size of all of the partial
> indexes is still much smaller than if we were to index every necessary
> field with regular indexes.

I wonder if you could implement a FTS parser that tokenized html in
just tokens representing the matching criteria. A GIN index using such
a parser would actually be very similar to what you have as GIN
indexes are basically a collection of btrees...

The operational problem with that is I think it would be even harder
to update a parser than adding a new partial index. I don't think you
can effectively upgrade a parser to include new tokens without
rebuilding any indexes using it. If you wanted to add new selector
critieria live you would probably end up deploying the new parser and
building a new index with CREATE INDEX CONCURRENTLY using the new
parser and then dropping the old index.

I'm not sure if it's possible to do a FTS parser for handling
arbitrary CSS selectors but if you managed that that would be a very
valuable addition to Postgres, IMHO

-- 
greg