Thread: pgsql: Add support for multivariate MCV lists

pgsql: Add support for multivariate MCV lists

From
Tomas Vondra
Date:
Add support for multivariate MCV lists

Introduce a third extended statistic type, supported by the CREATE
STATISTICS command - MCV lists, a generalization of the statistic
already built and used for individual columns.

Compared to the already supported types (n-distinct coefficients and
functional dependencies), MCV lists are more complex, include column
values and allow estimation of much wider range of common clauses
(equality and inequality conditions, IS NULL, IS NOT NULL etc.).
Similarly to the other types, a new pseudo-type (pg_mcv_list) is used.

Author: Tomas Vondra
Reviewed-by: Dean Rasheed, David Rowley, Mark Dilger, Alvaro Herrera
Discussion: https://postgr.es/m/dfdac334-9cf2-2597-fb27-f0fb3753f435@2ndquadrant.com

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/7300a699502fe5432b05fbc75baca534b080bebb

Modified Files
--------------
doc/src/sgml/catalogs.sgml                       |   13 +-
doc/src/sgml/func.sgml                           |   82 ++
doc/src/sgml/perform.sgml                        |   66 +
doc/src/sgml/planstats.sgml                      |  116 +-
doc/src/sgml/ref/create_statistics.sgml          |   35 +-
src/backend/commands/statscmds.c                 |   82 +-
src/backend/nodes/bitmapset.c                    |   44 +
src/backend/optimizer/path/clausesel.c           |  109 +-
src/backend/optimizer/util/plancat.c             |   12 +
src/backend/parser/parse_utilcmd.c               |    2 +
src/backend/statistics/Makefile                  |    2 +-
src/backend/statistics/README                    |    4 +
src/backend/statistics/README.mcv                |  100 ++
src/backend/statistics/dependencies.c            |  100 +-
src/backend/statistics/extended_stats.c          |  627 +++++++-
src/backend/statistics/mcv.c                     | 1695 ++++++++++++++++++++++
src/backend/utils/adt/ruleutils.c                |   24 +-
src/bin/psql/describe.c                          |    9 +-
src/bin/psql/tab-complete.c                      |    2 +-
src/include/catalog/pg_cast.dat                  |    6 +
src/include/catalog/pg_proc.dat                  |   24 +
src/include/catalog/pg_statistic_ext.h           |    2 +
src/include/catalog/pg_type.dat                  |    7 +
src/include/nodes/bitmapset.h                    |    1 +
src/include/optimizer/optimizer.h                |    6 +
src/include/statistics/extended_stats_internal.h |   40 +
src/include/statistics/statistics.h              |   40 +
src/test/regress/expected/create_table_like.out  |    2 +-
src/test/regress/expected/opr_sanity.out         |    3 +-
src/test/regress/expected/stats_ext.out          |  287 +++-
src/test/regress/expected/type_sanity.out        |    3 +-
src/test/regress/sql/stats_ext.sql               |  186 +++
32 files changed, 3597 insertions(+), 134 deletions(-)


Re: pgsql: Add support for multivariate MCV lists

From
Andres Freund
Date:
On 2019-03-27 19:02:01 +0000, Tomas Vondra wrote:
> Add support for multivariate MCV lists
> 
> Introduce a third extended statistic type, supported by the CREATE
> STATISTICS command - MCV lists, a generalization of the statistic
> already built and used for individual columns.
> 
> Compared to the already supported types (n-distinct coefficients and
> functional dependencies), MCV lists are more complex, include column
> values and allow estimation of much wider range of common clauses
> (equality and inequality conditions, IS NULL, IS NOT NULL etc.).
> Similarly to the other types, a new pseudo-type (pg_mcv_list) is used.
> 
> Author: Tomas Vondra
> Reviewed-by: Dean Rasheed, David Rowley, Mark Dilger, Alvaro Herrera
> Discussion: https://postgr.es/m/dfdac334-9cf2-2597-fb27-f0fb3753f435@2ndquadrant.com

Congrats, this was long in the making!

- Andres



Re: pgsql: Add support for multivariate MCV lists

From
Peter Geoghegan
Date:
On Wed, Mar 27, 2019 at 12:04 PM Andres Freund <andres@anarazel.de> wrote:
> Congrats, this was long in the making!

+1

Buildfarm member aye-aye has a problem with this patch, though:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=aye-aye&dt=2019-03-27%2019%3A07%3A53

-- 
Peter Geoghegan



Re: pgsql: Add support for multivariate MCV lists

From
Tomas Vondra
Date:
On Wed, Mar 27, 2019 at 12:32:34PM -0700, Peter Geoghegan wrote:
>On Wed, Mar 27, 2019 at 12:04 PM Andres Freund <andres@anarazel.de> wrote:
>> Congrats, this was long in the making!
>
>+1
>

Thanks! I guess it's a Gordie Howe of our patches.

>Buildfarm member aye-aye has a problem with this patch, though:
>
>https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=aye-aye&dt=2019-03-27%2019%3A07%3A53
>

Damn. And I see locust, prion and dory also seem to have an issue.
locust is also a power machine, I wonder if that's somehow related.

I'll take a look. 


cheers

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: pgsql: Add support for multivariate MCV lists

From
Tomas Vondra
Date:
On Wed, Mar 27, 2019 at 09:13:46PM +0100, Tomas Vondra wrote:
>On Wed, Mar 27, 2019 at 12:32:34PM -0700, Peter Geoghegan wrote:
>>On Wed, Mar 27, 2019 at 12:04 PM Andres Freund <andres@anarazel.de> wrote:
>>>Congrats, this was long in the making!
>>
>>+1
>>
>
>Thanks! I guess it's a Gordie Howe of our patches.
>
>>Buildfarm member aye-aye has a problem with this patch, though:
>>
>>https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=aye-aye&dt=2019-03-27%2019%3A07%3A53
>>
>
>Damn. And I see locust, prion and dory also seem to have an issue.
>locust is also a power machine, I wonder if that's somehow related.
>
>I'll take a look.
>

There seems to be a bug in statext_mcv_load - the deserialize happens
after releasing syscache. That's probably why prion is failing, because
it's running with -DRELCACHE_FORCE_RELEASE -DCATCACHE_FORCE_RELEASE. Not
sure about aye-aye.


cheers

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: pgsql: Add support for multivariate MCV lists

From
Tomas Vondra
Date:
On Wed, Mar 27, 2019 at 09:33:53PM +0100, Tomas Vondra wrote:
>On Wed, Mar 27, 2019 at 09:13:46PM +0100, Tomas Vondra wrote:
>>On Wed, Mar 27, 2019 at 12:32:34PM -0700, Peter Geoghegan wrote:
>>>On Wed, Mar 27, 2019 at 12:04 PM Andres Freund <andres@anarazel.de> wrote:
>>>>Congrats, this was long in the making!
>>>
>>>+1
>>>
>>
>>Thanks! I guess it's a Gordie Howe of our patches.
>>
>>>Buildfarm member aye-aye has a problem with this patch, though:
>>>
>>>https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=aye-aye&dt=2019-03-27%2019%3A07%3A53
>>>
>>
>>Damn. And I see locust, prion and dory also seem to have an issue.
>>locust is also a power machine, I wonder if that's somehow related.
>>
>>I'll take a look.
>>
>
>There seems to be a bug in statext_mcv_load - the deserialize happens
>after releasing syscache. That's probably why prion is failing, because
>it's running with -DRELCACHE_FORCE_RELEASE -DCATCACHE_FORCE_RELEASE. Not
>sure about aye-aye.
>

FWIW I can reproduce the issue by simply using the same CFLAGS, and it
turns out the MCV deserialization is a bit borked (the value may easily
reference memory that already went away). I have a patch that resolves
that for me, and I assume it'll address the other failures related to
this commit too - for example the segfault on gharial).

It's a bit too late for pushing emergency fixes over here, so I'll do
more testing tomorrow and then push.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: pgsql: Add support for multivariate MCV lists

From
David Fetter
Date:
On Wed, Mar 27, 2019 at 07:02:01PM +0000, Tomas Vondra wrote:
> Add support for multivariate MCV lists

Kudos!

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



Re: pgsql: Add support for multivariate MCV lists

From
Andres Freund
Date:
Hi,

On 2019-03-27 21:13:46 +0100, Tomas Vondra wrote:
> On Wed, Mar 27, 2019 at 12:32:34PM -0700, Peter Geoghegan wrote:
> > On Wed, Mar 27, 2019 at 12:04 PM Andres Freund <andres@anarazel.de> wrote:
> > > Congrats, this was long in the making!
> > 
> > +1
> > 
> 
> Thanks! I guess it's a Gordie Howe of our patches.

Btw, do you have plans to work on using more of this information during
join planning? Or at least an estimate of how much work that'd be?

Greetings,

Andres Freund



Re: pgsql: Add support for multivariate MCV lists

From
Tomas Vondra
Date:
On Wed, Mar 27, 2019 at 08:15:35PM -0700, Andres Freund wrote:
>Hi,
>
>On 2019-03-27 21:13:46 +0100, Tomas Vondra wrote:
>> On Wed, Mar 27, 2019 at 12:32:34PM -0700, Peter Geoghegan wrote:
>> > On Wed, Mar 27, 2019 at 12:04 PM Andres Freund <andres@anarazel.de> wrote:
>> > > Congrats, this was long in the making!
>> >
>> > +1
>> >
>>
>> Thanks! I guess it's a Gordie Howe of our patches.
>
>Btw, do you have plans to work on using more of this information during
>join planning? Or at least an estimate of how much work that'd be?
>

Yes, I'd like to use more of this info in join estimation. I don't have
an exact plan how to do that - so if anyone has ideas regarding this,
I'd love to hear them.

An obvious option is to add statistics covering multiple tables (the
syntax is already designed with that in mind). But I have a feeling there
are things we can do without that, e.g. using multivariate MCV lists in
eqjoinsel roughly the same way we use per-column MCV lists, etc.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: pgsql: Add support for multivariate MCV lists

From
Peter Geoghegan
Date:
On Wed, Mar 27, 2019 at 6:27 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> It's a bit too late for pushing emergency fixes over here, so I'll do
> more testing tomorrow and then push.

The buildfarm is still almost all-red now. Can you estimate how long
it will take to push a fix?

-- 
Peter Geoghegan



Re: pgsql: Add support for multivariate MCV lists

From
Tomas Vondra
Date:
On Thu, Mar 28, 2019 at 11:29:12AM -0700, Peter Geoghegan wrote:
>On Wed, Mar 27, 2019 at 6:27 PM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>> It's a bit too late for pushing emergency fixes over here, so I'll do
>> more testing tomorrow and then push.
>
>The buildfarm is still almost all-red now. Can you estimate how long
>it will take to push a fix?
>

Half an hour, at most. I have a fix and I'm running tests on it to make
sure it does break something else.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: pgsql: Add support for multivariate MCV lists

From
Tomas Vondra
Date:
On Thu, Mar 28, 2019 at 07:33:36PM +0100, Tomas Vondra wrote:
>On Thu, Mar 28, 2019 at 11:29:12AM -0700, Peter Geoghegan wrote:
>>On Wed, Mar 27, 2019 at 6:27 PM Tomas Vondra
>><tomas.vondra@2ndquadrant.com> wrote:
>>>It's a bit too late for pushing emergency fixes over here, so I'll do
>>>more testing tomorrow and then push.
>>
>>The buildfarm is still almost all-red now. Can you estimate how long
>>it will take to push a fix?
>>
>
>Half an hour, at most. I have a fix and I'm running tests on it to make
>sure it does break something else.
>

OK, I've pushed the fix. As explained in the commit message, the
deserialization was borked in two ways. Firstly, it was vulnerable to
use-after-free. Secondly, the serialization/deserialization of data for
by-value types did not work for bigendian systems.

I believe this should fix prion (which was tripping on the first issue,
due to using -DRELCACHE_FORCE_RELEASE -DCATCACHE_FORCE_RELEASE) and at
least some of the bigendian boxes (I've tested it on s390x).


I do think there's one remaining issue - the deserialized value is
allocated as a single chunk, and is then "sliced" into smaller buffers.
But the code ignores alignment, which I think may trigger SIGBUS on some
platforms - for example grison, skate or gull fail like this, and those
are ARMv7 and sparc machines.

I do have a fix for that too, but I decided not to push it yet before
testing it a bit more.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: pgsql: Add support for multivariate MCV lists

From
Tomas Vondra
Date:
On Thu, Mar 28, 2019 at 08:37:11PM +0100, Tomas Vondra wrote:
>On Thu, Mar 28, 2019 at 07:33:36PM +0100, Tomas Vondra wrote:
>>On Thu, Mar 28, 2019 at 11:29:12AM -0700, Peter Geoghegan wrote:
>>>On Wed, Mar 27, 2019 at 6:27 PM Tomas Vondra
>>><tomas.vondra@2ndquadrant.com> wrote:
>>>>It's a bit too late for pushing emergency fixes over here, so I'll do
>>>>more testing tomorrow and then push.
>>>
>>>The buildfarm is still almost all-red now. Can you estimate how long
>>>it will take to push a fix?
>>>
>>
>>Half an hour, at most. I have a fix and I'm running tests on it to make
>>sure it does break something else.
>>
>
>OK, I've pushed the fix. As explained in the commit message, the
>deserialization was borked in two ways. Firstly, it was vulnerable to
>use-after-free. Secondly, the serialization/deserialization of data for
>by-value types did not work for bigendian systems.
>
>I believe this should fix prion (which was tripping on the first issue,
>due to using -DRELCACHE_FORCE_RELEASE -DCATCACHE_FORCE_RELEASE) and at
>least some of the bigendian boxes (I've tested it on s390x).
>

OK, this apparently worked fine for all x86 and s390 machines.

>
>I do think there's one remaining issue - the deserialized value is
>allocated as a single chunk, and is then "sliced" into smaller buffers.
>But the code ignores alignment, which I think may trigger SIGBUS on some
>platforms - for example grison, skate or gull fail like this, and those
>are ARMv7 and sparc machines.
>
>I do have a fix for that too, but I decided not to push it yet before
>testing it a bit more.
>

I've pushed a fix for this. The short version is that the serialized
representation was not respecting memory alignment requirements, which was
causing issues in machines sensitive to this (ia64, sparc, hppa). It's a
blind attempt, as I currently don't have access to any such machine.

FWIW I've pushed this now so that the machines can test the other stuff,
like jsonpath. I think the serialization/deserialization would deserve a
second look, and I'm wondering if it should work more like ArrayType (i.e.
relying on att_align_* instead of explicit MAXALIGN).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: pgsql: Add support for multivariate MCV lists

From
Peter Geoghegan
Date:
On Fri, Mar 29, 2019 at 11:20 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> I've pushed a fix for this. The short version is that the serialized
> representation was not respecting memory alignment requirements, which was
> causing issues in machines sensitive to this (ia64, sparc, hppa). It's a
> blind attempt, as I currently don't have access to any such machine.

Looks like gharial still has problems:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2019-03-29%2018%3A30%3A47

-- 
Peter Geoghegan



Re: pgsql: Add support for multivariate MCV lists

From
Tomas Vondra
Date:
On Fri, Mar 29, 2019 at 12:06:26PM -0700, Peter Geoghegan wrote:
>On Fri, Mar 29, 2019 at 11:20 AM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>> I've pushed a fix for this. The short version is that the serialized
>> representation was not respecting memory alignment requirements, which was
>> causing issues in machines sensitive to this (ia64, sparc, hppa). It's a
>> blind attempt, as I currently don't have access to any such machine.
>
>Looks like gharial still has problems:
>
>https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2019-03-29%2018%3A30%3A47
>

Yeah, I saw that :-(

Unfortunately fixing this without access to any ia64/sparc/hppa machine
will be hard. The only thing I can do is wait for gaur/snapper to report
the failure with a backtrace.


cheers

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: pgsql: Add support for multivariate MCV lists

From
Tomas Vondra
Date:
On Fri, Mar 29, 2019 at 08:25:10PM +0100, Tomas Vondra wrote:
>On Fri, Mar 29, 2019 at 12:06:26PM -0700, Peter Geoghegan wrote:
>>On Fri, Mar 29, 2019 at 11:20 AM Tomas Vondra
>><tomas.vondra@2ndquadrant.com> wrote:
>>>I've pushed a fix for this. The short version is that the serialized
>>>representation was not respecting memory alignment requirements, which was
>>>causing issues in machines sensitive to this (ia64, sparc, hppa). It's a
>>>blind attempt, as I currently don't have access to any such machine.
>>
>>Looks like gharial still has problems:
>>
>>https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2019-03-29%2018%3A30%3A47
>>
>
>Yeah, I saw that :-(
>
>Unfortunately fixing this without access to any ia64/sparc/hppa machine
>will be hard. The only thing I can do is wait for gaur/snapper to report
>the failure with a backtrace.
>

OK, so the second memory-alignment fix I pushed on Saturday seems to
have worked. There's one aspect of it that I don't like though - the
MCV is first serialized into a buffer, which is then copied into the
varlena buffer. Deserialization works in reverse, i.e. it copies the
data from the varlena, then accessed the copy.

For the emergency fix that seemed acceptable, as it somewhat reduced the
amount of modified code, but I don't want to leave it like that because
of efficiency and memory consumption. So barring objections I'll push
the attached patch, which ensures proper memory alignment within the raw
varlena buffer. Thus the copies are not needed.

(The second part of the patch merely adds some alignment-enforcing
asserts - I don't plan to commit that, but if someone plans decides to
test this, it might be useful.)


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment