Re: Treating work_mem as a shared resource (Was: Parallel Hash take II) - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Treating work_mem as a shared resource (Was: Parallel Hash take II)
Date
Msg-id CA+TgmoZaYvskB7PFkWzq-M6nxBqzyHUrxBPas2tcC75a9TfDtg@mail.gmail.com
Whole thread Raw
In response to Re: Treating work_mem as a shared resource (Was: Parallel Hash take II)  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: Treating work_mem as a shared resource (Was: Parallel Hash take II)  (Thomas Munro <thomas.munro@enterprisedb.com>)
List pgsql-hackers
On Tue, Nov 21, 2017 at 5:38 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>> That having been said, I think the place where our plans most commonly
>> go wrong is where we incorrectly estimate the number of tuples by
>> multiple orders of magnitude - 100x is common, 1000x is common, a
>> million x is not uncommon, even a billion x is not unheard-of.  And I
>> don't think there's any way to make a hash join happy if it thinks
>> it's going to need 1 batch and it ends up needing a million batches.
>
> What about dynamic role reversal? That could make a big difference.

In the best case it's great, but it looks to me like there are a lot
of thorny problems.  For example, imagine giant_table INNER JOIN
bigger_than_we_thought  The latter table will be chosen as the inner
table and that won't work out very well, but there's no way to know
whether switching the sides will be any better except to try reading a
bunch of rows from giant_table and seeing whether it turns out to be a
lot smaller than we thought.  To do that, we'll need to dump the hash
table we started to build on the original inner side out to disk so
that we can free up enough work_mem to try building a hash table on
the other side.  When the giant table turns out to actually be giant,
we'll need to go back to the original plan, which means dumping out
the tuples from the second hash table and reloading the tuples from
the first one.  So we end up just doing a bunch of extra work for
nothing.  I think that this scenario - wasting effort trying to switch
the sides only to give up - will happen frequently.

In the multi-batch case, there seems to be a little more hope of doing
something clever.  We're anyway writing out most of both inputs out to
tapes.  If we were willing to write ALL of both inputs out to tapes,
then we could decide - perhaps even separately for each batch - which
side to load into the hash table.  Of course, that adds a lot of
incremental I/O unless the number of batches is large (e.g. if we had
only 4 batches, writing 4/4 of the data instead of 3/4 is a 33%
increase, but if we had 64 batches, writing 64/64 of the data instead
of 63/64 doesn't matter a lot, probably).  And it leaves out a few
important details, like the fact that what fits in the hash table is
used to choose the number of batches in the first place, and that we
write the whole of one side to tapes before starting on the other
side.  I don't know how to handle those problems but it seems like it
might be possible to come up with something clever, at least for
certain cases.

> I agree that it would be enormously valuable if we could make
> estimates much better, so I think that I understand why you emphasize
> it. But, I don't think that there are any good ideas for improving
> join selectivity that don't involve expert DBA knowledge, or
> novel/risky techniques for feedback to the system about column
> redundancy/correlation, etc. These do not seem like scalable
> approaches, and so they don't particularly appeal to me as projects.
> I'd be happy to be shown to be wrong about this.

Yeah, I agree that it's a hard problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: [HACKERS] More stats about skipped vacuums
Next
From: Tom Lane
Date:
Subject: Re: [HACKERS] More stats about skipped vacuums