Re: Potential Join Performance Issue - Mailing list pgsql-hackers

From Bruce Momjian
Subject Re: Potential Join Performance Issue
Date
Msg-id 200901080142.n081gkk19548@momjian.us
Whole thread Raw
In response to Re: Potential Join Performance Issue  ("Lawrence, Ramon" <ramon.lawrence@ubc.ca>)
List pgsql-hackers
Has this been completed?  TODO item?

---------------------------------------------------------------------------

Lawrence, Ramon wrote:
> > From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> > I was intending to do it the other way, actually.  An extra field in
> > HashPath hardly costs anything.  The other reason for it is that there
> > are other possible uses for knowing whether a hash will be
> multi-batch.
> > (For example, if we were prepared to tell the executor that it *must*
> > keep the hash to one batch, we could assume that the sort order of the
> > left input is preserved.  I haven't looked into the risks/benefits of
> > that too much, but it's been in the back of the mind for a long time.)
> 
> Having the number of batches in HashPath could be potentially useful for
> a variety of reasons.  For our research, we have added an nbatch
> variable in both HashPath and HashJoin.  Having it in HashJoin is useful
> as we modified EXPLAIN to output the number of batches.  There are costs
> in putting an nbatch variable in HashPath as the system may set this
> variable potentially hundreds/thousands of times during costing and does
> not (currently) use it until you convert the chosen HashPath to a plan.
> 
> > I'd be more inclined to deal with the issue by trying to establish a
> > "safety margin" in the estimate of whether the hash will go
> multi-batch.
> > IOW we should disuse_physical_tlist if the hash is estimated to be
> close
> > to but still within one batch.
> 
> Our experiments with large TPC-H 1GB joins show that it is almost always
> better to not use physical_tlists if the number of batches is > 1.
> There is a noticeable (approximately 5-15%) improvement when using
> physical_tlists for in-memory joins.  For batches of size 2, it
> sometimes can go either way depending how many attributes are projected
> out of the outer relation.  Using physical_tlists may be better even for
> batches of size 2 if most of the attributes of the outer relation are
> kept.  For a larger number of batches, the extra I/O cost significantly
> dominates over the physical_tlist optimization.  Performance of
> multi-batch joins may improve 50% or more by disabling the optimization.
> 
> It is possible to create a "safety margin" by having
> ExecChooseHashTableSize() return the value
> inner_rel_bytes/hash_table_bytes which represents the fraction of the
> memory available that the inner relation is expected to consume.  You
> can then make decisions based on that.   However, this is only as good
> as the inner relation size estimate and especially for large queries,
> the estimate may be quite inaccurate.  A more robust solution could
> examine the "width" of the path and the "width" of the relation combined
> with the number of batches to see if projecting early would be worth it.
> It may be best to keep it simple and just use number of batches > 1 as a
> criteria and instead focus on examining issues with inaccurate join size
> estimates.  
> 
> --
> Dr. Ramon Lawrence
> Assistant Professor, Department of Computer Science, University of
> British Columbia Okanagan
> E-mail: ramon.lawrence@ubc.ca
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


pgsql-hackers by date:

Previous
From: KaiGai Kohei
Date:
Subject: Re: Updates of SE-PostgreSQL 8.4devel patches (r1389)
Next
From: Bruce Momjian
Date:
Subject: Re: text search patch status update?