Re: Abbreviated keys for text cost model fix - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Abbreviated keys for text cost model fix
Date
Msg-id 54EA47E5.8010606@2ndquadrant.com
Whole thread Raw
In response to Abbreviated keys for text cost model fix  (Peter Geoghegan <pg@heroku.com>)
Responses Re: Abbreviated keys for text cost model fix  (Peter Geoghegan <pg@heroku.com>)
List pgsql-hackers
Hi,

On 22.2.2015 02:59, Peter Geoghegan wrote:
> Over in the abbreviated keys for numeric thread, Tomas Vondra reports
> a case where the ad-hoc cost model of abbreviated keys/sortsupport for
> text is far too conservative, and aborts a case where abbreviated keys
> can still greatly help.
>
> Previously, I proposed additions to the cost model that dealt with all
> this, by passing down a reltuples hint from the optimizer or a
> relcache entry to tuplesort. Robert seemed to think that this was a
> little bit much, and that it could be discussed at a later point. It's
> clear that we need to do something about cases like the one reported
> by Tomas, though.
>
> Attached patch adds a decay to the threshold that (estimated)
> abbreviated key cardinality must cross as a proportion of the
> (estimated) cardinality of the original set of strings to be sorted,
> while also quadrupling the initial required proportional cardinality
> to 20% of full key cardinality (but for just the first 10,000 strings,
> before this threshold is decayed aggressively). This seems
> significantly less invasive than what I previously proposed, while
> still being effective. I suggest that this be treated as a bugfix,
> since Tomas reports a 5% -7% regression with his case. OTOH, with this
> patch, I saw the same case have a ~300% improvement in performance.
>
> The adjusted cost model strongly prefers to decide to abort early,
> which seems desirable. In one way this makes it a little more
> vulnerable to skewed sets, since early data matters more, but in
> another more important way it's less vulnerable, since perhaps that
> skew indicates a general skew in the data, a skew that makes the idea
> of measuring the cardinality of abbreviated keys as a proxy for full
> key cardinality less useful.
>
> Thoughts?

I've done some more testing, and this helps a lot. I did the tests on
two different machines (one with Xeon E5450, one with i5-2500k), to see
how the patches (this and the datum/numeric ones) behave on different
CPUs. I also extended the tests a bit, to test random data (as before,
and data already sorted in ASC/DESC order). And of course, I did that
with different dataset sizes.

The aggregated results are available as a spreadsheet here:

   Xeon E5450: http://bit.ly/1AyRN7M
   i5-2500k:   http://bit.ly/1z9dLdP

You probably only want to look at the 'summary' sheet, which shows
speedup vs. master. Detailed results along with benchmarking scripts and
logs attached.

In short, this fixes all the cases except for the ASC sorted data. I
haven't done any code review, but I think we want this.

I'll use data from the i5-2500k, but it applies to the Xeon too, except
that the Xeon results are more noisy and the speedups are not that
significant.

For the 'text' data type, and 'random' dataset, the results are these:

      scale    datum    cost-model
    -------------------------------
     100000     328%          323%
    1000000     392%          391%
    2000000      96%          565%
    3000000      97%          572%
    4000000      97%          571%
    5000000      98%          570%

The numbers are speedup vs. master, so 100% means exactly the same
speed, 200% means twice as fast.

So while with 'datum' patch this actually caused very nice speedup for
small datasets - about 3-4x speedup up to 1M rows, for larger datasets
we've seen small regression (~3% slower). With the cost model fix, we
actually see a significant speedup (about 5.7x) for these cases.

I haven't verified whether this produces the same results, but if it
does this is very nice.

For 'DESC' dataset (i.e. data sorted in reverse order), we do get even
better numbers, with up to 6.5x speedup on large datasets.

But for 'ASC' dataset (i.e. already sorted data), we do get this:

      scale    datum    cost-model
    -------------------------------
     100000      85%           84%
    1000000      87%           87%
    2000000      76%           96%
    3000000      82%           90%
    4000000      91%           83%
    5000000      93%           81%

Ummm, not that great, I guess :-(


--
Tomas Vondra                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: New CF app deployment
Next
From: Peter Geoghegan
Date:
Subject: Re: Abbreviated keys for text cost model fix