Re: Abbreviated keys for text cost model fix - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Abbreviated keys for text cost model fix
Date
Msg-id 54EB580C.2000904@2ndquadrant.com
Whole thread Raw
In response to Re: Abbreviated keys for text cost model fix  (Peter Geoghegan <pg@heroku.com>)
Responses Re: Abbreviated keys for text cost model fix  (Peter Geoghegan <pg@heroku.com>)
Re: Abbreviated keys for text cost model fix  (Jeremy Harris <jgh@wizmail.org>)
List pgsql-hackers
Hi,

On 22.2.2015 22:30, Peter Geoghegan wrote:
> On Sun, Feb 22, 2015 at 1:19 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> In short, this fixes all the cases except for the ASC sorted data. I
>> haven't done any code review, but I think we want this.
>>
>> I'll use data from the i5-2500k, but it applies to the Xeon too, except
>> that the Xeon results are more noisy and the speedups are not that
>> significant.
>>
>> For the 'text' data type, and 'random' dataset, the results are these:
>>
>>       scale    datum    cost-model
>>     -------------------------------
>>      100000     328%          323%
>>     1000000     392%          391%
>>     2000000      96%          565%
>>     3000000      97%          572%
>>     4000000      97%          571%
>>     5000000      98%          570%
>>
>> The numbers are speedup vs. master, so 100% means exactly the same
>> speed, 200% means twice as fast.
>>
>> So while with 'datum' patch this actually caused very nice speedup for
>> small datasets - about 3-4x speedup up to 1M rows, for larger datasets
>> we've seen small regression (~3% slower). With the cost model fix, we
>> actually see a significant speedup (about 5.7x) for these cases.
>
> Cool.
>
>> I haven't verified whether this produces the same results, but if it
>> does this is very nice.
>>
>> For 'DESC' dataset (i.e. data sorted in reverse order), we do get even
>> better numbers, with up to 6.5x speedup on large datasets.
>>
>> But for 'ASC' dataset (i.e. already sorted data), we do get this:
>>
>>       scale    datum    cost-model
>>     -------------------------------
>>      100000      85%           84%
>>     1000000      87%           87%
>>     2000000      76%           96%
>>     3000000      82%           90%
>>     4000000      91%           83%
>>     5000000      93%           81%
>>
>> Ummm, not that great, I guess :-(
>
> You should try it with the data fully sorted like this, but with one
> tiny difference: The very last tuple is out of order. How does that
> look?

So here are the results for ASC-ordered dataset, with one 'unsorted' row
added to the end of the dataset. As before the complete scripts are
attached, and the raw results are available in a spreadsheet:

    http://bit.ly/18g1nTU

The durations are much higher than without the single unsorted row added
at the end. Queries often take 20x longer to finish (on the same code),
depending on the scale.

The speedup results (compared to master) look like this:

    scale     query#     datum     numeric   cost model
    100000         1      859%        861%         856%
    100000         2      811%        814%         805%
    100000         3      100%        100%          97%
    1000000        1      805%        804%         807%
    1000000        2      769%        773%         770%
    1000000        3      100%        100%          98%
    2000000        1       97%         97%         673%
    2000000        2       96%         97%         646%
    2000000        3       99%        101%         678%
    3000000        1       98%         98%         578%
    3000000        2       96%         97%         557%
    3000000        3       99%        101%         579%
    4000000        1       99%         99%         513%
    4000000        2       97%         98%         497%
    4000000        3       99%        101%         510%
    5000000        1       99%         99%         469%
    5000000        2       97%         98%         456%
    5000000        3       99%        101%         466%

What's interesting here is that some queries are much faster, but query
#3 is slow until we hit 2M rows:

    select * from (select * from stuff_int_desc order by randint
                   offset 100000000000) foo

Looking at the previous tests, I see this is exactly what's happening to
this query with 'random' dataset - it's slightly slower than master up
until 2M rows, when it suddenly jumps to the same speedup as the other
queries. Can we do something about that?

Anyway, I'm wondering what conclusion we can do from this? I believe
vast majority of datasets in production won't be perfectly sorted,
because when the table is CLUSTERed by index we tend to use index scan
to do the sort (so no problem), or the data are not actually perfectly
sorted (and here we get significant speedup).


--
Tomas Vondra                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

pgsql-hackers by date:

Previous
From: Eric Grinstein
Date:
Subject: Re: Query Rewrite with Postgres' materialized views
Next
From: Andres Freund
Date:
Subject: Re: Primary not sending to synchronous standby