Re: Abbreviated keys for Numeric - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: Abbreviated keys for Numeric
Date
Msg-id CAM3SWZRR+LdpadytaPE9qeNrw34UwUaMhiuCUGH0Fc5e07K_uA@mail.gmail.com
Whole thread Raw
In response to Re: Abbreviated keys for Numeric  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: Abbreviated keys for Numeric  (Peter Geoghegan <pg@heroku.com>)
Re: Abbreviated keys for Numeric  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
List pgsql-hackers
On Sat, Feb 21, 2015 at 12:15 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>> Although I cannot easily explain the disparity in performance between
>> 1M and 5M sized sets for this query:
>>
>> select count(distinct randtxt) from stuff_text
>>
>> You did make sure that the queries didn't spill to disk, right? Or
>> that they did so consistently, at least.
>
> All the queries were running with work_mem=1GB, and I don't think they
> were spilling to disk. Actually, I don't have the 'merge' patch applied,
> so that would probably crash because of SIGSEGV.

Actually, that isn't so -- Andrew's datum sort patch incidentally
fixes tuplesort_begin_datum() in the same way as the patch I posted.
You shouldn't actually need my patch, which was just an immediate fix.

I can recreate the problem you see with text sort regressions.
Abbreviation is aborted for the case in question, unsurprisingly, and
fairly far in. With that many tuples, the idea of taking abbreviated
cardinality as a proxy for full cardinality becomes less important,
because either way you have to do at least 10 comparisons per item on
average. Originally, I had as a heuristic that once you get to a
million items without aborting, stop considering the possibility of
aborting - that much should probably be added back, at a minimum.
Andrew is already doing that in his numeric patch (at 100,000 tuples),
and that may be the only reason why numeric is not regressed too.

For a query like this, if I "#define DEBUG_ABBREV_KEYS", so that we
don't actually abort, and run this query of yours until it stabilizes:
select * from (select * from stuff_text order by randtxt offset
100000000000) foo;

It takes about 5.3 seconds. If I don't, however - if it is allowed to
run its course - it takes a whopping 35 seconds. So the ad-hoc cost
model of the text opclass definitely needs more work, as I suspected.

-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Expanding the use of FLEXIBLE_ARRAY_MEMBER for declarations like foo[1]
Next
From: Peter Geoghegan
Date:
Subject: Re: Abbreviated keys for Numeric