Re: O(samplesize) tuple sampling, proof of concept - Mailing list pgsql-patches

From Tom Lane
Subject Re: O(samplesize) tuple sampling, proof of concept
Date
Msg-id 16177.1081204229@sss.pgh.pa.us
Whole thread Raw
In response to O(samplesize) tuple sampling, proof of concept  (Manfred Koizar <mkoi-pg@aon.at>)
Responses Re: O(samplesize) tuple sampling, proof of concept
List pgsql-patches
Manfred Koizar <mkoi-pg@aon.at> writes:
>>> Because I was afraid that method 1 might be too expensive in terms of
>>> CPU cycles, I implemented a small variation that skips tuples without
>>> checking them for visibility; this is sampling_method 2.
>>
>> And?  Does it matter?

> There's a clearly visible difference for mid-size relations.  I'm not
> sure whether this can be attributed to visibility bit updating or other
> noise-contributing factors.

I think it would have to be visibility-bit updates.  Can you try it with
a pre-vacuumed relation, so that there is no slowdown for updates?

The update cost will have to be paid sooner or later by some xact, and
on the whole it's better that it be done by a maintenance xact than by
a foreground client; so even if there is a noticeable slowdown here,
that doesn't seem like a reason not to inspect all the tuples.

>>> I find -u diffs close to unreadable for reviewing purposes.  Please
>>> submit diffs in -c format in future.

> De gustibus non est disputandum :-)
> Fortunately this patch wouldn't look much different.  There is just a
> bunch of "+" lines.

Yeah, so I managed to read it anyway ;-).  It's the ones with
intricately intermixed "-" and "+" that I find difficult to follow.

>> AFAICS the rows will *always* be sorted already, and so this qsort is an
>> utter waste of cycles.  No?

> I don't think so.  We get the *blocks* in the correct order.  But tuples
> are still sampled by the Vitter method which starts to replace random
> tuples after the pool is filled.

Duh, you're right --- I was thinking that the old code doesn't need a
qsort, but it does.  This seems a tad annoying considering that we know
the tuples were inserted into the pool in increasing order.  I wonder if
it's worth using a more complex data structure to keep track of both
orders at once?  I suppose that could easily wind up costing more than
the qsort though ...

            regards, tom lane

pgsql-patches by date:

Previous
From: Tom Lane
Date:
Subject: Re: O(samplesize) tuple sampling, proof of concept
Next
From: Manfred Koizar
Date:
Subject: Re: O(samplesize) tuple sampling, proof of concept