Thread: O(samplesize) tuple sampling, proof of concept

O(samplesize) tuple sampling, proof of concept

From
Manfred Koizar
Date:
Here is the first preview of the two-stage tuple sampling as discussed
on -hackers last week.  It is for review only, please do not apply.

As I didn't have a current CVS checkout when I worked on the patch, it
is based on 7.4.2 sources.  I hope this is ok for the purpose of
reviewing.

The old sampling method is still there.  I have added a GUC variable
sampling_method to make testing and benchmarking easier.

Sampling_method 0 is the old method, it has now an additional debug
message telling us how many pages and tuples have been accessed.

Sampling_method 1 is the new method (sample rows out of a sample of
blocks).  Once a block is selected for inspection, all tuples of this
block are accessed to get a good estimation of the live : dead row
ratio.

Because I was afraid that method 1 might be too expensive in terms of
CPU cycles, I implemented a small variation that skips tuples without
checking them for visibility; this is sampling_method 2.

Tests:

    \timing
    SET client_min_messages = debug2;
    SET sampling_method = 1; ANALYSE tablename;
    ...

tenk1 in the regression database is too small to give reliable results.
I made a new table

    CREATE TABLE x AS SELECT * FROM tenk1;

and repeatedly

    INSERT INTO x SELECT * FROM x;

There were also some relatively small UPDATES.  A second set of tests
was performed with a table with much smaller tuples.

Results:  Up to a relation size of a few thousand pages it is hard to
get consistent timing.  What you get is dominated by the effects of the
scan having to pay the write back costs for a previous update or
benefiting from finding its pages in shared buffers or the OS cache.
For 15000 pages and above (tests were done with the default sample size
of 3000) things start to look reasonable and repeatable.

For small relations with less than samplesize pages the new method (both
variants) accesses all pages and therefore *must* be slower than the old
method.  At 1000 pages there is only a difference of approximately 1%,
at 3000 pages the difference has its maximum of ca. 15%.  We can sell
this by pointing to the better quality of the statistics.

Above 3000 pages the new method looks increasingly better, because it
never reads more than samplesize pages.  Depending on tuple size the
point where the old method accesses 3000 pages is between 3500 and 4000.

Between 3000 and 8000 pages the old method still seems to be faster, but
differences between runs of the same method are not much smaller than
between runs of different methods.

At 15000 pages all methods are approximately equally fast.  Above this
relation size running times for the new methods stay constant and for
the old method continue to grow with the number of pages.

The full boring list of test results is at
http://www.pivot.at/pg/SamplePerf.sxc.

Comments are welcome.  If I see any indication that a patch of this kind
will be accepted, I'll prepare a new version based on current sources.
This will contain only one sampling method and it will clean up the
Vitter function a bit.  Are there any spots in the documentation that
should be updated?

Servus
 Manfred
diff -ruN ../base/src/backend/commands/analyze.c src/backend/commands/analyze.c
--- ../base/src/backend/commands/analyze.c    2003-10-18 17:38:06.000000000 +0200
+++ src/backend/commands/analyze.c    2004-04-05 10:16:30.000000000 +0200
@@ -105,6 +105,21 @@
     int            first;            /* values[] index of first occurrence */
 } ScalarMCVItem;

+/*
+** data structure for Algorithm S from Knuth 3.4.2
+*/
+typedef struct
+{
+    BlockNumber    N;                /* number of blocks, known in advance */
+    int            n;                /* sample size */
+    BlockNumber    t;                /* current block number */
+    int            m;                /* blocks selected so far */
+} BlockSamplerData;
+typedef BlockSamplerData *BlockSampler;
+
+static void BlockSampler_Init(BlockSampler bs, BlockNumber nblocks, int samplesize);
+static bool BlockSampler_HasMore(BlockSampler bs);
+static BlockNumber BlockSampler_Next(BlockSampler bs);

 #define swapInt(a,b)    do {int _tmp; _tmp=a; a=b; b=_tmp;} while(0)
 #define swapDatum(a,b)    do {Datum _tmp; _tmp=a; a=b; b=_tmp;} while(0)
@@ -113,6 +128,7 @@
 /* Default statistics target (GUC parameter) */
 int            default_statistics_target = 10;

+extern int sampling_method;   /* debugging hack */

 static int    elevel = -1;

@@ -310,7 +326,10 @@
      * Acquire the sample rows
      */
     rows = (HeapTuple *) palloc(targrows * sizeof(HeapTuple));
-    numrows = acquire_sample_rows(onerel, rows, targrows, &totalrows);
+    if (sampling_method == 0)
+        numrows = old_acquire_sample_rows(onerel, rows, targrows, &totalrows);
+    else
+        numrows = acquire_sample_rows(onerel, rows, targrows, &totalrows);

     /*
      * If we are running a standalone ANALYZE, update pages/tuples stats
@@ -488,6 +507,281 @@
 }

 /*
+** BlockSampler_Init -- prepare for random sampling of blocknumbers
+**
+** BlockSampler is used for stage one of our new two-stage tuple
+** sampling mechanism as discussed on -hackers 2004-04-02 (subject
+** "Large DB").  It selects a random sample of samplesize blocks out of
+** the nblocks blocks in the table.  If the table has less than
+** samplesize blocks, all blocks are selected.
+**
+*/
+static void
+BlockSampler_Init(BlockSampler bs, BlockNumber nblocks, int samplesize)
+{
+    bs->N = nblocks;            /* table size */
+    /*
+    ** If we decide to reduce samplesize for tables that have less or
+    ** not much more than samplesize blocks, here is the place to do
+    ** it.
+    */
+    bs->n = samplesize;
+    bs->t = 0;                    /* blocks scanned so far */
+    bs->m = 0;                    /* blocks selected so far */
+}
+
+static bool
+BlockSampler_HasMore(BlockSampler bs)
+{
+    return (bs->t < bs->N) && (bs->m < bs->n);
+}
+
+static BlockNumber
+BlockSampler_Next(BlockSampler bs)
+{
+    BlockNumber    K = bs->N - bs->t;        /* remaining blocks */
+    int            k = bs->n - bs->m;        /* blocks still to sample */
+    double        p;                /* probability to skip the next block */
+    double        V;                /* random */
+
+    Assert(BlockSampler_HasMore(bs));
+
+    if (k >= K)
+    {
+        /* need all the rest */
+        bs->t += 1;
+        bs->m += 1;
+        return bs->t - 1;
+    }
+
+#ifdef NOT_USED
+    /* simple, robust, but wastes a lot of entropy */
+    for (;;) {
+        double U = random_fract();
+
+        if (K * U < k) {
+            /* select */
+            bs->t += 1;
+            bs->m += 1;
+            return bs->t - 1;
+        }
+        else {
+            /* skip */
+            bs->t += 1;
+            K -= 1;
+        }
+    }
+#endif
+
+    p = 1.0 - (double) k / (double) K;
+    V = random_fract();
+    while (V < p)
+    {
+        bs->t += 1;
+        K -= 1;
+        /*
+        ** Assert(K > 0)
+        ** because we startet with K > k > 0,
+        ** and when K == k, the loop terminates
+        */
+        p *= 1.0 - (double) k / (double) K;
+    }
+
+    /* select */
+    bs->t += 1;
+    bs->m += 1;
+    return bs->t - 1;
+}
+
+/*
+ * acquire_sample_rows -- acquire a random sample of rows from the table
+ *
+ * As of April 2004 we use a new two-stage method:  Stage one selects up
+ * to targrows random blocks (or all blocks, if there aren't so many).
+ * Stage two scans these blocks and uses the Vitter algorithm to create
+ * a random sample of targrows rows (or less, if there are less in the
+ * sample of blocks).  The two stages are excuted simultaneously:  Each
+ * block is processed as soon as stage one returns its number and while the
+ * rows are read stage two controls which ones are to be kept in the sample.
+ * A truly random sample is collected: every row has an equal chance of
+ * ending up in the final sample.
+ *
+ * We also estimate the total number of rows in the table, and return that
+ * into *totalrows.
+ *
+ * The returned list of tuples is in order by physical position in the table.
+ * (We will rely on this later to derive correlation estimates.)
+ */
+static int
+acquire_sample_rows(Relation onerel, HeapTuple *rows, int targrows,
+                    double *totalrows)
+{
+    int            numrows = 0;
+    int            liverows = 0;
+    int            deadrows = 0;
+    int            skippedrows = 0;
+    int            rowstoskip = -1;
+    HeapScanDesc scan;
+    BlockSamplerData bs;
+    double        rstate;
+
+    Assert(targrows > 1);
+
+    /*
+    ** Hack for setting onerel->rd_nblocks:
+    */
+    scan = heap_beginscan(onerel, SnapshotNow, 0, NULL);
+    heap_endscan(scan);
+
+    /*
+    ** Prepare for sampling block numbers
+    */
+    BlockSampler_Init(&bs, onerel->rd_nblocks, targrows);
+    /*
+    ** Prepare for sampling rows
+    */
+    rstate = init_selection_state(targrows);
+    while (BlockSampler_HasMore(&bs)) {
+        BlockNumber currblock;
+        Buffer        currbuffer;
+        Page        currpage;
+        OffsetNumber targoffset,
+                    maxoffset;
+
+        CHECK_FOR_INTERRUPTS();
+
+        currblock = BlockSampler_Next(&bs);
+
+        /*
+         * We must maintain a pin on the target page's buffer to ensure
+         * that the maxoffset value stays good (else concurrent VACUUM
+         * might delete tuples out from under us).    Hence, pin the page
+         * until we are done looking at it.  We don't maintain a lock on
+         * the page, so tuples could get added to it, but we ignore such
+         * tuples.
+         */
+        currbuffer = ReadBuffer(onerel, currblock);
+        if (!BufferIsValid(currbuffer))
+            elog(ERROR, "ReadBuffer failed");
+        LockBuffer(currbuffer, BUFFER_LOCK_SHARE);
+        currpage = BufferGetPage(currbuffer);
+        maxoffset = PageGetMaxOffsetNumber(currpage);
+        LockBuffer(currbuffer, BUFFER_LOCK_UNLOCK);
+
+        for (targoffset = FirstOffsetNumber; targoffset <= maxoffset; ++targoffset)
+        {
+            HeapTupleData tuple;
+            Buffer        tupbuffer;
+
+            if (sampling_method == 2) {
+                if (rowstoskip > 0) {
+                    --rowstoskip;
+                    ++skippedrows;
+                    continue;
+                }/*if*/
+            }/*if*/
+
+            ItemPointerSet(&tuple.t_self, currblock, targoffset);
+            if (heap_fetch(onerel, SnapshotNow, &tuple, &tupbuffer,
+                           false, NULL))
+            {
+                ++liverows;
+                if (numrows < targrows)
+                    rows[numrows++] = heap_copytuple(&tuple);
+                else
+                {
+                    if (rowstoskip < 0)
+                    {
+                        double t_o;
+                        double t_n;
+
+                        /*
+                        ** Use the old function for now.
+                        ** This should be refactored into a function
+                        ** that natively returns the number of records
+                        ** to skip, as soon as we trust the new
+                        ** sampling method.
+                        */
+                        t_o = liverows +
+                              skippedrows * (double) liverows
+                                          / ((double) liverows + deadrows);
+                        t_n = select_next_random_record(t_o,
+                                                      targrows,
+                                                      &rstate);
+                        rowstoskip = (int) (t_n - t_o);
+                    }
+                    if (rowstoskip == 0)
+                    {
+                        /*
+                         * Found a suitable tuple, so save it,
+                         * replacing one old tuple at random
+                         */
+                        int    k = (int) (targrows * random_fract());
+
+                        Assert(k >= 0 && k < targrows);
+                        heap_freetuple(rows[k]);
+                        rows[k] = heap_copytuple(&tuple);
+                    }
+
+                    rowstoskip -= 1;
+                }
+
+                /* this releases the second pin acquired by heap_fetch: */
+                ReleaseBuffer(tupbuffer);
+            }
+            else
+            {
+                /*
+                ** This information is currently not used.
+                */
+                ++deadrows;
+            }
+        }
+
+        /* this releases the initial pin: */
+        ReleaseBuffer(currbuffer);
+    }
+
+    ereport(DEBUG2,
+            (errmsg("new: %d pages sampled, %d live rows and %d dead rows scanned",
+                    bs.m, liverows, deadrows)));
+    if (sampling_method == 2)
+        ereport(DEBUG2,
+                (errmsg("new: %d rows skipped", skippedrows)));
+    /*
+     * If we ran out of tuples then we're done.  No sort is needed,
+     * since they're already in order.
+     *
+     * Otherwise we need to sort the collected tuples by position
+     * (itempointer).  I don't care for corner cases where the tuples
+     * are already sorted.
+     */
+    if (numrows == targrows)
+        qsort((void *) rows, numrows, sizeof(HeapTuple), compare_rows);
+
+    /*
+     * Estimate total number of valid rows in relation.
+     */
+    if (bs.m > 0)
+    {
+        liverows += skippedrows * (double) liverows / ((double) liverows + deadrows);
+        *totalrows = floor((double) onerel->rd_nblocks * liverows / bs.m + 0.5);
+    }
+    else
+        *totalrows = 0.0;
+
+    /*
+     * Emit some interesting relation info
+     */
+    ereport(elevel,
+            (errmsg("\"%s\": %u pages, %d rows sampled, %.0f estimated total rows",
+                    RelationGetRelationName(onerel),
+                    onerel->rd_nblocks, numrows, *totalrows)));
+
+    return numrows;
+}
+
+/*
  * acquire_sample_rows -- acquire a random sample of rows from the table
  *
  * Up to targrows rows are collected (if there are fewer than that many
@@ -502,7 +796,7 @@
  * (We will rely on this later to derive correlation estimates.)
  */
 static int
-acquire_sample_rows(Relation onerel, HeapTuple *rows, int targrows,
+old_acquire_sample_rows(Relation onerel, HeapTuple *rows, int targrows,
                     double *totalrows)
 {
     int            numrows = 0;
@@ -516,6 +810,10 @@
     double        tuplesperpage;
     double        t;
     double        rstate;
+    /* for debugging: */
+    int blocksaccessed = 0;
+    int tuplesaccessed = 0;
+    int deadtuples = 0;

     Assert(targrows > 1);

@@ -592,6 +890,8 @@
         estblock = lastblock;
     tuplesperpage = (double) numest / (double) estblock;

+    blocksaccessed = lastblock + 1;
+    tuplesaccessed = numrows;
     t = (double) numrows;        /* t is the # of records processed so far */
     rstate = init_selection_state(targrows);
     for (;;)
@@ -644,6 +944,9 @@
         maxoffset = PageGetMaxOffsetNumber(targpage);
         LockBuffer(targbuffer, BUFFER_LOCK_UNLOCK);

+        if (targblock != lastblock)
+            ++blocksaccessed;
+
         for (;;)
         {
             HeapTupleData targtuple;
@@ -657,6 +960,7 @@
                 targoffset = FirstOffsetNumber;
                 goto pageloop;
             }
+            ++tuplesaccessed;
             ItemPointerSet(&targtuple.t_self, targblock, targoffset);
             if (heap_fetch(onerel, SnapshotNow, &targtuple, &tupbuffer,
                            false, NULL))
@@ -680,9 +984,14 @@
             }
             /* this tuple is dead, so advance to next one on same page */
             targoffset++;
+            ++deadtuples;
         }
     }

+    ereport(DEBUG2,
+            (errmsg("old: %d pages, %d rows accessed (%d dead rows)",
+                    blocksaccessed, tuplesaccessed, deadtuples)));
+
     /*
      * Now we need to sort the collected tuples by position (itempointer).
      */
diff -ruN ../base/src/backend/utils/misc/guc.c src/backend/utils/misc/guc.c
--- ../base/src/backend/utils/misc/guc.c    2004-03-13 07:26:11.000000000 +0100
+++ src/backend/utils/misc/guc.c    2004-04-04 10:53:26.000000000 +0200
@@ -131,6 +131,9 @@

 int            log_min_duration_statement = -1;

+/* quick hack for tuple sampling tests */
+int            sampling_method = 1;
+

 /*
  * These variables are all dummies that don't do anything, except in some
@@ -859,6 +862,15 @@
 static struct config_int ConfigureNamesInt[] =
 {
     {
+        {"sampling_method", PGC_USERSET, DEVELOPER_OPTIONS,
+            gettext_noop("Controls how to collect a tuple sample."),
+            gettext_noop("0 = old method; 1 = new, scan all tuples; "
+             "2 = new, skip tuples.")
+        },
+        &sampling_method,
+        1, 0, 2, NULL, NULL
+    },
+    {
         {"default_statistics_target", PGC_USERSET, QUERY_TUNING_OTHER,
             gettext_noop("Sets the default statistics target."),
             gettext_noop("This applies to table columns that have not had a "

Re: O(samplesize) tuple sampling, proof of concept

From
Tom Lane
Date:
Manfred Koizar <mkoi-pg@aon.at> writes:
> The old sampling method is still there.  I have added a GUC variable
> sampling_method to make testing and benchmarking easier.

I wouldn't bother with a GUC variable for the production patch.

> Once a block is selected for inspection, all tuples of this
> block are accessed to get a good estimation of the live : dead row
> ratio.

Why are you bothering to compute the live:dead ratio?  The statistics
model doesn't have any place for a dead-tuples count, so there's no need
to think about anything but live tuples.  (This is a historical
artifact, perhaps, arising from the fact that originally analysis
happened after VACUUM FULL and so the number of dead tuples was
guaranteed to be zero at the time.  But still, I'm not sure how we'd
factor a dead-tuples count into the estimates if we had it.)

> Because I was afraid that method 1 might be too expensive in terms of
> CPU cycles, I implemented a small variation that skips tuples without
> checking them for visibility; this is sampling_method 2.

And?  Does it matter?  (I'd guess not in practice, as checking a tuple
for visibility is cheap if someone's already marked it good.)

> At 1000 pages there is only a difference of approximately 1%,
> at 3000 pages the difference has its maximum of ca. 15%.  We can sell
> this by pointing to the better quality of the statistics.

If that's as bad as it gets I think we are OK.  You should redo the test
with larger sample size though (try stats target = 100) to see if the
answer changes.

> Are there any spots in the documentation that
> should be updated?

AFAIR there is noplace that specifically needs to be changed.

> diff -ruN ../base/src/backend/commands/analyze.c src/backend/commands/analyze.c

I find -u diffs close to unreadable for reviewing purposes.  Please
submit diffs in -c format in future.

> +    /*
> +     * If we ran out of tuples then we're done.  No sort is needed,
> +     * since they're already in order.
> +     *
> +     * Otherwise we need to sort the collected tuples by position
> +     * (itempointer).  I don't care for corner cases where the tuples
> +     * are already sorted.
> +     */
> +    if (numrows == targrows)
> +        qsort((void *) rows, numrows, sizeof(HeapTuple), compare_rows);

AFAICS the rows will *always* be sorted already, and so this qsort is an
utter waste of cycles.  No?

            regards, tom lane

Re: O(samplesize) tuple sampling, proof of concept

From
Tom Lane
Date:
Manfred Koizar <mkoi-pg@aon.at> writes:
>>> Because I was afraid that method 1 might be too expensive in terms of
>>> CPU cycles, I implemented a small variation that skips tuples without
>>> checking them for visibility; this is sampling_method 2.
>>
>> And?  Does it matter?

> There's a clearly visible difference for mid-size relations.  I'm not
> sure whether this can be attributed to visibility bit updating or other
> noise-contributing factors.

I think it would have to be visibility-bit updates.  Can you try it with
a pre-vacuumed relation, so that there is no slowdown for updates?

The update cost will have to be paid sooner or later by some xact, and
on the whole it's better that it be done by a maintenance xact than by
a foreground client; so even if there is a noticeable slowdown here,
that doesn't seem like a reason not to inspect all the tuples.

>>> I find -u diffs close to unreadable for reviewing purposes.  Please
>>> submit diffs in -c format in future.

> De gustibus non est disputandum :-)
> Fortunately this patch wouldn't look much different.  There is just a
> bunch of "+" lines.

Yeah, so I managed to read it anyway ;-).  It's the ones with
intricately intermixed "-" and "+" that I find difficult to follow.

>> AFAICS the rows will *always* be sorted already, and so this qsort is an
>> utter waste of cycles.  No?

> I don't think so.  We get the *blocks* in the correct order.  But tuples
> are still sampled by the Vitter method which starts to replace random
> tuples after the pool is filled.

Duh, you're right --- I was thinking that the old code doesn't need a
qsort, but it does.  This seems a tad annoying considering that we know
the tuples were inserted into the pool in increasing order.  I wonder if
it's worth using a more complex data structure to keep track of both
orders at once?  I suppose that could easily wind up costing more than
the qsort though ...

            regards, tom lane

Re: O(samplesize) tuple sampling, proof of concept

From
Manfred Koizar
Date:
On Mon, 05 Apr 2004 15:37:07 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>I wouldn't bother with a GUC variable for the production patch.

Among other things the GUC variable will be thrown out for the final
version.

>> Once a block is selected for inspection, all tuples of this
>> block are accessed to get a good estimation of the live : dead row
>> ratio.
>
>Why are you bothering to compute the live:dead ratio?

That was basically bad wording.  It should have been "... to get a good
estimation of live rows per page."  Counting dead rows turned out to be
trivial, so I just did it and included the number in the debug messages.
Then it happened to be useful for method 2.

>> Because I was afraid that method 1 might be too expensive in terms of
>> CPU cycles, I implemented a small variation that skips tuples without
>> checking them for visibility; this is sampling_method 2.
>
>And?  Does it matter?

There's a clearly visible difference for mid-size relations.  I'm not
sure whether this can be attributed to visibility bit updating or other
noise-contributing factors.

Method 2 gives a row count estimation error between 10 and 17% where
method 1 is less than 1% off.  (My updates generated dead tuples at very
evenly distributed locations by using conditions like WHERE mod(twenty,
7) = 0).

>If that's as bad as it gets I think we are OK.  You should redo the test
>with larger sample size though (try stats target = 100) to see if the
>answer changes.

Will do.

>I find -u diffs close to unreadable for reviewing purposes.  Please
>submit diffs in -c format in future.

De gustibus non est disputandum :-)
Fortunately this patch wouldn't look much different.  There is just a
bunch of "+" lines.

>AFAICS the rows will *always* be sorted already, and so this qsort is an
>utter waste of cycles.  No?

I don't think so.  We get the *blocks* in the correct order.  But tuples
are still sampled by the Vitter method which starts to replace random
tuples after the pool is filled.

BTW, thanks for the paper!

Servus
 Manfred

Re: O(samplesize) tuple sampling, proof of concept

From
Manfred Koizar
Date:
On Mon, 05 Apr 2004 18:30:29 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> noise-contributing factors.
>
>I think it would have to be visibility-bit updates.  Can you try it with
>a pre-vacuumed relation, so that there is no slowdown for updates?

I'd like to avoid VACUUM to keep the dead tuples.  Otherwise we'd have
nothing to judge the quality of the row count estimation.  SELECT
count(*) ... should do as well.  And I'll also issue a CHECKPOINT to
make sure that the following ANALYSE doesn't have to write out dirty
pages.

>Yeah, so I managed to read it anyway ;-).  It's the ones with
>intricately intermixed "-" and "+" that I find difficult to follow.

<ot>
Vim nicely marks "+" and "-" lines in different colours.  That makes you
read -u diffs almost like a newspaper, your eyes automatically ignore
lines that have the wrong colour.  I can't get myself used to reading -c
diffs.  Jumping up and down to find corresponding lines makes me
nervous.  Anyway, just a matter of taste ...
</ot>

>Duh, you're right --- I was thinking that the old code doesn't need a
>qsort, but it does.  This seems a tad annoying considering that we know
>the tuples were inserted into the pool in increasing order.  I wonder if
>it's worth using a more complex data structure to keep track of both
>orders at once?  I suppose that could easily wind up costing more than
>the qsort though ...

The least complex structure I can think of is a doubly linked list
combined with an array.  This can be done later, if we find it's worth
it (which I doubt).

Servus
 Manfred