Home > mailing lists

Re: TABLESAMPLE patch - Mailing list pgsql-hackers

From	Tomas Vondra
Subject	Re: TABLESAMPLE patch
Date	December 21, 2014 20:39:01
Msg-id	549705AA.7000301@fuzzy.cz Whole thread Raw
In response to	Re: TABLESAMPLE patch (Petr Jelinek <petr@2ndquadrant.com>)
Responses	Re: TABLESAMPLE patch (Michael Paquier <michael.paquier@gmail.com>) Re: TABLESAMPLE patch (Petr Jelinek <petr@2ndquadrant.com>)
List	pgsql-hackers

Tree view

Hi,

On 18.12.2014 13:14, Petr Jelinek wrote:
> Hi,
> 
> v2 version of this patch is attached.

I did a review of this v2 patch today. I plan to do a bit more testing,
but these are my comments/questions so far:

(0) There's a TABLESAMPLE page at the wiki, not updated since 2012:
   https://wiki.postgresql.org/wiki/TABLESAMPLE_Implementation
   We should either update it or mark it as obsolete I guess. Also,   I'd like to know what's the status regarding the
TODOitems   mentioned there. Are those still valid with this patch?

(1) The patch adds a new catalog, but does not bump CATVERSION.

(2) The catalog naming (pg_tablesamplemethod) seems a bit awkward,   as it squishes everything into a single chunk.
That'sinconsistent   with naming of the other catalogs. I think pg_table_sample_method   would be better.

(3) There are a few more strange naming decisions, but that's mostly   because of the SQL standard requires that
naming.I mean SYSTEM and   BERNOULLI method names, and also the fact that the probability is   specified as 0-100
value,which is inconsistent with other places   (e.g. percentile_cont uses the usual 0-1 probability notion). But   I
don'tthink this can be fixed, that's what the standard says.

(4) I noticed there's an interesting extension in SQL Server, which   allows specifying PERCENT or ROWS, so you can
say
     SELECT * FROM table TABLESAMPLE SYSTEM (25 PERCENT);
   or
     SELECT * FROM table TABLESAMPLE SYSTEM (2500 ROWS);
   That seems handy, and it'd make migration from SQL Server easier.   What do you think?

(5) I envision a lot of confusion because of the REPEATABLE clause.   With READ COMMITTED, it's not really repeatable
becauseof changes   done by the other users (and maybe things like autovacuum). Shall   we mention this in the
documentation?

(6) This seems slightly wrong, because of long/uint32 mismatch:
     long seed = PG_GETARG_UINT32(1);
   I think uint32 would be more appropriate, no?

(7) NITPICKING: I think a 'sample_rate' would be a better name here:
     double percent = sampler->percent;

(8) NITPICKING: InitSamplingMethod contains a command with ';;'
     fcinfo.arg[i] = (Datum) 0;;

(9) The current regression tests only use the REPEATABLE cases. I   understand queries without this clause are RANDOM,
butmaybe we   could do something like this:

      SELECT COUNT(*) BETWEEN 5000 AND 7000 FROM (        SELECT id FROM test_tablesample TABLESAMPLE SYSTEM (50)
)foo;

    Granted, there's still a small probability of false positive, but    maybe that's sufficiently small? Or is the
amountof code this    tests negligible?

(10) In the initial patch you mentioned it's possible to write custom    sampling methods. Do you think a CREATE
TABLESAMPLEMETHOD,    allowing custom methods implemented as extensions would be useful?

regards
Tomas

pgsql-hackers by date:

From: Fabrízio de Royes Mello
Date: 21 December 2014, 19:48:29
Subject: Proposal "VACUUM SCHEMA"

From: Tom Lane
Date: 21 December 2014, 21:01:02
Subject: Re: PATCH: decreasing memory needlessly consumed by array_agg

Re: TABLESAMPLE patch - Mailing list pgsql-hackers

Previous

Next