Thread: Stats target increase vs compute_tsvector_stats()
I started making the changes to increase the default and maximum stats targets 10X, as I believe was agreed to in this thread: http://archives.postgresql.org/pgsql-hackers/2008-12/msg00386.php I came across this bit in ts_typanalyze.c: /* We want statistic_target * 100 lexemes in the MCELEM array */num_mcelem = stats->attr->attstattarget * 100; I wonder whether the multiplier here should be changed? This code is new for 8.4, so we have zero field experience about what desirable lexeme counts are; but the prospect of up to a million lexemes in a pg_statistic entry doesn't seem quite right. I'm tempted to cut the multiplier to 10 so that the effective range of MCELEM sizes remains the same as what Jan had in mind when he wrote the code. regards, tom lane
I don't quote know how this data but any constant factor seems like it would be arbitrary. It sounds like a more principled algorithm would be to use stats_target^2. But that has the same problem. Even stats_target^1.5 would be too big for stats_target 10,000. I think just using 10 is probably the right thing. -- Greg On 13 Dec 2008, at 13:02, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I started making the changes to increase the default and maximum stats > targets 10X, as I believe was agreed to in this thread: > http://archives.postgresql.org/pgsql-hackers/2008-12/msg00386.php > > I came across this bit in ts_typanalyze.c: > > /* We want statistic_target * 100 lexemes in the MCELEM array */ > num_mcelem = stats->attr->attstattarget * 100; > > I wonder whether the multiplier here should be changed? This code is > new for 8.4, so we have zero field experience about what desirable > lexeme counts are; but the prospect of up to a million lexemes in > a pg_statistic entry doesn't seem quite right. I'm tempted to cut the > multiplier to 10 so that the effective range of MCELEM sizes remains > the same as what Jan had in mind when he wrote the code. > > regards, tom lane > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers
Tom Lane wrote: > I started making the changes to increase the default and maximum stats > targets 10X, as I believe was agreed to in this thread: > http://archives.postgresql.org/pgsql-hackers/2008-12/msg00386.php > > I came across this bit in ts_typanalyze.c: > > /* We want statistic_target * 100 lexemes in the MCELEM array */ > num_mcelem = stats->attr->attstattarget * 100; > > I wonder whether the multiplier here should be changed? This code is > new for 8.4, so we have zero field experience about what desirable > lexeme counts are; but the prospect of up to a million lexemes in > a pg_statistic entry doesn't seem quite right. I'm tempted to cut the > multiplier to 10 so that the effective range of MCELEM sizes remains > the same as what Jan had in mind when he wrote the code. The origin of that bit is this post: http://archives.postgresql.org/pgsql-hackers/2008-07/msg00556.php and the following few downthread ones. If we bump the default statistics target 10 times, then changing the multiplier to 10 seems the right thing to do. Only thing that needs caution is the frequency of pruning we do in the Lossy Counting algorithm, that IIRC is correlated with the desired target length of the MCELEM array. BTW: I've been occupied with other things and might have missed some discussions, but at some point it has been considered to use Lossy Counting to gather statistics from regular columns, not only tsvectors. Wouldn't this help the performance hit ANALYZE takes from upping default_stats_target? Cheers, Jan -- Jan Urbanski GPG key ID: E583D7D2 ouden estin
Jan Urbański <j.urbanski@students.mimuw.edu.pl> writes: > Tom Lane wrote: >> I came across this bit in ts_typanalyze.c: >> >> /* We want statistic_target * 100 lexemes in the MCELEM array */ >> num_mcelem = stats->attr->attstattarget * 100; >> >> I wonder whether the multiplier here should be changed? > The origin of that bit is this post: > http://archives.postgresql.org/pgsql-hackers/2008-07/msg00556.php > and the following few downthread ones. > If we bump the default statistics target 10 times, then changing the > multiplier to 10 seems the right thing to do. OK, will do. > Only thing that needs > caution is the frequency of pruning we do in the Lossy Counting > algorithm, that IIRC is correlated with the desired target length of the > MCELEM array. Right below that we have /* * We set bucket width equal to the target number of result lexemes. * This is probably about right but perhaps might needto be scaled * up or down a bit? */bucket_width = num_mcelem; so it should track automatically. AFAICS the argument in the above thread that this is an appropriate pruning distance holds good regardless of just how we obtain the target mcelem count. > BTW: I've been occupied with other things and might have missed some > discussions, but at some point it has been considered to use Lossy > Counting to gather statistics from regular columns, not only tsvectors. > Wouldn't this help the performance hit ANALYZE takes from upping > default_stats_target? Perhaps, but it's not likely to get done for 8.4 ... regards, tom lane