Re: [HACKERS] [WIP] Zipfian distribution in pgbench - Mailing list pgsql-hackers

From Fabien COELHO
Subject Re: [HACKERS] [WIP] Zipfian distribution in pgbench
Date
Msg-id alpine.DEB.2.20.1708240433560.24526@lancre
Whole thread Raw
In response to [HACKERS] [WIP] Zipfian distribution in pgbench  (Alik Khilazhev <a.khilazhev@postgrespro.ru>)
Responses Re: [HACKERS] [WIP] Zipfian distribution in pgbench
List pgsql-hackers
Hello Alik,

> I am attaching patch v7.

Patch generates multiple warnings with "git apply", apparently because of 
end-of-line spaces, and fails:
  pgbench-zipf-07v.patch:52: trailing whitespace.        {  pgbench-zipf-07v.patch:53: trailing whitespace.
  "random_zipfian", 3, PGBENCH_RANDOM_ZIPFIAN  pgbench-zipf-07v.patch:54: trailing whitespace.        },
pgbench-zipf-07v.patch:66:trailing whitespace.  #define ZIPF_CACHE_SIZE 15              /* cache cells number */
pgbench-zipf-07v.patch:67:trailing whitespace.
 
  error: patch failed: doc/src/sgml/ref/pgbench.sgml:1013  error: doc/src/sgml/ref/pgbench.sgml: patch does not apply
error:patch failed: src/bin/pgbench/exprparse.y:191  error: src/bin/pgbench/exprparse.y: patch does not apply  error:
patchfailed: src/bin/pgbench/pgbench.c:93  error: src/bin/pgbench/pgbench.c: patch does not apply  error: patch failed:
src/bin/pgbench/pgbench.h:75 error: src/bin/pgbench/pgbench.h: patch does not apply
 

It also complains with "patch", although it seems to work:
 (Stripping trailing CRs from patch; use --binary to disable.) patching file doc/src/sgml/ref/pgbench.sgml (Stripping
trailingCRs from patch; use --binary to disable.) patching file src/bin/pgbench/exprparse.y (Stripping trailing CRs
frompatch; use --binary to disable.) patching file src/bin/pgbench/pgbench.c (Stripping trailing CRs from patch; use
--binaryto disable.) patching file src/bin/pgbench/pgbench.h patch unexpectedly ends in middle of line patch
unexpectedlyends in middle of line
 

Please do not put spaces at the end of lines, eg there is a lonely tab on 
the second line of "computeHarmonicZipfian".

It seems that "CR" characters are used:
  sh> sha1sum ~/pgbench-zipf-07v.patch  439ad1de0a960b3b415ff6de9412b3cc4d6922e7
  sh> file ~/pgbench-zipf-07v.patch  pgbench-zipf-07v.patch: unified diff output, ASCII text, with CRLF line
terminators

Compilation issues one warning:
 pgbench.c: In function ‘computeHarmonicZipfian’: pgbench.c:894:2: warning: ISO C90 forbids mixed declarations and code
[-Wdeclaration-after-statement]   double  uniform = pg_erand48(thread->random_state);
 

Please conform to pg standard for portability, even if it is a very old one:-)


About the documentation:

I would suggest to replace 0.5 in the first example by 1.5, as a zipfian 
distribution with a lower-than-one parameter is not that well defined.

I'm fine with using references to papers or books for the algorithm.

The documentation lines in the sgml file are too long. At least
limit text paragraphs to maybe 80 columns as done elsewhere.

typo: "<entry> Zipfian" (remove space)

typo: "used(described" (missing space)

typo: "numbers(algorithm" (again)

The sentence "It's recommended to pick <replaceable>parameter</> in range 
(0, 1) for better accuracy." does not make sense with the updated 
generation algorithm.

The text would benefit from a review by a native English speaker, who I am 
not. Anyway here is an attempt at improving/simplifying the text (I have 
removed sgml formatting). If some native English speaker could review it, 
that would be great.

"""  "random_zipfian" generates an approximated bounded zipfian distribution.  For parameter in (0,1), an approximated
algorithmis taken from  "Quickly Generating Billion-Record Synthetic Databases", Jim Gray et al, SIGMOD 1994.  For
parameterin (1, 1000), a rejection method is used, based on  "Non-Uniform Random Variate Generation", Luc Devroye, p.
550-551,Springer 1986.  The distribution is not defined when the parameter's value is 1.0.  The drawing performance is
poorfor parameter values close and above 1.0  and on a small range.
 
  "parameter" defines how skewed the distribution is.  The larger the "parameter", the more frequently values close to
thebeginning  of the interval are drawn.  The closer to 0 "parameter" is, the flatter (more uniform) the access
distribution.
"""


English in comments and code:

"inited" is not English, I think you want "initialized". Maybe "nb_cells" 
would do.

"it is not exist" (multiple instances) -> "it does not exist".

I think that the references to the paper/book should appear as comments in 
the code, for instance in front of the functions which implement the 
algorithm.

"current_time", "last_touch_time" are counter based, they are not "time". 
I suggest to update the name to "current" and "last_touch" or "last_used".

"time when cell was used last time" -> "last used logical time"


About the code:

I would prefer that you use "1.0" instead of "1." for double constants.

I think that getZipfianRand should be symmetric. Maybe a direct formula 
could do:
  return min - 1 + (s > 1) ? xxx() : yyy();

or at least use an explicit "else" for symmetry.

Function "zipfn" asks for s and n as arguments, but ISTM that they are 
already include as fields in cell, so this seems redundant. Moreover I do 
not think that the zipfn function is that useful. I would suggest to 
inline it in its caller.

"zipfGeneralizedHarmonic" is not really related to zipf, it is just used 
there. I suggest to call it "generalizedHarmonicNumber" to reflect what it 
does.

I suggest to put all LRU management in the getCell* function, that is to 
move the last_use update from computeHarmonicZipfian to getCell.

computeHarminicZipfian and computeIterativeZipfian should have the same 
signature, not one with a random state and the other with the thread. I 
suggest to do the same as other random functions, whatever it is.


Otherwise, no action required:

The patch probably conflicst with other patches in the CF, which means 
that there will be rebase needed. That is life. The queue is long and 
slow.

Also, this function deserve some tap tests, that should be added if the 
"pgbench tap test" patch get through. Otherwise it is as well tested as 
everything else around, which is basically no test.

-- 
Fabien.

pgsql-hackers by date:

Previous
From: Fabien COELHO
Date:
Subject: Re: [HACKERS] pgbench tap tests & minor fixes
Next
From: Robert Haas
Date:
Subject: Re: [HACKERS] [PATCH] Push limit to sort through a subquery