Re: reducing the footprint of ScanKeyword (was Re: Large writable variables) - Mailing list pgsql-hackers

From Joel Jacobson
Subject Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)
Date
Msg-id CAASwCXdc1Yh10=gS1JjtF=Yn-zRsTG01k0Yw3M2xVrs7K5zq6w@mail.gmail.com
Whole thread Raw
In response to Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)
List pgsql-hackers
Many thanks for working on this, amazing work, really nice you made it a separate reusable Perl-module.

The generated hash functions reads one character at a time.
I've seen a performance trick in other hash functions [1]
to instead read multiple bytes in each iteration,
and then handle the remaining bytes after the loop.


I've done some testing and it looks like a ~30% speed-up of the generated
ScanKeywords_hash_func() function would be possible.

If you think this approach is promising, I would be happy to prepare a patch for it,
but I wanted to check with the project this idea has not already been considered and ruled out
for some technical reasons I've failed to see, is there any?

For this to work you would need to use larger constants for $hash_mult1 and $hash_mult2 though.
I've successfully used these values:
$hash_mult1 0x2c1b3c6d
$hash_mult2 (0x297a2d39, 0x85ebca6b, 0xc2b2ae35, 0x7feb352d, 0x846ca68b)

Here is the idea:

Generated C-code:

  for (; keylen >= 4; keylen -= 4, k += 4)
  {
    uint32_t v;
    memcpy(&v, k, 4);
    v |= 0x20202020;
    a = a * 739982445 + v;
    b = b * 2246822507 + v;
  }
  uint32_t v = 0;
  switch (keylen)
  {
  case 3:
    memcpy(&v, k, 3);
    v |= 0x202020;
    break;
  case 2:
    memcpy(&v, k, 2);
    v |= 0x2020;
    break;
  case 1:
    memcpy(&v, k, 1);
    v |= 0x20;
    break;
  }
  a = a * 739982445 + v;
  b = b * 2246822507 + v;
  return h[a % 883] + h[b % 883];

(Reding 8 bytes a time instead would perhaps be a win since some keywords are quite long.)

Perl-code:

sub _calc_hash
{
my ($key, $mult, $seed) = @_;

my $result = $seed;
my $i=0;
my $keylen = length($key);

for (; $keylen>=4; $keylen-=4, $i+=4) {
my $cn = (ord(substr($key,$i+0,1)) << 0)
       | (ord(substr($key,$i+1,1)) << 8)
| (ord(substr($key,$i+2,1)) << 16)
| (ord(substr($key,$i+3,1)) << 24);
$cn |= 0x20202020 if $case_fold;
$result = ($result * $mult + $cn) % 4294967296;
}

my $cn = 0;
if ($keylen == 3) {
$cn = (ord(substr($key,$i+0,1)) << 0)
       | (ord(substr($key,$i+1,1)) << 8)
| (ord(substr($key,$i+2,1)) << 16);
$cn |= 0x202020 if $case_fold;
} elsif ($keylen == 2) {
$cn = (ord(substr($key,$i+0,1)) << 0)
       | (ord(substr($key,$i+1,1)) << 8);
$cn |= 0x2020 if $case_fold;
} elsif ($keylen == 1) {
$cn = (ord(substr($key,$i+0,1)) << 0);
$cn |= 0x20 if $case_fold;
}
$result = ($result * $mult + $cn) % 4294967296;

return $result;
}



pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: PostgreSQL pollutes the file system
Next
From: legrand legrand
Date:
Subject: Re: [survey] New "Stable" QueryId based on normalized query text