Thread: Improved scanner performance

Improved scanner performance

From
Peter Eisentraut
Date:
I've been poking at the scanner a bit using the large literal test case
from the other day

http://archives.postgresql.org/pgsql-hackers/2002-04/msg00811.php

I've been able to reduce the wall-clock run time of that test from 3:37
min to 2:13 min and the base_yylex() per-call time from 137ms to 24ms.

I've used 'flex -8 -CFa' and restructured the code to avoid looping over
and copying the input string half a dozen times.  For instance, instead of
scanstr(), the escape sequences are resolved as the input is scanned, and
instead of the myinput() routine I use the function yy_scan_buffer()
provided by flex for scanning in-memory strings.  (This would make the
code flex-dependent, but in reality it already is anyway.)

The "before" profile was:
 %   cumulative   self              self     totaltime   seconds   seconds    calls  ms/call  ms/call  name23.51
6.65    6.65      110    60.45   137.27  base_yylex23.19     13.21     6.56       11   596.36  1089.69
pq_getstring19.16    18.63     5.42 74882482     0.00     0.00  pq_getbyte14.99     22.87     4.24       11   385.45
385.46 scanstr 9.61     25.59     2.72       23   118.26   118.26  yy_get_previous_state 3.78     26.66     1.07
34   31.47    31.47  myinput 3.64     27.69     1.03       22    46.82    46.82  textin 1.48     28.11     0.42
34   12.35    43.82  yy_get_next_buffer
 

The "after" profile is:
 %   cumulative   self              self     totaltime   seconds   seconds    calls  ms/call  ms/call  name40.30
5.65    5.65       11   513.64   943.64  pq_getstring33.74     10.38     4.73 74882482     0.00     0.00
pq_getbyte18.90    13.03     2.65      110    24.09    24.09  base_yylex 6.85     13.99     0.96       22    43.64
43.64 textin 0.07     14.00     0.01       86     0.12     0.12  heap_fetch
 

-- 
Peter Eisentraut   peter_e@gmx.net



Re: Improved scanner performance

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> I've used 'flex -8 -CFa' and restructured the code to avoid looping over
> and copying the input string half a dozen times.  For instance, instead of
> scanstr(), the escape sequences are resolved as the input is scanned, and
> instead of the myinput() routine I use the function yy_scan_buffer()
> provided by flex for scanning in-memory strings.  (This would make the
> code flex-dependent, but in reality it already is anyway.)

Yes, we've been requiring flex-only features for years, so that aspect
of it doesn't bother me.  Any downsides to the changes?  (For instance,
I had the idea that -CF would enlarge the lexer tables quite a bit ---
what's the change in executable size?)

> The "after" profile is:

>   %   cumulative   self              self     total
>  time   seconds   seconds    calls  ms/call  ms/call  name
>  40.30      5.65     5.65       11   513.64   943.64  pq_getstring
>  33.74     10.38     4.73 74882482     0.00     0.00  pq_getbyte
>  18.90     13.03     2.65      110    24.09    24.09  base_yylex
>   6.85     13.99     0.96       22    43.64    43.64  textin
>   0.07     14.00     0.01       86     0.12     0.12  heap_fetch

Looks like inlining pq_getbyte into pq_getstring would be worth doing
too.
        regards, tom lane


Re: Improved scanner performance

From
Peter Eisentraut
Date:
Tom Lane writes:

> I had the idea that -CF would enlarge the lexer tables quite a bit ---
> what's the change in executable size?)

+150 kB

I've also looked at -CFe, which is supposedly the next slowest level, but
it doesn't do nearly as well.

-- 
Peter Eisentraut   peter_e@gmx.net



Re: Improved scanner performance

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> Tom Lane writes:
>> I had the idea that -CF would enlarge the lexer tables quite a bit ---
>> what's the change in executable size?)

> +150 kB

> I've also looked at -CFe, which is supposedly the next slowest level, but
> it doesn't do nearly as well.

Ouch; that sounds like about a ten percent increase in the size of
the backend executable.  That's enough to reach my threshold of pain;
is the long-literal issue worth that much?

How much of your reported improvement is due to -CFa, and how much to
the coding improvements you made?
        regards, tom lane


Re: Improved scanner performance

From
Peter Eisentraut
Date:
Tom Lane writes:

> Peter Eisentraut <peter_e@gmx.net> writes:
> > Tom Lane writes:
> >> I had the idea that -CF would enlarge the lexer tables quite a bit ---
> >> what's the change in executable size?)
>
> > +150 kB
>
> > I've also looked at -CFe, which is supposedly the next slowest level, but
> > it doesn't do nearly as well.
>
> Ouch; that sounds like about a ten percent increase in the size of
> the backend executable.  That's enough to reach my threshold of pain;
> is the long-literal issue worth that much?

Here's a breakdown of the postmaster file sizes and the wall-clock run
time of the long-literal test:

no options    1749912        1m58.688s
-CFe        1754315        1m49.223s
-CF        1817621        1m43.780s
-CFa        1890197        1m45.600s

(These numbers are different than yesterday's because they don't have
profiling and debugging overhead.)

Seeing this, I think -CF should be OK space and time-wise.

> How much of your reported improvement is due to -CFa, and how much to
> the coding improvements you made?

As I recall it, probably a third of the overall improvement came from
using -CF[a].

-- 
Peter Eisentraut   peter_e@gmx.net



Re: Improved scanner performance

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> Here's a breakdown of the postmaster file sizes and the wall-clock run
> time of the long-literal test:

> no options    1749912        1m58.688s
> -CFe        1754315        1m49.223s
> -CF        1817621        1m43.780s
> -CFa        1890197        1m45.600s

> Seeing this, I think -CF should be OK space and time-wise.

Looks like a reasonable compromise to me too.
        regards, tom lane


Re: Improved scanner performance

From
Tom Lane
Date:
BTW, here is what I get on an HP box for the same test you described
(a dozen trivial SELECTs using string literals between 5MB and 10MB
long), using latest sources:
 %   cumulative   self              self     total           time   seconds   seconds    calls  ms/call  ms/call  name
 47.51      8.19     8.19                             chunks26.16     12.70     4.51      129    34.96    35.97
base_yylex12.30    14.82     2.12     1521     1.39     1.39  strlen 6.79     15.99     1.17       13    90.00    90.00
pq_getstring 4.18     16.71     0.72                             chunk2 2.55     17.15     0.44
   _recv_sys 0.29     17.20     0.05                             _mcount
 

"chunks" is the inner loop of memcpy() --- evidently all the time is
being spent just copying those big literals around.

We could probably avoid some of that copying if we had a cleaner
approach to parsetree handling, ie, no scribbling on one's input.
Then operations like eval_const_expressions wouldn't feel compelled
to copy parsetree nodes that they weren't modifying.  But I think
we've gotten all the low-hanging fruit for now.

At least on the backend side.  Did you notice that psql was chewing
up three times more CPU than the backend in this test??
        regards, tom lane