Thread: Storing pg_stat_statements query texts externally, pg_stat_statements in core

Storing pg_stat_statements query texts externally, pg_stat_statements in core

From
Peter Geoghegan
Date:
Attached patch allows pg_stat_statements to store arbitrarily long
query texts, using an external, transparently managed lookaside file.

This is of great practical benefit to certain types of users, who need
to understand the execution costs of queries with associated
excessively long query texts, often due to the fact that the query was
generated by an ORM. This is a common complaint of Heroku customers,
though I'm sure they're not alone there.

As I mentioned on a previous occasion, since query fingerprinting was
added, the query text is nothing more than a way of displaying the
entries to users that are interested. As such, it seems perfectly
reasonable to store those externally, out of shared memory - the
function pg_stat_statements() itself (that the view is defined as a
call to) isn't particularly performance sensitive. Creating a brand
new pg_stat_statements entry is already costly, since it necessitates
an exclusive lock that blocks everything, so the additional cost of
storing the query text when that happens is assumed to be of marginal
concern. Better to have more entries in the first place, so that
doesn't need to happen very frequently - using far less memory per
entry helps the user to increase pg_stat_statements.max in the first
place, making excessive exclusive locking more unlikely in the first
place. Having said all that, the code bends over backwards to make
sure that physical I/O is as unlikely as possible during exclusive
locking. There are numerous tricks employed here where cache pressure
is a concern, that I won't go into here, since it's all well commented
in the patch.

Maybe we should have a warning when eviction occurs too frequently? I
also think that we might want to take another look at making the
normalization logic less strict in terms of differentiating queries
that a reasonable person might consider equivalent. This is obviously
rather fuzzy, but I've had complaints about things like
pg_stat_statements being overly fussy in seeing row and array
comparisons as distinct from each other. This is a discussion for
another thread most probably, but informed by one of the same concerns
- making expensive cache pressure less likely.

This incidentally furthers the case for pg_stat_statements in core
(making it similar to pg_stat_user_functions - not turned on by
default, but easily available, even on busy production systems that
cannot afford a restart and didn't know they needed it until
performance tanked 5 minutes ago). The amount of shared memory now
used (and therefore arguably wasted by having a little bit of shared
memory just in case pg_stat_statements becomes useful) is greatly
reduced. That's really a separate discussion, though, which is why I
haven't done the additional work of integrating pg_stat_statements
into core here. Doing a restart to enable pg_stat_statements is, in my
experience, only acceptable infrequently - restarts are generally to
be avoided at all costs.

I can see a day when the query hash is actually a general query cache
identifier, at which time the query text will also be stored outside
of the pgss hash table proper.

Refactoring
=========

I've decided to remove the encoding id from the pgss entry key (it's
now just in the main entry struct) - I don't think it makes sense
anymore. It is clearly no longer true that it's a "notational
convenience" (and hasn't been since 9.1). Like query texts themselves,
it is something that exists for the sole purpose of displaying
statistics to users, and as such cannot possibly result in incorrectly
failing to differentiate or spuriously differentiating two entries.
Now, I suppose it's true that since we're sometimes directly hashing
query texts, it might matter (e.g. utility statements) assuming that
they could vary. But since encoding invariably comes from database
encoding anyway, and we always differentiate on databaseid to begin
with, I fail to see how that could possibly matter.

I've bumped PGSS_FILE_HEADER, just in case a pg_stat_statements temp
file survives a pg_upgrade. I think that some minor tweaks made by
Noah to pg_stat_statements (as part of commit b560ec1b) ought to have
necessitated doing so anyway, but no matter.

The role of time-series aggregation
=============================

Over on the min/max pg_stat_statements storage thread, I've stated
that I think the way forward is that third party tools aggregate
deltas fairly aggressively - maybe even as aggressively as every 3
seconds or something. So I'm slightly concerned that I may be
hampering a future tool that needs to aggregate statistics very
aggressively. For that reason, we should probably provide an
alternative function/view that identifies pg_stat_statements entries
by hashid + databaseid + userid only, so any overhead from reading big
query strings from disk with a shared lock held is eliminated.

Thoughts?
--
Peter Geoghegan

Attachment

Re: Storing pg_stat_statements query texts externally, pg_stat_statements in core

From
Peter Eisentraut
Date:
On 11/14/13, 3:17 AM, Peter Geoghegan wrote:
> Attached patch allows pg_stat_statements to store arbitrarily long
> query texts, using an external, transparently managed lookaside file.

Compiler warnings with fortify settings:

gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels
-Wmissing-format-attribute-Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g -fpic -I. -I.
-I../../src/include-D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -I/usr/include/libxml2   -c -o pg_stat_statements.o
pg_stat_statements.c-MMD -MP -MF .deps/pg_stat_statements.Po
 
pg_stat_statements.c: In function ‘entry_reset’:
pg_stat_statements.c:1827:11: warning: ignoring return value of ‘ftruncate’, declared with attribute warn_unused_result
[-Wunused-result]
pg_stat_statements.c: In function ‘garbage_collect_query_texts’:
pg_stat_statements.c:1779:11: warning: ignoring return value of ‘ftruncate’, declared with attribute warn_unused_result
[-Wunused-result]




Re: Storing pg_stat_statements query texts externally, pg_stat_statements in core

From
Peter Geoghegan
Date:
On Fri, Nov 15, 2013 at 8:51 AM, Peter Eisentraut <peter_e@gmx.net> wrote:
> Compiler warnings with fortify settings:

I'll post a revision with fixes for those. Another concern is that I
see some race conditions that only occur when pg_stat_statements cache
is under a lot of pressure with a lot of concurrency, that wasn't
revealed by my original testing. I'm working on a fix for that.


-- 
Peter Geoghegan



Re: Storing pg_stat_statements query texts externally, pg_stat_statements in core

From
Peter Geoghegan
Date:
On Sat, Nov 16, 2013 at 4:36 PM, Peter Geoghegan <pg@heroku.com> wrote:
> I'll post a revision with fixes for those. Another concern is that I
> see some race conditions that only occur when pg_stat_statements cache
> is under a lot of pressure with a lot of concurrency, that wasn't
> revealed by my original testing. I'm working on a fix for that.

Revision attached.


--
Peter Geoghegan

Attachment

Re: Storing pg_stat_statements query texts externally, pg_stat_statements in core

From
Peter Geoghegan
Date:
With a pg_stat_statements.max of 1,000, on my system with the patch
applied the additional amount of shared memory required is only
192KiB. That compares with about 1.17MiB for the same setting in
master's version of pg_stat_statements. Surely with this huge
reduction, we should revisit that default - I think that a default
setting of 5,000 for pg_stat_statements.max makes more sense.

entry_dealloc() requires an exclusive lock, locking all queries out of
simply recording their costs. With a lot of cache pressure this could
be really expensive. In my opinion that risk alone justifies the
change.

Without adding another GUC, we might even go so far as to have a
mechanism like checkpoint_warning warn that entry_dealloc() calls
occur too frequently...

-- 
Peter Geoghegan



On Sun, Nov 17, 2013 at 1:15 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Sat, Nov 16, 2013 at 4:36 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> I'll post a revision with fixes for those. Another concern is that I
>> see some race conditions that only occur when pg_stat_statements cache
>> is under a lot of pressure with a lot of concurrency, that wasn't
>> revealed by my original testing. I'm working on a fix for that.
>
> Revision attached.

The patch doesn't apply cleanly against the master due to recent change of
pg_stat_statements. Could you update the patch?

Regards,

-- 
Fujii Masao



Re: Storing pg_stat_statements query texts externally, pg_stat_statements in core

From
Peter Geoghegan
Date:
On Sat, Dec 7, 2013 at 9:26 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> The patch doesn't apply cleanly against the master due to recent change of
> pg_stat_statements. Could you update the patch?

Revision is attached, including changes recently discussed on other
thread [1] to allow avoiding sending query text where that's not
desirable and increase in default pg_stat_statements.max to 5000.

I've found myself tweaking things a bit more here and there. I've
added additional clarification around why we do an in-place garbage
collection (i.e. why we don't just write a new file and atomically
rename -- grep "over-engineer" to find what I mean). I also fully
documented the pg_stat_statements function. If we want external tool
authors to use it to avoid sending query texts, they have to know
about it in the first place.

I now use the pg_stat_tmp directory for my temp file (so
pg_stat_statements behaves more like the statistics collector). The
stats_temp_directory GUC is not used, for reasons that a patch comment
explains.

I've fuzz-tested this throughout with the "make installcheck-parallel"
regression tests. I found it useful to run that in one terminal, while
running pgbench with multiple clients in another. The pgbench script
looks something like:

"""
select * from pg_stat_statements;
select pg_stat_statements_reset();
"""

pg_stat_statements_reset() was temporarily rigged to perform a garbage
collection (and not a reset) for the benefit of this stress-test. The
function pg_stat_statements(), the garbage collection function and
pgss_store() will complain loudly if they notice anything out of the
ordinary. I was not able to demonstrate any problems through this
technique (though I think it focused my thinking on the right areas
during development). Clearly, missed subtleties around managing the
external file while locking in order to ensure correct behavior (that
no session can observe something that it should or should not) will be
something that a committer will want to pay particular attention to. I
go to some lengths to to avoid doing this with only a shared lock.

FYI, without hacking things up, it's quite hard to make external query
text garbage collection occur - need_gc_qtexts() will almost always
say no. It will only automatically happen once with
pg_stat_statements.max set to 1000 when the standard regression tests
are run. This is a really extreme workload in terms of causing
pg_stat_statements churn, which shows just how unlikely garbage
collection is in the real world. Still, it ought to be totally robust
when it happens.

In passing, I did this:

      /*
!      * Rename file into place, to atomically commit to serializing.
       */

Because at this juncture, there ought to be no file to replace (not
since Magnus had pg_stat_statements not keep the main global file
around for as long as the server is running, in the style of the
statistics collector).

Thanks

[1] http://www.postgresql.org/message-id/CAM3SWZSVFf-vBNC8sc-0CpWZxVQH49REYBcHb95t95fcrGSa6A@mail.gmail.com
--
Peter Geoghegan

Attachment

Re: Storing pg_stat_statements query texts externally, pg_stat_statements in core

From
Peter Geoghegan
Date:
On Mon, Dec 9, 2013 at 7:31 PM, Peter Geoghegan <pg@heroku.com> wrote:
> I go to some lengths to to avoid doing this with only a shared lock.

I should have said: I go to great lengths to do this with only a
shared lock, and not an exclusive (see gc_count stuff).


-- 
Peter Geoghegan



Peter Geoghegan <pg@heroku.com> writes:
> On Sat, Dec 7, 2013 at 9:26 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> The patch doesn't apply cleanly against the master due to recent change of
>> pg_stat_statements. Could you update the patch?

> Revision is attached, including changes recently discussed on other
> thread [1] to allow avoiding sending query text where that's not
> desirable and increase in default pg_stat_statements.max to 5000.

I see this is marked as ready for committer.  Where does it stand in
relation to the other long-running thread about "calls under-estimation
propagation"?  I was surprised to find that there isn't any CommitFest
entry linked to that thread, so I'm wondering if that proposal is
abandoned or what.  If it's not, is committing this going to blow up
that patch?

BTW, I'm also thinking that the "detected_version" kluge is about ready
to collapse of its own weight, or at least is clearly going to break in
future.  What we need to do is embed the API version in the C name of the
pg_stat_statements function instead.
        regards, tom lane



Re: Storing pg_stat_statements query texts externally, pg_stat_statements in core

From
Peter Geoghegan
Date:
On Mon, Jan 20, 2014 at 12:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I see this is marked as ready for committer.  Where does it stand in
> relation to the other long-running thread about "calls under-estimation
> propagation"?  I was surprised to find that there isn't any CommitFest
> entry linked to that thread, so I'm wondering if that proposal is
> abandoned or what.  If it's not, is committing this going to blow up
> that patch?

I believe that proposal was withdrawn. I think the conclusion there
was that we should just expose queryid and be done with it. In a way,
exposing the queryid enabled this work, because it provides an
identifier that can be used instead of sending large query texts each
call.

> BTW, I'm also thinking that the "detected_version" kluge is about ready
> to collapse of its own weight, or at least is clearly going to break in
> future.  What we need to do is embed the API version in the C name of the
> pg_stat_statements function instead.

I agree that it isn't scalable.

-- 
Peter Geoghegan



Peter Geoghegan <pg@heroku.com> writes:
> On Mon, Jan 20, 2014 at 12:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> BTW, I'm also thinking that the "detected_version" kluge is about ready
>> to collapse of its own weight, or at least is clearly going to break in
>> future.  What we need to do is embed the API version in the C name of the
>> pg_stat_statements function instead.

> I agree that it isn't scalable.

Yeah.  It was more or less tolerable as long as we were only using it in
connection with identifying the set of output columns, but using it to
identify the expected input arguments too seems darn rickety.  I'll
refactor so there are separate C call points for each supported API
version.  (Well, I guess it's too late to separate 1.0 and 1.1, but
we can make 1.2 and future versions separate.)
        regards, tom lane



BTW, what is the value of using posix_fadvise() in warm_fcache?

ISTM that if you're worried about waiting for I/O while holding the
LWLock (as I concur you should be), this coding is entirely inadequate
for preventing that, as the whole point of posix_fadvise is that it
won't wait around for the I/O to happen.  It strains credulity to
the breaking point to imagine that the kernel will have gotten
around to reading the file in the small number of nanoseconds that
will elapse before you start needing the data.  Unless of course the file
is already in buffers, but then the whole thing was a waste of code.

I think we should flush the posix_fadvise() code path as being neither
useful nor portable, and just do the actual read() as in the other
path (which btw is lacking an essential fseek).

Actually, I think the whole thing is rather badly designed, as warm
cache or no you're still proposing to do thousands of kernel calls
while holding a potentially contended LWLock.  I suggest what we
do is (1) read in the whole file, (2) acquire the lock, (3)
re-read the whole file in the same buffer, (4) work from the buffer.

It might be possible to optimize the re-read away in the common case
that the file didn't actually change since we started to read it;
we'd need to put a bit more reporting into qtext_store probably, so
that it's possible to tell if any space is reserved but still unwritten.
BTW shouldn't there be an fflush() in qtext_store?

We could also eliminate some palloc/pfree cycles by using the string
directly in the buffer (at the small cost of needing to include the
strings' trailing nulls in the file).
        regards, tom lane



Re: Storing pg_stat_statements query texts externally, pg_stat_statements in core

From
Peter Geoghegan
Date:
On Tue, Jan 21, 2014 at 6:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> BTW, what is the value of using posix_fadvise() in warm_fcache?
>
> ISTM that if you're worried about waiting for I/O while holding the
> LWLock (as I concur you should be), this coding is entirely inadequate
> for preventing that, as the whole point of posix_fadvise is that it
> won't wait around for the I/O to happen.  It strains credulity to
> the breaking point to imagine that the kernel will have gotten
> around to reading the file in the small number of nanoseconds that
> will elapse before you start needing the data.  Unless of course the file
> is already in buffers, but then the whole thing was a waste of code.

Not necessarily. Under Linux for example, POSIX_FADV_SEQUENTIAL "sets
the readahead window to the default size for the backing device". That
seems like a useful thing, though I will admit that it's hard to
quantify how useful here. If it is useful, and furthermore if it's
more useful than just reading the file into a temp buffer (which, I
admit, is not clear), then why not do it where we can? Adopting
support for the non-portable posix_fadvise is a sunk cost.

> I think we should flush the posix_fadvise() code path as being neither
> useful nor portable, and just do the actual read() as in the other
> path

Perhaps.

> Actually, I think the whole thing is rather badly designed, as warm
> cache or no you're still proposing to do thousands of kernel calls
> while holding a potentially contended LWLock.  I suggest what we
> do is (1) read in the whole file, (2) acquire the lock, (3)
> re-read the whole file in the same buffer, (4) work from the buffer.

fseeko() and fread() are part of the standard library, not any kernel.
I don't believe that what I've done in qtext_fetch() implies thousands
of kernel calls, or thousands of context switches.

Now, using a buffer not managed by libc might be better, but I should
note what holding that LWLock actually represents. As you know, it's a
shared lock that does not block the updating of existing entries. It
will only block the creation of entirely new ones. That ought to be
very rare. You may recall that when I worked with you on
pg_stat_statements in the past, I characterized the rate of growth as
being logarithmic. Once the system has been running for half an hour,
new entries become very infrequent indeed. Admittedly, I may be
failing to consider a worst case there, but in the average case this
works fine. Encouraging automated tools to lazily retrieve query texts
is a big part of why I thought this might be the least worst thing.

It's not obvious to me that always avoiding blocking new entries while
performing physical I/O is important enough to warrant the additional
complexity of copying to a buffer first, and mostly working off that.
Also, you must consider that there may have been a garbage collection
in the interim, so you have to get the shared lock twice (to check the
garbage collection count, to later guard against your buffer having
been invalidated by a garbage collection) instead of just getting the
shared lock once. I discussed this with Pavel, actually. Getting the
shared lock twice on every pg_stat_statements() call doesn't seem very
good either. Especially since you'll have to open the file twice,
*with* the shared lock second time around, when your scheme misses new
entries. Your scheme won't work out under exactly the same
circumstances that mine doesn't do too well, which in when there is a
new entry to consider. When that doesn't happen, holding a shared lock
is not that important.

> It might be possible to optimize the re-read away in the common case
> that the file didn't actually change since we started to read it;
> we'd need to put a bit more reporting into qtext_store probably, so
> that it's possible to tell if any space is reserved but still unwritten.

So, to be clear, you'd still be reading from the file if that proved
necessary? I think you'll have to, because it seems useful that
pg_stat_statements offers a limited form of consistency, in that you
only see entries at a point in time, even if the figures themselves
are not exactly consistent. Maybe I've missed something, but it's not
obvious to me that you're any better off as compared to just using a
FILE/stream.

> BTW shouldn't there be an fflush() in qtext_store?

I don't think so, no. Whenever qtext_store() callers don't fflush()
before releasing their exclusive lock, they FreeFile() before doing so
instead. In the case of pgss_shmem_startup() there is no lock held
anyway, because it doesn't matter, for the same reason it doesn't
matter that we're modifying shmem structures in a way that ordinarily
necessitates an exclusive lock. Why fflush() every qtext_store() call
if there is expected to be more than one, as is the case for several
callers?

FreeFile() calls fclose(). fclose() is described by the standard as
flushing its stream. It is my understanding that calling fflush() on a
stream immediately prior to calling fclose() on the same stream is
always superfluous.

> We could also eliminate some palloc/pfree cycles by using the string
> directly in the buffer (at the small cost of needing to include the
> strings' trailing nulls in the file).

It might be useful to do that anyway, for sanity checking purposes. So
+1 from me.

-- 
Peter Geoghegan



Re: Storing pg_stat_statements query texts externally, pg_stat_statements in core

From
Peter Geoghegan
Date:
On Tue, Jan 21, 2014 at 8:01 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Not necessarily. Under Linux for example, POSIX_FADV_SEQUENTIAL "sets
> the readahead window to the default size for the backing device"

Excuse me; I meant to put "POSIX_FADV_SEQUENTIAL doubles this size
[default size for the backing device]".

-- 
Peter Geoghegan



Re: Storing pg_stat_statements query texts externally, pg_stat_statements in core

From
Peter Geoghegan
Date:
On Tue, Jan 21, 2014 at 8:01 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> Actually, I think the whole thing is rather badly designed, as warm
>> cache or no you're still proposing to do thousands of kernel calls
>> while holding a potentially contended LWLock.  I suggest what we
>> do is (1) read in the whole file, (2) acquire the lock, (3)
>> re-read the whole file in the same buffer, (4) work from the buffer.
>
> fseeko() and fread() are part of the standard library, not any kernel.
> I don't believe that what I've done in qtext_fetch() implies thousands
> of kernel calls, or thousands of context switches.

I ran an strace on a pg_stat_statements backend with a full ~5,000
entries. It seems that glibc is actually content to always make
lseek() and read() syscalls in the event of an fseek(), even though it
does maintain its own buffer and could in principle have a lot more
gumption about that [1]. I agree that we should copy everything into a
buffer in one go.

I think that making the LWLocking duration as brief as possible isn't
critically important. Automated tools will only be interested in new
query texts (implying possible I/O while share locking) as they
observe new entries. But new entries are the only thing worth
considering that may potentially block on that shared lock (although,
not as they write out their query texts, which almost always just
requires a shared lock). New entries ought to be very rare occurrence
compared to the execution of queries - certainly, on the busy
production systems I use pg_stat_statements on, it may be weeks before
a new query is observed (uh, with the possible exception of *my*
ad-hoc pg_stat_statements queries). On my laptop, "explain analyze
select * from pg_stat_statements" indicates a total runtime of ~9ms
with 5,000 regression test entries, even with the (still modest)
overhead of doing all those read()/lseek() calls. So we're talking
about a small impact on a new entry, that will only be affected once,
that was always going to have to write out its query text to file
anyway. On top of all this, in general it's very improbable that any
given new entry will be affected at all, because two unlikely things
need to happen at the same time for that to be possible.

The most important aspect of keeping the overhead of
pg_stat_statements low is that there not be too much cache pressure.
Obviously the huge reduction in its shared memory footprint that this
patch allows helps with that.

[1] http://www.pixelbeat.org/programming/stdio_buffering/
-- 
Peter Geoghegan



Re: Storing pg_stat_statements query texts externally, pg_stat_statements in core

From
Peter Geoghegan
Date:
On Tue, Jan 21, 2014 at 8:01 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> BTW shouldn't there be an fflush() in qtext_store?
>
> I don't think so, no. Whenever qtext_store() callers don't fflush()
> before releasing their exclusive lock, they FreeFile() before doing so
> instead. In the case of pgss_shmem_startup() there is no lock held
> anyway, because it doesn't matter, for the same reason it doesn't
> matter that we're modifying shmem structures in a way that ordinarily
> necessitates an exclusive lock. Why fflush() every qtext_store() call
> if there is expected to be more than one, as is the case for several
> callers?

Having said all that, it occurs to me now that I could have been
smarter about where I fflush(). I'm pretty sure pgss_store() should
have two fflush() calls. The first can occur with just a shared LWLock
held, before the lock strength promotion interim (you probably noticed
that I no longer generate a normalized query text in that interim, for
reasons that are obvious). The second fflush() may occur only in the
very unlikely event of a garbage collection necessitating a new
qtext_store() call with an exclusive lock held, a few lines after the
first fflush(). No need to fflush() with the exclusive lock the vast
majority of the time.


-- 
Peter Geoghegan



Peter Geoghegan <pg@heroku.com> writes:
> On Tue, Jan 21, 2014 at 8:01 PM, Peter Geoghegan <pg@heroku.com> wrote:
>>> Actually, I think the whole thing is rather badly designed, as warm
>>> cache or no you're still proposing to do thousands of kernel calls
>>> while holding a potentially contended LWLock.  I suggest what we
>>> do is (1) read in the whole file, (2) acquire the lock, (3)
>>> re-read the whole file in the same buffer, (4) work from the buffer.

>> fseeko() and fread() are part of the standard library, not any kernel.
>> I don't believe that what I've done in qtext_fetch() implies thousands
>> of kernel calls, or thousands of context switches.

> I ran an strace on a pg_stat_statements backend with a full ~5,000
> entries. It seems that glibc is actually content to always make
> lseek() and read() syscalls in the event of an fseek(), even though it
> does maintain its own buffer and could in principle have a lot more
> gumption about that [1]. I agree that we should copy everything into a
> buffer in one go.

Yeah, that was what I thought might happen.  Even if glibc were smarter,
a lot of other libc implementations aren't.  Also, ISTM that traversal
of the hash table will generally lead to more-or-less-random access into
the text file.  So unless you assume that libc has mmap'd the whole
file, it'd have to do a lot of kernel calls anyway to get different pages
of the file into its buffer at different times.

I think that "read the whole file" is probably going to net out not any
more complex as far as our code goes, and it'll be making a lot fewer
assumptions about how smart the libc level is and what's happening at
the kernel level.  I'll have a go at coding it, anyway.

> I think that making the LWLocking duration as brief as possible isn't
> critically important.

Maybe, but I don't like the idea that calling pg_stat_statements()
frequently could result in random glitches in response time for other
queries.

> Automated tools will only be interested in new
> query texts (implying possible I/O while share locking) as they
> observe new entries.

This argument would be more plausible if there were a way to fetch
the text for a previously-observed entry without invoking
pg_stat_statements().  I'm surprised you've not submitted a patch
to add such a function.
        regards, tom lane



Peter Geoghegan <pg@heroku.com> writes:
> I ran an strace on a pg_stat_statements backend with a full ~5,000
> entries.

BTW, have you got any sort of test scenario you could share for this
purpose?  I'm sure I could build something, but if you've already
done it ...
        regards, tom lane



Re: Storing pg_stat_statements query texts externally, pg_stat_statements in core

From
Peter Geoghegan
Date:
On Wed, Jan 22, 2014 at 1:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> BTW, have you got any sort of test scenario you could share for this
> purpose?  I'm sure I could build something, but if you've already
> done it ...

I simply ran the standard regression tests, and then straced a backend
as it executed the pgss-calling query. I'm not sure that I've
understood your question.

As previously noted, my approach to testing this patch involved variations of:

running "make installcheck-parallel"

Concurrently, running pgbench with multiple clients each calling
pg_stat_statements() and pg_stat_statements_reset(). The latter was
sometimes rigged to do a direct garbage collection to stress test
things. My expectation was that the sanity checking done everywhere
would complain if there were any race conditions or other bugs. This
was pretty effective as a smoke test during development.

-- 
Peter Geoghegan



Peter Geoghegan <pg@heroku.com> writes:
> On Wed, Jan 22, 2014 at 1:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> BTW, have you got any sort of test scenario you could share for this
>> purpose?  I'm sure I could build something, but if you've already
>> done it ...

> I simply ran the standard regression tests, and then straced a backend
> as it executed the pgss-calling query. I'm not sure that I've
> understood your question.

Hm, are there 5K distinct queries in the regression tests?  I'd have
thought they weren't enough to fill a large pgss hash.
        regards, tom lane



Re: Storing pg_stat_statements query texts externally, pg_stat_statements in core

From
Peter Geoghegan
Date:
On Wed, Jan 22, 2014 at 1:38 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Hm, are there 5K distinct queries in the regression tests?  I'd have
> thought they weren't enough to fill a large pgss hash.

Well, yes. They're a really bad case for pg_stat_statements, in that
almost all distinct queries are only executed once or twice and yet
there are enough of them. I think this is generally useful for
testing.


-- 
Peter Geoghegan



Peter Geoghegan <pg@heroku.com> writes:
> On Wed, Jan 22, 2014 at 1:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> BTW, have you got any sort of test scenario you could share for this
>> purpose?  I'm sure I could build something, but if you've already
>> done it ...

> I simply ran the standard regression tests, and then straced a backend
> as it executed the pgss-calling query. I'm not sure that I've
> understood your question.

> As previously noted, my approach to testing this patch involved variations of:
> running "make installcheck-parallel"

Do the regression tests fail for you when doing this?

What I see is a duplicate occurrence of an escape_string_warning bleat:

*** /home/postgres/pgsql/src/test/regress/expected/plpgsql.out  Fri Jan  3 17:07
:46 2014
--- /home/postgres/pgsql/src/test/regress/results/plpgsql.out   Sat Jan 25 13:37
:20 2014
***************
*** 4568,4573 ****
--- 4568,4579 ---- HINT:  Use the escape string syntax for backslashes, e.g., E'\\'. QUERY:  SELECT 'foo\\bar\041baz'
CONTEXT: PL/pgSQL function strtest() line 4 at RETURN
 
+ WARNING:  nonstandard use of \\ in a string literal
+ LINE 1: SELECT 'foo\\bar\041baz'
+                ^
+ HINT:  Use the escape string syntax for backslashes, e.g., E'\\'.
+ QUERY:  SELECT 'foo\\bar\041baz'
+ CONTEXT:  PL/pgSQL function strtest() line 4 at RETURN    strtest    -------------  foo\bar!baz

======================================================================

which seems to happen because generate_normalized_query() reruns the
core lexer on the given statement, and if you got a warning the first
time, you'll get another one.

It seems this only happens with rather small values of
pg_stat_statements.max, else there's already a "RETURN text-constant"
query in the hashtable so we don't need to do reparsing right here.
(I'm testing with max = 100 to exercise the garbage collection code.)

I'm not sure this is a big deal for real-world use, because surely by now
everyone has either fixed their escape-string issues or disabled the
warning.  But it's annoying for testing.

The big picture of course is that having this warning issued this way
is broken anyhow, and has been since day one, so it's not entirely
pg_stat_statements' fault.  I'm just wondering if it's worth taking
the trouble to turn off the warning when reparsing.
        regards, tom lane



I wrote:
> I think that "read the whole file" is probably going to net out not any
> more complex as far as our code goes, and it'll be making a lot fewer
> assumptions about how smart the libc level is and what's happening at
> the kernel level.  I'll have a go at coding it, anyway.

Attached is a revised patch that does the I/O that way, and also makes
substantial cleanups in the error handling.  (The previous patch did
questionable things like resetting the shared hash table while callers
were actively iterating through it, and the error logging left a lot
to be desired as well.)  I've chosen to handle failures to load query
text data by just returning NULL for that query text, which seems
reasonable given the new mindset that the query text is auxiliary data
less important than the actual counts.

I've not done much more than smoke-testing on this; I'm thinking about
hot-wiring the code to sometimes force it through the error paths,
just to make sure they act as expected.  I've also not done any
real performance testing.

            regards, tom lane

diff --git a/contrib/pg_stat_statements/pg_stat_statements--1.1--1.2.sql
b/contrib/pg_stat_statements/pg_stat_statements--1.1--1.2.sql
index 74aa561..74ae438 100644
*** a/contrib/pg_stat_statements/pg_stat_statements--1.1--1.2.sql
--- b/contrib/pg_stat_statements/pg_stat_statements--1.1--1.2.sql
*************** DROP VIEW pg_stat_statements;
*** 12,18 ****
  DROP FUNCTION pg_stat_statements();

  /* Now redefine */
! CREATE FUNCTION pg_stat_statements(
      OUT userid oid,
      OUT dbid oid,
      OUT queryid bigint,
--- 12,18 ----
  DROP FUNCTION pg_stat_statements();

  /* Now redefine */
! CREATE FUNCTION pg_stat_statements(IN showtext boolean,
      OUT userid oid,
      OUT dbid oid,
      OUT queryid bigint,
*************** CREATE FUNCTION pg_stat_statements(
*** 34,43 ****
      OUT blk_write_time float8
  )
  RETURNS SETOF record
! AS 'MODULE_PATHNAME'
! LANGUAGE C;

  CREATE VIEW pg_stat_statements AS
!   SELECT * FROM pg_stat_statements();

  GRANT SELECT ON pg_stat_statements TO PUBLIC;
--- 34,43 ----
      OUT blk_write_time float8
  )
  RETURNS SETOF record
! AS 'MODULE_PATHNAME', 'pg_stat_statements_1_2'
! LANGUAGE C STRICT VOLATILE;

  CREATE VIEW pg_stat_statements AS
!   SELECT * FROM pg_stat_statements(true);

  GRANT SELECT ON pg_stat_statements TO PUBLIC;
diff --git a/contrib/pg_stat_statements/pg_stat_statements--1.2.sql
b/contrib/pg_stat_statements/pg_stat_statements--1.2.sql
index 80b74a1..5bfa9a5 100644
*** a/contrib/pg_stat_statements/pg_stat_statements--1.2.sql
--- b/contrib/pg_stat_statements/pg_stat_statements--1.2.sql
*************** RETURNS void
*** 9,15 ****
  AS 'MODULE_PATHNAME'
  LANGUAGE C;

! CREATE FUNCTION pg_stat_statements(
      OUT userid oid,
      OUT dbid oid,
      OUT queryid bigint,
--- 9,15 ----
  AS 'MODULE_PATHNAME'
  LANGUAGE C;

! CREATE FUNCTION pg_stat_statements(IN showtext boolean,
      OUT userid oid,
      OUT dbid oid,
      OUT queryid bigint,
*************** CREATE FUNCTION pg_stat_statements(
*** 31,42 ****
      OUT blk_write_time float8
  )
  RETURNS SETOF record
! AS 'MODULE_PATHNAME'
! LANGUAGE C;

  -- Register a view on the function for ease of use.
  CREATE VIEW pg_stat_statements AS
!   SELECT * FROM pg_stat_statements();

  GRANT SELECT ON pg_stat_statements TO PUBLIC;

--- 31,42 ----
      OUT blk_write_time float8
  )
  RETURNS SETOF record
! AS 'MODULE_PATHNAME', 'pg_stat_statements_1_2'
! LANGUAGE C STRICT VOLATILE;

  -- Register a view on the function for ease of use.
  CREATE VIEW pg_stat_statements AS
!   SELECT * FROM pg_stat_statements(true);

  GRANT SELECT ON pg_stat_statements TO PUBLIC;

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 2f069b7..538f6e9 100644
*** a/contrib/pg_stat_statements/pg_stat_statements.c
--- b/contrib/pg_stat_statements/pg_stat_statements.c
***************
*** 26,37 ****
--- 26,51 ----
   * tree(s) generated from the query.  The executor can then use this value
   * to blame query costs on the proper queryId.
   *
+  * To facilitate presenting entries to users, we create "representative" query
+  * strings in which constants are replaced with '?' characters, to make it
+  * clearer what a normalized entry can represent.  To save on shared memory,
+  * and to avoid having to truncate oversized query strings, we store these
+  * strings in a temporary external query-texts file.  Offsets into this
+  * file are kept in shared memory.
+  *
   * Note about locking issues: to create or delete an entry in the shared
   * hashtable, one must hold pgss->lock exclusively.  Modifying any field
   * in an entry except the counters requires the same.  To look up an entry,
   * one must hold the lock shared.  To read or update the counters within
   * an entry, one must hold the lock shared or exclusive (so the entry doesn't
   * disappear!) and also take the entry's mutex spinlock.
+  * The shared state variable pgss->extent (the next free spot in the external
+  * query-text file) should be accessed only while holding either the
+  * pgss->mutex spinlock, or exclusive lock on pgss->lock.  We use the mutex to
+  * allow reserving file space while holding only shared lock on pgss->lock.
+  * Rewriting the entire external query-text file, eg for garbage collection,
+  * requires holding pgss->lock exclusively; this allows individual entries
+  * in the file to be read or written while holding only shared lock.
   *
   *
   * Copyright (c) 2008-2014, PostgreSQL Global Development Group
***************
*** 43,48 ****
--- 57,63 ----
   */
  #include "postgres.h"

+ #include <sys/stat.h>
  #include <unistd.h>

  #include "access/hash.h"
***************
*** 53,59 ****
  #include "parser/analyze.h"
  #include "parser/parsetree.h"
  #include "parser/scanner.h"
- #include "pgstat.h"
  #include "storage/fd.h"
  #include "storage/ipc.h"
  #include "storage/spin.h"
--- 68,73 ----
***************
*** 63,73 ****

  PG_MODULE_MAGIC;

! /* Location of stats file */
  #define PGSS_DUMP_FILE    "global/pg_stat_statements.stat"

! /* This constant defines the magic number in the stats file header */
! static const uint32 PGSS_FILE_HEADER = 0x20131115;
  /* PostgreSQL major version number, changes in which invalidate all entries */
  static const uint32 PGSS_PG_MAJOR_VERSION = PG_VERSION_NUM / 100;

--- 77,98 ----

  PG_MODULE_MAGIC;

! /* Location of permanent stats file (valid when database is shut down) */
  #define PGSS_DUMP_FILE    "global/pg_stat_statements.stat"

! /*
!  * Location of external query text file.  We don't keep it in the core
!  * system's stats_temp_directory.  The core system can safely use that GUC
!  * setting, because the statistics collector temp file paths are set only once
!  * as part of changing the GUC, but pg_stat_statements has no way of avoiding
!  * race conditions.  Besides, we only expect modest, infrequent I/O for query
!  * strings, so placing the file on a faster filesystem is not compelling.
!  */
! #define PGSS_TEXT_FILE    "pg_stat_tmp/pgss_query_texts.stat"
!
! /* Magic number identifying the stats file format */
! static const uint32 PGSS_FILE_HEADER = 0x20140125;
!
  /* PostgreSQL major version number, changes in which invalidate all entries */
  static const uint32 PGSS_PG_MAJOR_VERSION = PG_VERSION_NUM / 100;

*************** static const uint32 PGSS_PG_MAJOR_VERSIO
*** 75,80 ****
--- 100,106 ----
  #define USAGE_EXEC(duration)    (1.0)
  #define USAGE_INIT                (1.0)    /* including initial planning */
  #define ASSUMED_MEDIAN_INIT        (10.0)    /* initial assumed median usage */
+ #define ASSUMED_LENGTH_INIT        1024    /* initial assumed mean query length */
  #define USAGE_DECREASE_FACTOR    (0.99)    /* decreased every entry_dealloc */
  #define STICKY_DECREASE_FACTOR    (0.50)    /* factor for sticky entries */
  #define USAGE_DEALLOC_PERCENT    5        /* free this % of entries at once */
*************** typedef enum pgssVersion
*** 94,109 ****
  /*
   * Hashtable key that defines the identity of a hashtable entry.  We separate
   * queries by user and by database even if they are otherwise identical.
-  *
-  * Presently, the query encoding is fully determined by the source database
-  * and so we don't really need it to be in the key.  But that might not always
-  * be true. Anyway it's notationally convenient to pass it as part of the key.
   */
  typedef struct pgssHashKey
  {
      Oid            userid;            /* user OID */
      Oid            dbid;            /* database OID */
-     int            encoding;        /* query encoding */
      uint32        queryid;        /* query identifier */
  } pgssHashKey;

--- 120,130 ----
*************** typedef struct Counters
*** 133,148 ****
  /*
   * Statistics per statement
   *
!  * NB: see the file read/write code before changing field order here.
   */
  typedef struct pgssEntry
  {
      pgssHashKey key;            /* hash key of entry - MUST BE FIRST */
      Counters    counters;        /* the statistics for this query */
      int            query_len;        /* # of valid bytes in query string */
      slock_t        mutex;            /* protects the counters only */
-     char        query[1];        /* VARIABLE LENGTH ARRAY - MUST BE LAST */
-     /* Note: the allocated length of query[] is actually pgss->query_size */
  } pgssEntry;

  /*
--- 154,171 ----
  /*
   * Statistics per statement
   *
!  * Note: in event of a failure in garbage collection of the query text file,
!  * we reset query_offset to zero and query_len to -1.  This will be seen as
!  * an invalid state by qtext_fetch().
   */
  typedef struct pgssEntry
  {
      pgssHashKey key;            /* hash key of entry - MUST BE FIRST */
      Counters    counters;        /* the statistics for this query */
+     Size        query_offset;    /* query text offset in external file */
      int            query_len;        /* # of valid bytes in query string */
+     int            encoding;        /* query text encoding */
      slock_t        mutex;            /* protects the counters only */
  } pgssEntry;

  /*
*************** typedef struct pgssEntry
*** 151,158 ****
  typedef struct pgssSharedState
  {
      LWLockId    lock;            /* protects hashtable search/modification */
-     int            query_size;        /* max query length in bytes */
      double        cur_median_usage;        /* current median usage in hashtable */
  } pgssSharedState;

  /*
--- 174,185 ----
  typedef struct pgssSharedState
  {
      LWLockId    lock;            /* protects hashtable search/modification */
      double        cur_median_usage;        /* current median usage in hashtable */
+     Size        mean_query_len; /* current mean entry text length */
+     slock_t        mutex;            /* protects following fields only: */
+     Size        extent;            /* current extent of query file */
+     int            n_writers;        /* number of active writers to query file */
+     int            gc_count;        /* query file garbage collection cycle count */
  } pgssSharedState;

  /*
*************** static bool pgss_save;            /* whether to s
*** 231,245 ****
--- 258,282 ----
      (pgss_track == PGSS_TRACK_ALL || \
      (pgss_track == PGSS_TRACK_TOP && nested_level == 0))

+ #define record_gc_qtexts() \
+     do { \
+         volatile pgssSharedState *s = (volatile pgssSharedState *) pgss; \
+         SpinLockAcquire(&s->mutex); \
+         s->gc_count++; \
+         SpinLockRelease(&s->mutex); \
+     } while(0)
+
  /*---- Function declarations ----*/

  void        _PG_init(void);
  void        _PG_fini(void);

  Datum        pg_stat_statements_reset(PG_FUNCTION_ARGS);
+ Datum        pg_stat_statements_1_2(PG_FUNCTION_ARGS);
  Datum        pg_stat_statements(PG_FUNCTION_ARGS);

  PG_FUNCTION_INFO_V1(pg_stat_statements_reset);
+ PG_FUNCTION_INFO_V1(pg_stat_statements_1_2);
  PG_FUNCTION_INFO_V1(pg_stat_statements);

  static void pgss_shmem_startup(void);
*************** static void pgss_store(const char *query
*** 261,270 ****
             double total_time, uint64 rows,
             const BufferUsage *bufusage,
             pgssJumbleState *jstate);
  static Size pgss_memsize(void);
! static pgssEntry *entry_alloc(pgssHashKey *key, const char *query,
!             int query_len, bool sticky);
  static void entry_dealloc(void);
  static void entry_reset(void);
  static void AppendJumble(pgssJumbleState *jstate,
               const unsigned char *item, Size size);
--- 298,317 ----
             double total_time, uint64 rows,
             const BufferUsage *bufusage,
             pgssJumbleState *jstate);
+ static void pg_stat_statements_internal(FunctionCallInfo fcinfo,
+                             pgssVersion api_version,
+                             bool showtext);
  static Size pgss_memsize(void);
! static pgssEntry *entry_alloc(pgssHashKey *key, Size query_offset, int query_len,
!             int encoding, bool sticky);
  static void entry_dealloc(void);
+ static bool qtext_store(const char *query, int query_len,
+             Size *query_offset, int *gc_count);
+ static char *qtext_load_file(Size *buffer_size);
+ static char *qtext_fetch(Size query_offset, int query_len,
+             char *buffer, Size buffer_size);
+ static bool need_gc_qtexts(void);
+ static void gc_qtexts(void);
  static void entry_reset(void);
  static void AppendJumble(pgssJumbleState *jstate,
               const unsigned char *item, Size size);
*************** _PG_init(void)
*** 302,308 ****
        "Sets the maximum number of statements tracked by pg_stat_statements.",
                              NULL,
                              &pgss_max,
!                             1000,
                              100,
                              INT_MAX,
                              PGC_POSTMASTER,
--- 349,355 ----
        "Sets the maximum number of statements tracked by pg_stat_statements.",
                              NULL,
                              &pgss_max,
!                             5000,
                              100,
                              INT_MAX,
                              PGC_POSTMASTER,
*************** _PG_fini(void)
*** 393,410 ****
  /*
   * shmem_startup hook: allocate or attach to shared memory,
   * then load any pre-existing statistics from file.
   */
  static void
  pgss_shmem_startup(void)
  {
      bool        found;
      HASHCTL        info;
!     FILE       *file;
      uint32        header;
      int32        num;
      int32        pgver;
      int32        i;
-     int            query_size;
      int            buffer_size;
      char       *buffer = NULL;

--- 440,459 ----
  /*
   * shmem_startup hook: allocate or attach to shared memory,
   * then load any pre-existing statistics from file.
+  * Also create and load the query-texts file, which is expected to exist
+  * (even if empty) while the module is enabled.
   */
  static void
  pgss_shmem_startup(void)
  {
      bool        found;
      HASHCTL        info;
!     FILE       *file = NULL;
!     FILE       *qfile = NULL;
      uint32        header;
      int32        num;
      int32        pgver;
      int32        i;
      int            buffer_size;
      char       *buffer = NULL;

*************** pgss_shmem_startup(void)
*** 428,443 ****
      {
          /* First time through ... */
          pgss->lock = LWLockAssign();
-         pgss->query_size = pgstat_track_activity_query_size;
          pgss->cur_median_usage = ASSUMED_MEDIAN_INIT;
      }

-     /* Be sure everyone agrees on the hash table entry size */
-     query_size = pgss->query_size;
-
      memset(&info, 0, sizeof(info));
      info.keysize = sizeof(pgssHashKey);
!     info.entrysize = offsetof(pgssEntry, query) +query_size;
      info.hash = pgss_hash_fn;
      info.match = pgss_match_fn;
      pgss_hash = ShmemInitHash("pg_stat_statements hash",
--- 477,493 ----
      {
          /* First time through ... */
          pgss->lock = LWLockAssign();
          pgss->cur_median_usage = ASSUMED_MEDIAN_INIT;
+         pgss->mean_query_len = ASSUMED_LENGTH_INIT;
+         SpinLockInit(&pgss->mutex);
+         pgss->extent = 0;
+         pgss->n_writers = 0;
+         pgss->gc_count = 0;
      }

      memset(&info, 0, sizeof(info));
      info.keysize = sizeof(pgssHashKey);
!     info.entrysize = sizeof(pgssEntry);
      info.hash = pgss_hash_fn;
      info.match = pgss_match_fn;
      pgss_hash = ShmemInitHash("pg_stat_statements hash",
*************** pgss_shmem_startup(void)
*** 455,479 ****
          on_shmem_exit(pgss_shmem_shutdown, (Datum) 0);

      /*
!      * Attempt to load old statistics from the dump file, if this is the first
!      * time through and we weren't told not to.
       */
!     if (found || !pgss_save)
          return;

      /*
       * Note: we don't bother with locks here, because there should be no other
       * processes running when this code is reached.
       */
      file = AllocateFile(PGSS_DUMP_FILE, PG_BINARY_R);
      if (file == NULL)
      {
!         if (errno == ENOENT)
!             return;                /* ignore not-found error */
          goto error;
      }

!     buffer_size = query_size;
      buffer = (char *) palloc(buffer_size);

      if (fread(&header, sizeof(uint32), 1, file) != 1 ||
--- 505,561 ----
          on_shmem_exit(pgss_shmem_shutdown, (Datum) 0);

      /*
!      * Done if some other process already completed our initialization.
       */
!     if (found)
          return;

      /*
       * Note: we don't bother with locks here, because there should be no other
       * processes running when this code is reached.
       */
+
+     /* Unlink query text file possibly left over from crash */
+     unlink(PGSS_TEXT_FILE);
+
+     /* Allocate new query text temp file */
+     qfile = AllocateFile(PGSS_TEXT_FILE, PG_BINARY_W);
+     if (qfile == NULL)
+     {
+         ereport(LOG,
+                 (errcode_for_file_access(),
+                  errmsg("could not create pg_stat_statement file \"%s\": %m",
+                         PGSS_TEXT_FILE)));
+         goto fail;
+     }
+
+     /*
+      * If we were told not to load old statistics, we're done.  (Note we do
+      * not try to unlink any old dump file in this case.  This seems a bit
+      * questionable but it's the historical behavior.)
+      */
+     if (!pgss_save)
+     {
+         FreeFile(qfile);
+         return;
+     }
+
+     /*
+      * Attempt to load old statistics from the dump file.
+      */
      file = AllocateFile(PGSS_DUMP_FILE, PG_BINARY_R);
      if (file == NULL)
      {
!         if (errno == ENOENT)    /* ignore not-found error */
!         {
!             /* No existing persisted stats file, so we're done */
!             FreeFile(qfile);
!             return;
!         }
          goto error;
      }

!     buffer_size = 2048;
      buffer = (char *) palloc(buffer_size);

      if (fread(&header, sizeof(uint32), 1, file) != 1 ||
*************** pgss_shmem_startup(void)
*** 487,524 ****
      {
          pgssEntry    temp;
          pgssEntry  *entry;

!         if (fread(&temp, offsetof(pgssEntry, mutex), 1, file) != 1)
              goto error;

          /* Encoding is the only field we can easily sanity-check */
!         if (!PG_VALID_BE_ENCODING(temp.key.encoding))
              goto error;

!         /* Previous incarnation might have had a larger query_size */
          if (temp.query_len >= buffer_size)
          {
!             buffer = (char *) repalloc(buffer, temp.query_len + 1);
!             buffer_size = temp.query_len + 1;
          }

!         if (fread(buffer, 1, temp.query_len, file) != temp.query_len)
              goto error;
          buffer[temp.query_len] = '\0';

          /* Skip loading "sticky" entries */
          if (temp.counters.calls == 0)
              continue;

!         /* Clip to available length if needed */
!         if (temp.query_len >= query_size)
!             temp.query_len = pg_encoding_mbcliplen(temp.key.encoding,
!                                                    buffer,
!                                                    temp.query_len,
!                                                    query_size - 1);

          /* make the hashtable entry (discards old entries if too many) */
!         entry = entry_alloc(&temp.key, buffer, temp.query_len, false);

          /* copy in the actual stats */
          entry->counters = temp.counters;
--- 569,616 ----
      {
          pgssEntry    temp;
          pgssEntry  *entry;
+         Size        query_offset;

!         if (fread(&temp, sizeof(pgssEntry), 1, file) != 1)
              goto error;

          /* Encoding is the only field we can easily sanity-check */
!         if (!PG_VALID_BE_ENCODING(temp.encoding))
              goto error;

!         /* Resize buffer as needed */
          if (temp.query_len >= buffer_size)
          {
!             buffer_size = Max(buffer_size * 2, temp.query_len + 1);
!             buffer = repalloc(buffer, buffer_size);
          }

!         if (fread(buffer, 1, temp.query_len + 1, file) != temp.query_len + 1)
              goto error;
+
+         /* Should have a trailing null, but let's make sure */
          buffer[temp.query_len] = '\0';

          /* Skip loading "sticky" entries */
          if (temp.counters.calls == 0)
              continue;

!         /* Store the query text */
!         query_offset = pgss->extent;
!         if (fwrite(buffer, 1, temp.query_len + 1, qfile) != temp.query_len + 1)
!         {
!             ereport(LOG,
!                     (errcode_for_file_access(),
!                   errmsg("could not write pg_stat_statement file \"%s\": %m",
!                          PGSS_TEXT_FILE)));
!             goto fail;
!         }
!         pgss->extent += temp.query_len + 1;

          /* make the hashtable entry (discards old entries if too many) */
!         entry = entry_alloc(&temp.key, query_offset, temp.query_len,
!                             temp.encoding,
!                             false);

          /* copy in the actual stats */
          entry->counters = temp.counters;
*************** pgss_shmem_startup(void)
*** 526,535 ****

      pfree(buffer);
      FreeFile(file);

      /*
!      * Remove the file so it's not included in backups/replication slaves,
!      * etc. A new file will be written on next shutdown.
       */
      unlink(PGSS_DUMP_FILE);

--- 618,637 ----

      pfree(buffer);
      FreeFile(file);
+     FreeFile(qfile);

      /*
!      * Remove the persisted stats file so it's not included in
!      * backups/replication slaves, etc.  A new file will be written on next
!      * shutdown.
!      *
!      * Note: it's okay if the PGSS_TEXT_FILE is included in a basebackup,
!      * because we remove that file on startup; it acts inversely to
!      * PGSS_DUMP_FILE, in that it is only supposed to be around when the
!      * server is running, whereas PGSS_DUMP_FILE is only supposed to be around
!      * when the server is not running.    Leaving the file creates no danger of
!      * a newly restored database having a spurious record of execution costs,
!      * which is what we're really concerned about here.
       */
      unlink(PGSS_DUMP_FILE);

*************** error:
*** 540,551 ****
--- 642,661 ----
              (errcode_for_file_access(),
               errmsg("could not read pg_stat_statement file \"%s\": %m",
                      PGSS_DUMP_FILE)));
+ fail:
      if (buffer)
          pfree(buffer);
      if (file)
          FreeFile(file);
+     if (qfile)
+         FreeFile(qfile);
      /* If possible, throw away the bogus file; ignore any error */
      unlink(PGSS_DUMP_FILE);
+
+     /*
+      * Don't unlink PGSS_TEXT_FILE here; it should always be around while the
+      * server is running with pg_stat_statements enabled
+      */
  }

  /*
*************** static void
*** 558,563 ****
--- 668,675 ----
  pgss_shmem_shutdown(int code, Datum arg)
  {
      FILE       *file;
+     char       *qbuffer = NULL;
+     Size        qbuffer_size = 0;
      HASH_SEQ_STATUS hash_seq;
      int32        num_entries;
      pgssEntry  *entry;
*************** pgss_shmem_shutdown(int code, Datum arg)
*** 586,601 ****
      if (fwrite(&num_entries, sizeof(int32), 1, file) != 1)
          goto error;

      hash_seq_init(&hash_seq, pgss_hash);
      while ((entry = hash_seq_search(&hash_seq)) != NULL)
      {
          int            len = entry->query_len;

!         if (fwrite(entry, offsetof(pgssEntry, mutex), 1, file) != 1 ||
!             fwrite(entry->query, 1, len, file) != len)
              goto error;
      }

      if (FreeFile(file))
      {
          file = NULL;
--- 698,733 ----
      if (fwrite(&num_entries, sizeof(int32), 1, file) != 1)
          goto error;

+     qbuffer = qtext_load_file(&qbuffer_size);
+     if (qbuffer == NULL)
+         goto error;
+
+     /*
+      * When serializing to disk, we store query texts immediately after their
+      * entry data.    Any orphaned query texts are thereby excluded.
+      */
      hash_seq_init(&hash_seq, pgss_hash);
      while ((entry = hash_seq_search(&hash_seq)) != NULL)
      {
          int            len = entry->query_len;
+         char       *qstr = qtext_fetch(entry->query_offset, len,
+                                        qbuffer, qbuffer_size);

!         if (qstr == NULL)
!             continue;            /* Ignore any entries with bogus texts */
!
!         if (fwrite(entry, sizeof(pgssEntry), 1, file) != 1 ||
!             fwrite(qstr, 1, len + 1, file) != len + 1)
!         {
!             /* note: we assume hash_seq_term won't change errno */
!             hash_seq_term(&hash_seq);
              goto error;
+         }
      }

+     free(qbuffer);
+     qbuffer = NULL;
+
      if (FreeFile(file))
      {
          file = NULL;
*************** pgss_shmem_shutdown(int code, Datum arg)
*** 603,609 ****
      }

      /*
!      * Rename file into place, so we atomically replace the old one.
       */
      if (rename(PGSS_DUMP_FILE ".tmp", PGSS_DUMP_FILE) != 0)
          ereport(LOG,
--- 735,741 ----
      }

      /*
!      * Rename file into place, so we atomically replace any old one.
       */
      if (rename(PGSS_DUMP_FILE ".tmp", PGSS_DUMP_FILE) != 0)
          ereport(LOG,
*************** pgss_shmem_shutdown(int code, Datum arg)
*** 611,616 ****
--- 743,751 ----
                   errmsg("could not rename pg_stat_statement file \"%s\": %m",
                          PGSS_DUMP_FILE ".tmp")));

+     /* Unlink query-texts file; it's not needed while shutdown */
+     unlink(PGSS_TEXT_FILE);
+
      return;

  error:
*************** error:
*** 618,626 ****
--- 753,764 ----
              (errcode_for_file_access(),
               errmsg("could not write pg_stat_statement file \"%s\": %m",
                      PGSS_DUMP_FILE ".tmp")));
+     if (qbuffer)
+         free(qbuffer);
      if (file)
          FreeFile(file);
      unlink(PGSS_DUMP_FILE ".tmp");
+     unlink(PGSS_TEXT_FILE);
  }

  /*
*************** pgss_hash_fn(const void *key, Size keysi
*** 916,922 ****
  {
      const pgssHashKey *k = (const pgssHashKey *) key;

-     /* we don't bother to include encoding in the hash */
      return hash_uint32((uint32) k->userid) ^
          hash_uint32((uint32) k->dbid) ^
          hash_uint32((uint32) k->queryid);
--- 1054,1059 ----
*************** pgss_match_fn(const void *key1, const vo
*** 933,939 ****

      if (k1->userid == k2->userid &&
          k1->dbid == k2->dbid &&
-         k1->encoding == k2->encoding &&
          k1->queryid == k2->queryid)
          return 0;
      else
--- 1070,1075 ----
*************** pgss_store(const char *query, uint32 que
*** 967,972 ****
--- 1103,1110 ----
      pgssHashKey key;
      pgssEntry  *entry;
      char       *norm_query = NULL;
+     int            encoding = GetDatabaseEncoding();
+     int            query_len;

      Assert(query != NULL);

*************** pgss_store(const char *query, uint32 que
*** 974,983 ****
      if (!pgss || !pgss_hash)
          return;

      /* Set up key for hashtable search */
      key.userid = GetUserId();
      key.dbid = MyDatabaseId;
-     key.encoding = GetDatabaseEncoding();
      key.queryid = queryId;

      /* Lookup the hash table entry with shared lock. */
--- 1112,1122 ----
      if (!pgss || !pgss_hash)
          return;

+     query_len = strlen(query);
+
      /* Set up key for hashtable search */
      key.userid = GetUserId();
      key.dbid = MyDatabaseId;
      key.queryid = queryId;

      /* Lookup the hash table entry with shared lock. */
*************** pgss_store(const char *query, uint32 que
*** 988,1032 ****
      /* Create new entry, if not present */
      if (!entry)
      {
!         int            query_len;

          /*
!          * We'll need exclusive lock to make a new entry.  There is no point
!          * in holding shared lock while we normalize the string, though.
           */
-         LWLockRelease(pgss->lock);
-
-         query_len = strlen(query);
-
          if (jstate)
          {
!             /* Normalize the string if enabled */
              norm_query = generate_normalized_query(jstate, query,
                                                     &query_len,
!                                                    key.encoding);

!             /* Acquire exclusive lock as required by entry_alloc() */
!             LWLockAcquire(pgss->lock, LW_EXCLUSIVE);

!             entry = entry_alloc(&key, norm_query, query_len, true);
!         }
!         else
!         {
!             /*
!              * We're just going to store the query string as-is; but we have
!              * to truncate it if over-length.
!              */
!             if (query_len >= pgss->query_size)
!                 query_len = pg_encoding_mbcliplen(key.encoding,
!                                                   query,
!                                                   query_len,
!                                                   pgss->query_size - 1);

!             /* Acquire exclusive lock as required by entry_alloc() */
!             LWLockAcquire(pgss->lock, LW_EXCLUSIVE);

!             entry = entry_alloc(&key, query, query_len, false);
!         }
      }

      /* Increment the counts, except when jstate is not NULL */
--- 1127,1190 ----
      /* Create new entry, if not present */
      if (!entry)
      {
!         Size        query_offset;
!         int            gc_count;
!         bool        stored;
!         bool        do_gc;

          /*
!          * Create a new, normalized query string if caller asked.  We don't
!          * need to hold the lock while doing this work.  (Note: in any case,
!          * it's possible that someone else creates a duplicate hashtable entry
!          * in the interval where we don't hold the lock below.  That case is
!          * handled by entry_alloc.)
           */
          if (jstate)
          {
!             LWLockRelease(pgss->lock);
              norm_query = generate_normalized_query(jstate, query,
                                                     &query_len,
!                                                    encoding);
!             LWLockAcquire(pgss->lock, LW_SHARED);
!         }

!         /* Append new query text to file with only shared lock held */
!         stored = qtext_store(norm_query ? norm_query : query, query_len,
!                              &query_offset, &gc_count);

!         /*
!          * Determine whether we need to garbage collect external query texts
!          * while the shared lock is still held.  This micro-optimization
!          * avoids taking the time to decide this while holding exclusive lock.
!          */
!         do_gc = need_gc_qtexts();

!         /* Need exclusive lock to make a new hashtable entry - promote */
!         LWLockRelease(pgss->lock);
!         LWLockAcquire(pgss->lock, LW_EXCLUSIVE);

!         /*
!          * A garbage collection may have occurred while we weren't holding the
!          * lock.  In the unlikely event that this happens, the query text we
!          * stored above will have been garbage collected, so write it again.
!          * This should be infrequent enough that doing it while holding
!          * exclusive lock isn't a performance problem.
!          */
!         if (!stored || pgss->gc_count != gc_count)
!             stored = qtext_store(norm_query ? norm_query : query, query_len,
!                                  &query_offset, NULL);
!
!         /* If we failed to write to the text file, give up */
!         if (!stored)
!             goto done;
!
!         /* OK to create a new hashtable entry */
!         entry = entry_alloc(&key, query_offset, query_len, encoding,
!                             jstate != NULL);
!
!         /* If needed, perform garbage collection while exclusive lock held */
!         if (do_gc)
!             gc_qtexts();
      }

      /* Increment the counts, except when jstate is not NULL */
*************** pgss_store(const char *query, uint32 que
*** 1064,1072 ****
          SpinLockRelease(&e->mutex);
      }

      LWLockRelease(pgss->lock);

!     /* We postpone this pfree until we're out of the lock */
      if (norm_query)
          pfree(norm_query);
  }
--- 1222,1231 ----
          SpinLockRelease(&e->mutex);
      }

+ done:
      LWLockRelease(pgss->lock);

!     /* We postpone this clean-up until we're out of the lock */
      if (norm_query)
          pfree(norm_query);
  }
*************** pg_stat_statements_reset(PG_FUNCTION_ARG
*** 1085,1100 ****
      PG_RETURN_VOID();
  }

  #define PG_STAT_STATEMENTS_COLS_V1_0    14
  #define PG_STAT_STATEMENTS_COLS_V1_1    18
! #define PG_STAT_STATEMENTS_COLS            19

  /*
   * Retrieve statement statistics.
   */
  Datum
  pg_stat_statements(PG_FUNCTION_ARGS)
  {
      ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
      TupleDesc    tupdesc;
      Tuplestorestate *tupstore;
--- 1244,1294 ----
      PG_RETURN_VOID();
  }

+ /* Number of output arguments (columns) for various API versions */
  #define PG_STAT_STATEMENTS_COLS_V1_0    14
  #define PG_STAT_STATEMENTS_COLS_V1_1    18
! #define PG_STAT_STATEMENTS_COLS_V1_2    19
! #define PG_STAT_STATEMENTS_COLS            19        /* maximum of above */

  /*
   * Retrieve statement statistics.
+  *
+  * The SQL API of this function has changed multiple times, and will likely
+  * do so again in future.  To support the case where a newer version of this
+  * loadable module is being used with an old SQL declaration of the function,
+  * we continue to support the older API versions.  For 1.2 and later, the
+  * expected API version is identified by embedding it in the C name of the
+  * function.  Unfortunately we weren't bright enough to do that for 1.1.
+  */
+ Datum
+ pg_stat_statements_1_2(PG_FUNCTION_ARGS)
+ {
+     bool        showtext = PG_GETARG_BOOL(0);
+
+     pg_stat_statements_internal(fcinfo, PGSS_V1_2, showtext);
+
+     return (Datum) 0;
+ }
+
+ /*
+  * Legacy entry point for pg_stat_statements() API versions 1.0 and 1.1.
+  * This can be removed someday, perhaps.
   */
  Datum
  pg_stat_statements(PG_FUNCTION_ARGS)
  {
+     /* If it's really API 1.1, we'll figure that out below */
+     pg_stat_statements_internal(fcinfo, PGSS_V1_0, true);
+
+     return (Datum) 0;
+ }
+
+ /* Common code for all versions of pg_stat_statements() */
+ static void
+ pg_stat_statements_internal(FunctionCallInfo fcinfo,
+                             pgssVersion api_version,
+                             bool showtext)
+ {
      ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
      TupleDesc    tupdesc;
      Tuplestorestate *tupstore;
*************** pg_stat_statements(PG_FUNCTION_ARGS)
*** 1102,1111 ****
      MemoryContext oldcontext;
      Oid            userid = GetUserId();
      bool        is_superuser = superuser();
      HASH_SEQ_STATUS hash_seq;
      pgssEntry  *entry;
-     pgssVersion detected_version;

      if (!pgss || !pgss_hash)
          ereport(ERROR,
                  (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
--- 1296,1309 ----
      MemoryContext oldcontext;
      Oid            userid = GetUserId();
      bool        is_superuser = superuser();
+     char       *qbuffer = NULL;
+     Size        qbuffer_size = 0;
+     Size        extent = 0;
+     int            gc_count = 0;
      HASH_SEQ_STATUS hash_seq;
      pgssEntry  *entry;

+     /* hash table must exist already */
      if (!pgss || !pgss_hash)
          ereport(ERROR,
                  (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
*************** pg_stat_statements(PG_FUNCTION_ARGS)
*** 1122,1149 ****
                   errmsg("materialize mode required, but it is not " \
                          "allowed in this context")));

      /* Build a tuple descriptor for our result type */
      if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
          elog(ERROR, "return type must be a row type");

      switch (tupdesc->natts)
      {
          case PG_STAT_STATEMENTS_COLS_V1_0:
!             detected_version = PGSS_V1_0;
              break;
          case PG_STAT_STATEMENTS_COLS_V1_1:
!             detected_version = PGSS_V1_1;
              break;
!         case PG_STAT_STATEMENTS_COLS:
!             detected_version = PGSS_V1_2;
              break;
          default:
!             elog(ERROR, "pgss version unrecognized from tuple descriptor");
      }

-     per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
-     oldcontext = MemoryContextSwitchTo(per_query_ctx);
-
      tupstore = tuplestore_begin_heap(true, false, work_mem);
      rsinfo->returnMode = SFRM_Materialize;
      rsinfo->setResult = tupstore;
--- 1320,1358 ----
                   errmsg("materialize mode required, but it is not " \
                          "allowed in this context")));

+     /* Switch into long-lived context to construct returned data structures */
+     per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+     oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
      /* Build a tuple descriptor for our result type */
      if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
          elog(ERROR, "return type must be a row type");

+     /*
+      * Check we have the expected number of output arguments.  Aside from
+      * being a good safety check, we need a kluge here to detect API version
+      * 1.1, which was wedged into the code in an ill-considered way.
+      */
      switch (tupdesc->natts)
      {
          case PG_STAT_STATEMENTS_COLS_V1_0:
!             if (api_version != PGSS_V1_0)
!                 elog(ERROR, "incorrect number of output arguments");
              break;
          case PG_STAT_STATEMENTS_COLS_V1_1:
!             /* pg_stat_statements() should have told us 1.0 */
!             if (api_version != PGSS_V1_0)
!                 elog(ERROR, "incorrect number of output arguments");
!             api_version = PGSS_V1_1;
              break;
!         case PG_STAT_STATEMENTS_COLS_V1_2:
!             if (api_version != PGSS_V1_2)
!                 elog(ERROR, "incorrect number of output arguments");
              break;
          default:
!             elog(ERROR, "incorrect number of output arguments");
      }

      tupstore = tuplestore_begin_heap(true, false, work_mem);
      rsinfo->returnMode = SFRM_Materialize;
      rsinfo->setResult = tupstore;
*************** pg_stat_statements(PG_FUNCTION_ARGS)
*** 1151,1158 ****
--- 1360,1430 ----

      MemoryContextSwitchTo(oldcontext);

+     /*
+      * We'd like to load the query text file (if needed) while not holding any
+      * lock on pgss->lock.    In the worst case we'll have to do this again
+      * after we have the lock, but it's unlikely enough to make this a win
+      * despite occasional duplicated work.    We need to reload if anybody
+      * writes to the file (either a retail qtext_store(), or a garbage
+      * collection) between this point and where we've gotten shared lock.  If
+      * a qtext_store is actually in progress when we look, we might as well
+      * skip the speculative load entirely.
+      */
+     if (showtext)
+     {
+         int            n_writers;
+
+         /* Take the mutex so we can examine variables */
+         {
+             volatile pgssSharedState *s = (volatile pgssSharedState *) pgss;
+
+             SpinLockAcquire(&s->mutex);
+             extent = s->extent;
+             n_writers = s->n_writers;
+             gc_count = s->gc_count;
+             SpinLockRelease(&s->mutex);
+         }
+
+         /* No point in loading file now if there are active writers */
+         if (n_writers == 0)
+             qbuffer = qtext_load_file(&qbuffer_size);
+     }
+
+     /*
+      * Get shared lock, load or reload the query text file if we must, and
+      * iterate over the hashtable entries.
+      *
+      * With a large hash table, we might be holding the lock rather longer
+      * than one could wish.  However, this only blocks creation of new hash
+      * table entries, and the larger the hash table the less likely that is to
+      * be needed.  So we can hope this is okay.  Perhaps someday we'll decide
+      * we need to partition the hash table to limit the time spent holding any
+      * one lock.
+      */
      LWLockAcquire(pgss->lock, LW_SHARED);

+     if (showtext)
+     {
+         /*
+          * Here it is safe to examine extent and gc_count without taking the
+          * mutex.  Note that although other processes might change
+          * pgss->extent just after we look at it, the strings they then write
+          * into the file cannot yet be referenced in the hashtable, so we
+          * don't care whether we see them or not.
+          *
+          * If qtext_load_file fails, we just press on; we'll return NULL for
+          * every query text.
+          */
+         if (qbuffer == NULL ||
+             pgss->extent != extent ||
+             pgss->gc_count != gc_count)
+         {
+             if (qbuffer)
+                 free(qbuffer);
+             qbuffer = qtext_load_file(&qbuffer_size);
+         }
+     }
+
      hash_seq_init(&hash_seq, pgss_hash);
      while ((entry = hash_seq_search(&hash_seq)) != NULL)
      {
*************** pg_stat_statements(PG_FUNCTION_ARGS)
*** 1170,1195 ****

          if (is_superuser || entry->key.userid == userid)
          {
!             char       *qstr;
!
!             if (detected_version >= PGSS_V1_2)
                  values[i++] = Int64GetDatumFast(queryid);

!             qstr = (char *)
!                 pg_do_encoding_conversion((unsigned char *) entry->query,
!                                           entry->query_len,
!                                           entry->key.encoding,
!                                           GetDatabaseEncoding());
!             values[i++] = CStringGetTextDatum(qstr);
!             if (qstr != entry->query)
!                 pfree(qstr);
          }
          else
          {
!             if (detected_version >= PGSS_V1_2)
                  nulls[i++] = true;

!             values[i++] = CStringGetTextDatum("<insufficient privilege>");
          }

          /* copy counters to a local variable to keep locking time short */
--- 1442,1498 ----

          if (is_superuser || entry->key.userid == userid)
          {
!             if (api_version >= PGSS_V1_2)
                  values[i++] = Int64GetDatumFast(queryid);

!             if (showtext)
!             {
!                 char       *qstr = qtext_fetch(entry->query_offset,
!                                                entry->query_len,
!                                                qbuffer,
!                                                qbuffer_size);
!
!                 if (qstr)
!                 {
!                     char       *enc;
!
!                     enc = (char *)
!                         pg_do_encoding_conversion((unsigned char *) qstr,
!                                                   entry->query_len,
!                                                   entry->encoding,
!                                                   GetDatabaseEncoding());
!
!                     values[i++] = CStringGetTextDatum(enc);
!
!                     if (enc != qstr)
!                         pfree(enc);
!                 }
!                 else
!                 {
!                     /* Just return a null if we fail to find the text */
!                     nulls[i++] = true;
!                 }
!             }
!             else
!             {
!                 /* Query text not requested */
!                 nulls[i++] = true;
!             }
          }
          else
          {
!             /* Don't show queryid */
!             if (api_version >= PGSS_V1_2)
                  nulls[i++] = true;

!             /*
!              * Don't show query text, but hint as to the reason for not doing
!              * so if it was requested
!              */
!             if (showtext)
!                 values[i++] = CStringGetTextDatum("<insufficient privilege>");
!             else
!                 nulls[i++] = true;
          }

          /* copy counters to a local variable to keep locking time short */
*************** pg_stat_statements(PG_FUNCTION_ARGS)
*** 1210,1246 ****
          values[i++] = Int64GetDatumFast(tmp.rows);
          values[i++] = Int64GetDatumFast(tmp.shared_blks_hit);
          values[i++] = Int64GetDatumFast(tmp.shared_blks_read);
!         if (detected_version >= PGSS_V1_1)
              values[i++] = Int64GetDatumFast(tmp.shared_blks_dirtied);
          values[i++] = Int64GetDatumFast(tmp.shared_blks_written);
          values[i++] = Int64GetDatumFast(tmp.local_blks_hit);
          values[i++] = Int64GetDatumFast(tmp.local_blks_read);
!         if (detected_version >= PGSS_V1_1)
              values[i++] = Int64GetDatumFast(tmp.local_blks_dirtied);
          values[i++] = Int64GetDatumFast(tmp.local_blks_written);
          values[i++] = Int64GetDatumFast(tmp.temp_blks_read);
          values[i++] = Int64GetDatumFast(tmp.temp_blks_written);
!         if (detected_version >= PGSS_V1_1)
          {
              values[i++] = Float8GetDatumFast(tmp.blk_read_time);
              values[i++] = Float8GetDatumFast(tmp.blk_write_time);
          }

!         Assert(i == (detected_version == PGSS_V1_0?
!                          PG_STAT_STATEMENTS_COLS_V1_0:
!                      detected_version == PGSS_V1_1?
!                          PG_STAT_STATEMENTS_COLS_V1_1:
!                      PG_STAT_STATEMENTS_COLS));

          tuplestore_putvalues(tupstore, tupdesc, values, nulls);
      }

      LWLockRelease(pgss->lock);

!     /* clean up and return the tuplestore */
!     tuplestore_donestoring(tupstore);

!     return (Datum) 0;
  }

  /*
--- 1513,1549 ----
          values[i++] = Int64GetDatumFast(tmp.rows);
          values[i++] = Int64GetDatumFast(tmp.shared_blks_hit);
          values[i++] = Int64GetDatumFast(tmp.shared_blks_read);
!         if (api_version >= PGSS_V1_1)
              values[i++] = Int64GetDatumFast(tmp.shared_blks_dirtied);
          values[i++] = Int64GetDatumFast(tmp.shared_blks_written);
          values[i++] = Int64GetDatumFast(tmp.local_blks_hit);
          values[i++] = Int64GetDatumFast(tmp.local_blks_read);
!         if (api_version >= PGSS_V1_1)
              values[i++] = Int64GetDatumFast(tmp.local_blks_dirtied);
          values[i++] = Int64GetDatumFast(tmp.local_blks_written);
          values[i++] = Int64GetDatumFast(tmp.temp_blks_read);
          values[i++] = Int64GetDatumFast(tmp.temp_blks_written);
!         if (api_version >= PGSS_V1_1)
          {
              values[i++] = Float8GetDatumFast(tmp.blk_read_time);
              values[i++] = Float8GetDatumFast(tmp.blk_write_time);
          }

!         Assert(i == (api_version == PGSS_V1_0 ? PG_STAT_STATEMENTS_COLS_V1_0 :
!                      api_version == PGSS_V1_1 ? PG_STAT_STATEMENTS_COLS_V1_1 :
!                      api_version == PGSS_V1_2 ? PG_STAT_STATEMENTS_COLS_V1_2 :
!                      -1 /* fail if you forget to update this assert */ ));

          tuplestore_putvalues(tupstore, tupdesc, values, nulls);
      }

+     /* clean up and return the tuplestore */
      LWLockRelease(pgss->lock);

!     if (qbuffer)
!         free(qbuffer);

!     tuplestore_donestoring(tupstore);
  }

  /*
*************** static Size
*** 1250,1260 ****
  pgss_memsize(void)
  {
      Size        size;
-     Size        entrysize;

      size = MAXALIGN(sizeof(pgssSharedState));
!     entrysize = offsetof(pgssEntry, query) +pgstat_track_activity_query_size;
!     size = add_size(size, hash_estimate_size(pgss_max, entrysize));

      return size;
  }
--- 1553,1561 ----
  pgss_memsize(void)
  {
      Size        size;

      size = MAXALIGN(sizeof(pgssSharedState));
!     size = add_size(size, hash_estimate_size(pgss_max, sizeof(pgssEntry)));

      return size;
  }
*************** pgss_memsize(void)
*** 1277,1283 ****
   * have made the entry while we waited to get exclusive lock.
   */
  static pgssEntry *
! entry_alloc(pgssHashKey *key, const char *query, int query_len, bool sticky)
  {
      pgssEntry  *entry;
      bool        found;
--- 1578,1585 ----
   * have made the entry while we waited to get exclusive lock.
   */
  static pgssEntry *
! entry_alloc(pgssHashKey *key, Size query_offset, int query_len, int encoding,
!             bool sticky)
  {
      pgssEntry  *entry;
      bool        found;
*************** entry_alloc(pgssHashKey *key, const char
*** 1299,1309 ****
          entry->counters.usage = sticky ? pgss->cur_median_usage : USAGE_INIT;
          /* re-initialize the mutex each time ... we assume no one using it */
          SpinLockInit(&entry->mutex);
!         /* ... and don't forget the query text */
!         Assert(query_len >= 0 && query_len < pgss->query_size);
          entry->query_len = query_len;
!         memcpy(entry->query, query, query_len);
!         entry->query[query_len] = '\0';
      }

      return entry;
--- 1601,1611 ----
          entry->counters.usage = sticky ? pgss->cur_median_usage : USAGE_INIT;
          /* re-initialize the mutex each time ... we assume no one using it */
          SpinLockInit(&entry->mutex);
!         /* ... and don't forget the query text metadata */
!         Assert(query_len >= 0);
!         entry->query_offset = query_offset;
          entry->query_len = query_len;
!         entry->encoding = encoding;
      }

      return entry;
*************** entry_dealloc(void)
*** 1338,1343 ****
--- 1640,1646 ----
      pgssEntry  *entry;
      int            nvictims;
      int            i;
+     Size        totlen = 0;

      /*
       * Sort entries by usage and deallocate USAGE_DEALLOC_PERCENT of them.
*************** entry_dealloc(void)
*** 1357,1369 ****
              entry->counters.usage *= STICKY_DECREASE_FACTOR;
          else
              entry->counters.usage *= USAGE_DECREASE_FACTOR;
      }

      qsort(entries, i, sizeof(pgssEntry *), entry_cmp);

-     /* Also, record the (approximate) median usage */
      if (i > 0)
          pgss->cur_median_usage = entries[i / 2]->counters.usage;

      nvictims = Max(10, i * USAGE_DEALLOC_PERCENT / 100);
      nvictims = Min(nvictims, i);
--- 1660,1678 ----
              entry->counters.usage *= STICKY_DECREASE_FACTOR;
          else
              entry->counters.usage *= USAGE_DECREASE_FACTOR;
+         /* Accumulate total size, too. */
+         totlen += entry->query_len + 1;
      }

      qsort(entries, i, sizeof(pgssEntry *), entry_cmp);

      if (i > 0)
+     {
+         /* Record the (approximate) median usage */
          pgss->cur_median_usage = entries[i / 2]->counters.usage;
+         /* Record the mean query length */
+         pgss->mean_query_len = totlen / i;
+     }

      nvictims = Max(10, i * USAGE_DEALLOC_PERCENT / 100);
      nvictims = Min(nvictims, i);
*************** entry_dealloc(void)
*** 1377,1382 ****
--- 1686,2078 ----
  }

  /*
+  * Given a null-terminated string, allocate a new entry in the external query
+  * text file and store the string there.
+  *
+  * Although we could compute the string length via strlen(), callers already
+  * have it handy, so we require them to pass it too.
+  *
+  * If successful, returns true, and stores the new entry's offset in the file
+  * into *query_offset.    Also, if gc_count isn't NULL, *gc_count is set to the
+  * number of garbage collections that have occurred so far.
+  *
+  * On failure, returns false.
+  *
+  * At least a shared lock on pgss->lock must be held by the caller, so as
+  * to prevent a concurrent garbage collection.    Share-lock-holding callers
+  * should pass a gc_count pointer to obtain the number of garbage collections,
+  * so that they can recheck the count after obtaining exclusive lock to
+  * detect whether a garbage collection occurred (and removed this entry).
+  */
+ static bool
+ qtext_store(const char *query, int query_len,
+             Size *query_offset, int *gc_count)
+ {
+     Size        off;
+     int            fd;
+
+     /*
+      * We use a spinlock to protect extent/n_writers/gc_count, so that
+      * multiple processes may execute this function concurrently.
+      */
+     {
+         volatile pgssSharedState *s = (volatile pgssSharedState *) pgss;
+
+         SpinLockAcquire(&s->mutex);
+         off = s->extent;
+         s->extent += query_len + 1;
+         s->n_writers++;
+         if (gc_count)
+             *gc_count = s->gc_count;
+         SpinLockRelease(&s->mutex);
+     }
+
+     *query_offset = off;
+
+     /* Now write the data into the successfully-reserved part of the file */
+     fd = OpenTransientFile(PGSS_TEXT_FILE, O_RDWR | O_CREAT | PG_BINARY,
+                            S_IRUSR | S_IWUSR);
+     if (fd < 0)
+         goto error;
+
+     if (lseek(fd, off, SEEK_SET) != off)
+         goto error;
+
+     if (write(fd, query, query_len + 1) != query_len + 1)
+         goto error;
+
+     CloseTransientFile(fd);
+
+     /* Mark our write complete */
+     {
+         volatile pgssSharedState *s = (volatile pgssSharedState *) pgss;
+
+         SpinLockAcquire(&s->mutex);
+         s->n_writers--;
+         SpinLockRelease(&s->mutex);
+     }
+
+     return true;
+
+ error:
+     ereport(LOG,
+             (errcode_for_file_access(),
+              errmsg("could not write pg_stat_statement file \"%s\": %m",
+                     PGSS_TEXT_FILE)));
+
+     if (fd >= 0)
+         CloseTransientFile(fd);
+
+     /* Mark our write complete */
+     {
+         volatile pgssSharedState *s = (volatile pgssSharedState *) pgss;
+
+         SpinLockAcquire(&s->mutex);
+         s->n_writers--;
+         SpinLockRelease(&s->mutex);
+     }
+
+     return false;
+ }
+
+ /*
+  * Read the external query text file into a malloc'd buffer.
+  *
+  * Returns NULL (without throwing an error) if unable to read, eg
+  * file not there or insufficient memory.
+  *
+  * On success, the buffer size is also returned into *buffer_size.
+  *
+  * This can be called without any lock on pgss->lock, but in that case
+  * the caller is responsible for verifying that the result is sane.
+  */
+ static char *
+ qtext_load_file(Size *buffer_size)
+ {
+     char       *buf;
+     int            fd;
+     struct stat stat;
+
+     fd = OpenTransientFile(PGSS_TEXT_FILE, O_RDONLY | PG_BINARY, 0);
+     if (fd < 0)
+     {
+         if (errno != ENOENT)
+             ereport(LOG,
+                     (errcode_for_file_access(),
+                    errmsg("could not read pg_stat_statement file \"%s\": %m",
+                           PGSS_TEXT_FILE)));
+         return NULL;
+     }
+
+     /* Get file length */
+     if (fstat(fd, &stat))
+     {
+         ereport(LOG,
+                 (errcode_for_file_access(),
+                  errmsg("could not stat pg_stat_statement file \"%s\": %m",
+                         PGSS_TEXT_FILE)));
+         CloseTransientFile(fd);
+         return NULL;
+     }
+
+     /* Allocate buffer */
+     buf = (char *) malloc(stat.st_size);
+     if (buf == NULL)
+     {
+         ereport(LOG,
+                 (errcode(ERRCODE_OUT_OF_MEMORY),
+                  errmsg("out of memory")));
+         CloseTransientFile(fd);
+         return NULL;
+     }
+
+     /*
+      * OK, slurp in the file.  If we get a short read and errno doesn't get
+      * set, the reason is probably that garbage collection truncated the file
+      * since we did the fstat(), so we don't log a complaint --- but we don't
+      * return the data, either, since it's most likely corrupt due to
+      * concurrent writes from garbage collection.
+      */
+     errno = 0;
+     if (read(fd, buf, stat.st_size) != stat.st_size)
+     {
+         if (errno)
+             ereport(LOG,
+                     (errcode_for_file_access(),
+                    errmsg("could not read pg_stat_statement file \"%s\": %m",
+                           PGSS_TEXT_FILE)));
+         free(buf);
+         CloseTransientFile(fd);
+         return NULL;
+     }
+
+     CloseTransientFile(fd);
+
+     *buffer_size = stat.st_size;
+     return buf;
+ }
+
+ /*
+  * Locate a query text in the file image previously read by qtext_load_file().
+  *
+  * We validate the given offset/length, and return NULL if bogus.  Otherwise,
+  * the result points to a null-terminated string within the buffer.
+  */
+ static char *
+ qtext_fetch(Size query_offset, int query_len,
+             char *buffer, Size buffer_size)
+ {
+     /* File read failed? */
+     if (buffer == NULL)
+         return NULL;
+     /* Bogus offset/length? */
+     if (query_len < 0 ||
+         query_offset + query_len >= buffer_size)
+         return NULL;
+     /* As a further sanity check, make sure there's a trailing null */
+     if (buffer[query_offset + query_len] != '\0')
+         return NULL;
+     /* Looks OK */
+     return buffer + query_offset;
+ }
+
+ /*
+  * Do we need to garbage-collect the external query text file?
+  *
+  * Caller should hold at least a shared lock on pgss->lock.
+  */
+ static bool
+ need_gc_qtexts(void)
+ {
+     Size        extent;
+
+     /* Read shared extent pointer */
+     {
+         volatile pgssSharedState *s = (volatile pgssSharedState *) pgss;
+
+         SpinLockAcquire(&s->mutex);
+         extent = s->extent;
+         SpinLockRelease(&s->mutex);
+     }
+
+     /* Don't proceed if file does not exceed 512 bytes per possible entry */
+     if (extent < 512 * pgss_max)
+         return false;
+
+     /*
+      * Don't proceed if file is less than about 50% bloat.  Nothing can or
+      * should be done in the event of unusually large query texts accounting
+      * for file's large size.  We go to the trouble of maintaining the mean
+      * query length in order to prevent garbage collection from thrashing
+      * uselessly.
+      */
+     if (extent < pgss->mean_query_len * pgss_max * 2)
+         return false;
+
+     return true;
+ }
+
+ /*
+  * Garbage-collect orphaned query texts in external file.
+  *
+  * This won't be called often in the typical case, since it's likely that
+  * there won't be too much churn, and besides, a similar compaction process
+  * occurs when serializing to disk at shutdown or as part of resetting.
+  * Despite this, it seems prudent to plan for the edge case where the file
+  * becomes unreasonably large, with no other method of compaction likely to
+  * occur in the foreseeable future.
+  *
+  * The caller must hold an exclusive lock on pgss->lock.
+  */
+ static void
+ gc_qtexts(void)
+ {
+     char       *qbuffer;
+     Size        qbuffer_size;
+     FILE       *qfile;
+     HASH_SEQ_STATUS hash_seq;
+     pgssEntry  *entry;
+     Size        extent;
+     int            nentries;
+
+     /*
+      * When called from pgss_store, some other session might have proceeded
+      * with garbage collection in the no-lock-held interim of lock strength
+      * escalation.    Check once more that this is actually necessary.
+      */
+     if (!need_gc_qtexts())
+         return;
+
+     /*
+      * Load the old texts file.  If we fail (out of memory, for instance) just
+      * skip the garbage collection.
+      */
+     qbuffer = qtext_load_file(&qbuffer_size);
+     if (qbuffer == NULL)
+         return;
+
+     /*
+      * We overwrite the query texts file in place, so as to reduce the risk of
+      * an out-of-disk-space failure.  Since the file is guaranteed not to get
+      * larger, this should always work on traditional filesystems; though we
+      * could still lose on copy-on-write filesystems.
+      */
+     qfile = AllocateFile(PGSS_TEXT_FILE, PG_BINARY_W);
+     if (qfile == NULL)
+     {
+         ereport(LOG,
+                 (errcode_for_file_access(),
+                  errmsg("could not write pg_stat_statement file \"%s\": %m",
+                         PGSS_TEXT_FILE)));
+         goto gc_fail;
+     }
+
+     extent = 0;
+     nentries = 0;
+
+     hash_seq_init(&hash_seq, pgss_hash);
+     while ((entry = hash_seq_search(&hash_seq)) != NULL)
+     {
+         int            query_len = entry->query_len;
+         char       *qry = qtext_fetch(entry->query_offset,
+                                       query_len,
+                                       qbuffer,
+                                       qbuffer_size);
+
+         if (qry == NULL)
+         {
+             /* Trouble ... drop the text */
+             entry->query_offset = 0;
+             entry->query_len = -1;
+             continue;
+         }
+
+         if (fwrite(qry, 1, query_len + 1, qfile) != query_len + 1)
+         {
+             ereport(LOG,
+                     (errcode_for_file_access(),
+                   errmsg("could not write pg_stat_statement file \"%s\": %m",
+                          PGSS_TEXT_FILE)));
+             hash_seq_term(&hash_seq);
+             goto gc_fail;
+         }
+
+         entry->query_offset = extent;
+         extent += query_len + 1;
+         nentries++;
+     }
+
+     /*
+      * Truncate away any now-unused space.    If this fails for some odd reason,
+      * we log it, but there's no need to fail.
+      */
+     if (ftruncate(fileno(qfile), extent) != 0)
+         ereport(LOG,
+                 (errcode_for_file_access(),
+                errmsg("could not truncate pg_stat_statement file \"%s\": %m",
+                       PGSS_TEXT_FILE)));
+
+     if (FreeFile(qfile))
+     {
+         ereport(LOG,
+                 (errcode_for_file_access(),
+                  errmsg("could not write pg_stat_statement file \"%s\": %m",
+                         PGSS_TEXT_FILE)));
+         qfile = NULL;
+         goto gc_fail;
+     }
+
+     elog(DEBUG1, "pgss gc of queries file shrunk size from %zu to %zu",
+          pgss->extent, extent);
+
+     /* Reset the shared extent pointer */
+     pgss->extent = extent;
+
+     /*
+      * Also update the mean query length, to be sure that need_gc_qtexts()
+      * won't still think we have a problem.
+      */
+     if (nentries > 0)
+         pgss->mean_query_len = extent / nentries;
+     else
+         pgss->mean_query_len = ASSUMED_LENGTH_INIT;
+
+     free(qbuffer);
+
+     /*
+      * OK, count a garbage collection cycle.  (Note: even though we have
+      * exclusive lock on pgss->lock, we must take pgss->mutex for this, since
+      * other processes may examine gc_count while holding only the mutex.
+      * Also, we have to advance the count *after* we've rewritten the file,
+      * else other processes might not realize they read a stale file.)
+      */
+     record_gc_qtexts();
+
+     return;
+
+ gc_fail:
+     /* clean up resources */
+     if (qfile)
+         FreeFile(qfile);
+     if (qbuffer)
+         free(qbuffer);
+
+     /*
+      * Since the contents of the external file are now uncertain, mark all
+      * hashtable entries as having invalid texts.
+      */
+     hash_seq_init(&hash_seq, pgss_hash);
+     while ((entry = hash_seq_search(&hash_seq)) != NULL)
+     {
+         entry->query_offset = 0;
+         entry->query_len = -1;
+     }
+
+     /* Seems like a good idea to bump the GC count even though we failed */
+     record_gc_qtexts();
+ }
+
+ /*
   * Release all entries.
   */
  static void
*************** entry_reset(void)
*** 1384,1389 ****
--- 2080,2086 ----
  {
      HASH_SEQ_STATUS hash_seq;
      pgssEntry  *entry;
+     FILE       *qfile;

      LWLockAcquire(pgss->lock, LW_EXCLUSIVE);

*************** entry_reset(void)
*** 1393,1398 ****
--- 2090,2123 ----
          hash_search(pgss_hash, &entry->key, HASH_REMOVE, NULL);
      }

+     /*
+      * Write new empty query file, perhaps even creating a new one to recover
+      * if the file was missing.
+      */
+     qfile = AllocateFile(PGSS_TEXT_FILE, PG_BINARY_W);
+     if (qfile == NULL)
+     {
+         ereport(LOG,
+                 (errcode_for_file_access(),
+                  errmsg("could not create pg_stat_statement file \"%s\": %m",
+                         PGSS_TEXT_FILE)));
+         goto done;
+     }
+
+     /* If ftruncate fails, log it, but it's not a fatal problem */
+     if (ftruncate(fileno(qfile), 0) != 0)
+         ereport(LOG,
+                 (errcode_for_file_access(),
+                errmsg("could not truncate pg_stat_statement file \"%s\": %m",
+                       PGSS_TEXT_FILE)));
+
+     FreeFile(qfile);
+
+ done:
+     pgss->extent = 0;
+     /* This counts as a query text garbage collection for our purposes */
+     record_gc_qtexts();
+
      LWLockRelease(pgss->lock);
  }

*************** RecordConstLocation(pgssJumbleState *jst
*** 1962,1968 ****
   * *query_len_p contains the input string length, and is updated with
   * the result string length (which cannot be longer) on exit.
   *
!  * Returns a palloc'd string, which is not necessarily null-terminated.
   */
  static char *
  generate_normalized_query(pgssJumbleState *jstate, const char *query,
--- 2687,2693 ----
   * *query_len_p contains the input string length, and is updated with
   * the result string length (which cannot be longer) on exit.
   *
!  * Returns a palloc'd string.
   */
  static char *
  generate_normalized_query(pgssJumbleState *jstate, const char *query,
*************** generate_normalized_query(pgssJumbleStat
*** 1970,1976 ****
  {
      char       *norm_query;
      int            query_len = *query_len_p;
-     int            max_output_len;
      int            i,
                  len_to_wrt,        /* Length (in bytes) to write */
                  quer_loc = 0,    /* Source query byte location */
--- 2695,2700 ----
*************** generate_normalized_query(pgssJumbleStat
*** 1984,1992 ****
       */
      fill_in_constant_lengths(jstate, query);

!     /* Allocate result buffer, ensuring we limit result to allowed size */
!     max_output_len = Min(query_len, pgss->query_size - 1);
!     norm_query = palloc(max_output_len);

      for (i = 0; i < jstate->clocations_count; i++)
      {
--- 2708,2715 ----
       */
      fill_in_constant_lengths(jstate, query);

!     /* Allocate result buffer */
!     norm_query = palloc(query_len + 1);

      for (i = 0; i < jstate->clocations_count; i++)
      {
*************** generate_normalized_query(pgssJumbleStat
*** 2002,2050 ****
          /* Copy next chunk, or as much as will fit */
          len_to_wrt = off - last_off;
          len_to_wrt -= last_tok_len;
-         len_to_wrt = Min(len_to_wrt, max_output_len - n_quer_loc);

          Assert(len_to_wrt >= 0);
          memcpy(norm_query + n_quer_loc, query + quer_loc, len_to_wrt);
          n_quer_loc += len_to_wrt;

!         if (n_quer_loc < max_output_len)
!             norm_query[n_quer_loc++] = '?';

          quer_loc = off + tok_len;
          last_off = off;
          last_tok_len = tok_len;
-
-         /* If we run out of space, might as well stop iterating */
-         if (n_quer_loc >= max_output_len)
-             break;
      }

      /*
       * We've copied up until the last ignorable constant.  Copy over the
!      * remaining bytes of the original query string, or at least as much as
!      * will fit.
       */
      len_to_wrt = query_len - quer_loc;
-     len_to_wrt = Min(len_to_wrt, max_output_len - n_quer_loc);

      Assert(len_to_wrt >= 0);
      memcpy(norm_query + n_quer_loc, query + quer_loc, len_to_wrt);
      n_quer_loc += len_to_wrt;

!     /*
!      * If we ran out of space, we need to do an encoding-aware truncation,
!      * just to make sure we don't have an incomplete character at the end.
!      */
!     if (n_quer_loc >= max_output_len)
!         query_len = pg_encoding_mbcliplen(encoding,
!                                           norm_query,
!                                           n_quer_loc,
!                                           pgss->query_size - 1);
!     else
!         query_len = n_quer_loc;

!     *query_len_p = query_len;
      return norm_query;
  }

--- 2725,2756 ----
          /* Copy next chunk, or as much as will fit */
          len_to_wrt = off - last_off;
          len_to_wrt -= last_tok_len;

          Assert(len_to_wrt >= 0);
          memcpy(norm_query + n_quer_loc, query + quer_loc, len_to_wrt);
          n_quer_loc += len_to_wrt;

!         norm_query[n_quer_loc++] = '?';

          quer_loc = off + tok_len;
          last_off = off;
          last_tok_len = tok_len;
      }

      /*
       * We've copied up until the last ignorable constant.  Copy over the
!      * remaining bytes of the original query string.
       */
      len_to_wrt = query_len - quer_loc;

      Assert(len_to_wrt >= 0);
      memcpy(norm_query + n_quer_loc, query + quer_loc, len_to_wrt);
      n_quer_loc += len_to_wrt;

!     Assert(n_quer_loc <= query_len);
!     norm_query[n_quer_loc] = '\0';

!     *query_len_p = n_quer_loc;
      return norm_query;
  }

diff --git a/doc/src/sgml/pgstatstatements.sgml b/doc/src/sgml/pgstatstatements.sgml
index 6ea0415..5074b52 100644
*** a/doc/src/sgml/pgstatstatements.sgml
--- b/doc/src/sgml/pgstatstatements.sgml
***************
*** 62,75 ****
        <entry><structfield>queryid</structfield></entry>
        <entry><type>bigint</type></entry>
        <entry></entry>
!       <entry>Internal hash identifier, computed from the entry's post-parse-analysis tree</entry>
       </row>

       <row>
        <entry><structfield>query</structfield></entry>
        <entry><type>text</type></entry>
        <entry></entry>
!       <entry>Text of a representative statement (up to <xref linkend="guc-track-activity-query-size"> bytes)</entry>
       </row>

       <row>
--- 62,75 ----
        <entry><structfield>queryid</structfield></entry>
        <entry><type>bigint</type></entry>
        <entry></entry>
!       <entry>Internal hash code, computed from the statement's parse tree</entry>
       </row>

       <row>
        <entry><structfield>query</structfield></entry>
        <entry><type>text</type></entry>
        <entry></entry>
!       <entry>Text of a representative statement</entry>
       </row>

       <row>
***************
*** 188,196 ****
    </table>

    <para>
!    This view, and the function <function>pg_stat_statements_reset</>,
!    are available only in databases they have been specifically installed into
!    by installing the <literal>pg_stat_statements</> extension.
     However, statistics are tracked across all databases of the server
     whenever the <filename>pg_stat_statements</filename> module is loaded
     into the server, regardless of presence of the view.
--- 188,197 ----
    </table>

    <para>
!    This view, and the functions <function>pg_stat_statements_reset</>
!    and <function>pg_stat_statements</>, are available only in
!    databases they have been specifically installed into by installing
!    the <literal>pg_stat_statements</> extension.
     However, statistics are tracked across all databases of the server
     whenever the <filename>pg_stat_statements</filename> module is loaded
     into the server, regardless of presence of the view.
***************
*** 242,277 ****

    <para>
     Consumers of <literal>pg_stat_statements</> may wish to use
!    <structfield>queryid</> (perhaps in composite with
     <structfield>dbid</> and <structfield>userid</>) as a more stable
     and reliable identifier for each entry than its query text.
     However, it is important to understand that there are only limited
     guarantees around the stability of the <structfield>queryid</> hash
     value.  Since the identifier is derived from the
     post-parse-analysis tree, its value is a function of, among other
!    things, the internal identifiers that comprise this representation.
!    This has some counterintuitive implications.  For example, a query
!    against a table that is fingerprinted by
!    <literal>pg_stat_statements</> will appear distinct to a
!    subsequently executed query that a reasonable observer might judge
!    to be a non-distinct, if in the interim the table was dropped and
!    re-created.  The hashing process is sensitive to difference in
     machine architecture and other facets of the platform.
     Furthermore, it is not safe to assume that <structfield>queryid</>
     will be stable across major versions of <productname>PostgreSQL</>.
    </para>

    <para>
!    As a rule of thumb, an assumption of the stability or comparability
!    of <structfield>queryid</> values should be predicated on the
!    underlying catalog metadata and hash function implementation
!    details exactly matching.  Any two servers participating in any
!    variety of replication based on physical WAL-replay can be expected
     to have identical <structfield>queryid</> values for the same query.
!    Logical replication schemes do not have replicas comparable in all
!    relevant regards, and so <structfield>queryid</> will not be a
!    useful identifier for accumulating costs for the entire replica
!    set.  If in doubt, direct testing is recommended.
    </para>
   </sect2>

--- 243,276 ----

    <para>
     Consumers of <literal>pg_stat_statements</> may wish to use
!    <structfield>queryid</> (perhaps in combination with
     <structfield>dbid</> and <structfield>userid</>) as a more stable
     and reliable identifier for each entry than its query text.
     However, it is important to understand that there are only limited
     guarantees around the stability of the <structfield>queryid</> hash
     value.  Since the identifier is derived from the
     post-parse-analysis tree, its value is a function of, among other
!    things, the internal object identifiers appearing in this representation.
!    This has some counterintuitive implications.  For example,
!    <literal>pg_stat_statements</> will consider two apparently-identical
!    queries to be distinct, if they reference a table that was dropped
!    and recreated between the executions of the two queries.
!    The hashing process is also sensitive to differences in
     machine architecture and other facets of the platform.
     Furthermore, it is not safe to assume that <structfield>queryid</>
     will be stable across major versions of <productname>PostgreSQL</>.
    </para>

    <para>
!    As a rule of thumb, <structfield>queryid</> values can be assumed to be
!    stable and comparable only so long as the underlying server version and
!    catalog metadata details stay exactly the same.  Two servers
!    participating in replication based on physical WAL replay can be expected
     to have identical <structfield>queryid</> values for the same query.
!    However, logical replication schemes do not promise to keep replicas
!    identical in all relevant details, so <structfield>queryid</> will
!    not be a useful identifier for accumulating costs across a set of logical
!    replicas.  If in doubt, direct testing is recommended.
    </para>
   </sect2>

***************
*** 297,302 ****
--- 296,331 ----
      </listitem>
     </varlistentry>

+    <varlistentry>
+    <indexterm>
+     <primary>pg_stat_statements</primary>
+     <secondary>function</secondary>
+    </indexterm>
+
+     <term>
+      <function>pg_stat_statements(showtext boolean) returns setof record</function>
+     </term>
+
+     <listitem>
+      <para>
+       The <structname>pg_stat_statements</structname> view is defined in
+       terms of a function, itself called <function>pg_stat_statements</>.
+       It is possible for clients to call
+       the <function>pg_stat_statements</function> function directly, and by
+       specifying <literal>showtext := false</literal> have query text be
+       omitted (that is, the <literal>OUT</literal> argument that corresponds
+       to the view's <structfield>query</> column will return nulls).  This
+       feature is intended to support external tools that might wish to avoid
+       the overhead of repeatedly retrieving query texts of indeterminate
+       length.  Such tools can instead cache the first query text observed
+       for each entry themselves, since that is
+       all <filename>pg_stat_statements</> itself does, and then retrieve
+       query texts only as needed.  Since the server stores query texts in a
+       file, this approach may reduce physical I/O for repeated examination
+       of the <structname>pg_stat_statements</structname> data.
+      </para>
+     </listitem>
+    </varlistentry>
    </variablelist>
   </sect2>

***************
*** 316,322 ****
        in the <structname>pg_stat_statements</> view).  If more distinct
        statements than that are observed, information about the least-executed
        statements is discarded.
!       The default value is 1000.
        This parameter can only be set at server start.
       </para>
      </listitem>
--- 345,351 ----
        in the <structname>pg_stat_statements</> view).  If more distinct
        statements than that are observed, information about the least-executed
        statements is discarded.
!       The default value is 5000.
        This parameter can only be set at server start.
       </para>
      </listitem>
***************
*** 378,386 ****
    </variablelist>

    <para>
!    The module requires additional shared memory amounting to about
!    <varname>pg_stat_statements.max</varname> <literal>*</>
!    <xref linkend="guc-track-activity-query-size"> bytes.  Note that this
     memory is consumed whenever the module is loaded, even if
     <varname>pg_stat_statements.track</> is set to <literal>none</>.
    </para>
--- 407,414 ----
    </variablelist>

    <para>
!    The module requires additional shared memory proportional to
!    <varname>pg_stat_statements.max</varname>.  Note that this
     memory is consumed whenever the module is loaded, even if
     <varname>pg_stat_statements.track</> is set to <literal>none</>.
    </para>

Re: Storing pg_stat_statements query texts externally, pg_stat_statements in core

From
Peter Geoghegan
Date:
On Sat, Jan 25, 2014 at 11:04 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Do the regression tests fail for you when doing this?
>
> What I see is a duplicate occurrence of an escape_string_warning bleat:
>
> *** /home/postgres/pgsql/src/test/regress/expected/plpgsql.out  Fri Jan  3 17:07
> :46 2014
> --- /home/postgres/pgsql/src/test/regress/results/plpgsql.out   Sat Jan 25 13:37
> :20 2014
> ***************
> *** 4568,4573 ****
> --- 4568,4579 ----
>   HINT:  Use the escape string syntax for backslashes, e.g., E'\\'.
>   QUERY:  SELECT 'foo\\bar\041baz'
>   CONTEXT:  PL/pgSQL function strtest() line 4 at RETURN
> + WARNING:  nonstandard use of \\ in a string literal
> + LINE 1: SELECT 'foo\\bar\041baz'
> +                ^
> + HINT:  Use the escape string syntax for backslashes, e.g., E'\\'.
> + QUERY:  SELECT 'foo\\bar\041baz'
> + CONTEXT:  PL/pgSQL function strtest() line 4 at RETURN
>      strtest
>   -------------
>    foo\bar!baz
>
> ======================================================================
>
> which seems to happen because generate_normalized_query() reruns the
> core lexer on the given statement, and if you got a warning the first
> time, you'll get another one.

Oh, yes, I noticed that and reached the same conclusion. Sorry, I
probably should have mentioned this pro-actively.


-- 
Peter Geoghegan



Re: Storing pg_stat_statements query texts externally, pg_stat_statements in core

From
Peter Geoghegan
Date:
On Sat, Jan 25, 2014 at 1:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I've chosen to handle failures to load query
> text data by just returning NULL for that query text, which seems
> reasonable given the new mindset that the query text is auxiliary data
> less important than the actual counts.

I guess that's okay.

> I've not done much more than smoke-testing on this; I'm thinking about
> hot-wiring the code to sometimes force it through the error paths,
> just to make sure they act as expected.  I've also not done any
> real performance testing.

I was never too worried about the impact on performance, because any
price that must be paid is paid only when new entries are created,
which is typically a very rare event that was already expensive due to
requiring an exclusive lock. This patch will make it much rarer in
practice by increasing the pg_stat_statements.max default, and thus
reducing the impact on such exclusive locks on *all* sessions, not
just those executing less popular queries. I don't know about you, but
the reason that I didn't performance test this is that it's really
hard to think of a representative scenario where it could possibly
matter, even though it seems like there might be one.

-- 
Peter Geoghegan



Re: Storing pg_stat_statements query texts externally, pg_stat_statements in core

From
Peter Geoghegan
Date:
Why do you think it's better to release the shared lock while
generating a normalized query text, only to acquire it once more? I'm
not suggesting that it's the wrong thing to do. I'm curious about the
reasoning around assessing the costs.

-- 
Peter Geoghegan



Peter Geoghegan <pg@heroku.com> writes:
> Why do you think it's better to release the shared lock while
> generating a normalized query text, only to acquire it once more? I'm
> not suggesting that it's the wrong thing to do. I'm curious about the
> reasoning around assessing the costs.

Well, it's fairly expensive to generate that text, in the case of a
large/complex statement.  It's possible that continuing to hold the lock
is nonetheless the right thing to do because release+reacquire is also
expensive; but there is no proof of that AFAIK, and I believe that
release+reacquire is not likely to be expensive unless the lock is heavily
contended, which would be exactly the conditions under which keeping it
wouldn't be such a good idea anyway.  So I'd prefer to leave it doing what
it did before, until there's some concrete evidence that keeping the lock
is a better idea.
        regards, tom lane



Re: Storing pg_stat_statements query texts externally, pg_stat_statements in core

From
Peter Geoghegan
Date:
On Sat, Jan 25, 2014 at 2:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Well, it's fairly expensive to generate that text, in the case of a
> large/complex statement.  It's possible that continuing to hold the lock
> is nonetheless the right thing to do because release+reacquire is also
> expensive; but there is no proof of that AFAIK, and I believe that
> release+reacquire is not likely to be expensive unless the lock is heavily
> contended, which would be exactly the conditions under which keeping it
> wouldn't be such a good idea anyway.

My reason for only acquiring the shared lock once, and performing text
normalization with it held was that in practice most query texts are
not all that expensive to lex/normalize, and the cost of holding the
lock for the vast majority of sessions (that won't experience
contention) is nil. Acquiring the shared lock twice for something like
that struck me as unjustified. I though it was closer to the original
coding to lex with the lock, because of course the original coding
will only ever acquire the shared lock once. The original
lex-with-no-lock coding is only justified by "well, might as well".

-- 
Peter Geoghegan



Peter Geoghegan <pg@heroku.com> writes:
> On Sat, Jan 25, 2014 at 2:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Well, it's fairly expensive to generate that text, in the case of a
>> large/complex statement.  It's possible that continuing to hold the lock
>> is nonetheless the right thing to do because release+reacquire is also
>> expensive; but there is no proof of that AFAIK, and I believe that
>> release+reacquire is not likely to be expensive unless the lock is heavily
>> contended, which would be exactly the conditions under which keeping it
>> wouldn't be such a good idea anyway.

> My reason for only acquiring the shared lock once, and performing text
> normalization with it held was that in practice most query texts are
> not all that expensive to lex/normalize, and the cost of holding the
> lock for the vast majority of sessions (that won't experience
> contention) is nil.

Meh.  This line of argument seems to reduce to "we don't need to worry
about performance of this code path because it won't be reached often".
I don't particularly buy that; it may be true in some usage patterns but
probably not all.  And if you *do* buy it, it can be turned around to say
that we needn't worry about the costs of an extra locking cycle, anyway.

However, the only thing that is really going to settle this argument is
data, so here's a simple test case.  I made a script that just consisted
of many repetitions of   \dd  SELECT pg_stat_statements_reset();
and ran it in psql while tracing the system with "perf".  (\dd produces
a query that is middlingly complex, but certainly not a patch on those
I've seen produced by many real applications.  The point of the reset
is to force us through the stats-entry-creation code path each time.)

The hits above 1%, in a non-assert-enabled backend:
   13.92%        34576       postmaster  [kernel.kallsyms]                     [k] 0xffffffff8103ed21
       8.64%        21645       postmaster  postgres                              [.] AllocSetAlloc
          5.09%        12502       postmaster  postgres                              [.] SearchCatCache
             2.37%         6025       postmaster  postgres                              [.] MemoryContextStrdup
                2.22%         5624       postmaster  libc-2.12.so                          [.] __strcmp_sse42
                   1.93%         4860       postmaster  postgres                              [.] core_yylex
                      1.84%         4501       postmaster  postgres                              [.] base_yyparse
                         1.83%         4601       postmaster  postgres                              [.]
ScanKeywordLookup                           1.54%         3893       postmaster  postgres
[.]copyObject                                   1.54%         3849       postmaster  postgres
  [.] MemoryContextAllocZeroAligned                1.30%         3237       postmaster  postgres
     [.] expression_tree_walker                       1.23%         3069       postmaster  postgres
        [.] palloc                                       1.15%         2903       postmaster  libc-2.12.so
           [.] memcpy                                       1.04%         2566       postmaster  postgres
              [.] hash_uint32                              
 

So 3.76% of the entire runtime is going into core_yylex+ScanKeywordLookup,
and presumably half of that can be blamed on generate_normalized_query.
On the other hand, the earliest mention of lock-related functions in the
profile is
    0.17%          411       postmaster  postgres                              [.] LWLockAcquire
   
 

and LWLockRelease is down here:
    0.11%          264       postmaster  postgres                              [.] LWLockRelease
   
 

So on this example, the *total* cost of LWLocks, across the entire
backend, is something like 15% of the cost of generate_normalized_query.
This seems to me to be reasonably strong evidence that we should worry
about that cost, and that releasing/reacquiring the pgss LWLock is
trivial by comparison.

Of course, if the lock were contended then the locking cost would go
up substantially --- but that would also be a scenario in which it'd
be important not to hold the lock unnecessarily.
        regards, tom lane



Re: Storing pg_stat_statements query texts externally, pg_stat_statements in core

From
Peter Geoghegan
Date:
On Sun, Jan 26, 2014 at 10:38 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Meh.  This line of argument seems to reduce to "we don't need to worry
> about performance of this code path because it won't be reached often".

I think I may have over-elaborated, giving you the false impression
that this was something I felt strongly about. I'm glad that the
overhead has been shown to be quite low, and I think that lexing
without the lock held will be fine.


-- 
Peter Geoghegan



Peter Geoghegan <pg@heroku.com> writes:
> On Sun, Jan 26, 2014 at 10:38 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Meh.  This line of argument seems to reduce to "we don't need to worry
>> about performance of this code path because it won't be reached often".

> I think I may have over-elaborated, giving you the false impression
> that this was something I felt strongly about. I'm glad that the
> overhead has been shown to be quite low, and I think that lexing
> without the lock held will be fine.

OK.  Committed after a couple of small further revisions.
        regards, tom lane



Re: Storing pg_stat_statements query texts externally, pg_stat_statements in core

From
Peter Geoghegan
Date:
On Mon, Jan 27, 2014 at 12:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> OK.  Committed after a couple of small further revisions.

Great. Thank you.


-- 
Peter Geoghegan



Re: Storing pg_stat_statements query texts externally, pg_stat_statements in core

From
Peter Geoghegan
Date:
I noticed a minor omission in the patch as committed. Attached patch
corrects this.


--
Peter Geoghegan

Attachment
Peter Geoghegan <pg@heroku.com> writes:
> I noticed a minor omission in the patch as committed. Attached patch
> corrects this.

Duh.  Thanks.
        regards, tom lane