Thread: pg_stat_*_columns?
Would others find it useful to see per column statistics about accesses to specific columns?
Two possible use-cases: (maybe there are more?)
1. I think it would be helpful for DBAs to better understand their own system.
Finding unused *tables* is today easy thanks to pg_stat_*_tables, but knowing if something is accessing a *column* or not is not easy.
In my case all our database access is via sprocs, so I can just grep the source code for the column name to see if something is using it, but most DBAs probably don't have that luxury.
2. It could also be useful for audit trailing, if you want to know what a query did, i.e. what tables/columns were accessed in the txn.
Here is an idea of a very simple audit trailing system that would probably fulfill my own needs:
Imagine if we had pg_stat_xact_user_columns and for each committed txn, do an insert to an unlogged table with the same structure as pg_stat_xact_user_columns with the addition of session_user and timestamp for the txn.
I would much rather have audit trailing in a nice table than in a text file. Maybe a foreign data wrapper could be used to ship the audit trail data to some other external append-only pg-database, if the purpose of the audit trailing is to prevent an evil DBA from doing evil things. But for others it might be sufficient to do audit trailing to the same database, for convenience purposes.
In summary:
Some users might only be interested in the statistics and mostly use pg_stat_user_columns.
Other others might also be interested in what happened in a specific txn and use pg_stat_xact_user_columns.
Yet some other users might be interested in audit trailing and want to log pg_stat_xact_user_columns for each txn. Probably very expensive performance wise, but might make sense if you have extremely sensitive data and audit trailing is more important than performance.
Thoughts?
On 6/5/15 6:51 AM, Joel Jacobson wrote: > > 1. I think it would be helpful for DBAs to better understand their own > system. > Finding unused *tables* is today easy thanks to pg_stat_*_tables, but > knowing if something is accessing a *column* or not is not easy. > In my case all our database access is via sprocs, so I can just grep the > source code for the column name to see if something is using it, but > most DBAs probably don't have that luxury. I have wanted this exact thing when considering vertical partitioning. It's easy to tell from a size standpoint what columns are good candidates for putting in a 'side table', but it's very hard to know how often columns are actually used. BTW, I think the right way to measure this would be how many rows were returned for queries referencing a column. Simply knowing how many queries reference a column doesn't tell you much; you want to know how much column data was actually pulled out. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Data in Trouble? Get it in Treble! http://BlueTreble.com
On Fri, Jun 5, 2015 at 7:51 AM, Joel Jacobson <joel@trustly.com> wrote: > Would others find it useful to see per column statistics about accesses to > specific columns? A probably serious and maybe not entirely obvious problem with this is that it would increase the amount of statistical information we keep by a pretty large multiple. How many columns do you have in a typical table? 20-30? That's a lot of data for a statistics collector that, regrettably, is already overloaded. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
So I've heard from Magnus Hagander today IRL at our Stockholm PostgreSQL User Group meeting where we discussed this idea. He told me the overhead in the statistics collector is mainly when reading from it, not that much when writing to it.
Magnus idea was to first optimize the collector to make it less of a problem to collect more data. Sounds like a good thing to do, but maybe more data in it wouldn't be a problem as long as you don't read too often from it?
Magnus idea was to first optimize the collector to make it less of a problem to collect more data. Sounds like a good thing to do, but maybe more data in it wouldn't be a problem as long as you don't read too often from it?
On Mon 8 Jun 2015 at 18:48 Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Jun 5, 2015 at 7:51 AM, Joel Jacobson <joel@trustly.com> wrote:
> Would others find it useful to see per column statistics about accesses to
> specific columns?
A probably serious and maybe not entirely obvious problem with this is
that it would increase the amount of statistical information we keep
by a pretty large multiple. How many columns do you have in a typical
table? 20-30? That's a lot of data for a statistics collector that,
regrettably, is already overloaded.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 6/8/15 3:26 PM, Joel Jacobson wrote: > So I've heard from Magnus Hagander today IRL at our Stockholm PostgreSQL > User Group meeting where we discussed this idea. He told me the overhead > in the statistics collector is mainly when reading from it, not that > much when writing to it. I've heard enough stories of people moving the stats files to faster storage that I'm not sure how true that really is... > Magnus idea was to first optimize the collector to make it less of a > problem to collect more data. Sounds like a good thing to do, but maybe > more data in it wouldn't be a problem as long as you don't read too > often from it? The stats collector is a known problem under certain circumstances, so improving it would probably be a good thing. The first thing that comes to mind is splitting it into more files. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Data in Trouble? Get it in Treble! http://BlueTreble.com
On Tue, Jun 16, 2015 at 4:47 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
Magnus idea was to first optimize the collector to make it less of a
problem to collect more data. Sounds like a good thing to do, but maybe
more data in it wouldn't be a problem as long as you don't read too
often from it?
The stats collector is a known problem under certain circumstances, so improving it would probably be a good thing. The first thing that comes to mind is splitting it into more files.
Is there any chance the project would accept a patch which adds the pg_stat_*_columns-feature without first optimizing the collector? I guess it primarily depends on how much of the new code that would need to be rewritten, if the collector is optimized/rewritten in the future?
(I would be interested in sponsoring the work, if anyone is interested.)
On Sat, Jun 20, 2015 at 10:12 AM, Joel Jacobson <joel@trustly.com> wrote: > Is there any chance the project would accept a patch which adds the > pg_stat_*_columns-feature without first optimizing the collector? I doubt it. It's such a pain point already that massively increasing the amount of data we need to store does not seem like a good plan. > I guess it > primarily depends on how much of the new code that would need to be > rewritten, if the collector is optimized/rewritten in the future? I don't think that's really the issue. It's more that I think it would be the extra data would be likely to cause real pain for users. FWIW, I tend to think that the solution here has less to do with splitting the data up into more files and more to do with rethinking the format. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Sat, Jun 20, 2015 at 10:12 AM, Joel Jacobson <joel@trustly.com> wrote: >> I guess it >> primarily depends on how much of the new code that would need to be >> rewritten, if the collector is optimized/rewritten in the future? > I don't think that's really the issue. It's more that I think it > would be the extra data would be likely to cause real pain for users. Yes. The stats data already causes real pain. > FWIW, I tend to think that the solution here has less to do with > splitting the data up into more files and more to do with rethinking > the format. I dunno that tweaking the format would accomplish much. Where I'd love to get to is to not have to write the data to disk at all (except at shutdown). But that seems to require an adjustable-size shared memory block, and I'm not sure how to do that. One idea, if the DSM stuff could be used, is to allow the stats collector to allocate multiple DSM blocks as needed --- but how well would that work on 32-bit machines? I'd be worried about running out of address space. regards, tom lane
On Sat, Jun 20, 2015 at 10:55 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
> On Sat, Jun 20, 2015 at 10:12 AM, Joel Jacobson <joel@trustly.com> wrote:
>> I guess it
>> primarily depends on how much of the new code that would need to be
>> rewritten, if the collector is optimized/rewritten in the future?
> I don't think that's really the issue. It's more that I think it
> would be the extra data would be likely to cause real pain for users.
Yes. The stats data already causes real pain.
> FWIW, I tend to think that the solution here has less to do with
> splitting the data up into more files and more to do with rethinking
> the format.
I dunno that tweaking the format would accomplish much. Where I'd love
to get to is to not have to write the data to disk at all (except at
shutdown). But that seems to require an adjustable-size shared memory
block, and I'm not sure how to do that. One idea, if the DSM stuff
could be used, is to allow the stats collector to allocate multiple
DSM blocks as needed --- but how well would that work on 32-bit
machines? I'd be worried about running out of address space.
I've considered both that and to perhaps use a shared memory message queue to communicate. Basically, have a backend send a request when it needs a snapshot of the stats data and get a copy back through that method instead of disk. It would be much easier if we didn't actually take a snapshot of the data per transaction, but we really don't want to give that up (if we didn't care about that, we could just have a protocol asking for individual values).
We'd need a way to actually transfer the whole hashtables over, without rebuilding them on the other end I think. Just the cost of looping over it to dump and then rehashing everything on the other end seems quite wasteful and unnecessary.
Magnus Hagander <magnus@hagander.net> writes: > On Sat, Jun 20, 2015 at 10:55 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I dunno that tweaking the format would accomplish much. Where I'd love >> to get to is to not have to write the data to disk at all (except at >> shutdown). But that seems to require an adjustable-size shared memory >> block, and I'm not sure how to do that. One idea, if the DSM stuff >> could be used, is to allow the stats collector to allocate multiple >> DSM blocks as needed --- but how well would that work on 32-bit >> machines? I'd be worried about running out of address space. > I've considered both that and to perhaps use a shared memory message queue > to communicate. Basically, have a backend send a request when it needs a > snapshot of the stats data and get a copy back through that method instead > of disk. It would be much easier if we didn't actually take a snapshot of > the data per transaction, but we really don't want to give that up (if we > didn't care about that, we could just have a protocol asking for individual > values). Yeah, that might work quite nicely, and it would not require nearly as much surgery on the existing code as mapping the stuff into constrained-size shmem blocks would do. The point about needing a data snapshot is a good one as well; I'm not sure how we'd preserve that behavior if backends are accessing the collector's data structures directly through shmem. I wonder if we should think about replacing the IP-socket-based data transmission protocol with a shared memory queue, as well. > We'd need a way to actually transfer the whole hashtables over, without > rebuilding them on the other end I think. Just the cost of looping over it > to dump and then rehashing everything on the other end seems quite wasteful > and unnecessary. Meh. All of a sudden you've made it complicated and invasive again, to get rid of a bottleneck that's not been shown to be a problem. Let's do the simple thing first, else maybe nothing will happen at all. regards, tom lane
On 06/20/2015 11:32 AM, Tom Lane wrote: > Magnus Hagander <magnus@hagander.net> writes: >> On Sat, Jun 20, 2015 at 10:55 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> I dunno that tweaking the format would accomplish much. Where I'd love >>> to get to is to not have to write the data to disk at all (except at >>> shutdown). But that seems to require an adjustable-size shared memory >>> block, and I'm not sure how to do that. One idea, if the DSM stuff >>> could be used, is to allow the stats collector to allocate multiple >>> DSM blocks as needed --- but how well would that work on 32-bit >>> machines? I'd be worried about running out of address space. Hmm. A backend already reads all the stats it needs to backend-private memory in one go. That consumes about as much address space as the DSM would, no? I guess it'd double the need, because you'd have mapped both the DSM and the backend-private memory at the same time. >> I've considered both that and to perhaps use a shared memory message queue >> to communicate. Basically, have a backend send a request when it needs a >> snapshot of the stats data and get a copy back through that method instead >> of disk. It would be much easier if we didn't actually take a snapshot of >> the data per transaction, but we really don't want to give that up (if we >> didn't care about that, we could just have a protocol asking for individual >> values). > > Yeah, that might work quite nicely, and it would not require nearly as > much surgery on the existing code as mapping the stuff into > constrained-size shmem blocks would do. The point about needing a data > snapshot is a good one as well; I'm not sure how we'd preserve that > behavior if backends are accessing the collector's data structures > directly through shmem. Usually you use a lock for such things ;-). Acquire lock, copy to private memory, unlock. If that's too slow, have two copies of the structure in shared memory, one that's being updated, and one that's being read, and swap them periodically. Or something like that - this doesn't seem particularly hard. > I wonder if we should think about replacing the IP-socket-based data > transmission protocol with a shared memory queue, as well. One problem is that the archiver process is not connected to shared memory, but calls pgstat_send_archiver() to update the stats whenever it has archived a file. If we nevertheless replace the files with dynamic shared memory, and switch to using shared memory queues for communication, ISTM we might as well have all the backends update the shared memory directly, and get rid of the stats collector process altogether. If we didn't already have a stats collector, that certainly seems like a more straightforward design. > Let's do the simple thing first, else maybe nothing will happen at all. Yeah, there's that.. - Heikki
On Sat, Jun 20, 2015 at 11:15 AM, Magnus Hagander <magnus@hagander.net> wrote: > I've considered both that and to perhaps use a shared memory message queue > to communicate. Basically, have a backend send a request when it needs a > snapshot of the stats data and get a copy back through that method instead > of disk. It would be much easier if we didn't actually take a snapshot of > the data per transaction, but we really don't want to give that up (if we > didn't care about that, we could just have a protocol asking for individual > values). > > We'd need a way to actually transfer the whole hashtables over, without > rebuilding them on the other end I think. Just the cost of looping over it > to dump and then rehashing everything on the other end seems quite wasteful > and unnecessary. One idea would be to advertise a DSM ID in the main shared memory segment, and have the individual backends read that value and attach to it. When new stats are generated, the stats collector creates a new DSM (which might be bigger or smaller than the old one), writes the new stats in there, and then advertises the new DSM ID in the main shared memory segment. Backends that still have the old segment attached can still use it, and it will go away automatically once they all drop off. But I'm not sure how this would work with the new per-database split of the stats file. I don't think it'll work to have one DSM per database; we don't support enough DSMs for that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > One idea would be to advertise a DSM ID in the main shared memory > segment, and have the individual backends read that value and attach > to it. When new stats are generated, the stats collector creates a > new DSM (which might be bigger or smaller than the old one), writes > the new stats in there, and then advertises the new DSM ID in the main > shared memory segment. Backends that still have the old segment > attached can still use it, and it will go away automatically once they > all drop off. Hmmm. This sounds attractive, but what happens if we fail to create a new DSM when needed? > But I'm not sure how this would work with the new per-database split > of the stats file. I don't think it'll work to have one DSM per > database; we don't support enough DSMs for that. AFAIR, that per-database split exists only to try to reduce the amount of traffic written to disk. We could lose it cheerfully if the communication all happens in shared memory. regards, tom lane
On Sat, Jun 20, 2015 at 6:15 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> One idea would be to advertise a DSM ID in the main shared memory >> segment, and have the individual backends read that value and attach >> to it. When new stats are generated, the stats collector creates a >> new DSM (which might be bigger or smaller than the old one), writes >> the new stats in there, and then advertises the new DSM ID in the main >> shared memory segment. Backends that still have the old segment >> attached can still use it, and it will go away automatically once they >> all drop off. > > Hmmm. This sounds attractive, but what happens if we fail to create > a new DSM when needed? Presumably you keep using the old one and retry later. I mean, out of memory is out of memory; you can't move a variable-size data structure into shared memory without the possibility of running into OOM failures at some point. >> But I'm not sure how this would work with the new per-database split >> of the stats file. I don't think it'll work to have one DSM per >> database; we don't support enough DSMs for that. > > AFAIR, that per-database split exists only to try to reduce the amount of > traffic written to disk. We could lose it cheerfully if the communication > all happens in shared memory. If we arranged things so that the processes could use the data in the DSM directly rather than having to copy it out, we'd presumably save quite a bit of memory, since the whole structure would be shared rather than each backend having its own copy. But if the structure got too big to map (on a 32-bit system), then you'd be sort of hosed, because there's no way to attach just part of it. That might not be worth worrying about, but it depends on how big it's likely to get - a 32-bit system is very likely to choke on a 1GB mapping, and maybe even on a much smaller one. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 06/21/2015 12:15 AM, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> One idea would be to advertise a DSM ID in the main shared memory >> segment, and have the individual backends read that value and attach >> to it. When new stats are generated, the stats collector creates a >> new DSM (which might be bigger or smaller than the old one), writes >> the new stats in there, and then advertises the new DSM ID in the main >> shared memory segment. Backends that still have the old segment >> attached can still use it, and it will go away automatically once they >> all drop off. > > Hmmm. This sounds attractive, but what happens if we fail to create > a new DSM when needed? Also, isn't this a potential problem with long-running backends? We might ultimately end with each backend using a different DSM segment. >> But I'm not sure how this would work with the new per-database >> split of the stats file. I don't think it'll work to have one DSM >> per database; we don't support enough DSMs for that. > > AFAIR, that per-database split exists only to try to reduce the > amount of traffic written to disk. We could lose it cheerfully if the > communication all happens in shared memory. Yes, reducing the amount of writes is one of the benefits of the per-database split, but there are other benefits too. Obviously, it also reduces the amount of reads (because each backend can read just the right portion of stats), and reduction of CPU time (because you're parsing much less data). But I don't see why we couldn't partition the stats in DSM in a similar manner. kind regards -- Tomas Vondra http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Robert Haas <robertmhaas@gmail.com> writes: > On Sat, Jun 20, 2015 at 6:15 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Hmmm. This sounds attractive, but what happens if we fail to create >> a new DSM when needed? > Presumably you keep using the old one and retry later. I mean, out of > memory is out of memory; you can't move a variable-size data structure > into shared memory without the possibility of running into OOM > failures at some point. Right, which is one reason that Magnus' thought of replacing the file transmission mechanism with a message queue might be a good alternative. > If we arranged things so that the processes could use the data in the > DSM directly rather than having to copy it out, we'd presumably save > quite a bit of memory, since the whole structure would be shared > rather than each backend having its own copy. That is not going to happen, because it would imply locking out the stats collector from doing any more updates for the entire time that any backend is looking at the results. We *do* need to copy. > But if the structure > got too big to map (on a 32-bit system), then you'd be sort of hosed, > because there's no way to attach just part of it. That might not be > worth worrying about, but it depends on how big it's likely to get - a > 32-bit system is very likely to choke on a 1GB mapping, and maybe even > on a much smaller one. Yeah, I'm quite worried about assuming that we can map a data structure that might be of very significant size into shared memory on 32-bit machines. The address space just isn't there. regards, tom lane
Robert Haas wrote: > If we arranged things so that the processes could use the data in the > DSM directly rather than having to copy it out, we'd presumably save > quite a bit of memory, since the whole structure would be shared > rather than each backend having its own copy. But if the structure > got too big to map (on a 32-bit system), then you'd be sort of hosed, > because there's no way to attach just part of it. That might not be > worth worrying about, but it depends on how big it's likely to get - a > 32-bit system is very likely to choke on a 1GB mapping, and maybe even > on a much smaller one. How realistic it is that you would get a 1 GB mapping on a 32-bit system? Each table entry is 106 bytes at the moment if my count is right, so you need about one million tables to get that large a table. It doesn't sound really realistic to have such a database on a smallish machine. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, Jun 20, 2015 at 7:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> If we arranged things so that the processes could use the data in the >> DSM directly rather than having to copy it out, we'd presumably save >> quite a bit of memory, since the whole structure would be shared >> rather than each backend having its own copy. > > That is not going to happen, because it would imply locking out the stats > collector from doing any more updates for the entire time that any backend > is looking at the results. We *do* need to copy. No, it doesn't mean that at all. As often as needed, the stats collector would build and publish a new copy of the data in a new DSM segment. The old one would stick around until the last reader drops off. So it would be almost totally lock-free. That's a very appealing design in my book. Stuffing all of the data through a message queue will be very inefficient by comparison. If the queue is big enough to fit all the info in a single chunk, it will be nice and fast. If not, the process receiving the data will have to read a chunk and then go to sleep while it waits for the next chunk to be sent. Even if you make the message queue pretty big, like 1MB, a large stats file, say 80MB, will require ~160 context switches to transfer. That's probably going to suck, especially if the stats collector does anything other than trying to service writes to the queues. >> But if the structure >> got too big to map (on a 32-bit system), then you'd be sort of hosed, >> because there's no way to attach just part of it. That might not be >> worth worrying about, but it depends on how big it's likely to get - a >> 32-bit system is very likely to choke on a 1GB mapping, and maybe even >> on a much smaller one. > > Yeah, I'm quite worried about assuming that we can map a data structure > that might be of very significant size into shared memory on 32-bit > machines. The address space just isn't there. Considering the advantages of avoiding message queues, I think we should think a little bit harder about whether we can't find some way to skin this cat. As I think about this a little more, I'm not sure there's really a problem with one stats DSM per database. Sure, the system might have 100,000 databases in some crazy pathological case, but the maximum number of those that can be in use is bounded by max_connections, which means the maximum number of stats file DSMs we could ever need at one time is also bounded by max_connections. There are a few corner cases to think about, like if the user writes a client that connects to all 100,000 databases in very quick succession, we've got to jettison the old DSMs fast enough to make room for the new DSMs before we run out of slots, but that doesn't seem like a particularly tough nut to crack. If the stats collector ensures that it never attaches to more than MaxBackends stats DSMs at a time, and each backend ensures that it never attaches to more than one stats DSM at a time, then 2 * MaxBackends stats DSMs is always enough. And that's just a matter of bumping PG_DYNSHMEM_SLOTS_PER_BACKEND from 2 to 4. In more realistic cases, it will probably be normal for many or all backends to be connected to the same database, and the number of stats DSMs required will be far smaller. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Jun 20, 2015 at 11:55 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Sat, Jun 20, 2015 at 7:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> But if the structure
>> got too big to map (on a 32-bit system), then you'd be sort of hosed,
>> because there's no way to attach just part of it. That might not be
>> worth worrying about, but it depends on how big it's likely to get - a
>> 32-bit system is very likely to choke on a 1GB mapping, and maybe even
>> on a much smaller one.
>
> Yeah, I'm quite worried about assuming that we can map a data structure
> that might be of very significant size into shared memory on 32-bit
> machines. The address space just isn't there.
Considering the advantages of avoiding message queues, I think we
should think a little bit harder about whether we can't find some way
to skin this cat. As I think about this a little more, I'm not sure
there's really a problem with one stats DSM per database. Sure, the
system might have 100,000 databases in some crazy pathological case,
but the maximum number of those that can be in use is bounded by
max_connections, which means the maximum number of stats file DSMs we
could ever need at one time is also bounded by max_connections. There
are a few corner cases to think about, like if the user writes a
client that connects to all 100,000 databases in very quick
succession, we've got to jettison the old DSMs fast enough to make
room for the new DSMs before we run out of slots, but that doesn't
seem like a particularly tough nut to crack. If the stats collector
ensures that it never attaches to more than MaxBackends stats DSMs at
a time, and each backend ensures that it never attaches to more than
one stats DSM at a time, then 2 * MaxBackends stats DSMs is always
enough. And that's just a matter of bumping
PG_DYNSHMEM_SLOTS_PER_BACKEND from 2 to 4.
In more realistic cases, it will probably be normal for many or all
backends to be connected to the same database, and the number of stats
DSMs required will be far smaller.
What about a combination in the line of something like this: stats collector keeps the statistics in local memory as before. But when a backend needs to get a snapshot of it's data, it uses a shared memory queue to request it. What the stats collector does in this case is allocate a new DSM, copy the data into that DSM, and hands the DSM over to the backend. At this point the stats collector can forget about it, and it's up to the backend to get rid of it when it's done with it.
That means the address space thing should not be any worse than today, because each backend will still only see "it's own data". And we only need to copy the data for databases that are actually used.
On 2015-06-20 10:55:03 -0400, Tom Lane wrote: > I dunno that tweaking the format would accomplish much. Where I'd love > to get to is to not have to write the data to disk at all (except at > shutdown). But that seems to require an adjustable-size shared memory > block, and I'm not sure how to do that. One idea, if the DSM stuff > could be used, is to allow the stats collector to allocate multiple > DSM blocks as needed --- but how well would that work on 32-bit > machines? I'd be worried about running out of address space. We could also just mmap() the stats file into memory in various processes. With a bit care it should be quite possible to only mmap a subsets of the file at once, taking care of the address space issues. There'll be some interesting problems to solve for both DSM and mmap based solutions to make the locking work nicely. I guess most of it should be doable quite sensibly using atomics; but the emulation on older platforms might mean that we'd need to serialize at restarts.
Andres Freund <andres@anarazel.de> writes: > On 2015-06-20 10:55:03 -0400, Tom Lane wrote: >> I dunno that tweaking the format would accomplish much. Where I'd love >> to get to is to not have to write the data to disk at all (except at >> shutdown). But that seems to require an adjustable-size shared memory >> block, and I'm not sure how to do that. One idea, if the DSM stuff >> could be used, is to allow the stats collector to allocate multiple >> DSM blocks as needed --- but how well would that work on 32-bit >> machines? I'd be worried about running out of address space. > We could also just mmap() the stats file into memory in various > processes. With a bit care it should be quite possible to only mmap a > subsets of the file at once, taking care of the address space issues. I think we should go into this with the mindset of re-using the DSM infrastructure, rather than inventing a new mechanism of uncertain portability. regards, tom lane
On 2015-06-21 12:40:50 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > We could also just mmap() the stats file into memory in various > > processes. With a bit care it should be quite possible to only mmap a > > subsets of the file at once, taking care of the address space issues. > > I think we should go into this with the mindset of re-using the DSM > infrastructure, rather than inventing a new mechanism of uncertain > portability. Maybe. I'm rather doubtful that it's a good idea to make a choice that'll basically force all the stats to always be in memory though.8 mmap()ing a file is one of the mechanisms for dsm, so it'd not be totally unproven. In totally different crazy way we could just use the existing buffer manager we have and simply put the stats file in shared_buffers. Inventing a per-database relfilenode that doesn't conflict doesn't seem impossible. With some care it shouldn't be hard to make that stats file readable from all sessions, even if they're not connected to the database (e.g. autovacuum launcher).
On Sun, Jun 21, 2015 at 11:43 AM, Magnus Hagander <magnus@hagander.net> wrote: > On Sat, Jun 20, 2015 at 11:55 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Sat, Jun 20, 2015 at 7:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> >> But if the structure >> >> got too big to map (on a 32-bit system), then you'd be sort of hosed, >> >> because there's no way to attach just part of it. That might not be >> >> worth worrying about, but it depends on how big it's likely to get - a >> >> 32-bit system is very likely to choke on a 1GB mapping, and maybe even >> >> on a much smaller one. >> > >> > Yeah, I'm quite worried about assuming that we can map a data structure >> > that might be of very significant size into shared memory on 32-bit >> > machines. The address space just isn't there. >> >> Considering the advantages of avoiding message queues, I think we >> should think a little bit harder about whether we can't find some way >> to skin this cat. As I think about this a little more, I'm not sure >> there's really a problem with one stats DSM per database. Sure, the >> system might have 100,000 databases in some crazy pathological case, >> but the maximum number of those that can be in use is bounded by >> max_connections, which means the maximum number of stats file DSMs we >> could ever need at one time is also bounded by max_connections. There >> are a few corner cases to think about, like if the user writes a >> client that connects to all 100,000 databases in very quick >> succession, we've got to jettison the old DSMs fast enough to make >> room for the new DSMs before we run out of slots, but that doesn't >> seem like a particularly tough nut to crack. If the stats collector >> ensures that it never attaches to more than MaxBackends stats DSMs at >> a time, and each backend ensures that it never attaches to more than >> one stats DSM at a time, then 2 * MaxBackends stats DSMs is always >> enough. And that's just a matter of bumping >> PG_DYNSHMEM_SLOTS_PER_BACKEND from 2 to 4. >> >> In more realistic cases, it will probably be normal for many or all >> backends to be connected to the same database, and the number of stats >> DSMs required will be far smaller. > > What about a combination in the line of something like this: stats collector > keeps the statistics in local memory as before. But when a backend needs to > get a snapshot of it's data, it uses a shared memory queue to request it. > What the stats collector does in this case is allocate a new DSM, copy the > data into that DSM, and hands the DSM over to the backend. At this point the > stats collector can forget about it, and it's up to the backend to get rid > of it when it's done with it. Well, there seems to be little point in having the stats collector forget about a DSM that it could equally well have shared with the next guy who wants a stats snapshot for the same database. That case is surely *plenty* common enough to be worth optimizing for. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, Jun 21, 2015 at 12:52 PM, Andres Freund <andres@anarazel.de> wrote: > On 2015-06-21 12:40:50 -0400, Tom Lane wrote: >> Andres Freund <andres@anarazel.de> writes: >> > We could also just mmap() the stats file into memory in various >> > processes. With a bit care it should be quite possible to only mmap a >> > subsets of the file at once, taking care of the address space issues. >> >> I think we should go into this with the mindset of re-using the DSM >> infrastructure, rather than inventing a new mechanism of uncertain >> portability. > > Maybe. I'm rather doubtful that it's a good idea to make a choice > that'll basically force all the stats to always be in memory though.8 > > mmap()ing a file is one of the mechanisms for dsm, so it'd not be > totally unproven. But it hasn't been made to work on Windows, and is probably pretty lightly tested elsewhere. Besides, memory mapping a disk file has no real advantages over a DSM that doesn't get written to disk. I/O is a serious problem where the stats file is concerned, and more to the point, the process that reads the file goes around and puts it back into memory anyway. > In totally different crazy way we could just use the existing buffer > manager we have and simply put the stats file in > shared_buffers. Inventing a per-database relfilenode that doesn't > conflict doesn't seem impossible. With some care it shouldn't be hard to > make that stats file readable from all sessions, even if they're not > connected to the database (e.g. autovacuum launcher). Interesting idea. We could consider the set of stats files a database unto itself and reserve a low-numbered OID for it. The obvious thing to do is use the database's OID as the relfilenode, but then how do you replace the stats file? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jun 23, 2015 at 3:01 AM, Robert Haas <robertmhaas@gmail.com> wrote:
Well, there seems to be little point in having the stats collectorOn Sun, Jun 21, 2015 at 11:43 AM, Magnus Hagander <magnus@hagander.net> wrote:
> On Sat, Jun 20, 2015 at 11:55 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Sat, Jun 20, 2015 at 7:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> >> But if the structure
>> >> got too big to map (on a 32-bit system), then you'd be sort of hosed,
>> >> because there's no way to attach just part of it. That might not be
>> >> worth worrying about, but it depends on how big it's likely to get - a
>> >> 32-bit system is very likely to choke on a 1GB mapping, and maybe even
>> >> on a much smaller one.
>> >
>> > Yeah, I'm quite worried about assuming that we can map a data structure
>> > that might be of very significant size into shared memory on 32-bit
>> > machines. The address space just isn't there.
>>
>> Considering the advantages of avoiding message queues, I think we
>> should think a little bit harder about whether we can't find some way
>> to skin this cat. As I think about this a little more, I'm not sure
>> there's really a problem with one stats DSM per database. Sure, the
>> system might have 100,000 databases in some crazy pathological case,
>> but the maximum number of those that can be in use is bounded by
>> max_connections, which means the maximum number of stats file DSMs we
>> could ever need at one time is also bounded by max_connections. There
>> are a few corner cases to think about, like if the user writes a
>> client that connects to all 100,000 databases in very quick
>> succession, we've got to jettison the old DSMs fast enough to make
>> room for the new DSMs before we run out of slots, but that doesn't
>> seem like a particularly tough nut to crack. If the stats collector
>> ensures that it never attaches to more than MaxBackends stats DSMs at
>> a time, and each backend ensures that it never attaches to more than
>> one stats DSM at a time, then 2 * MaxBackends stats DSMs is always
>> enough. And that's just a matter of bumping
>> PG_DYNSHMEM_SLOTS_PER_BACKEND from 2 to 4.
>>
>> In more realistic cases, it will probably be normal for many or all
>> backends to be connected to the same database, and the number of stats
>> DSMs required will be far smaller.
>
> What about a combination in the line of something like this: stats collector
> keeps the statistics in local memory as before. But when a backend needs to
> get a snapshot of it's data, it uses a shared memory queue to request it.
> What the stats collector does in this case is allocate a new DSM, copy the
> data into that DSM, and hands the DSM over to the backend. At this point the
> stats collector can forget about it, and it's up to the backend to get rid
> of it when it's done with it.
forget about a DSM that it could equally well have shared with the
next guy who wants a stats snapshot for the same database. That case
is surely *plenty* common enough to be worth optimizing for.
Right, we only need to drop it once we have received a stats message for it so something changed. And possibly that with a minimum time as well, as we have now, if we want to limit the potential churn.
On 6/22/15 8:05 PM, Robert Haas wrote: >> In totally different crazy way we could just use the existing buffer >> >manager we have and simply put the stats file in >> >shared_buffers. Inventing a per-database relfilenode that doesn't >> >conflict doesn't seem impossible. With some care it shouldn't be hard to >> >make that stats file readable from all sessions, even if they're not >> >connected to the database (e.g. autovacuum launcher). > Interesting idea. We could consider the set of stats files a database > unto itself and reserve a low-numbered OID for it. The obvious thing > to do is use the database's OID as the relfilenode, but then how do > you replace the stats file? I think one of the biggest use cases for the stats is to collect them over time and put them in a database. It's rather tempting to just say that's what we should be doing, and provide a knob for how often to record the values and how long to keep the data for. That would eliminate a lot of these problems. The one part I don't see how to solve is the case where bad stuff is happening *right now* and you don't/can't wait for the next reporting interval. Presumably handling that requires all the stuff that's discussed already and you might as well just handle the recording in user space. But maybe someone has some clever ideas there... -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Data in Trouble? Get it in Treble! http://BlueTreble.com
On 2015-06-22 21:05:52 -0400, Robert Haas wrote: > Interesting idea. We could consider the set of stats files a database > unto itself and reserve a low-numbered OID for it. The obvious thing > to do is use the database's OID as the relfilenode, but then how do > you replace the stats file? The relmapper infrastructure should be able to take care of that. Regards, Andres
On Tue, Jun 23, 2015 at 3:47 PM, Andres Freund <andres@anarazel.de> wrote: > On 2015-06-22 21:05:52 -0400, Robert Haas wrote: >> Interesting idea. We could consider the set of stats files a database >> unto itself and reserve a low-numbered OID for it. The obvious thing >> to do is use the database's OID as the relfilenode, but then how do >> you replace the stats file? > > The relmapper infrastructure should be able to take care of that. How? I think it assumes that the number of mapped entries is pretty small. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2015-06-23 16:32:54 -0400, Robert Haas wrote: > On Tue, Jun 23, 2015 at 3:47 PM, Andres Freund <andres@anarazel.de> wrote: > > On 2015-06-22 21:05:52 -0400, Robert Haas wrote: > >> Interesting idea. We could consider the set of stats files a database > >> unto itself and reserve a low-numbered OID for it. The obvious thing > >> to do is use the database's OID as the relfilenode, but then how do > >> you replace the stats file? > > > > The relmapper infrastructure should be able to take care of that. > > How? I think it assumes that the number of mapped entries is pretty small. Well, we could just use one stats file for all of the stats - that'd not necessarily be bad. The relmapper would just be there to be able, if we need it, to rewrite them in the background. The other alternative is to place the stats file in the target database instead of pg_global, in which case we can just use the per database relmapper. Andres
On Mon, Jun 15, 2015 at 7:47 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
On 6/8/15 3:26 PM, Joel Jacobson wrote:So I've heard from Magnus Hagander today IRL at our Stockholm PostgreSQL
User Group meeting where we discussed this idea. He told me the overhead
in the statistics collector is mainly when reading from it, not that
much when writing to it.
I've heard enough stories of people moving the stats files to faster storage that I'm not sure how true that really is...
Were the stories (or the experience which lead to the stories) on 9.3 or later? Do they have a good way to reproduce it for testing purposes?
When I was testing the stat file split patch, one thing I noticed is that on the ext4 file system, when you atomically replace a file by renaming a file over the top of an existing file name, it will automatically fsync the file being renamed. This is exactly what we don't want in the case of the temp stats files. But there seems to be no way to circumvent it, other than unlinking the target file before the rename (which of course defeats the point of atomic replacement), or moving to a different filesystem for the stats files.
Perhaps the problem was related to that.
Cheers,
Jeff
On Sat, Jun 20, 2015 at 7:20 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Sat, Jun 20, 2015 at 10:12 AM, Joel Jacobson <joel@trustly.com> wrote:
> Is there any chance the project would accept a patch which adds the
> pg_stat_*_columns-feature without first optimizing the collector?
I doubt it. It's such a pain point already that massively increasing
the amount of data we need to store does not seem like a good plan.
What if the increase only happened for people who turned on track_column_usage?
Cheers,
Jeff
On Wed, Jun 24, 2015 at 2:03 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Sat, Jun 20, 2015 at 7:20 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> >> On Sat, Jun 20, 2015 at 10:12 AM, Joel Jacobson <joel@trustly.com> wrote: >> > Is there any chance the project would accept a patch which adds the >> > pg_stat_*_columns-feature without first optimizing the collector? >> >> I doubt it. It's such a pain point already that massively increasing >> the amount of data we need to store does not seem like a good plan. > > What if the increase only happened for people who turned on > track_column_usage? Hmm. You know, what I think would really help is if the column statistics were stored in different files from the table statistics, and requested separately. Since these are just there for the DBA, and wouldn't be used by anything automated, the enormous explosion wouldn't hurt much. I'm skeptical of the proposal overall in any event: even if good work is done to keep the performance impact small, how many DBAs will want to increase the amount of statistical data by 1-2 orders of magnitude?It's a lot of numbers to keep around for somethingthat you won't look at very often and which may not be informative when you do look at it (e.g. if your queries tend to use SELECT *). But I won't stand in the way if there's a general consensus that users will find this useful. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 06/24/2015 07:54 PM, Jeff Janes wrote: > > Were the stories (or the experience which lead to the stories) on 9.3 or > later? Do they have a good way to reproduce it for testing purposes? The per-db split can only improve things if there actually are multiple databases, and if the objects are somehow evenly distributed among them. If there's a single database, or if most of the objects are in a single database (but stored in multiple schemas, for example), it does not help much. So it's trivially to reproduce the previous issues. > When I was testing the stat file split patch, one thing I noticed is > that on the ext4 file system, when you atomically replace a file by > renaming a file over the top of an existing file name, it will > automatically fsync the file being renamed. This is exactly what we > don't want in the case of the temp stats files. But there seems to be > no way to circumvent it, other than unlinking the target file before the > rename (which of course defeats the point of atomic replacement), or > moving to a different filesystem for the stats files. Interesting. I don't think unlinking is a good idea - my understanding is that we do it this way because rename is supposed to be atomic or something like that. I don't know whether addressing this filesystem feature is worth it, or if pgstat.c is the right place to do that. -- Tomas Vondra http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 6/24/15 6:41 PM, Tomas Vondra wrote: >> Were the stories (or the experience which lead to the stories) on 9.3 or >> later? Do they have a good way to reproduce it for testing purposes? > > The per-db split can only improve things if there actually are multiple > databases, and if the objects are somehow evenly distributed among them. > If there's a single database, or if most of the objects are in a single > database (but stored in multiple schemas, for example), it does not help > much. So it's trivially to reproduce the previous issues. Right, and a single large database is a pretty common scenario. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Data in Trouble? Get it in Treble! http://BlueTreble.com
On 06/27/2015 12:30 AM, Jim Nasby wrote: > On 6/24/15 6:41 PM, Tomas Vondra wrote: >>> Were the stories (or the experience which lead to the stories) on >>> 9.3 or later? Do they have a good way to reproduce it for testing >>> purposes? >> >> The per-db split can only improve things if there actually are >> multiple databases, and if the objects are somehow evenly >> distributed among them. If there's a single database, or if most of >> the objects are in a single database (but stored in multiple >> schemas, for example), it does not help much. So it's trivially to >> reproduce the previous issues. > > Right, and a single large database is a pretty common scenario. FWIW I doubt this scenario is so common. Most of the cases of issues with stats file I've heard of were about a database server shared by many applications, separated into databases. That's basically the scenario we've been facing with our PostgreSQL machines, and the motivation for the split. Maybe there are many clusters with a single database, containing many objects, but I don't remember any recent reports of problems with stats files from them. Those pretty much disappeared after 9.3, which is when the split happened. That is not to say we can just design it in a way that will cause OOM issues or crashes on such machines ... regards -- Tomas Vondra http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 6/26/15 6:09 PM, Joel Jacobson wrote: > Can't we just use the infrastructure of PostgreSQL to handle the few > megabytes of data we are talking about here? Why not just store the data > in a regular table? Why bother with special files and special data > structures? If it's just a table we want to produce as output, why can't > we just store it in a regular table, in the pg_catalog schema? The problem is the update rate. I've never tried measuring it, but I'd bet that the stats collector can end up with 10s of thousands of updates per second. MVCC would collapse under that kind of load. What might be interesting is setting things up so the collector simply inserted into history tables every X seconds and then had a separate process to prune that data. The big problem with that is I see no way for that to easily allow access to real-time data (which is certainly necessary sometimes). -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Data in Trouble? Get it in Treble! http://BlueTreble.com
On Mon, Jun 29, 2015 at 11:14 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
What might be interesting is setting things up so the collector simply inserted into history tables every X seconds and then had a separate process to prune that data. The big problem with that is I see no way for that to easily allow access to real-time data (which is certainly necessary sometimes)
I think the idea sounds promising. If near real-time data is required, we could just update once every second, which should be often enough for everybody.
Each backend process could then simply INSERT the stats for each txn that committed/rollbacked into an UNLOGGED table, and then the collector would do one single UPDATE of the collector stats based on the aggregate of the rows inserted since the previous update a second ago and then delete the processed rows (naturally in one operation, using DELETE FROM .. RETURNING *).
That way we could get rid of the legacy communication protocol between the backends and the collector and instead rely on unlogged tables for the submission of data from the backends to the collector.
INSERTing 100 000 rows to an unlogged table takes 70 ms on my laptop, so should be fast enough to handle the 10s of thousands of updates per second we need to handle.
On 2015-06-29 16:14:34 -0500, Jim Nasby wrote: > What might be interesting is setting things up so the collector simply > inserted into history tables every X seconds and then had a separate process > to prune that data. The big problem with that is I see no way for that to > easily allow access to real-time data (which is certainly necessary > sometimes). Uh. Hot-Standby?
Joel Jacobson <joel@trustly.com> writes: > On Mon, Jun 29, 2015 at 11:14 PM, Jim Nasby <Jim.Nasby@bluetreble.com> > wrote: >> What might be interesting is setting things up so the collector simply >> inserted into history tables every X seconds and then had a separate >> process to prune that data. The big problem with that is I see no way for >> that to easily allow access to real-time data (which is certainly necessary >> sometimes) > I think the idea sounds promising. If near real-time data is required, we > could just update once every second, which should be often enough for > everybody. I'd bet a good lunch that performance will be absolutely disastrous. Even with unlogged tables, the vacuuming cost would be intolerable. Why would we insist on pushing the envelope in what's known to be Postgres' weakest area performance-wise? > Each backend process could then simply INSERT the stats for each txn that > committed/rollbacked into an UNLOGGED table, ... and if its transaction failed, how would it do that? Regular tables are *not* what we want here, either from a semantics or a performance standpoint. regards, tom lane