Thread: [GENERAL] EXPLAIN command just hangs...
On Thu, Nov 02, 2017 at 08:51:23PM +0000, Rhhh Lin wrote: > However, this query will run for days without completing. I suspect it has to do with the timestamp predicate and lackof using an appropriate index access path. This is what I need to verify/establish. Perhaps the timestamp index is badly fragmented, and perhaps it would help to reindex/cluster/pg_repack.. > So I try and perform a simple 'EXPLAIN <query>' in order to check what the planner has for the execution of this query. > And after approx. six hours waiting, nothing has returned. It is still executing, but has not given me back my prompt (Ican see the session is still active).My understanding is that the simple EXPLAIN version does not actually execute thequery, so I do not understand why this is also performing poorly/hanging/stuck? Any ideas? Is explain "wait"ing ? If you do "ps -fu postgres |grep EXPLAIN" does it say "EXPLAIN waiting" ? Or, if you "ps uww ThePID" does it show lots of CPU(or RAM) ? If you do "SELECT * FROM pg_stat_activity WHERE pid=??" (from "ps" or from SELECT pg_backend_pid() before starting "explain") does it show "active" state or waiting ? If it's waiting, you can see what it's waiting ON by looking at pg_locks.. Maybe like: SELECT c.query, * FROM pg_locks a JOIN pg_locks b USING(relation) JOIN pg_stat_activity c ON b.pid=c.pid WHERE a.pid=?? (from ps) Justin -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Thanks for replying Justin.
Yes, it may be an issue with the index, but I'd like to have some evidence towards that before dropping and recreating (It does not appear that bloat is a problem here or dead tuples either).
The EXPLAIN <query> statement is currently 'f' for waiting and 'active' via pg_stat_activity, so it is doing something. The ps command does not show me anything more verbose.
I had already checked the locks and blocking locks to see if there was an issue there and there was none. I can see there is AccessShareLock modes granted.
The reason I am very suspect of the timestamp column makeup is that if I remove that predicate from the EXPLAIN command and the actual query, both complete within seconds without issue. So I do know where the issue is (I just dont know what the issue is!).
Regards,
Ruan
Sent: 02 November 2017 21:02
To: Rhhh Lin
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] EXPLAIN <query> command just hangs...
> However, this query will run for days without completing. I suspect it has to do with the timestamp predicate and lack of using an appropriate index access path. This is what I need to verify/establish.
Perhaps the timestamp index is badly fragmented, and perhaps it would help to
reindex/cluster/pg_repack..
> So I try and perform a simple 'EXPLAIN <query>' in order to check what the planner has for the execution of this query.
> And after approx. six hours waiting, nothing has returned. It is still executing, but has not given me back my prompt (I can see the session is still active).My understanding is that the simple EXPLAIN version does not actually execute the query, so I do not understand why this is also performing poorly/hanging/stuck? Any ideas?
Is explain "wait"ing ? If you do "ps -fu postgres |grep EXPLAIN" does it say
"EXPLAIN waiting" ?
Or, if you "ps uww ThePID" does it show lots of CPU(or RAM) ?
If you do "SELECT * FROM pg_stat_activity WHERE pid=??" (from "ps" or from
SELECT pg_backend_pid() before starting "explain") does it show "active" state
or waiting ?
If it's waiting, you can see what it's waiting ON by looking at pg_locks..
Maybe like: SELECT c.query, * FROM pg_locks a JOIN pg_locks b USING(relation)
JOIN pg_stat_activity c ON b.pid=c.pid WHERE a.pid=?? (from ps)
Justin
Rhhh Lin <ruanlinehan@hotmail.com> writes: > The EXPLAIN <query> statement is currently 'f' for waiting and 'active' via pg_stat_activity, so it is doing something.The ps command does not show me anything more verbose. ps would confirm for sure whether it was eating CPU time, whereas I do not particularly trust pg_stat_activity to tell you that. > The reason I am very suspect of the timestamp column makeup is that if I > remove that predicate from the EXPLAIN command and the actual query, > both complete within seconds without issue. We've seen issues with the planner having trouble trying to determine the extreme values of an indexed column, in cases where there are a lot of uncommitted or recently-dead entries at the end of the index --- it does a lot of work trying to verify the commit status of each entry in turn. So I wonder if that might apply. regards, tom lane -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
On Thu, Nov 02, 2017 at 09:13:05PM +0000, Rhhh Lin wrote: > Yes, it may be an issue with the index, but I'd like to have some evidence > towards that before dropping and recreating (It does not appear that bloat is > a problem here or dead tuples either). Why do you say those aren't an issue? Just curious. Have you vacuum or reindexed (or pg_repack) ? How large are the table and index? \dt+ and \di+ > The reason I am very suspect of the timestamp column makeup is that if I > remove that predicate from the EXPLAIN command and the actual query, both > complete within seconds without issue. So I do know where the issue is (I > just dont know what the issue is!). It could be that you're hitting selfuncs.c:get_actual_variable_range() and the extremes of the index point to many dead tuples (as I see Tom suggests). You could strace the backend and see if it's reading (or writing??) consecutively (hopefully with ample OS readahead) or randomly (without). Justin -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
On 2017-11-02 20:51:23 +0000, Rhhh Lin wrote: [...] > where timestamp BETWEEN 1506676560000 AND 1508750220000 [...] > *Also, as a sidenote - can someone please expand on why one (I was not involved > in the creation of this DB/schema definition) would choose to have the > definition of the timestamp column as a bigint in this case? The numbers look like Java timestamps (Milliseconds since the epoch). So probably the programs accessing the database are written in Java and the programmer decided that it's simpler to do all timestamp computations in Java than in SQL. Or maybe the programmer just felt more comfortable with Java-like timestamps than with calendar dates. (I have done the same (with Unix timestamps, i.e. seconds since the epoch). Although in the cases where I've done it I had the additional excuse that the database didn't support timestamps with timezones, which isn't the case for PostgreSQL.) hp -- _ | Peter J. Holzer | we build much bigger, better disasters now |_|_) | | because we have much more sophisticated | | | hjp@hjp.at | management tools. __/ | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>
Thanks for replying Tom.
Ah, I misunderstood the purpose of what you were looking for from the "ps" command. Most definitely, the postgres process for this session is eating up CPU cycles. That fact is what drew us to investigate the session activity in the first place. The CPU (8 cores) on this host are currently being pegged because of a couple of sessions all experiencing issues with the same or similar queries against this very table.
The actual index on the table is a composite index on (timestamp, measurement_id), but it is also the primary key constraint, so I don't particularly want to go messing with it(Production system environment also).
My initial attempt to 'explain' the plan was to check if the query planner was utilising this composite index based on the predicates. If not, then I could reasonable add in an explicit index on "timestamp", but right now I cannot check and verify because it just hangs...
Regard,
Ruan
Sent: 02 November 2017 21:40
To: Rhhh Lin
Cc: Justin Pryzby; pgsql-general@postgresql.org
Subject: Re: [GENERAL] EXPLAIN <query> command just hangs...
> The EXPLAIN <query> statement is currently 'f' for waiting and 'active' via pg_stat_activity, so it is doing something. The ps command does not show me anything more verbose.
ps would confirm for sure whether it was eating CPU time, whereas I do not
particularly trust pg_stat_activity to tell you that.
> The reason I am very suspect of the timestamp column makeup is that if I
> remove that predicate from the EXPLAIN command and the actual query,
> both complete within seconds without issue.
We've seen issues with the planner having trouble trying to determine the
extreme values of an indexed column, in cases where there are a lot of
uncommitted or recently-dead entries at the end of the index --- it does
a lot of work trying to verify the commit status of each entry in turn.
So I wonder if that might apply.
regards, tom lane
Hi Justin,
I checked for dead tuples against that particular table initially as I have seen performance problems before in a related 'busy' environment which needed its frequency of vacuuming to be increased. So I have a query to check for table with dead tuples and this table is not showing any.
I also came across the suggestion that bloat might be an issue on the database and how to identify and address it and it does not appear to be evident here also, so thats my thinking as to why these are not factors.
That said, if you have any pointers as to how I could verify (bloat or dead tuples) are not a factor, please let me know?
I have vacuumed. I have not reindexed as it is a prod environment and I see that... "REINDEX locks out writes but not reads of the index's parent table.", so I may have to arrange this to avoid any interruptions (Although currently, accessing this table seems completely problematic anyway!).
The table is 691MB and the composite index(PK) is 723 MB.
My thinking now is I may need to export this data out to a staging area whereby I can attempt to "play" with it without any repercussions...
Regards,
Ruan
Sent: 02 November 2017 21:49
To: Rhhh Lin
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] EXPLAIN <query> command just hangs...
> Yes, it may be an issue with the index, but I'd like to have some evidence
> towards that before dropping and recreating (It does not appear that bloat is
> a problem here or dead tuples either).
Why do you say those aren't an issue? Just curious.
Have you vacuum or reindexed (or pg_repack) ?
How large are the table and index? \dt+ and \di+
> The reason I am very suspect of the timestamp column makeup is that if I
> remove that predicate from the EXPLAIN command and the actual query, both
> complete within seconds without issue. So I do know where the issue is (I
> just dont know what the issue is!).
It could be that you're hitting selfuncs.c:get_actual_variable_range() and the
extremes of the index point to many dead tuples (as I see Tom suggests).
You could strace the backend and see if it's reading (or writing??)
consecutively (hopefully with ample OS readahead) or randomly (without).
Justin
On Fri, Nov 03, 2017 at 09:12:02PM +0000, Rhhh Lin wrote: > I checked for dead tuples against that particular table initially as I have seen performance problems before in a related'busy' environment which needed its frequency of vacuuming to be increased. So I have a query to check for table withdead tuples and this table is not showing any. > > I also came across the suggestion that bloat might be an issue on the database and how to identify and address it and itdoes not appear to be evident here also, so thats my thinking as to why these are not factors. I'd be helpful if you'd paste the commands+output as you run them "\dt+, \di+, ps, vacuum, dead tuples, etc" > I have vacuumed. I have not reindexed as it is a prod environment and I see that... "REINDEX locks out writes but not readsof the index's parent table.", so I may have to arrange this to avoid any interruptions (Although currently, accessingthis table seems completely problematic anyway!). Perhaps you could look into pg_repack? Note that by default it will kill longrunning transaction if it needs in order to (briefly) obtain a super-exclusive lock. > The table is 691MB and the composite index(PK) is 723 MB. It'd be useful to see the pg_stat_user_tables.* and pg_class.reltuples and relpages for that table. Also output from VACUUM VERBOSE or autovacuum logs, if you have them (but note that vacuum does different work every time it's re-run). > My thinking now is I may need to export this data out to a staging area whereby I can attempt to "play" with it withoutany repercussions... I imagine that maybe this is related to the pattern of activity on that table (specifically around the extremes values of its indexed columns). So it'll be hard to reproduce, and dumping and reloading the table (or just reindexing it without reloading it at all) will probably temporarily improve or resolve the issue. You could try *adding* a new index on the timestamp column alone (CREATE INDEX CONCURRENTLY). Re-creating the index might conceivably be the solution in the end, and it's what pg_repack does behind the scenes. Justin -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Thanks Justin,
Yeah, I'd agree it would be much easier to show you direct command output but I'm prohibited due to the security and sensitive nature of my work environment.
A few observations I had earlier this evening.
I created a CTAS copy of the table in order to mirror it from its original in terms of constraints, indexes etc. I was able to query this copy table (using my "problematic" query) in under 1 second (Actually even before I had applied indexes). So I was also able to check the explain plan and based on the index makeup, the access path should be using the appropriate index correctly. So I was somewhat pleased that the issue was not a query optimisation difficulty as such.
I also dug deeper into vacuum, and discovered that even though a vacuum appeared to be completing, a verbose vacuum was actually erring out at the point of truncation at the end and not shrinking back the table to reclaim space. It could not obtain an exclusive lock to do this...
So I decided to intervene and terminate some active sessions (Which were active and not waiting) which were all querying this table with a similar query, by using pg_backend_terminate(pid). Some of these sessions were in an active state for a few days executing this query. However, the termination of the PIDs [The command succeeded and returned 't']has not closed down the sessions. They are still present, and still active and still executing. Those current sessions are for whatever reason completely "hung", locking out exclusive access to the table (from vacuum) and cannot be killed. I have not yet resorted to trying to kill the process PID's from the OS perspective. Instead, I will try to schedule an outage over the next few days, restart the server, implement a vacuum execution and see where I am at then.
Regards,
Ruan
Sent: 03 November 2017 21:33
To: Rhhh Lin
Cc: Tom Lane; pgsql-general@postgresql.org
Subject: Re: [GENERAL] EXPLAIN <query> command just hangs...
> I checked for dead tuples against that particular table initially as I have seen performance problems before in a related 'busy' environment which needed its frequency of vacuuming to be increased. So I have a query to check for table with dead tuples and this table is not showing any.
>
> I also came across the suggestion that bloat might be an issue on the database and how to identify and address it and it does not appear to be evident here also, so thats my thinking as to why these are not factors.
I'd be helpful if you'd paste the commands+output as you run them "\dt+, \di+,
ps, vacuum, dead tuples, etc"
> I have vacuumed. I have not reindexed as it is a prod environment and I see that... "REINDEX locks out writes but not reads of the index's parent table.", so I may have to arrange this to avoid any interruptions (Although currently, accessing this table seems completely problematic anyway!).
Perhaps you could look into pg_repack? Note that by default it will kill
longrunning transaction if it needs in order to (briefly) obtain a
super-exclusive lock.
> The table is 691MB and the composite index(PK) is 723 MB.
It'd be useful to see the pg_stat_user_tables.* and pg_class.reltuples and
relpages for that table. Also output from VACUUM VERBOSE or autovacuum logs,
if you have them (but note that vacuum does different work every time it's
re-run).
> My thinking now is I may need to export this data out to a staging area whereby I can attempt to "play" with it without any repercussions...
I imagine that maybe this is related to the pattern of activity on that table
(specifically around the extremes values of its indexed columns). So it'll be
hard to reproduce, and dumping and reloading the table (or just reindexing it
without reloading it at all) will probably temporarily improve or resolve the
issue.
You could try *adding* a new index on the timestamp column alone (CREATE INDEX
CONCURRENTLY). Re-creating the index might conceivably be the solution in
the end, and it's what pg_repack does behind the scenes.
Justin
That helps, thanks Peter.
Regards,
Ruan
Sent: 03 November 2017 08:49
To: pgsql-general@postgresql.org
Subject: Re: [GENERAL] EXPLAIN <query> command just hangs...
[...]
> where timestamp BETWEEN 1506676560000 AND 1508750220000
[...]
> *Also, as a sidenote - can someone please expand on why one (I was not involved
> in the creation of this DB/schema definition) would choose to have the
> definition of the timestamp column as a bigint in this case?
The numbers look like Java timestamps (Milliseconds since the epoch). So
probably the programs accessing the database are written in Java and the
programmer decided that it's simpler to do all timestamp computations in
Java than in SQL. Or maybe the programmer just felt more comfortable
with Java-like timestamps than with calendar dates. (I have done the
same (with Unix timestamps, i.e. seconds since the epoch). Although in
the cases where I've done it I had the additional excuse that the
database didn't support timestamps with timezones, which isn't the case
for PostgreSQL.)
hp
--
_ | Peter J. Holzer | we build much bigger, better disasters now
|_|_) | | because we have much more sophisticated
| | | hjp@hjp.at | management tools.
__/ | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>
www.hjp.at Peter Johannes Holzer ..... was the name my parents chose to have written in my certificate of baptism. Not being content with two Christian names, my fellow men ... |
www.edge.org To arrive at the edge of the world's knowledge, seek out the most complex and sophisticated minds, put them in a room together, and have them ask each other the ... |
Rhhh Lin <ruanlinehan@hotmail.com> writes: > So I decided to intervene and terminate some active sessions (Which were active and not waiting) which were all queryingthis table with a similar query, by using pg_backend_terminate(pid). Some of these sessions were in an active statefor a few days executing this query. However, the termination of the PIDs [The command succeeded and returned 't']hasnot closed down the sessions. They are still present, and still active and still executing. Those current sessionsare for whatever reason completely "hung", locking out exclusive access to the table (from vacuum) and cannot bekilled. I have not yet resorted to trying to kill the process PID's from the OS perspective. Instead, I will try to schedulean outage over the next few days, restart the server, implement a vacuum execution and see where I am at then. Well, that's pretty interesting in itself. Any chance of attaching to one of those unkillable backends with gdb and getting a stack trace? https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend regards, tom lane -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general