Thread: query slow; strace output worrisome

query slow; strace output worrisome

From

Brian Cox

Date:

06 April 2010, 01:02:30

The SELECT show below has been running for 30+mins and the strace output
is alarming:

[root@dione ~]# strace -p 10083
Process 10083 attached - interrupt to quit
creat(umovestr: Input/output error
0x2, 037777777777)                = 1025220608
creat(umovestr: Input/output error
0x2, 037777777777)                = 1025220608
creat(umovestr: Input/output error
0x2, 037777777777)                = 1025220608
creat(NULL, 037777777777)               = 216203264
restart_syscall(<... resuming interrupted call ...>) = 8192
creat(umovestr: Input/output error
0x2, 037777777777)                = 1025220608
creat(umovestr: Input/output error
0x2, 037777777777)                = 1025220608
creat(umovestr: Input/output error
0x2, 037777777777)                = 1025220608
creat(umovestr: Input/output error
0x2, 037777777777)                = 1025220608

however, I can find no indication of I/O errors in the postgres log.
Anyone have any idea what's going on here?

postgres 8.3.7

[root@dione ~]# uname -a
Linux dione.ca.com 2.6.18-164.el5 #1 SMP Tue Aug 18 15:51:48 EDT 2009
x86_64 x86_64 x86_64 GNU/Linux

Thanks,
Brian


top:
   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
10083 postgres  25   0 2288m 822m 817m R 100.0  1.7  28:08.79 postgres


cemdb=# select procpid,query_start,current_query from pg_stat_activity;
    10083 | 2010-04-05 17:18:34.989022-07 | select b.ts_id from
ts_stats_transetgroup_usergroup_daily b , ts_stats_transet_user_interval
c, ts_transetgroup_transets_map tm, ts_usergroup_users_map um where
b.ts_transet_group_id =
tm.ts_transet_group_id and tm.ts_transet_incarnation_id =
c.ts_transet_incarnation_id and b.ts_usergroup_id = um.ts_usergroup_id
and um.ts_user_incarnation_id = c.ts_user_incarnation_id and
c.ts_interval_start_time >= '2010-03-29 21:00' and
c.ts_interval_start_time < '2010-03-29 22:00' and
b.ts_interval_start_time >= '2010-03-29 00:00' and
b.ts_interval_start_time < '2010-03-30 00:00'

cemdb=# explain select b.ts_id from
ts_stats_transetgroup_usergroup_daily b , ts_stats_transet_user_interval
c, ts_transetgroup_transets_map tm, ts_usergroup_users_map um where
b.ts_transet_group_id =
tm.ts_transet_group_id and tm.ts_transet_incarnation_id =
c.ts_transet_incarnation_id and b.ts_usergroup_id = um.ts_usergroup_id
and um.ts_user_incarnation_id = c.ts_user_incarnation_id and
c.ts_interval_start_time >= '2010-03-29 21:00' and
c.ts_interval_start_time < '2010-03-29 22:00' and
b.ts_interval_start_time >= '2010-03-29 00:00' and
b.ts_interval_start_time < '2010-03-30 00:00';
          QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Hash Join  (cost=1169.95..7046.23 rows=1 width=8)
    Hash Cond: ((b.ts_transet_group_id = tm.ts_transet_group_id) AND
(c.ts_transet_incarnation_id = tm.ts_transet_incarnation_id))
    ->  Nested Loop  (cost=1159.95..7036.15 rows=10 width=24)
          ->  Nested Loop  (cost=0.00..28.16 rows=6 width=24)
                ->  Index Scan using
ts_stats_transetgroup_usergroup_daily_starttimeindex on
ts_stats_transetgroup_usergroup_daily b  (cost=0.00..8.86 rows=1 width=24)
                      Index Cond: ((ts_interval_start_time >=
'2010-03-29 00:00:00-07'::timestamp with time zone) AND
(ts_interval_start_time < '2010-03-30 00:00:00-07'::timestamp with time
zone))
                ->  Index Scan using
ts_usergroup_users_map_usergroupindex on ts_usergroup_users_map um
(cost=0.00..19.23 rows=6 width=16)
                      Index Cond: (um.ts_usergroup_id = b.ts_usergroup_id)
          ->  Bitmap Heap Scan on ts_stats_transet_user_interval c
(cost=1159.95..1167.97 rows=2 width=16)
                Recheck Cond: ((c.ts_user_incarnation_id =
um.ts_user_incarnation_id) AND (c.ts_interval_start_time >= '2010-03-29
21:00:00-07'::timestamp with time zone) AND (c.ts_interval_start_time <
'2010-03-29 22:00:00-07'::timestamp with time zone))
                ->  BitmapAnd  (cost=1159.95..1159.95 rows=2 width=0)
                      ->  Bitmap Index Scan on
ts_stats_transet_user_interval_userincarnationidindex  (cost=0.00..14.40
rows=438 width=0)
                            Index Cond: (c.ts_user_incarnation_id =
um.ts_user_incarnation_id)
                      ->  Bitmap Index Scan on
ts_stats_transet_user_interval_starttime  (cost=0.00..1134.09 rows=44856
width=0)
                            Index Cond: ((c.ts_interval_start_time >=
'2010-03-29 21:00:00-07'::timestamp with time zone) AND
(c.ts_interval_start_time < '2010-03-29 22:00:00-07'::timestamp with
time zone))
    ->  Hash  (cost=7.00..7.00 rows=200 width=16)
          ->  Seq Scan on ts_transetgroup_transets_map tm
(cost=0.00..7.00 rows=200 width=16)
(17 rows)

Re: query slow; strace output worrisome

From

Craig Ringer

Date:

06 April 2010, 01:54:19

On 6/04/2010 9:01 AM, Brian Cox wrote:
> The SELECT show below has been running for 30+mins and the strace output
> is alarming:
>
> [root@dione ~]# strace -p 10083
> Process 10083 attached - interrupt to quit
> creat(umovestr: Input/output error
> 0x2, 037777777777) = 1025220608
> creat(umovestr: Input/output error
> 0x2, 037777777777) = 1025220608
> creat(umovestr: Input/output error
> 0x2, 037777777777) = 1025220608
> creat(NULL, 037777777777) = 216203264
> restart_syscall(<... resuming interrupted call ...>) = 8192
> creat(umovestr: Input/output error
> 0x2, 037777777777) = 1025220608
> creat(umovestr: Input/output error
> 0x2, 037777777777) = 1025220608
> creat(umovestr: Input/output error
> 0x2, 037777777777) = 1025220608
> creat(umovestr: Input/output error
> 0x2, 037777777777) = 1025220608
>
> however, I can find no indication of I/O errors in the postgres log.
> Anyone have any idea what's going on here?

Anything in `dmesg' (command) or /var/log/syslog ?

--
Craig Ringer

Re: query slow; strace output worrisome

From

Brian Cox

Date:

06 April 2010, 04:06:18

On 04/05/2010 09:53 PM, Craig Ringer [craig@postnewspapers.com.au] wrote:
> Anything in `dmesg' (command) or /var/log/syslog ?
nothing out of the ordinary. Brian

Re: query slow; strace output worrisome

From

Craig Ringer

Date:

06 April 2010, 05:21:08

Brian Cox wrote:
> On 04/05/2010 09:53 PM, Craig Ringer [craig@postnewspapers.com.au] wrote:
>> Anything in `dmesg' (command) or /var/log/syslog ?
> nothing out of the ordinary. Brian

I'm wondering if the issue is with strace rather than Pg. That is to
say, that strace is trying to print:

creat("/path/to/file", 0x2, 037777777777)                = 1025220608

... but to get "/path/to/file" it's calling umovestr() which is
returning an I/O error, perhaps due to some kind of security framework
like AppArmor or SELinux in place on your system.

Yep, umovestr() is in util.c:857 in the strace 4.5.18 sources. It looks
like it might read from the target's memory via /proc or using ptrace
depending on build configuration. Either are obvious targets for
security framework limitations.

So what you're probably seeing is actually strace being prevented from
getting some information out of the postgres backend's memory by system
security policy.

As for what Pg is doing: creat() returns -1 on error and a file
descriptor otherwise, so the return value appears to indicate success.
I'm kind of wondering what the Pg backend is actually _doing_ though -
if it was using sort temp files you'd expect
open()/write()/read()/close() not just creat() calls.

--
Craig Ringer

Re: query slow; strace output worrisome

From

Brian Cox

Date:

06 April 2010, 17:33:25

On 04/06/2010 01:18 AM, Craig Ringer [craig@postnewspapers.com.au] wrote:
> I'm wondering if the issue is with strace rather than Pg. That is to
> say, that strace is trying to print:
Thanks, Craig: I do think that this is a strace issue.

> As for what Pg is doing: creat() returns -1 on error and a file
> descriptor otherwise, so the return value appears to indicate success.
> I'm kind of wondering what the Pg backend is actually _doing_ though -
> if it was using sort temp files you'd expect
> open()/write()/read()/close() not just creat() calls.
My quesiton exactly: what is the backend doing calling creat() over and
over again? Note that this query does complete -- in 30-40 mins.

Brian

Re: query slow; strace output worrisome

From

Craig Ringer

Date:

07 April 2010, 02:33:22

On 7/04/2010 12:24 AM, Brian Cox wrote:
> On 04/06/2010 01:18 AM, Craig Ringer [craig@postnewspapers.com.au] wrote:
>> I'm wondering if the issue is with strace rather than Pg. That is to
>> say, that strace is trying to print:
> Thanks, Craig: I do think that this is a strace issue.
>
>> As for what Pg is doing: creat() returns -1 on error and a file
>> descriptor otherwise, so the return value appears to indicate success.
>> I'm kind of wondering what the Pg backend is actually _doing_ though -
>> if it was using sort temp files you'd expect
>> open()/write()/read()/close() not just creat() calls.
> My quesiton exactly: what is the backend doing calling creat() over and
> over again? Note that this query does complete -- in 30-40 mins.

I'd assume it was tempfile creation, but for the fact that there's
nothing but creat() calls.

However, we can't trust strace. There may be more going on that is being
hidden from strace via limitations on the ptrace syscall imposed by
SELinux / AppArmor / whatever.

I suggest turning on the logging options in Pg that report use of
tempfiles and disk-spilled sorts, then have a look and see if Pg is in
fact using on-disk temp files for sorts or materialized data sets.

If it is, you might find that increasing work_mem helps your query out.

--
Craig Ringer

Re: query slow; strace output worrisome

From

Robert Haas

Date:

07 April 2010, 18:39:39

On Tue, Apr 6, 2010 at 10:32 PM, Craig Ringer
<craig@postnewspapers.com.au> wrote:
> On 7/04/2010 12:24 AM, Brian Cox wrote:
>>
>> On 04/06/2010 01:18 AM, Craig Ringer [craig@postnewspapers.com.au] wrote:
>>>
>>> I'm wondering if the issue is with strace rather than Pg. That is to
>>> say, that strace is trying to print:
>>
>> Thanks, Craig: I do think that this is a strace issue.
>>
>>> As for what Pg is doing: creat() returns -1 on error and a file
>>> descriptor otherwise, so the return value appears to indicate success.
>>> I'm kind of wondering what the Pg backend is actually _doing_ though -
>>> if it was using sort temp files you'd expect
>>> open()/write()/read()/close() not just creat() calls.
>>
>> My quesiton exactly: what is the backend doing calling creat() over and
>> over again? Note that this query does complete -- in 30-40 mins.
>
> I'd assume it was tempfile creation, but for the fact that there's nothing
> but creat() calls.
>
> However, we can't trust strace. There may be more going on that is being
> hidden from strace via limitations on the ptrace syscall imposed by SELinux
> / AppArmor / whatever.
>
> I suggest turning on the logging options in Pg that report use of tempfiles
> and disk-spilled sorts, then have a look and see if Pg is in fact using
> on-disk temp files for sorts or materialized data sets.
>
> If it is, you might find that increasing work_mem helps your query out.

Yeah, I'd start with EXPLAIN and then, if you can wait long enough,
EXPLAIN ANALYZE.

You'll probably find it's doing a big sort or a big hash join.

...Robert