Thread: BUG #15460: Error while creating index or constraint

BUG #15460: Error while creating index or constraint

From

PG Bug reporting form

Date:

26 October 2018, 09:31:18

The following bug has been logged on the website:

Bug reference:      15460
Logged by:          doskabouter
Email address:      paul.vanderlinden@mapcreator.eu
PostgreSQL version: 11.0
Operating system:   Windows Server 2008 R2
Description:

On large tables I get an error on:

CREATE INDEX nx_tablename ON tablename USING btree (col1); col1 being an
integer
and
ALTER TABLE ONLY tablename ADD CONSTRAINT pk_tablename PRIMARY KEY (col1,
col2); col1 and col2 both integer

First noticed that with pg_restore (from a pg_dump from PG10)
pg_restore: [archiver (db)] Error while PROCESSING TOC:
pg_restore: [archiver (db)] Error from TOC entry 14005; 1259 546886907 INDEX
nx_tablename postgres
pg_restore: [archiver (db)] could not execute query: ERROR:  could not
determine
 size of temporary file "0"
    Command was: CREATE INDEX nx_tablename ON tablename USING btree
(col1);

After the restore finished (with that error) I tried running that same query
in pgadmin.
There I got:
********** Error **********
ERROR: could not determine size of temporary file "0"
SQL state: 58P01

table definition:

CREATE TABLE tablename
(
  col1 bigint NOT NULL,
  col2 bigint NOT NULL,
  col3 bigint NOT NULL,
  col4 bigint NOT NULL,
  col5 bigint,
  col6 integer NOT NULL,
)

I have several comparable tables (same columns). Errors occur for tables
with 386726424 or 556071229 records
for other smaller tables (largest has 163183465 records) there's no error

postgres/postgis versions:
"POSTGIS="2.5.0 r16836" [EXTENSION] PGSQL="110" GEOS="3.7.0-CAPI-1.11.0
3.7.0" PROJ="Rel. 4.9.3, 15 August 2016" GDAL="GDAL 2.2.4, released
2018/03/19" LIBXML="2.7.8" LIBJSON="0.12" LIBPROTOBUF="1.2.1" RASTER"
"PostgreSQL 11.0, compiled by Visual C++ build 1914, 64-bit"
running on a 28-core, 75GB Windows server 2008 R2 64-bit machine

postgres.conf: All parallel related options are at default

Re: BUG #15460: Error while creating index or constraint

From

Peter Geoghegan

Date:

29 October 2018, 11:41:19

On Fri, Oct 26, 2018 at 10:32 AM PG Bug reporting form
<noreply@postgresql.org> wrote:
> On large tables I get an error on:
>
> CREATE INDEX nx_tablename ON tablename USING btree (col1); col1 being an
> integer
> and
> ALTER TABLE ONLY tablename ADD CONSTRAINT pk_tablename PRIMARY KEY (col1,
> col2); col1 and col2 both integer
>
> First noticed that with pg_restore (from a pg_dump from PG10)
> pg_restore: [archiver (db)] Error while PROCESSING TOC:
> pg_restore: [archiver (db)] Error from TOC entry 14005; 1259 546886907 INDEX
> nx_tablename postgres
> pg_restore: [archiver (db)] could not execute query: ERROR:  could not
> determine
>  size of temporary file "0"
>     Command was: CREATE INDEX nx_tablename ON tablename USING btree
> (col1);

I initially suspected that this was caused by commit 445e31bdc74. This
commit made the parallel CREATE INDEX leader call BufFileSize(),
rather than having each worker call it themselves. It is where the
error message from Paul actually originated from. The commit also
changed BufFileSize() so that it would lseek(), rather than just
passing filesize metadata through shared memory.

I'm not sure what the exact problem might be, because I can't see why
the BufFileSize() thing would fail, even on Windows. I think that
using lseek() + SEEK_END to get the size of a file may only work on
POSIX, and yet I can see the same pattern in much older code (e.g.,
_mdnblocks()). I don't have a Windows machine, and the spec seems
rather hard to read here.

I have a strong suspicion that going back to passing the size through
shared memory (i.e. partially reverting 445e31bdc74) would make the
problem go away, but I won't do that until I actually understand
what's going on.

--
Peter Geoghegan

Re: BUG #15460: Error while creating index or constraint

From

Peter Geoghegan

Date:

29 October 2018, 12:27:06

On Mon, Oct 29, 2018 at 11:41 AM Peter Geoghegan <pg@bowt.ie> wrote:
> I have a strong suspicion that going back to passing the size through
> shared memory (i.e. partially reverting 445e31bdc74) would make the
> problem go away, but I won't do that until I actually understand
> what's going on.

BufFileOpenShared()/SharedFileSetOpen() is allowed to fail in the
event of no file (an ENOENT) within PathNameOpenTemporaryFile(). OTOH,
BufFileSize() can throw an error due to the same condition following
445e31bdc74, which is what we see here. If it's okay in the first
function, why shouldn't it be okay in the second function, which is
called shortly afterwards? That's where Paul's ERRCODE_UNDEFINED_FILE
actually comes from. ISTM that only one of those assumptions can be
correct.

I thought that parallel CREATE INDEX didn't need to rely on this
ENOENT-is-okay business, and that that was just there for parallel
hash join, which is why I didn't complain when 445e31bdc74 went in. It
now seems like logtape.c could correctly rely on ENOENT-is-okay,
though. I think that temporary file "0" is almost always the
leader-as-worker fileset, which may be relevant.

Paul: Does varying whether an index is unique or not change the
outcome of any of your failing test cases?

Does anyone see a way in which the logtape.c + shared buffile stuff
could come to rely on assuming that an ENOENT case relates to an empty
worker BufFile? I suppose it's also possible that there is a bug that
partially reverting 445e31bdc74 would only mask, though that seems
less likely.

-- 
Peter Geoghegan

RE: BUG #15460: Error while creating index or constraint

From

Paul van der Linden

Date:

29 October 2018, 12:45:10

It happens with both unique and non-unique index.
When I disable virusscanner I also get the error, but settin g the max_parallel_workers to 0 does let me create the
index

-----Original Message-----
From: Peter Geoghegan <pg@bowt.ie> 
Sent: maandag 29 oktober 2018 13:27
To: Paul van der Linden <paul.vanderlinden@mapcreator.eu>; PostgreSQL mailing lists <pgsql-bugs@lists.postgresql.org>
Cc: Heikki Linnakangas <hlinnaka@iki.fi>
Subject: Re: BUG #15460: Error while creating index or constraint

On Mon, Oct 29, 2018 at 11:41 AM Peter Geoghegan <pg@bowt.ie> wrote:
> I have a strong suspicion that going back to passing the size through 
> shared memory (i.e. partially reverting 445e31bdc74) would make the 
> problem go away, but I won't do that until I actually understand 
> what's going on.

BufFileOpenShared()/SharedFileSetOpen() is allowed to fail in the event of no file (an ENOENT) within
PathNameOpenTemporaryFile().OTOH,

BufFileSize() can throw an error due to the same condition following 445e31bdc74, which is what we see here. If it's
okayin the first function, why shouldn't it be okay in the second function, which is called shortly afterwards? That's
wherePaul's ERRCODE_UNDEFINED_FILE actually comes from. ISTM that only one of those assumptions can be correct.

I thought that parallel CREATE INDEX didn't need to rely on this ENOENT-is-okay business, and that that was just there
forparallel hash join, which is why I didn't complain when 445e31bdc74 went in. It now seems like logtape.c could
correctlyrely on ENOENT-is-okay, though. I think that temporary file "0" is almost always the leader-as-worker fileset,
whichmay be relevant.

Paul: Does varying whether an index is unique or not change the outcome of any of your failing test cases?

Does anyone see a way in which the logtape.c + shared buffile stuff could come to rely on assuming that an ENOENT case
relatesto an empty worker BufFile? I suppose it's also possible that there is a bug that partially reverting
445e31bdc74would only mask, though that seems less likely.

--
Peter Geoghegan

Re: BUG #15460: Error while creating index or constraint

From

Peter Geoghegan

Date:

29 October 2018, 12:53:28

On Mon, Oct 29, 2018 at 12:45 PM Paul van der Linden
<paul.vanderlinden@mapcreator.eu> wrote:
> It happens with both unique and non-unique index.
> When I disable virusscanner I also get the error, but settin g the max_parallel_workers to 0 does let me create the
index

Thanks for trying that out. This confirms that parallel CREATE INDEX
is involved.

Is it possible for you to provide me with a dump of the data, or a
minimal test case? You can provided it to me off-list if that's
easier, possibly using Google drive or something. I realize that the
amount of data involved here isn't trivial, so this may be difficult.

I'm a little suspicious of the fact that your table is probably
aligned to certain boundaries, since it consists only of
integers/bigints. Maybe that's just a coincidence, though.

Thanks
-- 
Peter Geoghegan

RE: BUG #15460: Error while creating index or constraint

From

Paul van der Linden

Date:

29 October 2018, 13:01:15

Well I also saw it on uuids and possibly enum.
One of the tables is a materialized view with 550161114 records consisting of 2 uuid-colums and 2 enum columns.
Total size is 34 GB...

-----Original Message-----
From: Peter Geoghegan <pg@bowt.ie> 
Sent: maandag 29 oktober 2018 13:53
To: Paul van der Linden <paul.vanderlinden@mapcreator.eu>
Cc: PostgreSQL mailing lists <pgsql-bugs@lists.postgresql.org>; Heikki Linnakangas <hlinnaka@iki.fi>
Subject: Re: BUG #15460: Error while creating index or constraint

On Mon, Oct 29, 2018 at 12:45 PM Paul van der Linden <paul.vanderlinden@mapcreator.eu> wrote:
> It happens with both unique and non-unique index.
> When I disable virusscanner I also get the error, but settin g the 
> max_parallel_workers to 0 does let me create the index

Thanks for trying that out. This confirms that parallel CREATE INDEX is involved.

Is it possible for you to provide me with a dump of the data, or a minimal test case? You can provided it to me
off-listif that's easier, possibly using Google drive or something. I realize that the amount of data involved here
isn'ttrivial, so this may be difficult.

I'm a little suspicious of the fact that your table is probably aligned to certain boundaries, since it consists only
ofintegers/bigints. Maybe that's just a coincidence, though.

Thanks
--
Peter Geoghegan

Re: BUG #15460: Error while creating index or constraint

From

Peter Geoghegan

Date:

29 October 2018, 13:04:58

On Mon, Oct 29, 2018 at 12:53 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Is it possible for you to provide me with a dump of the data, or a
> minimal test case? You can provided it to me off-list if that's
> easier, possibly using Google drive or something. I realize that the
> amount of data involved here isn't trivial, so this may be difficult.

It might also be helpful to see if amcheck shows any problems with the
indexes you can create without error. You could run:

CREATE EXTENSION amcheck;

SELECT bt_index_check(index => c.oid, heapallindexed => true),
               c.relname,
               c.relpages
FROM pg_index i
JOIN pg_opclass op ON i.indclass[0] = op.oid
JOIN pg_am am ON op.opcmethod = am.oid
JOIN pg_class c ON i.indexrelid = c.oid
JOIN pg_namespace n ON c.relnamespace = n.oid
WHERE am.amname = 'btree'
-- Don't check temp tables, which may be from another session:
AND c.relpersistence != 't'
-- Function may throw an error when this is omitted:
AND c.relkind = 'i' AND i.indisready AND i.indisvalid
ORDER BY c.relpages DESC;

Hopefully this won't find any problems/raise any errors, but if it
does then they'll likely be relevant. It might take several minutes to
run the query, or maybe even several hour if there is lots of data.
You may have to use your own judgement when it comes to determining
what's worth checking, and what you don't have time to check (e.g.,
add something extra to the where clause, so that only some indexes get
checked). I'm suggesting this because it isn't very much work, and may
shed some light on the problem. Though the likelihood is that it won't
find anything wrong.

-- 
Peter Geoghegan

Re: BUG #15460: Error while creating index or constraint

From

Peter Geoghegan

Date:

29 October 2018, 13:26:23

On Mon, Oct 29, 2018 at 1:01 PM Paul van der Linden
<paul.vanderlinden@mapcreator.eu> wrote:
>
> Well I also saw it on uuids and possibly enum.
> One of the tables is a materialized view with 550161114 records consisting of 2 uuid-colums and 2 enum columns.
> Total size is 34 GB...

Another thing that could be helpful is if you "set trace_sort = on"
within your session/pgAdmin, and showed us log output from the start
of the failing CREATE INDEX to the point that it fails. You may or may
not want to "set client_min_messages = LOG" while you're at it, since
that will send LOG output to the client, which may be more convenient.

Thanks
-- 
Peter Geoghegan

RE: BUG #15460: Error while creating index or constraint

From

Paul van der Linden

Date:

29 October 2018, 13:44:38

Amcheck is still running.
Trace_sort is attached

-----Original Message-----
From: Peter Geoghegan <pg@bowt.ie> 
Sent: maandag 29 oktober 2018 14:26
To: Paul van der Linden <paul.vanderlinden@mapcreator.eu>
Cc: PostgreSQL mailing lists <pgsql-bugs@lists.postgresql.org>; Heikki Linnakangas <hlinnaka@iki.fi>
Subject: Re: BUG #15460: Error while creating index or constraint

On Mon, Oct 29, 2018 at 1:01 PM Paul van der Linden <paul.vanderlinden@mapcreator.eu> wrote:
>
> Well I also saw it on uuids and possibly enum.
> One of the tables is a materialized view with 550161114 records consisting of 2 uuid-colums and 2 enum columns.
> Total size is 34 GB...

Another thing that could be helpful is if you "set trace_sort = on"
within your session/pgAdmin, and showed us log output from the start of the failing CREATE INDEX to the point that it
fails.You may or may not want to "set client_min_messages = LOG" while you're at it, since that will send LOG output to
theclient, which may be more convenient.

Thanks
--
Peter Geoghegan

Attachment

trace_sort.log

Re: BUG #15460: Error while creating index or constraint

From

Peter Geoghegan

Date:

29 October 2018, 14:11:08

On Mon, Oct 29, 2018 at 1:44 PM Paul van der Linden
<paul.vanderlinden@mapcreator.eu> wrote:
>
> Amcheck is still running.
> Trace_sort is attached

This first line looks like it might be interesting:

LOG:  could not rmdir directory
"base/pgsql_tmp/pgsql_tmp5088.0.sharedfileset": Directory not empty
ERROR:  could not determine size of temporary file "0"

I suppose that this could actually just be a result of the ERROR; the
exact order isn't a reliable indicator of the sequence of events
across processes (A useful log_line_prefix setting might clear this up
if you collect the trace_sort output again).

Can you show me the contents of the directory that we fail to rmdir
here? Their names, and their size. The directory should remain around
until you restart the server, since the usual error cleanup path
doesn't successfully finish here.

Thanks
-- 
Peter Geoghegan

Re: BUG #15460: Error while creating index or constraint

From

Peter Geoghegan

Date:

29 October 2018, 14:13:00

On Mon, Oct 29, 2018 at 2:11 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I suppose that this could actually just be a result of the ERROR; the
> exact order isn't a reliable indicator of the sequence of events
> across processes (A useful log_line_prefix setting might clear this up
> if you collect the trace_sort output again).

For example, this log_line_prefix will show process PID, and a timestamp:

log_line_prefix='%p/%t '

-- 
Peter Geoghegan

RE: BUG #15460: Error while creating index or constraint

From

Paul van der Linden

Date:

29 October 2018, 14:19:37

Directory is empty (when I check after the error occurred, no idea yet (will test) if there's something there during
indexcreation.

And for some reason I'm not able to set the log_line_prefix in my pgAdmin session... ERROR:  parameter
"log_line_prefix"cannot be changed now

-----Original Message-----
From: Peter Geoghegan <pg@bowt.ie> 
Sent: maandag 29 oktober 2018 15:13
To: Paul van der Linden <paul.vanderlinden@mapcreator.eu>
Cc: PostgreSQL mailing lists <pgsql-bugs@lists.postgresql.org>; Heikki Linnakangas <hlinnaka@iki.fi>
Subject: Re: BUG #15460: Error while creating index or constraint

On Mon, Oct 29, 2018 at 2:11 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I suppose that this could actually just be a result of the ERROR; the 
> exact order isn't a reliable indicator of the sequence of events 
> across processes (A useful log_line_prefix setting might clear this up 
> if you collect the trace_sort output again).

For example, this log_line_prefix will show process PID, and a timestamp:

log_line_prefix='%p/%t '

--
Peter Geoghegan

Re: BUG #15460: Error while creating index or constraint

From

Peter Geoghegan

Date:

29 October 2018, 14:23:12

On Mon, Oct 29, 2018 at 2:19 PM Paul van der Linden
<paul.vanderlinden@mapcreator.eu> wrote:
> Directory is empty (when I check after the error occurred, no idea yet (will test) if there's something there during
indexcreation.

There'll definitely be something during index creation, though the
exact directory name will vary per operation.

Do you use temp_tablespaces (easy way to find out is to "show
temp_tablespaces")? There's one directory per temp tablespace, so
maybe you looked in the wrong one?

> And for some reason I'm not able to set the log_line_prefix in my pgAdmin session... ERROR:  parameter
"log_line_prefix"cannot be changed now

That needs to be set in postgresql.conf.

-- 
Peter Geoghegan

RE: BUG #15460: Error while creating index or constraint

From

Paul van der Linden

Date:

29 October 2018, 14:28:31

I do see some files now (0.0, 1.0 and 2.0) in that directory. 
Not able to restart postgres at the moment because it's also running some other stuff...

-----Original Message-----
From: Peter Geoghegan <pg@bowt.ie> 
Sent: maandag 29 oktober 2018 15:23
To: Paul van der Linden <paul.vanderlinden@mapcreator.eu>
Cc: PostgreSQL mailing lists <pgsql-bugs@lists.postgresql.org>; Heikki Linnakangas <hlinnaka@iki.fi>
Subject: Re: BUG #15460: Error while creating index or constraint

On Mon, Oct 29, 2018 at 2:19 PM Paul van der Linden <paul.vanderlinden@mapcreator.eu> wrote:
> Directory is empty (when I check after the error occurred, no idea yet (will test) if there's something there during
indexcreation.

There'll definitely be something during index creation, though the exact directory name will vary per operation.

Do you use temp_tablespaces (easy way to find out is to "show temp_tablespaces")? There's one directory per temp
tablespace,so maybe you looked in the wrong one?

> And for some reason I'm not able to set the log_line_prefix in my 
> pgAdmin session... ERROR:  parameter "log_line_prefix" cannot be 
> changed now

That needs to be set in postgresql.conf.

--
Peter Geoghegan

RE: BUG #15460: Error while creating index or constraint

From

Paul van der Linden

Date:

29 October 2018, 14:29:22

Oh, and the bt_index_check just finished without any errors

-----Original Message-----
From: Peter Geoghegan <pg@bowt.ie> 
Sent: maandag 29 oktober 2018 15:23
To: Paul van der Linden <paul.vanderlinden@mapcreator.eu>
Cc: PostgreSQL mailing lists <pgsql-bugs@lists.postgresql.org>; Heikki Linnakangas <hlinnaka@iki.fi>
Subject: Re: BUG #15460: Error while creating index or constraint

On Mon, Oct 29, 2018 at 2:19 PM Paul van der Linden <paul.vanderlinden@mapcreator.eu> wrote:
> Directory is empty (when I check after the error occurred, no idea yet (will test) if there's something there during
indexcreation.
 

There'll definitely be something during index creation, though the exact directory name will vary per operation.

Do you use temp_tablespaces (easy way to find out is to "show temp_tablespaces")? There's one directory per temp
tablespace,so maybe you looked in the wrong one?
 

> And for some reason I'm not able to set the log_line_prefix in my 
> pgAdmin session... ERROR:  parameter "log_line_prefix" cannot be 
> changed now

That needs to be set in postgresql.conf.

--
Peter Geoghegan

Re: BUG #15460: Error while creating index or constraint

From

Peter Geoghegan

Date:

29 October 2018, 14:30:47

On Mon, Oct 29, 2018 at 2:28 PM Paul van der Linden
<paul.vanderlinden@mapcreator.eu> wrote:
>
> I do see some files now (0.0, 1.0 and 2.0) in that directory.
> Not able to restart postgres at the moment because it's also running some other stuff...

No need to restart. Just call pg_reload_conf() as a superuser, once
postgresql.conf has the change. That won't be disruptive.

-- 
Peter Geoghegan

Re: BUG #15460: Error while creating index or constraint

From

Peter Geoghegan

Date:

29 October 2018, 15:15:03

On Mon, Oct 29, 2018 at 2:11 PM Peter Geoghegan <pg@bowt.ie> wrote:
> This first line looks like it might be interesting:
>
> LOG:  could not rmdir directory
> "base/pgsql_tmp/pgsql_tmp5088.0.sharedfileset": Directory not empty
> ERROR:  could not determine size of temporary file "0"

(Thomas Munro is CC'd here.)

> I suppose that this could actually just be a result of the ERROR; the
> exact order isn't a reliable indicator of the sequence of events
> across processes (A useful log_line_prefix setting might clear this up
> if you collect the trace_sort output again).

Hmm. So apparently Windows has a habit of setting an ENOTEMPTY errcode
when rmdir'ing a directory that somebody merely has a handle to. It
could just be that somebody has a Windows Explorer window open -- you
still get ENOTEMPTY [1]! Not sure if this is truly relevant to the
problem at hand, but it seems worth being aware of.

[1] https://github.com/gruntjs/grunt-contrib-clean/issues/66
-- 
Peter Geoghegan

RE: BUG #15460: Error while creating index or constraint

From

Paul van der Linden

Date:

29 October 2018, 15:17:58

Hm, the log_line_prefix didn't change a lot in the logs...
And what's worse (but completely unrelated): the performance benefits of upgrading to pg11 don't work for me (having a
bignumber of heavy geocalculations unioned all into one select still doesn't use more than one worker/core...)

-----Original Message-----
From: Peter Geoghegan <pg@bowt.ie> 
Sent: maandag 29 oktober 2018 15:31
To: Paul van der Linden <paul.vanderlinden@mapcreator.eu>
Cc: PostgreSQL mailing lists <pgsql-bugs@lists.postgresql.org>; Heikki Linnakangas <hlinnaka@iki.fi>
Subject: Re: BUG #15460: Error while creating index or constraint

On Mon, Oct 29, 2018 at 2:28 PM Paul van der Linden <paul.vanderlinden@mapcreator.eu> wrote:
>
> I do see some files now (0.0, 1.0 and 2.0) in that directory.
> Not able to restart postgres at the moment because it's also running some other stuff...

No need to restart. Just call pg_reload_conf() as a superuser, once postgresql.conf has the change. That won't be
disruptive.

--
Peter Geoghegan

Attachment

trace_sort2.log

Re: BUG #15460: Error while creating index or constraint

From

Peter Geoghegan

Date:

29 October 2018, 15:24:16

On Mon, Oct 29, 2018 at 3:18 PM Paul van der Linden
<paul.vanderlinden@mapcreator.eu> wrote:
>
> Hm, the log_line_prefix didn't change a lot in the logs...

That's because log_line_prefix only appears in the server log. It'll
be necessary to go into the server log to get those extra details.

-- 
Peter Geoghegan

Re: BUG #15460: Error while creating index or constraint

From

Tom Lane

Date:

29 October 2018, 15:25:48

Peter Geoghegan <pg@bowt.ie> writes:
> On Fri, Oct 26, 2018 at 10:32 AM PG Bug reporting form
> <noreply@postgresql.org> wrote:
>> pg_restore: [archiver (db)] could not execute query: ERROR:  could not
>> determine size of temporary file "0"
>> Command was: CREATE INDEX nx_tablename ON tablename USING btree (col1);

> I'm not sure what the exact problem might be, because I can't see why
> the BufFileSize() thing would fail, even on Windows.

So there are a couple of things to complain about here with respect
to the error message, regardless of the underlying bug:

* "0" is not a file name.  Reporting it is just about useless, and
identifying it as a "file" is a lie.  Why aren't we reporting the
actual name of the underlying file?  (I realize that we might not
have easy access to that right here.  But now that this has been
shown to be reachable in the field, it seems worth doing some extra
work to make the error message useful.)

* We aren't reporting the errno (no %m in the message).  By luck,
the errcode_for_file_access() call is enough to let us infer that
it was ENOENT in this case, but in other cases that might not be
enough.

> I think that
> using lseek() + SEEK_END to get the size of a file may only work on
> POSIX, and yet I can see the same pattern in much older code (e.g.,
> _mdnblocks()).

Yeah, that works fine on Windows AFAIK.  I also note that ENOENT
isn't an error code that lseek() can deliver, anyway, since it works
on an already-open FD.  The failure here must be coming from opening
the file.

I'm a little inclined to suspect that the true cause here is workers
not correctly computing the name of this temp file, which is what
led me to complain about the error message.  Although a weak spot in
this theory is that it's not clear why they'd not fail later anyway,
unless maybe this particular file never got touched by workers before.

> I have a strong suspicion that going back to passing the size through
> shared memory (i.e. partially reverting 445e31bdc74) would make the
> problem go away, but I won't do that until I actually understand
> what's going on.

Sounds like papering over the bug ...

            regards, tom lane

Re: BUG #15460: Error while creating index or constraint

From

Tom Lane

Date:

29 October 2018, 15:31:10

Paul van der Linden <paul.vanderlinden@mapcreator.eu> writes:
> Trace_sort is attached
> ...
> LOG:  performsort of 0 done: CPU: user: 250.03 s, system: 42.66 s, elapsed: 294.35 s
> LOG:  parallel external sort of 0 ended, 462886 disk blocks used: CPU: user: 250.03 s, system: 42.66 s, elapsed:
298.95s 
> LOG:  begin index sort: unique = f, workMem = 2096128, randomAccess = f
> LOG:  performsort of -1 starting: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
> LOG:  could not rmdir directory "base/pgsql_tmp/pgsql_tmp5088.0.sharedfileset": Directory not empty
> ERROR:  could not determine size of temporary file "0"

"performsort of -1"?  Seems a bit suspicious.  (And my complaint about
inadequate messaging holds here as well.  It's really unclear in this
trace which digit strings are to be read as numbers and which ones
are names of files/filesets.  At the very least, these messages violate
our message style guidelines: "When citing the name of an object, state
what kind of object it is.")

            regards, tom lane

RE: BUG #15460: Error while creating index or constraint

From

Paul van der Linden

Date:

29 October 2018, 15:31:11

Ok, here's the correct one.

-----Original Message-----
From: Peter Geoghegan <pg@bowt.ie> 
Sent: maandag 29 oktober 2018 16:24
To: Paul van der Linden <paul.vanderlinden@mapcreator.eu>
Cc: PostgreSQL mailing lists <pgsql-bugs@lists.postgresql.org>; Heikki Linnakangas <hlinnaka@iki.fi>
Subject: Re: BUG #15460: Error while creating index or constraint

On Mon, Oct 29, 2018 at 3:18 PM Paul van der Linden <paul.vanderlinden@mapcreator.eu> wrote:
>
> Hm, the log_line_prefix didn't change a lot in the logs...

That's because log_line_prefix only appears in the server log. It'll be necessary to go into the server log to get
thoseextra details.
 

--
Peter Geoghegan

Attachment

trace_sort2.log

Re: BUG #15460: Error while creating index or constraint

From

Peter Geoghegan

Date:

29 October 2018, 15:44:15

On Mon, Oct 29, 2018 at 3:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> So there are a couple of things to complain about here with respect
> to the error message, regardless of the underlying bug:

I agree with you fully.

> Yeah, that works fine on Windows AFAIK.  I also note that ENOENT
> isn't an error code that lseek() can deliver, anyway, since it works
> on an already-open FD.  The failure here must be coming from opening
> the file.

Good point.

> I'm a little inclined to suspect that the true cause here is workers
> not correctly computing the name of this temp file, which is what
> led me to complain about the error message.  Although a weak spot in
> this theory is that it's not clear why they'd not fail later anyway,
> unless maybe this particular file never got touched by workers before.

There just isn't that much to get right there, though. Another weak
spot in that theory is that it seems unlikely that the first complaint
we'd hear would happen to be from a Windows user. I think that they're
very much in the minority, especially among early adopters.

> > I have a strong suspicion that going back to passing the size through
> > shared memory (i.e. partially reverting 445e31bdc74) would make the
> > problem go away, but I won't do that until I actually understand
> > what's going on.
>
> Sounds like papering over the bug ...

I may have been unclear. It would be papering over the bug if I went
ahead and did that now.

The advantage of getting the file size from shared memory is that it
doesn't leave it up to code like BufFileOpenShared() to find
everything through readdir() iteration, an approach that might not be
totally portable. We'll reliably fail if all BufFile segments cannot
be accounted for with the size-in-shared-memory approach, which seems
more robust. I wouldn't be surprised if that actually was the correct
fix in the end.

-- 
Peter Geoghegan

RE: BUG #15460: Error while creating index or constraint

From

Paul van der Linden

Date:

29 October 2018, 15:58:50

Yes I did an early adoption, but I more or less regret it because the performance (more parallel queries) that I was
hopingfor didn't work out for me.

But as a pre: I'm more than happy with finding bugs 😊

-----Original Message-----
From: Peter Geoghegan <pg@bowt.ie> 
Sent: maandag 29 oktober 2018 16:44
To: Tom Lane <tgl@sss.pgh.pa.us>
Cc: Paul van der Linden <paul.vanderlinden@mapcreator.eu>; PostgreSQL mailing lists <pgsql-bugs@lists.postgresql.org>;
HeikkiLinnakangas <hlinnaka@iki.fi>

Subject: Re: BUG #15460: Error while creating index or constraint

On Mon, Oct 29, 2018 at 3:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> So there are a couple of things to complain about here with respect to 
> the error message, regardless of the underlying bug:

I agree with you fully.

> Yeah, that works fine on Windows AFAIK.  I also note that ENOENT isn't 
> an error code that lseek() can deliver, anyway, since it works on an 
> already-open FD.  The failure here must be coming from opening the 
> file.

Good point.

> I'm a little inclined to suspect that the true cause here is workers 
> not correctly computing the name of this temp file, which is what led 
> me to complain about the error message.  Although a weak spot in this 
> theory is that it's not clear why they'd not fail later anyway, unless 
> maybe this particular file never got touched by workers before.

There just isn't that much to get right there, though. Another weak spot in that theory is that it seems unlikely that
thefirst complaint we'd hear would happen to be from a Windows user. I think that they're very much in the minority,
especiallyamong early adopters.

> > I have a strong suspicion that going back to passing the size 
> > through shared memory (i.e. partially reverting 445e31bdc74) would 
> > make the problem go away, but I won't do that until I actually 
> > understand what's going on.
>
> Sounds like papering over the bug ...

I may have been unclear. It would be papering over the bug if I went ahead and did that now.

The advantage of getting the file size from shared memory is that it doesn't leave it up to code like
BufFileOpenShared()to find everything through readdir() iteration, an approach that might not be totally portable.
We'llreliably fail if all BufFile segments cannot be accounted for with the size-in-shared-memory approach, which seems
morerobust. I wouldn't be surprised if that actually was the correct fix in the end.

--
Peter Geoghegan

Re: BUG #15460: Error while creating index or constraint

From

Tom Lane

Date:

29 October 2018, 16:28:25

Peter Geoghegan <pg@bowt.ie> writes:
> On Mon, Oct 29, 2018 at 3:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Sounds like papering over the bug ...

> I may have been unclear. It would be papering over the bug if I went
> ahead and did that now.

> The advantage of getting the file size from shared memory is that it
> doesn't leave it up to code like BufFileOpenShared() to find
> everything through readdir() iteration, an approach that might not be
> totally portable. We'll reliably fail if all BufFile segments cannot
> be accounted for with the size-in-shared-memory approach, which seems
> more robust. I wouldn't be surprised if that actually was the correct
> fix in the end.

Perhaps.  Anyway, I think we agree that we should identify why it's
failing before we decide on a fix.

            regards, tom lane

Re: BUG #15460: Error while creating index or constraint

From

Peter Geoghegan

Date:

29 October 2018, 16:38:07

On Mon, Oct 29, 2018 at 3:31 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > LOG:  performsort of 0 done: CPU: user: 250.03 s, system: 42.66 s, elapsed: 294.35 s
> > LOG:  parallel external sort of 0 ended, 462886 disk blocks used: CPU: user: 250.03 s, system: 42.66 s, elapsed:
298.95s
 
> > LOG:  begin index sort: unique = f, workMem = 2096128, randomAccess = f
> > LOG:  performsort of -1 starting: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
> > LOG:  could not rmdir directory "base/pgsql_tmp/pgsql_tmp5088.0.sharedfileset": Directory not empty
> > ERROR:  could not determine size of temporary file "0"
>
> "performsort of -1"?  Seems a bit suspicious.

This just refers to the leader Tuplesortstate. It isn't suspicious.

We follow the convention that worker -1 is the leader within
tuplesort.c. trace_sort is a developer option, so this seems fine to
me.

-- 
Peter Geoghegan

Re: BUG #15460: Error while creating index or constraint

From

Tom Lane

Date:

29 October 2018, 16:51:03

Peter Geoghegan <pg@bowt.ie> writes:
> On Mon, Oct 29, 2018 at 3:31 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> LOG:  performsort of 0 done: CPU: user: 250.03 s, system: 42.66 s, elapsed: 294.35 s
>>> LOG:  parallel external sort of 0 ended, 462886 disk blocks used: CPU: user: 250.03 s, system: 42.66 s, elapsed:
298.95s 
>>> LOG:  begin index sort: unique = f, workMem = 2096128, randomAccess = f
>>> LOG:  performsort of -1 starting: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
>>> LOG:  could not rmdir directory "base/pgsql_tmp/pgsql_tmp5088.0.sharedfileset": Directory not empty
>>> ERROR:  could not determine size of temporary file "0"

>> "performsort of -1"?  Seems a bit suspicious.

> This just refers to the leader Tuplesortstate. It isn't suspicious.
> We follow the convention that worker -1 is the leader within
> tuplesort.c.

Hmm.  But the sort of "0" has already completed, according to the first
couple of log lines I quoted.  Why is something still trying to access
it?  Why is a worker trying to do anything at all with the leader's
Tuplesortstate?

> trace_sort is a developer option, so this seems fine to me.

That's a poor excuse for ignoring the message style guidelines.
There are many reasons why even experts could get confused trying
to follow badly-written messages.

            regards, tom lane

Re: BUG #15460: Error while creating index or constraint

From

Peter Geoghegan

Date:

29 October 2018, 17:11:57

On Mon, Oct 29, 2018 at 4:51 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> "performsort of -1"?  Seems a bit suspicious.
>
> > This just refers to the leader Tuplesortstate. It isn't suspicious.
> > We follow the convention that worker -1 is the leader within
> > tuplesort.c.
>
> Hmm.  But the sort of "0" has already completed, according to the first
> couple of log lines I quoted.  Why is something still trying to access
> it?  Why is a worker trying to do anything at all with the leader's
> Tuplesortstate?

This is almost certainly because the parallel infrastructure generally
doesn't guarantee that log output will be in order. Even if something
bizarre took place with the temporary files, there is no way that the
worker number in trace_sort would change within a worker. It's an
immutable field within Tuplesortstate, initialized once.

> > trace_sort is a developer option, so this seems fine to me.
>
> That's a poor excuse for ignoring the message style guidelines.
> There are many reasons why even experts could get confused trying
> to follow badly-written messages.

I'll change the trace_sort messages to refer to "worker n" -- that
should bring the trace_sort output in line with our style guidelines.

-- 
Peter Geoghegan

Re: BUG #15460: Error while creating index or constraint

From

Tom Lane

Date:

29 October 2018, 18:39:51

Peter Geoghegan <pg@bowt.ie> writes:
> On Mon, Oct 29, 2018 at 4:51 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> That's a poor excuse for ignoring the message style guidelines.
>> There are many reasons why even experts could get confused trying
>> to follow badly-written messages.

> I'll change the trace_sort messages to refer to "worker n" -- that
> should bring the trace_sort output in line with our style guidelines.

WFM.

            regards, tom lane

Re: BUG #15460: Error while creating index or constraint

From

Thomas Munro

Date:

30 October 2018, 13:37:01

On Tue, Oct 30, 2018 at 4:15 AM Peter Geoghegan <pg@bowt.ie> wrote:
> On Mon, Oct 29, 2018 at 2:11 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > This first line looks like it might be interesting:
> >
> > LOG:  could not rmdir directory
> > "base/pgsql_tmp/pgsql_tmp5088.0.sharedfileset": Directory not empty
> > ERROR:  could not determine size of temporary file "0"
>
> (Thomas Munro is CC'd here.)
>
> > I suppose that this could actually just be a result of the ERROR; the
> > exact order isn't a reliable indicator of the sequence of events
> > across processes (A useful log_line_prefix setting might clear this up
> > if you collect the trace_sort output again).
>
> Hmm. So apparently Windows has a habit of setting an ENOTEMPTY errcode
> when rmdir'ing a directory that somebody merely has a handle to. It
> could just be that somebody has a Windows Explorer window open -- you
> still get ENOTEMPTY [1]! Not sure if this is truly relevant to the
> problem at hand, but it seems worth being aware of.

We only try to remove the directory after removing everything in it,
so that does seem to require a pretty strange explanation like that.
I also wondered if a worker having a handle to an unlinked file (as
permitted by the FILE_SHARE_DELETE flag we use) inside that directory
could produce that effect.  Perhaps RemovedDirectoryA[1] would be
better than rmdir, since it "... marks a directory for deletion on
close. Therefore, the directory is not removed until the last handle
to the directory is closed."  The documentation for rmdir[2] just says
deprecated, and _rmdir[3] mentions ENOTEMPTY, but it says you get
EACCES if someone has an open handle to the directory.

[1] https://docs.microsoft.com/en-us/windows/desktop/api/fileapi/nf-fileapi-removedirectorya
[2] https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/rmdir?view=vs-2017
[3] https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/rmdir-wrmdir?view=vs-2017

-- 
Thomas Munro
http://www.enterprisedb.com

Re: BUG #15460: Error while creating index or constraint

From

Thomas Munro

Date:

02 November 2018, 00:26:06

On Wed, Oct 31, 2018 at 2:37 AM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Tue, Oct 30, 2018 at 4:15 AM Peter Geoghegan <pg@bowt.ie> wrote:
> > On Mon, Oct 29, 2018 at 2:11 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > > This first line looks like it might be interesting:
> > >
> > > LOG:  could not rmdir directory
> > > "base/pgsql_tmp/pgsql_tmp5088.0.sharedfileset": Directory not empty
> > > ERROR:  could not determine size of temporary file "0"
> >
> > (Thomas Munro is CC'd here.)
> >
> > > I suppose that this could actually just be a result of the ERROR; the
> > > exact order isn't a reliable indicator of the sequence of events
> > > across processes (A useful log_line_prefix setting might clear this up
> > > if you collect the trace_sort output again).
> >
> > Hmm. So apparently Windows has a habit of setting an ENOTEMPTY errcode
> > when rmdir'ing a directory that somebody merely has a handle to. It
> > could just be that somebody has a Windows Explorer window open -- you
> > still get ENOTEMPTY [1]! Not sure if this is truly relevant to the
> > problem at hand, but it seems worth being aware of.
>
> We only try to remove the directory after removing everything in it,
> so that does seem to require a pretty strange explanation like that.
> I also wondered if a worker having a handle to an unlinked file (as
> permitted by the FILE_SHARE_DELETE flag we use) inside that directory
> could produce that effect.  Perhaps RemovedDirectoryA[1] would be
> better than rmdir, since it "... marks a directory for deletion on
> close. Therefore, the directory is not removed until the last handle
> to the directory is closed."  The documentation for rmdir[2] just says
> deprecated, and _rmdir[3] mentions ENOTEMPTY, but it says you get
> EACCES if someone has an open handle to the directory.

Is there a Windows developer who would like to experiment with this
stuff -- perhaps with a small standalone program that tries to unlink
a directory while it has another handle to the directory open -- and
make a recommendation?  Otherwise I'll try to do this with AppVeyor.
While having the directory open in Explorer.exe might seem a bit
unusual for a production database, I bet there are other things that
periodically go around opening and reading our directories: backup
tools and SearchIndexer.exe presumably traverse the whole filesystem
from time to time.  By my reading, RemovedDirectoryA() is probably the
right thing to defend against this, here and any other place we remove
directories.

-- 
Thomas Munro
http://www.enterprisedb.com

RE: BUG #15460: Error while creating index or constraint

From

Paul van der Linden

Date:

02 November 2018, 08:02:26

Well, I can test.
If you'll provide me with the call (incl flags) that is done on that moment to remove the directory I can see what is
neededto make that fail and possibly how to circumvent that
 

-----Original Message-----
From: Thomas Munro <thomas.munro@enterprisedb.com> 
Sent: vrijdag 2 november 2018 1:26
To: Peter Geoghegan <pg@bowt.ie>
Cc: Paul van der Linden <paul.vanderlinden@mapcreator.eu>; PostgreSQL mailing lists <pgsql-bugs@lists.postgresql.org>;
HeikkiLinnakangas <hlinnaka@iki.fi>
 
Subject: Re: BUG #15460: Error while creating index or constraint

On Wed, Oct 31, 2018 at 2:37 AM Thomas Munro <thomas.munro@enterprisedb.com> wrote:
> On Tue, Oct 30, 2018 at 4:15 AM Peter Geoghegan <pg@bowt.ie> wrote:
> > On Mon, Oct 29, 2018 at 2:11 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > > This first line looks like it might be interesting:
> > >
> > > LOG:  could not rmdir directory
> > > "base/pgsql_tmp/pgsql_tmp5088.0.sharedfileset": Directory not 
> > > empty
> > > ERROR:  could not determine size of temporary file "0"
> >
> > (Thomas Munro is CC'd here.)
> >
> > > I suppose that this could actually just be a result of the ERROR; 
> > > the exact order isn't a reliable indicator of the sequence of 
> > > events across processes (A useful log_line_prefix setting might 
> > > clear this up if you collect the trace_sort output again).
> >
> > Hmm. So apparently Windows has a habit of setting an ENOTEMPTY 
> > errcode when rmdir'ing a directory that somebody merely has a handle 
> > to. It could just be that somebody has a Windows Explorer window 
> > open -- you still get ENOTEMPTY [1]! Not sure if this is truly 
> > relevant to the problem at hand, but it seems worth being aware of.
>
> We only try to remove the directory after removing everything in it, 
> so that does seem to require a pretty strange explanation like that.
> I also wondered if a worker having a handle to an unlinked file (as 
> permitted by the FILE_SHARE_DELETE flag we use) inside that directory 
> could produce that effect.  Perhaps RemovedDirectoryA[1] would be 
> better than rmdir, since it "... marks a directory for deletion on 
> close. Therefore, the directory is not removed until the last handle 
> to the directory is closed."  The documentation for rmdir[2] just says 
> deprecated, and _rmdir[3] mentions ENOTEMPTY, but it says you get 
> EACCES if someone has an open handle to the directory.

Is there a Windows developer who would like to experiment with this stuff -- perhaps with a small standalone program
thattries to unlink a directory while it has another handle to the directory open -- and make a recommendation?
OtherwiseI'll try to do this with AppVeyor.
 
While having the directory open in Explorer.exe might seem a bit unusual for a production database, I bet there are
otherthings that periodically go around opening and reading our directories: backup tools and SearchIndexer.exe
presumablytraverse the whole filesystem from time to time.  By my reading, RemovedDirectoryA() is probably the right
thingto defend against this, here and any other place we remove directories.
 

--
Thomas Munro
http://www.enterprisedb.com

Re: BUG #15460: Error while creating index or constraint

From

Thomas Munro

Date:

03 November 2018, 09:13:44

On Fri, Nov 2, 2018 at 9:02 PM Paul van der Linden
<paul.vanderlinden@mapcreator.eu> wrote:
> Well, I can test.
> If you'll provide me with the call (incl flags) that is done on that moment to remove the directory I can see what is
neededto make that fail and possibly how to circumvent that

Thanks. Actually I think I now see what's going on with the
"Directory not empty" error. I experimented a bit on AppVeyor's
wonderful free Windows build bot, and finished up with a simple
program[1] that spits out:

Creating directory my_dir...
Creating file my_dir\foo.txt...
Unlinking file my_dir\foo.txt...
Opening directory my_dir...
Removing directory my_dir with rmdir()...
Failed, error = 145, errno = 41 (Directory not empty).

Although you can unlink files and directories for which there are open
handles floating around as long as all handles were opened with
FILE_SHARE_DELETE, this shows that you can't unlink directories if
they contain[ed] files that have now been unlinked, if someone still
has a handle open. In other words, the file isn't really unlinked at
all, it's just hidden until all referencing handles are closed, or
something like that.

Back to PostgreSQL, in the non-error happy path there is no problem,
because well behaved clients close handles before detaching from the
reference-counted SharedFileSet, so when the last detacher unlinks
there can be no handles left. In the error path, however, we rely on
resowner.c to close handles, and it does that *after* releasing the
DSM segment that triggers unlinking.

We discussed this exact topic while working on this stuff[2] and I
concluded incorrectly that you get sufficiently Unix-like behaviour if
all openers use FILE_SHARE_DELETE. I was apparently right about
unlinking the files themselves, but not for unlinking directories that
hold them. So, it looks like we may need to reorder the cleanup code
in resowner.c, after all. I already had a patch for that, but I'm
certainly not comfortable making such a change for the minor release
that's about to wrap, since it could have unforeseen consequences and
needs more study.

Of course none of this explains the root problem that led to the error
('could not determine size of temporary file "0"'), it's just a noisy,
temporary directory leaking error path that we should fix.

[1] https://github.com/macdice/hello-windows/blob/remove-directory/test.c
[2] https://www.postgresql.org/message-id/CAEepm=1Ugp7mNFX-YpSfWr0F_6QA4jzjtavauBcoAAZGd7_tPA@mail.gmail.com

--
Thomas Munro
http://www.enterprisedb.com

Re: BUG #15460: Error while creating index or constraint

From

Thomas Munro

Date:

13 November 2018, 09:13:37

On Tue, Oct 30, 2018 at 4:15 AM Peter Geoghegan <pg@bowt.ie> wrote:
> > ERROR:  could not determine size of temporary file "0"

                filesize = BufFileSize(file);
                if (filesize < 0)
                        ereport(ERROR,
                                        (errcode_for_file_access(),
                                         errmsg("could not determine
size of temporary file \"%s\"", filename)));

Ahh.  I think we may be overflowing a 32 bit off_t on this platform.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: BUG #15460: Error while creating index or constraint

From

Thomas Munro

Date:

13 November 2018, 09:49:26

On Tue, Nov 13, 2018 at 10:13 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Tue, Oct 30, 2018 at 4:15 AM Peter Geoghegan <pg@bowt.ie> wrote:
> > > ERROR:  could not determine size of temporary file "0"
>
>                 filesize = BufFileSize(file);
>                 if (filesize < 0)
>                         ereport(ERROR,
>                                         (errcode_for_file_access(),
>                                          errmsg("could not determine
> size of temporary file \"%s\"", filename)));
>
> Ahh.  I think we may be overflowing a 32 bit off_t on this platform.

I think using pgoff_t would fix the problem on Windows, but on other
systems where it maps directly to off_t it might also be 32 bits, so
I'm not entirely sure what pgoff_t is for.  Perhaps we should just use
int64 directly for this?  Like in the attached draft patch.  I don't
have a Windows system to test it on.  A separate bug report that came
in today[1] has repro steps that could be used to validate it.

[1] https://www.postgresql.org/message-id/flat/CAHDGBJP_GsESbTt4P3FZA8kMUKuYxjg57XHF7NRBoKnR%3DCAR-g%40mail.gmail.com

-- 
Thomas Munro
http://www.enterprisedb.com

Attachment

0001-Use-64-bit-type-for-BufFileSize.patch

Re: BUG #15460: Error while creating index or constraint

From

Peter Geoghegan

Date:

13 November 2018, 17:50:26

On Tue, Nov 13, 2018 at 1:50 AM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> I think using pgoff_t would fix the problem on Windows, but on other
> systems where it maps directly to off_t it might also be 32 bits, so
> I'm not entirely sure what pgoff_t is for.  Perhaps we should just use
> int64 directly for this?  Like in the attached draft patch.  I don't
> have a Windows system to test it on.  A separate bug report that came
> in today[1] has repro steps that could be used to validate it.

I think that int64 is the way to go.

Should I fix up the diagnostic message to address Tom's complaints
about that in a separate patch?
-- 
Peter Geoghegan

Re: BUG #15460: Error while creating index or constraint

From

Thomas Munro

Date:

13 November 2018, 20:12:10

On Wed, Nov 14, 2018 at 6:50 AM Peter Geoghegan <pg@bowt.ie> wrote:
> On Tue, Nov 13, 2018 at 1:50 AM Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
> > I think using pgoff_t would fix the problem on Windows, but on other
> > systems where it maps directly to off_t it might also be 32 bits, so
> > I'm not entirely sure what pgoff_t is for.  Perhaps we should just use
> > int64 directly for this?  Like in the attached draft patch.  I don't
> > have a Windows system to test it on.  A separate bug report that came
> > in today[1] has repro steps that could be used to validate it.
>
> I think that int64 is the way to go.

Thanks.  So, a practical matter:  The change of return type as
proposed is an ABI break for 32 bit off_t systems.  BufFileSize() is
brand new and seems pretty unlikely to be used by extensions so soon.
Could we change that?  A conservative alternative would be to define a
second function BufFileSize64() and use that for parallel CREATE INDEX
in REL_11_STABLE, leaving the off_t version as it is, but then fold
them back into a single version in master.

> Should I fix up the diagnostic message to address Tom's complaints
> about that in a separate patch?

+1

-- 
Thomas Munro
http://www.enterprisedb.com

Re: BUG #15460: Error while creating index or constraint

From

Thomas Munro

Date:

15 November 2018, 01:09:10

On Wed, Nov 14, 2018 at 9:12 AM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Wed, Nov 14, 2018 at 6:50 AM Peter Geoghegan <pg@bowt.ie> wrote:
> > I think that int64 is the way to go.

Thanks.  Pushed to master and 11.

If anyone with a Windows system would like to verify that the problem
is fixed, I found the following test case by changing BufFileSize()
and ltsConcatWorkerTapes() to use int32 instead of off_t so I see the
error too:

create table t as select generate_series(1, 100000000) i, 'hello
world, this is a test' v;
create index on t (i, v);

I expect that test case to fail on Windows before today's commit.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: BUG #15460: Error while creating index or constraint

From

Peter Geoghegan

Date:

28 November 2018, 19:31:48

On Mon, Oct 29, 2018 at 8:25 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> So there are a couple of things to complain about here with respect
> to the error message, regardless of the underlying bug:

Attached patch fixes the problems with BufFileSize(), while also
making similar adjustments to a closely related buffile.c "ereport() +
errcode_for_file_access()" in passing -- the ereport() from
BufFileOpenShared(). Note that both BufFileOpenShared() and
BufFileSize() are used in a closely coordinated way already -- we
literally call one immediately after the other from logtape.c. I don't
like the current inconsistency in error handling.

My approach implies a minor API break in BufFileSize(), but I think
that that's okay -- see the draft commit message for an explanation.
Beyond what I say there, I think that any third party client code
ought to be following the example of tuplestore_rescan(), and seeking
to the end of the BufFile directly. BufFileSize() should just be a
shared BufFile thing, in case it needs to be changed in the future.

-- 
Peter Geoghegan

Attachment

0001-Have-BufFileSize-ereport-on-FileSize-failure.patch

Re: BUG #15460: Error while creating index or constraint

From

Tom Lane

Date:

28 November 2018, 21:04:06

Peter Geoghegan <pg@bowt.ie> writes:
> On Mon, Oct 29, 2018 at 8:25 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> So there are a couple of things to complain about here with respect
>> to the error message, regardless of the underlying bug:

> Attached patch fixes the problems with BufFileSize(), while also
> making similar adjustments to a closely related buffile.c "ereport() +
> errcode_for_file_access()" in passing -- the ereport() from
> BufFileOpenShared(). Note that both BufFileOpenShared() and
> BufFileSize() are used in a closely coordinated way already -- we
> literally call one immediately after the other from logtape.c. I don't
> like the current inconsistency in error handling.

LGTM.

            regards, tom lane

Re: BUG #15460: Error while creating index or constraint

From

Peter Geoghegan

Date:

28 November 2018, 22:48:29

On Wed, Nov 28, 2018 at 1:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> LGTM.

Pushed. Thanks.
-- 
Peter Geoghegan

Re: BUG #15460: Error while creating index or constraint

From

Thomas Munro

Date:

29 November 2018, 00:01:13

On Thu, Nov 29, 2018 at 11:48 AM Peter Geoghegan <pg@bowt.ie> wrote:
> On Wed, Nov 28, 2018 at 1:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > LGTM.
>
> Pushed. Thanks.

Thanks.  So, of the three issues that came up in this thread, we have:

1.  off_t overflow: fixed in aa551830.
2.  Error reporting could be better: fixed in 1a990b20.
3.  Windows can't unlink directories containing files that have been
unlinked but are still open.  This generates a WARNING and a
left-behind empty directory when an error is raised.

Here's a draft patch for #3.  I can't test it on Windows myself; we'd
need to confirm it by hacking code somewhere to raise errors while
either running a parallel CREATE INDEX or a batched parallel hash
join, perhaps something involving a low probability divide-by-zero, or
maybe someone could try reverting commit aa551830 and using the short
repro I gave in my previous message upthread.

Assuming this can be confirmed to solve the problem, the real question
is: is it a bad idea to back-patch code that changes the order of
resowner release in this way?  Specifically, could someone be upset if
we start calling their DSM detach hook after we've released everything
else, where previously it was before most other stuff?  (Do any
extensions use DSM segments, and specifically dsm_detach() hooks?)

-- 
Thomas Munro
http://www.enterprisedb.com

Attachment

0001-Detach-from-DSM-segments-last-in-resowner.c.patch