Thread: Help with large delete
Currently I have one table that mimics a file system. Each entry has a parent_id and a base name where parent_id is an idin the table that must exist in the table or be null with cascade on delete. I’ve started a delete of a root entry with about 300,000 descendants. The table currently has about 22M entries and I’madding about 1600 entries per minute still. Eventually there will not be massive amounts of entries being added and thetable will be mostly static. I started the delete before from a terminal that got detached. So I killed that process and started it up again from a terminalless likely to get detached.˘ My question is basically how can I make life easier for Postgres? I believe (hope) the deletes will be few and far betweenbut they will happen from time to time. In this case, Dropbox — its a long story that isn’t really pertinent. Thepoint is that @#$% happens. “What can I do” includes starting completely over if necessary. I’ve only got about a week invested in this and its justmachine time at zero cost. I could stop the other processes that are adding entries and let the delete finish if thatwould help. etc. Thank you for your time, Perry
Attachment
On 4/16/22 07:25, Perry Smith wrote:
I would try 1) find any nodes with disproportionately more nodes, deal with them separately. You my have a gut feel for where these nodes are? 2) Start at least one step down: run a transaction for each entry in root node. Maybe go two levels down.Currently I have one table that mimics a file system. Each entry has a parent_id and a base name where parent_id is an id in the table that must exist in the table or be null with cascade on delete. I’ve started a delete of a root entry with about 300,000 descendants. The table currently has about 22M entries and I’m adding about 1600 entries per minute still. Eventually there will not be massive amounts of entries being added and the table will be mostly static. I started the delete before from a terminal that got detached. So I killed that process and started it up again from a terminal less likely to get detached.˘ My question is basically how can I make life easier for Postgres? I believe (hope) the deletes will be few and far between but they will happen from time to time. In this case, Dropbox — its a long story that isn’t really pertinent. The point is that @#$% happens. “What can I do” includes starting completely over if necessary. I’ve only got about a week invested in this and its just machine time at zero cost. I could stop the other processes that are adding entries and let the delete finish if that would help. etc. Thank you for your time, Perry
On 2022-04-16 08:25:56 -0500, Perry Smith wrote: > Currently I have one table that mimics a file system. Each entry has > a parent_id and a base name where parent_id is an id in the table that > must exist in the table or be null with cascade on delete. > > I’ve started a delete of a root entry with about 300,000 descendants. > The table currently has about 22M entries and I’m adding about 1600 > entries per minute still. Eventually there will not be massive > amounts of entries being added and the table will be mostly static. > > I started the delete before from a terminal that got detached. So I > killed that process and started it up again from a terminal less > likely to get detached.˘ > > My question is basically how can I make life easier for Postgres? Deleting 300k rows doesn't sound that bad. Neither does recursively finding those 300k rows, although if you have a very biased distribution (many nodes with only a few children, but some with hundreds of thousands or even millions of children), PostgreSQL may not find a good plan. So as almost always when performance is an issue: * What exactly are you doing? * What is the execution plan? * How long does it take? hp -- _ | Peter J. Holzer | Story must make more sense than reality. |_|_) | | | | | hjp@hjp.at | -- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!"
Attachment
Perry Smith <pedz@easesoftware.com> writes: > Currently I have one table that mimics a file system. Each entry has a parent_id and a base name where parent_id is anid in the table that must exist in the table or be null with cascade on delete. > I’ve started a delete of a root entry with about 300,000 descendants. The table currently has about 22M entries and I’madding about 1600 entries per minute still. Eventually there will not be massive amounts of entries being added and thetable will be mostly static. The most obvious question is do you have an index on the referencing column. PG doesn't require one to exist to create an FK; but if you don't, deletes of referenced rows had better be uninteresting to you performance-wise, because each one will cause a seqscan. regards, tom lane
On Apr 16, 2022, at 10:33, Tom Lane <tgl@sss.pgh.pa.us> wrote:Perry Smith <pedz@easesoftware.com> writes:Currently I have one table that mimics a file system. Each entry has a parent_id and a base name where parent_id is an id in the table that must exist in the table or be null with cascade on delete.
I’ve started a delete of a root entry with about 300,000 descendants. The table currently has about 22M entries and I’m adding about 1600 entries per minute still. Eventually there will not be massive amounts of entries being added and the table will be mostly static.
The most obvious question is do you have an index on the referencing
column. PG doesn't require one to exist to create an FK; but if you
don't, deletes of referenced rows had better be uninteresting to you
performance-wise, because each one will cause a seqscan.
To try to reply to Peter’s question, I just now started:
psql -c "explain analyze delete from dateien where basename = '/mnt/pedz/Visual_Media'” find_dups
And it hasn’t replied yet. I hope you are not slapping your head muttering “this guy is an idiot!!” — in that this would not give you the plan you are asking for...
This is inside a BSD “jail” on a NAS. I’m wondering if the jail has a limited time and the other processes have consumed it all. In any case, if / when it replies, I will post the results.
For Tom’s question, here is the description of the table:
psql -c '\d dateien' find_dups
Table "public.dateien"
Column | Type | Collation | Nullable | Default
------------+--------------------------------+-----------+----------+-------------------------------------
id | bigint | | not null | nextval('dateien_id_seq'::regclass)
basename | character varying | | not null |
parent_id | bigint | | |
dev | bigint | | not null |
ftype | character varying | | not null |
uid | bigint | | not null |
gid | bigint | | not null |
ino | bigint | | not null |
mode | bigint | | not null |
mtime | timestamp without time zone | | not null |
nlink | bigint | | not null |
size | bigint | | not null |
sha1 | character varying | | |
created_at | timestamp(6) without time zone | | not null |
updated_at | timestamp(6) without time zone | | not null |
Indexes:
"dateien_pkey" PRIMARY KEY, btree (id)
"unique_dev_ino_for_dirs" UNIQUE, btree (dev, ino) WHERE ftype::text = 'directory'::text
"unique_parent_basename" UNIQUE, btree (COALESCE(parent_id, '-1'::integer::bigint), basename)
Foreign-key constraints:
"fk_rails_c01ebbd0bf" FOREIGN KEY (parent_id) REFERENCES dateien(id) ON DELETE CASCADE
Referenced by:
TABLE "dateien" CONSTRAINT "fk_rails_c01ebbd0bf" FOREIGN KEY (parent_id) REFERENCES dateien(id) ON DELETE CASCADE
Table "public.dateien"
Column | Type | Collation | Nullable | Default
------------+--------------------------------+-----------+----------+-------------------------------------
id | bigint | | not null | nextval('dateien_id_seq'::regclass)
basename | character varying | | not null |
parent_id | bigint | | |
dev | bigint | | not null |
ftype | character varying | | not null |
uid | bigint | | not null |
gid | bigint | | not null |
ino | bigint | | not null |
mode | bigint | | not null |
mtime | timestamp without time zone | | not null |
nlink | bigint | | not null |
size | bigint | | not null |
sha1 | character varying | | |
created_at | timestamp(6) without time zone | | not null |
updated_at | timestamp(6) without time zone | | not null |
Indexes:
"dateien_pkey" PRIMARY KEY, btree (id)
"unique_dev_ino_for_dirs" UNIQUE, btree (dev, ino) WHERE ftype::text = 'directory'::text
"unique_parent_basename" UNIQUE, btree (COALESCE(parent_id, '-1'::integer::bigint), basename)
Foreign-key constraints:
"fk_rails_c01ebbd0bf" FOREIGN KEY (parent_id) REFERENCES dateien(id) ON DELETE CASCADE
Referenced by:
TABLE "dateien" CONSTRAINT "fk_rails_c01ebbd0bf" FOREIGN KEY (parent_id) REFERENCES dateien(id) ON DELETE CASCADE
Attachment
Perry Smith <pedz@easesoftware.com> writes: > On Apr 16, 2022, at 10:33, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The most obvious question is do you have an index on the referencing >> column. PG doesn't require one to exist to create an FK; but if you >> don't, deletes of referenced rows had better be uninteresting to you >> performance-wise, because each one will cause a seqscan. > For Tom’s question, here is the description of the table: > psql -c '\d dateien' find_dups > Table "public.dateien" > Column | Type | Collation | Nullable | Default > ------------+--------------------------------+-----------+----------+------------------------------------- > id | bigint | | not null | nextval('dateien_id_seq'::regclass) > basename | character varying | | not null | > parent_id | bigint | | | > dev | bigint | | not null | > ftype | character varying | | not null | > uid | bigint | | not null | > gid | bigint | | not null | > ino | bigint | | not null | > mode | bigint | | not null | > mtime | timestamp without time zone | | not null | > nlink | bigint | | not null | > size | bigint | | not null | > sha1 | character varying | | | > created_at | timestamp(6) without time zone | | not null | > updated_at | timestamp(6) without time zone | | not null | > Indexes: > "dateien_pkey" PRIMARY KEY, btree (id) > "unique_dev_ino_for_dirs" UNIQUE, btree (dev, ino) WHERE ftype::text = 'directory'::text > "unique_parent_basename" UNIQUE, btree (COALESCE(parent_id, '-1'::integer::bigint), basename) > Foreign-key constraints: > "fk_rails_c01ebbd0bf" FOREIGN KEY (parent_id) REFERENCES dateien(id) ON DELETE CASCADE > Referenced by: > TABLE "dateien" CONSTRAINT "fk_rails_c01ebbd0bf" FOREIGN KEY (parent_id) REFERENCES dateien(id) ON DELETE CASCADE Yeah. So if you want to make deletes on this table not be unpleasantly slow, you need an index on the parent_id column, and you don't have one. (unique_parent_basename doesn't help, because with that definition it's useless for looking up rows by parent_id.) regards, tom lane
Make your connection immune to disconnects by using something like the screen utility.
Regards, Jan
On Sat, Apr 16, 2022, 09:26 Perry Smith <pedz@easesoftware.com> wrote:
Currently I have one table that mimics a file system. Each entry has a parent_id and a base name where parent_id is an id in the table that must exist in the table or be null with cascade on delete.
I’ve started a delete of a root entry with about 300,000 descendants. The table currently has about 22M entries and I’m adding about 1600 entries per minute still. Eventually there will not be massive amounts of entries being added and the table will be mostly static.
I started the delete before from a terminal that got detached. So I killed that process and started it up again from a terminal less likely to get detached.˘
My question is basically how can I make life easier for Postgres? I believe (hope) the deletes will be few and far between but they will happen from time to time. In this case, Dropbox — its a long story that isn’t really pertinent. The point is that @#$% happens.
“What can I do” includes starting completely over if necessary. I’ve only got about a week invested in this and its just machine time at zero cost. I could stop the other processes that are adding entries and let the delete finish if that would help. etc.
Thank you for your time,
Perry
On Apr 16, 2022, at 12:57, Jan Wieck <jan@wi3ck.info> wrote:Make your connection immune to disconnects by using something like the screen utility.
On Sat, Apr 16, 2022, 09:26 Perry Smith <pedz@easesoftware.com> wrote:Currently I have one table that mimics a file system. Each entry has a parent_id and a base name where parent_id is an id in the table that must exist in the table or be null with cascade on delete.
I’ve started a delete of a root entry with about 300,000 descendants. The table currently has about 22M entries and I’m adding about 1600 entries per minute still. Eventually there will not be massive amounts of entries being added and the table will be mostly static.
I started the delete before from a terminal that got detached. So I killed that process and started it up again from a terminal less likely to get detached.˘
My question is basically how can I make life easier for Postgres? I believe (hope) the deletes will be few and far between but they will happen from time to time. In this case, Dropbox — its a long story that isn’t really pertinent. The point is that @#$% happens.
“What can I do” includes starting completely over if necessary. I’ve only got about a week invested in this and its just machine time at zero cost. I could stop the other processes that are adding entries and let the delete finish if that would help. etc.
Thank you for your time,
Perry
Attachment
On Apr 16, 2022, at 12:24 PM, Perry Smith <pedz@easesoftware.com> wrote:
I use emacs a lot. It doesn’t keep the terminal alive in my experience. Perhaps nohup?On Apr 16, 2022, at 12:57, Jan Wieck <jan@wi3ck.info> wrote:Make your connection immune to disconnects by using something like the screen utility.Exactly… I’m using emacs in a server (daemon) mode so it stays alive. Then I do “shell” within it.
Attachment
On Apr 16, 2022, at 13:56, Rob Sargent <robjsargent@gmail.com> wrote:On Apr 16, 2022, at 12:24 PM, Perry Smith <pedz@easesoftware.com> wrote:I use emacs a lot. It doesn’t keep the terminal alive in my experience. Perhaps nohup?On Apr 16, 2022, at 12:57, Jan Wieck <jan@wi3ck.info> wrote:Make your connection immune to disconnects by using something like the screen utility.Exactly… I’m using emacs in a server (daemon) mode so it stays alive. Then I do “shell” within it.
Doing: emacs —daemon
You will see a couple of messages about loading up the customized file and then it detaches and you get back to the prompt.
Attachment
I’m sending this again. I don’t see that it made it to the list but there is also new info here.
On Apr 16, 2022, at 10:33, Tom Lane <tgl@sss.pgh.pa.us> wrote:Perry Smith <pedz@easesoftware.com> writes:Currently I have one table that mimics a file system. Each entry has a parent_id and a base name where parent_id is an id in the table that must exist in the table or be null with cascade on delete.
I’ve started a delete of a root entry with about 300,000 descendants. The table currently has about 22M entries and I’m adding about 1600 entries per minute still. Eventually there will not be massive amounts of entries being added and the table will be mostly static.
The most obvious question is do you have an index on the referencing
column. PG doesn't require one to exist to create an FK; but if you
don't, deletes of referenced rows had better be uninteresting to you
performance-wise, because each one will cause a seqscan.
To try to reply to Peter’s question, I jstarted:
psql -c "explain analyze delete from dateien where basename = '/mnt/pedz/Visual_Media'” find_dups
I did this last night at 10 p.m. and killed it just now at 6:30 without any response.
This is inside a BSD “jail” on a NAS. I don’t know how much CPU the jail is given.
For Tom’s question, here is the description of the table:
psql -c '\d dateien' find_dups
Table "public.dateien"
Column | Type | Collation | Nullable | Default
------------+--------------------------------+-----------+----------+-------------------------------------
id | bigint | | not null | nextval('dateien_id_seq'::regclass)
basename | character varying | | not null |
parent_id | bigint | | |
dev | bigint | | not null |
ftype | character varying | | not null |
uid | bigint | | not null |
gid | bigint | | not null |
ino | bigint | | not null |
mode | bigint | | not null |
mtime | timestamp without time zone | | not null |
nlink | bigint | | not null |
size | bigint | | not null |
sha1 | character varying | | |
created_at | timestamp(6) without time zone | | not null |
updated_at | timestamp(6) without time zone | | not null |
Indexes:
"dateien_pkey" PRIMARY KEY, btree (id)
"unique_dev_ino_for_dirs" UNIQUE, btree (dev, ino) WHERE ftype::text = 'directory'::text
"unique_parent_basename" UNIQUE, btree (COALESCE(parent_id, '-1'::integer::bigint), basename)
Foreign-key constraints:
"fk_rails_c01ebbd0bf" FOREIGN KEY (parent_id) REFERENCES dateien(id) ON DELETE CASCADE
Referenced by:
TABLE "dateien" CONSTRAINT "fk_rails_c01ebbd0bf" FOREIGN KEY (parent_id) REFERENCES dateien(id) ON DELETE CASCADE
To do a simple delete of a node that has no children takes about 11 seconds:
Table "public.dateien"
Column | Type | Collation | Nullable | Default
------------+--------------------------------+-----------+----------+-------------------------------------
id | bigint | | not null | nextval('dateien_id_seq'::regclass)
basename | character varying | | not null |
parent_id | bigint | | |
dev | bigint | | not null |
ftype | character varying | | not null |
uid | bigint | | not null |
gid | bigint | | not null |
ino | bigint | | not null |
mode | bigint | | not null |
mtime | timestamp without time zone | | not null |
nlink | bigint | | not null |
size | bigint | | not null |
sha1 | character varying | | |
created_at | timestamp(6) without time zone | | not null |
updated_at | timestamp(6) without time zone | | not null |
Indexes:
"dateien_pkey" PRIMARY KEY, btree (id)
"unique_dev_ino_for_dirs" UNIQUE, btree (dev, ino) WHERE ftype::text = 'directory'::text
"unique_parent_basename" UNIQUE, btree (COALESCE(parent_id, '-1'::integer::bigint), basename)
Foreign-key constraints:
"fk_rails_c01ebbd0bf" FOREIGN KEY (parent_id) REFERENCES dateien(id) ON DELETE CASCADE
Referenced by:
TABLE "dateien" CONSTRAINT "fk_rails_c01ebbd0bf" FOREIGN KEY (parent_id) REFERENCES dateien(id) ON DELETE CASCADE
To do a simple delete of a node that has no children takes about 11 seconds:
time psql -c "delete from dateien where id = 13498939;" find_dups
DELETE 1
psql -c "delete from dateien where id = 13498939;" find_dups 0.00s user 0.01s system 0% cpu 11.282 total
DELETE 1
psql -c "delete from dateien where id = 13498939;" find_dups 0.00s user 0.01s system 0% cpu 11.282 total
I’m implementing the suggestion that I do the recession myself but at this rate it will take about 38 days to delete 300K entries. I must be doing something horribly wrong. I hope you guys can enlighten me.
Thank you for your time,
Perry
Attachment
Thank you TOM!!!
So… I did:
create index parent_id_index on dateien(parent_id);
And now things are going much faster. As you can see, I had an index kinda sorta on the parent id but I guess the way I did it prevented Postgres from using it.
On Apr 17, 2022, at 06:58, Perry Smith <pedz@easesoftware.com> wrote:I’m sending this again. I don’t see that it made it to the list but there is also new info here.On Apr 16, 2022, at 10:33, Tom Lane <tgl@sss.pgh.pa.us> wrote:Perry Smith <pedz@easesoftware.com> writes:Currently I have one table that mimics a file system. Each entry has a parent_id and a base name where parent_id is an id in the table that must exist in the table or be null with cascade on delete.
I’ve started a delete of a root entry with about 300,000 descendants. The table currently has about 22M entries and I’m adding about 1600 entries per minute still. Eventually there will not be massive amounts of entries being added and the table will be mostly static.
The most obvious question is do you have an index on the referencing
column. PG doesn't require one to exist to create an FK; but if you
don't, deletes of referenced rows had better be uninteresting to you
performance-wise, because each one will cause a seqscan.To try to reply to Peter’s question, I jstarted:psql -c "explain analyze delete from dateien where basename = '/mnt/pedz/Visual_Media'” find_dupsI did this last night at 10 p.m. and killed it just now at 6:30 without any response.This is inside a BSD “jail” on a NAS. I don’t know how much CPU the jail is given.For Tom’s question, here is the description of the table:psql -c '\d dateien' find_dups
Table "public.dateien"
Column | Type | Collation | Nullable | Default
------------+--------------------------------+-----------+----------+-------------------------------------
id | bigint | | not null | nextval('dateien_id_seq'::regclass)
basename | character varying | | not null |
parent_id | bigint | | |
dev | bigint | | not null |
ftype | character varying | | not null |
uid | bigint | | not null |
gid | bigint | | not null |
ino | bigint | | not null |
mode | bigint | | not null |
mtime | timestamp without time zone | | not null |
nlink | bigint | | not null |
size | bigint | | not null |
sha1 | character varying | | |
created_at | timestamp(6) without time zone | | not null |
updated_at | timestamp(6) without time zone | | not null |
Indexes:
"dateien_pkey" PRIMARY KEY, btree (id)
"unique_dev_ino_for_dirs" UNIQUE, btree (dev, ino) WHERE ftype::text = 'directory'::text
"unique_parent_basename" UNIQUE, btree (COALESCE(parent_id, '-1'::integer::bigint), basename)
Foreign-key constraints:
"fk_rails_c01ebbd0bf" FOREIGN KEY (parent_id) REFERENCES dateien(id) ON DELETE CASCADE
Referenced by:
TABLE "dateien" CONSTRAINT "fk_rails_c01ebbd0bf" FOREIGN KEY (parent_id) REFERENCES dateien(id) ON DELETE CASCADE
To do a simple delete of a node that has no children takes about 11 seconds:time psql -c "delete from dateien where id = 13498939;" find_dups
DELETE 1
psql -c "delete from dateien where id = 13498939;" find_dups 0.00s user 0.01s system 0% cpu 11.282 totalI’m implementing the suggestion that I do the recession myself but at this rate it will take about 38 days to delete 300K entries. I must be doing something horribly wrong. I hope you guys can enlighten me.Thank you for your time,Perry