Thread: Chunk Delete
THE problem is that the table does not have a primary key; Too expensive.
On Thu, Nov 15, 2007 at 03:09:10PM +0200, Abraham, Danny wrote: > THE problem is that the table does not have a primary key; Too > expensive. If the table doesn't have a primary key, you've designed it wrong. But I'd like to see any evidence you have at all that having a primary key is "too expensive". A -- Andrew Sullivan Old sigs will return after re-constitution of blue smoke
Say we have a FIFO of 800,000,000 records.
No primary key is required – this is only audit information. We do not use it on Oracle too.
Based on a condition, 2,000,000 records should be deleted daily.
We have a background process that wakes up every X minutes and deletes Y records.
With Oracle we do it with: delete ,tname> where <cond> and rownum < Y;
Can we have the same goody on Postgres?
Say we have a FIFO of 800,000,000 records.
No primary key is required – this is only audit information. We do not use it on Oracle too.
Based on a condition, 2,000,000 records should be deleted daily.
We have a background process that wakes up every X minutes and deletes Y records.
With Oracle we do it with: delete ,tname> where <cond> and rownum < Y;
Can we have the same goody on Postgres?
On 11/15/07, Abraham, Danny <danny_abraham@bmc.com> wrote: > With Oracle we do it with: delete ,tname> where <cond> and rownum < Y; You could create a temporary sequence: create temporary sequence foo_seq; delete from foos where nextval('foo_seq') < 50000; I'm not sure how fast nextval() is, even on temporary sequences; but it should be reasonably fast. If it's a FIFO queue, the table surely has some explicit order through a column which you will need as part of the query? Alexander.
On Thu, Nov 15, 2007 at 02:46:09PM +0100, Alexander Staubo wrote: > On 11/15/07, Abraham, Danny <danny_abraham@bmc.com> wrote: > > With Oracle we do it with: delete ,tname> where <cond> and rownum < Y; > > You could create a temporary sequence: > > create temporary sequence foo_seq; > delete from foos where nextval('foo_seq') < 50000; > > I'm not sure how fast nextval() is, even on temporary sequences; but > it should be reasonably fast. That's not going to do anything very useful after VACUUM has been run will it? VACUUM will leave lots of empty slots within a page, that subsequent INSERTs will populate. I suppose that you could cluster the table first on some data column, and then do your delete trick. But If the OP really has got a billion rows they're not going to want to cluster it very regularly. This basically goes back to the fundamental issue that a relation has no implicit order (it sounds as though Oracle had an implicit one, but that assumption doesn't hold with PG or in general). Maybe partitioning could help here. Sam
The temporary sequence works perfectly for me. Thanks
On Thu, Nov 15, 2007 at 04:18:27PM +0200, Abraham, Danny wrote: > The temporary sequence works perfectly for me. You may want to read my other message and test again. You are well into implementation details here and the trick that was suggested will (with the current version of PG delete the first x rows that appear in the table. I've got no idea what 8.3 will do with its new seqscan piggybacking code, it may be clever enough to start deleting out of the middle of the table if it can. When you VACUUM a table, these rows will be marked as unused and your new data will be inserted into them. i.e. your FIFO will only act like a FIFO until you VACUUM it, then all hell will break loose. Sam p.s. the test I did, was to run this once: CREATE TABLE foo ( id serial, value INTEGER ); and then run several iterations of: INSERT INTO foo (value) SELECT v FROM generate_series(1,10000) x(v); CREATE TEMPORARY SEQUENCE foo_seq; DELETE FROM foo WHERE nextval('foo_seq') < 5000; DROP SEQUENCE foo_seq; VACUUM foo; Checking how the FIFO was working at various points with: SELECT MIN(id), MAX(id) FROM foo;
On Thu, Nov 15, 2007 at 03:33:04PM +0200, Abraham, Danny wrote: > > Based on a condition, 2,000,000 records should be deleted daily. Why not use that condition (which I presume is indexed) as the criterion for your delete? I mean, how else are you getting those row numbers in Oracle. They have to be sorted somehow, unless you're just deleting 2million random records. A -- Andrew Sullivan Old sigs will return after re-constitution of blue smoke
On Thu, Nov 15, 2007 at 09:43:52AM -0500, Andrew Sullivan wrote: > On Thu, Nov 15, 2007 at 03:33:04PM +0200, Abraham, Danny wrote: > > > > Based on a condition, 2,000,000 records should be deleted daily. > > Why not use that condition (which I presume is indexed) as the criterion for > your delete? I mean, how else are you getting those row numbers in Oracle. Yup, that's what I'm confused about as well. > They have to be sorted somehow, unless you're just deleting 2million random > records. To be fair; they're not random, just rather arbitrarily chosen. Sam
[snip] > With Oracle we do it with: delete ,tname> where <cond> and rownum < > Y; > Can we have the same goody on Postgres? The attached message is Tom's response to a similar question, in any case it would work fine in your case too (assuming you have postgres 8.2). HTH, Csaba. Stephan Szabo <sszabo@megazone.bigpanda.com> writes: > Unfortunately the stuff that makes a ctid=<value> nice doesn't seem to be > used when you're doing an in. Yeah, see the header comments in tidpath.c: * There is currently no special support for joins involving CTID; in * particular nothing corresponding to best_inner_indexscan(). Since it's * not very useful to store TIDs of one table in another table, there * doesn't seem to be enough use-case to justify adding a lot of code * for that. Of course, that argument is wrong for a self-join, which is what this would essentially be. So maybe it would be worth doing sometime. Still, the issue doesn't come up very often. [ thinks for a bit ... ] Actually, you can do it as of 8.2 or so, by abusing the ScalarArrayOp stuff: turn the subquery into an array. An example in the regression database: regression=# explain update tenk1 set ten=ten+1 regression-# where ctid = any (array(select ctid from tenk1 limit 10)); QUERY PLAN ------------------------------------------------------------------------- Tid Scan on tenk1 (cost=0.46..40.71 rows=10 width=250) TID Cond: (ctid = ANY ($0)) InitPlan -> Limit (cost=0.00..0.46 rows=10 width=6) -> Seq Scan on tenk1 (cost=0.00..458.00 rows=10000 width=6) (5 rows) It even seems to get the cost estimate right... regards, tom lane