Thread: Chunk Delete

Chunk Delete

From
"Abraham, Danny"
Date:

THE problem is that the table does not have a primary key; Too expensive.

 

 

Re: Chunk Delete

From
Andrew Sullivan
Date:
On Thu, Nov 15, 2007 at 03:09:10PM +0200, Abraham, Danny wrote:
> THE problem is that the table does not have a primary key; Too
> expensive.

If the table doesn't have a primary key, you've designed it wrong.

But I'd like to see any evidence you have at all that having a primary key
is "too expensive".

A

--
Andrew Sullivan
Old sigs will return after re-constitution of blue smoke

Re: Chunk Delete

From
"Abraham, Danny"
Date:

Say we have a FIFO of  800,000,000 records.

 

No primary key is required – this is only audit information. We do not use it on Oracle too.

 

Based on a condition,  2,000,000 records should be deleted daily.

 

We have a background process that wakes up every X minutes and deletes Y records.

 

With Oracle we do it with: delete ,tname>   where  <cond> and rownum < Y;

 

Can we have the same goody on Postgres?

Chunk Delete

From
"Abraham, Danny"
Date:

Say we have a FIFO of  800,000,000 records.

 

No primary key is required – this is only audit information. We do not use it on Oracle too.

 

Based on a condition,  2,000,000 records should be deleted daily.

 

We have a background process that wakes up every X minutes and deletes Y records.

 

With Oracle we do it with: delete ,tname>   where  <cond> and rownum < Y;

 

Can we have the same goody on Postgres?

Re: Chunk Delete

From
"Alexander Staubo"
Date:
On 11/15/07, Abraham, Danny <danny_abraham@bmc.com> wrote:
> With Oracle we do it with: delete ,tname>   where  <cond> and rownum < Y;

You could create a temporary sequence:

  create temporary sequence foo_seq;
  delete from foos where nextval('foo_seq') < 50000;

I'm not sure how fast nextval() is, even on temporary sequences; but
it should be reasonably fast.

If it's a FIFO queue, the table surely has some explicit order through
a column which you will need as part of the query?

Alexander.

Re: Chunk Delete

From
Sam Mason
Date:
On Thu, Nov 15, 2007 at 02:46:09PM +0100, Alexander Staubo wrote:
> On 11/15/07, Abraham, Danny <danny_abraham@bmc.com> wrote:
> > With Oracle we do it with: delete ,tname>   where  <cond> and rownum < Y;
>
> You could create a temporary sequence:
>
>   create temporary sequence foo_seq;
>   delete from foos where nextval('foo_seq') < 50000;
>
> I'm not sure how fast nextval() is, even on temporary sequences; but
> it should be reasonably fast.

That's not going to do anything very useful after VACUUM has been run
will it?  VACUUM will leave lots of empty slots within a page, that
subsequent INSERTs will populate.  I suppose that you could cluster the
table first on some data column, and then do your delete trick.  But
If the OP really has got a billion rows they're not going to want to
cluster it very regularly.

This basically goes back to the fundamental issue that a relation has no
implicit order (it sounds as though Oracle had an implicit one, but that
assumption doesn't hold with PG or in general).

Maybe partitioning could help here.


  Sam

Re: Chunk Delete

From
"Abraham, Danny"
Date:
The temporary sequence works perfectly for me.

Thanks


Re: Chunk Delete

From
Sam Mason
Date:
On Thu, Nov 15, 2007 at 04:18:27PM +0200, Abraham, Danny wrote:
> The temporary sequence works perfectly for me.

You may want to read my other message and test again.  You are well
into implementation details here and the trick that was suggested will
(with the current version of PG delete the first x rows that appear
in the table.  I've got no idea what 8.3 will do with its new seqscan
piggybacking code, it may be clever enough to start deleting out of the
middle of the table if it can.

When you VACUUM a table, these rows will be marked as unused and your
new data will be inserted into them. i.e. your FIFO will only act like a
FIFO until you VACUUM it, then all hell will break loose.


  Sam

p.s. the test I did, was to run this once:

  CREATE TABLE foo (
    id serial,
    value INTEGER
  );

and then run several iterations of:

  INSERT INTO foo (value)
    SELECT v FROM generate_series(1,10000) x(v);

  CREATE TEMPORARY SEQUENCE foo_seq;
  DELETE FROM foo WHERE nextval('foo_seq') < 5000;
  DROP SEQUENCE foo_seq;
  VACUUM foo;

Checking how the FIFO was working at various points with:

  SELECT MIN(id), MAX(id) FROM foo;

Re: Chunk Delete

From
Andrew Sullivan
Date:
On Thu, Nov 15, 2007 at 03:33:04PM +0200, Abraham, Danny wrote:
>
> Based on a condition,  2,000,000 records should be deleted daily.

Why not use that condition (which I presume is indexed) as the criterion for
your delete?  I mean, how else are you getting those row numbers in Oracle.
They have to be sorted somehow, unless you're just deleting 2million random
records.

A

--
Andrew Sullivan
Old sigs will return after re-constitution of blue smoke

Re: Chunk Delete

From
Sam Mason
Date:
On Thu, Nov 15, 2007 at 09:43:52AM -0500, Andrew Sullivan wrote:
> On Thu, Nov 15, 2007 at 03:33:04PM +0200, Abraham, Danny wrote:
> >
> > Based on a condition,  2,000,000 records should be deleted daily.
>
> Why not use that condition (which I presume is indexed) as the criterion for
> your delete?  I mean, how else are you getting those row numbers in Oracle.

Yup, that's what I'm confused about as well.

> They have to be sorted somehow, unless you're just deleting 2million random
> records.

To be fair; they're not random, just rather arbitrarily chosen.


  Sam

Re: Chunk Delete

From
Csaba Nagy
Date:
[snip]
> With Oracle we do it with: delete ,tname>   where  <cond> and rownum <
> Y;
> Can we have the same goody on Postgres?

The attached message is Tom's response to a similar question, in any
case it would work fine in your case too (assuming you have postgres
8.2).

HTH,
Csaba.

Stephan Szabo <sszabo@megazone.bigpanda.com> writes:
> Unfortunately the stuff that makes a ctid=<value> nice doesn't seem to be
> used when you're doing an in.

Yeah, see the header comments in tidpath.c:

 * There is currently no special support for joins involving CTID; in
 * particular nothing corresponding to best_inner_indexscan().    Since it's
 * not very useful to store TIDs of one table in another table, there
 * doesn't seem to be enough use-case to justify adding a lot of code
 * for that.

Of course, that argument is wrong for a self-join, which is what this
would essentially be.  So maybe it would be worth doing sometime.
Still, the issue doesn't come up very often.

[ thinks for a bit ... ] Actually, you can do it as of 8.2 or so,
by abusing the ScalarArrayOp stuff: turn the subquery into an array.
An example in the regression database:

regression=# explain update tenk1 set ten=ten+1
regression-#   where ctid = any (array(select ctid from tenk1 limit 10));
                               QUERY PLAN
-------------------------------------------------------------------------
 Tid Scan on tenk1  (cost=0.46..40.71 rows=10 width=250)
   TID Cond: (ctid = ANY ($0))
   InitPlan
     ->  Limit  (cost=0.00..0.46 rows=10 width=6)
           ->  Seq Scan on tenk1  (cost=0.00..458.00 rows=10000 width=6)
(5 rows)

It even seems to get the cost estimate right...

            regards, tom lane