Re: Deletion Challenge - Mailing list pgsql-general

From Steve Crawford
Subject Re: Deletion Challenge
Date
Msg-id CAEfWYywEBm_zSBVtbZ48BLH96-W4_VJASNL9B4LR+YvCZyBPGg@mail.gmail.com
Whole thread Raw
In response to Re: Deletion Challenge  ("David G. Johnston" <david.g.johnston@gmail.com>)
Responses Re: Deletion Challenge  (Berend Tober <btober@computer.org>)
List pgsql-general
The two general solutions are the "keep the last one" proposed by Adrian "keep the last N" that I sent.

But it might be worth stepping back a bit. You said you are having performance problems that you feel would be improved by removing only a million rows which doesn't sound like that much to me. It's less than half of what I *add* to just one of my tables every week and my database is dwarfed by those of many of the participants on this list.

This suggests that there may be other issues such as tuning, indexing or query optimization at play. Depending on your requirements, partitioning might be useful. It wouldn't be last N but could easily be done to partition by date-ranges which makes archiving and purging a low-cost operation.

You might want to expand a bit on the core issue you are trying to solve.

Cheers,
Steve


On Wed, Dec 9, 2015 at 12:43 PM, David G. Johnston <david.g.johnston@gmail.com> wrote:
On Wed, Dec 9, 2015 at 1:31 PM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 12/09/2015 12:24 AM, Berend Tober wrote:
Adrian Klaver wrote:
On 12/05/2015 08:08 AM, Berend Tober wrote:
/*

Deletion Challenge

I want to delete all but the most recent transaction, per person, from a
table that records a transaction history because at some point the
transaction history grows large enough to adversely effect performance,
and also becomes less relevant for retention.

...


test=> delete from cash_journal where ARRAY[click, cash_journal_id]
NOT in (select max(ARRAY[click,cash_journal_id]) from cash_journal
group by fairian_id);
DELETE 7

test=> SELECT * FROM cash_journal order by fairian_id, click,
cash_journal_id;
  click | cash_journal_id | fairian_id | debit | credit | balance
|           description
-------+-----------------+------------+-------+--------+---------+----------------------------------

    412 |               1 |          7 |     5 |        |      14 |
Sold food quantity 7 units.
     37 |               7 |          8 |     8 |        |       8 |
Ratified contract f1abd670358e03
     37 |               9 |          9 |     7 |        |       7 |
Ratified contract 1574bddb75c78a
     36 |              14 |         18 |     0 |      0 |       0 |
initial cash balance
    413 |               1 |         25 |       |    995 |       0 |
Redeemed bond 7719a1c782a1ba
(5 rows)


Nice.

The idea of a NOT IN query had occurred to me briefly, but I failed to
pursue it because at some point in the distant past I had gained the
impression that NOT IN queries were not computationally efficient.
During one round of testing I had like a million rows. I'll have to run
some EXPLAIN query testing with a larger data sample for comparison.
Thanks!

Plan B:

WITH d AS
    (SELECT * FROM
        cash_journal
    LEFT JOIN
        (SELECT
            MAX(ARRAY[click,cash_journal_id]) AS mx
        FROM
            cash_journal
        GROUP BY
            fairian_id)
        AS
            mxa
    ON
        mxa.mx=ARRAY[click, cash_journal_id]
    WHERE
        mx IS NULL)
DELETE FROM
    cash_journal
USING
    d
WHERE
    d.click = cash_journal.click
AND
    d.cash_journal_id = cash_journal.cash_journal_id;


​Couldn't the LEFT JOIN relation in the CTE be better written using "SELECT DISTINCT ON (click, cash_journal_id) click, cash_journal_id [...] ORDER BY click DESC, cash_journal_id" or something similar?  It doesn't seem like you should need to introduce an array and an aggregate here.

​It does have the negative property of only providing a single row; which excludes using it for the "last 5" part but I suspect it will be considerably faster for the single version.

David J.

pgsql-general by date:

Previous
From: "David G. Johnston"
Date:
Subject: Re: Deletion Challenge
Next
From: Christopher Molnar
Date:
Subject: Regexp_replace question / help needed