Thread: Deletion Challenge

Deletion Challenge

From
Berend Tober
Date:
/*

Deletion Challenge

I want to delete all but the most recent transaction, per person, from a
table that records a transaction history because at some point the
transaction history grows large enough to adversely effect performance,
and also becomes less relevant for retention.

I have devised a way to accomplish this, but it is a 'two-stage'
approach: that is, it requires two delete statements. I would like to
know if there is a way to do it in a single statement.

Bonus challenge: Same question, except preserving the most recent N, for
N > 1, rows for each person so that a short history is retained after
the deletion.

I have included below an annotated test case and my current solution for
the N = 1 case.

*/

DROP TABLE IF EXISTS cash_journal;


CREATE TABLE cash_journal (
     click bigint NOT NULL,
     cash_journal_id bigint NOT NULL,
     fairian_id bigint NOT NULL,
     debit double precision,
     credit double precision,
     balance real DEFAULT 0,
     description text
);

COMMENT ON COLUMN cash_journal.click        IS 'Time of transaction.';
COMMENT ON COLUMN cash_journal.cash_journal_id    IS 'Sequence of transaction within current click.';
COMMENT ON COLUMN cash_journal.fairian_id    IS 'Fairian account effected.';
COMMENT ON COLUMN cash_journal.debit        IS 'Account balance increase amount.';
COMMENT ON COLUMN cash_journal.credit        IS 'Account balance decrease amount.';
COMMENT ON COLUMN cash_journal.balance        IS 'Account balance, per Fairian running total.';
COMMENT ON COLUMN cash_journal.description    IS 'Transaction description.';

/*

Below is some sample data, listed in the click/sequence order that the
data would actually be entered. That is, the 'click' column represents
advancing time, and within each click, transactions are sequenced by the
'cash_journal_id' column. Note there are some missing cash_journal_id
sequence numbers. This is an artifact of having presented here only
an illustrative sample. Generally, within each click, the sequence
would start at one and increment uniformly by one for each new row
in the same click, and then reset to one for the next click. The
missing increments in the sample data should not make any difference
in the solution.

The 'balance' column is a per-player running total, which is a
deliberate denormalization. It is calculated in a before insert trigger
by starting with the per-player previous balance, and then adding
the new row debit, if any, and subtracting the new row credit, if any.

Note, not all Fairians will have a transaction in every click, but any
number of Fairians may have multiple transactions in any click.

*/

copy cash_journal (click,cash_journal_id,fairian_id,debit,credit,balance,description) from stdin;
36    3    7    0    0    0    Initial cash balance
36    4    8    0    0    0    Initial cash balance
36    5    9    0    0    0    Initial cash balance
36    14    18    0    0    0    initial cash balance
37    5    7    9    \N    9    Ratified contract fa35e192121eab
37    7    8    8    \N    8    Ratified contract f1abd670358e03
37    9    9    7    \N    7    Ratified contract 1574bddb75c78a
411    1    25    0    0    0    Initial cash balance
411    2    25    1000    \N    1000    Issued bond 7719a1c782a1ba
412    1    7    5    \N    14    Sold food quantity 7 units.
412    2    25    \N    5    995    Bought food quantity 7 units.
413    1    25    \N    995    0    Redeemed bond 7719a1c782a1ba
\.


SELECT * FROM cash_journal order by fairian_id, click, cash_journal_id;

/*

The sample starting data is shown here in order by Fairian so that it is
perhaps easier to see what is happening for each player. Note that the
result of the deletion should be the last row for each player.

  click | cash_journal_id | fairian_id | debit | credit | balance |           description
-------+-----------------+------------+-------+--------+---------+----------------------------------
     36 |               3 |          7 |     0 |      0 |       0 | Initial cash balance
     37 |               5 |          7 |     9 |        |       9 | Ratified contract fa35e192121eab
    412 |               1 |          7 |     5 |        |      14 | Sold food quantity 7 units.
     36 |               4 |          8 |     0 |      0 |       0 | Initial cash balance
     37 |               7 |          8 |     8 |        |       8 | Ratified contract f1abd670358e03
     36 |               5 |          9 |     0 |      0 |       0 | Initial cash balance
     37 |               9 |          9 |     7 |        |       7 | Ratified contract 1574bddb75c78a
     36 |              14 |         18 |     0 |      0 |       0 | initial cash balance
    411 |               1 |         25 |     0 |      0 |       0 | Initial cash balance
    411 |               2 |         25 |  1000 |        |    1000 | Issued bond 7719a1c782a1ba
    412 |               2 |         25 |       |      5 |     995 | Bought food quantity 7 units.
    413 |               1 |         25 |       |    995 |       0 | Redeemed bond 7719a1c782a1ba
(12 rows)

*/


/*

Here is the current, two-stage solution in use. Is there a way to do it
with a single statement?

Can you create a solution that retains an arbitrarily specified number
of rows per player?

*/
BEGIN;

WITH max_click AS (
   SELECT
     cash_journal.fairian_id,
     max(cash_journal.click) AS click
     FROM cash_journal
     GROUP BY cash_journal.fairian_id
     )
   delete from cash_journal j
     using max_click b
     where j.fairian_id = b.fairian_id
     and j.click        < b.click;

WITH max_journal_id AS (
   SELECT
     cash_journal.fairian_id,
     cash_journal.click,
     max(cash_journal.cash_journal_id) AS cash_journal_id
     FROM cash_journal
     GROUP BY cash_journal.fairian_id, cash_journal.click
     )
   delete from cash_journal j
      using max_journal_id b
      where j.fairian_id    = b.fairian_id
      and j.click           = b.click
      and j.cash_journal_id < b.cash_journal_id;

COMMIT;

SELECT * FROM cash_journal order by fairian_id, click, cash_journal_id;

/*

  click | cash_journal_id | fairian_id | debit | credit | balance |           description
-------+-----------------+------------+-------+--------+---------+----------------------------------
    412 |               1 |          7 |     5 |        |      14 | Sold food quantity 7 units.
     37 |               7 |          8 |     8 |        |       8 | Ratified contract f1abd670358e03
     37 |               9 |          9 |     7 |        |       7 | Ratified contract 1574bddb75c78a
     36 |              14 |         18 |     0 |      0 |       0 | initial cash balance
    413 |               1 |         25 |       |    995 |       0 | Redeemed bond 7719a1c782a1ba
(5 rows)


*/


Re: Deletion Challenge

From
Steve Crawford
Date:
If I understand correctly the value of "click" always advances and within a "click" the "cash_journal_id" always advances - not necessarily by single steps so within a fairian_id, ordering by "click" plus "cash_journal_id" would return the records in order from which you want the most recent 5 for each farian_id.

Typing without testing and ignoring performance optimizations, something along the lines of the following should work and covers the "last 5" issue as well.

with stuff_to_delete as (
select farian_id, click, cash_journal_id,
rank() over (partition by farian_id order by (click, cash_journal_id) desc) as howold)
from cash_journal)
delete from cash_journal
using stuff_to_delete
where
cash_journal.farian_id = stuff_to_delete.farian_id
and cash_journal.click = stuff_to_delete.click
and cash_journal.cash_journal_id = stuff_to_delete.cash_journal_id
and stuff_to_delete.howold > 5;

Cheers,
Steve


On Sat, Dec 5, 2015 at 8:08 AM, Berend Tober <btober@computer.org> wrote:
/*

Deletion Challenge

I want to delete all but the most recent transaction, per person, from a
table that records a transaction history because at some point the
transaction history grows large enough to adversely effect performance,
and also becomes less relevant for retention.

I have devised a way to accomplish this, but it is a 'two-stage'
approach: that is, it requires two delete statements. I would like to
know if there is a way to do it in a single statement.

Bonus challenge: Same question, except preserving the most recent N, for
N > 1, rows for each person so that a short history is retained after
the deletion.

I have included below an annotated test case and my current solution for
the N = 1 case.

*/

DROP TABLE IF EXISTS cash_journal;


CREATE TABLE cash_journal (
    click bigint NOT NULL,
    cash_journal_id bigint NOT NULL,
    fairian_id bigint NOT NULL,
    debit double precision,
    credit double precision,
    balance real DEFAULT 0,
    description text
);

COMMENT ON COLUMN cash_journal.click            IS 'Time of transaction.';
COMMENT ON COLUMN cash_journal.cash_journal_id  IS 'Sequence of transaction within current click.';
COMMENT ON COLUMN cash_journal.fairian_id       IS 'Fairian account effected.';
COMMENT ON COLUMN cash_journal.debit            IS 'Account balance increase amount.';
COMMENT ON COLUMN cash_journal.credit           IS 'Account balance decrease amount.';
COMMENT ON COLUMN cash_journal.balance          IS 'Account balance, per Fairian running total.';
COMMENT ON COLUMN cash_journal.description      IS 'Transaction description.';

/*

Below is some sample data, listed in the click/sequence order that the
data would actually be entered. That is, the 'click' column represents
advancing time, and within each click, transactions are sequenced by the
'cash_journal_id' column. Note there are some missing cash_journal_id
sequence numbers. This is an artifact of having presented here only
an illustrative sample. Generally, within each click, the sequence
would start at one and increment uniformly by one for each new row
in the same click, and then reset to one for the next click. The
missing increments in the sample data should not make any difference
in the solution.

The 'balance' column is a per-player running total, which is a
deliberate denormalization. It is calculated in a before insert trigger
by starting with the per-player previous balance, and then adding
the new row debit, if any, and subtracting the new row credit, if any.

Note, not all Fairians will have a transaction in every click, but any
number of Fairians may have multiple transactions in any click.

*/

copy cash_journal (click,cash_journal_id,fairian_id,debit,credit,balance,description) from stdin;
36      3       7       0       0       0       Initial cash balance
36      4       8       0       0       0       Initial cash balance
36      5       9       0       0       0       Initial cash balance
36      14      18      0       0       0       initial cash balance
37      5       7       9       \N      9       Ratified contract fa35e192121eab
37      7       8       8       \N      8       Ratified contract f1abd670358e03
37      9       9       7       \N      7       Ratified contract 1574bddb75c78a
411     1       25      0       0       0       Initial cash balance
411     2       25      1000    \N      1000    Issued bond 7719a1c782a1ba
412     1       7       5       \N      14      Sold food quantity 7 units.
412     2       25      \N      5       995     Bought food quantity 7 units.
413     1       25      \N      995     0       Redeemed bond 7719a1c782a1ba
\.


SELECT * FROM cash_journal order by fairian_id, click, cash_journal_id;

/*

The sample starting data is shown here in order by Fairian so that it is
perhaps easier to see what is happening for each player. Note that the
result of the deletion should be the last row for each player.

 click | cash_journal_id | fairian_id | debit | credit | balance |           description
-------+-----------------+------------+-------+--------+---------+----------------------------------
    36 |               3 |          7 |     0 |      0 |       0 | Initial cash balance
    37 |               5 |          7 |     9 |        |       9 | Ratified contract fa35e192121eab
   412 |               1 |          7 |     5 |        |      14 | Sold food quantity 7 units.
    36 |               4 |          8 |     0 |      0 |       0 | Initial cash balance
    37 |               7 |          8 |     8 |        |       8 | Ratified contract f1abd670358e03
    36 |               5 |          9 |     0 |      0 |       0 | Initial cash balance
    37 |               9 |          9 |     7 |        |       7 | Ratified contract 1574bddb75c78a
    36 |              14 |         18 |     0 |      0 |       0 | initial cash balance
   411 |               1 |         25 |     0 |      0 |       0 | Initial cash balance
   411 |               2 |         25 |  1000 |        |    1000 | Issued bond 7719a1c782a1ba
   412 |               2 |         25 |       |      5 |     995 | Bought food quantity 7 units.
   413 |               1 |         25 |       |    995 |       0 | Redeemed bond 7719a1c782a1ba
(12 rows)

*/


/*

Here is the current, two-stage solution in use. Is there a way to do it
with a single statement?

Can you create a solution that retains an arbitrarily specified number
of rows per player?

*/
BEGIN;

WITH max_click AS (
  SELECT
    cash_journal.fairian_id,
    max(cash_journal.click) AS click
    FROM cash_journal
    GROUP BY cash_journal.fairian_id
    )
  delete from cash_journal j
    using max_click b
    where j.fairian_id = b.fairian_id
    and j.click        < b.click;

WITH max_journal_id AS (
  SELECT
    cash_journal.fairian_id,
    cash_journal.click,
    max(cash_journal.cash_journal_id) AS cash_journal_id
    FROM cash_journal
    GROUP BY cash_journal.fairian_id, cash_journal.click
    )
  delete from cash_journal j
     using max_journal_id b
     where j.fairian_id    = b.fairian_id
     and j.click           = b.click
     and j.cash_journal_id < b.cash_journal_id;

COMMIT;

SELECT * FROM cash_journal order by fairian_id, click, cash_journal_id;

/*

 click | cash_journal_id | fairian_id | debit | credit | balance |           description
-------+-----------------+------------+-------+--------+---------+----------------------------------
   412 |               1 |          7 |     5 |        |      14 | Sold food quantity 7 units.
    37 |               7 |          8 |     8 |        |       8 | Ratified contract f1abd670358e03
    37 |               9 |          9 |     7 |        |       7 | Ratified contract 1574bddb75c78a
    36 |              14 |         18 |     0 |      0 |       0 | initial cash balance
   413 |               1 |         25 |       |    995 |       0 | Redeemed bond 7719a1c782a1ba
(5 rows)


*/


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: Deletion Challenge

From
Adrian Klaver
Date:
On 12/05/2015 08:08 AM, Berend Tober wrote:
> /*
>
> Deletion Challenge
>
> I want to delete all but the most recent transaction, per person, from a
> table that records a transaction history because at some point the
> transaction history grows large enough to adversely effect performance,
> and also becomes less relevant for retention.
>
> I have devised a way to accomplish this, but it is a 'two-stage'
> approach: that is, it requires two delete statements. I would like to
> know if there is a way to do it in a single statement.
>
> Bonus challenge: Same question, except preserving the most recent N, for
> N > 1, rows for each person so that a short history is retained after
> the deletion.
>
> I have included below an annotated test case and my current solution for
> the N = 1 case.
>
> */
>
> DROP TABLE IF EXISTS cash_journal;
>
>
> CREATE TABLE cash_journal (
>      click bigint NOT NULL,
>      cash_journal_id bigint NOT NULL,
>      fairian_id bigint NOT NULL,
>      debit double precision,
>      credit double precision,
>      balance real DEFAULT 0,
>      description text
> );
>
> COMMENT ON COLUMN cash_journal.click        IS 'Time of transaction.';
> COMMENT ON COLUMN cash_journal.cash_journal_id    IS 'Sequence of
> transaction within current click.';
> COMMENT ON COLUMN cash_journal.fairian_id    IS 'Fairian account
> effected.';
> COMMENT ON COLUMN cash_journal.debit        IS 'Account balance increase
> amount.';
> COMMENT ON COLUMN cash_journal.credit        IS 'Account balance
> decrease amount.';
> COMMENT ON COLUMN cash_journal.balance        IS 'Account balance, per
> Fairian running total.';
> COMMENT ON COLUMN cash_journal.description    IS 'Transaction
> description.';
>
> /*
>
> Below is some sample data, listed in the click/sequence order that the
> data would actually be entered. That is, the 'click' column represents
> advancing time, and within each click, transactions are sequenced by the
> 'cash_journal_id' column. Note there are some missing cash_journal_id
> sequence numbers. This is an artifact of having presented here only
> an illustrative sample. Generally, within each click, the sequence
> would start at one and increment uniformly by one for each new row
> in the same click, and then reset to one for the next click. The
> missing increments in the sample data should not make any difference
> in the solution.
>
> The 'balance' column is a per-player running total, which is a
> deliberate denormalization. It is calculated in a before insert trigger
> by starting with the per-player previous balance, and then adding
> the new row debit, if any, and subtracting the new row credit, if any.
>
> Note, not all Fairians will have a transaction in every click, but any
> number of Fairians may have multiple transactions in any click.
>
> */
>
> copy cash_journal
> (click,cash_journal_id,fairian_id,debit,credit,balance,description) from
> stdin;
> 36    3    7    0    0    0    Initial cash balance
> 36    4    8    0    0    0    Initial cash balance
> 36    5    9    0    0    0    Initial cash balance
> 36    14    18    0    0    0    initial cash balance
> 37    5    7    9    \N    9    Ratified contract fa35e192121eab
> 37    7    8    8    \N    8    Ratified contract f1abd670358e03
> 37    9    9    7    \N    7    Ratified contract 1574bddb75c78a
> 411    1    25    0    0    0    Initial cash balance
> 411    2    25    1000    \N    1000    Issued bond 7719a1c782a1ba
> 412    1    7    5    \N    14    Sold food quantity 7 units.
> 412    2    25    \N    5    995    Bought food quantity 7 units.
> 413    1    25    \N    995    0    Redeemed bond 7719a1c782a1ba
> \.
>
>
> SELECT * FROM cash_journal order by fairian_id, click, cash_journal_id;
>
> /*
>
> The sample starting data is shown here in order by Fairian so that it is
> perhaps easier to see what is happening for each player. Note that the
> result of the deletion should be the last row for each player.
>
>   click | cash_journal_id | fairian_id | debit | credit | balance
> |           description
> -------+-----------------+------------+-------+--------+---------+----------------------------------
>
>      36 |               3 |          7 |     0 |      0 |       0 |
> Initial cash balance
>      37 |               5 |          7 |     9 |        |       9 |
> Ratified contract fa35e192121eab
>     412 |               1 |          7 |     5 |        |      14 | Sold
> food quantity 7 units.
>      36 |               4 |          8 |     0 |      0 |       0 |
> Initial cash balance
>      37 |               7 |          8 |     8 |        |       8 |
> Ratified contract f1abd670358e03
>      36 |               5 |          9 |     0 |      0 |       0 |
> Initial cash balance
>      37 |               9 |          9 |     7 |        |       7 |
> Ratified contract 1574bddb75c78a
>      36 |              14 |         18 |     0 |      0 |       0 |
> initial cash balance
>     411 |               1 |         25 |     0 |      0 |       0 |
> Initial cash balance
>     411 |               2 |         25 |  1000 |        |    1000 |
> Issued bond 7719a1c782a1ba
>     412 |               2 |         25 |       |      5 |     995 |
> Bought food quantity 7 units.
>     413 |               1 |         25 |       |    995 |       0 |
> Redeemed bond 7719a1c782a1ba
> (12 rows)
>
> */
>
>
> /*
>
> Here is the current, two-stage solution in use. Is there a way to do it
> with a single statement?
>
> Can you create a solution that retains an arbitrarily specified number
> of rows per player?
>
> */
> BEGIN;
>
> WITH max_click AS (
>    SELECT
>      cash_journal.fairian_id,
>      max(cash_journal.click) AS click
>      FROM cash_journal
>      GROUP BY cash_journal.fairian_id
>      )
>    delete from cash_journal j
>      using max_click b
>      where j.fairian_id = b.fairian_id
>      and j.click        < b.click;
>
> WITH max_journal_id AS (
>    SELECT
>      cash_journal.fairian_id,
>      cash_journal.click,
>      max(cash_journal.cash_journal_id) AS cash_journal_id
>      FROM cash_journal
>      GROUP BY cash_journal.fairian_id, cash_journal.click
>      )
>    delete from cash_journal j
>       using max_journal_id b
>       where j.fairian_id    = b.fairian_id
>       and j.click           = b.click
>       and j.cash_journal_id < b.cash_journal_id;
>
> COMMIT;
>
> SELECT * FROM cash_journal order by fairian_id, click, cash_journal_id;
>
> /*
>
>   click | cash_journal_id | fairian_id | debit | credit | balance
> |           description
> -------+-----------------+------------+-------+--------+---------+----------------------------------
>
>     412 |               1 |          7 |     5 |        |      14 | Sold
> food quantity 7 units.
>      37 |               7 |          8 |     8 |        |       8 |
> Ratified contract f1abd670358e03
>      37 |               9 |          9 |     7 |        |       7 |
> Ratified contract 1574bddb75c78a
>      36 |              14 |         18 |     0 |      0 |       0 |
> initial cash balance
>     413 |               1 |         25 |       |    995 |       0 |
> Redeemed bond 7719a1c782a1ba
> (5 rows)
>
>
> */
>
>


test=> delete from cash_journal where ARRAY[click, cash_journal_id] NOT in (select max(ARRAY[click,cash_journal_id])
fromcash_journal group by fairian_id); 
DELETE 7

test=> SELECT * FROM cash_journal order by fairian_id, click, cash_journal_id;
 click | cash_journal_id | fairian_id | debit | credit | balance |           description
-------+-----------------+------------+-------+--------+---------+----------------------------------
   412 |               1 |          7 |     5 |        |      14 | Sold food quantity 7 units.
    37 |               7 |          8 |     8 |        |       8 | Ratified contract f1abd670358e03
    37 |               9 |          9 |     7 |        |       7 | Ratified contract 1574bddb75c78a
    36 |              14 |         18 |     0 |      0 |       0 | initial cash balance
   413 |               1 |         25 |       |    995 |       0 | Redeemed bond 7719a1c782a1ba
(5 rows)


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: Deletion Challenge

From
Berend Tober
Date:
Steve Crawford wrote:
> If I understand correctly the value of "click" always advances and within a "click" the
> "cash_journal_id" always advances - not necessarily by single steps so within a fairian_id, ordering
> by "click" plus "cash_journal_id" would return the records in order from which you want the most
> recent 5 for each farian_id.
>
> Typing without testing and ignoring performance optimizations, something along the lines of the
> following should work and covers the "last 5" issue as well.
>
> with stuff_to_delete as (
> select farian_id, click, cash_journal_id,
> rank() over (partition by farian_id order by (click, cash_journal_id) desc) as howold)
> from cash_journal)
> delete from cash_journal
> using stuff_to_delete
> where
> cash_journal.farian_id = stuff_to_delete.farian_id
> and cash_journal.click = stuff_to_delete.click
> and cash_journal.cash_journal_id = stuff_to_delete.cash_journal_id
> and stuff_to_delete.howold > 5;
>

Assessing without testing, I like that. Thanks!

Although the above is not the exactly the form I was using, an earlier iteration of a related
problem employed window functions. But as the data set grew performance suffered, so if deletes were
not done on a regular, continuing basis in order to keep the historical data set approximately
"small", the process execution time using a windowing scheme eventually exceeded the extent of my
patience.

That "non-scalable" situation is actually what motivated the deliberate de-normalization (of
retaining the "running balance" in a separate column) and the desire to delete old data. The
original implementation calculated the running balance on-the-fly, employing windowing per
fairian_id, and those tallies of the net balance entailed increasingly lengthy execution times as
the number of rows increased, hence I was motivated to retain only a relatively constant-sized
per-farian history, and I dismissed the use of windowing for the delete problem since it was so
problematic for the running-balance-without-delete problem.

Thanks for knocking some sense into me!




Re: Deletion Challenge

From
Berend Tober
Date:
Adrian Klaver wrote:
> On 12/05/2015 08:08 AM, Berend Tober wrote:
>> /*
>>
>> Deletion Challenge
>>
>> I want to delete all but the most recent transaction, per person, from a
>> table that records a transaction history because at some point the
>> transaction history grows large enough to adversely effect performance,
>> and also becomes less relevant for retention.
>>
>> ...
>>
>
> test=> delete from cash_journal where ARRAY[click, cash_journal_id] NOT in (select max(ARRAY[click,cash_journal_id])
fromcash_journal group by fairian_id); 
> DELETE 7
>
> test=> SELECT * FROM cash_journal order by fairian_id, click, cash_journal_id;
>   click | cash_journal_id | fairian_id | debit | credit | balance |           description
> -------+-----------------+------------+-------+--------+---------+----------------------------------
>     412 |               1 |          7 |     5 |        |      14 | Sold food quantity 7 units.
>      37 |               7 |          8 |     8 |        |       8 | Ratified contract f1abd670358e03
>      37 |               9 |          9 |     7 |        |       7 | Ratified contract 1574bddb75c78a
>      36 |              14 |         18 |     0 |      0 |       0 | initial cash balance
>     413 |               1 |         25 |       |    995 |       0 | Redeemed bond 7719a1c782a1ba
> (5 rows)
>

Nice.

The idea of a NOT IN query had occurred to me briefly, but I failed to pursue it because at some
point in the distant past I had gained the impression that NOT IN queries were not computationally
efficient. During one round of testing I had like a million rows. I'll have to run some EXPLAIN
query testing with a larger data sample for comparison. Thanks!



Re: Deletion Challenge

From
Adrian Klaver
Date:
On 12/09/2015 12:24 AM, Berend Tober wrote:
> Adrian Klaver wrote:
>> On 12/05/2015 08:08 AM, Berend Tober wrote:
>>> /*
>>>
>>> Deletion Challenge
>>>
>>> I want to delete all but the most recent transaction, per person, from a
>>> table that records a transaction history because at some point the
>>> transaction history grows large enough to adversely effect performance,
>>> and also becomes less relevant for retention.
>>>
>>> ...
>>>
>>
>> test=> delete from cash_journal where ARRAY[click, cash_journal_id]
>> NOT in (select max(ARRAY[click,cash_journal_id]) from cash_journal
>> group by fairian_id);
>> DELETE 7
>>
>> test=> SELECT * FROM cash_journal order by fairian_id, click,
>> cash_journal_id;
>>   click | cash_journal_id | fairian_id | debit | credit | balance
>> |           description
>> -------+-----------------+------------+-------+--------+---------+----------------------------------
>>
>>     412 |               1 |          7 |     5 |        |      14 |
>> Sold food quantity 7 units.
>>      37 |               7 |          8 |     8 |        |       8 |
>> Ratified contract f1abd670358e03
>>      37 |               9 |          9 |     7 |        |       7 |
>> Ratified contract 1574bddb75c78a
>>      36 |              14 |         18 |     0 |      0 |       0 |
>> initial cash balance
>>     413 |               1 |         25 |       |    995 |       0 |
>> Redeemed bond 7719a1c782a1ba
>> (5 rows)
>>
>
> Nice.
>
> The idea of a NOT IN query had occurred to me briefly, but I failed to
> pursue it because at some point in the distant past I had gained the
> impression that NOT IN queries were not computationally efficient.
> During one round of testing I had like a million rows. I'll have to run
> some EXPLAIN query testing with a larger data sample for comparison.
> Thanks!

Plan B:

WITH d AS
     (SELECT * FROM
         cash_journal
     LEFT JOIN
         (SELECT
             MAX(ARRAY[click,cash_journal_id]) AS mx
         FROM
             cash_journal
         GROUP BY
             fairian_id)
         AS
             mxa
     ON
         mxa.mx=ARRAY[click, cash_journal_id]
     WHERE
         mx IS NULL)
DELETE FROM
     cash_journal
USING
     d
WHERE
     d.click = cash_journal.click
AND
     d.cash_journal_id = cash_journal.cash_journal_id;


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: Deletion Challenge

From
"David G. Johnston"
Date:
On Wed, Dec 9, 2015 at 1:31 PM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 12/09/2015 12:24 AM, Berend Tober wrote:
Adrian Klaver wrote:
On 12/05/2015 08:08 AM, Berend Tober wrote:
/*

Deletion Challenge

I want to delete all but the most recent transaction, per person, from a
table that records a transaction history because at some point the
transaction history grows large enough to adversely effect performance,
and also becomes less relevant for retention.

...


test=> delete from cash_journal where ARRAY[click, cash_journal_id]
NOT in (select max(ARRAY[click,cash_journal_id]) from cash_journal
group by fairian_id);
DELETE 7

test=> SELECT * FROM cash_journal order by fairian_id, click,
cash_journal_id;
  click | cash_journal_id | fairian_id | debit | credit | balance
|           description
-------+-----------------+------------+-------+--------+---------+----------------------------------

    412 |               1 |          7 |     5 |        |      14 |
Sold food quantity 7 units.
     37 |               7 |          8 |     8 |        |       8 |
Ratified contract f1abd670358e03
     37 |               9 |          9 |     7 |        |       7 |
Ratified contract 1574bddb75c78a
     36 |              14 |         18 |     0 |      0 |       0 |
initial cash balance
    413 |               1 |         25 |       |    995 |       0 |
Redeemed bond 7719a1c782a1ba
(5 rows)


Nice.

The idea of a NOT IN query had occurred to me briefly, but I failed to
pursue it because at some point in the distant past I had gained the
impression that NOT IN queries were not computationally efficient.
During one round of testing I had like a million rows. I'll have to run
some EXPLAIN query testing with a larger data sample for comparison.
Thanks!

Plan B:

WITH d AS
    (SELECT * FROM
        cash_journal
    LEFT JOIN
        (SELECT
            MAX(ARRAY[click,cash_journal_id]) AS mx
        FROM
            cash_journal
        GROUP BY
            fairian_id)
        AS
            mxa
    ON
        mxa.mx=ARRAY[click, cash_journal_id]
    WHERE
        mx IS NULL)
DELETE FROM
    cash_journal
USING
    d
WHERE
    d.click = cash_journal.click
AND
    d.cash_journal_id = cash_journal.cash_journal_id;


​Couldn't the LEFT JOIN relation in the CTE be better written using "SELECT DISTINCT ON (click, cash_journal_id) click, cash_journal_id [...] ORDER BY click DESC, cash_journal_id" or something similar?  It doesn't seem like you should need to introduce an array and an aggregate here.

​It does have the negative property of only providing a single row; which excludes using it for the "last 5" part but I suspect it will be considerably faster for the single version.

David J.

Re: Deletion Challenge

From
Steve Crawford
Date:
The two general solutions are the "keep the last one" proposed by Adrian "keep the last N" that I sent.

But it might be worth stepping back a bit. You said you are having performance problems that you feel would be improved by removing only a million rows which doesn't sound like that much to me. It's less than half of what I *add* to just one of my tables every week and my database is dwarfed by those of many of the participants on this list.

This suggests that there may be other issues such as tuning, indexing or query optimization at play. Depending on your requirements, partitioning might be useful. It wouldn't be last N but could easily be done to partition by date-ranges which makes archiving and purging a low-cost operation.

You might want to expand a bit on the core issue you are trying to solve.

Cheers,
Steve


On Wed, Dec 9, 2015 at 12:43 PM, David G. Johnston <david.g.johnston@gmail.com> wrote:
On Wed, Dec 9, 2015 at 1:31 PM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 12/09/2015 12:24 AM, Berend Tober wrote:
Adrian Klaver wrote:
On 12/05/2015 08:08 AM, Berend Tober wrote:
/*

Deletion Challenge

I want to delete all but the most recent transaction, per person, from a
table that records a transaction history because at some point the
transaction history grows large enough to adversely effect performance,
and also becomes less relevant for retention.

...


test=> delete from cash_journal where ARRAY[click, cash_journal_id]
NOT in (select max(ARRAY[click,cash_journal_id]) from cash_journal
group by fairian_id);
DELETE 7

test=> SELECT * FROM cash_journal order by fairian_id, click,
cash_journal_id;
  click | cash_journal_id | fairian_id | debit | credit | balance
|           description
-------+-----------------+------------+-------+--------+---------+----------------------------------

    412 |               1 |          7 |     5 |        |      14 |
Sold food quantity 7 units.
     37 |               7 |          8 |     8 |        |       8 |
Ratified contract f1abd670358e03
     37 |               9 |          9 |     7 |        |       7 |
Ratified contract 1574bddb75c78a
     36 |              14 |         18 |     0 |      0 |       0 |
initial cash balance
    413 |               1 |         25 |       |    995 |       0 |
Redeemed bond 7719a1c782a1ba
(5 rows)


Nice.

The idea of a NOT IN query had occurred to me briefly, but I failed to
pursue it because at some point in the distant past I had gained the
impression that NOT IN queries were not computationally efficient.
During one round of testing I had like a million rows. I'll have to run
some EXPLAIN query testing with a larger data sample for comparison.
Thanks!

Plan B:

WITH d AS
    (SELECT * FROM
        cash_journal
    LEFT JOIN
        (SELECT
            MAX(ARRAY[click,cash_journal_id]) AS mx
        FROM
            cash_journal
        GROUP BY
            fairian_id)
        AS
            mxa
    ON
        mxa.mx=ARRAY[click, cash_journal_id]
    WHERE
        mx IS NULL)
DELETE FROM
    cash_journal
USING
    d
WHERE
    d.click = cash_journal.click
AND
    d.cash_journal_id = cash_journal.cash_journal_id;


​Couldn't the LEFT JOIN relation in the CTE be better written using "SELECT DISTINCT ON (click, cash_journal_id) click, cash_journal_id [...] ORDER BY click DESC, cash_journal_id" or something similar?  It doesn't seem like you should need to introduce an array and an aggregate here.

​It does have the negative property of only providing a single row; which excludes using it for the "last 5" part but I suspect it will be considerably faster for the single version.

David J.

Re: Deletion Challenge

From
Berend Tober
Date:
Steve Crawford wrote:
> The two general solutions are the "keep the last one" proposed by Adrian
> "keep the last N" that I sent.
>
> But it might be worth stepping back a bit. You said you are having
> performance problems that you feel would be improved by removing only a
> million rows which doesn't sound like that much to me. It's less than
> half of what I *add* to just one of my tables every week and my database
> is dwarfed by those of many of the participants on this list.
>
> This suggests that there may be other issues such as tuning, indexing or
> query optimization at play. Depending on your requirements, partitioning
> might be useful. It wouldn't be last N but could easily be done to
> partition by date-ranges which makes archiving and purging a low-cost
> operation.
>
> You might want to expand a bit on the core issue you are trying to solve.
>


I really appreciate the deep-dive.

I'm quite sure the performance issue is mostly a hardware limitation at
this point. The application is in a developmental phase and running for
test purposes on grossly inadequate hardware for production purposes ...
think home laptop computer in one case.

The issue is that I'd like the application (that is, the data base and
its stored procedures) to be robust enough to be a "long-running"
application, i.e. one that doesn't suffer gradual performance
degradation as time and the accumulated data increase. For the cash and
similar journals, I'd like to retain some amount of history so that
players can review "recent" transactions so as to understand and verify
how the current balance was arrived at, but at some point let old
transactions age-off and be deleted.

This question was sort of addressed at the "query tuning" aspect, and
I'm confident that partitioning would help. But since this is just a
game, retention of a full and auditable history is not really required:
I have a lot of flexibility to determine what to keep and in fact am not
exactly sure how much to keep. I know I need at least one row in order
to retain the current balance, but I'm thinking something on the order
of scores or hundreds, maybe a thousand transactions per player in each
of several similar journals retained at any point in time would be
sufficient.

This project is a game, btw, described at

https://github.com/bmtober/fairwinds

for those interested in the backstory.

I am eager to get some real-world experience with multiple players
actually using the application and providing feedback, which is
gradually happening by means of presentations I have and am scheduled to
make at local user groups. Eventually I want to raise some money to rent
some virtual private server space and host it on a publicly-available
site when obvious scalability issues like this are mitigated.




Re: Deletion Challenge

From
Jim Nasby
Date:
On 12/9/15 7:59 PM, Berend Tober wrote:
> The issue is that I'd like the application (that is, the data base and
> its stored procedures) to be robust enough to be a "long-running"
> application, i.e. one that doesn't suffer gradual performance
> degradation as time and the accumulated data increase.

In my experience, no such thing exists. This is one of the things that
makes database development very different than other forms of
programming. Once you write a program, it's basically always going to
perform the same. Database performance slowly changes over time, not
only due to different amounts of data, but also different *composition*
of data stored.

Of course, it is wise to avoid things that will obviously hurt future
performance, so it's good that you're thinking about this. But don't
expect something you can "set and forget". :)

> This question was sort of addressed at the "query tuning" aspect, and
> I'm confident that partitioning would help.

Keep in mind that partitioning isn't a magic bullet either, though in
this case I agree it would help. Sometimes something as simple as having
"active" and "history" partitions is enough.

> This project is a game, btw, described at

You might be interested in https://schemaverse.com/
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


Re: Deletion Challenge

From
Benjamin Smith
Date:
On Saturday, December 05, 2015 11:08:05 AM Berend Tober wrote:
> WITH max_click AS (
>    SELECT
>      cash_journal.fairian_id,
>      max(cash_journal.click) AS click
>      FROM cash_journal
>      GROUP BY cash_journal.fairian_id
>      )
>    delete from cash_journal j
>      using max_click b
>      where j.fairian_id = b.fairian_id
>      and j.click        < b.click;
>
> WITH max_journal_id AS (
>    SELECT
>      cash_journal.fairian_id,
>      cash_journal.click,
>      max(cash_journal.cash_journal_id) AS cash_journal_id
>      FROM cash_journal
>      GROUP BY cash_journal.fairian_id, cash_journal.click
>      )
>    delete from cash_journal j
>       using max_journal_id b
>       where j.fairian_id    = b.fairian_id
>       and j.click           = b.click
>       and j.cash_journal_id < b.cash_journal_id;

Although I couldn't be sure if this would provide atomicity, I'd merge these
into one query like:

WITH max_click AS (
   SELECT
     cash_journal.fairian_id,
     max(cash_journal.click) AS click
     FROM cash_journal
     GROUP BY cash_journal.fairian_id
     ),
max_journal_id AS (
   SELECT
     cash_journal.fairian_id,
     cash_journal.click,
     max(cash_journal.cash_journal_id) AS cash_journal_id
     FROM cash_journal
     GROUP BY cash_journal.fairian_id, cash_journal.click
     ),
delete_journal1 AS
     (
   delete from cash_journal j
     using max_click b
     where j.fairian_id = b.fairian_id
     and j.click        < b.click
    returning *, 'journal1'::varchar AS source
    ),
delete_journal2 AS
   (
   delete from cash_journal j
      using max_journal_id b
      where j.fairian_id    = b.fairian_id
      and j.click           = b.click
      and j.cash_journal_id < b.cash_journal_id
    returning *, 'journal2'::varchar AS source
   )
-- AND THEN TO FIND OUT WHAT HAPPENED
SELECT delete_journal1.*
UNION ALL
select delete_journal2.*




Re: Deletion Challenge

From
Benjamin Smith
Date:
> test=> delete from cash_journal where ARRAY[click, cash_journal_id] NOT in
> (select max(ARRAY[click,cash_journal_id]) from cash_journal group by
> fairian_id); DELETE 7


For what it's worth, we've run into *severe* performance issues using in() if
there are a large number of values in conjunction with a complex query. (EG:
more than 10,000) Using a with() prefix table and joining against that doesn't
seem to carry anything like a similar performance penalty.


Re: Deletion Challenge

From
Berend Tober
Date:
Jim Nasby wrote:
> On 12/9/15 7:59 PM, Berend Tober wrote:
>> This project is a game, btw, described at
>
> You might be interested in https://schemaverse.com/


Schemaverse looks somewhat interesting. Seems like it and Fairwinds
share in common Postgresql as a foundation, but they are very different
games: while they both have an aspect of conquest, one by battle, one by
economic dominance, Fairwinds involves a different kind of strategy that
entails a balance of cooperation and competition. Also, whereas in
Schemaverse you get "free money" to build your space ship just by
joining the game, a fundamental feature of Fairwinds is wealth creation
... you have to make your own success out of nothing.