Thread: Deletion Challenge
/* Deletion Challenge I want to delete all but the most recent transaction, per person, from a table that records a transaction history because at some point the transaction history grows large enough to adversely effect performance, and also becomes less relevant for retention. I have devised a way to accomplish this, but it is a 'two-stage' approach: that is, it requires two delete statements. I would like to know if there is a way to do it in a single statement. Bonus challenge: Same question, except preserving the most recent N, for N > 1, rows for each person so that a short history is retained after the deletion. I have included below an annotated test case and my current solution for the N = 1 case. */ DROP TABLE IF EXISTS cash_journal; CREATE TABLE cash_journal ( click bigint NOT NULL, cash_journal_id bigint NOT NULL, fairian_id bigint NOT NULL, debit double precision, credit double precision, balance real DEFAULT 0, description text ); COMMENT ON COLUMN cash_journal.click IS 'Time of transaction.'; COMMENT ON COLUMN cash_journal.cash_journal_id IS 'Sequence of transaction within current click.'; COMMENT ON COLUMN cash_journal.fairian_id IS 'Fairian account effected.'; COMMENT ON COLUMN cash_journal.debit IS 'Account balance increase amount.'; COMMENT ON COLUMN cash_journal.credit IS 'Account balance decrease amount.'; COMMENT ON COLUMN cash_journal.balance IS 'Account balance, per Fairian running total.'; COMMENT ON COLUMN cash_journal.description IS 'Transaction description.'; /* Below is some sample data, listed in the click/sequence order that the data would actually be entered. That is, the 'click' column represents advancing time, and within each click, transactions are sequenced by the 'cash_journal_id' column. Note there are some missing cash_journal_id sequence numbers. This is an artifact of having presented here only an illustrative sample. Generally, within each click, the sequence would start at one and increment uniformly by one for each new row in the same click, and then reset to one for the next click. The missing increments in the sample data should not make any difference in the solution. The 'balance' column is a per-player running total, which is a deliberate denormalization. It is calculated in a before insert trigger by starting with the per-player previous balance, and then adding the new row debit, if any, and subtracting the new row credit, if any. Note, not all Fairians will have a transaction in every click, but any number of Fairians may have multiple transactions in any click. */ copy cash_journal (click,cash_journal_id,fairian_id,debit,credit,balance,description) from stdin; 36 3 7 0 0 0 Initial cash balance 36 4 8 0 0 0 Initial cash balance 36 5 9 0 0 0 Initial cash balance 36 14 18 0 0 0 initial cash balance 37 5 7 9 \N 9 Ratified contract fa35e192121eab 37 7 8 8 \N 8 Ratified contract f1abd670358e03 37 9 9 7 \N 7 Ratified contract 1574bddb75c78a 411 1 25 0 0 0 Initial cash balance 411 2 25 1000 \N 1000 Issued bond 7719a1c782a1ba 412 1 7 5 \N 14 Sold food quantity 7 units. 412 2 25 \N 5 995 Bought food quantity 7 units. 413 1 25 \N 995 0 Redeemed bond 7719a1c782a1ba \. SELECT * FROM cash_journal order by fairian_id, click, cash_journal_id; /* The sample starting data is shown here in order by Fairian so that it is perhaps easier to see what is happening for each player. Note that the result of the deletion should be the last row for each player. click | cash_journal_id | fairian_id | debit | credit | balance | description -------+-----------------+------------+-------+--------+---------+---------------------------------- 36 | 3 | 7 | 0 | 0 | 0 | Initial cash balance 37 | 5 | 7 | 9 | | 9 | Ratified contract fa35e192121eab 412 | 1 | 7 | 5 | | 14 | Sold food quantity 7 units. 36 | 4 | 8 | 0 | 0 | 0 | Initial cash balance 37 | 7 | 8 | 8 | | 8 | Ratified contract f1abd670358e03 36 | 5 | 9 | 0 | 0 | 0 | Initial cash balance 37 | 9 | 9 | 7 | | 7 | Ratified contract 1574bddb75c78a 36 | 14 | 18 | 0 | 0 | 0 | initial cash balance 411 | 1 | 25 | 0 | 0 | 0 | Initial cash balance 411 | 2 | 25 | 1000 | | 1000 | Issued bond 7719a1c782a1ba 412 | 2 | 25 | | 5 | 995 | Bought food quantity 7 units. 413 | 1 | 25 | | 995 | 0 | Redeemed bond 7719a1c782a1ba (12 rows) */ /* Here is the current, two-stage solution in use. Is there a way to do it with a single statement? Can you create a solution that retains an arbitrarily specified number of rows per player? */ BEGIN; WITH max_click AS ( SELECT cash_journal.fairian_id, max(cash_journal.click) AS click FROM cash_journal GROUP BY cash_journal.fairian_id ) delete from cash_journal j using max_click b where j.fairian_id = b.fairian_id and j.click < b.click; WITH max_journal_id AS ( SELECT cash_journal.fairian_id, cash_journal.click, max(cash_journal.cash_journal_id) AS cash_journal_id FROM cash_journal GROUP BY cash_journal.fairian_id, cash_journal.click ) delete from cash_journal j using max_journal_id b where j.fairian_id = b.fairian_id and j.click = b.click and j.cash_journal_id < b.cash_journal_id; COMMIT; SELECT * FROM cash_journal order by fairian_id, click, cash_journal_id; /* click | cash_journal_id | fairian_id | debit | credit | balance | description -------+-----------------+------------+-------+--------+---------+---------------------------------- 412 | 1 | 7 | 5 | | 14 | Sold food quantity 7 units. 37 | 7 | 8 | 8 | | 8 | Ratified contract f1abd670358e03 37 | 9 | 9 | 7 | | 7 | Ratified contract 1574bddb75c78a 36 | 14 | 18 | 0 | 0 | 0 | initial cash balance 413 | 1 | 25 | | 995 | 0 | Redeemed bond 7719a1c782a1ba (5 rows) */
If I understand correctly the value of "click" always advances and within a "click" the "cash_journal_id" always advances - not necessarily by single steps so within a fairian_id, ordering by "click" plus "cash_journal_id" would return the records in order from which you want the most recent 5 for each farian_id.
Typing without testing and ignoring performance optimizations, something along the lines of the following should work and covers the "last 5" issue as well.
with stuff_to_delete as (
select farian_id, click, cash_journal_id,
rank() over (partition by farian_id order by (click, cash_journal_id) desc) as howold)
from cash_journal)
delete from cash_journal
using stuff_to_delete
where
cash_journal.farian_id = stuff_to_delete.farian_id
and cash_journal.click = stuff_to_delete.click
and cash_journal.cash_journal_id = stuff_to_delete.cash_journal_id
and stuff_to_delete.howold > 5;
Cheers,
Steve
On Sat, Dec 5, 2015 at 8:08 AM, Berend Tober <btober@computer.org> wrote:
/*
Deletion Challenge
I want to delete all but the most recent transaction, per person, from a
table that records a transaction history because at some point the
transaction history grows large enough to adversely effect performance,
and also becomes less relevant for retention.
I have devised a way to accomplish this, but it is a 'two-stage'
approach: that is, it requires two delete statements. I would like to
know if there is a way to do it in a single statement.
Bonus challenge: Same question, except preserving the most recent N, for
N > 1, rows for each person so that a short history is retained after
the deletion.
I have included below an annotated test case and my current solution for
the N = 1 case.
*/
DROP TABLE IF EXISTS cash_journal;
CREATE TABLE cash_journal (
click bigint NOT NULL,
cash_journal_id bigint NOT NULL,
fairian_id bigint NOT NULL,
debit double precision,
credit double precision,
balance real DEFAULT 0,
description text
);
COMMENT ON COLUMN cash_journal.click IS 'Time of transaction.';
COMMENT ON COLUMN cash_journal.cash_journal_id IS 'Sequence of transaction within current click.';
COMMENT ON COLUMN cash_journal.fairian_id IS 'Fairian account effected.';
COMMENT ON COLUMN cash_journal.debit IS 'Account balance increase amount.';
COMMENT ON COLUMN cash_journal.credit IS 'Account balance decrease amount.';
COMMENT ON COLUMN cash_journal.balance IS 'Account balance, per Fairian running total.';
COMMENT ON COLUMN cash_journal.description IS 'Transaction description.';
/*
Below is some sample data, listed in the click/sequence order that the
data would actually be entered. That is, the 'click' column represents
advancing time, and within each click, transactions are sequenced by the
'cash_journal_id' column. Note there are some missing cash_journal_id
sequence numbers. This is an artifact of having presented here only
an illustrative sample. Generally, within each click, the sequence
would start at one and increment uniformly by one for each new row
in the same click, and then reset to one for the next click. The
missing increments in the sample data should not make any difference
in the solution.
The 'balance' column is a per-player running total, which is a
deliberate denormalization. It is calculated in a before insert trigger
by starting with the per-player previous balance, and then adding
the new row debit, if any, and subtracting the new row credit, if any.
Note, not all Fairians will have a transaction in every click, but any
number of Fairians may have multiple transactions in any click.
*/
copy cash_journal (click,cash_journal_id,fairian_id,debit,credit,balance,description) from stdin;
36 3 7 0 0 0 Initial cash balance
36 4 8 0 0 0 Initial cash balance
36 5 9 0 0 0 Initial cash balance
36 14 18 0 0 0 initial cash balance
37 5 7 9 \N 9 Ratified contract fa35e192121eab
37 7 8 8 \N 8 Ratified contract f1abd670358e03
37 9 9 7 \N 7 Ratified contract 1574bddb75c78a
411 1 25 0 0 0 Initial cash balance
411 2 25 1000 \N 1000 Issued bond 7719a1c782a1ba
412 1 7 5 \N 14 Sold food quantity 7 units.
412 2 25 \N 5 995 Bought food quantity 7 units.
413 1 25 \N 995 0 Redeemed bond 7719a1c782a1ba
\.
SELECT * FROM cash_journal order by fairian_id, click, cash_journal_id;
/*
The sample starting data is shown here in order by Fairian so that it is
perhaps easier to see what is happening for each player. Note that the
result of the deletion should be the last row for each player.
click | cash_journal_id | fairian_id | debit | credit | balance | description
-------+-----------------+------------+-------+--------+---------+----------------------------------
36 | 3 | 7 | 0 | 0 | 0 | Initial cash balance
37 | 5 | 7 | 9 | | 9 | Ratified contract fa35e192121eab
412 | 1 | 7 | 5 | | 14 | Sold food quantity 7 units.
36 | 4 | 8 | 0 | 0 | 0 | Initial cash balance
37 | 7 | 8 | 8 | | 8 | Ratified contract f1abd670358e03
36 | 5 | 9 | 0 | 0 | 0 | Initial cash balance
37 | 9 | 9 | 7 | | 7 | Ratified contract 1574bddb75c78a
36 | 14 | 18 | 0 | 0 | 0 | initial cash balance
411 | 1 | 25 | 0 | 0 | 0 | Initial cash balance
411 | 2 | 25 | 1000 | | 1000 | Issued bond 7719a1c782a1ba
412 | 2 | 25 | | 5 | 995 | Bought food quantity 7 units.
413 | 1 | 25 | | 995 | 0 | Redeemed bond 7719a1c782a1ba
(12 rows)
*/
/*
Here is the current, two-stage solution in use. Is there a way to do it
with a single statement?
Can you create a solution that retains an arbitrarily specified number
of rows per player?
*/
BEGIN;
WITH max_click AS (
SELECT
cash_journal.fairian_id,
max(cash_journal.click) AS click
FROM cash_journal
GROUP BY cash_journal.fairian_id
)
delete from cash_journal j
using max_click b
where j.fairian_id = b.fairian_id
and j.click < b.click;
WITH max_journal_id AS (
SELECT
cash_journal.fairian_id,
cash_journal.click,
max(cash_journal.cash_journal_id) AS cash_journal_id
FROM cash_journal
GROUP BY cash_journal.fairian_id, cash_journal.click
)
delete from cash_journal j
using max_journal_id b
where j.fairian_id = b.fairian_id
and j.click = b.click
and j.cash_journal_id < b.cash_journal_id;
COMMIT;
SELECT * FROM cash_journal order by fairian_id, click, cash_journal_id;
/*
click | cash_journal_id | fairian_id | debit | credit | balance | description
-------+-----------------+------------+-------+--------+---------+----------------------------------
412 | 1 | 7 | 5 | | 14 | Sold food quantity 7 units.
37 | 7 | 8 | 8 | | 8 | Ratified contract f1abd670358e03
37 | 9 | 9 | 7 | | 7 | Ratified contract 1574bddb75c78a
36 | 14 | 18 | 0 | 0 | 0 | initial cash balance
413 | 1 | 25 | | 995 | 0 | Redeemed bond 7719a1c782a1ba
(5 rows)
*/
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On 12/05/2015 08:08 AM, Berend Tober wrote: > /* > > Deletion Challenge > > I want to delete all but the most recent transaction, per person, from a > table that records a transaction history because at some point the > transaction history grows large enough to adversely effect performance, > and also becomes less relevant for retention. > > I have devised a way to accomplish this, but it is a 'two-stage' > approach: that is, it requires two delete statements. I would like to > know if there is a way to do it in a single statement. > > Bonus challenge: Same question, except preserving the most recent N, for > N > 1, rows for each person so that a short history is retained after > the deletion. > > I have included below an annotated test case and my current solution for > the N = 1 case. > > */ > > DROP TABLE IF EXISTS cash_journal; > > > CREATE TABLE cash_journal ( > click bigint NOT NULL, > cash_journal_id bigint NOT NULL, > fairian_id bigint NOT NULL, > debit double precision, > credit double precision, > balance real DEFAULT 0, > description text > ); > > COMMENT ON COLUMN cash_journal.click IS 'Time of transaction.'; > COMMENT ON COLUMN cash_journal.cash_journal_id IS 'Sequence of > transaction within current click.'; > COMMENT ON COLUMN cash_journal.fairian_id IS 'Fairian account > effected.'; > COMMENT ON COLUMN cash_journal.debit IS 'Account balance increase > amount.'; > COMMENT ON COLUMN cash_journal.credit IS 'Account balance > decrease amount.'; > COMMENT ON COLUMN cash_journal.balance IS 'Account balance, per > Fairian running total.'; > COMMENT ON COLUMN cash_journal.description IS 'Transaction > description.'; > > /* > > Below is some sample data, listed in the click/sequence order that the > data would actually be entered. That is, the 'click' column represents > advancing time, and within each click, transactions are sequenced by the > 'cash_journal_id' column. Note there are some missing cash_journal_id > sequence numbers. This is an artifact of having presented here only > an illustrative sample. Generally, within each click, the sequence > would start at one and increment uniformly by one for each new row > in the same click, and then reset to one for the next click. The > missing increments in the sample data should not make any difference > in the solution. > > The 'balance' column is a per-player running total, which is a > deliberate denormalization. It is calculated in a before insert trigger > by starting with the per-player previous balance, and then adding > the new row debit, if any, and subtracting the new row credit, if any. > > Note, not all Fairians will have a transaction in every click, but any > number of Fairians may have multiple transactions in any click. > > */ > > copy cash_journal > (click,cash_journal_id,fairian_id,debit,credit,balance,description) from > stdin; > 36 3 7 0 0 0 Initial cash balance > 36 4 8 0 0 0 Initial cash balance > 36 5 9 0 0 0 Initial cash balance > 36 14 18 0 0 0 initial cash balance > 37 5 7 9 \N 9 Ratified contract fa35e192121eab > 37 7 8 8 \N 8 Ratified contract f1abd670358e03 > 37 9 9 7 \N 7 Ratified contract 1574bddb75c78a > 411 1 25 0 0 0 Initial cash balance > 411 2 25 1000 \N 1000 Issued bond 7719a1c782a1ba > 412 1 7 5 \N 14 Sold food quantity 7 units. > 412 2 25 \N 5 995 Bought food quantity 7 units. > 413 1 25 \N 995 0 Redeemed bond 7719a1c782a1ba > \. > > > SELECT * FROM cash_journal order by fairian_id, click, cash_journal_id; > > /* > > The sample starting data is shown here in order by Fairian so that it is > perhaps easier to see what is happening for each player. Note that the > result of the deletion should be the last row for each player. > > click | cash_journal_id | fairian_id | debit | credit | balance > | description > -------+-----------------+------------+-------+--------+---------+---------------------------------- > > 36 | 3 | 7 | 0 | 0 | 0 | > Initial cash balance > 37 | 5 | 7 | 9 | | 9 | > Ratified contract fa35e192121eab > 412 | 1 | 7 | 5 | | 14 | Sold > food quantity 7 units. > 36 | 4 | 8 | 0 | 0 | 0 | > Initial cash balance > 37 | 7 | 8 | 8 | | 8 | > Ratified contract f1abd670358e03 > 36 | 5 | 9 | 0 | 0 | 0 | > Initial cash balance > 37 | 9 | 9 | 7 | | 7 | > Ratified contract 1574bddb75c78a > 36 | 14 | 18 | 0 | 0 | 0 | > initial cash balance > 411 | 1 | 25 | 0 | 0 | 0 | > Initial cash balance > 411 | 2 | 25 | 1000 | | 1000 | > Issued bond 7719a1c782a1ba > 412 | 2 | 25 | | 5 | 995 | > Bought food quantity 7 units. > 413 | 1 | 25 | | 995 | 0 | > Redeemed bond 7719a1c782a1ba > (12 rows) > > */ > > > /* > > Here is the current, two-stage solution in use. Is there a way to do it > with a single statement? > > Can you create a solution that retains an arbitrarily specified number > of rows per player? > > */ > BEGIN; > > WITH max_click AS ( > SELECT > cash_journal.fairian_id, > max(cash_journal.click) AS click > FROM cash_journal > GROUP BY cash_journal.fairian_id > ) > delete from cash_journal j > using max_click b > where j.fairian_id = b.fairian_id > and j.click < b.click; > > WITH max_journal_id AS ( > SELECT > cash_journal.fairian_id, > cash_journal.click, > max(cash_journal.cash_journal_id) AS cash_journal_id > FROM cash_journal > GROUP BY cash_journal.fairian_id, cash_journal.click > ) > delete from cash_journal j > using max_journal_id b > where j.fairian_id = b.fairian_id > and j.click = b.click > and j.cash_journal_id < b.cash_journal_id; > > COMMIT; > > SELECT * FROM cash_journal order by fairian_id, click, cash_journal_id; > > /* > > click | cash_journal_id | fairian_id | debit | credit | balance > | description > -------+-----------------+------------+-------+--------+---------+---------------------------------- > > 412 | 1 | 7 | 5 | | 14 | Sold > food quantity 7 units. > 37 | 7 | 8 | 8 | | 8 | > Ratified contract f1abd670358e03 > 37 | 9 | 9 | 7 | | 7 | > Ratified contract 1574bddb75c78a > 36 | 14 | 18 | 0 | 0 | 0 | > initial cash balance > 413 | 1 | 25 | | 995 | 0 | > Redeemed bond 7719a1c782a1ba > (5 rows) > > > */ > > test=> delete from cash_journal where ARRAY[click, cash_journal_id] NOT in (select max(ARRAY[click,cash_journal_id]) fromcash_journal group by fairian_id); DELETE 7 test=> SELECT * FROM cash_journal order by fairian_id, click, cash_journal_id; click | cash_journal_id | fairian_id | debit | credit | balance | description -------+-----------------+------------+-------+--------+---------+---------------------------------- 412 | 1 | 7 | 5 | | 14 | Sold food quantity 7 units. 37 | 7 | 8 | 8 | | 8 | Ratified contract f1abd670358e03 37 | 9 | 9 | 7 | | 7 | Ratified contract 1574bddb75c78a 36 | 14 | 18 | 0 | 0 | 0 | initial cash balance 413 | 1 | 25 | | 995 | 0 | Redeemed bond 7719a1c782a1ba (5 rows) -- Adrian Klaver adrian.klaver@aklaver.com
Steve Crawford wrote: > If I understand correctly the value of "click" always advances and within a "click" the > "cash_journal_id" always advances - not necessarily by single steps so within a fairian_id, ordering > by "click" plus "cash_journal_id" would return the records in order from which you want the most > recent 5 for each farian_id. > > Typing without testing and ignoring performance optimizations, something along the lines of the > following should work and covers the "last 5" issue as well. > > with stuff_to_delete as ( > select farian_id, click, cash_journal_id, > rank() over (partition by farian_id order by (click, cash_journal_id) desc) as howold) > from cash_journal) > delete from cash_journal > using stuff_to_delete > where > cash_journal.farian_id = stuff_to_delete.farian_id > and cash_journal.click = stuff_to_delete.click > and cash_journal.cash_journal_id = stuff_to_delete.cash_journal_id > and stuff_to_delete.howold > 5; > Assessing without testing, I like that. Thanks! Although the above is not the exactly the form I was using, an earlier iteration of a related problem employed window functions. But as the data set grew performance suffered, so if deletes were not done on a regular, continuing basis in order to keep the historical data set approximately "small", the process execution time using a windowing scheme eventually exceeded the extent of my patience. That "non-scalable" situation is actually what motivated the deliberate de-normalization (of retaining the "running balance" in a separate column) and the desire to delete old data. The original implementation calculated the running balance on-the-fly, employing windowing per fairian_id, and those tallies of the net balance entailed increasingly lengthy execution times as the number of rows increased, hence I was motivated to retain only a relatively constant-sized per-farian history, and I dismissed the use of windowing for the delete problem since it was so problematic for the running-balance-without-delete problem. Thanks for knocking some sense into me!
Adrian Klaver wrote: > On 12/05/2015 08:08 AM, Berend Tober wrote: >> /* >> >> Deletion Challenge >> >> I want to delete all but the most recent transaction, per person, from a >> table that records a transaction history because at some point the >> transaction history grows large enough to adversely effect performance, >> and also becomes less relevant for retention. >> >> ... >> > > test=> delete from cash_journal where ARRAY[click, cash_journal_id] NOT in (select max(ARRAY[click,cash_journal_id]) fromcash_journal group by fairian_id); > DELETE 7 > > test=> SELECT * FROM cash_journal order by fairian_id, click, cash_journal_id; > click | cash_journal_id | fairian_id | debit | credit | balance | description > -------+-----------------+------------+-------+--------+---------+---------------------------------- > 412 | 1 | 7 | 5 | | 14 | Sold food quantity 7 units. > 37 | 7 | 8 | 8 | | 8 | Ratified contract f1abd670358e03 > 37 | 9 | 9 | 7 | | 7 | Ratified contract 1574bddb75c78a > 36 | 14 | 18 | 0 | 0 | 0 | initial cash balance > 413 | 1 | 25 | | 995 | 0 | Redeemed bond 7719a1c782a1ba > (5 rows) > Nice. The idea of a NOT IN query had occurred to me briefly, but I failed to pursue it because at some point in the distant past I had gained the impression that NOT IN queries were not computationally efficient. During one round of testing I had like a million rows. I'll have to run some EXPLAIN query testing with a larger data sample for comparison. Thanks!
On 12/09/2015 12:24 AM, Berend Tober wrote: > Adrian Klaver wrote: >> On 12/05/2015 08:08 AM, Berend Tober wrote: >>> /* >>> >>> Deletion Challenge >>> >>> I want to delete all but the most recent transaction, per person, from a >>> table that records a transaction history because at some point the >>> transaction history grows large enough to adversely effect performance, >>> and also becomes less relevant for retention. >>> >>> ... >>> >> >> test=> delete from cash_journal where ARRAY[click, cash_journal_id] >> NOT in (select max(ARRAY[click,cash_journal_id]) from cash_journal >> group by fairian_id); >> DELETE 7 >> >> test=> SELECT * FROM cash_journal order by fairian_id, click, >> cash_journal_id; >> click | cash_journal_id | fairian_id | debit | credit | balance >> | description >> -------+-----------------+------------+-------+--------+---------+---------------------------------- >> >> 412 | 1 | 7 | 5 | | 14 | >> Sold food quantity 7 units. >> 37 | 7 | 8 | 8 | | 8 | >> Ratified contract f1abd670358e03 >> 37 | 9 | 9 | 7 | | 7 | >> Ratified contract 1574bddb75c78a >> 36 | 14 | 18 | 0 | 0 | 0 | >> initial cash balance >> 413 | 1 | 25 | | 995 | 0 | >> Redeemed bond 7719a1c782a1ba >> (5 rows) >> > > Nice. > > The idea of a NOT IN query had occurred to me briefly, but I failed to > pursue it because at some point in the distant past I had gained the > impression that NOT IN queries were not computationally efficient. > During one round of testing I had like a million rows. I'll have to run > some EXPLAIN query testing with a larger data sample for comparison. > Thanks! Plan B: WITH d AS (SELECT * FROM cash_journal LEFT JOIN (SELECT MAX(ARRAY[click,cash_journal_id]) AS mx FROM cash_journal GROUP BY fairian_id) AS mxa ON mxa.mx=ARRAY[click, cash_journal_id] WHERE mx IS NULL) DELETE FROM cash_journal USING d WHERE d.click = cash_journal.click AND d.cash_journal_id = cash_journal.cash_journal_id; -- Adrian Klaver adrian.klaver@aklaver.com
Plan B:On 12/09/2015 12:24 AM, Berend Tober wrote:Adrian Klaver wrote:On 12/05/2015 08:08 AM, Berend Tober wrote:/*
Deletion Challenge
I want to delete all but the most recent transaction, per person, from a
table that records a transaction history because at some point the
transaction history grows large enough to adversely effect performance,
and also becomes less relevant for retention.
...
test=> delete from cash_journal where ARRAY[click, cash_journal_id]
NOT in (select max(ARRAY[click,cash_journal_id]) from cash_journal
group by fairian_id);
DELETE 7
test=> SELECT * FROM cash_journal order by fairian_id, click,
cash_journal_id;
click | cash_journal_id | fairian_id | debit | credit | balance
| description
-------+-----------------+------------+-------+--------+---------+----------------------------------
412 | 1 | 7 | 5 | | 14 |
Sold food quantity 7 units.
37 | 7 | 8 | 8 | | 8 |
Ratified contract f1abd670358e03
37 | 9 | 9 | 7 | | 7 |
Ratified contract 1574bddb75c78a
36 | 14 | 18 | 0 | 0 | 0 |
initial cash balance
413 | 1 | 25 | | 995 | 0 |
Redeemed bond 7719a1c782a1ba
(5 rows)
Nice.
The idea of a NOT IN query had occurred to me briefly, but I failed to
pursue it because at some point in the distant past I had gained the
impression that NOT IN queries were not computationally efficient.
During one round of testing I had like a million rows. I'll have to run
some EXPLAIN query testing with a larger data sample for comparison.
Thanks!
WITH d AS
(SELECT * FROM
cash_journal
LEFT JOIN
(SELECT
MAX(ARRAY[click,cash_journal_id]) AS mx
FROM
cash_journal
GROUP BY
fairian_id)
AS
mxa
ON
mxa.mx=ARRAY[click, cash_journal_id]
WHERE
mx IS NULL)
DELETE FROM
cash_journal
USING
d
WHERE
d.click = cash_journal.click
AND
d.cash_journal_id = cash_journal.cash_journal_id;
Couldn't the LEFT JOIN relation in the CTE be better written using "SELECT DISTINCT ON (click, cash_journal_id) click, cash_journal_id [...] ORDER BY click DESC, cash_journal_id" or something similar? It doesn't seem like you should need to introduce an array and an aggregate here.
It does have the negative property of only providing a single row; which excludes using it for the "last 5" part but I suspect it will be considerably faster for the single version.
David J.
The two general solutions are the "keep the last one" proposed by Adrian "keep the last N" that I sent.
But it might be worth stepping back a bit. You said you are having performance problems that you feel would be improved by removing only a million rows which doesn't sound like that much to me. It's less than half of what I *add* to just one of my tables every week and my database is dwarfed by those of many of the participants on this list.
This suggests that there may be other issues such as tuning, indexing or query optimization at play. Depending on your requirements, partitioning might be useful. It wouldn't be last N but could easily be done to partition by date-ranges which makes archiving and purging a low-cost operation.
You might want to expand a bit on the core issue you are trying to solve.
Cheers,
Steve
On Wed, Dec 9, 2015 at 12:43 PM, David G. Johnston <david.g.johnston@gmail.com> wrote:
Plan B:On 12/09/2015 12:24 AM, Berend Tober wrote:Adrian Klaver wrote:On 12/05/2015 08:08 AM, Berend Tober wrote:/*
Deletion Challenge
I want to delete all but the most recent transaction, per person, from a
table that records a transaction history because at some point the
transaction history grows large enough to adversely effect performance,
and also becomes less relevant for retention.
...
test=> delete from cash_journal where ARRAY[click, cash_journal_id]
NOT in (select max(ARRAY[click,cash_journal_id]) from cash_journal
group by fairian_id);
DELETE 7
test=> SELECT * FROM cash_journal order by fairian_id, click,
cash_journal_id;
click | cash_journal_id | fairian_id | debit | credit | balance
| description
-------+-----------------+------------+-------+--------+---------+----------------------------------
412 | 1 | 7 | 5 | | 14 |
Sold food quantity 7 units.
37 | 7 | 8 | 8 | | 8 |
Ratified contract f1abd670358e03
37 | 9 | 9 | 7 | | 7 |
Ratified contract 1574bddb75c78a
36 | 14 | 18 | 0 | 0 | 0 |
initial cash balance
413 | 1 | 25 | | 995 | 0 |
Redeemed bond 7719a1c782a1ba
(5 rows)
Nice.
The idea of a NOT IN query had occurred to me briefly, but I failed to
pursue it because at some point in the distant past I had gained the
impression that NOT IN queries were not computationally efficient.
During one round of testing I had like a million rows. I'll have to run
some EXPLAIN query testing with a larger data sample for comparison.
Thanks!
WITH d AS
(SELECT * FROM
cash_journal
LEFT JOIN
(SELECT
MAX(ARRAY[click,cash_journal_id]) AS mx
FROM
cash_journal
GROUP BY
fairian_id)
AS
mxa
ON
mxa.mx=ARRAY[click, cash_journal_id]
WHERE
mx IS NULL)
DELETE FROM
cash_journal
USING
d
WHERE
d.click = cash_journal.click
AND
d.cash_journal_id = cash_journal.cash_journal_id;Couldn't the LEFT JOIN relation in the CTE be better written using "SELECT DISTINCT ON (click, cash_journal_id) click, cash_journal_id [...] ORDER BY click DESC, cash_journal_id" or something similar? It doesn't seem like you should need to introduce an array and an aggregate here.It does have the negative property of only providing a single row; which excludes using it for the "last 5" part but I suspect it will be considerably faster for the single version.David J.
Steve Crawford wrote: > The two general solutions are the "keep the last one" proposed by Adrian > "keep the last N" that I sent. > > But it might be worth stepping back a bit. You said you are having > performance problems that you feel would be improved by removing only a > million rows which doesn't sound like that much to me. It's less than > half of what I *add* to just one of my tables every week and my database > is dwarfed by those of many of the participants on this list. > > This suggests that there may be other issues such as tuning, indexing or > query optimization at play. Depending on your requirements, partitioning > might be useful. It wouldn't be last N but could easily be done to > partition by date-ranges which makes archiving and purging a low-cost > operation. > > You might want to expand a bit on the core issue you are trying to solve. > I really appreciate the deep-dive. I'm quite sure the performance issue is mostly a hardware limitation at this point. The application is in a developmental phase and running for test purposes on grossly inadequate hardware for production purposes ... think home laptop computer in one case. The issue is that I'd like the application (that is, the data base and its stored procedures) to be robust enough to be a "long-running" application, i.e. one that doesn't suffer gradual performance degradation as time and the accumulated data increase. For the cash and similar journals, I'd like to retain some amount of history so that players can review "recent" transactions so as to understand and verify how the current balance was arrived at, but at some point let old transactions age-off and be deleted. This question was sort of addressed at the "query tuning" aspect, and I'm confident that partitioning would help. But since this is just a game, retention of a full and auditable history is not really required: I have a lot of flexibility to determine what to keep and in fact am not exactly sure how much to keep. I know I need at least one row in order to retain the current balance, but I'm thinking something on the order of scores or hundreds, maybe a thousand transactions per player in each of several similar journals retained at any point in time would be sufficient. This project is a game, btw, described at https://github.com/bmtober/fairwinds for those interested in the backstory. I am eager to get some real-world experience with multiple players actually using the application and providing feedback, which is gradually happening by means of presentations I have and am scheduled to make at local user groups. Eventually I want to raise some money to rent some virtual private server space and host it on a publicly-available site when obvious scalability issues like this are mitigated.
On 12/9/15 7:59 PM, Berend Tober wrote: > The issue is that I'd like the application (that is, the data base and > its stored procedures) to be robust enough to be a "long-running" > application, i.e. one that doesn't suffer gradual performance > degradation as time and the accumulated data increase. In my experience, no such thing exists. This is one of the things that makes database development very different than other forms of programming. Once you write a program, it's basically always going to perform the same. Database performance slowly changes over time, not only due to different amounts of data, but also different *composition* of data stored. Of course, it is wise to avoid things that will obviously hurt future performance, so it's good that you're thinking about this. But don't expect something you can "set and forget". :) > This question was sort of addressed at the "query tuning" aspect, and > I'm confident that partitioning would help. Keep in mind that partitioning isn't a magic bullet either, though in this case I agree it would help. Sometimes something as simple as having "active" and "history" partitions is enough. > This project is a game, btw, described at You might be interested in https://schemaverse.com/ -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com
On Saturday, December 05, 2015 11:08:05 AM Berend Tober wrote: > WITH max_click AS ( > SELECT > cash_journal.fairian_id, > max(cash_journal.click) AS click > FROM cash_journal > GROUP BY cash_journal.fairian_id > ) > delete from cash_journal j > using max_click b > where j.fairian_id = b.fairian_id > and j.click < b.click; > > WITH max_journal_id AS ( > SELECT > cash_journal.fairian_id, > cash_journal.click, > max(cash_journal.cash_journal_id) AS cash_journal_id > FROM cash_journal > GROUP BY cash_journal.fairian_id, cash_journal.click > ) > delete from cash_journal j > using max_journal_id b > where j.fairian_id = b.fairian_id > and j.click = b.click > and j.cash_journal_id < b.cash_journal_id; Although I couldn't be sure if this would provide atomicity, I'd merge these into one query like: WITH max_click AS ( SELECT cash_journal.fairian_id, max(cash_journal.click) AS click FROM cash_journal GROUP BY cash_journal.fairian_id ), max_journal_id AS ( SELECT cash_journal.fairian_id, cash_journal.click, max(cash_journal.cash_journal_id) AS cash_journal_id FROM cash_journal GROUP BY cash_journal.fairian_id, cash_journal.click ), delete_journal1 AS ( delete from cash_journal j using max_click b where j.fairian_id = b.fairian_id and j.click < b.click returning *, 'journal1'::varchar AS source ), delete_journal2 AS ( delete from cash_journal j using max_journal_id b where j.fairian_id = b.fairian_id and j.click = b.click and j.cash_journal_id < b.cash_journal_id returning *, 'journal2'::varchar AS source ) -- AND THEN TO FIND OUT WHAT HAPPENED SELECT delete_journal1.* UNION ALL select delete_journal2.*
> test=> delete from cash_journal where ARRAY[click, cash_journal_id] NOT in > (select max(ARRAY[click,cash_journal_id]) from cash_journal group by > fairian_id); DELETE 7 For what it's worth, we've run into *severe* performance issues using in() if there are a large number of values in conjunction with a complex query. (EG: more than 10,000) Using a with() prefix table and joining against that doesn't seem to carry anything like a similar performance penalty.
Jim Nasby wrote: > On 12/9/15 7:59 PM, Berend Tober wrote: >> This project is a game, btw, described at > > You might be interested in https://schemaverse.com/ Schemaverse looks somewhat interesting. Seems like it and Fairwinds share in common Postgresql as a foundation, but they are very different games: while they both have an aspect of conquest, one by battle, one by economic dominance, Fairwinds involves a different kind of strategy that entails a balance of cooperation and competition. Also, whereas in Schemaverse you get "free money" to build your space ship just by joining the game, a fundamental feature of Fairwinds is wealth creation ... you have to make your own success out of nothing.