Thread: Select for update, locks and transaction levels
Hi,
I am trying to gather stats about how many times a resource in our web app is viewed, i.e. just a COUNT. There are potentially millions of resources within the system.
I thought of two methods:
1. An extra column in the resource table which contains a count.
a. Each time a resource is viewed an UPDATE statement is run.
UPDATE res_table SET view_count = view_count + 1 WHERE res_id=2177526::bigint;
b. The count is just SELECTed from the resource table.
2. A separate table that contains a count using an algorithm similar to the method presented here:
http://archives.postgresql.org/pgsql-performance/2004-01/msg00059.php
a. Each time a resource is viewed a new row is inserted with a count of 1.
b. Each time the view count is needed, rows from the table are SUMmed together.
c. A compression script runs regularly to group and sum the rows together.
I personally did not like the look of 1 so I thought about using 2. The main reason being there would be no locks that would interfere with “updating” the view count because in fact this was just an INSERT statement. Also vacuuming on the new table is preferred as it is considerably thinner (i.e. less columns) than the resource table. The second method allows me to capture more data too, such as who viewed the resource, which resource they viewed next, but I digress :-).
Q1.Have I missed any methods?
I thought I would have a further look 2 and have some questions about that too.
The schema for this new table is shown below.
-- SCHEMA ---------------------------------------------------------------
CREATE TABLE view_res (
res_id int8,
count int8
) WITHOUT OIDS;
CREATE INDEX view_res_res_id_idx ON view_res (res_id);
------------------------------------------------------------------------
And the compression script should reduce the following rows:
-- QUERY ---------------------------------------------------------------
db_dev=# select * from view_res where res_id=2177526::bigint;
res_id | count
---------+-------
2177526 | 1
2177526 | 1
2177526 | 1
2177526 | 1
2177526 | 1
2177526 | 1
2177526 | 1
2177526 | 1
(8 rows)
------------------------------------------------------------------------
to the following
-- QUERY ---------------------------------------------------------------db_dev=# select * from view_res where res_id=2177526::bigint;
res_id | count
---------+-------
2177526 | 8
(1 rows)
------------------------------------------------------------------------
Now I must admit I have never really played around with select for update, locks or transaction levels, hence the questions. I have looked in the docs and think I figured out what I need to do. The following is pseudo-code for the compression script.
------------------------------------------------------------------------
BEGIN;
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
SELECT res_id, sum(count) AS res_count FROM view_res GROUP BY res_id FOR UPDATE;
For each row
{
DELETE FROM view_res WHERE res_id=<res_id>::biignt
INSERT INTO view_res (res_id, count) VALUES (<res_id>, <res_count>);
}
COMMIT;
------------------------------------------------------------------------
Right the questions for this method:
Q2.Will a “group by” used with a “select … for update” lock all the rows used for the sum?
Q3.Am I right in saying freshly inserted rows will not be affected by the delete because of the SERIALIZABLE transaction level?
Q4.Are there any other concurrency issues that I have not though of?
BTW, this is still at the planning phase so a complete redesign is perfectly fine. Just seeing if anyone has greater experience than me at this sort of thing.
TIA
Nick Barr
"Nick Barr" <nick.barr@webbased.co.uk> writes: > I personally did not like the look of 1 so I thought about using 2. The > main reason being there would be no locks that would interfere with > "updating" the view count because in fact this was just an INSERT > statement. INSERTs are good. > Q2.Will a "group by" used with a "select . for update" lock all the rows > used for the sum? No; it won't work at all. regression=# select hundred,count(*) from tenk1 group by hundred for update; ERROR: SELECT FOR UPDATE is not allowed with GROUP BY clause regression=# However, AFAICS it will not matter if you are using a serializable transaction. If two such transactions try to delete the same row, one of them will error out, so you do not need FOR UPDATE. regards, tom lane
on 2/16/04 10:51 AM, nick.barr@webbased.co.uk purportedly said: > I am trying to gather stats about how many times a resource in our web > app is viewed, i.e. just a COUNT. There are potentially millions of > resources within the system. > > I thought of two methods: > > 1. An extra column in the resource table which contains a count. Not a good idea if you expect a high concurrency rate--you will create a superfluous bottleneck in your app. > 2. A separate table that contains a count using an algorithm similar > to the method presented here: > > http://archives.postgresql.org/pgsql-performance/2004-01/msg00059.php > > a. Each time a resource is viewed a new row is inserted with a count > of 1. > b. Each time the view count is needed, rows from the table are SUMmed > together. > c. A compression script runs regularly to group and sum the rows > together. I am assuming that you are concerned about storage size, which is why you want to "compress". You are probably better off (both by performance and storage) with something like the following approach: CREATE TABLE view_res ( res_id int8, stamp timestamp ) WITHOUT OIDS; CREATE TABLE view_res_arch ( res_id int8, cycle date, hits int8 ); By using a timestamp instead of count you can archive using a date/time range and avoid any concurrency/locking issues: INSERT INTO view_res_arch (res_id, cycle, hits) SELECT res_id, '2003-12-31', COUNT(res_id) FROM view_res WHERE stamp >= '2003-12-01' AND stamp <= '2003-12-31 23:59:59' GROUP BY res_id; then: DELETE FROM view_res WHERE stamp >= '2003-12-01' AND stamp <= '2003-12-31 23:59:59' With this kind of approach you have historicity and extensibility, so you could, for example, show historical trends with only minor modifications. Best regards, Keary Suska Esoteritech, Inc. "Leveraging Open Source for a better Internet"
Maybe filesystem fragmenttion is a problem ?? They told that fragmentation on multiuser system is not a problem (for example on ext2 filesystem), because many users/ many tasks shared hdd IO subsytem and there is not benefit for having disk low fragmented but...... In my situation I use postgresql, PHP as apache module. I make a backup and run e2fs defragmentation program on related partitions (ie /home and /var/ , where php files and database cluster lives ) Result ? About 40% (!) performance boost... ----- Original Message ----- From: "Keary Suska" <hierophant@pcisys.net> To: "Postgres General" <pgsql-general@postgresql.org> Sent: Thursday, February 19, 2004 8:52 PM Subject: Re: [GENERAL] Select for update, locks and transaction levels > on 2/16/04 10:51 AM, nick.barr@webbased.co.uk purportedly said: > > > I am trying to gather stats about how many times a resource in our web > > app is viewed, i.e. just a COUNT. There are potentially millions of > > resources within the system. > > > > I thought of two methods: > > > > 1. An extra column in the resource table which contains a count. > > Not a good idea if you expect a high concurrency rate--you will create a > superfluous bottleneck in your app. > > > 2. A separate table that contains a count using an algorithm similar > > to the method presented here: > > > > http://archives.postgresql.org/pgsql-performance/2004-01/msg00059.php > > > > a. Each time a resource is viewed a new row is inserted with a count > > of 1. > > b. Each time the view count is needed, rows from the table are SUMmed > > together. > > c. A compression script runs regularly to group and sum the rows > > together. > > I am assuming that you are concerned about storage size, which is why you > want to "compress". You are probably better off (both by performance and > storage) with something like the following approach: > > CREATE TABLE view_res ( > res_id int8, > stamp timestamp > ) WITHOUT OIDS; > > CREATE TABLE view_res_arch ( > res_id int8, > cycle date, > hits int8 > ); > > By using a timestamp instead of count you can archive using a date/time > range and avoid any concurrency/locking issues: > > INSERT INTO view_res_arch (res_id, cycle, hits) > SELECT res_id, '2003-12-31', COUNT(res_id) FROM view_res > WHERE stamp >= '2003-12-01' AND stamp <= '2003-12-31 23:59:59' > GROUP BY res_id; > > then: > > DELETE FROM view_res > WHERE stamp >= '2003-12-01' AND stamp <= '2003-12-31 23:59:59' > > With this kind of approach you have historicity and extensibility, so you > could, for example, show historical trends with only minor modifications. > > Best regards, > > Keary Suska > Esoteritech, Inc. > "Leveraging Open Source for a better Internet" > > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org > >