Thread: Threaded Records in SQL: Advice Needed
Hi Everyone, I'm planning to implement a project using a php/postgres 6.5.3 combination. Part of what I need to do is develop a model in SQL to keep track of "threaded records." What I mean by this, is that I need to keep track of parent-child relationships in exactly the same way that a threaded-discussion group does. The design should be able to support approximately 100 users, mostly readers, and be able to handle approximately 1 insert per minute. So far, I've found two methods/algorithms for implenting this in SQL, but there's a problem with each. The first one is extremely efficient and inexpensive for selecting and ordering all of the records in a thread (taking only one select to get all of the records and in order), but requires that ALL records be updated on insert, which is very expensive. Because I'm expecting about 120000 records a year, or ca. 450 new records every business day, the prospect of each insert updating each record in the database is not appealing. At first it would be ok, but as the record count increased, the inserts would become extremely slow. This method works by modeling the records as "nested sets" and was explained by Joe Celko in one of his articles. The second method is simply adding fields to keep track of an id, parent_id, and root_id. Its benefit is that it is inexpensive on inserts, requiring only one insert and no updates, but it is costly to select and order all the records in a thread. Essentially, the only way I have found to do this is to implement a recursive function that selects one record at a time, checks for children, ad infinitum. Not very efficient. My concern with method 1 is that the great expense of inserting new records will slow the database down to unacceptable levels. This is based on my past experience with postgres when updating a database of 1.5 million rows (which took an hour and a half with syncing turned off.) Method 2 is problematic because it uses recursion first of all (which I understand to be heavy in resource usage) and because it makes a select for each and every record in a thread to be displayed. What I'd like to find is either a better way that I'm not aware of, or some kind of middle ground. Also, what do you make of these methods. Are they as inefficient as I want to believe? Thanks! Bryan Ingram Database Programmer S!XTYFOOTSP!DER 972-492-5676 bingram@sixtyfootspider.com <mailto:bingram@sixtyfootspider.com> <http://www.sixtyfootspider.com/> www.sixtyfootspider.com
> Hi Everyone, > > I'm planning to implement a project using a php/postgres 6.5.3 combination. > > Part of what I need to do is develop a model in SQL to keep track of > "threaded records." What I mean by this, is that I need to keep track of > parent-child relationships in exactly the same way that a > threaded-discussion group does. Hi Bryan check out the PDF files at http://ourworld.compuserve.com/homepages/Graeme_Birchall/HTM_COOK.HTM The "SQL Cookbook" that this guy writes is oriented towards IBM's DB2 database, however, he does talk about a number of different ways of handling parent-child relationships in an SQL table; his versions are able to be nested as well. Actually, this guy seems to know his stuff when it comes to SQL, so it is probably worth having a look at anyways...thus I have also posted this response to the list. Cordially Patrick Giagnocavo a222@redrose.net
In order to simplify the regular expressions, I propose to change the "." as "field separator" in the ids: "." has a special meaning in regular expressions, let us avoid escaping all over the place. So, take for instance "/" as separator, so that the message in my previous example is now "25/7/19/2". In order to compute the correct id at insert time, I would suggest keeping a sequence root_seq for the root messages (see CREATE SEQUENCE in the postgres manual), and just "select nextval(root_seq)" each time you insert a new root message. Alternatively, if you write root message ids as e.g. "/25" and keep the previous structure, you can use the method described below also for root messages. In this case, you just would be interpreting the root messages as "replies to the (fictitious) message with an empty index". Now assume you want to insert a new reply to message with id X (which could be at any level, e.g. X = 25/7/19). You can get the number of the next response to X = 25/7/19 using the regular expression capabilities of postgres "select count(id)+1 as Y from your_table where id ~ '25/7/19/[^/]*$' " and then compute the index to be inserted as "X/Y" In the regexp you are requesting something that matches 25/7/19/(any number of symbols different from "/") so that all direct replies are selected, but NOT the replies to them - as they would have a "/" somewhere before the end. You can automatize this with the sql functions create function next_reply_num(text) returns int4 as 'select count(*)+1from ids where id ~ ($1|| ''/[^/]*$'') ' language 'sql'; create function next_reply_id(text) returns text as 'select ($1 || ''/'' || next_reply_num($1)::text)' language'sql'; You could then insert the next reply to message "25/19/2" using insert into messages(id, ...) values(next_reply_id('25/19/2'), ...); I do not know how to do this within a single call to an sql function; it would be easy to do if you use either PL/tcl or PL/pgsql procedural languages. Thanks for the "challenge": this IS fun. Miguel
Thanks for the ideas on the functions, that'll work nicely. The only other problem I see in actually implementing this, is that the id column i.e. /25/10/2/ will not be ordered correctly because it relies on ascii values. You get alphabetic orderings, rather than numerical. Such as: 1 10 11 12 14 2 20 25 3 4 5 instead of 1 2 3 4 5 10 11 12 14 20 25 Any ideas how to get around this? I'm working on the problem right now, but haven't found anything yet. Bryan
In reference to the ascii/numeric ordering problem .. But first let me say ..if someone knows of a way to get ascii values to order as if they were numerics ..please let me in on the secret .. Instead of using numbers, if letters were used the ordering would be correct. The id string would become something like: A/A/A for the 1st reply to the 1st reply of the 1st root message. This will work fine for threads with relatively few replies on the same level. For instance, the number 100 would take only 4 characters to encode, however, the number 1000 would need 39 characters to encode. For my purpose 100-200 replies on any given level will suffice nicely. Even though text fields are limited to 4096 characters, even if each level needed 40 characters for encoding, I would still have room for approximately 100 levels (plenty) .. Adding a step to the procedure you developed, we could convert the returned "next reply number" into its alphabetic equivalent. e.g. 1 = A2 = B26 = Z27 = ZA28 = ZB52 = ZZ53 = ZZA The procedure is this: next_id/26 = number of Z's mod of next_id/26 = numeric position within alphabet So .. if you wanted to add the 100th reply to the 2nd root topic ... 100/26 = 3 (ZZZ) (drop everything but the integer) mod of 100/26 = 12 (L) So we'd have 3 Z's and an L for the second root topic. The fully assembled ID would be /B/ZZZL Then a reply to this would become /B/ZZZL/A So on, and so forth .. This is verging on getting kludgey, but it still looks like it meets all of the criteria: 1) Fast inserts 2) Fast selects of a thread or part of a thread Actually this criterion hinges on the quality of the indexing 3) Rows are returned in order, with only one select. So ..I really wish there was a better way ..but this isn't too big of a price to pay. Thanks, Bryan -----Original Message----- From: Ingram, Bryan Sent: Tuesday, April 11, 2000 3:05 PM To: 'mig@utdt.edu' Cc: pgsql-sql@postgresql.org Subject: RE: [SQL] Threaded Records in SQL: Advice Needed Thanks for the ideas on the functions, that'll work nicely. The only other problem I see in actually implementing this, is that the id column i.e. /25/10/2/ will not be ordered correctly because it relies on ascii values. You get alphabetic orderings, rather than numerical. Such as: 1 10 11 12 14 2 20 25 3 4 5 instead of 1 2 3 4 5 10 11 12 14 20 25 Any ideas how to get around this? I'm working on the problem right now, but haven't found anything yet. Bryan
lexicographic ordering is what you want ... So, new proposal (described from the start) which solves all this and should (I think) be rather fast. On the other hand, the id fields get larger and larger ... (1) Ids of messages are of the form /(field_0)/(field_1)/.../(field_n) where each field counts the number of message atthat level: field_0 counts the root messages, field_1 the replies to the root message at field_0, etc ... (2) The field counters are numbers where (a) the first digit gives the count of digits in the index (if youthink that youwill have more than nine digits in any field,make it a two-digit counter: 01, 02, ... 99 (b) the rest is the index Same example as before: the second reply to the nineteenth reply to the seventh reply to root message 25 has id /225/17/219/12 (field_0=225 means: two digits in the counter, the counter is 25) This does solve the issue of correctordering: lexicographic ordering of THESE indexes is what you want ... shorter numbers do come before the longerones (3) The issuing of new indexes can be implemented in three sql functions as follows: create function next_reply_num(text)returns text as 'select (count(*)+1)::text from ids where id ~ ($1|| ''/[^/]*$'') ' language 'sql'; create function add_ct(text) returns text as 'select (char_length($1)::text || $1)' language 'sql'; create function next_reply_id(text) returns text as 'select ($1 || ''/'' || add_ct(next_reply_num($1)))' language'sql'; The function next_reply_num gets the next counter; add_ct prepends the digit count; next_reply_id assemblesthe lot. Again, maybe this can be implemented with fewer sql function calls; it can definitely be implemented in a single tcl or pgSQL function ... Another remark on these functions: if you EVER delete a message from the list, the procedure next_reply_num will wreak havoc in your numbering! If that is a possibility, you should probably select the largest reply number to that message, parse it, add one to the counter, prepend the byte count and then compute the index for the new reply. PS: I just received your e-mail which solves these issues; here is an alternative ... It is of course compatible with usingletters instead of numbers, as you propose - the new idea is adding a "byte count" to each field, which can of coursealso be encoded in a letter --text follows this line-- >From: "Ingram, Bryan" <BIngram@sixtyfootspider.com> >Cc: pgsql-sql@postgresql.org >Date: Tue, 11 Apr 2000 14:57:00 -0500 >Content-Type: text/plain; > charset="iso-8859-1" > >Thanks for the ideas on the functions, that'll work nicely. > >The only other problem I see in actually implementing this, is that the id >column i.e. /25/10/2/ will not be ordered correctly because it relies on >ascii values. You get alphabetic orderings, rather than numerical. > >Such as: > >1 >10 >11 >12 >14 >2 >20 >25 >3 >4 >5 > >instead of > >1 >2 >3 >4 >5 >10 >11 >12 >14 >20 >25 > >Any ideas how to get around this? I'm working on the problem right now, but >haven't found anything yet. > >Bryan >
Would it be too obvious to mention that leading zeroes would give you what you need? //\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\ Michael McCarthy TCSI Corporation michael@tcsi.com 1080 Marina Village Parkway (510) 749-8739 Alameda, CA 94501 > The only other problem I see in actually implementing this, is that the id > column i.e. /25/10/2/ will not be ordered correctly because it relies on > ascii values. You get alphabetic orderings, rather than numerical.
Bryan, You can numerically, even if the field is a text field, if all fields have a number. Just cst to an int4. E.g. create table testtable (num text, num2 text); select num from testtable order by num::int4; Troy > > Thanks for the ideas on the functions, that'll work nicely. > > The only other problem I see in actually implementing this, is that the id > column i.e. /25/10/2/ will not be ordered correctly because it relies on > ascii values. You get alphabetic orderings, rather than numerical. > > Such as: > > 1 > 10 > 11 > 12 > 14 > 2 > 20 > 25 > 3 > 4 > 5 > > instead of > > 1 > 2 > 3 > 4 > 5 > 10 > 11 > 12 > 14 > 20 > 25 > > Any ideas how to get around this? I'm working on the problem right now, but > haven't found anything yet. > > Bryan > > >
----- Original Message ----- From: "Ingram, Bryan" <BIngram@sixtyfootspider.com> To: "Ingram, Bryan" <BIngram@sixtyfootspider.com>; <mig@utdt.edu> Cc: <pgsql-sql@postgresql.org> Sent: Tuesday, April 11, 2000 9:57 PM Subject: RE: [SQL] Threaded Records in SQL: Advice Needed > In reference to the ascii/numeric ordering problem .. > > But first let me say ..if someone knows of a way to get ascii values to > order as if they were numerics ..please let me in on the secret .. > > Instead of using numbers, if letters were used the ordering would be > correct. > > The id string would become something like: A/A/A for the 1st reply to the > 1st reply of the 1st root message. > > This will work fine for threads with relatively few replies on the same > level. For instance, the number 100 would take only 4 characters to encode, > however, the number 1000 would need 39 characters to encode. More efficient to use letters as numbers in base 26? Then for up to 676 replies per level you need only 2 letters. If you allowed 3, you'd have some headroom. ---------------------------------------------------------------- Moray.McConnachie@computing-services.oxford.ac.uk
Hello, I don't know if it has been answered yet (bit late I know). I solved the ascii numeric ordering as followed: - concatenate a lot of zeroes in front (at least the length of the result value) - cut the the result length. snlsor=> select * from asciinum order by number; number ------ 1 10 111 1211 2 20 211 22 (8 rows) snlsor=> select * from asciinum order by lpad(number, 4, '0'); number ------ 1 2 10 20 22 111 211 1211 (8 rows) Best regards, Roelof > -----Original Message----- > From: Ingram, Bryan [SMTP:BIngram@sixtyfootspider.com] > Sent: dinsdag 11 april 2000 22:58 > To: Ingram, Bryan; mig@utdt.edu > Cc: pgsql-sql@postgresql.org > Subject: RE: [SQL] Threaded Records in SQL: Advice Needed > > In reference to the ascii/numeric ordering problem .. > > But first let me say ..if someone knows of a way to get ascii values to > order as if they were numerics ..please let me in on the secret .. > > Instead of using numbers, if letters were used the ordering would be > correct. > > The id string would become something like: A/A/A for the 1st reply to the > 1st reply of the 1st root message. > > This will work fine for threads with relatively few replies on the same > level. For instance, the number 100 would take only 4 characters to > encode, > however, the number 1000 would need 39 characters to encode. > > For my purpose 100-200 replies on any given level will suffice nicely. > > Even though text fields are limited to 4096 characters, even if each level > needed 40 characters for encoding, I would still have room for > approximately > 100 levels (plenty) .. > > Adding a step to the procedure you developed, we could convert the > returned > "next reply number" into its alphabetic equivalent. > > e.g. 1 = A > 2 = B > 26 = Z > 27 = ZA > 28 = ZB > 52 = ZZ > 53 = ZZA > > The procedure is this: > > next_id/26 = number of Z's > mod of next_id/26 = numeric position within alphabet > > So .. if you wanted to add the 100th reply to the 2nd root topic ... > > 100/26 = 3 (ZZZ) (drop everything but the integer) > mod of 100/26 = 12 (L) > > So we'd have 3 Z's and an L for the second root topic. > > The fully assembled ID would be /B/ZZZL > > Then a reply to this would become /B/ZZZL/A > > So on, and so forth .. > > This is verging on getting kludgey, but it still looks like it meets all > of > the criteria: > > 1) Fast inserts > 2) Fast selects of a thread or part of a thread > Actually this criterion hinges on the quality of the indexing > 3) Rows are returned in order, with only one select. > > So ..I really wish there was a better way ..but this isn't too big of a > price to pay. > > Thanks, > Bryan > > > > > > > -----Original Message----- > From: Ingram, Bryan > Sent: Tuesday, April 11, 2000 3:05 PM > To: 'mig@utdt.edu' > Cc: pgsql-sql@postgresql.org > Subject: RE: [SQL] Threaded Records in SQL: Advice Needed > > > Thanks for the ideas on the functions, that'll work nicely. > > The only other problem I see in actually implementing this, is that the id > column i.e. /25/10/2/ will not be ordered correctly because it relies on > ascii values. You get alphabetic orderings, rather than numerical. > > Such as: > > 1 > 10 > 11 > 12 > 14 > 2 > 20 > 25 > 3 > 4 > 5 > > instead of > > 1 > 2 > 3 > 4 > 5 > 10 > 11 > 12 > 14 > 20 > 25 > > Any ideas how to get around this? I'm working on the problem right now, > but > haven't found anything yet. > > Bryan >