Thread: COPY and default values
The attached patch changes COPY so that if no value for an attribute is specified, the default value for the attribute (or the attribute's type) will be assigned, rather than NULL. I'll write some regression tests and update the docs later. Also, there's a possible optimization that I didn't implement: if the default value Node is a 'Const', we can safely evaluate and store the default value once for the lifetime of the COPY, rather than every time that it is needed. However, Const Nodes are evaluated very quickly by ExecEvalExpr(), so I'm not sure if it's worth worrying about... There's one major problem with the patch, which I haven't been able to fix: when there are multiple rows of data which require default values, the COPY fails: nconway=# create table test (a int, b text default 'def value'); CREATE TABLE nconway=# copy test from stdin; >> 5 >> \. nconway=# copy test from stdin; >> 10 >> 15 >> \. ERROR: copy: line 2, Memory exhausted in AllocSetAlloc(1065320379) I think I've mismanaged the memory contexts involved somehow, but I'm not sure what the problem is. Any help would be appreciated... Cheers, Neil -- Neil Conway <neilconway@rogers.com> PGP Key ID: DB3C29FC
Attachment
Neil Conway <nconway@klamath.dyndns.org> writes: > ERROR: copy: line 2, Memory exhausted in AllocSetAlloc(1065320379) > I think I've mismanaged the memory contexts involved somehow, but > I'm not sure what the problem is. Any help would be appreciated... Well, for one thing you're doing + for (i = 0; i < attr_count; i++) + { + if (nulls[i] == 'd' && values[i] == 0) + { + bool isNull; + + values[i] = ExecEvalExprSwitchContext(defaults[i], econtext, + &isNull, NULL); + + /* If it's NULL, return value is meaningless */ + if (isNull) + { + values[i] = 0; + nulls[i] = 'n'; + } + else + nulls[i] = ' '; + + ResetPerTupleExprContext(estate); + } + } which means you reset the memory context containing the default results before those results can ever get used, leaving dangling pointers in values[i]. (Hint: there's a reason why the econtext is called the "per-tuple" context, not "per-column" context. The one reset that's in there now is sufficient.) The nulls = 'd' mechanism is ugly and unnecessary, I think. We were intending to modify COPY's behavior to prohibit missing/extra fields in incoming rows anyway, so there's no reason not to know in advance exactly which columns need to have defaults inserted, and to set up default info for only those columns. I'm also somewhat uncomfortable with the fact that the patch invokes ExecEvalExpr on a NULL pointer if there is a default-less column involved --- perhaps that works at the moment but I don't like it. Should special-case that. The pfree's you've added near line 1021 look rather pointless, seeing as how (a) they're only pfree'ing the topmost node of the default expressions, and (b) the whole context is about to get reset anyhow. There isn't a lot of percentage in any of those end-of-statement pfrees that are there now... I didn't like the aspect of the patch that moves build_column_default into execUtils.c. It's not an executor routine (as evidenced by the fact that you had to add a pile of new inclusions to execUtils.c to make it compile). I'm not sure of the cleanest place for it; maybe someplace in backend/catalog/, though really there's nothing wrong with keeping it in the rewriter. I haven't tried to run the patch, but those were some things I noticed in a quick eyeball review... regards, tom lane
On Mon, 27 May 2002 17:15:21 -0400 "Tom Lane" <tgl@sss.pgh.pa.us> wrote: > Neil Conway <nconway@klamath.dyndns.org> writes: > > ERROR: copy: line 2, Memory exhausted in AllocSetAlloc(1065320379) > > > I think I've mismanaged the memory contexts involved somehow, but > > I'm not sure what the problem is. Any help would be appreciated... > [You] reset the memory context containing the default results > before those results can ever get used, leaving dangling pointers in > values[i]. (Hint: there's a reason why the econtext is called the > "per-tuple" context, not "per-column" context. The one reset that's in > there now is sufficient.) Woops :-) Ok, removed that. Actually, I had just inserted that when I was trying to debug the assertion failure I mentioned earlier. > The nulls = 'd' mechanism is ugly and unnecessary, I think. We were > intending to modify COPY's behavior to prohibit missing/extra fields > in incoming rows anyway, so there's no reason not to know in advance > exactly which columns need to have defaults inserted, and to set up > default info for only those columns. Can you elaborate on exactly how you'd like to see COPY's behavior changed? What's the rationale for disallowing missing fields in COPY input? > I'm also somewhat uncomfortable > with the fact that the patch invokes ExecEvalExpr on a NULL pointer > if there is a default-less column involved --- perhaps that works at > the moment but I don't like it. Should special-case that. Good spot -- there's a special case for that now. > The pfree's you've added near line 1021 look rather pointless, seeing > as how (a) they're only pfree'ing the topmost node of the default > expressions, and (b) the whole context is about to get reset anyhow. > There isn't a lot of percentage in any of those end-of-statement > pfrees that are there now... OK, removed them. > I didn't like the aspect of the patch that moves build_column_default > into execUtils.c. It's not an executor routine (as evidenced by the > fact that you had to add a pile of new inclusions to execUtils.c to > make it compile). I'm not sure of the cleanest place for it; maybe > someplace in backend/catalog/, though really there's nothing wrong with > keeping it in the rewriter. Oh, yeah -- I forgot to mention that. I wasn't really sure where that code should go, so I basically picked execUtils.c at random -- if you'd prefer that I move the code to somewhere in catalog/ that's fine, just let me know where. I don't have a strong preference myself. > I haven't tried to run the patch, but those were some things I noticed > in a quick eyeball review... Thanks Tom. However, after applying the changes noted above, the test case still fails (and I'm still scratching my head about what's causing the problem). A new version of the patch is attached. Cheers, Neil -- Neil Conway <neilconway@rogers.com> PGP Key ID: DB3C29FC
Attachment
Neil Conway <nconway@klamath.dyndns.org> writes: > Can you elaborate on exactly how you'd like to see COPY's behavior > changed? What's the rationale for disallowing missing fields in > COPY input? Robustness. We've seen problem reports in the past when COPY got out of sync with the data stream because of loss or insertion of a newline (eg, due to faulty quoting logic in an application; this scenario is not as improbable as you might think). Rejecting lines that don't seem to have the right number of fields would go a long way towards detecting that sort of mistake. And, since the existing behavior with unexpected numbers of fields has never been documented, no one feels any strong compunction about changing it. This has been hashed out in several prior threads, and in fact I think Brent Verner has a nearly-ready-to-go patch implementing the agreed-on behavior. Now that I think about it, your patch is heading off in a quite different direction from what was agreed to: we were never intending that omitting trailing fields would be the cue for inserting default values. Without a way to specify a target column list, that's far too restrictive; and with it, it's unnecessary. regards, tom lane
On Mon, 27 May 2002 19:28:30 -0400 "Tom Lane" <tgl@sss.pgh.pa.us> wrote: > This has been hashed out in several prior threads, and in fact I think > Brent Verner has a nearly-ready-to-go patch implementing the agreed-on > behavior. Now that I think about it, your patch is heading off in a > quite different direction from what was agreed to: we were never > intending that omitting trailing fields would be the cue for inserting > default values. Without a way to specify a target column list, that's > far too restrictive; and with it, it's unnecessary. Ok, sounds fair. I think my patch is an improvement over the current behavior (if you're going to silently insert arbitrary data for missing attributes, I'd say the default value is a better choice than NULL), but I would agree that this kind of behavior as a whole is not optimal. Assuming that the patch from Brent materializes, I'll be happy to take his solution over mine. Cheers, Neil -- Neil Conway <neilconway@rogers.com> PGP Key ID: DB3C29FC
Neil Conway <nconway@klamath.dyndns.org> writes: > Thanks Tom. However, after applying the changes noted above, the test > case still fails (and I'm still scratching my head about what's > causing the problem). A new version of the patch is attached. I got curious, so I tried out this version of the patch. Some debugging demonstrated that the value being pointed to by the default expression's Const node was getting freed :-(, evidently by the loop at copy.c lines 1014-1018 (as patched). Said loop is really pretty bogus anyway, since it effectively assumes that every pass-by-ref datatype's input function always returns a freshly-palloc'd value. (This isn't the only place that assumes that, if memory serves; but over time I'd like to eliminate such assumptions.) In the presence of values generated by means other than datatype input functions, the assumption breaks down completely. What I would like to do about this is to eliminate the explicit pfree's entirely; they are both dangerous, as shown by this example, and inadequate to guarantee no memory leak. (For example, any storage leaked internally by a datatype conversion function cannot get reclaimed by this approach.) A much more bulletproof approach in both COPY IN and COPY OUT would be to run the per-tuple loop in a memory context that we reset at the top of the loop. Then any palloc'd leaked storage will be reclaimed at the end of the tuple cycle. COPY IN may as well use the per-tuple exprcontext's memory context, which we're already using and resetting. COPY OUT doesn't seem to need an exprcontext at the moment, but it might be best to use parallel coding anyway. Brent Verner hasn't submitted any version of his patch yet, but he did pop up a day or so ago to say that he still intended to do so. Maybe you should offer to help him whip it into shape ... regards, tom lane
[2002-05-27 19:28] Tom Lane said: | Neil Conway <nconway@klamath.dyndns.org> writes: | > Can you elaborate on exactly how you'd like to see COPY's behavior | > changed? What's the rationale for disallowing missing fields in | > COPY input? | | Robustness. We've seen problem reports in the past when COPY got out of | sync with the data stream because of loss or insertion of a newline (eg, | due to faulty quoting logic in an application; this scenario is not as | improbable as you might think). Rejecting lines that don't seem to have | the right number of fields would go a long way towards detecting that | sort of mistake. And, since the existing behavior with unexpected | numbers of fields has never been documented, no one feels any strong | compunction about changing it. | | This has been hashed out in several prior threads, and in fact I think | Brent Verner has a nearly-ready-to-go patch implementing the agreed-on | behavior. Now that I think about it, your patch is heading off in a | quite different direction from what was agreed to: we were never | intending that omitting trailing fields would be the cue for inserting | default values. Without a way to specify a target column list, that's | far too restrictive; and with it, it's unnecessary. attached is an updated patch (copy.colspec.20020527.diff), against HEAD, implementing the following COPY syntax COPY sleep (n,o,w) FROM stdin; where (n,o,w) may be a subset of columns in the "sleep" table. If a column is not specified, and has an attribute default, the default value will be evaluated and used. There is no support for COPY (<attlist>) TO yet, but that should not be too hairy. A second patch for pg_dump to support this syntax is attached, but I have not even tried to apply it, sorry. The pg_dump patch is against a circa 7.2-RELEASE cvs head. I attach it in case someone else wants to help wrap this up. I have been death-marching for 3-4 months and I'm not sure I see the top of the hill yet... How it does what it does. I modified gram.y to add an opt_column_list to CopyStmt. This List is sent down to DoCopy. If no attlist is present, I build an attlist from the relation's available attributes, enabling the old behavior. Once inside CopyFrom, the List attlist is scanned to build two arrays, one representing the actual index of the specified column, and another containing available default exprs. Attributes are read from the input stream and assigned to their appropriate location in the new tuple. If there are any defaults available, they are then evaluated and placed in the new tuple arrays. That's really it. I'm not _at all_ confident that there is not a better way to do this, but this is what I've done. The patch will elog(ERROR) in DoCopy, before Copy(To|From) is entered, if a column list is specified that cannot be handled (non-existent columns, duplicate columns). Should I be waiting until I'm _in_ Copy(To|From) to elog(ERROR) out? I'll try to add some commentary to that snakey code over the next week, and will resubmit an updated patch. Anyone questions regarding this patch should be sent 'To:' me and 'CC:' the list, as I haven't had the luxury of reading the good list lately :-(. comments/brain-death-fixes are welcome. cheers. b -- "Develop your talent, man, and leave the world something. Records are really gifts from people. To think that an artist would love you enough to share his music with anyone is a beautiful thing." -- Duane Allman