Thread: archive not showing all attachements

archive not showing all attachements

From
Andres Freund
Date:
Hi,

In a recent email Horiguchi-san attached an incremental patch as .txt,
presumably to avoid confusing cfbot. I just noticed that the archives don't
offer a download link:
https://www.postgresql.org/message-id/20220323.163833.1929400812667041796.horikyota.ntt%40gmail.com

but in my mail reader it's perfectly visible.

I don't know if it's caused by the file ending or Horiguchi-san's email client
using text/Plain rather than text/plain. As far as I understand mime types are
supposed to be case insensitive, so text/Plain is perfectly correct, albeit a
little unusual.

Greetings,

Andres Freund



Re: archive not showing all attachements

From
"Jonathan S. Katz"
Date:
On 3/24/22 3:24 PM, Andres Freund wrote:
> Hi,
> 
> In a recent email Horiguchi-san attached an incremental patch as .txt,
> presumably to avoid confusing cfbot. I just noticed that the archives don't
> offer a download link:
> https://www.postgresql.org/message-id/20220323.163833.1929400812667041796.horikyota.ntt%40gmail.com
> 
> but in my mail reader it's perfectly visible.
> 
> I don't know if it's caused by the file ending or Horiguchi-san's email client
> using text/Plain rather than text/plain. As far as I understand mime types are
> supposed to be case insensitive, so text/Plain is perfectly correct, albeit a
> little unusual.

I ran into a similar issue while using Mac mail, where attached patches 
wouldn't post to the archives. I switched to Thunderbird solely for 
posting to the mailing lists.

That said, that may be a thread worth pulling. It looks like we do 
case-sensitivity checking on the MIME type based on the 
"get_content_type" docs[1][2].

If you have the raw message, it may be good to manually run through the 
code and see what's happening in the the "recursive_get_attachments" to 
see root cause and potentially fix this.

(I don't have access to the server side directly, so I can't test it. I 
may be able to parse it from the mbox, which in that case, I'll try this 
out later).

Jonathan

[1] 
https://git.postgresql.org/gitweb/?p=pgarchives.git;a=blob;f=loader/lib/parser.py;hb=HEAD#l257
[2] 
https://docs.python.org/3/library/email.message.html#email.message.EmailMessage.get_content_type

Attachment

Re: archive not showing all attachements

From
Magnus Hagander
Date:
On Thu, Mar 24, 2022 at 9:18 PM Jonathan S. Katz <jkatz@postgresql.org> wrote:
>
> On 3/24/22 3:24 PM, Andres Freund wrote:
> > Hi,
> >
> > In a recent email Horiguchi-san attached an incremental patch as .txt,
> > presumably to avoid confusing cfbot. I just noticed that the archives don't
> > offer a download link:
> > https://www.postgresql.org/message-id/20220323.163833.1929400812667041796.horikyota.ntt%40gmail.com
> >
> > but in my mail reader it's perfectly visible.
> >
> > I don't know if it's caused by the file ending or Horiguchi-san's email client
> > using text/Plain rather than text/plain. As far as I understand mime types are
> > supposed to be case insensitive, so text/Plain is perfectly correct, albeit a
> > little unusual.
>
> I ran into a similar issue while using Mac mail, where attached patches
> wouldn't post to the archives. I switched to Thunderbird solely for
> posting to the mailing lists.
>
> That said, that may be a thread worth pulling. It looks like we do
> case-sensitivity checking on the MIME type based on the
> "get_content_type" docs[1][2].
>
> If you have the raw message, it may be good to manually run through the
> code and see what's happening in the the "recursive_get_attachments" to
> see root cause and potentially fix this.

There is a link to the raw message directly from that page in the
archives, remember? :)

That said, it's not as simple as a upper/lowercase. get_content_type()
returns 'text/plain' in lowercase for both parts so it takes care of
that.

The problem is that:
1. It's not an attachment. That is:
Content-Disposition: inline; filename="0001-doc_wip-diff.txt"
explicitly says it's not an attachment.

So the reason it's not listed as an attachment (per the subject of the
email) is because it *isn't* an attachment.


2. We only show the first plaintext part of messages

This is the actual problem. Our archives don't know how to merge
multiple plaintext parts when they are set to be viewed inline and not
as attachments.

That said, I notice that it's shown as an attachment in gmail as well,
even though it's not an attachment. I have no idea why :) MIME parsing
in general is black magic, but in this case it seems pretty clear that
showing it as an attachment is plain wrong. Not showing it at all,
however, is also wrong, but I'm not sure how much complexity it's
worth to handle this case, given it's the first time I've heard of the
problem.


//Magnus



Re: archive not showing all attachements

From
Andres Freund
Date:
Hi,

On 2022-03-26 18:41:39 +0100, Magnus Hagander wrote:
> That said, it's not as simple as a upper/lowercase. get_content_type()
> returns 'text/plain' in lowercase for both parts so it takes care of
> that.
> 
> The problem is that:
> 1. It's not an attachment. That is:
> Content-Disposition: inline; filename="0001-doc_wip-diff.txt"
> explicitly says it's not an attachment.

It's not actually that unreasonable for a doc patch ;)


> 2. We only show the first plaintext part of messages
> 
> This is the actual problem. Our archives don't know how to merge
> multiple plaintext parts when they are set to be viewed inline and not
> as attachments.
> 
> That said, I notice that it's shown as an attachment in gmail as well,
> even though it's not an attachment. I have no idea why :)

The archive code appears to actually try to handle the case of multiple inline
plain text attachements, from what I can see:
https://git.postgresql.org/gitweb/?p=pgarchives.git;a=blob;f=loader/lib/parser.py;hb=HEAD#l381
 381             # If we have already found one text/plain part, make all
 382             # further text/plain parts attachments
 383             if self.attachments_found_first_plaintext:


So there's something more than just the content-disposition: inline; going on
I think?


> MIME parsing in general is black magic, but in this case it seems pretty
> clear that showing it as an attachment is plain wrong. Not showing it at
> all, however, is also wrong, but I'm not sure how much complexity it's worth
> to handle this case, given it's the first time I've heard of the problem.

I'm not sure we'd really notice. I dimly recall encountering something like
this before and somebody else complaining about it.

Greetings,

Andres Freund



Re: archive not showing all attachements

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2022-03-26 18:41:39 +0100, Magnus Hagander wrote:
>> MIME parsing in general is black magic, but in this case it seems pretty
>> clear that showing it as an attachment is plain wrong. Not showing it at
>> all, however, is also wrong, but I'm not sure how much complexity it's worth
>> to handle this case, given it's the first time I've heard of the problem.

> I'm not sure we'd really notice. I dimly recall encountering something like
> this before and somebody else complaining about it.

I've definitely seen other cases where the archives fail to show a portion
of a message that I can see by other means.  Whether this particular
mechanism was the cause, I can't say of course.

I recall an example recently [ digs... ] ah, here:

https://www.postgresql.org/message-id/7f6fabaa-3f8f-49ab-89ca-59fbfe633105@me.com

There is a patch in there, but it looks like the boundary line
for the preceding section might be messed up?

            regards, tom lane



Re: archive not showing all attachements

From
Kyotaro Horiguchi
Date:
At Sat, 26 Mar 2022 18:41:39 +0100, Magnus Hagander <magnus@hagander.net> wrote in 
> The problem is that:
> 1. It's not an attachment. That is:
> Content-Disposition: inline; filename="0001-doc_wip-diff.txt"
> explicitly says it's not an attachment.

RFC2183 says that the header suggests how the multipart body is
displayed, not what it is.

https://www.ietf.org/rfc/rfc2183.txt

> 4.  Summary
> 
>    Content-Disposition takes one of two values, `inline' and
>    `attachment'.  `Inline' indicates that the entity should be
>    immediately displayed to the user, whereas `attachment' means that
>    the user should take additional action to view the entity.
> 
>    The `filename' parameter can be used to suggest a filename for
>    storing the bodypart, if the user wishes to store it in an external
>    file.

Doesn't the fact that the Contnet-Disposition has the "filename"
attribute suggests the part is stored as an external content apart
from the difference between inline and attachement?

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: archive not showing all attachements

From
"David G. Johnston"
Date:
On Sat, Mar 26, 2022 at 10:41 AM Magnus Hagander <magnus@hagander.net> wrote:

The problem is that:
1. It's not an attachment. That is:
Content-Disposition: inline; filename="0001-doc_wip-diff.txt"
explicitly says it's not an attachment.

So the reason it's not listed as an attachment (per the subject of the
email) is because it *isn't* an attachment.


2. We only show the first plaintext part of messages

This is the actual problem. Our archives don't know how to merge
multiple plaintext parts when they are set to be viewed inline and not
as attachments.


It would be nice if our archive viewer conformed on the point that anything showing inline show up, in order, in the "body" section of the viewer.  We'd probably add some structural separators to distinguish the parts.

If that is not possible then every part except the first needs to be made accessible as an attachment regardless of the Content-Disposition suggestion.

There is an argument for making every individual part of the message individually downloadable, even inline ones, the first part included.  I think this would be nice, instead of the only way to save the message body locally being to copy-paste, or to download the raw message which contains more data than one probably wishes to download.  This would be my preference.  We'd name the inline parts which lack filenames something like "inline_n.ext" and decide the extension from the Content-Type.  This does bring up the question - what are we doing if an attachment doesn't also have a filename?

If that UI choice goes too far, and we solve the multiple inline problem, then we should at least provide a download option for any part that has a filename provided.  Inline parts with filenames would be shown on the webpage body and would also have a download option.  Frankly, for many patches (small-ish ones at least) this would be an ideal combination, but I doubt most people have fine enough control provided by their mail clients to make these decisions intentionally and on a per-message basis.

I haven't tried to reconcile this with the cf bot's needs.

I also haven't worked through any dynamics involving child message parts that are themselves multi-part messages.  That seems like a "let the user download it and figure it out themselves" thing instead of trying to make our viewer interactive to the extent necessary to actually handle such a message.  Having some stats on whether we even encounter this situation would help.

David J.

Re: archive not showing all attachements

From
"David G. Johnston"
Date:
On Sat, Mar 26, 2022 at 12:48 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I recall an example recently [ digs... ] ah, here:

https://www.postgresql.org/message-id/7f6fabaa-3f8f-49ab-89ca-59fbfe633105@me.com

There is a patch in there, but it looks like the boundary line
for the preceding section might be messed up?


IIUC...(I am by no means experienced here, and also haven't tried to examine other good ones this evening.  Just skimming RFC 1341 and trying to make sense of what is shown)

Root uses a boundary of [42], which is structured as multipart/alternative and has two mutually exclusive children:
Text Body (structured as text/plain)
HTML Body + Attachment (structured as a multipart/mixed [64]) - to make the html (structured pointlessly? as multipart/related [86] and an embedded text/html) and a attachment (structured as text/x-patch) independent

So a viewer displaying the text body alternative should not present the attachment.  But one presenting the HTML body would present the attachment.  Apparently the multipart/alternative section that makes text and html mutually exclusive should ideally be in a multipart/mixed container so that the shared attachment can be considered "mixed" with whichever one of them is chosen.

We basically present the plain text version on our archive viewer so the absence of the attachment is technically correct.  Gmail for me defaults to displaying html and so I too get the attachment download option.

Probably the archive should deal with the html branch in a better way than a complete ignore.  This does seem to indicate a need for at least somewhat better handling of nested multipart messages - my advice in the prior email to "let the user download and deal with it on their own" is probably insufficient.  The UI part of this, aside from agreement on what would be nice to do, seems straight-forward enough; I don't have much feel for the complexity of implementation though.  We probably need to at least pull out any explicit text attachments regardless of which branch we are showing. Whether to extend that to any attachment type is probably easiest but worth discussing.  That would fall under "bug fix" for me.  Converting the html body branch into a download option would an improvement worth considering along the lines of "everything is either displayed on the page or is an attachment".  Or, taken further, everything on the page (each part) can be individually downloaded as content only.

Done for tonight...will probably explore a "good" example in light of my understanding tomorrow.

David J.

Re: archive not showing all attachements

From
Magnus Hagander
Date:


On Sat, Mar 26, 2022 at 8:48 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Andres Freund <andres@anarazel.de> writes:
> On 2022-03-26 18:41:39 +0100, Magnus Hagander wrote:
>> MIME parsing in general is black magic, but in this case it seems pretty
>> clear that showing it as an attachment is plain wrong. Not showing it at
>> all, however, is also wrong, but I'm not sure how much complexity it's worth
>> to handle this case, given it's the first time I've heard of the problem.

> I'm not sure we'd really notice. I dimly recall encountering something like
> this before and somebody else complaining about it.

I've definitely seen other cases where the archives fail to show a portion
of a message that I can see by other means.  Whether this particular
mechanism was the cause, I can't say of course.

I recall an example recently [ digs... ] ah, here:

https://www.postgresql.org/message-id/7f6fabaa-3f8f-49ab-89ca-59fbfe633105@me.com

There is a patch in there, but it looks like the boundary line
for the preceding section might be messed up?

Oh, there's definitely emails it mis-parses, just not for this reason. In your example for example, the problem is the one where the attachment is only attached to the HTML part and not the text part (and not the email itself), and we don't show the HTML part. Which is a known, but different, case. 

--

Re: archive not showing all attachements

From
Magnus Hagander
Date:


On Sat, Mar 26, 2022 at 8:11 PM Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2022-03-26 18:41:39 +0100, Magnus Hagander wrote:
> That said, it's not as simple as a upper/lowercase. get_content_type()
> returns 'text/plain' in lowercase for both parts so it takes care of
> that.
>
> The problem is that:
> 1. It's not an attachment. That is:
> Content-Disposition: inline; filename="0001-doc_wip-diff.txt"
> explicitly says it's not an attachment.

It's not actually that unreasonable for a doc patch ;)

Agreed. But it does explain why it's not shown *as an attachment*.

 
> 2. We only show the first plaintext part of messages
>
> This is the actual problem. Our archives don't know how to merge
> multiple plaintext parts when they are set to be viewed inline and not
> as attachments.
>
> That said, I notice that it's shown as an attachment in gmail as well,
> even though it's not an attachment. I have no idea why :)

The archive code appears to actually try to handle the case of multiple inline
plain text attachements, from what I can see:
https://git.postgresql.org/gitweb/?p=pgarchives.git;a=blob;f=loader/lib/parser.py;hb=HEAD#l381
 381             # If we have already found one text/plain part, make all
 382             # further text/plain parts attachments
 383             if self.attachments_found_first_plaintext:


So there's something more than just the content-disposition: inline; going on
I think?

Huh. I think you're right. I guess I was just reading it too quickly.
 
And yup, there's the bug.

Line 388 gets the payload in this case -- which returns in type "bytes".
Line 395 verifies that it's a string (which it isn't).

This is a bug in the conversion to python 3.

I've applied a fix and re-parsed the message.

Thanks for being persistent!

--