On Wednesday, October 22, 2008 11:31 AM +0200 Kai Schaetzl
wrote:

> This is far away from reality. What makes you think that XHTML mail would
> be any better formed than HTML? I bet some makers of those many crap
> HTML web mailers will just rename the Doctype if a client asks them
> about XHTML compatibility. Or add that Doctype and DTD finally.


It's a question of standards, and of forward-going enforcement. We don't
have to live in a world where we accept sloppy (and malicious) input.

I don't really care if I don't get mail from a "crappy web mailer", esp. if
it was written recently enough to think it knows anything about XML. If it
promises well-formed XML and then fails to deliver on that promise, I don't
see any reason to let it through, any more than I should let through mail
with fouled-up headers or improper SMTP transactions.

I hate that I have to tolerate bad stuff from old and broken but
widely-deployed mail clients, but why should we have to tolerate it from
new senders?

> Have you already tested on the sheer use of XHTML? Maybe there's no
> spammer using it, so you could use it for whitelist scoring, at least
> for a while?


I started the thread precisely because I was poking through some uncaught
spam and found a bunch claiming to be XHTML, but with all kinds of parsing
errors, mostly mismatched quotes and tags. I'd like to know if any
legitimate senders are claiming XML compliance and being as broken in their
implementation, so that I can tell if it's worth pursuing an actual test.

> XML parsing is slow. You probably gain little accuracy with a lot of
> performance hit.


How does a simple well-formedness test compare to virus scanning or SA's
regex scan? (I don't need the actual parse tree and I suspect it's not
necessary to check the DTD, so it should be possible to stream the test.)