HTML in emails
Ian Batten
ukcrypto at chiark.greenend.org.uk
Fri, 26 Sep 2008 15:30:53 +0100
On 26 Sep 08, at 1336, Tom Thomson wrote:
> I think Ian that you are perhaps unaware of the horrors perpetrated by
> some user agents when it is permitted to transmit email in html -
> it's not
> unusual to see a 20 line message (maybe 1.25 kbytes in plain text
> format)
> take up 12.5 kbytes in html.
Geek Hat: That's terrible. An extra 11 KBytes! How wasteful.
Realistic Hat: The volumes you're talking about are miniscule, and
what you're doing is looking for a vaguely technical sounding reason
for a personal preference. I'm fine with ``HTML annoys me'' and I'm
quite happy to go with the flow, but I'd rather people didn't dress it
up with a technical argument that simply doesn't add up in 2008. It's
like Goldacre on Brain Gym: having a stretch mid-lesson is great, just
don't claim it's neuroscience.
The average size of a message on ukcrypto over the past year is 5.8K
#
# pwd; ls -l | nawk '$NF~"[0-9]\.$" {t+=$5;c++}
> END {printf "total %d count %d avsize %.1f\n", t, c, t/c}'
/var/imap/partition1/user/igb/ukcrypto
total 15503375 count 2672 avsize 5802.2
#
The average size of messages that have text/html in them is
# ls -l `grep -l text/html *.` | awk '{t+=$5;c++}
> END {printf "total %d count %d avsize %.1f\n", t, c, t/c}'
total 892302 count 95 avsize 9392.7
#
so they are on average 9.3K as against 5.8K. Ah, someone says, but
that 5.8K includes the HTML. I'm there ahead of you, so let's look at
the average size of messages that don't include text/html:
# grep -l text/html *. > /tmp/html; ls *. | join -v1 - /tmp/html |
xargs ls -l | awk '{t+=$5;c++}
> END {printf "total %d count %d avsize %.1f\n", t, c, t/c}'
total 14622828 count 2578 avsize 5672.2
#
So in fact, the average size of the (95) messages that reference text/
html is
# bc
scale=2
9392.7/5672.2
1.65
So less of the ``ten times'' or ``five times'' expansion: we're
talking about 60% (the reason it's not even the 100% minimum you'd
expect from multipart/alternative is because the messages are
dominated by headers, see below). And, more to the point, for a nine
month archive we're talking about an excess of:
95 * (9392.7 - 5672.2)
353447.5
So yes, I confess: the use of text/html is `wasting' perhaps 500
kilobytes per year on this mailing list. If anyone would like to
send me a stamped addressed envelope, I'll be happy to send them a
1.44MB floppy disk for the next three years' excess.
On the other hand, if you want to know where your disk space is going,
consider mail headers:
# for i in *.; do sed '/^^M/q' $i; echo $i 1>&2; done | wc -c
[...]
10956857
# bc
10956857/2672
4100
So for an archive totalling 15.5MB, 10MB of it is headers, an average
of 4012 bytes per message. A quick glance reveals they're roughly
evenly distributed between sender->list processor, list processor,
list processor->recipient.
So if you're worried about the 350K of excess HTML data, you might
more profitably worry about the 10MB of headers you're storing.
ian