HTML in emails

Ian Batten ukcrypto at chiark.greenend.org.uk
Fri, 26 Sep 2008 15:30:53 +0100


On 26 Sep 08, at 1336, Tom Thomson wrote:

> I think Ian that you are perhaps unaware of the horrors perpetrated by
> some user agents when it is permitted to transmit email in html -  
> it's not
> unusual to see a 20 line message (maybe 1.25 kbytes in plain text  
> format)
> take up 12.5 kbytes in html.

Geek Hat: That's terrible.  An extra 11 KBytes!  How wasteful.

Realistic Hat: The volumes you're talking about are miniscule, and  
what you're doing is looking for a vaguely technical sounding reason  
for a personal preference.  I'm fine with ``HTML annoys me'' and I'm  
quite happy to go with the flow, but I'd rather people didn't dress it  
up with a technical argument that simply doesn't add up in 2008.  It's  
like Goldacre on Brain Gym: having a stretch mid-lesson is great, just  
don't claim it's neuroscience.

The average size of a message on ukcrypto over the past year is 5.8K

#
# pwd; ls -l | nawk '$NF~"[0-9]\.$" {t+=$5;c++}
 > END {printf "total %d count %d avsize %.1f\n", t, c, t/c}'
/var/imap/partition1/user/igb/ukcrypto
total 15503375 count 2672 avsize 5802.2
#



The average size of messages that have text/html in them is

# ls -l `grep -l text/html *.` | awk '{t+=$5;c++}
 > END {printf "total %d count %d avsize %.1f\n", t, c, t/c}'
total 892302 count 95 avsize 9392.7
#

so they are on average 9.3K as against 5.8K.  Ah, someone says, but  
that 5.8K includes the HTML.  I'm there ahead of you, so let's look at  
the average size of messages that don't include text/html:

# grep -l text/html *. > /tmp/html; ls *. | join -v1 - /tmp/html |  
xargs ls -l | awk '{t+=$5;c++}
 > END {printf "total %d count %d avsize %.1f\n", t, c, t/c}'
total 14622828 count 2578 avsize 5672.2
#

So in fact, the average size of the (95) messages that reference text/ 
html is

# bc
scale=2
9392.7/5672.2
1.65

So less of the ``ten times'' or ``five times'' expansion: we're  
talking about 60% (the reason it's not even the 100% minimum you'd  
expect from multipart/alternative is because the messages are  
dominated by headers, see below).  And, more to the point, for a nine  
month archive we're talking about an excess of:

95 * (9392.7 - 5672.2)
353447.5

So yes, I confess: the use of text/html is `wasting' perhaps 500  
kilobytes per year on  this mailing list.  If anyone would like to  
send me a stamped addressed envelope,  I'll be happy to send them a  
1.44MB floppy disk for the next three years' excess.

On the other hand, if you want to know where your disk space is going,  
consider mail headers:

# for i in *.; do sed '/^^M/q' $i; echo $i 1>&2; done | wc -c
[...]
10956857
# bc
10956857/2672
4100

So for an archive totalling 15.5MB, 10MB of it is headers, an average  
of 4012 bytes per message.   A quick glance reveals they're roughly  
evenly distributed between sender->list processor, list processor,  
list processor->recipient.

So if you're worried about the 350K of excess HTML data, you might  
more profitably worry about the 10MB of headers you're storing.

ian