Instructing spiders/crawlers

David Biggins ukcrypto at chiark.greenend.org.uk
Thu, 10 May 2007 16:34:16 +0100


This is a multi-part message in MIME format.

------_=_NextPart_001_01C79318.A86445EA
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

There are currently three standardised mechanisms for instructing
spiders and crawlers.

The ROBOTS.TXT file=20

http://www.robotstxt.org/wc/norobots.html

http://www.w3.org/TR/1998/REC-html40-19980424/appendix/notes.html#h-B.4.
1

The "ROBOTS" meta tag

http://www.w3.org/Search/9605-Indexing-Workshop/ReportOutcomes/Spidering
.txt

http://www.w3.org/TR/1998/REC-html40-19980424/appendix/notes.html#h-B.4.
1  (yes, that's the same link as the ROBOTS.TXT one).

And the XML sitemap.

http://www.google.com/support/webmasters/bin/answer.py?answer=3D40318&ctx=
=3D
sibling
http://www.sitemaps.org/protocol.php

Right now, it's increasingly advisable to use all three, though I expect
the sitemap will eventually substantially dominate because it is by far
the most powerful. =20

If anyone were to mistake obscurity for security and leave data in a
folder not linked by other pages but without other protection, it's
worth pointing out that adding the folder to the ROBOTS.TXT or the
sitemap is of course merely creating a signpost to it.

Dave.



------_=_NextPart_001_01C79318.A86445EA
Content-Type: text/html;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV=3D"Content-Type" CONTENT=3D"text/html; =
charset=3DUS-ASCII">
<META NAME=3D"Generator" CONTENT=3D"MS Exchange Server version =
6.5.7651.59">
<TITLE>Instructing spiders/crawlers</TITLE>
</HEAD>
<BODY>
<!-- Converted from text/rtf format -->

<P><FONT SIZE=3D2 FACE=3D"Arial">There are currently three standardised =
mechanisms for instructing spiders and crawlers.</FONT>
</P>

<P><FONT SIZE=3D2 FACE=3D"Arial">The ROBOTS.TXT file </FONT>
</P>

<P><A HREF=3D"http://www.robotstxt.org/wc/norobots.html"><U><FONT =
COLOR=3D"#0000FF" SIZE=3D2 =
FACE=3D"Arial">http://www.robotstxt.org/wc/norobots.html</FONT></U></A>
</P>

<P><A =
HREF=3D"http://www.w3.org/TR/1998/REC-html40-19980424/appendix/notes.html=
#h-B.4.1"><U><FONT COLOR=3D"#0000FF" SIZE=3D2 =
FACE=3D"Arial">http://www.w3.org/TR/1998/REC-html40-19980424/appendix/not=
es.html#h-B.4.1</FONT></U></A>
</P>

<P><FONT SIZE=3D2 FACE=3D"Arial">The &quot;ROBOTS&quot; meta tag</FONT>
</P>

<P><A =
HREF=3D"http://www.w3.org/Search/9605-Indexing-Workshop/ReportOutcomes/Sp=
idering.txt"><U><FONT COLOR=3D"#0000FF" SIZE=3D2 =
FACE=3D"Arial">http://www.w3.org/Search/9605-Indexing-Workshop/ReportOutc=
omes/Spidering.txt</FONT></U></A>
</P>

<P><A =
HREF=3D"http://www.w3.org/TR/1998/REC-html40-19980424/appendix/notes.html=
#h-B.4.1"><U><FONT COLOR=3D"#0000FF" SIZE=3D2 =
FACE=3D"Arial">http://www.w3.org/TR/1998/REC-html40-19980424/appendix/not=
es.html#h-B.4.1</FONT></U></A><FONT SIZE=3D2 FACE=3D"Arial">&nbsp; (yes, =
that's the same link as the ROBOTS.TXT one).</FONT></P>

<P><FONT SIZE=3D2 FACE=3D"Arial">And the XML sitemap.</FONT>
</P>

<P><A =
HREF=3D"http://www.google.com/support/webmasters/bin/answer.py?answer=3D4=
0318&amp;ctx=3Dsibling"><U><FONT COLOR=3D"#0000FF" SIZE=3D2 =
FACE=3D"Arial">http://www.google.com/support/webmasters/bin/answer.py?ans=
wer=3D40318&amp;ctx=3Dsibling</FONT></U></A>

<BR><A HREF=3D"http://www.sitemaps.org/protocol.php"><U><FONT =
COLOR=3D"#0000FF" SIZE=3D2 =
FACE=3D"Arial">http://www.sitemaps.org/protocol.php</FONT></U></A>
</P>

<P><FONT SIZE=3D2 FACE=3D"Arial">Right now, it's increasingly advisable =
to use all three, though I expect the sitemap will eventually =
substantially dominate because it is by far the most powerful.&nbsp; =
</FONT></P>

<P><FONT SIZE=3D2 FACE=3D"Arial">If anyone were to mistake obscurity for =
security and leave data in a folder not linked by other pages but =
without other protection, it's worth pointing out that adding the folder =
to the ROBOTS.TXT or the sitemap is of course merely creating a signpost =
to it.</FONT></P>

<P><FONT SIZE=3D2 FACE=3D"Arial">Dave.</FONT>
</P>
<BR>

</BODY>
</HTML>
------_=_NextPart_001_01C79318.A86445EA--