Question for Duncan Campbell re: Word-Spotting Capabilities

Ian G Batten I.G.Batten at ftel.co.uk
Thu, 5 Aug 1999 10:53:47 +0100 (BST)


This is a multi-part message in MIME format...

------------=_933846605-8665-0
Content-Type: text/plain
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
Content-Md5: /T6eDOd78p5mEhBbc2Uwkw==

Duncan writew:
> In other words, you could show the computer the last six months of=20
> uk-crypto, and then say, find me anybody else talking about this stuff in=
=20
> the world's communications.   The topic spotting system then ranks orders=
=20
> the target communications as to how closely the topics match to the=20
> uk-crypto corpus.

I worked up some code on a similar nature to spot keywords you don't
even know are keywords from a corpus (I hang our with computational
lexicographers socially and do odds and ends of consulting, but this was
just out of interest).  I computed frequency counts for the whole
corpus, then for sub-corpora, and looked for words that appeared
disproportionately.  My statistics aren't very good, so the measures
were crude, but the top twenty words that are in a given play by
Shakespeare far more than in all of Shakespeare gives you interesting
results.  I meant to write it up, but I never got around to it.=20=20

It would seem to me that, say, comparing a given flow of email with
_all_ email would be quite interesting, and require no seeding.  You'd
strip out all the grammatical words, all the phatic stuff, all the
general chit-chat and be left with a few key terms.  That would be
language independant.  If you knew the language, it would be better to
lemmatise (reduce plural to singular, reduce all tenses to present,
etc).

ian

------------=_933846605-8665-0
Content-Type: application/pgp-signature
Content-Disposition: inline
Content-Transfer-Encoding: 7bit
Content-Description: PGP Information

-----BEGIN PGP MESSAGE-----
Version: PGPfreeware 5.0i for non-commercial use
MessageID: nxkZ4+UPibRl29pJ0iBBkJmhzfvxSe2P

iQB1AwUBN6leTsoy0yij3IvtAQEMYAL/YN5MjvbFLmcIKlp/RqDA44IovXDisW54
6SZtbkZtvq25OFQr2rR/4d8qPYaB0hulNnLPRvrfh76vkLWyZQY2V+R8dR8BcCcN
F+qkSeHOZ2YYTfvQ87M/MZwFiGEK2P9m
=9mqx
-----END PGP MESSAGE-----
------------=_933846605-8665-0--