Question for Duncan Campbell re: Word-Spotting Capabilities
Ian G Batten
I.G.Batten at ftel.co.uk
Thu, 5 Aug 1999 10:53:47 +0100 (BST)
This is a multi-part message in MIME format...
------------=_933846605-8665-0
Content-Type: text/plain
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
Content-Md5: /T6eDOd78p5mEhBbc2Uwkw==
Duncan writew:
> In other words, you could show the computer the last six months of=20
> uk-crypto, and then say, find me anybody else talking about this stuff in=
=20
> the world's communications. The topic spotting system then ranks orders=
=20
> the target communications as to how closely the topics match to the=20
> uk-crypto corpus.
I worked up some code on a similar nature to spot keywords you don't
even know are keywords from a corpus (I hang our with computational
lexicographers socially and do odds and ends of consulting, but this was
just out of interest). I computed frequency counts for the whole
corpus, then for sub-corpora, and looked for words that appeared
disproportionately. My statistics aren't very good, so the measures
were crude, but the top twenty words that are in a given play by
Shakespeare far more than in all of Shakespeare gives you interesting
results. I meant to write it up, but I never got around to it.=20=20
It would seem to me that, say, comparing a given flow of email with
_all_ email would be quite interesting, and require no seeding. You'd
strip out all the grammatical words, all the phatic stuff, all the
general chit-chat and be left with a few key terms. That would be
language independant. If you knew the language, it would be better to
lemmatise (reduce plural to singular, reduce all tenses to present,
etc).
ian
------------=_933846605-8665-0
Content-Type: application/pgp-signature
Content-Disposition: inline
Content-Transfer-Encoding: 7bit
Content-Description: PGP Information
-----BEGIN PGP MESSAGE-----
Version: PGPfreeware 5.0i for non-commercial use
MessageID: nxkZ4+UPibRl29pJ0iBBkJmhzfvxSe2P
iQB1AwUBN6leTsoy0yij3IvtAQEMYAL/YN5MjvbFLmcIKlp/RqDA44IovXDisW54
6SZtbkZtvq25OFQr2rR/4d8qPYaB0hulNnLPRvrfh76vkLWyZQY2V+R8dR8BcCcN
F+qkSeHOZ2YYTfvQ87M/MZwFiGEK2P9m
=9mqx
-----END PGP MESSAGE-----
------------=_933846605-8665-0--