Question for Duncan Campbell re: Word-Spotting Capabilities
Brian Randell
Brian.Randell at newcastle.ac.uk
Thu, 5 Aug 1999 14:58:30 +0100
Hi Duncan:
>The topic spotting methods that NSA is working on are based on n-gram
>analysis, which in my crude way I understand to be based on a comparison of
>n-dimensional matrices setting out the relative probablity of any text
>string of length n in two corpuses (corpi ?) of texts. One is the
>surveillance data, which can be massive, the second is the seed corpus, a
>chosen set of documents which are about the topic of interest.
>
>In other words, you could show the computer the last six months of
>uk-crypto, and then say, find me anybody else talking about this stuff in
>the world's communications. The topic spotting system then ranks orders
>the target communications as to how closely the topics match to the
>uk-crypto corpus.
>
>NSA has patented this method, and claims that it is completely language
>independent (true, if each corpus is in the same language) and highly
>effective despite high error rates (which seems very plausible). It is
>this latter claim that makes me suspect that it may work when they apply it
>to phoneme strings in the speech recognition problem. If you can do that,
>you don't need to go through the actual transcription phase.
>
>I find the method elegant, as it neatly sidesteps all the well-understood
>problems of Boolean based searches.
I don't claim personal expertise in Information Retrieval, but I've heard a
number of presentations on IR research which give me the impression that
the idea of saying "find me lots of other documents which match the
statistical characteristics that are possessed by this selected base set of
documents and that differentiate this base set of documents from some much
more general set" is a very well-established approach in IR. (The selected
base set of documents might for example have been found by typical boolean
search terms, and had the irrelevant documents weeded out.) However someone
- anyone! - who is into IR research should be able to confirm or deny the
accuracy of this impression of mine.
Cheers
Brian Randell
Dept. of Computing Science, University of Newcastle, Newcastle upon Tyne,
NE1 7RU, UK
EMAIL = Brian.Randell@newcastle.ac.uk PHONE = +44 191 222 7923
FAX = +44 191 222 8232 URL = http://www.cs.ncl.ac.uk/~brian.randell/