Question for Duncan Campbell re: Word-Spotting Capabilities

Duncan Campbell duncan at gn.apc.org
Thu, 05 Aug 1999 01:33:07 +0100


The topic spotting methods that NSA is working on are based on n-gram 
analysis, which in my crude way I understand to be based on a comparison of 
n-dimensional matrices setting out the relative probablity of any text 
string of length n in two corpuses (corpi ?) of texts.   One is the 
surveillance data, which can be massive, the second is the seed corpus, a 
chosen set of documents which are about the topic of interest.

In other words, you could show the computer the last six months of 
uk-crypto, and then say, find me anybody else talking about this stuff in 
the world's communications.   The topic spotting system then ranks orders 
the target communications as to how closely the topics match to the 
uk-crypto corpus.

NSA has patented this method, and claims that it is completely language 
independent (true, if each corpus is in the same language) and highly 
effective despite high error rates (which seems very plausible).   It is 
this latter claim that makes me suspect that it may work when they apply it 
to phoneme strings in the speech recognition problem.   If you can do that, 
you don't need to go through the actual transcription phase.

I find the method elegant, as it neatly sidesteps all the well-understood 
problems of Boolean based searches.

Duncan Campbell


> >>>Seems to me that topic spotting is a more useful goal anyway.  Even if
> >>>you have 100% accuracy in word spotting, you will generate too many
> >>>false positive hits when that word appears out of context (eg "Boy did
> >>>the Yankees bomb tonight").
> >>
> >>
> >> Topic spotting where the transcription is so bad it introduced
> >> a 70-80% error rate in the words?  Do tell how........
> >
> >Er, I didn't say it was possible, I said that I thought it was a more
> >useful goal to aim for than spotting isolated words.
>
>
>
>  What I meant was, if you cannot spot the words on a noisy any-voice
>  channel (to such an extent they are 70 or 80% wrong),
>
>  how you gonna spot topics in the transcribed words?
>
>
>
>--
>    ^-^-^-@@-^-;-^   http://www.xemu.demon.co.uk/
>         (..)__u     news:alt.smoking.mooses