Question for Duncan Campbell re: Word-Spotting Capabilities
Duncan Campbell
duncan at gn.apc.org
Thu, 05 Aug 1999 01:33:07 +0100
The topic spotting methods that NSA is working on are based on n-gram
analysis, which in my crude way I understand to be based on a comparison of
n-dimensional matrices setting out the relative probablity of any text
string of length n in two corpuses (corpi ?) of texts. One is the
surveillance data, which can be massive, the second is the seed corpus, a
chosen set of documents which are about the topic of interest.
In other words, you could show the computer the last six months of
uk-crypto, and then say, find me anybody else talking about this stuff in
the world's communications. The topic spotting system then ranks orders
the target communications as to how closely the topics match to the
uk-crypto corpus.
NSA has patented this method, and claims that it is completely language
independent (true, if each corpus is in the same language) and highly
effective despite high error rates (which seems very plausible). It is
this latter claim that makes me suspect that it may work when they apply it
to phoneme strings in the speech recognition problem. If you can do that,
you don't need to go through the actual transcription phase.
I find the method elegant, as it neatly sidesteps all the well-understood
problems of Boolean based searches.
Duncan Campbell
> >>>Seems to me that topic spotting is a more useful goal anyway. Even if
> >>>you have 100% accuracy in word spotting, you will generate too many
> >>>false positive hits when that word appears out of context (eg "Boy did
> >>>the Yankees bomb tonight").
> >>
> >>
> >> Topic spotting where the transcription is so bad it introduced
> >> a 70-80% error rate in the words? Do tell how........
> >
> >Er, I didn't say it was possible, I said that I thought it was a more
> >useful goal to aim for than spotting isolated words.
>
>
>
> What I meant was, if you cannot spot the words on a noisy any-voice
> channel (to such an extent they are 70 or 80% wrong),
>
> how you gonna spot topics in the transcribed words?
>
>
>
>--
> ^-^-^-@@-^-;-^ http://www.xemu.demon.co.uk/
> (..)__u news:alt.smoking.mooses