1. the regexp was too loose and matched /in/ not just /\bin\b/.
2. chiark.peer.fu-berlin.de consists mostly of stopwords by this rule
3. A bug meant that when it got to the end, it didn't stop, but always ate the TLD as if it were a stopword.
my $sk= $site;
for (;;) {
last unless $sk =~
my $sk= $site;
for (;;) {
last unless $sk =~
- s/^[^.]*(?:chiark|greenend|news|nntp|peer|feed|in|out)[^.]*\.//;
+ s/^[^. ]*\b(?:chiark|greenend|news|nntp|peer|feed|in|out)\b[^.]*\.//;
$sk .= " $&";
}
foreach my $inout (keys %{ $news_sources{$site} }) {
$sk .= " $&";
}
foreach my $inout (keys %{ $news_sources{$site} }) {