2/9 What are some good oss search engines that can parse an HTML page
and spit out the top X relevant keywords? TIA.
\_ lynx -dump $URL | sed '/^References$/,$d' | perl -ne\
'while(s/([a-z]+)//i){print "$1\n";}' | sort | uniq -c | sort -rn
\_ Pretty cool. But I think the op was thinking something
kind of like google.
\_ Yea, as much I figured, but google or anything remotely
of the sort relies on _multiple_ documents linking to each
other to establish relevance/importance/etc. If all you have
to work with is a single document with no context, there's
rather little you can do unless you want to get neck-deep
in natural-language issues (well, knee-deep if you hack up
something to figure out which words are "unusually" common
in this document compared to the language at large, but
any serious solution would require some amount of parsing
and language understanding). Hence the above silly hack,
which I meant largely as a joke. -alexf
\_ What if you can assume that the page authors aren't
trying to game the system with off-topic keywords, etc? |