Berkeley CSUA MOTD:Entry faq2
2024/11/22 [General] UID:1000 Activity:popular
How does it work?
\_ Using an automated process described below:
1) pure randomly timed fetch
2) parse raw motd entries, do a simple similarity comparison
(O(n) time where n is the size of the entire motd history),
append/update, date them, store them
3) categorization
4) HTML generation
X) (optional) check for duplicate entries using complicated
paragraph structure similarity algorithms.
It's similar to a compiler. Phase 1/2 is lexical analysis and
parsing. Phase 3 is optimization (analysis on intermediate format.
Phase 4 is code generation. Ok maybe not exactly compiler work
but it's cool to think of it that way.
Is it fully automatic?
\_ Yes. All tasks are croned.
What is it written in?
\_ Look below.
How many lines of code?
\_
1375 4088 45270 common.pm
23 29 561 config.pm
850 4040 24748 Diff.pm <--- I didn't write this
1354 3573 47242 genhtml.pm
345 849 9437 search.pm // search, O(lg n)
114 201 3632 urlcheck.pm
7 29 249 backup.pl
544 1674 19036 categorize.pl // step 3, O(n*d)
39 86 1133 compareBin.pl
551 1571 18264 convertURL.pl // step 6, O(n), mostly network
56 173 1730 doall.pl
151 343 3470 fetch.pl // step 1, O(1)
81 246 2313 genhtml.pl // step 4, O(n)
26 69 564 keyword.pl
110 314 2993 mos.pl // step 5, O(n^2)
114 325 3477 motdRCSFetch.pl
466 1343 16116 parseAndInsert.pl // step 2, O(n)
131 317 3586 ../index.cgi
6337 19270 203821 total
How fast is it?
\_ read above, the annotated complexity where n is the size of the
motd entries and d is the size of the keywords. Note that "mos.pl" is
Measure Of Similarity and does a comprehensive comparison between
one entry to another. Comparison of entries is *not* unix file
diff, rather it is independent of white spaces and words. If an
entry is at least 85% similar to another entry, the later entry
is taken while both the responses are merged.
What is the algorithm you use for comparison?
\_ What else, the most basic Longest Common Subsequence algorithm :)
I'd like to use something similar to MOSS (measure of software similarity)
where I check for structural similarities instead of statistical and
lexical similarities as is currently done. If you're familiar
with it, email me.
Do you use a database?
\_ Yes, MySQL.
You're from Berkeley, should'nt you use PostgreSQL?
\_ Yeah and I'm suppose to use BSD as well. I've used PostgreSQL a lot
before and I just don't need transaction and other fancy features.
Besides, the support for PostgreSQL sucks. Just go to a bookstore
and compare the number of MySQL books vs. PostgreSQL books. It's
like comparing the number of Oracle books vs. Informix/Sybase books,
the difference is astonishing (20 vs. 1).
What is the algorithm for categorization?
\_ It's quite adhoc and complicated, and I'm not happy with it. I'd
like to rewrite it using Bayes and/or hybrid of other techniques
like SVM, C4.5, and what not, but I don't have the expertise to
do so. If AI is in your field of study and you're good with Perl,
please email me! I'll add your name to the credit.
At any rate, automatic categorization process is more of an art
than science, so imperfection is expected. In short, my algorithm
is simple-- every path has a rule, and every path inherits a
parents' rule. The rules are applied to every single motd entry,
and are assigned points.
Then it sorts those results and finds the most specific path.
If something happens to have too many categories, then the
algorithm finds the "least upper bound" path, which finds the
common parent. So it traverses hierarchies both up and down to try
to find a range of 1-3 categories.
Some entries will still have too many categories even after applying
the rules, and are put into Uncategorized/Multicategory. How would you
personally categorize an entry that contains the following: "George
Bush should use Linux, convert to Christianity, and ride bike."
If it's hard for humans to categorize that, then it's even harder
for computers.
Anyways, as a computer scientist, I'm always worried about the run
time. Categorization is still O(n). For 3000 entries, it takes this
long to categorize (unix time):
231.730u 1.530s 4:05.18 95.1% 0+0k 0+0io 1361pf+0w
Is it adaptive?
\_ Not at this point, but the rules for categorization already works
pretty well, at about 80% accuracy. If you're familiar with
fast AND accurate adaptive algorithms, please email me. Two
heads are better than one.
How come categorization is inaccurate?
\_ It's not because the algorithm is dumb, but that human beings
are smart... way smarter than computers.
Categorization is very very context sensitive and it's more of
an art than science, so you can't simply grep for keywords like
"logic" (do you mean "logic design" or "bush logic"?) and you
can't simply grep for keywords like "proof" (do you mean
"p/np proof" or "bush has no proof"?)
To go a step further, how would a computer categorize the
following sentence: "How many Republicans does it take to screw
a lightbulb?" Well, it's easy for human beings to recognize that
it is a prelude to a joke, but I know of no algorithm that could
recognize this highly context sensitive sentence, reliably. If
you have such an algorthm, tell me, we'll work on it. It'll be a
great PhD thesis.
In summary, it's hard to make something accurate, AND FAST. If
you'd like to help, you can volunteer to be a Berkeley Motd superuser
and manually categorize and/or set rules. Email me if you're
interested.
Do you plan to use Perl mod to make CGI search faster?
\_ Most definitely. If you're technically capable of doing so
email me.
How often do you poll the motd?
\_ Prior to 4/18/04, 4-8 times a day, on a pure random schedule.
From 4/18/04-now, 10-12 times a day, on a pure random schedule.
So trying to time a highly controversial troll to be archived is
futile because it'll most likely be deleted by other people before
it's polled. In addition, even if it's polled and archived, other
Berkeley Motd superusers (not me) will most likely delete it.
Why is your search faster than regular UNIX grep?
\_ No grep. It's all in MySQL FULLTEXT preindexing, which is lg n.
kchang> time index.cgi q=kchang
[results here deleted here]
0.120u 0.020s 0:00.17 82.3% 0+0k 0+0io 586pf+0w
If I do a naive unix grep, this is how long it would take:
kchang> time sh -c "find . -name '*.meta'|xargs grep kchang > /dev/null"
1.320u 0.410s 0:02.20 78.6% 0+0k 0+0io 1671pf+0w
kchang>
This is a factor of 10 difference. It's easy to see why. Unix grep
is O(n), does not normalize data, and has a tremendous I/O bottleneck.
On the other hand, MySQL FULLTEXT search is cached in memory,
with normalized data, and search is only lg n.
Can I see your source code?
\_ If you can help me do the following much faster, I will give you
all the source. I need someone who's not only good at programming
but is strong algorithmically. The duplication comparison, "mos.pl",
or Measure of Similarity, takes O(n^2) time. You can't just
"hash" entries in MD5 or others because hashes are dependent
on a few keywords (for example, deleting just one
character changes the entire hash). So to check for duplicates,
you compare 1 entry to all others, then another entry to all
others, so on so forth and eventually you'll need to make
(summation of i where i=0 to n) comparisons or (n*(n+1))/2=O(n^2)
comparisons! A month of motd (300-400 entries) takes about
1 minute to check for duplicates. 6 months of motd takes about
70 minutes. Projecting time, we have the following:
year # entries sqr entries mos.pl min search.pl sec
1/12 400 160000 1 (tested) x
1/2 2500 6250000 70 (tested) 0.1
1 5000 25000000 218 0.2 (projected)
2 10000 100000000 873 0.4 (projected)
3 15000 225000000 1963 0.6 (projected)
4 20000 400000000 3490 0.8 (projected)
5 25000 625000000 5453 1.0 (projected)
"# entries" is the estimated number of entries by year 1 thru 5,
based on past history of August 2003 to March 2004, and assuming
the growth is constant, the number of entries will grow linearly
to 25000 by the 5th year. Because of the squared number of entries
needed for comparison, mos.pl will take 5453 minutes, or 3.7 days to
check for duplicates! Essentually the program will not be
able to keep up with the rate of growth.
Now compare mos.pl to search.pl which does normal user search.
The time required to search will grow from 200 ms by year 1 to
1 whole second by the 5th year, which is still quite acceptable.
Now do you see why it's critical to find an algorithm that does
this faster? Keep in mind these numbers are based on a P4 quad
Linux machine with 4 gigs of RAM.
We can find adhoc solutions like minimizing n, better hashing,
and/or others, but it'll essentially still be O(n^2). Finding
better/faster/more accurate ways to check for similarity, at
least for this project, is a fun challenge. If you're up to the
job and would like to help, email me.
I'm playing with your code right now. I find it annoying to have
"use strict;" in your code. Can I change that?
\_ NO! You will use strict or you don't check in the code at all!!!
What are other things I need to know before checking in the code?
\_ Please write your code like the way you write C code.
I don't want overly special Perl syntax that can really mess up
the aesthetics of the code. If you think you're cool and can write
super obfiscated Perl code, you better not check in the code.
Also please comment every routine and everything else. My code
has a lot of comments, and you should follow the conventions even
if you think it sucks. "Closest to use" local var declarations, caps
for global (avoid this), long intuitive variable names (use emacs
for name completion if you're lazy), and consistent auto indentations.
Speaking of indentations, there are only a few white spaces
allowed in the source code. Returns and spaces only.
No line feed, DOS style carriage returns, and especially **TABS**.
NO TAB, nada, end of question. If you check in tabs, I will personally
give you a 2 hour lecture on why they're bad.
I can't really code well and I can't really help you out, but I'd
like to understand your code anyways. Can I see it?
\_ If you understand the problem with mos.pl and can help me out,
I'll give you everything.
However, if you don't understand it but just want to know the
system, I'll give you everything EXCEPT parseAndInsert.pl. The
reason is that by simply understanding parseAndInsert.pl, you
can make minimal changes and maximal damages to the database.
I'm not particularly keen to exposing all the loopholes in
which people can really pollute the database with minimal work.
How come some of your Perl code is messed up?
\_ I probably wrote some of it in the middle of the night (2-4AM).
Please fix it for me and CVS check in with your comments.
What version is this FAQ?
\_ $Author: $
$Date: 2004/07/29 01:54:14 $
$Id: info.faq2,v 1.4 2004/07/29 01:54:14 $
$State: Exp $