Entry faq2 (Berkeley CSUA MOTD)

Berkeley CSUA MOTD:Entry faq2
WIKI \| FAQ \| Tech FAQ
`http://csua.com/feed/`
2025/06/30 [General] UID:1000 Activity:popular
6/30
How does it work?
\_ Using an automated process described below:
   1) pure randomly timed fetch 
   2) parse raw motd entries, do a simple similarity comparison
      (O(n) time where n is the size of the entire motd history),
      append/update, date them, store them
   3) categorization
   4) HTML generation
   X) (optional) check for duplicate entries using complicated
      paragraph structure similarity algorithms.

   It's similar to a compiler. Phase 1/2 is lexical analysis and
   parsing. Phase 3 is optimization (analysis on intermediate format. 
   Phase 4 is code generation. Ok maybe not exactly compiler work
   but it's cool to think of it that way.

Is it fully automatic?
\_ Yes. All tasks are croned.

What is it written in?
\_ Look below.
 
How many lines of code?
\_
   1375    4088   45270 common.pm
     23      29     561 config.pm
    850    4040   24748 Diff.pm          <--- I didn't write this
   1354    3573   47242 genhtml.pm
    345     849    9437 search.pm         // search, O(lg n)
    114     201    3632 urlcheck.pm
      7      29     249 backup.pl
    544    1674   19036 categorize.pl     // step 3, O(n*d)
     39      86    1133 compareBin.pl
    551    1571   18264 convertURL.pl     // step 6, O(n), mostly network
     56     173    1730 doall.pl
    151     343    3470 fetch.pl          // step 1, O(1)
     81     246    2313 genhtml.pl        // step 4, O(n)
     26      69     564 keyword.pl
    110     314    2993 mos.pl            // step 5, O(n^2)
    114     325    3477 motdRCSFetch.pl
    466    1343   16116 parseAndInsert.pl // step 2, O(n)
    131     317    3586 ../index.cgi
   6337   19270  203821 total

How fast is it?
\_ read above, the annotated complexity where n is the size of the
   motd entries and d is the size of the keywords. Note that "mos.pl" is
   Measure Of Similarity and does a comprehensive comparison between
   one entry to another. Comparison of entries is *not* unix file
   diff, rather it is independent of white spaces and words. If an
   entry is at least 85% similar to another entry, the later entry
   is taken while both the responses are merged.

What is the algorithm you use for comparison?
\_ What else, the most basic Longest Common Subsequence algorithm :) 
   I'd like to use something similar to MOSS (measure of software similarity) 
   where I check for structural similarities instead of statistical and
   lexical similarities as is currently done. If you're familiar
   with it, email me.

Do you use a database?
\_ Yes, MySQL.

You're from Berkeley, should'nt you use PostgreSQL?
\_ Yeah and I'm suppose to use BSD as well. I've used PostgreSQL a lot
   before and I just don't need transaction and other fancy features.
   Besides, the support for PostgreSQL sucks. Just go to a bookstore 
   and compare the number of MySQL books vs. PostgreSQL books. It's
   like comparing the number of Oracle books vs. Informix/Sybase books,
   the difference is astonishing (20 vs. 1).
   
What is the algorithm for categorization?
\_ It's quite adhoc and complicated, and I'm not happy with it. I'd
   like to rewrite it using Bayes and/or hybrid of other techniques
   like SVM, C4.5, and what not, but I don't have the expertise to 
   do so. If AI is in your field of study and you're good with Perl,
   please email me! I'll add your name to the credit.
     At any rate, automatic categorization process is more of an art 
   than science, so imperfection is expected. In short, my algorithm
   is simple-- every path has a rule, and every path inherits a 
   parents' rule. The rules are applied to every single motd entry, 
   and are assigned points.
     Then it sorts those results and finds the most specific path.
   If something happens to have too many categories, then the
   algorithm finds the "least upper bound" path, which finds the
   common parent. So it traverses hierarchies both up and down to try
   to find a range of 1-3 categories.
      Some entries will still have too many categories even after applying
   the rules, and are put into Uncategorized/Multicategory. How would you
   personally categorize an entry that contains the following: "George
   Bush should use Linux, convert to Christianity, and ride bike."
   If it's hard for humans to categorize that, then it's even harder 
   for computers.
     Anyways, as a computer scientist, I'm always worried about the run
   time. Categorization is still O(n). For 3000 entries, it takes this 
   long to categorize (unix time): 
        231.730u 1.530s 4:05.18 95.1%   0+0k 0+0io 1361pf+0w

Is it adaptive?
\_ Not at this point, but the rules for categorization already works
   pretty well, at about 80% accuracy. If you're familiar with
   fast AND accurate adaptive algorithms, please email me. Two
   heads are better than one.

How come categorization is inaccurate?
\_ It's not because the algorithm is dumb, but that human beings
   are smart... way smarter than computers. 
     Categorization is very very context sensitive and it's more of
   an art than science, so you can't simply grep for keywords like 
   "logic" (do you mean "logic design" or "bush logic"?) and you
   can't simply grep for keywords like "proof" (do you mean 
   "p/np proof" or "bush has no proof"?)
     To go a step further, how would a computer categorize the
   following sentence: "How many Republicans does it take to screw 
   a lightbulb?" Well, it's easy for human beings to recognize that 
   it is a prelude to a joke, but I know of no algorithm that could 
   recognize this highly context sensitive sentence, reliably. If
   you have such an algorthm, tell me, we'll work on it. It'll be a
   great PhD thesis.
     In summary, it's hard to make something accurate, AND FAST. If 
   you'd like to help, you can volunteer to be a Berkeley Motd superuser 
   and manually categorize and/or set rules. Email me if you're
   interested.

Do you plan to use Perl mod to make CGI search faster?
\_ Most definitely. If you're technically capable of doing so
   email me.

How often do you poll the motd?
\_ Prior to 4/18/04, 4-8 times a day, on a pure random schedule.
   From 4/18/04-now, 10-12 times a day, on a pure random schedule.
   So trying to time a highly controversial troll to be archived is
   futile because it'll most likely be deleted by other people before 
   it's polled. In addition, even if it's polled and archived, other
   Berkeley Motd superusers (not me) will most likely delete it.

Why is your search faster than regular UNIX grep?
\_ No grep. It's all in MySQL FULLTEXT preindexing, which is lg n.

   kchang> time index.cgi q=kchang
   [results here deleted here]
        0.120u 0.020s 0:00.17 82.3%   0+0k 0+0io 586pf+0w

   If I do a naive unix grep, this is how long it would take:
   kchang> time sh -c "find . -name '*.meta'|xargs grep kchang > /dev/null"
        1.320u 0.410s 0:02.20 78.6%     0+0k 0+0io 1671pf+0w
   kchang>

   This is a factor of 10 difference. It's easy to see why. Unix grep 
   is O(n), does not normalize data, and has a tremendous I/O bottleneck. 
   On the other hand, MySQL FULLTEXT search is cached in memory,
   with normalized data, and search is only lg n.

Can I see your source code?
\_ If you can help me do the following much faster, I will give you 
   all the source. I need someone who's not only good at programming 
   but is strong algorithmically. The duplication comparison, "mos.pl", 
   or Measure of Similarity, takes O(n^2) time. You can't just
   "hash" entries in MD5 or others because hashes are dependent 
   on a few keywords (for example, deleting just one 
   character changes the entire hash). So to check for duplicates,
   you compare 1 entry to all others, then another entry to all
   others, so on so forth and eventually you'll need to make 
   (summation of i where i=0 to n) comparisons or (n*(n+1))/2=O(n^2)
   comparisons! A month of motd (300-400 entries) takes about 
   1 minute to check for duplicates. 6 months of motd takes about 
   70 minutes. Projecting time, we have the following:

   year   # entries   sqr entries   mos.pl min    search.pl sec
   1/12   400         160000          1 (tested)    x
   1/2    2500        6250000         70 (tested)   0.1
   1      5000        25000000        218           0.2  (projected)
   2      10000       100000000       873           0.4  (projected)
   3      15000       225000000       1963          0.6  (projected)
   4      20000       400000000       3490          0.8  (projected)
   5      25000       625000000       5453          1.0  (projected)

   "# entries" is the estimated number of entries by year 1 thru 5,
   based on past history of August 2003 to March 2004, and assuming
   the growth is constant, the number of entries will grow linearly 
   to 25000 by the 5th year. Because of the squared number of entries 
   needed for comparison, mos.pl will take 5453 minutes, or 3.7 days to
   check for duplicates! Essentually the program will not be
   able to keep up with the rate of growth.
     Now compare mos.pl to search.pl which does normal user search.
   The time required to search will grow from 200 ms by year 1 to 
   1 whole second by the 5th year, which is still quite acceptable.
     Now do you see why it's critical to find an algorithm that does
   this faster? Keep in mind these numbers are based on a P4 quad 
   Linux machine with 4 gigs of RAM.
     We can find adhoc solutions like minimizing n, better hashing,
   and/or others, but it'll essentially still be O(n^2). Finding 
   better/faster/more accurate ways to check for similarity, at
   least for this project, is a fun challenge. If you're up to the 
   job and would like to help, email me.

I'm playing with your code right now. I find it annoying to have
"use strict;" in your code. Can I change that?
\_ NO! You will use strict or you don't check in the code at all!!!

What are other things I need to know before checking in the code?
\_   Please write your code like the way you write C code.
   I don't want overly special Perl syntax that can really mess up
   the aesthetics of the code. If you think you're cool and can write
   super obfiscated Perl code, you better not check in the code.
     Also please comment every routine and everything else. My code
   has a lot of comments, and you should follow the conventions even
   if you think it sucks. "Closest to use" local var declarations, caps 
   for global (avoid this), long intuitive variable names (use emacs 
   for name completion if you're lazy), and consistent auto indentations.
     Speaking of indentations, there are only a few white spaces
   allowed in the source code. Returns and spaces only. 
   No line feed, DOS style carriage returns, and especially **TABS**.
   NO TAB, nada, end of question. If you check in tabs, I will personally 
   give you a 2 hour lecture on why they're bad.

I can't really code well and I can't really help you out, but I'd
like to understand your code anyways. Can I see it?
\_ If you understand the problem with mos.pl and can help me out, 
   I'll give you everything. 
     However, if you don't understand it but just want to know the
   system, I'll give you everything EXCEPT parseAndInsert.pl. The 
   reason is that by simply understanding parseAndInsert.pl, you
   can make minimal changes and maximal damages to the database.
   I'm not particularly keen to exposing all the loopholes in 
   which people can really pollute the database with minimal work.

How come some of your Perl code is messed up?
\_ I probably wrote some of it in the middle of the night (2-4AM). 
   Please fix it for me and CVS check in with your comments.

What version is this FAQ?
\_ $Author: $
   $Date: 2004/07/29 01:54:14 $
   $Id: info.faq2,v 1.4 2004/07/29 01:54:14 $
   $State: Exp $