csua.com/?entry=faq2
Phase 3 is optimization (analysis on intermediate format. Ok maybe not exactly compiler work but it's cool to think of it that way. pl" is Measure Of Similarity and does a comprehensive comparison between one entry to another. Comparison of entries is *not* unix file diff, rather it is independent of white spaces and words. If an entry is at least 85% similar to another entry, the later entry is taken while both the responses are merged. I've used PostgreSQL a lot before and I just don't need transaction and other fancy features. Just go to a bookstore and compare the number of MySQL books vs. Informix/Sybase books, the difference is astonishing (20 vs. If AI is in your field of study and you're good with Perl, please email me! At any rate, automatic categorization process is more of an art than science, so imperfection is expected. In short, my algorithm is simple-- every path has a rule, and every path inherits a parents' rule. The rules are applied to every single motd entry, and are assigned points. Then it sorts those results and finds the most specific path. If something happens to have too many categories, then the algorithm finds the "least upper bound" path, which finds the common parent. So it traverses hierarchies both up and down to try to find a range of 1-3 categories. Some entries will still have too many categories even after applying the rules, and are put into Uncategorized/Multicategory. How would you personally categorize an entry that contains the following: "George Bush should use Linux, convert to Christianity, and ride bike." If it's hard for humans to categorize that, then it's even harder for computers. Anyways, as a computer scientist, I'm always worried about the run time. If you're familiar with fast AND accurate adaptive algorithms, please email me. Well, it's easy for human beings to recognize that it is a prelude to a joke, but I know of no algorithm that could recognize this highly context sensitive sentence, reliably. If you have such an algorthm, tell me, we'll work on it. In summary, it's hard to make something accurate, AND FAST. If you'd like to help, you can volunteer to be a Kais Motd superuser and manually categorize and/or set rules. From 4/18/04-now, 10-12 times a day, on a pure random schedule. So trying to time a highly controversial troll to be archived is futile because it'll most likely be deleted by other people before it's polled. In addition, even if it's polled and archived, other Kais Motd superusers (not me) will most likely delete it.
Unix grep is O, does not normalize data, and has a tremendous I/O bottleneck. On the other hand, MySQL FULLTEXT search is cached in memory, with normalized data, and search is only lg n Can I see your source code? I need someone who's not only good at programming but is strong algorithmically. You can't just "hash" entries in MD5 or others because hashes are dependent on a few keywords (for example, deleting just one character changes the entire hash). So to check for duplicates, you compare 1 entry to all others, then another entry to all others, so on so forth and eventually you'll need to make (summation of i where i=0 to n) comparisons or (n*(n+1))/2=O(n^2) comparisons! A month of motd (300-400 entries) takes about 1 minute to check for duplicates. pl sec 1/12 400 160000 1 (tested) x 1/2 2500 6250000 70 (tested) 01 1 5000 25000000 218 02 (projected) 2 10000 100000000 873 04 (projected) 3 15000 225000000 1963 06 (projected) 4 20000 400000000 3490 08 (projected) 5 25000 625000000 5453 10 (projected) "# entries" is the estimated number of entries by year 1 thru 5, based on past history of August 2003 to March 2004, and assuming the growth is constant, the number of entries will grow linearly to 25000 by the 5th year. pl will take 5453 minutes, or 37 days to check for duplicates! Essentually the program will not be able to keep up with the rate of growth. The time required to search will grow from 200 ms by year 1 to 1 whole second by the 5th year, which is still quite acceptable. Now do you see why it's critical to find an algorithm that does this faster? Keep in mind these numbers are based on a P4 quad Linux machine with 4 gigs of RAM. We can find adhoc solutions like minimizing n, better hashing, and/or others, but it'll essentially still be O(n^2). Finding better/faster/more accurate ways to check for similarity, at least for this project, is a fun challenge. If you're up to the job and would like to help, email me. What are other things I need to know before checking in the code? I don't want overly special Perl syntax that can really mess up the aesthetics of the code. If you think you're cool and can write super obfiscated Perl code, you better not check in the code. My code has a lot of comments, and you should follow the conventions even if you think it sucks. "Closest to use" local var declarations, caps for global (avoid this), long intuitive variable names (use emacs for name completion if you're lazy), and consistent auto indentations. Speaking of indentations, there are only a few white spaces allowed in the source code. No line feed, DOS style carriage returns, and especially **TABS**. If you check in tabs, I will personally give you a 2 hour lecture on why they're bad. I can't really code well and I can't really help you out, but I'd like to understand your code anyways. pl, you can make minimal changes and maximal damages to the database. I'm not particularly keen to exposing all the loopholes in which people can really pollute the database with minimal work.
|