Entry 26462 (Berkeley CSUA MOTD)

Berkeley CSUA MOTD:Entry 26462

WIKI \| FAQ \| Tech FAQ
`http://csua.com/feed/`

2025/07/05 [General] UID:1000 Activity:popular

7/5

2002/11/7-8 [Computer/Theory] UID:26462 Activity:kinda low

11/7    Anyone have a good intro (i'm in math54 atm) to bayesian analysis
        that a college sophomore could understand? i read that the new
        spamassassin uses it (like ifile) but cant find a good intro text on
        it other than:
        http://www.paulgraham.com/spam.html
        (which was actually pretty good)
        \_ If you are talking about that stupid /. thing, it's really simple.
           It's just counting words.  If a document has lots of words deemed
           'spammy', it does not get through.  This is literally all there is
           to this 'bayesian' approach.  AI people call this the 'naive bayes'
           algorithm.
           \_ i dont read /. -op
        \_ does spamassassin actually use this already?  If so, how do I slap
           it around and tell it which is spam and which isn't?
           \_ the beta/experimental release (2.5) does.
        \_ Bayesian analysis fundamentals are simple.  It all drives off of
           Bayes Theorem (mathworld has one, but there are plenty of better
           ones, if you google for it: http://mathworld.wolfram.com  In a
           nutshell, given the observation of data D, the probability that a
           hypothesis is true is equal to the conditional probability that the
           data is true given the hypothesis * the PRIOR probability of the
           hypothesis being true.  In essence, it ideally redistributes your
           prior probability distribution based on the data observed.  Bayes
           Theorem:  P(H(i)|D) = [P(D|H(i))*P(H(i))]   (formatd was here)
                where                  ---------------------------
                D = observ. data       Sum(j=1->n,P(D|H(j))*P(Hj))
                H(i) = hypothesis i
                1<=i<=n                         -nivra
        \_ What's the relation between Bayes Theorem and Neural Networks?
           \_ No relationship at all.  Neural networks are function approximators
              that work using hillclimbing.  Bayes theorem is a relationship
              between conditional probabilities of two events (it's not actually
              a theorem, it just follows straight from the definitions).
                -- card-carrying bayesian
           \_ Bayes Theorem: Conditional probability.  I don't remember Neural
              Networks using Bayes Theorem.  However, Bayesian networks use
              Bayes theory.
              \_ I don't know much about Neural Networks.  Presumably, they use
                 different learning algorithms for deciding what information to
                 weight, and how to weight its nodes/connections.  Thus,
                 presumably, you could have a Bayesian learning algorithm that
                 decides how to re-weight its connections based on observed
                 data.  -nivra
                 \_ google for bayesian networks or bayesian belief networks.
                    Neural Networks typically have an activation function
                    at the nodes to weight things.  The Backpropagation
                    Algorithm is typically run.  I don't think I've heard
                    of a variation of Neural Networks algorithm which uses
                    Bayes.

2025/07/05 [General] UID:1000 Activity:popular

7/5

You may also be interested in these entries...

2004/7/28-29 [Computer/SW/Graphics, Computer/Theory] UID:32556 Activity:very high

7/28    Dear CS PhDs, can someone please enlighten me on the different
        conferences out there, what they're good for, how prestigious
        they are, etc?      -cs dumb
        \_
        I'll start:
        // prestigious:
	...

2001/11/8-9 [Computer/Theory] UID:22971 Activity:very high

11/7    I hate math 55. If CS is an extension of math 55, I'm gonna drop out.
        \_ if you don't like math 55 you are better off doing something else
           with your life.  You will not be happy as a programmer.
           \_ I don't think math55 is a good measure of what a career in cs
              is like.
              \_ career in CS or a career in programming?  they are not the
	...

2000/8/14-15 [Computer/Theory] UID:18977 Activity:high

8/13    Is there such a thing as an algorithm that can output another
        algorithm? Kind of like self tuning, self evolving algorithm?
        \_ Yes. Not as glamorous as it sounds; see also: genetic algorithms,
           Remez algorithms, data-directed programming
        \_ bison, yacc, (f)lex, and many many more. not self tuning though.
        \_ There are algorithmns that can learn from data.  It's really not
	...

2000/5/3-4 [Computer/Theory] UID:18162 Activity:very high

4/32    Stat 200a vs Stat 101 vs Stat 134 -- any comments? Experience with
        any would be appreciated too. (Assuming that purpose is applications
        in an arbitrary area of theoretical CS)
        \-sigh, i guess i will wade in: if you let me know what topics you
        \_ Can you describe the subject matter please?  I am too lazy to
           look it up.
	...

2000/1/18-19 [Computer/SW/Languages/Perl] UID:17262 Activity:moderate

1/17    Where can I find perl in Japanese? I'd also like to know where
        I can get hold of character recognition software for Japanese/
        Chinese characters? -fab@csua
        \_ You misunderstand the nature of perl.  There's nothing inherintly
           american or japanese or anything else about it.  Perl relies on the
           underlying OS for almost everything.  What are you trying to do that
	...

Cache (8192 bytes)

www.paulgraham.com/spam.htmlIf we can write software that recognizes their messages, there is no way they can get around that. If you hired someone to read your mail and discard the spam, they would have little trouble doing it. How much do we have to do, short of AI, to automate this process? I think we will be able to solve the problem with fairly simple algorithms. In fact, I've found that you can filter present-day spam acceptably well using nothing more than a Bayesian combination of the spam probabilities of individual words. Using a slightly tweaked (as described below) Bayesian filter, we now miss less than 5 per 1000 spams, with 0 false positives. The statistical approach is not usually the first one people try when they write spam filters. Most hackers' first instinct is to try to write software that recognizes individual properties of spam. You look at spams and you think, the gall of these guys to try sending me mail that begins "Dear Friend" or has a subject line that's all uppercase and ends in eight exclamation points. I can filter out that stuff with about one line of code. A few simple rules will take a big bite out of your incoming spam. I spent about six months writing software that looked for individual spam features before I tried the statistical approach. What I found was that recognizing that last few percent of spams got very hard, and that as I made the filters stricter I got more false positives. False positives are innocent emails that get mistakenly identified as spams. For most users, missing legitimate email is an order of magnitude worse than receiving spam, so a filter that yields false positives is like an acne cure that carries a risk of death to the patient. The more spam a user gets, the less likely he'll be to notice one innocent mail sitting in his spam folder. And strangely enough, the better your spam filters get, the more dangerous false positives become, because when the filters are really good, users will be more likely to ignore everything they catch. I don't know why I avoided trying the statistical approach for so long. I think it was because I got addicted to trying to identify spam features myself, as if I were playing some kind of competitive game with the spammers. It discovered, of course, that terms like "virtumundo" and "teens" were good indicators of spam. But it also discovered that "per" and "FL" and "ff0000" are good indicators of spam. In fact, "ff0000" (html for bright red) turns out to be as good an indicator of spam as any pornographic term. I start with one corpus of spam and one of nonspam mail. I scan the entire text, including headers and embedded html and javascript, of each message in each corpus. I currently consider alphanumeric characters, dashes, apostrophes, and dollar signs to be part of tokens, and everything else to be a token separator. I count the number of times each token (ignoring case, currently) occurs in each corpus. At this stage I end up with two large hash tables, one for each corpus, mapping tokens to number of occurrences. If you've never seen a word before, it is probably fairly innocent; There are examples of this algorithm being applied to actual emails in an appendix at the end. But in practice it would not matter much where I put this threshold, because few probabilities end up in the middle of the range. Over the past six months, I've read literally thousands of spams, and it is really kind of demoralizing. Norbert Wiener said if you compete with slaves you become a slave, and there is something similarly degrading about competing with spammers. To recognize individual spam features you have to try to get into the mind of the spammer, and frankly I want to spend as little time inside the minds of spammers as possible. But the real advantage of the Bayesian approach, of course, is that you know what you're measuring. Feature-recognizing filters like SpamAssassin assign a spam "score" to email. The problem with a "score" is that no one knows what it means. The user doesn't know what it means, but worse still, neither does the developer of the filter. How many points should an email get for having the word "sex" in it? A probability can of course be mistaken, but there is little ambiguity about what it means, or how evidence should be combined to calculate it. Because it is measuring probabilities, the Bayesian approach considers all the evidence in the email, both good and bad. Words that occur disproportionately rarely in spam (like "though" or "tonight" or "apparently") contribute as much to decreasing the probability as bad words like "unsubscribe" and "opt-in" do to increasing it. So an otherwise innocent email that happens to include the word "sex" is not going to get tagged as spam. Ideally, of course, the probabilities should be calculated individually for each user. I get a lot of email containing the word "Lisp", and (so far) no spam that does. So a word like that is effectively a kind of password for sending mail to me. In my earlier spam-filtering software, the user could set up a list of such words and mail containing them would automatically get past the filters. On my list I put words like "Lisp" and also my zipcode, so that (otherwise rather spammy-sounding) receipts from online orders would get through. I thought I was being very clever, but I found that the Bayesian filter did the same thing for me, and moreover discovered of a lot of words I hadn't thought of. When I said at the start that our filters let through less than 5 spams per 1000 with 0 false positives, I'm talking about filtering my mail based on a corpus of my mail. But these numbers are not misleading, because that is the approach I'm advocating: filter each user's mail based on the spam and nonspam mail he receives. Essentially, each user should have two delete buttons, ordinary delete and delete-as-spam. Anything deleted as spam goes into the spam corpus, and everything else goes into the nonspam corpus. You could start users with a seed filter, but ultimately each user should have his own per-word probabilities based on the actual mail he receives. This makes the filters more effective, lets each user decide their own precise definition of spam, and perhaps best of all makes it hard for spammers to tune mails to get through the filters. If a lot of the brain of the filter is in the individual databases, then merely tuning spams to get through the seed filters won't guarantee anything about how well they'll get through individual users' varying and much more trained filters. Content-based spam filtering is often combined with a whitelist, a list of senders whose mail can be accepted with no filtering. One easy way to build such a whitelist is to keep a list of every address the user has ever sent mail to. If a mail reader has a delete-as-spam button then you could also add the from address of every email the user has deleted as ordinary trash. I'm an advocate of whitelists, but more as a way to save computation than as a way to improve filtering. I used to think that whitelists would make filtering easier, because you'd only have to filter email from people you'd never heard from, and someone sending you mail for the first time is constrained by convention in what they can say to you. Someone you already know might send you an email talking about sex, but someone sending you mail for the first time would not be likely to. The problem is, people can have more than one email address, so a new from-address doesn't guarantee that the sender is writing to you for the first time. It is not unusual for an old friend (especially if he is a hacker) to suddenly send you an email with a new from-address, so you can't risk false positives by filtering mail from unknown addresses especially stringently. In a sense, though, my filters do themselves embody a kind of whitelist (and blacklist) because they are based on entire messages, including the headers. So to that extent they "know" the email addresses of trusted senders and even the routes by which mail gets from them to me. And they know the same about spam, including the server names, mailer versions, and protocols. But it doesn't mean much to be ...

Cache (839 bytes)

mathworld.wolfram.comNKS 2004 Conference to Be Held in Boston NKS 2004, the second conference and minicourse devoted to the concepts in Stephen Wolfram's A New Kind of Science (NKS), will take place this April 22-25 in Boston. Arbitrarily Long Progressions of Primes A new preprint by Green and Tao appears to establish the existence of prime arithmetic progressions of any length k, a result that has been conjectured for three quarters of a century but until now has defied proof. New NKS|Online, Functions, and MathWorld Websites Unveiled 2004 brings the rollout of NKS|Online, a major upgrade to The Wolfram Functions Site, and a new look for MathWorld to the internet. RSA-576 Factored A 174-digit number proposed as a challenge for current factorization algorithms has been factored into two 87-digit primes. References Visible links 1. Hidden links: 39.