1. Training Strategies
Unlike the core classifier's algorithms, different approaches to training the classifier have not been well tested. Different people use different approaches for a number of reasons, including ease of use, effectiveness and curiosity. There seem to be two main training strategies used by most people: train on everything and train on mistakes. The second subsumes train on mistakes and unsures if you pretend that unsures are mistakes (which they aren't).
1.1. Some Training Aphorisms
I post these here with no proof that they are true, just that they work for me and seem to have been used by more of the developers than just me (SkipMontanaro). Oh, and Tim told me to do it.
Don't be afraid to retrain from scratch. The system learns quickly. Retraining from scratch is often the quickest way to recover from training mistakes.
Bigger is not always better, no matter what all those enlargement messages would have you believe. A larger database is harder to examine for mistakes, and a few mistakes skewed in the same direction may be hard to overcome with correct training. You'll also reach a point where you want to just delete all that spam. Once you do that, you've completely lost the ability to find mistakes. If you only have a few messages in your training database things will be easier to manage.
Never train on the same message twice. Using iterative reasoning it's easy see you should never train on the same 100 or 1000 times either. [Though the Outlook plug-in will not let you train on a particular copy of a message more than once, it will let you train on any number of copies of a message, which is the same thing in the end. TimPeters]
Seek balance in your training database. Similar numbers of ham and spam are good.
Don't automatically train on all incoming messages. If you get swamped with spam, you will quickly wind up with a training database which is wildly out-of-balance.
Don't worry about training on every unsure message either. Some messages just aren't amenable to a strict classification. For example, a bounce message from a mail server containing an attached spam may be best left untrained. It contains both strong ham clues (all the postmaster gibberish which you would get in a bounce of an otherwise valid message) and strong spam clues (the spam message itself). Pretending that message is definitely ham or spam is likely to worsen the classification of future mail bounces or future similar spam.
1.2. Training Tactics
Skip has laid out some good precepts and what seems to work best based on a lot of collective experience, even if Tim did make him do it. Now let's list all the current possibilities for training and then discuss them in more detail. If nothing else, more people will understand what works, what doesn't and why. These are listed in order of the number of new messages that require training from smallest to largest.
TrainOnErrors (spam scores less than ham threshold or ham scores greater than spam threshold)
TrainOnUnsures (message scores less than spam threshold and greater than ham threshold)
TrainOnErrorsAndUnsures (spam scores less than spam threshold or ham scores greater than ham threshold)
TrainOnAlmostEverything (spam scores less than an obvious spam threshold or ham scores greater than an obvious ham threshold)
TrainOnEverything (yup, every single message)
For all of the above (except TrainToExhaustion), there is the optional but recommended step of:
periodically train additional ham or spam to keep the database message counts in balance;
note: this is problematic with train on everything
There is an ongoing debate about the effects of database size on performance. So let's arbitrarily say the database size philosophies are:
super-sized (someone did call this enlarged)
Assuming there is no mechanism for pruning the token databases (more on that later), here are some approaches for when and if to terminate training. For all of these, initial training set size and the training tactic selected determines the quality of the result.
train until performance is acceptable
train until database message count reaches certain size
train until database spans certain time frame
For all of these, there is the additional option to periodically adjust the ham and spam thresholds. Finally, if performance degrades over time, start over.
This is turning into a great outline of the various training strategies. Is there anyone out there now willing to put code/time into pulling out some numbers? I believe that the timtv.py script does (effectively) the 'train on everything' strategy. There's also a script to test 'incremental' training, which could probably be used as the basis of testing some of these other ideas (the ones that work with the current database setup). It would be great if someone would look into that and write up a recipe for testing that others can follow.
1.3. Picking the Initial Training Set
Most of us just collect what we think is a reasonable number of manually classified ham and spam into two folders and train on them. Depending on how you chose these, your mileage will really vary a lot. Aside from the number of ham and spam to chose (equal numbers seem to work better, I'm told), which ones you chose are more critical than you might think. The goal is to get the token databases to each have good coverage of the tokens we expect to see in our incoming mail stream with correct probabilities. In fact, the choice of the message corpus to train on actually defines the training tactic.
After picking a training tactic, which determines which messages can go in the corpus, and picking a corpus size (how many messages), there are still different ways to select a training set out of the message corpus you just constructed. The most obvious approach is to just train on every message in the corpus. Aside from taking a long time and creating giant databases, this is not necessarily optimal for the reasons Skip gives in several sections of this Wiki. Like any problem of this type, a much smaller subset of the messages can give an equally optimal statistical representation of the message corpus with all the advantages of smaller databases.
So how do we identify this message subset to train on? Skip created a terrific, but labor intensive algorithm, which he put out on the mailing list. For anyone who missed it, here is SkipsRecursiveTrainingSetSelectionAlgorithm, shamelessly repeated without his permission. This is an incredibly good algorithm. The key is that adding a particular message to the database affects the classification of many other messages in not always obvious ways. You train on a very small initial training set, use the result to classify a larger message corpus, train a few of the messages that classify the worst and repeat this until you can't see straight. You will have a terrific, yet small, training set. I did this with the Outlook Plug-In rather than Skip's scripts, but the result is the same. Though it's only been a few days, it is performing much better than the random method I previously used plus the database size is much smaller. It also had better classification performance on a corpus of 6,000 previous messages than I imagined possible: no errors and 7 unsures. For anyone using the Outlook plug-in, SkipsRecursiveTrainingSetSelectionForOutlook is an adaptation of Skip's algorithm for Outlook.
Update May 19, 2004: I retrained on January 8, 2004 using SkipsRecursiveTrainingSetSelectionForOutlook with a training set size of 640 ham + 640 spam out of a corpus of about 7,000 messages. I have added a few messages per day if they are either unsure, mistakes or just classify 'too close' to their respective thresholds. Performance has remained pretty consistent with the following overall stats: unsures = 3.9% (284), false positives = 0.0% (none), false negatives = 0.3% (13) based on 5,032 incoming messages. The training set currently has 858 ham and 1023 spam.
1.4. When and If To Terminate Training
It's unlikely that you will ever completely stop training. Over time, the spammers try different tricks, they promote new come-ons, and they move from ISP to ISP. Your training database probably needs to evolve to adapt.
1.5. Manual Pruning of Databases
Currently, if the database grows out of bounds the recipe is to throw it away and replace by a new one.
Q: I'd like to see a user interface for the manual maintanance of the database. See statistics, see the tokens and their ratigns, sort, remove tokens with a low count (hapaxes).
While most training ideas focus on automatic database adaption, my proposal is a manual maintenance. Let me explain a bit more in detail:
Many spam mails nowadays come with arbitrary text that has the sole purpose to add noise to Bayesian filters. This leads to a mass of tokens with count 1. Actually, my the database reached a size that made Spambayes trash on my rather old system.
Deleting tokens with spam-count 1 could be seen as simply reverting the effect of the random noise added to some spams -- it gives the same result as training on "clean" spam messages and keeps the database small and handy.
A: You can do this now if you want to. Use the sb_dbexpimp.py script to convert the database to CSV. Open the CSV file in (e.g.) Excel. Sort the columns. Remove or change the counts as you like. Save the file. Use sb_dbexpimp.py to convert the database back to whatever format it was in before.
However, this is not a good idea. SpamBayes knows that tokens are added or removed from the database in groups (i.e. messages). If you remove individual tokens, rather than complete messages, you may very well stuff up the calculations. --TonyMeyer
Q: Thanks for this tip. I tried sb_dbexpimp.py and it turned out well.
Converting the database to CSV, I realized that only the total number of trained ham/spam messages is saved, no hint on tokens coming from the same message.
My idea was to keep the strong hints but prune the weak ones. Manually pruning all tokens with count < 3 reduced my database from 120000 tokens to 600. The number of uncertain mails increased for a while, but with very little learning, the filter works again satisfactory with my limited ressources. -- GuenterMilde
A: There has been quite a lot of discussion about database pruning, including pruning hapaxes, on the lists (firstname.lastname@example.org and email@example.com) - for a good start, see this google search. It would be easier to continue this discussion there (and you'd get the benefit of opinions of others who might not read this wiki) if that was ok with you. --TonyMeyer
OK -- GuenterMilde
1.6. Self-Pruning Databases
If you keep adding to your training set, it will probably grow larger than you want over time. In addition, if the characteristics of your message stream change over time, the older messages in the training set no longer are a good representation of the current message stream. This motivates schemes for expiring tokens and/or messages to keep the training set to a reasonable size and representative of the current message stream. This is somewhat controversial and several people are experimenting with it to see if real gains can be had and if it is broadly applicable.
For more info, see AdaptiveTraining.
1.7. Cluster Algorithm
We all know the technique of "spelling variations" like via-gra to avoid easily catchable keywords. A clustering algorithm could be used on a "grown up" database to find clusters of related tokens. The tokens will be replaced by a cluster representative + a new function to compare message tokens to this clusters.
Ratings are than based on the distance of the probed token to the cluster-token times the ham/spam value of the cluster. Besides keeping the database smaller, this also catches new variations -- GuenterMilde
How would you define "related"? Tokens that appear together? Tokens that substitue for each other? (If the latter, how to do figure that out?). Tokens that have a small edit distance? --TonyMeyer
There are several possibilities to calculate a distance between two strings. In PHP (unfortunately, not yet in Python), I found some ready-made functions
int levenshtein ( string str1, string str2) -- Calculate Levenshtein distance between two strings
int similar_text ( string first, string second [, float percent]) -- Calculate the similarity between two strings
This calculates the similarity between two strings as described in Oliver . Note that this implementation does not use a stack as in Oliver's pseudo code, but recursive calls which may or may not speed up the whole process. Note also that the complexity of this algorithm is O(N**3) where N is the length of the longest string.
A more specialized function could take into account the "graphical distance" of substitutions (i.e. "l" <-> "|" is nearer than "l" <-> "g")
Something similar is available for a "phonetic distance" in PHP as
soundex -- Calculates the soundex key of str and
metaphone ( string str) -- Calculates the metaphone key of str.
The Levenshtein distance is defined as the minimal number of characters you have to replace, insert or delete to transform str1 into str2. The complexity of the algorithm is O(m*n), where n and m are the length of str1 and str2.
1.8. Incremental Training
Comments seem to be in order to add to the SpamBayes documentation.
Incremental Training options are turned on in the Manager as a default. When an imbalance occurs in training messages, it may be useful to intervene using this feature.
An approach could be to move new ham out of the Inbox and then select "Recover from Spam". Then an improvement in the balance of ham and spam could be achieved.