1. Adaptive Training/Self-Pruning Databases
If there were a mechanism for pruning the token databases, there are many more possibilities. Why bother? Because it has the potential to make the program more robust and perhaps it could also reduce false negatives. The databases will also be smaller, which is generally a good thing. The main advantage, though, is the potential that the databases could track the message stream and continuously adapt to it in a fairly optimal way.
This has both advantages and disadvantages, as do all adaptive, self-learning systems. First the advantages. In cases of pilot error (user trains message into wrong class), this is very hard to correct by subsequent training, and you often wind up starting over from scratch. Some users are more prone to this than others. If old messages were pruned from the databases, mistakes like this would be self-correcting. If the content of your message stream gradually changes over time (friends get crazier, spam mutates), the classifier's performance will gradually deteriorate. Adding new information to the database helps for a while, but as the old information in the database gets further from a good statistical estimate of the incoming message stream, performance will deteriorate. Rather than periodically starting over, a self-pruning, continuous learning database has the potential of tracking the current message stream.
Now for the disadvantages, mostly mathematical. Our training set selection problem belongs to a class of adaptive algorithms known in the field of digital signal processing as adaptive signal estimation, or more specifically, blind adaptive signal estimation. Since we don't know a priori what the future signal (message stream) will look like, we assume that the message stream statistics are stationary (don't change with time), or at least they only change slowly. We then estimate the statistics using a recent sample of the message stream (a message corpus). We filter new messages (tokenize and assign scores) using the estimate, which gives a scalar result and apply a decision rule to classify the message as ham, spam or unknown. The combined process of sampling, filtering and applying a decision rule (receiving a message, tokenizing, scoring and classifying) is what is known as an estimated classifier. Finally, we do further training to make the process adaptive, which we hope will improve the estimated classifier, or at least maintain its performance, over time.
Since the statistical properties of the message stream do slowly change over time (it is non-stationary), the further in the future we use this estimated classifier, the less correct we can expect it to be. Training tactics, such as TrainOnErrors, are attempts to correct the estimated classifier for statistical changes in the message stream so that it continues to classify future messages correctly. Those same training tactics also correct estimation errors caused by imperfect initial training set selection. The initial training set is only a sample of the current message stream, so the estimated classifier is generally imperfect and will not perform as well for subsequent messages. Because of this, we usually do additional training, so the system is already adaptive. How much additional training we need depends on 1) how closely the message corpus that we selected the training set from represents the statistical properties of the actual message stream, 2) how much the statistical properties of the actual message stream change over time and 3) what is our tolerance for errors and unsures.
Another factor to consider is that even if the long-term average statistical properties of the message stream don't change over time (they are stationary), the messages that we receive on any given day are only a small sample of that stream. Since the sample is small, its statistical properties may differ from the long-term average enough to cause the estimated classifier to perform poorly on a different sample from the same message stream. For a stationary message stream and a fixed estimated classifier, the variance of the statistical properties of these daily samples sets an upper bound on the classification performance. If the estimated classifier is not fixed and can adapt, it could perform better as it will partially track the random walk of the statistical properties of the daily sample of the message stream. Note that the more agile we attempt to make the estimated classifier (the faster it adapts), the closer to instability we bring the entire adaptive system.
There are a few "gotchas" to a continuous learning adaptive approach:
-
Any adaptive system can become unstable if its parameters are chosen poorly. So we won't do that. Easier said than done!
-
All adaptive signal estimation algorithms have the potential to become unstable if the system input (in our case a sample of the message stream) has a particular mathematical deficiency for a long enough period of time. The deficiency is known as non-persistent excitation and is not under our control. In terms of messages, this means that certain tokens will, for a time, disappear from the message stream, or at least appear much less frequently, due to statistical variation (each token score is a stochastic variable). This can cause unpredictable behavior in an adaptive system. Well-designed adaptive systems take this possibility into account and are somewhat resistant to it.
FWIW, having worked with adaptive algorithms for a long time, my guess is that neither of these potential pitfalls will prove to be serious for our problem of classifying a slowly-changing future message stream with moderate sample variance into ham, spam and unsure.
One way to avoid all this is to simply start over and select a new training set whenever performance becomes poor. Though SpamBayes makes starting over easy, there are a lot of variables. The best initial training schemes are labor intensive, though they could be automated. Less optimal initial training schemes mean you get spam in your Inbox for a while. Some people don't mind starting over from scratch, some like the idea and some even do it daily. That's a perfectly reasonable solution for many, so this isn't necessarily a problem for everyone. For those of us who would prefer a more automated system, there are considerations that we don't currently have to worry about for a continuous learning system that prunes the databases.
First, here are four tactics for pruning a database of the tokens associated with one or more trained messages and two tactics for pruning individual tokens. We prune either periodically or when we want to train a new message into the database, depending on the tactic chosen.
-
set a maximum message age (guarantees contemporaneous training sets, but variable number of messages in each)
-
set a maximum message count (guarantees fixed, equal size training sets, but not contemporaneous)
-
set a maximum message age, minimum and maximum ham/spam ratios (guarantees that old messages expire and trained ham/spam ratios stay within fixed limits)
-
set a maximum message age, minimum and maximum ham/spam ratios and maximum message count (guarantees that old messages expire, trained ham/spam ratios stay within fixed limits and number of trained messages is limited)
-
set a minimum/maximum token count - TonyMeyer
-
prune tokens that are 'weak' (0.4-0.6) and/or are 'old' . Although tokens arrive together and it might be better to prune them together, you never know. If 'weak' tokens are pruned, it's probably better to ensure that they're common weak tokens - i.e. not hapaxes - TonyMeyer
Assuming that we implement one of these pruning tactics, here are some possible implementations:
-
keep track of the last time each hapax token was seen in a scored message; expire hapaxes that have gone too long without appearing in the message stream (note Tony's warning above for expiring hapaxes)
-
keep track of the last time each token was seen in a scored message; reduce the occurrence count of a token if it hasn't been seen for a long enough time; expire the token if the occurrence count reaches zero (this will tend to get rid of old hapaxes first; see Tony's warning above)
-
if the preset maximum token count is exceeded, delete the oldest token (this doesn't discriminate between common and uncommon tokens)
-
if the preset maximum token count is exceeded, delete the least recently used token (this will tend to prune old hapaxes, but not exclusively)
-
extend the database to contain a list of tokens for each trained message, the message classification, the timestamp when trained and how many times the message was trained (for TrainToExhaustion); if any messages are older than the maximum message age, untrain them; if the maximum message count is exceeded, untrain the oldest trained message; if the ham/spam ratio is out of bounds, untrain additional messages as necessary
note: untraining whole messages, rather than tokens, preserves the underlying assumptions of the Spambayes calculations (whether this really matters in practice is anyone's guess)
When it is time to train a new message into the databases, there are also choices as to how to add it:
-
tokenize the message and add tokens to the appropriate database
-
same as previous, but keep atypical messages and their tokens around longer
The latter choice is easier than it sounds, if we do it in a simple-minded way: add a parameter (in SpamBayes Manager or the .ini file) that says how much longer an atypical message would stay in the database than a typical one. The typical message classifies perfectly (ham = 0, spam = 100). An atypical message does not classify perfectly: it can be close or totally wrong, but not perfect. Depending on how atypical (imperfectly classified) a message is, keep its tokens in the database longer, up to the maximum extra time specified in the parameter. To keep a message around longer by one day, for example, set it's timestamp one day in the future. The length of extra time a message's tokens stay in the database is determined by some formula (linear, quadratic, logarithmic?) and is between zero and the maximum extra time parameter.
As the training set changes, so does the contribution of each message. A message that was atypical when it was first trained may become rather typical later. This suggests that an optimal approach is to rescore all trained messages when we need to prune a message and select the most typical one. Since message-based pruning requires saving the list of tokens for each trained message, it's not as bad as tokenizing and scoring every message, but it's still a lot of overhead. It may turn out that TrainToExhaustion accomplishes the same thing.
Here's what we know so far. TrainOnErrorsAndUnsures, TrainOnAlmostEverything and TrainToExhaustion have all been proven to be better tactics than TrainOnEverything or TrainOnErrors. Keeping spam/ham ratios within limits appears to help for most people, though some have reported good results with spam/ham rations very different from unity. Very large training sets don't seem to perform better than moderate or small-sized ones. The rest is conjecture.
Other ideas and comments are welcome.