UserPreferences

MultiWayClassification


1. MultiWayClassification

Having grown to appreciate automatic spam filtering, first via SpamAssassin and now via SpamBayes, I want the same abilities for email classifiction. This page records what I know about n-way classification, using SpamBayes and other Bayesian tools, and raises some issues.

1.1. Using SpamBayes for MultiWayClassification

Most n-way work assumes n-exclusive-classifications, e.g. for sorting email intoo folders. I am interested in non-excllusive tagging, where multiple tags can apply to same message.

Q: in what ways is SpamBayes strictly oriented towards spam? Probably in certain of the tokenizations... Would it harm MultiWayClassification to use the same mechanism?

1.1.1. More SpamBayes community discussion about MultiWayClassification

As already mentioned,

The Spambayes mailing lists contain several such discussions, such as

1.1.2. Binary vs. MultiWayClassification

SpamBayes returns a ternary result for a binary scale. I.e. we want spam/ham, or spam/non-spam. SpamBayes delivers spam/unsure/non-spam.

With non-exclusive MultiWayClassification, how should this be represented?

This raises isssues both with user interface and with the statistical model.

It appears to me that SpamBayes' chief advantage over othe Bayesian spam filters is that it scores "spam" and "ham" separately and independently. A message can have a high spam score and a low ham score, vice versa, or it could have both a high spam score and a high ham score. Or both low.

Q: is the SpamBayes scoring specifically binary spam/ham? I.e. does the chi-square test answer a specifically binary question? Or is it comparing two independent scores, ham and spam? Or could the Spambayes database be extended to handle multiple tags?

If the latter, it may be possible to persuade the Spambayes code to answer questions like

I.e. I am trying to ask if MultiWayClassification with SpamBayes really needs a database with tagN/non-tagN scores, one such pair for each tag? I.e.

or

I ask this question in part because I do not see the value of tag1/unsure/non-tag1 in a nonexlusive multiway classification.

But if I am just doing non-exclusive tagging, I think the user interface would just suggest keywords that exceeded a threshold - a 1-way test for each. I am having trouble imagining what the user interface would look like for "unsure", or other --- perhaps you would sort the keywords by score.

1.2. Original

I originally posted this on FrontPage, but then decided to refactor into a separate discussion. Here's the original duscussion:

Is this suppposed to be a WishList? If so, then... I wish there was a way to use Spambayes tp do N-ways sorting, not just spam/ham, but spam/ham/proprietary/personal/public/... I vaguely remember seeing this before, but haven't had much luck finding it now I need it. -- AndyGlew

No, this isn't really a good place to put feature requests. Suggestions about improvements for the classification system are ok, but if you have a feature request then it would be much better to put it on Sourceforge. SpamBayes is not designed to be an n-way classifier. If you want to try it anyway, look at the n-way.py script in the contrib directory of the source release. If you want n-way classified mail, then I recommend POPfile. --TonyMeyer

Fair enough... I expect you'll delete this soon, although it might be appropropriate to refactor.wiki into a separate page on n-way - maybe MultiWayClassification. My old spambayes installation did not have n-way.py, so I'll update and look. POPFile does not seem to be what I want: it does n-exclusive-ways, whereas I am interested in suggested n nonexlusive keywords for classification (like the keywords on gmail). Sourceforge seems to contain just an issue tracking system; I am more interested in discussion about the issues involved in MultiWayClassification. -- AndyGlew

I was wrong. I found contrib/nway.py. It does the "obvious" thing of using multiple databases... -- AndyGlew

1.3. Multiway Email Sorting

The following I originally posted on my blog, which is on a company intranet and not externally visible.

I am getting tired of sorting my email into folders. I have grown dependent on automatic spam filtering via spambayes... I want the same for sorting.

I could very easily go over to search based email handling, e.g. using Google desktop. But

Spambayes is binary, spam/ham. I've heard about people doing it N-way, but cannot find any examples right now.

I can imagine the N-way being done

POPFile dies naive bayesian N-way. But it is N exclusive ways.

It occurs to me that if I am applying classification tags they are not mutually exclusive. Therefore, running something like a Spambayes binary test N times, suggesting N tags, is not unreasonable. Since spambayes reports high-probability-yes, high-probability-no, unsure, I could imagine an interface that reported all tags, and allowed the user to correct them.

Some tags are mutually exclusive, others not. E.g. "spam" probably isn't associated with any ham tags - there isn't much spam that is also comp-arch-technical. However... I *do* enjoy Nigerian fraud emails, so spam spam-nigerian might be amusing.

Basically, I am wondering about how to handle various combinations of tags. I want to support multiple tags, but I want as few extraneous tag combinations as possible.

e.g. possibly train a final filter on the tags suggested by the initial filter. If the initial filter suggested (spam,technical-comp-arch), then the final filter might map this to (spam) only. But if (spam,family-recreation) usually means that it is family related, then it might elide the spam and map to (family-recreation).

i.e. we might be able to use Bayesian learning to automatically figure out which combinations of tags are meaningful.

The various spam filtering tools like Spambayes and POPFle integrate with mailers in many different ways. E.g. POPFile is a POP server. Spambayes is more of a library, that is integrated into proxy servers, but also email tools that filter various folders, etc.

Maybe what I really want is a generic tool that maps a file containing an email to a classification, returning the strings of the tags. I could then use this as a suggestion in Gnus under EMACS, and edit out the tags unwanted.

Given such suggestions, various other tools could place them in mail headers, or in the subject line, or whatever.

Basic structure

ISSUE: this implies 2N separate databases: one for each tag, on the raw message, and one for each tag, given the classifications. Since Spambayes is already slow, maybe it would be better to have a unified database, especially since all probably use the same features (tokens).

1.3.1. Usage Model

I used to use procmail rules to sort my mail into different folders, and then use gnus to take me to folders by different priorities.

I stopped doing this with Outlook, because Outlook mail sorting rules are so broken. Now that I have abandoned Outlook I could fall back to procmail, but I am still reluctant to maintain a ruleset. Hence the blurb about using N-way spambayes sorting.

When I sorted all incoming mail into folders, the folder I was in suggested how what folder any mail I generated should go to. I.e. my replies got stored in the same folder as the mail I was replying to.

That's still valid. Unfortunately, right now what I do is apply spam/ham classification, but receive all mail into my MH +inbox. So all of my replies stay in my inbox. And sorting out my replies is even more annoying than classifying incoming mail.

I suppose the general rule still applies: classify a reply like the email it is replying to. But I still probably want to generate content based classification for email I send. I mean, how often does a proprietary discussion turn into an invitation for lunch?

For email that I generate I can generate the classification tags immediately, and edit them. I don't want to do this manual editing much, but if the suggestion were close enough.

I also record a lifestream of all email. Duplicate emails in different folders is unfortunate. A unified store, with tags, would be nice, so long as the proprietary/nonproprietary distinction were maintained.

2. Full vs Partial Populations

Thinking about this has leed me to the following conclusion: only one "column", or set of counts, is necessary for a particular tag, if that tag classification has been performed for the entire population, and if all messages not tagged are, well, not in the classification.

In this case, probabilities would be calculated by dividing any of the (feature,tag) counts, and dividing by the total size of the population, which is separately known.

This might be okay if you establish your classification sccheme for email at the beginning of time, and have run all email through it. However, if you are like me your classification scheme evolves; you create new categories and discard old. So, although you may have a history of 100,000 messages, there may be only 10 that have been trained wrt the new classification tag.

To handle such evolving classification schemes, one must have explicit tag and not-tag counts (and assume unknown for features that have neither tag nor not-tag asssociated with them). Perhaps, when the size of the total population and the size of population for which tag/not-tag training has been done approach each other, the not-tag counting can be discarded. More important, when the populations almost completely overlap, more queries can be supported.

Given tag/not-tag feature classification, one can decide tag/not-tag.

Given tag1/not-tag1 and tag2/not-tag2 feature classifications, one can do tag1/tag2 classification, but one must make assumptions:

The basic problem with doing such tag1/tag2 multiway tests, if we do not know to what degree the populations overlap, is that we do not know how independent the variables are: we do not know the size of the population that is tagged tag1*tag2.

We can imagine that the database storing the feature=>classification counts might be augmented to track the various population sizes:

If these counts are all available, if we do a query or test involving tagI and tagJ, then we will have the population sizes necessary. Assuming stationary distributions would allow us to figure out the appropriate feature=>classification weights.