I (TonyMeyer) tried this out, both adding an additional token and replacing the original token, either just in header tokenization, just in body tokenization, or both. None gave me any results I'd call a win (this was only one corpus, but it didn't look that promising). Another stupid beats smart, I guess (Replacing the token makes the token list harder to read, too, of course).
The patch for adding the token only in the body (you should be able to figure the others from this) is:
*** tokenizer.py Wed Jan 19 12:04:21 2005 --- tokenizer2.py Wed Jan 19 11:59:29 2005 *************** *** 1593,1598 **** --- 1593,1603 ---- n = len(w) # Make sure this range matches in tokenize_word(). if 3 <= n <= maxword: + if options["Tokenizer", "x-de-anagram"]: + yield w + w = list(w) + w.sort() + w = "".join(w) yield w elif n >= 3: *** Options.py Wed Jan 19 12:04:20 2005 --- Options2.py Wed Jan 19 11:59:14 2005 *************** *** 179,184 **** --- 179,188 ---- the ability to reduce the nine tokens to one. (This option has no effect if 'Search for Habeas Headers' is False)"""), BOOLEAN, RESTORE), + + ("x-de-anagram", "Sort all words into alphabetical order", False, + """(EXPERIMENTAL)""", + BOOLEAN, RESTORE), ), # These options are all experimental; it seemed better to put them into
A table of results is:
-> <stat> tested 4690 hams & 384 spams against 17688 hams & 1539 spams (etc) filename: defaults deanagram_header deanagram_both deanagram_body deanagram_add ham:spam: 22378:1923 18166:1923 23454:1923 22378:1923 22378:1923 fp total: 5 7 5 7 6 fp %: 0.02 0.04 0.02 0.03 0.03 fn total: 23 28 23 20 26 fn %: 1.20 1.46 1.20 1.04 1.35 unsure t: 152 176 156 160 159 unsure %: 0.63 0.88 0.61 0.66 0.65 real cost: $103.40 $133.20 $104.20 $122.00 $117.80 best cost: $83.60 $103.20 $83.40 $86.80 $96.00 h mean: 0.12 0.15 0.11 0.13 0.12 h sdev: 2.39 2.94 2.31 2.66 2.44 s mean: 96.29 95.62 96.24 96.56 96.13 s sdev: 14.55 15.50 14.62 13.87 14.77 mean diff: 96.17 95.47 96.13 96.43 96.01 k: 5.68 5.18 5.68 5.83 5.58
If you'd like cmp.py results, which tell you much more, let [TonyMeyer me] know and I'll happily provide them.