Better spam defense for Django comments
I've finally found the time to package up an improved version of my Django comment validation.
The original's simple link count worked surprisingly well for a while, but microbes are always evolving, so this had to as well. (For those of you wondering when I'm going to stop being stubborn and use TypePad's AntiSpam service ... not yet. :^)
The first improvement is a list of banned IPs — if the commenter's posting from one the comment is killed. Even if the IP hasn't been banned, it's checked for previous comments that weren't clean enough to be marked public; each one found counts against the current comment.
The second is more complex. You can now create blacklists of phrases that count against comments. The comment text, after removing stop words, is compared to each blacklist to derive their Tanimoto coefficient, and that is multiplied by the weight assigned to the blacklist. The weighted score lets you be more aggressive about certain phrases.
Finally, this version includes these batch tools for comment administration, and adds the ability to ban the IPs of multiple comments at once.
You still just need to add this version to your INSTALLED_APPS setting, but
you'll also need to run manage.py syncdb to install the tables
and the initial blacklist data (don't look too closely if you're easily
offended).
This has been working pretty well here, and it's pretty tweakable. Some things to consider:
The Tanimoto coefficient is sensitive to the number of words in the comparison. You may find you want a greater number of short blacklists, instead of a few comprehensive ones. Or you can just play with the blacklist weights.
The default threshold for marking comments non-public is pretty aggressive; you may want to raise this if you're spending too much time in the admin marking comments public.
If you never want to reject a comment outright, just set the rejection threshold really high. Be aware that the spambots think they're getting through if not rejected, so you'll accumulate lots of comments from their repeat visits.
You can still add your own validators via the FCCV_VALIDATORS setting.
Update: 18 May 2009
After several requests and way too long, I've set this up as a Bitbucket project. There you'll find better instructions, the full source in both downloadable form and a Mercurial repository, and an issue tracker. The code itself has had numerous improvements since this was posted, too, including support for Akismet or TypePad AntiSpam.
Comments (3)
You known, wiki, bugtracking, more visibility...
Comments have been turned off for this article, but you can always contact us about it.