Despamming Shortstat
I've been using Shaun Inman's Shortstat package for a short while now as my main source of web statistics. However, as with most other blog-related things these days, it's fairly susceptible to the innovation
known as referer spam.
Note: This script has been updated somewhat now, but the basic story here remains the same.
Anyway, this got me thinking, and cutting a long story short it occurred to me that I already had a great blacklist of spam domains supplied by Jay Allen's MT-Blacklist, and it shouldn't be so hard to use this list as a basis to remove the referer spam from the Shortstat database - and so, here's the results of about 13 minutes of investigation:
include_once("configuration.php");
include_once("functions.php");
if ($shortstat) {
SI_pconnect();
$urlpatterns = mysql_query("SELECT ext_bl_item_text FROM mt_ext_bl_item");
while ($row = mysql_fetch_array($urlpatterns, MYSQL_NUM)) {
$query = "DELETE FROM si_shortstat WHERE domain LIKE \"%$row[0]\"";
@mysql_query($query);
}
}
I've called this "_despam.php" and installed in the Shortstat installation directory it will use your existing database connection settings. Note - the script makes the assumption that MT-Blacklist has been set up to use the same MySQL database that Shortstat uses, but that said, I imagine that's most setups.
It certainly could do with a few more features (actually reporting back what it's done/doing would be a start), but the basic functionality is there and lovely shiny clean reports are the result.
Update: Tony at juju.org has taken things a step further with a Perl script that 'de-spams' your server log files using MT-Blacklist too.