Despamming Shortstat

I’ve been using Shaun Inman‘s Shortstat package for a short while now as my main source of web statistics. However, as with most other blog-related things these days, it’s fairly susceptible to the innovation known as referer spam.

Note: This script has been updated somewhat now, but the basic story here remains the same.

Anyway, this got me thinking, and cutting a long story short it occurred to me that I already had a great blacklist of spam domains supplied by Jay Allen‘s MT-Blacklist, and it shouldn’t be so hard to use this list as a basis to remove the referer spam from the Shortstat database – and so, here’s the results of about 13 minutes of investigation:


include_once("configuration.php");
include_once("functions.php");
if ($shortstat) {
SI_pconnect();
$urlpatterns = mysql_query("SELECT ext_bl_item_text FROM mt_ext_bl_item");
while ($row = mysql_fetch_array($urlpatterns, MYSQL_NUM)) {
$query = "DELETE FROM si_shortstat WHERE domain LIKE \"%$row[0]\"";
@mysql_query($query);
}
}

I’ve called this “_despam.php” and installed in the Shortstat installation directory it will use your existing database connection settings. Note – the script makes the assumption that MT-Blacklist has been set up to use the same MySQL database that Shortstat uses, but that said, I imagine that’s most setups.

It certainly could do with a few more features (actually reporting back what it’s done/doing would be a start), but the basic functionality is there and lovely shiny clean reports are the result.

Update: Tony at juju.org has taken things a step further with a Perl script that ‘de-spams’ your server log files using MT-Blacklist too.

This entry was posted in Scripting and tagged , , , , , , , , . Bookmark the permalink.

6 Responses to Despamming Shortstat

  1. Hans says:

    How might one go about doing this without having MovableType installed? (I use Textpattern.)

  2. jimmy says:

    GREAT IDEA! How can this be modified if shortstat and movabletype are separate databases in mysql?

  3. I don’t do too much work with MT or MySQL, but couldn’t you combine your two queries into one that would achieve the desired effect without having to loop through a possibly large dataset (blacklist items) executing SQL commands to delete records one at a time?

    Seems like a quite a resource intensive action to be executing against your DB. If you’re in a shared web hosting environment, your webhost might frown on CPU pegging.

    I don’t know, maybe something like:

    delete from si_shortstat where domain in (select ext_bl_item_text FROM mt_ext_bl_item)

    Just a guess at first glance, I have no idea if that would work. :)

  4. Hans: The blacklist is also available as a plain text file which could be downloaded and used instead. You could then create a script that cycled through that list and doing much the same thing as the code above.

    Jimmy: I honestly don’t know the best way around. I’m supposing you’d have to query MT-Blacklist first, store that in an array and then loop through Shortstat using that as your criteria.

    Justin: Fair point – my method above is a bit, er, clumsy at least. That said, subqueries in MySQL don’t work in versions less than 4.1 (my host runs 4.0) and so I was pretty much out of luck.

    There are of course workarounds but as you may have guessed, my SQL query knowledge starts drawing a big brick wall when faced with that sort of thing!

  5. I figured there was something like that going on (lack of subquery support), it’s obvious to me you are not an amateur at server-side programming.

    I figured I’d throw that information out there just in case, it would’ve come up at some point anyway :)

    You seem to be loving the rel=”nofollow” attribute eh? I’ve got some rules in my user stylesheet (as per recommendation on Andy Budd) and I see them everywhere. I’d argue that reference links in your post should be OK to weigh in on a sites Page Rank, no?

  6. jotsheet says:

    Proposal for a solution to referrer spam: Using MT-Blacklist and other blacklists to filter spamming URLs

    Referrer (or referer) spam has become a serious problem in the blogosphere. We need an intelligent way to eliminate this growing nuisance. I’ve thought about and researched this for the past few days, and below I offer a proposal for a technological so…

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>