Undercrank

Icon

Despamming Shortstat

I’ve been using Shaun Inman’s Shortstat package for a short while now as my main source of web statistics. However, as with most other blog-related things these days, it’s fairly susceptible to the innovation known as referer spam.

Note: This script has been updated somewhat now, but the basic story here remains the same.

Anyway, this got me thinking, and cutting a long story short it occurred to me that I already had a great blacklist of spam domains supplied by Jay Allen’s MT-Blacklist, and it shouldn’t be so hard to use this list as a basis to remove the referer spam from the Shortstat database – and so, here’s the results of about 13 minutes of investigation:


include_once("configuration.php");
include_once("functions.php");
if ($shortstat) {
SI_pconnect();
$urlpatterns = mysql_query("SELECT ext_bl_item_text FROM mt_ext_bl_item");
while ($row = mysql_fetch_array($urlpatterns, MYSQL_NUM)) {
$query = "DELETE FROM si_shortstat WHERE domain LIKE \"%$row[0]\"";
@mysql_query($query);
}
}

I’ve called this “_despam.php” and installed in the Shortstat installation directory it will use your existing database connection settings. Note – the script makes the assumption that MT-Blacklist has been set up to use the same MySQL database that Shortstat uses, but that said, I imagine that’s most setups.

It certainly could do with a few more features (actually reporting back what it’s done/doing would be a start), but the basic functionality is there and lovely shiny clean reports are the result.

Update: Tony at juju.org has taken things a step further with a Perl script that ‘de-spams’ your server log files using MT-Blacklist too.

Category: Scripting

Tagged: , , , , , , , ,

6 Responses

  1. Hans says:

    How might one go about doing this without having MovableType installed? (I use Textpattern.)

  2. jimmy says:

    GREAT IDEA! How can this be modified if shortstat and movabletype are separate databases in mysql?

  3. I don’t do too much work with MT or MySQL, but couldn’t you combine your two queries into one that would achieve the desired effect without having to loop through a possibly large dataset (blacklist items) executing SQL commands to delete records one at a time?

    Seems like a quite a resource intensive action to be executing against your DB. If you’re in a shared web hosting environment, your webhost might frown on CPU pegging.

    I don’t know, maybe something like:

    delete from si_shortstat where domain in (select ext_bl_item_text FROM mt_ext_bl_item)

    Just a guess at first glance, I have no idea if that would work. :)

  4. Hans: The blacklist is also available as a plain text file which could be downloaded and used instead. You could then create a script that cycled through that list and doing much the same thing as the code above.

    Jimmy: I honestly don’t know the best way around. I’m supposing you’d have to query MT-Blacklist first, store that in an array and then loop through Shortstat using that as your criteria.

    Justin: Fair point – my method above is a bit, er, clumsy at least. That said, subqueries in MySQL don’t work in versions less than 4.1 (my host runs 4.0) and so I was pretty much out of luck.

    There are of course workarounds but as you may have guessed, my SQL query knowledge starts drawing a big brick wall when faced with that sort of thing!

  5. I figured there was something like that going on (lack of subquery support), it’s obvious to me you are not an amateur at server-side programming.

    I figured I’d throw that information out there just in case, it would’ve come up at some point anyway :)

    You seem to be loving the rel=”nofollow” attribute eh? I’ve got some rules in my user stylesheet (as per recommendation on Andy Budd) and I see them everywhere. I’d argue that reference links in your post should be OK to weigh in on a sites Page Rank, no?

  6. jotsheet says:

    Proposal for a solution to referrer spam: Using MT-Blacklist and other blacklists to filter spamming URLs

    Referrer (or referer) spam has become a serious problem in the blogosphere. We need an intelligent way to eliminate this growing nuisance. I’ve thought about and researched this for the past few days, and below I offer a proposal for a technological so…

Leave a Reply

Disclaimer

The opinions expressed on Undercrank.com are personal and do not represent Skype or any other organisation I belong to.

Flickr

Mark McLaughlin posted a photo:Need to capture this little beauty for the few short days it shows up on my pond.  Fortunately the iPhone 3GS is decent enough now to spare me finding the DSLR in a hurry.Mark McLaughlin posted a photo:Managed to miss the whole hoopla about this place in Covent Garden; it's almost more of a statement having it abandoned, just like all the other Woolworths.www.bbc.co.uk/london/content/articles/2009/05/18/josef_va...Mark McLaughlin posted a video:Mark McLaughlin posted a photo:First test shot from the iPhone 3G S - some real actual depth of field going on there.Mark McLaughlin posted a photo:Mark McLaughlin posted a photo:Probably as quiet as it'll be this week.Mark McLaughlin posted a photo:Mark McLaughlin posted a photo:Mark McLaughlin posted a photo: