Undercrank

Scripting

January 21, 2005

Despamming Shortstat

I've been using Shaun Inman's Shortstat package for a short while now as my main source of web statistics. However, as with most other blog-related things these days, it's fairly susceptible to the innovation known as referer spam.

Note: This script has been updated somewhat now, but the basic story here remains the same.

Anyway, this got me thinking, and cutting a long story short it occurred to me that I already had a great blacklist of spam domains supplied by Jay Allen's MT-Blacklist, and it shouldn't be so hard to use this list as a basis to remove the referer spam from the Shortstat database - and so, here's the results of about 13 minutes of investigation:

  include_once("configuration.php");
  include_once("functions.php");
  if ($shortstat) {
     SI_pconnect();
     $urlpatterns = mysql_query("SELECT ext_bl_item_text FROM mt_ext_bl_item");
     while ($row = mysql_fetch_array($urlpatterns, MYSQL_NUM)) {
        $query = "DELETE FROM si_shortstat WHERE domain LIKE \"%$row[0]\"";
        @mysql_query($query);
      }
   }

I've called this "_despam.php" and installed in the Shortstat installation directory it will use your existing database connection settings. Note - the script makes the assumption that MT-Blacklist has been set up to use the same MySQL database that Shortstat uses, but that said, I imagine that's most setups.

It certainly could do with a few more features (actually reporting back what it's done/doing would be a start), but the basic functionality is there and lovely shiny clean reports are the result.

Update: Tony at juju.org has taken things a step further with a Perl script that 'de-spams' your server log files using MT-Blacklist too.

More of the same RSS

Yahoo! Web Services news over Atom

aka: "When you realise Yahoo! already serve their news over RSS..."

Yahoo! Web Services news over RSS

I've put together a small project that will create an dynamic RSS 2.0 feed based on the Yahoo! News Search hooks.

Despamming Shortstat (Part 2)

Further to my earlier post about Despamming Shortstat, I've made a small update to the code that has a few improvements on the original.

Importing CSV files to SQLite

"SQLite is different from most other SQL database engines in that its primary design goal is to be simple". Which isn't strictly true...

Trackbacks

Trackback URL for this entry is:

Proposal for a solution to referrer spam: Using MT-Blacklist and other blacklists to filter spamming URLs on January 25, 2005

JOTSHEET - Referrer (or referer) spam has become a serious problem in the blogosphere. We need an intelligent way to eliminate this growing nuisance. I've thought about and researched this for the past few days, and below I offer a proposal for a technological so...

Comments

Hans on January 23, 2005

How might one go about doing this without having MovableType installed? (I use Textpattern.)

jimmy on January 23, 2005

GREAT IDEA! How can this be modified if shortstat and movabletype are separate databases in mysql?

Justin Perkins on January 23, 2005

I don't do too much work with MT or MySQL, but couldn't you combine your two queries into one that would achieve the desired effect without having to loop through a possibly large dataset (blacklist items) executing SQL commands to delete records one at a time?

Seems like a quite a resource intensive action to be executing against your DB. If you're in a shared web hosting environment, your webhost might frown on CPU pegging.

I don't know, maybe something like:

delete from si_shortstat where domain in (select ext_bl_item_text FROM mt_ext_bl_item)

Just a guess at first glance, I have no idea if that would work. :)

Mark McLaughlin on January 23, 2005

Hans: The blacklist is also available as a plain text file which could be downloaded and used instead. You could then create a script that cycled through that list and doing much the same thing as the code above.

Jimmy: I honestly don't know the best way around. I'm supposing you'd have to query MT-Blacklist first, store that in an array and then loop through Shortstat using that as your criteria.

Justin: Fair point - my method above is a bit, er, clumsy at least. That said, subqueries in MySQL don't work in versions less than 4.1 (my host runs 4.0) and so I was pretty much out of luck.

There are of course workarounds but as you may have guessed, my SQL query knowledge starts drawing a big brick wall when faced with that sort of thing!

Justin Perkins on January 24, 2005

I figured there was something like that going on (lack of subquery support), it's obvious to me you are not an amateur at server-side programming.

I figured I'd throw that information out there just in case, it would've come up at some point anyway :)

You seem to be loving the rel="nofollow" attribute eh? I've got some rules in my user stylesheet (as per recommendation on Andy Budd) and I see them everywhere. I'd argue that reference links in your post should be OK to weigh in on a sites Page Rank, no?

Post a comment

Remember personal info?