While this could be a really awesome title for an Arduino project (hmm…. maybe it will be, one day), this post is actually about spamming software mechanisms, as exposed by an interesting piece of tell-tale spam I got a few days ago. Let’s peek into the murky depths where spammers and spam blockers collide…
How The SEO Arms Race Started
At the beginning, web search engines counted keywords in pages to determine their relevance to future queries. This produced pretty lame results, because the number of keywords is by no means an indication to the quality of the text they’re in. Also, dishonest website owners started stuffing their pages with redundant – even irrelevant – keywords just to push themselves higher in the search results.
When this practice became unbearable, search providers stepped up their game by, among other things, actually analyzing the texts. If the algorithms determined there were too many keywords, or detected similar monkey businesses, the page rank actually dropped. To get good scores, website owners now had to publish good – well, at least adequate-looking – texts.
However, no one said this text had to be their own, and this little loophole gave rise to the phenomenon of duplicate content – i.e. shameless copy-pasting of good content, usually made by others and without permission of course. But no one is better than computers at detecting exact duplicates, so the villains had to resort quickly to Article Spinning: giving a few pennies to some poor schmuck to rewrite the original. Nothing major, just enough to make it pass the automated comparison test.
As search engine background checks became more and more rigorous, even the most desperate article spinners could not churn these variations fast enough, and even at a penny apiece, this method became too expensive for these SEO scammers. Thus appeared automatic spinning software, usually based on some database of synonyms and alternative expressions. The battle between these bots and the search engine algorithms is raging as I’m writing this, consuming unbelievable amounts of computing power everywhere.
What’s Spam Got To Do With This
The case of blog comment spam is quite similar in principle. Spammers try to put links to their products (or whatever) all over the web in other people’s blogs. But this is too obvious, so they disguise them as legitimate-looking messages (e.g. “This post changed my life, thank you!”). Automatic spam blockers learned to recognize these canned comments, so the spammers have to “spin” the text, and since doing it manually is too slow and expensive for them, they now let bots do it. All this explains the revealing message I got; here’s an excerpt:
{Hello|Hi|Hello there|Hi there|Howdy|Good day}! I could have sworn I've {been to|visited} {this blog|this web site|this website|this site|your blog} beffore but after {browsing through|going through|looking at} {some of the|a feww of the|many of the} {posts|articles} I realpized it's new to me. {Anyways|Anyhow|Nonetheless|Regardless}, I'm {definitely|certainly} {happy|pleased|delighted} {I found|I discovered|I came across|I stumbled upon} it andd I'll be {bookmarking|book-marking} it and checking back {frequently|regularly|often}!
Inside each pair of curly braces there’s a bunch of alternative words/phrases separated by the “|” character. The whole thing was written, again, by a poor human schmuck, but in a format that makes it very easy for a software bot to manipulate: it prints the words out, and whenever it encounters a curly brace pair it selects in random one of the alternatives inside it to print. The above text has 6,2,5,3,3,2,4,2,3,4,2, and 3 alternatives, if I counted correctly, and that gives us 622,080 possible variations to try and confuse the spam blocker.
I wrote a small Python program to produce such variations – one at a time, mercifully. It assumes the original text is in a file called “SpamTemplate.txt” in the same directory as the program itself:
import random fIn = open("SpamTemplate.txt") origTxt = fIn.read() while True: txt = origTxt while True: p1 = txt.find("{") if p1 < 0 : break p2 = txt.find("}", p1) options = txt[p1+1 : p2].split("|") txt = txt[:p1] + \ random.choice(options) + txt[p2+1:] print txt cont = raw_input("Continue? Y/N ") if cont.upper() != "Y" : break
This Python code is not 100% safe – it makes some assumptions about input validity, and if these don’t hold, it will print unexpected strings. That is exactly what caused the raw message to be sent to me in the first place, instead of just one variant: somewhere along the line, at least one person did a half-assed job, and the original template went through unprocessed, revealing a little trade secret. How I {love|enjoy|appreciate} it when this happens!
I {realise|understand|know} this post is {old|ancient} but {great|nice} job anyway, it was {well written|interesting}.