Comment spam arms race

While a lot of people are quick to institute draconian rule over their weblogs and email clients, installing any widget that zaps spam before the spammer has even conceived of it, I tend to take a much more latent approach. My Bayesian email filter and MTBlacklist give me the control I need to make sure my world isn’t taken over by garbage, but at the same time I can pay attention to the tactics and technologies that these infidels are employing. It makes me feel empowered.

While cleaning up a few spam comments today, I noticed the next effort in the spamming arms race: encampment. The purpose of comment spam, as we all know, is to harvest PageRank from weblogs that have it and aren’t paying attention. The problem with this strategy is that there is more than one contending force attempting to take over this blog ghetto. The more links that appear on a given post, the less each individual link is worth in Google’s currency.

Instead of spreading links across thousands of pages, the new technique I’ve become aware of is to take a single weblog post, obviously deserted, and use comment spam on other sites to give support. Here’s the link I received today:

Hello, I just wanted to say you have a very informative site which really made me think, thanks very much! Have a nice Day!!

best online casinos

Except for the text (online casinos), this link looks pretty innocuous. And clicking through to the site appears to be no big as well, since it’s just some other weblog. But looking at the comments on this post shows the true purpose, pushing PageRank to any number of other sites. This is a serious ghetto, kind of like the Robert Taylor homes of blog posts, with hundreds of links to other sites.

It seems to me that the most effective strategy would be finding a little corner of the web where no other spammer has found, and placing a few links to your sites there, and using this strategy to elevate the given PageRank. But that’s just from my understanding of the algorithm, and maybe these spammers have something up their sleeve that I don’t know about.

Weblogs and authority

This week I’ll be presenting a paper at the International Communication Association Conference in New Orleans titled Audience, Structure and Authority in the Weblog Community. The paper is an analysis of two different metrics for measuring authority within weblogs:

  • Blogroll: A link from one weblog to the top-level of another, (e.g., links to http://overstated.net, http://www.overstated.net or http://overstated.net/index.asp). I assume this is a proxy to popularity.
  • Permalink: Any link from one weblog to deep content on another (e.g. a link to http://overstated.net/04/05/24-weblogs-and-authority.asp). I assume this is a proxy to influence.

The following table shows the top 20 for each measure. One observation is that many of the top ranked sites are community weblogs (e.g. Slashdot or Memepool). These sites play the important role of hubs, maintaining ties to more weblogs than a single person would be able to. They allow information to diffuse quickly between distant parts of the network of readership.

Blogroll Degree Rank Permalink Degree Rank
links url links url
1. 2581 metafilter.com 1322 boingboing.net
2. 2434 slashdot.org 1270 diveintomark.org
3. 2146 boingboing.net 1096 metafilter.com
4. 1825 kottke.org 1073 slashdot.org
5. 1604 instapundit.com 982 kottke.org
6. 1527 scripting.com 976 weblog.siliconvalley.com/column/dangillmor
7. 1307 evhead.com 956 instapundit.com
8. 1220 andrewsullivan.com 828 andrewsullivan.com
9. 1062 memepool.com 827 themorningnews.org
10. 1007 doc.weblogs.com 826 rathergood.com
11. 977 megnut.com 819 textism.com
12. 961 littlegreenfootballs.com/weblog 683 denbeste.nu
13. 899 diveintomark.org 626 doc.weblogs.com
14. 880 littleyellowdifferent.com 625 asmallvictory.net
15. 848 textism.com 582 rightwingnews.com
16. 846 rebeccablood.net 577 microcontentnews.com
17. 758 plasticbag.org 568 joi.ito.com
18. 737 dashes.com/anil 560 buzzmachine.com
19. 719 ftrain.com 553 waxy.org
20. 714 plastic.com 522 a.wholelottanothing.org

A second observation is that the lists are fairly distinct. While some webloggers hold top positions in both ranks, the list diverges considerably as the position increases. While Blogrolls tend to support the weblog elders (scripting.com, evhead.com, etc.), permalinks suggest a different set of authors as influencers (joi.ito.com, buzzmachine.com, etc.). Looking at the differential between the ranks in the figure below, it is apparent that as soon as the rank passes 100, the correlation between Blogroll and Permalink rank becomes less defined.

rank differential
Permalink and Blogroll rank differential

This raises new light to the age-old weblog power law debate. While the blogroll rankings (reflected by Shirky’s original analysis) suggest a model of preferential attachment, many of those weblogs listed in the top permalink ranks are much younger. If the weblog social structure is mitigated by a law of the “rich getting richer,” we would expect older weblogs to have more influence, and hence more links to their entries.

There are obviously many caveats and details, all of which are listed in the full paper below. Since I’m presenting it this coming Friday, I’d appreciate any feedback you may have.

Full paper: Audience, Structure and Authority in the Weblog Community (pdf 228k)

Popular press and weblogs

In the process of researching a paper for an upcoming conference at the end of the month I did some research on the coverage of weblogs in the popular press. I queried the LexisNexis database for references to "weblog," "web log," and "blog" resulting in 4051 magazine and newspaper articles from 1998 to the present. The first article, published in the Independent, February 18, 1998 isn’t actually a reference to weblogs as we know them, but rather another invention of the term:

Just how tricky the whole thing is is shown by the many drafts through which that note has already gone. Some of these drafts are available on the Internet, and for those of you unfortunate enough to be without a weblog* , I bring you today some of the first versions of that note to Saddam Hussein.

* Weblog. This is a new Internet word I have made up, which I hope will catch on. If it does, I will work out a meaning for it later.

The second reference is an article published in the Guardian, November 11, 1998, citing Jorn Barger’s Robot Wisdom:

Can computers model the human predicament? John Barger’s page sets out to tackle the idea of ‘robot wisdom’, taking in James Joyce, artificial intelligence and Internet issues along the way. The real gem is the weblog, a daily account of John’s travels around the web. Watch a highly observant and thoughtful surfer at work.

The story behind weblogs becomes more complete when they start receiving attention in mid-1999, with weblog exclusives by Jim McClellan of the Guardian (June 3, 1999) and Dan Gillmor (June 14, 1999). Both of these articles followed shortly after a piece by Scott Rosenberg in Salon (May 28, 1999), which unfortunately is not indexed by LexisNexis.

Weblog citations over time

The chart above shows the citation of weblogs over time along with the average number of times the term was used per article in that month. The data have been normalized so that they can be seen on the same plot; the maximum value for occurences of the term occured in October, 1999 at 31, and the maximum number of articles published in April 2004 at 296.

The exponential growth of attention to the topic is striking, although it appears in the last month to have taper off. Comparing this trend with the average number of uses of the term per article, it appears that the more frequently the concept is cited, the fewer times the word is used per article. The obvious interpretation is that the term is slowly becoming part of our vernacular, and when journalists write about weblogs today, much less context is necessary than in 1999. Also, the number of articles exclusively about weblogs is probably on the decline, while stories only tangentially related to weblogs are on the rise.

Another surprising characteristic of the media presentation of weblogs is the oversight of the most popular tools:

Weblog tool # of articles
Blogger 1913
MovableType 919
LiveJournal 181
DiaryLand 114
Xanga 31

While an extremely large contingent of weblog users rely on the last three tools in this list, all of the attention has been on MovableType and Blogger. Given that these tools are private communities, it could be simply that the press is not aware of how explosive their growth is.

If you’re interested in working with the data, I’m offering it up in zip (21 MB) and gzip (19MB) formats. I’ve stripped the HTML of unnecessary cruft, but it could still use being converted to XML.

Occam’s Razor

Daniel DoertyAfter sitting on the Hot Abercrombie Chick story for a week, I still had an unsettling feeling that the story was unresolved. I decided to go back to my initial premonitions and check the IP addresses of the comments she posted to a few weblogs. Then I realized that I had been sitting on my own data the whole time, the IP addresses of sites added to Blogdex. Here’s what I’ve found:

   Date		    Source	     IP		  Owner
----------------------------------------------------------
2004-02-11	Blogdex		68.91.65.131	swbell.net
2004-02-21	Anil Dash	65.69.87.117	swbell.net
2004-02-28	Undisclosed	68.90.64.194	swbell.net
2004-04-14	Blogdex		68.89.157.59	swbell.net
2004-04-02	Blogdex		65.69.86.105	swbell.net
2004-04-07	Blogdex		128.252.173.54	wustl.edu

All of these IP addresses originate from St. Louis, Missouri, one directly from Washington University where Daniel Zeigenbein transferred to after leaving Vassar (also compare this to Amanda Doerty’s profile on MSN). While many open questions remain (the motive of the author, the identity of the girl in the picture, etc.), this is enough evidence for me to close the case on this one.

I find a great deal of pleasure in doing research on the internet, especially when it involves unearthing information that is nonobvious or difficult to procure. It’s the same type of excitement that one gets watching Woodward and Bernstein slowly uncover the trail of leads that eventually finds Nixon. I never had any intention of defaming HAC or Daniel Zeigenbein, it was simply a mystery where the facts didn’t add up and the truth should exist somewhere, buried in the internet.

Justin has also posted a more in-depth analysis of the evidence.

Dude, where’s my Google?

I’m not much of a conspiracy theorist, but I have to take notice when events coincide. As many people noticed yesterday, my Hot Abercrombie Chick post had a quick rise to prominence on Google, ranking at #1 for the query "Abercrombie Chick" and #2 for "Hot Abercrombie Chick." I was shocked to find that this page was no longer in any of the results for these searches. On the other hand, the post on Wizbangblog still maintains its original rank on both queries. In fact, my site has been removed from Google’s index while his remains. Check the screengrabs (click for more detail):

search for my post
Search for my post
search for the response to my post
Search for response post

Google’s policy about page removal is quite explicit:

Except in instances involving legal issues or spam, Google’s policy for removing a page from our index requires that we obtain the permission of that page’s webmaster. This prevents competitors from sabotaging each other’s listings.

I’m assuming my page is not spam. Without any emails from Google, I can only assume that it has been removed for legal reasons. Has Amanda or some other party emailed Google to remove my allegations? I’ve sent an email to Google to inquire.

April 23: The page is back in the index as of sometime this morning. Looking back on the whole incident, it’s pretty amazing that Google had the page in the index within 2 days, along with its PageRank. They have been caching weblogs within a few hours but this is the first time I’ve seen an individual weblog post go online in that amount of time. I now render this post defunct.

“Amanda Doerty”

she went to Vassar...Amanda Doerty is a name like no other. In fact, I don’t think it’s a real name at all. After becoming interested in Hot Abercrombie Chick I decided to do a little follow-up research to see what the Internet had to offer on our enigmatic blogstress. Of course a google search for her real name, Amanda Doerty returns just about nothing. It appears that her debut on the web was her first post to Hot Abercrombie Chick. Nothing on usenet, but then again I’d be surprised if a 19 year old had heard of such antiquated things.

In her first post she mentions two friends, one Mr. Daniel Zeigenbein and another Sebastian Bach (not the lead singer of Skid Row). While there’s no mention of Amanda on anything related to Sebastian, I did find a gleaming review of HAC by Daniel which confirms that not only do they know each other, but they’re acquainted through a mutual college friend. Daniel is wearing a Vassar cap on his website and Amanda claims her residence is Poughkeepsie, NY, so they must be chums through Vassar, right?

So one would think. I called the Vassar Registrar’s office to get to the bottom of this. It turns out that Daniel Zeigenbeim was once a student there, never graduated, and apparently no longer is. Amanda Doerty on the other hand, well, they’ve never heard of her.

So I’m assuming that HAC and Amanda Doerty are in some way a creation of Zeigenbaum and/or friends.

Game my system and pay the consequences, beeatch.

7:30pm: Foster reports that Amanda has a vanilla Movable Type site setup

8:11pm: Amanda responds to these questions, revealing her name is a pseudonym. I have responded in her comments.

April 22: Jason Carter has uncovered an original Hot or Not post circa 2002 (click 1083 votes) for a familiar face named Ashley. Props to the hive mind.

Hot Abercrombie hoax

hot abercrombie chickAnyone watching Blogdex over the past few months knows Hot Abercrombie Chick, a.k.a. Amanda Doerty. This site has popped into the top 10 least 5 times since I started noticing. I became interested last week when I saw it for the third or fourth time, and delved a little deeper. As it would turn out, all of the sites that Amanda was posting to were weblogs that posted their most recent comments on their front page, hence exposing comments to Blogdex. In other words, Hot Abercrombie Chick has been gaming Blogdex.

The notion that this attractive college freshman was spending all of her time trolling weblogs looking for exposed weblogs seemed implausable to me. But looking through the comments themselves, it appeared that most, if not all, were at least marginally on topic. In addition Abercrombie Chick was interacting with hundreds of commenters on her own site, and doing quite a good job of it. A person this prolific would have to be unemployed and completely focused, which anyone who has been unemployed knows is impossible.

Something was amiss, and I had to prove that Hot Abercrombie Chick was either a) a totally different girl, b) a guy or c) some team of people creating an identity. And I was devoted to outing this fraud. It turns out that Julia Set beat me to it:

Just received an inside tip that the recently famous Hot Abercrombie Chick is really a male college student capitalizing on cute pictures of his girlfriend (previously unbeknownst to her) in a rush of “beggars” trackbacks. In retrospect, it’s pretty obvious that he is quite the player. Over the course of the last couple of months, “Mr. Abercrombie” has played every text-book trick for raising his popularity on the blogosphere.

Unfortunately there’s still no reference to this indictment on Amanda’s site, and still very little evidence beyond Julia’s post that this inside tip is true. Recent links to the site on Blogdex reveal that someone else is using Amanda’s tactics to call her out ("Comment Spamming Bitch Riding High On Blogdex!").

And I feel like a tool… she’s a man, duh!

Continue reading

Redesign

After staring at my site for 2 years, and needing some serious procrastinatory work, I decided to redesign overstated.net. My goal was to remove as much of the cruft as possible from the interface and focus on the readability. My main inspiration were Tufte, hence the use of whitespace in layout and small caps for various typographic functions. There’s also quite a bit of Dean Allen in there too because he’s the man when it comes to weblog layout.

The image in the header of the page was ganked from a 17th century map by Fredrick de Wit. I had originally intended to use one of Jesuit scientist Athanasius Kircher’s engravings, but this king was too badass to pass up.

Let me know what you think and of course if you encounter any bugs

Low threshold links

get on the link chainSometime around the beginning of this year, I realized that I was encountering way too many sites to write an individual weblog post about each and every one. My threshold for what to post was way to high to catch many of the sites I was laughing at, engaged by, and sending on to my friends. Instead of losing these links thanks to my imperfect brain, I decided like many others to create a separate weblog just for the ephemeral sites that didn’t deserve discussion.

And so my oddments was born.

Ever since, I’ve become obsessed with finding more of them. They’re like crack. The part I love best is that when I’m truly bored, hitting reload in my RSS reader almost always turns up something. And more than anything else, these new lists facilitate the rapid spread of memes across the universe.

Following is a list of my favorite low-threshold link sites, roughly in the order that I discovered them. Send me an email or post a comment with yours and I’ll add you to the list.

Continue reading

Weblogs and churn rate

the countThe first question that every journalist asks about weblogs—how many are there— has been a source of constant debate over the past year. I was cited in the Economist with the number 500,000, which prompted a response, as well as a number of new efforts for estimating this number:

Blogcensus is a funded project crawling and classifying content as weblog or not weblog. Pages identified as weblogs are then categorized by their native language using simple heuristics. This project is the sole work of Maciej Ceglowski

Blogcount is a self-proclaimed aggregator of other data sources. The site is making press releases based on the management reports of centrally hosted weblogs/journals (i.e. Blogger, LiveJournal, etc.). Using the data collected by Blogcensus, the original numbers are adjusted to account for international and non-hosted entities.

A word to the wise: online communities can appear much more active than they actually are (and I’ve got some data to show it!).

Continue reading