The continued growth of the Internet has seen a large increase in the number of sites where content is contributed by general users, not just by the owners or managers of the site. One category of sites has users “tag”, or categorize, material that is on the site. (Del.icio.us is one example.) These sites can be particularly vulnerable to spammers who post advertising material, because they traditionally have attached importance to tags from a user in proportion to the number of items the user has tagged, possibly with mroe weight given to recent activity. So the spammers strategy has been to look for the most popular tags, and then insert their advertisements “pre-marked” to match them.
Some new research, presented at the recent SIGIR conference in Boston, and reported in the Technology Review attempts to address this problem. A group of Euorpean researchers has developed and tested a new algorithm which attempts to rate users’ contributions for quality, and to identify the most important content. The approach is in many ways similar to the approach that Google uses to rank search results:
… the algorithm evaluates popular users and popular content and declares expert users to be the ones who identify the most important content, while important content is that which is identified by the most expert users.
The algorithm also attempts to identify “trend-setting” users — those who on average identify quality content earlier than most other users. These “discoverers” are especially unlikely to be spammers, according to the researchers.
The developers claim that this approach does a much better job of distinguishing between quality content and spam than any existing system. They have performed tests using actual data from the del.icio.us site:
The researchers tested their algorithm using data from Delicious, analyzing over 71,000 Web documents, 0.5 million users, and 2 million shared bookmarks
As I’ve mentioned before, there is an ongoing arms race between the spammers and the folks whose job it is to keep them out. This research is interesting, because it uses information that I think will be considerably harder for the spammers to counterfeit.