Crowdsourcing is thriving as a low-cost outsourcing mechanism. Unfortunately, many Web service abusers are using crowdsourcing to mount attacks on popular Web services such as Google, Yahoo and Facebook. Thus, crowdsourcing sites and Web service providers need ways to control the influx of abuse-related jobs. In this project, we use topic modeling to identify abuse-related job postings on Freelancer.com, a popular site for crowdsourcing.
In our AISec-11 paper , we explored the use of latent Dirichlet allocation (LDA) on our Freelancer data set. Our analysis suggests that LDA can provide an effective and largely automated tool for monitoring abuse jobs.
Going forward, we are applying more sophisticated topic models to the data set. Please feel free to check back for updates. The Freelancer data set is available in the Data Set section.
Do-kyum Kim, Marti Motoyama, Geoffrey M. Voelker and Lawrence K. Saul,
Topic Modeling of Freelance Job Postings to Monitor Web Service Abuse
Proceedings of the 4th ACM workshop on Artificial Intelligence and Security (AISec), Chicago, IL, Oct 2011.
From the job postings on Freelancer.com , we built a term-document matrix of 27,600 terms and 355,386 documents. For preprocessing procedures, please refer to our AISec-11 paper . If you use the data set in published work, please cite the AISec-11 paper .
Description of the Dictionary File
Each line of the file contains a term and its document frequency. The term and frequency is separated by a tab character.
Description of the Term-document Matrix Files
The first line of the file contains the number of documents, and the second line has the number of terms.
Next, each document is represented as:
# of distinct terms in the document
index of the first term: term frequency of the first term
index of the second term: term frequency of the second term