Introduction

Welcome! The long-term goal of this research is to construct a real-time system that uses machine learning techniques to detect malicious URLs (spam, phishing, exploits, and so on). To this end, we have explored techniques that involve classifying URLs based on their lexical and host-based features, as well as online learning to process large numbers of examples and adapt quickly to evolving URLs over time.

Our work is ongoing, so please feel free to check back for updates. In the meantime, you can download an anonymized version of our ICML-09 data set as described in the Data Sets section.

Publications

People

Data Sets

An anonymized 120-day subset of our ICML-09 data set is available from the following links:

The data set consists of about 2.4 million URLs (examples) and 3.2 million features. If you use the data set in published work, please cite the ICML-09 paper in which it was introduced and first described.


Description of Data (Matlab)

The file url.mat contains variables which we describe as follows:

Description of Data (SVM-light)

Uncompressing the archive url_svmlight.tar.gz will yield a directory url_svmlight/ containing the following files:

Affiliations

UCSD Computer Science and Engineering