Welcome! The long-term goal of this research is to construct a real-time system that uses machine learning techniques to detect malicious URLs (spam, phishing, exploits, and so on). To this end, we have explored techniques that involve classifying URLs based on their lexical and host-based features, as well as online learning to process large numbers of examples and adapt quickly to evolving URLs over time.

Our work is ongoing, so please feel free to check back for updates. In the meantime, you can download an anonymized version of our ICML-09 data set as described in the Data Sets section. You can also download sample code for the factor analysis approach we described in our AISTATS-10 paper in the Code section.



Data Sets

An anonymized 120-day subset of our ICML-09 data set is available from the following links:

The data set consists of about 2.4 million URLs (examples) and 3.2 million features. If you use the data set in published work, please cite the ICML-09 paper in which it was introduced and first described.

Description of Data (Matlab)

The file url.mat contains variables which we describe as follows:

Description of Data (SVM-light)

Uncompressing the archive url_svmlight.tar.gz will yield a directory url_svmlight/ containing the following files:


Here is our code for approximating the covariance matrix using the factor analysis-based approach we introduced in our AISTATS-10 paper. It contains code to run synthetic experiments over different online algorithms. [Download]


UCSD Computer Science and Engineering