ApMl (All Purpose Machine Learning) Toolkit

A Toolkit for Evaluating Cross Taxonomy
Training and Classification

David Miller and Helen Howell

Semantic Web Project (csci835)

Department of Computer Science

The University of Georgia

Two tools were integrated to create the ability to supply html documents and perform machine learning algorithms. The effectiveness of the machine learning algorithms are used as a basis to determine their individual accuracy across a different yet similiar taxonomy. For example, ODP (Open Dictionary Project) provides a given taxonomy which is has been populated with resources via volunteer classifiers. Similarly, CNN has established its own taxonomy. Given these two example taxonomies it is quite likely that their exists some overlap. For example, both have a Sports category containing subcategories of Baseball, Football, and Basketball, to name just a few.

Evaluation of a series of machine learning algorithms are to be used to determine their ability to be trained with resources from a subset of one taxonomy A that is determined, via human expertise, to overlap some subset of a second taxonomy B. Then classification accuracy will be based on the ability of the algorithms to properly classify resources from B into to their corresponding category in A. In addition, a classifier committees will be evaluated to determine if and by what factor classification accuracy can be improved.

Tools Used:

Rainbow

Rainbow is an executable program that does document classification. While mostly designed for classification by naive Bayes, it also provides TFIDF/Rocchio, Probabilistic Indexing and K-nearest neighbor.

Rainbow is the underlying classification code. It is based on the Bow library. Bow (or libbow) is a library of C code useful for writing statistical text analysis, language modeling and information retrieval programs. The current distribution includes the library, as well as front-ends for document classification (rainbow), document retrieval (arrow) and document clustering (crossbow). For more information about obtaining the source and citing its use, see the Bow home page Bow home page.

Rainbow first indexes a file, by turning the file's stream of characters into tokens by a process called tokenization or "lexing". It supports several options for tokenizing text. For example the --skip-headers (or -h) option causes rainbow to skip newsgroup or email headers before beginning tokenization.Once indexing is performed and a model has been archived to disk, rainbow can perform document classification. Statistics from a set of training documents will determine the parameters of the classifier; classification of a set of testing documents will be output.

Learning algorithms used:

Naive Bayes
K-Nearest Neighbor
TFIDF Rocchio Method
Probablistic Indexing
Classification Committee

WGET

GNU Wget is a freely available network utility to retrieve files from the World Wide Web, using HTTP and FTP. This tools was integrated into our project to be used to download files for training, testing and classification. It has many useful features:

Wget is capable of descending recursively through the structure of HTML documents and FTP directory trees, making a local copy of the directory hierarchy similar to the one on the remote server. This feature can be used to mirror archives and home pages, or traverse the web in search of data, like a WWW robot. In that spirit, Wget understands the norobots convention.
File name wildcard matching and recursive mirroring of directories are available when retrieving via FTP. Wget can read the time-stamp information given by both HTTP and FTP servers, and store it locally. Thus Wget can see if the remote file has changed since last retrieval, and automatically retrieve the new version if it has. This makes Wget suitable for mirroring of FTP sites, as well as home pages.
By default, Wget supports proxy servers, which can lighten the network load, speed up retrieval and provide access behind firewalls. However, if you are behind a firewall that requires that you use a socks style gateway, you can get the socks library and build wget with support for socks. Wget also supports the passive FTP downloading as an option.
Most of the features are fully configurable, either through command line options, or via the initialization file `.wgetrc' (See section Startup File ). Wget allows you to define global startup files (`/usr/local/etc/wgetrc' by default) for site settings.
GNU Wget is free software. This means that everyone may use it, redistribute it and/or modify it under the terms of the GNU General Public License, as published by the Free Software Foundation (See section GNU GENERAL PUBLIC LICENSE ).