Support Vector Machines for Text Management
Jimi Shanahan, Clairvoyance Corporation
Support vector machines (SVM) are a general purpose suite of machine
learning algorithms for classification and regression that were
introduced by Vapnik and his colleagues in 1992. Generic support
vector machines (SVMs) provide excellent performance on a variety of
learning problems including hand-written character recognition, face
detection and, most recently, text categorization.
Practical applications of SVMs to text can be found in areas such as
knowledge management, process improvement, CRM (Customer Relationship
Management), text mining, alerting, intelligence and law enforcement,
spam and porn filtering, and bioinformatics. Most fall under the
generic umbrellas of text classification and adaptive filtering.
Research on customizing SVMs for text processing (ranging from
classification to clustering) has exploded in the past five years.
This has resulted in a plethora of new approaches and many interesting
and commercially viable applications. However, the concrete steps
required to reduce SVM theory to practice are often not obvious or
clearly explained in the research literature.
This tutorial provides a self-contained and systematic exposition of
the following key concepts in SVMs:
- Support Vector Machines
- Linear Separators
- Primal SVMs
- Dualspace SVMs
- Hard and Soft SVMs
- Kernels
- Linear, Polynomial, RBF
- String Kernels
- Latent Semantic Kernels
- SVM Learning Algorithms
- Perceptron
- Kernel Adatron
- SMO learning algorithm (and improvements)
- Quadratic Programming
This tutorial will show how generic SVMs can be customized for text
processing tasks such as classification and filtering. Topics covered
here will include Uneven Margin Based Learning, Thresholding SVMs,
Transductive SVMs, and Shrinkage.
SVM-based solutions to actual text-processing problems will be
described and compared to other more traditional approaches.
Discussion of techniques will be supported by live demonstrations.
After attending this tutorial, you should:
- Understand the fundamentals and the important ideas behind SVMs and
kernels, with the help of illustrative examples in the domain of
text classification and filtering.
- Have an in-depth understanding of how to implement an SVM learning
algorithm or use a publicly available SVM package for the tasks of
text classification and filtering.
- Have technical insight into technologies that you may see in
commercial applications such as knowledge management, process
improvement, CRM (Customer Relationship Management), text mining,
alerting, intelligence and law enforcement, spam and porn filtering,
and bioinformatics.
- Understand the intellectual property surrounding SVM technology.
- Be familiar with the exciting and promising areas of research in this
domain.
Audience
This tutorial is especially targeted at people who are interested in
implementing support vector machines for text processing tasks; people
who are responsible for understanding implications of using such
systems; people who want to understand and extend the core concepts;
and people who want to understand the intellectual property
surrounding the core algorithms.
About the Instructor
Dr. James G. Shanahan is Senior Research Scientist at Clairvoyance
Corporation where he heads the Filtering and Machine Learning Group.
At Clairvoyance Corp, he is actively involved in developing
cutting-edge information management systems that harness information
retrieval, linguistics, text/data mining and machine learning. Prior
to joining Clairvoyance, he was a research scientist at Xerox Research
Center Europe (XRCE), Grenoble, France, where, as a member of the
Co-ordination Technologies Group, he developed and patented new
document-centric approaches to information access (known as Document
Souls).� Before joining Xerox, he completed his PhD in 1998 at the
University of Bristol in probabilistic fuzzy approaches to machine
learning. He has extensive industrial experience both at the AI group
at Mitsubishi in Tokyo, Japan, and at the satellite-scheduling group
of the Iridium project at Motorola, Phoenix, AZ (over 5 years).
Dr. Shanahan has published four books in the area of machine learning
including a book on knowledge discovery -- "Soft computing for
knowledge discovery: Introducing Cartesian granule features". In
addition he has authored over 40 research publications and has twelve
patents. He is on the editorial board of the Journal of Automation and
Soft Computing. He has been a member of the program committee in
numerous international conferences (e.g., ACM SIGIR, ACM CIKM, IEEE
FUZZ-IEEE) and workshops and is an active journal reviewer. He was
co-organizer of the AAAI Spring Symposium, EAAT, on Affect and Opinion
Modeling (Stanford, 2004). He is a member of IEEE and ACM.
His research interests include Information Management Systems, Text
Retrieval and Filtering, Support Vector Machines, Probabilistic Learning
(Expectation Maximisation, Na�ve Bayes, Bayesian Networks, HMMs, Language
Modeling), Clustering, and Uncertainty Modeling.