ACM Thirteenth Conference on
Information and Knowledge Management (CIKM)
CIKM and Workshops 2004
Tutorials
Tutorials

November 8, 2004 Morning (8:30am - 12pm)
T2: Text Analytics: Theory and Practice (Feldman)
T4: Scalable Distributed Data Structures (Litwin and Schwarz)

November 8, 2004 Afternoon (1:30pm - 5pm)
T5: Sensor Data Mining: Similarity Search and Pattern Analysis (Faloutsos)
T6: Support Vector Machines for Text Management (Shanahan)

Text Analytics: Theory and Practice
Ronen Feldman, ClearForest Corporation

The information age has made it easy to store large amounts of data. The proliferation of documents available on the Web, on corporate intranets, on news wires, and elsewhere is overwhelming. However, while the amount of data available to us is constantly increasing, our ability to absorb and process this information remains constant. Search engines only exacerbate the problem by making more and more documents available in a matter of a few key strokes. Text Analytics is a new and exciting research area that tries to solve the information overload problem by using techniques from data mining, machine learning, NLP, IR and knowledge management. Text analytics involves the preprocessing of document collections (text categorization, information extraction, term extraction), the storage of the intermediate representations, the techniques to analyze these intermediate representations (distribution analysis, clustering, trend analysis, association rules etc) and visualization of the results.

In this tutorial we will present the general theory of Text analytics and will demonstrate several systems that use these principles to enable interactive exploration of large textual collections. We will present a general architecture for text analytics and will outline the algorithms and data structures behind the systems. Special emphasis will be given to efficient algorithms for very large document collections, tools for visualizing such document collections, the use of intelligent agents to perform text analytics on the internet, and the use information extraction to better capture the major themes of the documents. The tutorial will cover the state of the art in this rapidly growing area of research. Several real world applications of text analytics will be presented.

Audience

The tutorial should be of interest to practitioners from Data Mining, Bioinformatics, NLP, IR, Knowledge Management and the general AI audience interested in this fast growing research area.

About the Instructor

Ronen Feldman is a senior lecturer at the Mathematics and Computer Science Department of Bar-Ilan University in Israel, and the Director of the Data Mining Laboratory. He received his B.Sc. in Math, Physics and Computer Science from the Hebrew University, M.Sc. in Computer Science from Bar-Ilan University, and his Ph.D. in Computer Science from Cornell University in NY. He was an Adjunct Professor at NYU Stern Business School. He is the founder, president and chief scientist of ClearForest Corporation, a NY based company specializing in development of text analytics tools and applications.

Scalable Distributed Data Structures
Witold Litwin, University Paris 9
Thomas Schwartz, Santa Clara University

Commodity computers, interconnected through the high-speed networks are becoming basic hardware. Such configurations are called multicomputers, clusters, superservers, and more recently grid-computing systems and P2P systems. They include dozens, hundreds or even thousands of clients and servers. Their cumulative resources are impressive: dozens of GBytes of distributed RAM, and TBytes of disks accessible for GMips of highly parallel and distributed processing. They offer potentially unbeatable performance and price/performance ratio, opening up new perspectives for applications. Major hardware and software makers present multicomputers as the next step of their business strategy. The technology is also felt as the next step for the Internet that should evolve from a data providing utility today, into a computing utility. In the US, it is the technology recommended at the highest governmental level to keep the lead over the rest of the world in the next Millennium One also foresees very large multicomputers the grid computing and P2P computing as new tools for the academia, e.g., a 10,000 node multicomputer for Stanford University in 10 years.

Multicomputers need new system software, fully taking advantage of the distributed RAM, and of parallel and processing on multiple CPUs. One especially needs new data structures, spanning multiple servers, and scaling to as many sites as needed transparently for the application. Such structures should possibly reside for processing in distributed RAM, providing access performance inaccessible to disk files. They should be accessible to multiple autonomous clients, including mobile ones. They should not require any centralized data access computation or directories, to avoid hot-spots. Finally, they should be highly-available which means they should be able to survive failures of storage nodes.

One solution towards this goal is a new class of data structures called Distributed Scalable Data Structures (SDDSs). First proposed in 1993, they have become well known to the research community. SDDS's gave rise to important research effort, materialized by several algorithms, papers and implementations. It was shown that SDDS files are able to scale to thousands of sites, and terabytes in distributed RAM, with constant access performance, and key search times under a millisecond. Multi-key searches requiring an hour or so in a traditional file, e.g., a k-d file, should succeed in less than a second in an SDDS file. All these properties should be of prime importance for applications, especially in the DBMS design arena. They open new perspective for VLDB design, for multimedia databases, for real-time and high-availability databases, for decision support systems, and for high performance computing in general.

The tutorial will introduce the attendees to the SDDS technology. We will start with the general principles and the overview the major research projects, especially the SDDS-2000 at U. Paris 9. We will discuss design, performance, and application issues of the main families of SDDS algorithms known by today. These are specifically:

  • Hash partitioned SDDS's. These structures offer scalable hash partitioning, generalizing the static hash partitioning popular with the current generation of the parallel DBMSs. They extend the more traditional dynamic hash data structures, especially the popular linear hashing and dynamic hashing, to network multicomputers. These are especially the typical networks of Wintel computers, or specific ones like the 1024 Window NT cluster of NCSA, as well as the switched multicomputer, e.g. the Parsytech with 128 PowerPCs, or the IBM SP-2 supercomputer with up to 512 processors. Versions of LH* are now patented by HP and IBM.
  • Range partitioned SDDS's. These extend the traditional ordered data structures, B-trees or binary trees, as well as the static range partitioning, also popular with the current generation of parallel DBMSs, to the multicomputers. Access performance for large files get largely improved, compared to more traditional implementations.
  • The (multi-key) SDDS's. They extend the traditional k-d data structures to the multicomputers. Performance of multi-attribute search get improved by orders of magnitude for larger files, as compared to the traditional implementations.
  • High-availability SDDS's. These new structures have no known counterpart among the traditional data structures. They transparently survive failures of n >= 1 data servers. This capability is of basic importance in the Web environment requiring 24/7 serviceability, as otherwise a failure can be expensive, e.g. $4B in lost market value and operational losses of $25-30M. Some high-availability SDDS's apply principles of record mirroring, or of record striping, enhanced to support the file scalability. Other use a new principle of record grouping. This technology provides much better storage efficiency than the mirroring, and, unlike striping, also allows for parallel scans and function shipping, since the records remain entire. The high-availability results from various method for parity calculus. This includes the so-called grouping functions (patent pending by IBM), and a specific version of the Reed-Salomon erasure correcting codes. This version introduces the so-called logarithmic parity matrix in Galois Field GF (2^16) with one column and one row of 1's which makes it to our best knowledge the fastest RS matrix known. We will present in particular the performance measures on 100 Mb/s and 1 Gb/s Ethernet supporting typical a typical Wintel PCs as SDDS clients and servers.
  • Scalable distributed relational tables. This topic covers the SDDS schemes one may apply to a relational table so to make it scaling over more and more nodes transparently for the application. We will present the prototype termed SD-SQL Server running on the top of the SQL Server. It manages (transparently) the scalable distributed tables using a distributed partitioned view as the SDDS client view. It is the only DBMS with this feature we are aware of at present.

We will in particular address the implementation issues of the SDDS's. We will focus on the Wintel prototype termed SDDS-2004 (U. Paris 9) http://ceria.dauphine.fr/. This system is available for download at CERIA Web site. It was installed and experimented with in several places. We will describe the experimental SDDS protocol and the software architectures of SDDS-2004. We will also address the original technique of algebraic signatures that is built-in for faster update processing and disk back-up. This technique is the only known, to the best of our knowledge, to attain zero probability of collision (false positive). We will also present experimental performance measurements. These prove in particular access performance reaching 30 µs per key search in distributed RAM where SDDS-2003 keeps the (possibly even very large) data for the search operations. This access time is about 300 times faster than to a local disk, and will remain probably ever inaccessible to mechanical devices.

We will also discuss the relationship of the design principles of SDDSs with respect to the recent ideas in unstructured or structured P2P systems such as Freenet, Chord, and Pastry. We will show practical implications for information, knowledge and data management. Finally, we will discuss the new perspective that SDDS's open up for the applications, especially DBMSs and Web servers supporting the discussed applications, including related problems of concurrency control, transaction management, and query optimization issues.

Audience

The tutorial is intended for the researchers and practitioners designing basic software, middleware, new data management systems and advanced applications, interested in the revolutionary possibilities that the multicomputers bring to their fields.

About the Instructors

Dr. Witold A. Litwin is Professor of Computer Science at the University Paris 9 (Dauphine) since 1990. He is also Director of Centre d'Etudes et de Recherches en Informatique Appliquée (CERIA) of U. Paris 9. His research areas are multidatabase systems, interoperability, data structures and scalable distributed multicomputer data structures. The multidatabase system design techniques and linear hash data structures he has proposed in 80-ties are considered among most important practical contributions to these domains. They are routinely taught in database courses, and are present in major database systems and popular applications: Netscape browser & servers, Unify DBMS, LH-Server, Microsoft: IIS (Internet Information Server), FrontPage, Exchange... Dr. Litwin has been invited lecturer and scientist at several universities. In the US, he taught Database courses among others at UC Berkeley in 1992-94, at Santa Clara University in 1991 and Stanford University in 1990-91. His course on Multidatabase Systems was broadcast on Stanford Educational TV Network (SITN). He was visiting scientist at prominent US research centers, including IBM Almaden Research Center, in 1997 and 1998 and part time at Hewlett Packard Palo Alto Laboratories, between 1990 and 1994, as well as U. of Maryland in 1989. Between 1980 and 1989 Dr. Litwin was Research Director at Institut National de Recherche en Informatique et Automatique (INRIA, France), and Expert for ISO Committee on Open Systems. Dr. Litwin wrote hundred fifty research papers, edited or contributed to eleven books, and was Program Committee member of fifty international database conferences. He is ACM Fellow, and member of IEEE. The publication list of Dr. Litwin is at http://ceria.dauphine.fr/witold.html.

Prof. Thomas Schwarz is currently Associate Professor in the Department of Computer Engineering, Santa Clara University, Santa Clara, CA. He has Ph.D. in Computer Science Engineering from UC San Diego, 1994 and is Dr.rer. nat. in Non-associative Algebra of FernUniversität Hagen, 1984. His research addresses, among others goals, the signature calculus and the erasure correcting codes, for the high-availability scalable distributed data structures especially. Prof. Schwarz (co)authored numerous publications in these areas.

Sensor Data Mining: Similarity Search and Pattern Analysis
Christos Faloutsos, Carnegie Mellon University

How can we find patterns in a sequence of sensor measurements (eg., a sequence of temperatures, or water-pollutant measurements)? How can we compress it? What are the major tools for forecasting and outlier detection? The objective of this tutorial is to provide a concise and intuitive overview of the most important tools, that can help us find patterns in sensor sequences. Sensor data analysis becomes of increasingly high importance, thanks to the decreasing cost of hardware and the increasing on-sensor processing abilities. We review the state of the art in three related fields: (a) fast similarity search for time sequences, (b) linear forecasting with the traditional AR (autoregressive) and ARIMA methodologies and (c) non-linear forecasting, for chaotic/self-similar time sequences, using lag-plots and fractals. The emphasis of the tutorial is to give the intuition behind these powerful tools, which is usually lost in the technical literature, as well as to give case studies that illustrate their practical use.

CONTENT AND OUTLINE

  • Similarity Search
    • why we need similarity search
    • distance functions (Euclidean, LP norms, time-warping)
    • fast searching (R-trees, M-trees)
    • feature extraction (DFT, Wavelets, SVD, FastMap)
  • Linear Forecasting
    • main idea behind linear forecasting
    • AR methodology
    • multivariate regression
    • Recursive Least Squares
    • de-trending; periodicities
  • Non-linear/chaotic forecasting
    • main idea: lag-plots
    • 'fractals' and 'fractal dimensions'
      • definition and intuition
      • algorithms for fast computation
    • case studies

Audience

Researchers that want to get up to speed with the major tools in time sequence analysis. Also, practitioners who want a concise, intuitive overview of the state of the art.

About the Instructor

Christos Faloutsos is a Professor at Carnegie Mellon University. He has received the Presidential Young Investigator Award by the National Science Foundation (1989), three "best paper" awards (SIGMOD 94, VLDB 97, KDD01-runner-up), and four teaching awards. He is a member of the executive committee of SIGKDD; he has published over 100 refereed articles, one monograph, and holds four patents. His research interests include data mining, fractals, indexing methods for multimedia and text data bases, and data base performance.

Support Vector Machines for Text Management
Jimi Shanahan, Clairvoyance Corporation

Support vector machines (SVM) are a general purpose suite of machine learning algorithms for classification and regression that were introduced by Vapnik and his colleagues in 1992. Generic support vector machines (SVMs) provide excellent performance on a variety of learning problems including hand-written character recognition, face detection and, most recently, text categorization.

Practical applications of SVMs to text can be found in areas such as knowledge management, process improvement, CRM (Customer Relationship Management), text mining, alerting, intelligence and law enforcement, spam and porn filtering, and bioinformatics. Most fall under the generic umbrellas of text classification and adaptive filtering.

Research on customizing SVMs for text processing (ranging from classification to clustering) has exploded in the past five years. This has resulted in a plethora of new approaches and many interesting and commercially viable applications. However, the concrete steps required to reduce SVM theory to practice are often not obvious or clearly explained in the research literature.

This tutorial provides a self-contained and systematic exposition of the following key concepts in SVMs:

  • Support Vector Machines
    • Linear Separators
    • Primal SVMs
    • Dualspace SVMs
    • Hard and Soft SVMs
  • Kernels
    • Linear, Polynomial, RBF
    • String Kernels
    • Latent Semantic Kernels
  • SVM Learning Algorithms
    • Perceptron
    • Kernel Adatron
    • SMO learning algorithm (and improvements)
    • Quadratic Programming

This tutorial will show how generic SVMs can be customized for text processing tasks such as classification and filtering. Topics covered here will include Uneven Margin Based Learning, Thresholding SVMs, Transductive SVMs, and Shrinkage.

SVM-based solutions to actual text-processing problems will be described and compared to other more traditional approaches.

Discussion of techniques will be supported by live demonstrations.

After attending this tutorial, you should:

  • Understand the fundamentals and the important ideas behind SVMs and kernels, with the help of illustrative examples in the domain of text classification and filtering.
  • Have an in-depth understanding of how to implement an SVM learning algorithm or use a publicly available SVM package for the tasks of text classification and filtering.
  • Have technical insight into technologies that you may see in commercial applications such as knowledge management, process improvement, CRM (Customer Relationship Management), text mining, alerting, intelligence and law enforcement, spam and porn filtering, and bioinformatics.
  • Understand the intellectual property surrounding SVM technology.
  • Be familiar with the exciting and promising areas of research in this domain.

Audience

This tutorial is especially targeted at people who are interested in implementing support vector machines for text processing tasks; people who are responsible for understanding implications of using such systems; people who want to understand and extend the core concepts; and people who want to understand the intellectual property surrounding the core algorithms.

About the Instructor

Dr. James G. Shanahan is Senior Research Scientist at Clairvoyance Corporation where he heads the Filtering and Machine Learning Group. At Clairvoyance Corp, he is actively involved in developing cutting-edge information management systems that harness information retrieval, linguistics, text/data mining and machine learning. Prior to joining Clairvoyance, he was a research scientist at Xerox Research Center Europe (XRCE), Grenoble, France, where, as a member of the Co-ordination Technologies Group, he developed and patented new document-centric approaches to information access (known as Document Souls).  Before joining Xerox, he completed his PhD in 1998 at the University of Bristol in probabilistic fuzzy approaches to machine learning. He has extensive industrial experience both at the AI group at Mitsubishi in Tokyo, Japan, and at the satellite-scheduling group of the Iridium project at Motorola, Phoenix, AZ (over 5 years).

Dr. Shanahan has published four books in the area of machine learning including a book on knowledge discovery -- "Soft computing for knowledge discovery: Introducing Cartesian granule features". In addition he has authored over 40 research publications and has twelve patents. He is on the editorial board of the Journal of Automation and Soft Computing. He has been a member of the program committee in numerous international conferences (e.g., ACM SIGIR, ACM CIKM, IEEE FUZZ-IEEE) and workshops and is an active journal reviewer. He was co-organizer of the AAAI Spring Symposium, EAAT, on Affect and Opinion Modeling (Stanford, 2004). He is a member of IEEE and ACM.

His research interests include Information Management Systems, Text Retrieval and Filtering, Support Vector Machines, Probabilistic Learning (Expectation Maximisation, Naïve Bayes, Bayesian Networks, HMMs, Language Modeling), Clustering, and Uncertainty Modeling.