20 Text Mining

This chapter explains how you can use Oracle Data Mining to mine text.

This chapter includes the following topics:

About Unstructured Data
How Oracle Data Mining Supports Unstructured Data
Preparing Text for Mining
Sample Text Mining Problem
Oracle Data Mining and Oracle Text

About Unstructured Data

Data mining algorithms act on numerical and categorical data stored in relational databases or spreadsheets. Numerical data has a type such as INTEGER, DECIMAL, or FLOAT. Categorical data has a type such as CHAR or VARCHAR2.

What if you want to mine data items that are not numericals or categoricals? There are many examples: web pages, document libraries, PowerPoint presentations, product specifications, emails, sound files, and digital images to name a few. What if you want to mine the information stored in long character strings, such as product descriptions, comment fields in reports, or call center notes?

Data that cannot be meaningfully interpreted as numerical or categorical is considered unstructured for purposes of data mining. It has been estimated that as much as 85% of enterprise data falls into this category. Extracting meaningful information from this unstructured data can be critical to the success of a business.

How Oracle Data Mining Supports Unstructured Data

Unstructured data may be binary objects, such as image or audio files, or text objects, which are language-based. Oracle Data Mining supports text objects.

The case table for Data Mining may include one or more columns of text (see "Mixed Data"), which can be designated as attributes. A text column cannot be used as a target. The case table itself must be a relational table; it cannot be created as a view.

Text must undergo a transformation process before it can be mined. Once the data has been properly transformed, the case table can be used for building, testing, or scoring data mining models. Most Oracle Data Mining algorithms support text. (See "Text Mining Algorithms".)

Mixed Data

Much of today's enterprise information includes both structured and unstructured content related to a given item of interest. Customer account data may include text fields that describe support calls and other interactions with the customer. Insurance claim data may include a claim status description, supporting documents, email correspondence, and other information. It is often essential that analytic applications evaluate the structured information together with the related unstructured information.

Oracle Data Mining offers this capability. You can use Oracle Data Mining to mine data sets that contain regular relational information (numeric and character columns), as well as one or more text columns.

Text Data Types

Oracle Data Mining supports text columns that have any of the data types shown in Table 20-1.

Table 20-1 Data Types for Text Columns

Data Type	Description
BFILE	Locator to a large binary file stored outside the database
BLOB	Binary large object
CHAR	Fixed length character string
CLOB	Character large object
LONG	Long variable length character string
LONG RAW	Long variable length raw binary data
RAW	Raw binary data
VARCHAR2	Variable length character string
XMLTYPE	XML data

See Also:

Oracle Database SQL Language Reference for information about Oracle data types

Text Mining Algorithms

The Oracle Data Mining algorithms shown in Table 20-2 can be used for text mining.

Table 20-2 Oracle Data Mining Algorithms that Support Text

Algorithm	Mining Function
Naive Bayes	Classification
Generalized Linear Models	Classification, Regression
Support Vector Machine	Classification, Regression, Anomaly Detection
k-Means	Clustering
Non-Negative Matrix Factorization	Feature Extraction
Apriori	Association Rules
Minimum Descriptor Length	Attribute Importance

Oracle Data Mining supports text with all mining functions. As shown in Table 20-2, at least one algorithm per mining function has text mining capability.

Classification, clustering, and feature extraction have important applications in pure text mining. Other functions, such as regression and anomaly detection, are more suited for mining mixed data (both structured and unstructured).

Text Classification

Text classification is the process of categorizing documents: for example, by subject or author. Most document classification applications use either multi-class classification or multi-target classification.

Multi-Class Document Classification

In multi-class document classification, each document is assigned a probability for each category, and the probabilities add to 1. For example, if the categories are economics, math, and physics, document A might be 20% likely to be economics, 50% likely to be math, and 30% likely to be physics.

This approach to document classification is supported by Oracle Data Mining and by Oracle Text.

Multi-Target Document Classification

In multi-target document classification, each document is assigned a probability for either being in a category or not being in a category, and the probabilities for each category add to 1. Given categories economics, math, and physics, document A might be classified as: 30% likely to be economics and 70% likely not to be economics; 65% likely to be math and 35% likely not to be math; 40% likely to be physics and 60% likely not to be physics.

In multi-target document classification, each category is a separate binary target. Each document is scored for each target.

This approach to document classification is supported by Oracle Text but not by Oracle Data Mining. However, you can obtain similar results by building a single binary classification model for each category and then scoring all the models separately in a single SQL scoring query.

See Also:

"Oracle Data Mining and Oracle Text"

Document Classification Algorithms

Oracle Data Mining supports three classification algorithms that are well suited to text mining applications. Both can easily process thousands of text features (see "Preparing Text for Mining" for information about text features), and both are easy to train with small or large amounts of data. The algorithms are:

Support Vector Machine (SVM), described in Chapter 18
Logistic Regression (GLM), described in Chapter 12
Naive Bayes (NB), described in Chapter 15

See Also:

Chapter 5, "Classification"

Text Clustering

The main applications of clustering in text mining are:

Simple clustering. This refers to the creation of clusters of text features (see "Preparing Text for Mining" for information about text features). For example: grouping the hits returned by a search engine.
Taxonomy generation. This refers to the generation of hierarchical groupings. For example: a cluster that includes text about car manufacturers is the parent of child clusters that include text about car models.
Topic extraction. This refers to the extraction of the most typical features of a group. For example: the most typical characteristics of documents in each document topic.

The Oracle Data Mining enhanced k-Means clustering algorithm, described in Chapter 13, supports text mining.

See Also:

Chapter 7, "Clustering"

Text Feature Extraction

Feature extraction is central to text mining. Feature extraction is used for text transformation at two different stages in the text mining process:

A feature extraction process must be performed on text documents before they can be mined. This preprocessing step transforms text documents into small units of text called features or terms.

See Also:
"Preparing Text for Mining".
The text transformation process generates large numbers (potentially many thousands) of text features from a text document. Oracle Data Mining algorithms treat each feature as a separate attribute. Thus text data may present a huge number of attributes, many of which provide little significant information for training a supervised model or building an unsupervised model.

Oracle Data Mining supports the Non-Negative Matrix Factorization (NMF) algorithm for feature extraction. You can create an NMF model to consolidate the text attributes derived from the case table and generate a reduced set of more meaningful attributes. The results can be far more effective for use in classification, clustering, or other types of data mining models. See Chapter 16 for information on NMF.

Text Association

Association models can be used to uncover the semantic meaning of words. For example, suppose that the word account co-occurs with words like customer, explanation, churn, story, region, debit, and memo. An association model would produce rules connecting account with these concepts. Inspection of the rules would provide context for account in the document collection. Such associations can improve information retrieval engines.

Oracle Data Mining supports Apriori for association. See Chapter 10 for information on Apriori.

See Also:

Chapter 8, "Association"

Text Attribute Importance

Attribute importance can be used to find terms that distinguish the values of a target column. Attribute importance ranks the relative importance of the terms in predicting the target. For example, certain words and phrases might distinguish the writing style of one writer from another.

Oracle Data Mining supports Minimum Description Length (MDL) for attribute importance. See Chapter 14 for information on MDL.

Preparing Text for Mining

Before text can be mined, it must undergo a special preprocessing step known as term extraction, also called feature extraction. This process breaks the text down into units (terms) that can be mined. Text terms may be keywords or other document-derived features.

The Oracle Data Miner graphical tool performs term extraction transparently when you create or apply a text mining model. The Oracle Data Mining Java API provides term extraction classes for use in Java applications. If you are using the PL/SQL API, you can use a set of Oracle Text table functions to extract the terms.

See Also:

Oracle Data Mining Application Developer's Guide for instructions on term extraction using the PL/SQL API

Oracle Data Mining Administrator's Guide for information about sample PL/SQL term extraction code provided with the Oracle Data Mining sample programs

Oracle Data Mining Application Developer's Guide for information about term extraction using the Java API

The term extraction process uses Oracle Text routines to transform a text column into a nested column. Each term is the name of a nested attribute. The value of each nested attribute is a number that uniquely identifies the term. Thus each term derived from the text is used as a separate numerical attribute by the data mining algorithm.

All Oracle Data Mining algorithms that support nested data can be used for text mining. These algorithms are listed in Table 20-2

See Also:

"Oracle Data Mining and Oracle Text"

Oracle Data Mining Application Developer's Guide for information about nested data

Sample Text Mining Problem

Suppose you want to predict if customers will increase spending with an affinity card. You want to include the comments field from a customer survey as one of the predictors. Before building the classification model, you want to create a reduced set of text features to improve the predictive power of the comments. To do this, you will create a feature extraction model.

This example uses the table mining_build_text, which is provided with the Oracle Data Mining sample programs. Since you will be including a text column, the build data must come from a table, not a view. Figure 20-1 shows ten rows and three columns from the table. The comments column is truncated in the display.

Figure 20-1 Sample Build Data for Text Mining

Description of "Figure 20-1 Sample Build Data for Text Mining"

A model build activity in Oracle Data Miner automatically transforms the text in the comments column to a nested column of text terms, then creates a feature extraction model. The NMF algorithm uses these text terms along with the other attributes as input. The results are a set of features, which are linear combinations of the model attributes.

Some of attribute/value pairs that contribute to Feature 1 are shown in Figure 20-2 and in Figure 20-3. The y are listed in descending order by coefficient. The attribute name/value with the highest coefficient has the most predictive importance in this feature. The attribute name/value with the lowest coefficient has the least predictive importance.

In Figure 20-2, the feature list is expanded to show that at least eight features were generated by the model.

Figure 20-2 Feature 1 Attributes and Feature List 1-8

Description of "Figure 20-2 Feature 1 Attributes and Feature List 1-8"

In Figure 20-3, the feature list is expanded to show that a total of 23 features were generated by the model.

Figure 20-3 Feature 1 Attributes and Feature List 11-23

Description of "Figure 20-3 Feature 1 Attributes and Feature List 11-23"

The text terms derived from the comments column are listed with the other attributes. The numeric value of each text term is displayed. If you click the Filter button, you will see the first word of the name of the text terms. The name of the text term is the actual extracted text. In the Filter Attributes dialog, shown in Figure 20-4, you can select specific attributes to include or exclude for this feature.

Figure 20-4 Filter Attributes of a Feature

Description of "Figure 20-4 Filter Attributes of a Feature"

You can use the results of the feature extraction model as build data for another model, in this case a classification model to predict spending with an affinity card. To do this, score the NMF model and use the resulting data as the build data for the classification model.

Oracle Data Mining and Oracle Text

Oracle Text is a technology included in the base functionality offered by Oracle Database. Oracle Text uses internal components of Oracle Data Mining to provide some data mining capabilities.

Oracle Data Mining is an option of the Enterprise Edition of Oracle Database. To use Oracle Data Mining, you must have a license for the Data Mining option. To use Oracle Text and its data mining capabilities, you do not need to license the Data Mining option.

Oracle Text consists of a set of PL/SQL packages and related data structures that support text query and document classification. Oracle Text routines can be used to:

Query document collections, such as web sites and online libraries
Query document catalogs, such author and publisher descriptions
Perform document classification and clustering

The primary functional differences between Oracle Data Mining and Oracle Text can be summarized as follows:

Oracle Data Mining supports the mining of mixed data, as described in "Mixed Data". Oracle Text mining capabilities only support text; they do not support mixed structured and unstructured data.
Oracle Data Mining supports mining more than one text column at once. Oracle Text routines operate on a single column.
Oracle Data Mining supports text with all data mining functions. Oracle Text has limited support for data mining. The differences are summarized in Table 20-3.
Oracle Data Mining and Oracle Text both support text columns with any of the data types listed in Table 20-1. Oracle Data Mining requires text feature extraction transformation prior to mining, as described in "Preparing Text for Mining". Oracle Text operates on native text; it performs text feature extraction internally.
Oracle Data Mining and Oracle Text both support document classification, as described in "Text Classification". Oracle Data Mining supports multi-class classification. Oracle Text supports multi-class and multi-target classification.

Table 20-3 Mining Functions: Oracle Data Mining and Oracle Text

Mining Function	Oracle Data Mining	Oracle Text
Anomaly detection	Text or mixed data can be mined using One-Class SVM	No support
Association	Text or mixed data can be mined using MDL	No support
Attribute importance	Text or mixed data can be mined using Apriori	No support
Classification	Text or mixed data can be mined using SVM, GLM, or Naive Bayes	Text can be mined using SVM, decision trees, or user-defined rules
Clustering	Text or mixed data can be mined using k-Means	Text can be mined using k-Means
Feature extraction	Text or mixed data can be mined using NMF	No support
Regression	Text or mixed data can be mined using SVM or GLM	No support

Oracle Text Reference