4 Preparing Text for Mining

Oracle Data Mining supports the mining of data sets that have one or more text columns. These columns must undergo a special preprocessing step whereby text tokens known as terms are extracted and stored in a nested column. The transformed text can then be used as any other attribute in the building, testing, and scoring of models. Any algorithm that supports nested data can be used for text mining. (See Table 3-2 in Chapter 3.)

This chapter explains how to use PL/SQL to prepare a column of text for mining.

Note:

Oracle Data Mining includes sample programs that illustrate text transformation and text mining in both PL/SQL and Java. Refer to Oracle Data Mining Administrator's Guide for information about the Oracle Data Mining sample programs.

Oracle Text Routines for Term Extraction

Oracle Data Mining uses specialized Oracle Text routines to preprocess text data. Oracle Text is a technology within the database for building text querying and classification applications.

Oracle Text provides the following facilities for the Oracle Data Mining term extraction process:

SVM_CLASSIFIER, defined in the CTX_DLL Oracle Text PL/SQL package, specifies an index preference for Oracle Data Mining term extraction.
The CTXSYS.DRVODM Oracle Text PL/SQL package defines the table functions, FEATURE_PREP and FEATURE_EXPLAIN, which generate intermediate and final tables of text terms for Oracle Data Mining.

In PL/SQL, the term extraction process requires the use of these Oracle Text facilities. Java developers can use the OraTextTransform interface, which presents the Oracle Text term extraction capability within the context of a Java environment. (See "Using Text Transformation".)

Term Extraction in the Sample Programs

A good place to start in learning the term extraction process is with the sample programs. The following sample programs contain term extraction code for text mining:

dmsh.sql — Prepares the build, test, and scoring data for the sample programs, including the text mining programs. dmsh.sql creates views for data mining and tables and indexes for text mining.
dmtxtfe.sql — Using the indexed text column created by dmsh.sql, creates a nested column suitable for text mining.

The dmtxtfe.sql program is a sample term extractor. It contains extensive comments that explain the code in a step-by-step fashion. You can expand this program into a complete term extraction solution by adding index creation and the preparation of test and scoring data (as in dmsh.sql).

Two sample PL/SQL programs use the data prepared by dmsh.sql to mine text.

dmtxtnmf.sql creates and applies a model that uses Non-Negative Matrix Factorization.
dmtxtsvm.sql creates and applies a model that uses SVM classification.

Both these programs mine a table of customer data that includes a nested column of text data called COMMENTS. The COMMENTS column has been pre-processed by dmsh.sql. The models created by these programs are referenced in the following example from a Linux system.

-- Run the programs to create the models
SQL> @ $ORACLE_HOME/rdbms/demo/dmtxtnmf.sql
SQL> @ $ORACLE_HOME/rdbms/demo/dmtxtsvm.sql
-- List the models created by the programs
SQL> SELECT model_name, mining_function, algorithm FROM user_mining_models;
 
MODEL_NAME                MINING_FUNCTION              ALGORITHM
------------------------  ---------------------------  ------------------------
T_SVM_CLAS_SAMPLE         CLASSIFICATION               SUPPORT_VECTOR_MACHINES
T_NMF_SAMPLE              FEATURE_EXTRACTION           NONNEGATIVE_MATRIX_FACTOR

From Unstructured Data to Structured Data

The pre-processing steps for text mining create nested table columns of type DM_NESTED_NUMERICALS from columns of type VARCHAR2 or CLOB. Each row of the nested table specifies an attribute name and a value. The DM_NESTED_NUMERICALS type defines the following columns.

attribute_name     VARCHAR2(4000)
value               NUMBER)

The term extraction process treats the text in each row of the original table as a separate document. Each document is transformed to a set of terms that have a numeric value and a text label. Within the nested table column, the attribute_name column holds the text and the value column holds the numeric value of the term, which is derived using the term frequency in the document and in the document collection (other rows).

For example, the following query returns various attributes of customer 102998, including a text column of comments. The text column has not been transformed.

SQL> select cust_id, cust_gender, cust_income_level, affinity_card, comments
             from mining_build_text
             where cust_id = 102998;
 
CUST_ID C  CUST_INCOME_LEVEL     AFFINITY_CARD COMMENTS
------- -- --------------------  ------------- --------------------------------
102998  M  J: 190,000 - 249,999  1             I wanted to write you to let you
                                               know that I've purchased several
                                               items at your store recently and
                                               have been very satisfied with my
                                               purchases. Keep up the good work.

The following query returns the same attributes of customer 102998, but the text in the comments column has been transformed. The query extracts the ATTRIBUTE_NAME and VALUE columns from the nested table that holds the transformed text.

SQL> select b.cust_id, b.cust_gender, b.cust_income_level, b.affinity_card, n.*
             from mining_build_nested_text b,
                  table(b.comments) n
             where b.cust_id = 102998
             order by n.attribute_name;
 
CUST_ID  C  CUST_INCOME_LEVEL    AFFINITY_CARD  ATTRIBUTE_NAME  VALUE
-------  -- -------------------  -------------  --------------  --------
102998   M  J: 190,000 - 249,999  1              GOOD            .26894
102998   M  J: 190,000 - 249,999  1              ITEMS            158062
102998   M  J: 190,000 - 249,999  1              KEEP             238765
102998   M  J: 190,000 - 249,999  1              KNOW              .2006
102998   M  J: 190,000 - 249,999  1              LET              299856
102998   M  J: 190,000 - 249,999  1              PURCHASED        142743
102998   M  J: 190,000 - 249,999  1              PURCHASES        173146
102998   M  J: 190,000 - 249,999  1              RECENTLY        .195223
102998   M  J: 190,000 - 249,999  1              SATISFIED       .355851
102998   M  J: 190,000 - 249,999  1              SEVERAL         .355851
102998   M  J: 190,000 - 249,999  1              STORE          .0712537
102998   M  J: 190,000 - 249,999  1              UP              .159838
102998   M  J: 190,000 - 249,999  1              WANTED          .355851
102998   M  J: 190,000 - 249,999  1              WORK            .299856
102998   M  J: 190,000 - 249,999  1              WRITE           .355851

The ATTRIBUTE_NAME column holds an item of text from the original comments column. The VALUE column holds the term value. Note that not all words from the original comments column are extracted as terms. For example, the articles the and to are not included.

Steps in the Term Extraction Process

The steps in the term extraction process are summarized in this section. Further details and specific syntax requirements are explained later in this chapter.

Transform a Text Column in the Build Table

First transform the text in the build data. During this process you will generate the text term definitions, which you will reuse for the test and apply data. Perform the following steps:

Create an index on the text column in the build table.
Create an SVM_CLASSIFIER preference for the index.
Define a table to hold the categories specified by the SVM_CLASSIFIER index.
Use the FEATURE_PREP table function to create the term definitions and populate an intermediate terms table.
Use the FEATURE_EXPLAIN table function to populate the final terms table.
Replicate the columns of the original build table (using a view or another table), replacing the text column with a nested table column. Load the terms from the final terms table into the nested table column.

Transform a Text Column in the Test and Apply Tables

The test and apply data must undergo the same pre-processing as the build data. To transform the test and apply data, you will reuse the term definitions generated for the build data. Perform the following steps:

Create an index on the text column in the test or apply table.
Use the FEATURE_PREP table function to populate an intermediate terms table. Use the term definitions previously generated for the build data.
Use the FEATURE_EXPLAIN table function to populate the final terms table.
Replicate the columns of the original test or apply table, replacing the text column with a nested table column. Load the terms from the final terms table into the nested table column.

Create the Index and Index Preference

Oracle Text processing requires a text index. Oracle Text supports several types of indexes for querying, cataloging, and classifying text documents. The Oracle Data Mining term extraction process requires a CONTEXT index for text querying.

You must create an index for each text column to be transformed. Use the following syntax to create the index.

SQL>CREATE INDEX index_name ON table_name(column_name)
                   INDEXTYPE IS ctxsys.context PARAMETERS ('nopopulate');

Note:

This statement creates a basic CONTEXT index. You can further define the characteristics of the index by specifying additional arguments to the CREATE INDEX statement. Refer to Oracle Text Reference for details.

Oracle Text supports index preferences for overriding the default characteristics of an index. The CREATE_PREFERENCE procedure in the Oracle Text package CTX_DDL creates a preference with the name and type that you specify. The SVM_CLASSIFIER preference type defines the characteristics of an index for Oracle Data Mining.

You must create an index preference when you prepare the build data. It will be reused when you prepare the test and apply data. Use the following syntax to create the index preference.

SQL>EXECUTE ctx_ddl.create_preference('preference_name', 'SVM_CLASSIFIER');

The SVM_CLASSIFIER index preference uses a predefined table with two numeric columns: ID and CAT. ID holds the case ID; CAT holds the category. The category table is used for internal processing. You must create the category table using the following syntax.

SQL>CREATE TABLE category_table_name(id NUMBER, cat NUMBER);

Create the Intermediate Terms Table

The FEATURE_PREP table function in the CTXSYS.DRVODM Oracle Text package extracts terms from a text column using an index preference of type SVM_CLASSIFIER. FEATURE_PREP creates a table of term definitions from the build data and reuses these definitions for the test and apply data.

FEATURE_PREP returns an intermediate terms table.

FEATURE_PREP Calling Syntax

FEATURE_PREP is an over-loaded function that accepts two different sets of arguments. You will specify one set of arguments for the build data and another set for the test and apply data.

--- syntax for build data ---
            CTXSYS.DRVODM.FEATURE_PREP (
                   text_index                IN   VARCHAR2,
                   case_id                   IN   VARCHAR2,
                   category_tbl              IN   VARCHAR2,
                   category_tbl_id_col       IN   VARCHAR2,
                   category_tbl_cat_col      IN   VARCHAR2,
                   feature_definition_tbl    IN   VARCHAR2,
                   index_preference          IN   VARCHAR2)
                RETURN DRVODM;

--- syntax for test/apply data ---
           CTXSYS.DRVODM.FEATURE_PREP (
                   text_index                IN   VARCHAR2,
                   case_id                   IN   VARCHAR2,
                   feature_definition_tbl    IN   VARCHAR2,
                RETURN DRVODM;

FEATURE_PREP Return Value

FEATURE_PREP returns the following columns. The SEQUENCE_ID column holds the case ID; the ATTRIBUTE_ID column holds the term ID.

Name                   NULL?   Type
---------------------- ------- ------
SEQUENCE_ID                    NUMBER
ATTRIBUTE_ID                   NUMBER
VALUE                          NUMBER

FEATURE_PREP Arguments

FEATURE_PREP accepts the arguments described in Table 4-1.

Table 4-1 FEATURE_PREP Table Function Arguments

Argument Name	Data Type
`text_index`	`VARCHAR2`	Name of the index on the text column in the build, test, or apply table.
`case_ID`	`VARCHAR2`	Name of the case ID column in the build, test, or apply table.
`category_tbl`	`VARCHAR2`	Name of the table used by the `SVM_CLASSIFIER` index preference. Specify this argument only for build data.
`category_tbl_id_col`	`VARCHAR2`	Specify 'id'. This is the name of the `ID` column in the table used by the `SVM_CLASSIFIER` index preference. Specify this argument only for build data.
`category_tbl_cat_col`	`VARCHAR2`	Specify 'cat'. This is the name of the `CAT` column in the table used by the `SVM_CLASSIFIER` index preference. Specify this argument only for build data.
`feature_definition_tbl`	`VARCHAR2`	Name of the term definition table created by `FEATURE_PREP`. The columns of the term definition table are: Name Null? Type --------------------------------- CAT_ID NUMBER TYPE NUMBER RULE BLOB
`index_preference`	`VARCHAR2`	Name of the `SVM_CLASSIFIER` index preference. Specify this argument only for build data.

FEATURE_PREP Example

The following example creates an intermediate terms table called txt_term_out. The FEATURE_PREP table function extracts terms from a text column with an index called build_text_idx. The text column is in a build table with a case ID column called cust_id. The index preference txt_pref is applied to the index using the id and cat columns in the table cat_tbl. FEATURE_PREP creates a table of term definitions called txt_pref_terms.

CREATE TABLE txt_term_out AS
SELECT *
  FROM TABLE(ctxsys.drvodm.feature_prep (
               'build_text_idx',
               'cust_id',
               'cat_tbl',
               'id',
               'cat',
               'txt_pref_terms',
               'txt_pref'));

Create the Final Terms Table

The FEATURE_EXPLAIN table function in the CTXSYS.DRVODM Oracle Text package extracts the term values from the definitions created by FEATURE_PREP and appends the associated word to each value.

FEATURE_EXPLAIN returns the final terms table.

FEATURE_EXPLAIN Calling Syntax

The calling syntax of FEATURE_EXPLAIN is described as follows.

CTXSYS.DRVODM.FEATURE_EXPLAIN (
                   feature_definition_tbl     IN   VARCHAR2,
                RETURN DRVODM;

FEATURE_EXPLAIN Return Value

FEATURE_EXPLAIN returns the following columns.

Name              Type
---------------    ---------------
text               VARCHAR2(160)
type               NUMBER(3)
ID                 NUMBER
score              NUMBER

FEATURE_EXPLAIN Arguments

FEATURE_EXPLAIN accepts a single argument: the terms definition table created by FEATURE_PREP.

FEATURE_EXPLAIN Example

The following example creates a final terms table called txt_final_terms using the intermediate terms table txt_term_out. The FEATURE_EXPLAIN table function returns the terms specified in the terms definition table txt_pref_terms.

SQL> create table txt_final_terms as
                   select A.sequence_id, B.text, A.value
                         FROM txt_term_out A,
                              TABLE(ctxsys.drvodm.feature_explain(
                                    'txt_pref_terms')) B
                         WHERE A.attribute_id = B.id;

Populate a Nested Table Column

Use the final terms table to populate a nested table column of type DM_NESTED_NUMERICALS.

The following example creates the table mining_build_nested_text. (Alternatively, you could create a view.) The table has a case ID column of customer IDs and three customer attribute columns: age, education, and occupation. It also includes a comments column of type DM_NESTED_NUMERICALS created from the terms table txt_final_terms.

SQL> CREATE TABLE mining_build_nested_text
       NESTED TABLE comments store AS build_comments
      AS
    SELECT non_text.cust_id,
      non_text.age,
      non_text.education,
      non_text.occupation,
     txt.comments
     FROM
     mining_build_text non_text,
     ( SELECT features.sequence_id,
              cast(COLLECT(dm_nested_numerical(features.text,features.value))
                           as dm_nested_numericals)  comments
       FROM txt_final_terms features
       group by features.sequence_id) txt
     WHERE non_text.cust_id = txt.sequence_id(+);

Example: Transforming a Text Column

In the following example, a text column in MINING_BUILD_TEXT is transformed to a nested table column in MINING_BUILD_NESTED_TEXT. The same text column in MINING_APPLY_TEXT is transformed to a nested table column in MINING_APPLY_NESTED_TEXT.

Both MINING_BUILD_TEXT and MINING_APPLY_TEXT have the following columns.

Name                              Null?    Type
 --------------------------------- -------- ---------------------------
 CUST_ID                           NOT NULL NUMBER
 AGE                                        NUMBER
 EDUCATION                                  VARCHAR2(21)
 OCCUPATION                                 VARCHAR2(21)
 COMMENTS                                   VARCHAR2(4000)

The following statements create the indexes.

SQL> create index build_text_idx on mining_build_text (comments)
             indextype is ctxsys.context parameters ('nopopulate');
SQL> create index apply_text_idx ON mining_apply_text (comments)
             indextype is ctxsys.context parameters ('nopopulate');

The following statements create the index preference and its table.

SQL> execute ctx_ddl.create_preference('idx_pref', 'SVM_CLASSIFIER');
SQL> create table idx_pref_cat (id number, cat number);

The following statement returns the intermediate terms in the table BUILD_TERMS_OUT. It also creates the table FEATURE_DEFS and populates it with the term definitions.

SQL>  create table build_terms_out as
              select * from
                     table (ctxsys.drvodm.feature_prep
                                  ('build_text_idx',
                                   'cust_id',
                                   'idx_pref_cat',
                                   'id',
                                   'cat',
                                   'feature_defs',
                                   'idx_pref'));

The following statement returns the final terms in the table BUILD_EXPLAIN_OUT.

SQL> create table build_explain_out as
             select a.sequence_id,
                    b.text,
                    a.value
             from build_terms_out a,
             table (ctxsys.drvodm.feature_explain('feature_defs')) b
             where a.attribute_id = b.id;

The following statement creates the table MINING_BUILD_NESTED_TEXT. This table contains the non-text attributes from the original build table and a nested table of comments. This table can be used to build a model.

SQL> create table mining_build_nested_text
        nested table comments store as build_comments
          as select non_text.cust_id,
                    non_text.age,
                    non_text.education,
                    non_text.occupation,
                    txt.comments
              from mining_build_text non_text,
             (select features.sequence_id,
                cast(collect(dm_nested_numerical(features.text,features.value))
                as dm_nested_numericals)  comments
              from build_explain_out features
              group by features.sequence_id) txt
              where non_text.cust_id = txt.sequence_id(+);

The following statement creates the intermediate terms table for the comments column in the apply table, MINING_APPLY_TEXT. It uses the term definitions in the FEATURE_DEFS table, which was created during the pre-processing of the comments column in MINING_BUILD_TEXT.

SQL>  create table apply_terms_out as
              select * from
                     table (ctxsys.drvodm.feature_prep
                                  ('build_text_idx',
                                   'cust_id',
                                   'feature_defs'));

The following statement creates the final terms table for apply.

SQL> create table apply_explain_out as
             select a.sequence_id,
                    b.text,
                    a.value
             from apply_terms_out a,
             table (ctxsys.drvodm.feature_explain('feature_defs')) b
             where a.attribute_id = b.id;

The following statement creates the table MINING_APPLY_NESTED_TEXT. This table contains the non-text attributes from the original apply table and a nested table of comments. This table can be used to apply the model.

SQL> create table mining_apply_nested_text
        nested table comments store as apply_comments
          as select non_text.cust_id,
                    non_text.age,
                    non_text.education,
                    non_text.occupation,
                    txt.comments
              from mining_apply_text non_text,
             (select features.sequence_id,
                cast(collect(dm_nested_numerical(features.text,features.value))
                as dm_nested_numericals)  comments
              from apply_explain_out features
              group by features.sequence_id) txt
              where non_text.cust_id = txt.sequence_id(+);