Oracle® Warehouse Builder Data Modeling, ETL, and Data Quality Guide 11g Release 2 (11.2) Part Number E10935-02 |
|
|
View PDF |
This chapter describes Oracle Warehouse Builder data rules and their applications, and describes how to design data rules and derive them from data profiling results.
This chapter contains the following topics:
Data rules are definitions for valid data values and relationships that can be created in Warehouse Builder. They can be applied to tables, views, materialized views, and external tables. They determine legal data within a table (or other object) or legal relationships between data in different columns of a table or different tables.
Data rules are central to the data quality features of Warehouse Builder.
After data profiling, data rules can be automatically generated based upon any discovered data relationships. Data profiling can also test data against specific data rules.
Data auditors use data rules to test data for compliance, and generate statistics about noncompliant data.
Automatically generated data cleansing mappings use data rules as the basis of determining which data is noncompliant. Selecting a data rule and the cleansing strategy to apply for noncompliant data determines the content of the cleansing mapping.
Schema cleansing translates data rules bound to an object into constraints defined on the cleansed schema.
In ETL mappings, if a table or other object has one or more data rules applied in the context of the mapping, then Warehouse Builder automatically creates and manages error tables and other logic to manage and audit noncompliant rows.
There are two ways to create a data rule:
Derive one or more rules from data profiling results, as described in "Deriving Data Rules From Data Profiling Results"
Define a data rule directly, as described in "Creating Data Rules Using the Create Data Rule Wizard"
Every Warehouse Builder workspace also has a certain number of predefined public data rules, accessible in all projects in the workspace, for common conditions such as testing for non-null values, testing for common data formats, and so on.
The metadata for a data rule is stored in the workspace. To use a data rule, you apply the data rule to a data object. For example, you create a data rule called gender_rule
that specifies that valid values are 'M' and 'F'. You can apply this data rule to the emp_gender
column of the Employees
table. Applying the data rule ensures that the values stored for the emp_gender
column are either 'M' or 'F'. You can view the details of the data rule bindings on the Data Rule tab of the Table Editor for the Employees
table.
Table 19-1 describes the types of data rules.
Table 19-1 Types of Data Rules
Data Rule Type | Description | Example |
---|---|---|
Domain List |
Defines a list of values that an attribute is allowed to have |
The Gender attribute can have "M" or "F". |
Domain Pattern List |
Defines a list of patterns that an attribute is allowed to conform to. The patterns are defined in the Oracle Database regular expression syntax. |
A pattern for telephone number is: (^[[:space:]]*[0-9]{ 3 }[[:punct:]|:space:]]?[0-9]{ 4 }[[:space:]]*$) |
Domain Range |
Defines a range of values that an attribute is allowed to have |
The value of the Salary attribute can be between 100 and 10000. |
Common Format |
Defines a known common format that an attribute is allowed to conform to This rule type has many subtypes: Telephone Number, IP Address, SSN, URL, E-mail Address. Each type has predefined formats listed. You can add more formats to this list. |
An E-mail address should be in the following format: ^(mailto:[-_a-z0-9.]+@[-_a-z0-9.]+$) |
No Nulls |
Specifies that the attribute cannot have null values |
The |
Functional Dependency |
Defines that the data in the data object may be normalized |
The |
Unique Key |
Defines whether an attribute or group of attributes are unique in the given data object |
The name of a department must be unique. |
Referential |
Defines the type of relationship (1:n) a value must have with another value |
The |
Name and Address |
Uses the Name and Address support to evaluate a group of attributes as a name or address |
|
Custom |
Applies a SQL expression that you specify to its input parameters |
A custom rule called VALID_DATE has two input parameters, START_DATE and END_DATE. A valid expression for this rule is defined as follows: "THIS"."END_DATE" > "THIS"."START_DATE". |
Data rules are objects in each workspace, independent of individual data elements that a rule may govern. For example, the rule "No Nulls" exists independent of any particular column that the rule is applied to, and a Common Format rule such as Email Address is independent of any specific column in any specific table which is subject to that rule.
In the Projects Navigator, the Data Rules node for a project contains Data Rule folders, which group one or more data rules. There are also public data rules defined for all projects in a workspace.
Data rules have input parameters that identify the objects to which the rules are applied in a given instance. To bind a data rule is to associate that data rule with particular data objects that the rule governs, such as one or more columns in one or several tables. Note that a data rule can be bound multiple times, to several columns in a single table or across multiple tables.
For example, the gender_rule
domain list rule that limits a column to the values M and F can be applied to several different columns in a table or even several different tables, such as EMPLOYEES.emp_gender
, EMPLOYEES.manager_gender
, and CHILDREN.child_gender
. In such a case, if the rule is updated, all places where the rule is bound are affected. The next time you generate and deploy code using that data rule, the updated data rule definition is used. For example, if you changed the gender_rule to accept X to indicate unknown gender, then you can enforce that rule change everywhere by regenerating ETL for the objects to which the gender_rule is bound.
In addition to deriving data rules based on the results of data profiling, you can define your own data rules. You can bind a data rule to multiple tables within the project in which the data rule is defined. An object can contain any number of data rules.
Use the Design Center to create and edit data rules. Once you create a data rule, you can use it in any of the following scenarios.
Using Data Rules in Data Profiling
When you are using data profiling to analyze tables, you can use data rules to analyze how well data complies with a given rule, and to collect statistics. From the results, you can derive a new data rule. If data profiling determines that the majority of records have a value of red, white, and blue for a particular column, a new data rule can be derived that defines the color domain (red, white, and blue). This rule can then be reused to profile other tables, or used like other rules in schema correction, data cleansing, and data auditing.
Using Data Rules in Schema Correction
Data rules can be used to convert a source schema into a new target schema where the structure of the new tables strictly adheres to the data rules. In the new schema, table columns have data types consistent with the data rules, constraints based upon the rules are generated and applied to the tables and enforced, and schemas are normalized.
Using Data Rules in Data Cleansing
The second way that data rules are used is in a correction mapping that validates the data in a source table against the data rules, to determine which records comply and which do not. The analyzed data set is then corrected (for example, orphan records are removed, domain value inaccuracies are corrected, and so on) and the cleansed data set is loaded into the corrected target schema.
Using Data Rules in Data Auditing
Data rules are also used in data auditing. Data auditors are processes that validate data against a set of data rules to determine which records comply and which do not. Data auditors gather statistical metrics on how well the data in a system complies with a rule, and they report defective data into auditing and error tables. In that sense they are like data-rule-based correction mappings, which also offer a report-only option for data that does not comply with the data rules.
Each data rule belongs to a data rule folder, which is a container object that groups related data rules. Before you can create any data rules, you must first create at least one data rule folder.
To create a data rule folder, in the navigation tree, right-click Data Rules and select New Data Rule Folder. The Create Data Rule Folder dialog box is displayed. Enter a name for the new data rule folder and click OK.
Based on the results of data profiling, you can derive data rules that can be used to cleanse your data. Although you can create data rules and apply them manually to your data profile, deriving data rules based on data profiling results enhances productivity and ensures that your rules do reflect the underlying data.
A data rule is an expression that determines the set of legal data that can be stored within a data object. Use data rules to ensure that only values compliant with the data rules are allowed within a data object. Data rules will form the basis for correcting or removing data if you decide to cleanse the data. You can also use data rules to report on noncompliant data.
For example, you have a table called Employees with the following columns: Employee_Number
, Gender
, Employee_Name
. The profiling result shows that 90% of the values in the Employee_Number column are unique, making it a prime candidate for the unique key. The results also show that 85% of the values in the Gender
column are either 'M' or 'F', making it a good candidate for a domain. You can then derive these rules directly from the Profile Results Canvas.
Select a data profile in the navigation tree, right-click, and select Open.
The Data Profile Editor is displayed with the profiling results.
Review the profiling results and determine the findings from which you want to derive data rules.
The types of results that warrant data rules vary. Some results commonly derived into data rules include a detected domain, a functional dependency between two attributes, or a unique key.
Select the tab that displays the results from which you want to derive a data rule.
For example, to create a data rule that enforces a unique key rule for the EMPLOYEE_NUMBER
column, navigate to the Unique Key tab.
Select the cell that contains the results from which you want to derive a data rule.
You can define a data rule on a cell that contains a blue arrow icon. If the cell contains a green arrow icon, a data rule has already been defined for the column represented in that cell
From the Profile menu select Derive Data Rule. Or click the Derive Data Rule button.
For example, to create a Unique Key rule on the EMPLOYEE_NUMBER column, select this column and click Derive Data Rule.
The Derive Data Rule Wizard is displayed.
On the Welcome page, click Next.
On the Name and Description page, the Name field displays a default name for the data rule. To specify a new name, select the name, enter the new name, and click Next.
On the Define Rule page, provide details about the data rule parameters and click Next.
The Type field is automatically populated depending on the type of data being derived. You cannot edit the type of data rule.
Additional fields in the lower portion of this page define the parameters for the data rule. Some of these fields are populated with values based on the result of data profiling. The number and type of fields depend on the type of data rule.
On the Summary page, review the options that you set in the wizard using this page. Click Back to change any of the selected values. Click Finish to create the data rule.
The data rule is created and it appears in the Data Rule panel of the Data Profile Editor. The derived data rule is also added to the Derived_Data_Rules node under the Data Rules node in the Projects Navigator. You can reuse this data rule by attaching it to other data objects.
The Data Rules folder in the Projects Navigator contains the data rules. Every data rule must belong to a data rule folder. The subfolder DERIVED_DATA_RULES in each project contains the data rules derived as a result of data profiling. You can create additional data rule folders to contain any data rules that you create.
To create a data rule:
Right-click the data rule folder in which you want to create a data rule and select New Data Rule.
The Welcome page of the Create Data Rule Wizard is displayed.
On the Welcome page, click Next.
On the Name and Description page, specify a name and an optional description for the data rule. Click Next.
On the Define Rule page, specify the type of data rule to create. Also specify any additional information required to create the data rule. Click Next.
For example, when you create a Domain Range rule, you must specify the values that represent the valid domain values.
See "Defining the Data Rule" for information about defining the data rule.
On the Summary page, review the selections that you made in the wizard. Click Back to modify any selected values. Click Finish to create the data rule.
The data rule is added to the data rule folder under which you created the data rule.
Use the Define Rule page or the Define Rule tab to provide details about the data rule. The top section of the page displays the Type list that represents the type of data rule. When you are creating a data rule, expand the Type field to view the types of data rules, and select the type that you want to create. When you edit a data rule, the Type field is disabled, as you cannot change the type of data rule once it is created. For more information about types of data rules, see "Types of Data Rules".
Note:
When you are deriving a data rule, the Type field is automatically populated and you cannot edit this value.The bottom section of this page specifies additional details about the data rule. The number and names of the fields displayed in this section depend on the type of data rule that you create.
For example, if you select Custom as the type, use the Attributes section to define the attributes required for the rule. Use the Ellipsis button on the Expression field to define a custom expression involving the attributes that you defined in the Attributes section.
If you select Domain Range as the type of data rule, the bottom section of the page provides fields to specify the data type of the range, the minimum value, and the maximum value. When you are deriving a data rule, some of these fields are populated based on the profiling results from which you are deriving the rule. You can edit these values.
After you create a data rule, you can edit its definition. You can rename the data rule and edit its description. You cannot change the type of data rule. However, you can change the other parameters specified for the data rule. For example, for a Domain Range type of data rule, you can edit the data type of the range, the minimum range value, and the maximum range value.
To edit a data rule:
In the Projects Navigator, right-click the data rule and select Open.
The Edit Data Rule dialog box is displayed.
On the Name tab, you can perform the following tasks:
To rename a data rule, select the name and enter the new name.
To edit the description for the data rule, select the description and enter the new description.
On the Define tab, edit the properties of the data rule.
Note:
You cannot change the type of data rule. You can only modify the properties related to that type of data rule, such as the domain bounds, domain list, and number of attributes in a unique key.Applying a data rule to an object binds the definition of the data rule to the object. For example, binding a rule to the table Dept
ensures that the rule is implemented for the specified attribute in the table. You apply a data rule using the object editor. You can also apply a derived data rule from the Data Rule panel of the Data Profile Editor.
The Apply Data Rule Wizard enables you to apply a data rule to a data object. You can apply precreated data rules, or any data rule you created to data objects. The types of data objects to which you can apply data rules are tables, views, materialized views, and external tables.
To apply a data rule to a data object:
In the Projects Navigator, right-click the object to which you want to apply a data rule and select Open.
The editor for the data object is displayed.
On the Data Rules tab, any data rules already bound to the data object are displayed. Click Apply Rule to apply a new data rule.
The Apply Data Rule Wizard is displayed.
On the Welcome page, click Next.
On the Select Rule page, select the data rule that you want to apply to the data object and click Next.
Data rules are grouped under the nodes BUILT_IN, DERIVED_DATA_RULES, and any other data rule folders that you create.
The BUILT_IN node contains the default data rules defined in the workspace. These include rules such as foreign key, unique key, and not null.
The DERIVED_DATA_RULES node lists all the data rules that were derived as a result of data profiling.
On the Name and Description page, enter a name and an optional description for the data rule and click Next.
On the Bind Rule Parameters page, use the Binding list to select the column to which the data rule must be applied and click Next.
On the Summary page, review the selections that you made on the previous wizard pages. Click Back to modify selected values. Click Finish to apply the data rule.
The data rule is bound to the data object and is listed on the Data Rules tab.