Oracle® Real Application Clusters Administration and Deployment Guide 11g Release 2 (11.2) Part Number E10718-03 |
|
|
View PDF |
This chapter describes how to monitor and tune Oracle Real Application Clusters (Oracle RAC) performance.
This chapter includes the following topics:
This section includes the following topics:
See Also:
Oracle Database 2 Day + Real Application Clusters Guide, Oracle Enterprise Manager Concepts,
The Oracle Enterprise Manager Online Help
Oracle Database 2 Day DBA for more information about basic database tuning
Oracle Database 2 Day + Performance Tuning Guide for more information about general performance tuning
Oracle Clusterware Administration and Deployment Guide for more information about diagnosing problems for Oracle Clusterware components
Using Oracle Enterprise Manager is the preferred method for monitoring the Oracle RAC and Oracle Clusterware environment. Oracle Enterprise Manager is an Oracle Web-based integrated management solution for monitoring and administering your computing environment. From any location where you can access a web browser, you can manage Oracle RAC databases, application servers, host computers, and Web applications, as well as related hardware and software. For example, you can monitor your Oracle RAC database performance from your office, home, or a remote site, as long as you have access to a Web browser.
Both Oracle Enterprise Manager Database Control and Oracle Enterprise Manager Grid Control are cluster-aware and provide a central console to manage your cluster database. From the Cluster Database Home page, you can do all of the following:
View the overall system status, such as the number of nodes in the cluster and their current status. This high-level view capability means that you do not have to access each individual database instance for details if you just want to see inclusive, aggregated information.
View alert messages aggregated across all the instances with lists for the source of each alert message. An alert message is an indicator that signifies that a particular metric condition has been encountered. A metric is a unit of measurement used to report the system's conditions.
Review issues that are affecting the entire cluster as well as those that are affecting individual instances.
Monitor cluster cache coherency statistics to help you identify processing trends and optimize performance for your Oracle RAC environment. Cache coherency statistics measure how well the data in caches on multiple instances is synchronized. If the data caches are completely synchronized with each other, then reading a memory location from the cache on any instance will return the most recent data written to that location from any cache on any instance.
Oracle Enterprise Manager accumulates data over specified periods of time, called collection-based data. Oracle Enterprise Manager also provides current data, called real-time data.
Oracle Database 2 Day + Real Application Clusters Guide
Automatic Database Diagnostic Monitor and Oracle RAC Performance
When you log in to Oracle Enterprise Manager using a client browser, the Cluster Database Home page appears where you can monitor the status of both Oracle Clusterware and the Oracle RAC environments. Monitoring can include such things as:
Notification if there are any VIP relocations
Status of the Oracle Clusterware on each node of the cluster using information obtained through the Cluster Verification Utility (cluvfy
)
Notification if node applications (nodeapps
) start or stop
Notification of issues in the Oracle Clusterware alert log for the OCR, voting disk issues (if any), and node evictions
The Cluster Database Home page is similar to a single-instance Database Home page. However, on the Cluster Database Home page, Oracle Enterprise Manager displays the system state and availability. This includes a summary about alert messages and job activity, as well as links to all the database and Oracle Automatic Storage Management (Oracle ASM) instances. For example, you can track problems with services on the cluster including when a service is not running on all of the preferred instances or when a service response time threshold is not being met.
You can use the Oracle Enterprise Manager Interconnects page to monitor the Oracle Clusterware environment. The Interconnects page shows the public and private interfaces on the cluster, the overall throughput on the private interconnect, individual throughput on each of the network interfaces, error rates (if any) and the load contributed by database instances on the interconnect, including:
Overall throughput across the private interconnect
Notification if a database instance is using public interface due to misconfiguration
Throughput and errors (if any) on the interconnect
Throughput contributed by individual instances on the interconnect
All of this information also is available as collections that have a historic view. This is useful in conjunction with cluster cache coherency, such as when diagnosing problems related to cluster wait events. You can access the Interconnects page by clicking the Interconnect tab on the Cluster Database home page or clicking the Interconnect Alerts link under Diagnostic Findings on the Oracle RAC database home page.
Also, the Oracle Enterprise Manager Cluster Database Performance page provides a quick glimpse of the performance statistics for a database. Statistics are rolled up across all the instances in the cluster database in charts. Using the links next to the charts, you can get more specific information and perform any of the following tasks:
Identify the causes of performance issues.
Decide whether resources need to be added or redistributed.
Tune your SQL plan and schema for better optimization.
Resolve performance issues
The charts on the Cluster Database Performance page include the following:
Chart for Cluster Host Load Average: The Cluster Host Load Average chart in the Cluster Database Performance page shows potential problems that are outside the database. The chart shows maximum, average, and minimum load values for available nodes in the cluster for the previous hour.
Chart for Global Cache Block Access Latency: Each cluster database instance has its own buffer cache in its System Global Area (SGA). Using Cache Fusion, Oracle RAC environments logically combine each instance's buffer cache to enable the database instances to process data as if the data resided on a logically combined, single cache.
Chart for Average Active Sessions: The Average Active Sessions chart in the Cluster Database Performance page shows potential problems inside the database. Categories, called wait classes, show how much of the database is using a resource, such as CPU or disk I/O. Comparing CPU time to wait time helps to determine how much of the response time is consumed with useful work rather than waiting for resources that are potentially held by other processes.
Chart for Database Throughput: The Database Throughput charts summarize any resource contention that appears in the Average Active Sessions chart, and also show how much work the database is performing on behalf of the users or applications. The Per Second view shows the number of transactions compared to the number of logons, and the amount of physical reads compared to the redo size per second. The Per Transaction view shows the amount of physical reads compared to the redo size per transaction. Logons is the number of users that are logged on to the database.
In addition, the Top Activity drill down menu on the Cluster Database Performance page enables you to see the activity by wait events, services, and instances. Plus, you can see the details about SQL/sessions by going to a prior point in time by moving the slider on the chart.
For example, the Cluster Database Performance page provides a quick glimpse of the performance statistics for an Oracle RAC database. Statistics are rolled up across all of the instances in the cluster database so that users can identify performance issues without going through all the instances. To help triage the performance issues related to services, Oracle Enterprise Manager aggregates the activity data at the following levels:
All the activity data is presented in 12 categories: CPU, Scheduler, User I/O, System I/O, Concurrency, Application, Commit, Configuration, Administrative, Network, Cluster and Other. The data presented is rolled up from all of the running instances.
All the activity data is rolled up for each service. When the activity data is presented in this way, it is easy to identify which service is most active, and needs more analysis.
As a similar effort, the activity data is rolled up for each instance, if services are not the interested ones.
The aggregates are provided on the pages where the activity data is presented including: Database Performance Page, Top Activity Page, Wait Details Page and Service Details Page.
All single-instance tuning practices for Oracle Database apply to Oracle RAC databases. Therefore, implement the single-instance tuning methodologies described in the Oracle Database 2 Day + Performance Tuning Guide and the Oracle Database Performance Tuning Guide.
The interconnect and internode communication protocols can affect Cache Fusion performance. In addition, the interconnect bandwidth, its latency, and the efficiency of the IPC protocol determine the speed with which Cache Fusion processes block transfers.
To verify the interconnect settings, query the V$CLUSTER_INTERCONNECTS
and the V$CONFIGURED_INTERCONNECTS
. For example:
Example 11-1 Verify Interconnect Settings with V$CLUSTER_INTERCONNECTS
SQL> SELECT * FROM V$CLUSTER_INTERCONNECTS; NAME IP_ADDRESS IS_ SOURCE --------------- ---------------- --- ------------------------------- eth2 10.137.20.181 NO Oracle Cluster Repository
Example 11-2 Verify Interconnect Settings with V$CONFIGURED_INTERCONNECTS
SQL> SELECT * FROM V$CONFIGURED_INTERCONNECTS; NAME IP_ADDRESS IS_ SOURCE --------------- ---------------- --- ------------------------------- eth2 10.137.20.181 NO Oracle Cluster Repository eth0 10.137.8.225 YES Oracle Cluster Repository SQL> DESC V$CONFIGURED_INTERCONNECTS Name Null? Type ----------------------------------------- -------- ---------------------------- NAME VARCHAR2(15) IP_ADDRESS VARCHAR2(16) IS_PUBLIC VARCHAR2(3) SOURCE VARCHAR2(31)
Once your interconnect is operative, you cannot significantly influence its performance. However, you can influence an interconnect protocol's efficiency by adjusting the IPC buffer sizes.
The Oracle Cluster Registry (OCR) stores your system's interconnect information. Use the oifcfg getif
command or the OCRDUMP
utility to identify the interconnect that you are using. You can then change the interconnect that you are using by running an oifcfg
command.
See Also:
Your vendor-specific interconnect documentation for more information about adjusting IPC buffer sizesAlthough you rarely need to set the CLUSTER_INTERCONNECTS
parameter, you can use it to assign a private network IP address or NIC as in the following example:
CLUSTER_INTERCONNECTS=10.0.0.1
If you are using an operating system-specific vendor IPC protocol, then the trace information may not reveal the IP address.
Note:
You can also use theoifcfg
command as described in the Oracle Clusterware Administration and Deployment Guide for more information about enabling and using the Oracle Interface Configuration (OIFCFG) command-line utility to assign private network or private IP addresses.Each instance has a set of instance-specific views, which are prefixed with V$
. You can also query global dynamic performance views to retrieve performance information from all of the qualified instances. Global dynamic performance view names are prefixed with GV$
.
Querying a GV$
view retrieves the V$
view information from all qualified instances. In addition to the V$
information, each GV$
view contains an extra column named INST_ID
of data type NUMBER
. The INST_ID
column displays the instance number from which the associated V$
view information was obtained.
You can use the INST_ID
column as a filter to retrieve V$
information from a subset of available instances. For example, the following query retrieves the information from the V$LOCK
view for instances 2 and 5:
SQL> SELECT * FROM GV$LOCK WHERE INST_ID = 2 OR INST_ID = 5;
If you did not create your Oracle RAC database with the Database Configuration Assistant (DBCA), then you must run the CATCLUST.SQL
script to create views and tables related to Oracle RAC. You must have SYSDBA
privileges to run this script.
See Also:
The platform-specific Oracle RAC installation guides for more information about creating your Oracle RAC databaseThis section provides an overview of the V$
and GV$
views that provide statistics that you can use evaluate block transfers in your cluster. Use these statistics to analyze interconnect block transfer rates as well as the overall performance of your Oracle RAC database.
Oracle RAC statistics appear as message request counters or as timed statistics. Message request counters include statistics showing the number of certain types of block mode conversions. Timed statistics reveal the total or average time waited for read and write I/O for particular types of operations.
You can use Automatic Workload Repository (AWR) to monitor performance statistics related to Oracle RAC databases. AWR automatically generates snapshots of the performance data once every hour and collects the statistics in the workload repository. In Oracle RAC environments, each AWR snapshot captures data from all active instances in the cluster. The data for each snapshot set is captured from the same point in time. AWR stores the snapshot data for all instances in the same table and the data is identified by an instance qualifier. For example, the BUFFER_BUSY_WAIT
statistic shows the number of buffer waits on each instance. AWR does not store data that is aggregated from across the entire cluster. In other words, the data is stored for each individual instance.
Using the Automatic Database Diagnostic Monitor (ADDM), you can analyze the information collected by AWR for possible performance problems with Oracle Database. ADDM presents performance data from a cluster-wide perspective, thus enabling you to analyze performance on a global basis. In an Oracle RAC environment, ADDM can analyze performance using data collected from all instances and present it at different levels of granularity, including:
Analysis for the entire cluster
Analysis for a specific database instance
Analysis for a subset of database instances
To perform these analyses, you can run the ADDM Advisor in ADDM for Oracle RAC mode to perform an analysis of the entire cluster, in Local ADDM mode to analyze the performance of an individual instance, or in Partial ADDM mode to analyze a subset of instances. You activate ADDM analysis using the advisor framework through Advisor Central in Oracle Enterprise Manager, or through the DBMS_ADVISOR
and DBMS_ADDM
PL/SQL packages.
See Also:
Oracle Database Performance Tuning Guide for information about AWR.
Oracle Database 2 Day + Real Application Clusters Guide for more information about how to access and analyze global and local ADDM data using Oracle Enterprise Manager
Oracle Database PL/SQL Packages and Types Reference for more information about the DBMS_ADVISOR
and DBMS_ADDM
packages
Oracle Database Reference for more information about the DBMS_ADDM_FINDINGS
view
This section describes Active Session History (ASH) reports for Oracle RAC under the following topics:
ASH is an integral part of the Oracle Database self-management framework and is useful for diagnosing performance problems in Oracle RAC environments. ASH report statistics provide details about Oracle Database session activity. Oracle Database records information about active sessions for all active Oracle RAC instances and stores this data in the System Global Area (SGA). Any session that is connected to the database and using CPU is considered an active session. The exception to this is sessions that are waiting for an event that belongs to the idle wait class.
ASH reports present a manageable set of data by capturing only information about active sessions. The amount of the data is directly related to the work being performed, rather than the number of sessions allowed on the system.
ASH statistics that are gathered over a specified duration can be put into ASH reports. Each ASH report is divided into multiple sections to help you identify short-lived performance problems that do not appear in the ADDM analysis. Two ASH report sections that are specific to Oracle RAC are Top Cluster Events and Top Remote Instance as described in the next two sections.
The ASH report Top Cluster Events section is part of the Top Events report that is specific to Oracle RAC. The Top Cluster Events report lists events that account for the highest percentage of session activity in the cluster wait class event along with the instance number of the affected instances. You can use this information to identify which events and instances caused a high percentage of cluster wait events.
The ASH report Top Remote Instance section is part of the Top Load Profile report that is specific to Oracle RAC. The Top Remote Instance report shows cluster wait events along with the instance numbers of the instances that accounted for the highest percentages of session activity. You can use this information to identify the instance that caused the extended cluster wait period.
See Also:
Oracle Database Performance Tuning Guide for more information about ASH reportsThis section explains wait events and statistics specific to Oracle RAC and how to interpret them when assessing performance data generated by the Automatic Workload Repository, Statspack, or by ad-hoc queries of the dynamic performance views.
This section includes the following topics:
See Also:
Oracle Database Performance Tuning Guide for more information about wait event analysis and thespdoc.txt
file for details about the Statspack utilityThe statistics snapshots generated by AWR and Statspack can be evaluated by producing reports displaying summary data such as load and cluster profiles based on regular statistics and wait events gathered on each instance.
Most of the relevant data is summarized on the Oracle RAC Statistics Page. This information includes:
Global cache efficiency percentages—workload characteristics
Global cache and Enqueue Service (GES)—messaging statistics
Additional Oracle RAC sections appear later in the report:
Analyzing and interpreting what sessions are waiting for is an important method to determine where time is spent. In Oracle RAC, the wait time is attributed to an event which reflects the exact outcome of a request. For example, when a session on an instance is looking for a block in the global cache, it does not know whether it will receive the data cached by another instance or whether it will receive a message to read from disk. The wait events for the global cache now convey precise information and waiting for global cache blocks or messages is:
Summarized in a broader category called Cluster Wait Class
Temporarily represented by a placeholder event which is active while waiting for a block, for example:
gc current block request
gc cr block request
Attributed to precise events when the outcome of the request is known, for example:
gc current block 3-way
gc current block busy
gc cr block grant 2-way
In summary, the wait events for Oracle RAC convey information valuable for performance analysis. They are used in Automatic Database Diagnostic Monitor (ADDM) to enable precise diagnostics of the effect of cache fusion.
In order to determine the amount of work and cost related to inter-instance messaging and contention, examine block transfer rates, remote requests made by each transaction, the number and time waited for global cache events as described under the following headings:
The effect of accessing blocks in the global cache and maintaining coherency is represented by
The Global Cache Service statistics for current and cr blocks, for example, gc current blocks received, gc cr blocks received, and so on
The Global Cache Service wait events, for gc current block 3-way, gc cr grant 2-way, and so on.
The response time for cache fusion transfers is determined by the messaging and processing times imposed by the physical interconnect components, the IPC protocol and the GCS protocol. It is not affected by disk I/O factors other than occasional log writes. The cache fusion protocol does not require I/O to data files in order to guarantee cache coherency and Oracle RAC inherently does not cause any more I/O to disk than a non-clustered instance.
This section describes how to monitor Global Cache Service performance by identifying data blocks and objects which are frequently used (hot) by all instances. High concurrency on certain blocks may be identified by Global Cache Service wait events and times.
The gc current block busy
wait event indicates that the access to cached data blocks was delayed because they were busy either in the remote or the local cache. This means that the blocks were pinned or held up by sessions or delayed by a log write on a remote instance or that a session on the same instance is already accessing a block which is in transition between instances and the current session needs to wait behind it (for example, gc current block busy
).
The V$SESSION_WAIT
view to identify objects and data blocks with contention. The gc
wait events contain the file and block number for a block request in p1 and p2, respectively.
An additional segment statistic, gc buffer busy
, has been added to quickly determine the busy objects without recourse to the query on V$SESSION_WAIT
mentioned earlier.
The AWR infrastructure provides a view of active session history which can also be used to trace recent wait events and their arguments. It is therefore useful for hot block analysis. Most of the reporting facilities used by AWR and Statspack contain the object statistics and cluster wait class category, so that sampling of the views mentioned earlier is largely unnecessary.
It is advisable to run ADDM on the snapshot data collected by the AWR infrastructure to obtain an overall evaluation of the impact of the global cache. The advisory will also identify the busy objects and SQL highest cluster wait time.
This section describes how to monitor Global Cache Service performance by identifying objects read and modified frequently and the service times imposed by the remote access. Waiting for blocks to arrive may constitute a significant portion of the response time, in the same way that reading from disk could increase the block access delays, only that cache fusion transfers in most cases are faster than disk access latencies.
The following wait events indicate that the remotely cached blocks were shipped to the local instance without having been busy, pinned or requiring a log flush:
gc current block 2-way
gc current block 3-way
gc cr block 2-way
gc cr block 3-way
The object statistics for gc current blocks received and gc cr blocks received enable quick identification of the indexes and tables which are shared by the active instances. As mentioned earlier, creating an ADDM analysis will, in most cases, point you to the SQL statements and database objects that could be impacted by inter-instance contention.
Any increases in the average wait times for the events mentioned earlier could be caused by the following occurrences:
High load: CPU shortages, long run queues, scheduling delays
Misconfiguration: using public instead of private interconnect for message and block traffic
If the average wait times are acceptable and no interconnect or load issues can be diagnosed, then the accumulated time waited can usually be attributed to a few SQL statements which need to be tuned to minimize the number of blocks accessed.
The column CLUSTER_WAIT_TIME
in V$SQLAREA
represents the wait time incurred by individual SQL statements for global cache events and will identify the SQL which may need to be tuned.
Note:
Oracle recommends using ADDM and AWR. However, Statspack is available for backward compatibility. Statspack provides reporting only. You must run Statspack at level 7 to collect statistics related to block contention and segment block waits.Most global cache wait events that show a high total time as reported in the AWR and Statspack reports or in the dynamic performance views are normal and may present themselves as the top database time consumers without actually indicating a problem. This section describes the most important and frequent wait events that you should be aware of when interpreting performance data.
If user response times increase and a high proportion of time waited is for global cache (gc), then you should determine the cause. Most reports include a breakdown of events sorted by percentage of the total time.
It is useful to start with an ADDM report, which analyzes the routinely collected performance statistics with respect to their impact, and points to the objects and SQL contributing most to the time waited, and then moves on to the more detailed reports produced by AWR and Statspack.
The most important wait events for Oracle RAC include various categories, including:
Block-oriented
gc current block 2-way
gc current block 3-way
gc cr block 2-way
gc cr block 3-way
Message-oriented
gc current grant 2-way
gc cr grant 2-way
Contention-oriented
gc current block busy
gc cr block busy
gc buffer busy acquire/release
Load-oriented
gc current block congested
gc cr block congested
The block-oriented wait event statistics indicate that a block was received as either the result of a 2-way or a 3-way message, that is, the block was sent from either the resource master requiring 1 message and 1 transfer, or was forwarded to a third node from which it was sent, requiring 2 messages and 1 block transfer.
The gc current block busy
and gc cr block busy
wait events indicate that the local instance that is making the request did not immediately receive a current or consistent read block. The term busy in these events' names indicates that the sending of the block was delayed on a remote instance. For example, a block cannot be shipped immediately if Oracle Database has not yet written the redo for the block's changes to a log file.
In comparison to block busy
wait events, a gc buffer busy
event indicates that Oracle Database cannot immediately grant access to data that is stored in the local buffer cache. This is because a global operation on the buffer is pending and the operation has not yet completed. In other words, the buffer is busy and all other processes that are attempting to access the local buffer must wait to complete.
The existence of gc buffer busy events also means that there is block contention that is resulting in multiple requests for access to the local block. Oracle Database must queue these requests. The length of time that Oracle Database needs to process the queue depends on the remaining service time for the block. The service time is affected by the processing time that any network latency adds, the processing time on the remote and local instances, and the length of the wait queue.
The average wait time and the total wait time should be considered when being alerted to performance issues where these particular waits have a high impact. Usually, either interconnect or load issues or SQL execution against a large shared working set can be found to be the root cause.
The message-oriented wait event statistics indicate that no block was received because it was not cached in any instance. Instead a global grant was given, enabling the requesting instance to read the block from disk or modify it.
If the time consumed by these events is high, then it may be assumed that the frequently used SQL causes a lot of disk I/O (in the event of the cr grant) or that the workload inserts a lot of data and needs to find and format new blocks frequently (in the event of the current grant).
The contention-oriented wait event statistics indicate that a block was received which was pinned by a session on another node, was deferred because a change had not yet been flushed to disk or because of high concurrency, and therefore could not be shipped immediately. A buffer may also be busy locally when a session has already initiated a cache fusion operation and is waiting for its completion when another session on the same node is trying to read or modify the same data. High service times for blocks exchanged in the global cache may exacerbate the contention, which can be caused by frequent concurrent read and write accesses to the same data.
The load-oriented wait events indicate that a delay in processing has occurred in the GCS, which is usually caused by high load, CPU saturation and would have to be solved by additional CPUs, load-balancing, off loading processing to different times or a new cluster node.For the events mentioned, the wait time encompasses the entire round trip from the time a session starts to wait after initiating a block request until the block arrives.