Previous Page TOC Next Page Home


52

Parallel Processing

In recent years, Multi-Processors (commonly referred to as MP), Symmetrical Multi-Processors (SMP), and Massively Parallel Processors (MPP) have been sweeping the marketplace and gaining ground to offload vast amounts of data processing. This processing is performed in "parallel" among the available Central Processing Units (CPUs). In this chapter, I discuss how to choose, leverage, optimize, and convert to an Oracle parallel processing platform.

Understanding the Requirements for Parallel Processing

The following configurations describe which features, functions, and benefits are available through parallel processing.

Symmetrical Multiple Processors are usually a 1- to 49-processor system configuration having limited scalability. This is due to greater I/O per CPU as additional processors are added.

In Parallel Processors, the database query is divided into logical components that can be run in parallel on MP servers. Oracle7 Parallel Query uses this feature.

Massively Parallel Processors involve eight or more processors, as in the nCUBE 2 and 3 models with 65,536 parallel processors available.

Tightly coupled servers are MP servers in which all CPUs address shared memory or distributed memory addressing other processors. Tightly coupled servers provide very scalable operation without increased CPU I/O overhead.

Loosely coupled servers are MP servers that contain all CPUs in a multiple server configuration, such as a DEC VAX or ALPHA cluster.

If the system you develop performs as planned but is modified or used by more users than originally designed for, you can add an additional processor and RAM for about $10,000. This assumes, of course, that your file server selection has a multi-processor upgrade path.

MP file servers feature a preconfigured, fixed number of processors and cannot be expanded without an expense nearly equal to the original file server purchase price. The advantage of an MP system is that the fixed number of processors are optimized to run in parallel with minimal parallel processing CPU overhead.

Symmetrical Multi-Processors enable the file server to incorporate additional processors to be included as needed. Scalability does have its drawbacks. As more processors are added to the system, the law of diminishing returns applies. Each additional CPU will deliver less processing power available to the combined CPU parallel processing capability.

Massively Parallel Processors incorporate an initially expensive architecture that delivers cheaper overall performance. As more processors are added, performance increases without the loss of I/O associated with scalability in SMP systems.

Operating System Impact in Parallel Processing

Some operating systems are more scalable than others. As a system accumulates processors, additional tasks are required of the hardware, operating system, and Distributed Lock Manager (DLM) for that platform. The DLM facilitates processor availability for tasks to be scheduled between processors, on both local and remote clustered file server configurations.

The DEC Alpha 2100 is exemplary in demonstrating that a great file server is only as great as its chosen operating system.

A DEC Alpha 2100 file server model EV5/275 (MHz) running MS-NT will only achieve the best processing results with no more than three processors, and in fact will actually achieve slower overall results as additional processors are added!

That very same DEC Alpha file server model 2100 EV5/275 running at 275 MHz and running UNIX will continue to achieve the best processing results, regardless of how many processors are added to the system. In fact, the first four processors each add well over 90% additional throughput in a UNIX configuration.

This compares with only an additional 50% throughput benefit received for the second MS-NT processor added to the same exact system, operating systems being the only difference between these two examples.

New World Processing

Parallel processing was previously performed only by mainframes, which currently cost an average of $6 to 28 million to operate for five years versus any faster 32-bit or 64-bit, 30-gigabyte RAID5 SMP file server. Today turnkey conversions can be done by AIS for less than $500,000, including fault-tolerant hardware, database software, application software, and a legacy conversion of the older application onto the new client/server platform.

The hardware costs for MP are approximately 1/32 that of the traditional mainframe hardware cost, while delivering very high numbers of transactions per second (TPS). Due to very high mainframe costs, price/performance ratios are not available, as illustrated by the lack of audited benchmarks from the mainframe arena.

Mainframe Benchmarks

The best public TPC benchmark of late is that of a $27,913,514.00, ES/9000 model 511 TPF mainframe running the TPF 3.1 database at 3,504.93 TPS/A v1.2. The test results report that this mainframe's total five-year system cost, including software, will equate to $7,964.00 per transaction (TPC/A).

This compares with MP industry price/performance trends of 147 to 713 TPC/A v1.2 for any similar UNIX platform. The best UNIX systems costs vary from $1,000.00 to $5,941.00 per TPC/A.

These tests, like all TPC benchmarks, also include a complete five-year system cost including maintenance and all software all totaling from between $20,000.00 to $4,239,200.00 complete.

Leveraging Parallel Processing Platforms

How does Oracle7 leverage these newer parallel processing platforms? The Oracle7 Enterprise Edition has some very "advanced" features. The major enhanced features lacking in the Oracle7 Workgroup Server that the Enterprise Edition leverages for parallel processing are the following options:

The Parallel Query Option distributes queries among the available processors to complete complex tasks much more quickly than a single CPU can process. Even a full table scan can be distributed among separate CPUs on the same file server with this feature. Oracle Parallel Query employs "Query Slaves" that actually provide the scheduling of these Parallel job streams. These Parallel features are developed by Oracle Corporation, in conjunction with the Parallel Platform designers from each of the hardware vendors, and they are transparent to the Oracle programmers and users! Oracle7.1 delivers the Parallel Query Option and leverages MP by providing sophisticated high-speed record latching. This distributes parts of queries among the available processors for significant throughput advantages. The Parallel Query option is also transparent to the DBA, as the Oracle7.1 software and host operating system automatically distribute these transactions among any MP, SMP, PP, or MPP system.

The Parallel Server Option creates a real-time copy of the Oracle instance and database on one or more file servers, or nodes on a network. This serves a dual purpose, as it can be used to do both of the following:

  1. Balance the user load proportionally between separate file servers on a network

  2. Enable an on- or off-site hot standby, real-time updated copy of the Oracle RDBMS

Balancing of the Client/Server user load proportionally between parallel server file servers and the associated parallel instances on a network can be achieved in three ways.

In the first method, as a PC windows client requests a connection from the client to the server, SQL*Net 2.2 is invoked and calls the /windows/oracle.ini variable:

LOCAL=TNS:your_connect_string or "alias"

The search path next defaults to your user's ora_home/network/admin directory to locate the user's tnsnames.ora file in search of your_connect_string or "alias." If half of the users have been defaulted to host1 and the other users were defaulted to host2, a proportional balance is achieved.

A second method of parallel load balancing is more difficult to accomplish. This balance is achieved at the TCP/IP or other protocol level by programming an automated switch sequence, triggered by a predetermined user limit.

When this high-water user connection mark is reached, users are shuttled to the next host specified in the program. If the primary host is unavailable, the secondary host is the recipient of all users requesting connections.

This fault-tolerant, fail-over mechanism can also be achieved by another method, enabling fault tolerance in an on- or off-site hot standby. This is accomplished through the parallel server's real-time updated copy of the Oracle RDBMS and is achieved through the tnsnames.ora file as well.

In the address list section of the tnsnames.ora file shown following, a second address entry specifies a redundant host name or Internet Protocol address (IP).

If the primary host database instance and SQL*Net listener are unavailable, then the second address connection is made to that second host's database listener, database, and associated database instance.

The Oracle "SID" or instance is the portion of the database cached in real memory, for that host database file server.

################

# Filename......: tnsnames.ora

# Name..........: LOCAL_REGION

# Date..........: 23-SEP-95 22:22:39

################

your_connect_string_is_this=

  (DESCRIPTION =

    (ADDRESS_LIST =

        (ADDRESS =

        (PROTOCOL = TCP)

          (Host = VENUS)

/*(The above host is the Primary host,*/

/*and is only used if the primary SID is available)*/

          (Port = 1521)

        )

          (ADDRESS =           <(These four lines are

          (PROTOCOL = TCP)     < the secondary host

          (Host = URANUS)      < description and are only used

          (Port = 1521)        < if the primary SID is unavailable)

        )

    )

    (CONNECT_DATA =

       (SID =ORAC)

/*(This should be a single instance name or "SID",*/

/*shared by all Parallel Server instances)*/

    )

  )

In the preceding scenario, all of the fault tolerance is transparent to the user. The user was never required to enter a connect string or to change a connect string.

If a tool requests that an Oracle connect string be entered by the user, the user can simply press Enter on the keyboard to substitute the connect string or alias from the manually set LOCAL=TNS:your_connect_string or "alias" line in the /windows/oracle.ini file.

This will then facilitate the user's connection to the first available Oracle database through that database's already running listener process.

The Oracle parallel server option always synchronizes transactions between the instances in memory and the database or databases. This assures transaction integrity in that all inserts, updates, and deletes committed on one parallel instance are immediately available to all users of any other parallel instance. These transactions can be applied to any parallel server instance on any other cluster or network node.

Parallel Server should not be confused with Parallel Processing, in that it can be used with the following:

Such a file server configuration should have at least 16M of RAM for each Oracle instance, simultaneously required to be resident in memory.

The Distributed Database Option joins two or more remote databases as one database. The remote database location and/or server name are transparent to the users.

The Data Replication Option is useful when isolated information, from a laptop or local regional office, is to be duplicated and updated separately from the master database.

The Oracle7.2 64-bit Option allows for many unlimited and amazing Parallel Processing capabilities, including actual performance increases of up to 107 times faster than 32-bit systems, as well as the ability to leverage the following:

Comparing Parallel Processing to Mainframes

The following test results provided by the Oracle Corporation illustrate how these newer parallel processing platforms are faster, cheaper, and as fault-tolerant as any very expensive mainframe with terabytes of data.

Test Results

The tests shown in Table 52.1 were performed using maintenance releases 7.1, 7.2, and 7.3 of the Oracle7 server as a demonstration of the performance and scalability improvements engineered into these releases. Due to time constraints, not all of the tests could be run on releases 7.1 and 7.2. Some estimations and extrapolations based on available data were used to fill out the performance matrix.

Displaying time in minutes, the test results illustrate dramatic performance gains for all the major operations involved—summary creation, index builds, and complex queries. Release 7.2 provides a ten-fold improvement in summary creation performance over release 7.1. Index builds and complex query execution are about twice as fast. These performance gains are further extended with release 7.3: summary creation exhibits an amazing twenty-fold improvement over release 7.1. Complex query execution is five times faster and index creation delivers over a three-fold improvement.

Operation


Release 7.1


Release 7.2


Improvement


 Release 7.3


Improvement


 


 


Results


over 7.1


Results


over 7.1


Summary creation

13,400.0**

1320.8

1015%

647.4

2070%

(6-D cube)

Single key index creation

967.7

398.6

243%

304.3

318%

Concatenated key index creation

1719.6

620.0

277%

389.9

441%

Complex query

149.1

75.4%

198%

29.9

498%

**This number is an estimate. The operation was not run to completion.

Oracle7 Server Scalable Parallel Architecture for Open Data Warehousing

Oracle's scalable parallel database architecture combines the best elements of traditional shared-nothing and shared-disk approaches to provide highly scalable database system performance for open data warehousing on all parallel hardware systems—from tightly coupled Symmetric Multi-Processor systems (SMP) to loosely coupled systems such as clusters and Massively Parallel systems (MPP). This chapter provides an outline of Oracle's parallel database architecture and illustrates the superiority of Oracle's "best-of-both-worlds" architecture over pure shared-nothing database systems in terms of performance, scalability, resource utilization, manageability, availability, and consistency with emerging trends in technology.

The increasing use of information as a key strategic weapon in business decision-making has led to an explosive growth in complex, data-intensive decision support applications in recent years. This growth has been further accelerated by the availability of cost-effective parallel processing systems from open system vendors. This section discusses the common architectural approaches that underlie parallel hardware and database systems.

Parallel Hardware Systems

Parallel hardware systems utilizing inexpensive commodity components have the potential to provide excellent price/performance advantages over traditional mainframe systems in data-intensive decision support applications. Tightly coupled Symmetric Multi-Processor systems (SMP) have been the most widely used parallel hardware systems. These systems utilize multiple processors that share common memory and disk resources and hence are also known as "shared-everything" systems. Primary advantages of SMP systems include simplicity of application development and ease of administration. These systems, however, do not provide any inherent fault-tolerance: the failure of a single critical component such as a CPU could bring the entire system down. Further, they are currently somewhat limited in terms of scalability and growth due to limitations in available system bus bandwidth and operating system software scalability.

Loosely coupled systems such as clusters and Massively Parallel Processing (MPP) systems eliminate some of the drawbacks of SMP systems, providing improved fault tolerance and easy incremental system growth. These systems are made up of multiple nodes, with each node consisting of a single processing unit—a single CPU or an SMP unit, with its own dedicated system memory. In terms of disk sharing, these systems come in a variety of flavors. At one extreme are "shared-nothing" systems where each set of disk devices has physical connectivity only to a single node; at the other end are "shared-disk" cluster systems with each node having direct physical access to all of the disks. There are also hybrid systems where each node has direct physical access to a subset of devices but has logical access, enabled through an operating system abstraction layer, to all disk devices. The level of physical connectivity determines the potential for fault tolerance available on these systems. On true shared-disk systems, for example, it is possible to implement database systems that provide access to all data as long as at least one node is available.

Loosely coupled systems, however, present greater challenges in terms of system administration and application development, as compared to SMP systems. One of the primary requirements for a parallel database system is to hide these complexities, presenting a logical unified view to the users, enabling them to transparently exploit all available resources on a loosely coupled system.

Traditionally, two distinct approaches—shared-nothing and shared-disk—have been used in the implementation of database systems on parallel hardware. Each approach, in its pure form, offers certain unique benefits and tradeoffs. This section provides a conceptual foundation for the central argument in this chapter: a hybrid architecture that combines the strengths of each approach is the most pragmatic real-world solution.

The Shared-Nothing Approach

In a pure shared-nothing architecture, database files are partitioned among the instances running on the nodes of a multi-computer system. As illustrated in Figure 52.1, each instance or node "owns" a subset of the data, and all access to this data is performed exclusively by the owning instance. In other words, a pure shared-nothing system uses a partitioned or restricted access scheme to divide the work among multiple processing nodes. Data ownership by nodes changes relatively infrequently—database reorganization and node failure are the typical reasons for change in ownership.

Parallel execution in a shared-nothing system is directly based upon the data partitioning scheme. Each partition is accessed in parallel by a single process or thread, with no provision for intrapartition parallelism. Conceptually, it is useful to think of a pure shared-nothing system as being very similar to a distributed database. A transaction executing on a given node has to send messages to other nodes that own the data being accessed and coordinate the work done on the other nodes, to perform the required read/write activity. Such message passing is commonly known as function shipping.


Figure 52.1. A shared-nothing database system.

In principle, this is a very reasonable approach on shared-nothing parallel hardware: The approach is simple and elegant. It employs a software architecture that directly maps to the hardware system and has the potential to provide scalable performance on loosely coupled systems. Function shipping is an efficient execution strategy and typically provides significant performance gains over the alternative data shipping approach. However, as you will see in further detail later, the real-world applicability of a pure shared-nothing database architecture is seriously limited by certain drawbacks inherent to this scheme.

Firstly, the shared-nothing approach is not appropriate for use on shared-everything SMP hardware. The requirement to physically partition data in order to derive the benefits of parallelism is clearly an artificial requirement in a shared-everything SMP system, where every processor has direct, equal access to all the data. Secondly, the rigid partitioning-based parallel execution strategy employed in the shared-nothing approach often leads to skewed resource utilization. The tight ownership model that prevents intrapartition parallel execution fails to utilize all available processing power in the presence of data or workload skew, delivering suboptimal use of available processing power.

Thirdly, the use of data partitioning as the exclusive basis for parallelism forces a trade-off between manageability and parallel execution performance, often introducing serious administrative complexities.

Finally, shared-nothing systems, due to their use of a rigid restricted access scheme, fail to fully exploit the potential for high fault-tolerance available in clustered systems.

The Shared-Disk Approach

This approach is based on the assumption that every processing node has equal access to all of the disks (data). In a pure shared-disk architecture, database files are logically shared among the nodes of a loosely coupled system with each instance having access to all the data. As illustrated in Figure 52.2, shared-disk access is accomplished either through direct hardware connectivity or by using an operating system abstraction layer that provides a single view of devices on all nodes. Therefore, a transaction running on any instance can directly read or modify any part of the database. Such systems require the use of interinstance communication to synchronize update activities performed from multiple instances.


Figure 52.2. A shared-disk database system.

Pure shared-disk is a good approach on clustered systems where equal and direct access to all disks is typically available from every node. A single node variant of the shared-disk scheme also ideally maps to SMP systems. Shared-disk has the potential to offer excellent resource utilization because there is no concept of data ownership and every processing node can participate in accessing all data. Further, this approach provides unmatched levels of fault tolerance, with all data remaining accessible even with a single surviving node. However, on shared-nothing hardware systems with local affinity between processing nodes and disks, the assumption of direct data access is not valid. Logical shared access can usually be accomplished, but the exclusive use of such a scheme can result in unneeded internode data shipping, incurring significant performance penalties.

What the Real World Needs

Over the years, there has been much religious debate among academic researchers and industry analysts on what the ideal parallel database architecture is—shared-nothing or shared-disk. Although such debate still goes on, often producing interesting technical insights into the merits and drawbacks of either scheme, it's clear that no single scheme is without flaws. Given that, the authors believe that the most pragmatic real-world solution has to be a hybrid architecture that incorporates elements of each approach to provide the best of both worlds. Such an approach would provide the elegance and efficiency of pure shared-nothing systems, while avoiding their drawbacks in terms of resource utilization, manageability, and availability. At the same time, such a hybrid approach would also incorporate key attributes of pure shared-disk systems to deliver excellent resource utilization and fault tolerance.

Oracle7 Parallel Architecture: An Overview

The Oracle server was the first open relational database product to provide complete support for all parallel hardware architectures, with production availability on a variety of SMP systems for several years, and on loosely coupled cluster and MPP systems since 1990. The Oracle Parallel Server technology with its advanced parallel cache management facilities and the unique high performance, nonblocking concurrency mechanism is years ahead of any other commercial open systems product in performance, proven reliability, and unlimited scalability. Parallel query execution technology was introduced in Oracle7 Release 7.1 to serve as the basis for enabling data-intensive decision support applications on cost-effective open systems. Parallel execution capabilities were designed as core internal facilities designed to achieve highly scalable performance on all parallel hardware architectures. The initial release provided support for parallel execution of most operations involved in query execution including table scans, sorts, joins, aggregations, and ordering. In addition, this release included parallel execution of data loads, index creation, and recovery operations. Each subsequent release of the Oracle7 server has added significant functional and performance improvements to this initial offering. This section provides an outline of Oracle's server architecture and an overview of the dynamic parallel query execution technology.

Oracle Parallel Server Technology

The Oracle Parallel Server technology is at the heart of Oracle's server implementation on loosely coupled clustered and MPP systems. As shown in Figure 52.3, the parallel server utilizes a robust, fault-tolerant shared data access scheme and Oracle's proven, scalable parallel cache management technology to provide unmatched levels of availability and scalable performance on parallel hardware systems. This section outlines key elements of this technology with specific emphasis on decision support applications. The Oracle Parallel Server utilizes a shared data access mechanism that enables multiple instances of the database server, with an instance on each node of the loosely coupled system, to transparently share a common set of database files. The shared data access is direct on clustered systems where there is physical connectivity of every disk device to all nodes. On shared-nothing and hybrid hardware systems, the shared access is enabled through an efficient operating system abstraction layer. As you will see later in further detail, Oracle's parallel architecture utilizes an intelligent mix of efficient local data access and transparent shared access on these systems to provide a superior combination of performance, resource utilization, and fault tolerance.


Figure 52.3. Oracle Parallel Server.

Each server instance utilizes a dedicated buffer cache, with the consistency of data updates across the multiple caches ensured by Oracle's parallel cache management technology. Oracle's proven, scalable parallel cache management technology represents the result of years of joint development efforts with leading parallel hardware system vendors. Key components of this technology include an efficient interinstance communication mechanism and a distributed lock manager (DLM) subsystem.

Parallel cache management plays a key role in providing scalable performance for on-line transaction processing (OLTP) applications characterized by highly concurrent update activities. In typical data warehouse systems, however, since the workload mostly consists of read-only access and bulk data additions and purges, parallel cache management doesn't play as much of a role. Oracle's unique nonblocking concurrency mechanism provides virtually unlimited concurrent data access to multiple readers and writers, practically eliminating contention and locking issues.

The shared data access scheme in Oracle's server architecture provides a logical single-system view of the data on all nodes, simplifying system administration issues. It also forms the foundation for providing superior fault tolerance—on most systems, Oracle's Parallel Server technology enables access to all data even with a single surviving node.

Oracle Parallel Query Technology

The term Parallel Query is somewhat incomplete because this technology encompasses a range of other data management operations such as data loads, index creation, summary creation and recovery, in addition to parallel query execution. Oracle's parallel technology is designed as a core internal capability of the server, providing superior performance and reliability without any compromises in the range of available server functionality. The key elements of this industry-leading technology are described in this chapter.

Dynamic Parallel Execution: Key Elements

This chapter also describes the basic building blocks of Oracle's dynamic parallel architecture. The key elements are consistent across all parallel hardware architectures—SMP, clustered, and MPP systems. As you will see later, there are some unique optimizations that leverage characteristics of loosely coupled hardware systems.

Parallel Operators

The basic unit of parallel execution is a Data Flow Operator (or simply operator). An operator is a higher-level task and often combines multiple relational subtasks into a composite parallel operation. For example, in a query like select ename, empno from emp where salary > 50000, application of the filter SALARY > 50000 could be combined with the table scan into a single operator. Subtasks that can be executed in parallel include:

Query Servers

A set of processes known as Query Servers (sometimes called query slaves) execute each operator in parallel. Query servers are drawn from a configurable, system-wide pool of available servers. The user process where the query originates serves as the Query Coordinator and manages the tasks performed by multiple query servers. The coordinator also assembles and returns the result set.

Intraoperator and Interoperator Parallelism

As illustrated in Figure 52.4, parallel execution occurs at two distinct levels, intraoperator parallelism and interoperator parallelism. Intraoperator parallelism, sometimes called horizontal parallelism, is the execution of a single operation such as a table scan, sort, or join in parallel by multiple servers. Interoperator parallelism refers to the parallel execution of multiple distinct parallel operations concurrently. For example, while a set of query servers are scanning a table in parallel, another set of servers could be sorting the scanned rows in parallel. This is also known as pipelined parallelism or vertical parallelism.


Figure 52.4. Parallel execution: building blocks.

Table Queues: Dynamic Data Redistribution

It's useful to visualize the parallel execution in terms of an operator tree with directly connected parent-child nodes having a producer-consumer relationship. An efficient, dynamic interprocess communication mechanism called the Table Queue connects each producer-consumer pair and effectively redistributes the output from a set of producer servers to the next set of consumers. The redistribution happens dynamically and takes into account the optimal data partitioning requirement for a given operation. In the example of a table scan followed by a sort, producers generate output that is randomly partitioned, and the appropriate partitioning for a sort operation is by key value range. The table queue mechanism illustrated in Figure 52.5 takes rows retrieved by the table scan processes and transparently redistributes them, partitioned by key value ranges, to the sort servers.

Depending on the type of operation, the table queue mechanism dynamically chooses an appropriate redistribution method from the available partitioning schemes: hash, key range, round robin, or broadcast. In performing the data redistribution, the table queue mechanism automatically incorporates several intelligent optimizations to achieve workload balance among the query servers. In the key range partitioning case, for example, the determination of range values is based on a dynamic sampling of the incoming rows to achieve equal-sized partitions. The medium for interprocess communication among the query servers is shared memory on SMP systems and the high-speed interconnect for clustered and MPP systems.


Figure 52.5. Table queues.

Dynamic Table Partitioning

Parallel query execution typically begins with a parallel table scan operation as the first task. Oracle7 uses a dynamic table-partitioning scheme to automatically partition the table among the participating query servers. The intelligent performance and load-balancing optimizations incorporated in dynamic table partitioning underscore the superiority of Oracle's dynamic parallel architecture and its ability to deliver scalable, optimal performance without rigid physical data partitioning. The table being scanned is initially divided into logical partitions, one per available scan process. The initial partitioning takes physical location of data into consideration and minimizes disk contention. Further, in order to achieve optimal load balancing, each logical partition is subdivided into more granular chunks, and the query servers are assigned a chunk at a time to scan. If some query servers fall behind others in completing the scan, processes that have completed their work "steal" chunks of work from the busy servers, achieving improved throughput and optimal load balancing. As explained later, this strategy pays even richer dividends when combined with Oracle's shared-disk access scheme on clusters and MPP systems.

Oracle's dynamic table partitioning, coupled with intelligent redistribution of intermediate results using the table queue mechanism, enables customers to realize the benefits of parallel hardware immediately, without any need to perform expensive data reorganizations. Further, this approach delivers superior performance for a wide range of access patterns including ad hoc queries by knowledge workers. Although a static data partitioning approach could offer performance gains if the access patterns and data distributions are predictable and fixed, such a requirement is completely unrealistic in the real world. The frequent need to perform physical repartitioning of data—with changes in access patterns, additions in hardware, changes in data distributions, or growth in data volumes—imposes an undue administrative burden. This chapter presents a more detailed examination of this issue in the section ÒInternode Parallelism.Ó

Flexible Parallelism

Each database table (or index) is assigned a degree of parallelism that's appropriate for operations on that object. The degree of parallelism is a user-defined attribute of the table, specified as a parameter at table creation time or assigned later using a simple SQL command. An appropriate degree of parallelism takes into account factors such as table size, available processors, and the number of disk devices that store the table data. If a degree of parallelism is not explicitly specified, the system computes an intelligent default value. Because table sizes and available processing resources change over time, Oracle's dynamic architecture also provides easy ways to change the degree of parallelism. Users can alter the parallelism using a direct SQL command or even override the value at run time using a query hint. It's useful to contrast this flexibility against the static partitioning-based approach where a physical repartitioning of the table is typically required to alter the degree of parallel execution.

Dynamic Parallel-Aware Query Optimization

Oracle's cost-based optimizer incorporates parallel execution considerations as a fundamental component in arriving at optimal query execution plans. As mentioned earlier, the optimizer dynamically computes intelligent heuristic defaults for parallelism based on the number of processors and the number of disk devices that store a given table. Evaluation of the costs of alternative access paths (table scan versus indexed access, for example) takes into account the degree of parallelism available for the operation, leading to execution plans optimized for parallel execution.

In addition to parallelism considerations, Oracle's cost-based optimizer includes a wealth of superior optimization techniques specifically targeted at common data warehouse operations. Star query optimization is a good example of this. Star queries involving a large "fact" table and multiple small "dimension" tables are intelligently optimized, delivering performance that matches special-purpose products from niche vendors. As a further example, parallel access of multiple underlying tables in a UNION ALL view incorporates several intelligent optimizations to deliver optimal performance.

Intelligent Performance Optimizations

In addition to dynamic parallel execution, Oracle's scalable parallel technology incorporates several efficient performance optimizations that specifically benefit typical data warehouse operations. Most of the parallel operations bypass the buffer cache, performing direct parallel database reads to deliver superior performance. Further, sorts, database loads, summary table creation, and index builds perform asynchronous, parallel direct writes to disk, achieving maximum throughput. These facilities effectively eliminate contention for the buffer cache between concurrent DSS operations and isolate DSS queries from concurrent OLTP activities on the system. Parallel table scans utilize an asynchronous read-ahead mechanism to achieve an overlap between I/O and processing, delivering much higher throughput for queries involving large tables. Logging can be optionally disabled for bulk operations such as data loads, index builds, and summary creation using the CREATE TABLE ... AS SELECT operation to achieve substantial improvements in performance.

Oracle Parallel Query on Loosely Coupled Systems

The basic architectural elements of Oracle's parallel technology remain consistent across all hardware systems. There are, however, some key optimizations that utilize Oracle's unique shared access technology to leverage distinct characteristics of loosely coupled systems and deliver superior scalability, resource utilization, and availability benefits on these systems.

Function Shipping

Oracle's parallel execution on loosely coupled systems extensively uses the function-shipping strategy to perform work on remote nodes. Query server processes located on remote nodes are sent efficient messages, often in the form of modified SQL subqueries, to indicate the work that needs to be done.

Loosely Coupled Systems

This may come as a surprise to many purists because function shipping is typically associated with shared-nothing database systems, and on systems where shared-disk access is available, data shipping is the typical approach. Oracle's parallel architecture, with its combination of key elements of both systems, makes intelligent use of function shipping when the data to be accessed is located at the remote node, to deliver efficient parallel execution eliminating unneeded internode data transfer over the interconnect.

Exploitation of Data Locality

Each node on a shared-nothing hardware system has direct hardware connectivity to a subset of disk devices, and it is more efficient to access these local devices from the "owning" node. Oracle's parallel query execution exploits this affinity of devices to nodes and delivers superior performance on these multi-computer systems. As with other elements of Oracle's dynamic parallel architecture, this strategy works transparently, without any need to perform value-based partitioning of data. The system dynamically detects the locality of data and makes intelligent use of it in two ways:

Such dynamic exploitation of data locality maximizes local data access and minimizes suboptimal internode data transfer over the interconnect, delivering optimal performance on shared-nothing hardware systems.

Internode Parallelism

In principle, local data access is the preferred strategy on shared-nothing hardware systems; however, exclusive use of this approach, as in pure shared-nothing database architectures, leads to inefficient resource utilization. A number of factors can lead to this: Firstly, data distribution across nodes typically get skewed over time, with additions and purges. Secondly, the data accessed by a number of queries may relate only to a subset of nodes. For example, if sales data is partitioned by month, a drill-down query on a specific calendar quarter involves only 25% of the nodes. In such situations, a pure partitioned execution strategy cannot leverage unutilized or underutilized processing power on other nodes, resulting in highly skewed resource utilization and suboptimal performance.

Oracle's intelligent combination of shared-nothing and shared-disk architectures makes efficient use of local access and shared access to avoid this problem. Initial work assignment to query servers is based on locality of data. In skewed cases, however, Oracle makes dynamic use of internode parallelism. Query servers located on remote nodes compensate for busy nodes by utilizing shared access, to deliver improved throughput and transparent utilization of all available processing power.

Superior Fault Tolerance

Oracle's implementation of the shared access scheme enables any node in a loosely coupled system to have direct logical access to all data and delivers unmatched levels of availability. In the worst case, even if only one node is available, all of the data is still accessible. Live instances recover transactions on failed instances, and the recovery is performed automatically without any additional administrative work. It's impossible to achieve such complete and transparent fault tolerance in a shared-nothing architecture. On these systems, there's at best a provision for supporting dual-ported disk systems resulting in a single backup node for each primary "owner" node. If both the primary and backup nodes fail, the subset of data owned by the primary node remains inaccessible.

The Oracle7 Advantage over Pure Shared-Nothing Systems

This section contrasts Oracle's scalable "best of both worlds" architecture against pure shared-nothing database systems, illustrating the superior real-world applicability of Oracle's approach.

Static Data Partitioning and Parallelism

Pure shared-nothing database systems base their parallel execution strategy exclusively on static data partitioning, resulting in a scheme that is fundamentally flawed. Let's explain why. Static data partitioning is a valuable data management facility that can improve the manageability and availability of very large objects in VLDB systems. The use of value-based partitioning as a Very Large Data Base (VLDB) administration facility has been around a long time, much before the advent of parallel processing. Partitioning is also useful for query optimization in some cases, as it eliminates the need to examine unneeded data.

Deriving parallel query execution exclusively from this administrative segregation of data, however, introduces substantial administrative complexities and severely limits the applicability of parallel processing. A rigid partitioning-based execution approach can provide optimal performance only if the query patterns can be predicted ahead of time. In real-world data warehouse environments, this is an unrealistic requirement because what's "interesting" to analyze is neither predictable nor fixed and is often data-dependent—the knowledge worker decides what to look at based on results of the current analysis. Even if a subset of access requirements can be predicted, because partitioning can be done only along one set of attributes, this approach optimizes performance only for a restricted subset of queries, limiting the applicability of parallel processing.

Frequent Need for Repartitioning

When data access requirements change over time, as they frequently do in real life, data has to be repartitioned to achieve acceptable performance levels. Further, repartitioning is also required with any of the following:

Such frequent repartitioning of data adds substantial administrative costs and application downtime, limiting the practical applicability of such a scheme. Partitioning-based parallel execution is an inappropriate approach on shared-everything SMP systems. Some level of data partitioning may be desirable on these systems from a manageability point of view as data volumes grow. However, the need to partition all data, irrespective of the administrative needs, is an artificial requirement imposed solely based on the architectural limitation of pure shared-nothing database systems. Such a requirement introduces unneeded administrative complexities and prevents a vast majority of users on SMP systems from realizing the benefits of parallel technology. In short, the static parallel execution strategy requires frequent data reorganizations at substantial cost and downtime, just to achieve reasonable performance for a limited range of queries.

In contrast, Oracle's dynamic parallel execution strategy delivers the full potential of parallel processing for a much broader range of access patterns, including ad hoc queries, without any need for frequent, expensive repartitioning of data. The manageability benefits of data partitioning can be achieved today using a manual partitioning scheme involving multiple underlying physical tables combined using a UNION ALL view. Comprehensive support for static data partitioning will be available in the next version of the Oracle server. The key difference, however, is that Oracle's dynamic parallel execution strategy will provide all the benefits of data partitioning—improved VLDB manageability and optimization without incurring the costs of a strategy that's exclusively based on partitioning.

Leverage of Processing Power

The tight data ownership scheme that is central to the shared-nothing architecture prevents the exploitation of all available processing resources. There are many cases during normal operation where the available processing power of a pure shared-nothing system is limited to a subset of nodes. Data distribution across partitions typically tends to get skewed over time, with additions and purges. In such cases, the largest partition, or equivalently the slowest processing node, dominates the response time, resulting in suboptimal resource utilization. Even in cases where data is fairly uniformly spread across partitions, each query may not access all partitions. An obvious worst-case example is a query that does a full scan of a single partition—this will be exclusively processed by the owning node, resulting in serial execution, even on massively parallel hardware!

This fundamental limitation of pure shared-nothing systems is particularly significant in the following common situation. Data warehouse environments frequently maintain rolling windows of data partitioned by some time unit, say one month. For example, sales data from a fixed number of months may be kept in a table; new data is rolled in on a monthly basis while the oldest month's data is rolled out. Drill-down queries against such data typically involve a particular month or subset of months. For example, a query might generate a report of revenue and profit by product line for a particular fiscal quarter. In a pure shared-nothing system, the processing power available for this query would be restricted to just the nodes that own the three months' data, leveraging only a fraction of the full processing power of the system.

Some shared-nothing systems attempt to rectify their lack of intrapartition parallelism using a multilevel partitioning scheme, sometimes called "hybrid partitioning." In this scheme, data is initially distributed across all nodes using a hash function. At the next level, data within each node is partitioned by key range into a number of subpartitions. The idea is that the hash step will insure the participation of all nodes in processing a query because the distribution is somewhat random. Within each node, if a significant number of partitions can be eliminated based on the query predicate, each node will process only a subset of the range partitions, improving response time. This may seem like a good idea in principle until you look at the underlying administration issues. On an MPP system with 32 nodes, for example, if each hash partition is subdivided into 25 ranges, there will be a total of 800 partitions to create and manage! What's worse, as the partitions get skewed over time, as they often do, the administrator has to repartition all these 800 pieces all over again. This results in nothing short of an administrative nightmare. Furthermore, if these 800 partitions relate only to a single table there will likely be several thousand of these partitions to manage in a typical system.

Oracle's parallel architecture, with its intelligent combination of data locality and shared access strategies, is far less prone to these data skew problems. Initial partitioning of data is based on maximizing locality. In cases of data skew, however, idle or underutilized nodes can transparently compensate for busy nodes, utilizing shared-disk access, to deliver optimal performance.

Ease of Administration

The excessive dependence of shared-nothing systems on static data partitioning introduces substantial administrative complexities. Users have just one control—static data partitioning—for achieving two potentially conflicting objectives: manageability and parallel execution performance. What's desirable from a manageability and availability point of view may not yield acceptable performance and vice versa, forcing users to compromise one or the other. Further, the frequent need to perform expensive offline data repartitioning imposes further administrative burden, potentially rendering this approach impractical in a number of situations. The rigid data ownership model in shared-nothing systems also prevents incremental system growth—data has to be repartitioned before any additional processing resources can be utilized—negating one of the major benefits of parallel hardware systems. Oracle's real-world architecture, in contrast, delivers the true potential of parallel processing with minimal administrative complexity.

Robust Scalability

In a pure shared-nothing architecture, the ideal speedup goal of processing a query in time (T/N) is seriously compromised by the failure of a single processor. In the simplest pure shared-nothing configuration, each processing unit masters a set of disks and no other processing unit has direct access to that set of disks. If the work of executing a query has truly been spread over the N processors, then the failure of a single processor stops the execution of the query for the duration of the outage because no other processing unit has access to the failed processor's set of disks.

To get around this problem, some shared-nothing implementations utilize dual-ported disks, and each processing unit serves as both a master to a particular set of disks and as a backup for another processor unit. If a processor fails, the associated backup does double duty, taking over the work of the failed processor as well as processing its own normal workload. Therein lies the rub. Because the backup now has 2/N of the total work to do, it will finish in 2T/N the time. The completion time for the query is really the completion time of the slowest processing unit, so the entire query will finish in 2T/N the time instead of T/N. In other words, during periods of outage of even a single processor, one gets only half the performance that one paid for. To make matters worse, the time that some processing unit in the system is down is roughly proportional to the number of processing units, so this problem actually gets worse as one adds processors—a kind of reverse scalability feature. Oracle's unrestricted access implementation does not have this problem. Because every processing unit has logical access to every disk, work associated with a failed processor can be spread among all the remaining processors, providing robust scalability and virtually no degradation in performance.

Emerging Trends in Technology

Oracle's parallel technology, with its ability to provide transparent shared data access, is consistent with emerging trends in hardware technology. The next generation of mass storage products in the open systems market will most likely be based on high-performance switched protocols such as Fiber Channel, capable of supporting 100M per second bidirectional bandwidth (200M/sec total) per connection. Fiber Channel is gaining considerable momentum in the mass storage world, and large-scale, shared-disk physical connectivity is likely to be available in the near future.

Arbitrated-loop Fiber Channel configurations connecting large numbers of processors to a shared set of disks are being developed even now. Switch-based Fiber Channel configurations will offer even more flexibility and scalability. The switched configuration will be proportional to the number of ports on the switch. This greatly enhanced physical connectivity can be transparently exploited by the unrestricted access solution provided by Oracle's parallel architecture. In fact, in the face of ever-increasing shared-disk physical connectivity, it will become increasingly difficult to defend rigid shared-nothing implementations in which each processor can only get to the vast majority of the data on the system via a networked request to another processor.

Another significant direction in storage technology is the availability of storage subsystems that present a single device appearance to the operating system, but internally use many disks and gigabytes of cache memory to hide device boundaries and mask failures. With a single logical device capable of storing hundreds of gigabytes and transfer rates of hundreds of megabytes per second, the pure shared-nothing limitation of access to the device by one instance only is a severe compromise in processing power. With Oracle's unrestricted access implementation, all the processors in the system can access all the disks in the logical device without going through a "master" instance, delivering scalable performance.

Parallel Processing Platform Hardware Configuration Requirements

In this section, I discuss a few aspects of the required hardware configuration for a parallel processing platform.

Fail-Safe Redundant Hosts

HP, Sun, and Digital 64-bit UNIX provide for an instantaneous redundant host. In the event of a system-wide failure, a "switchover" of the production UNIX servers disks and software can be transferred to a development host automatically. The HP Switch Over daemon sends a regularly scheduled signal or "heartbeat" and receives state-of-health diagnostic information in response to that signal, or absence of a signal to the standby host.

When the standby host determines that the messages have stopped, users are warned of a pending shutdown. HP Switch Over then initiates a takeover, locking the current host's disks, and rebooting as the current host. Heartbeat sends messages to notify the dead or distressed host that a switchover sequence has occurred and issues the appropriate operating system shutdown, ensuring that the lame machine notices the takeover and halts.

The DEC Alpha file servers 2100, 8200 and 8400, utilizing the model EV/5 processor deliver added redundancy through "CPU failover" and "memory failover." If a single CPU or RAM failure occurs during processing, the SMP file server continues uninterrupted processing! At the time of failure, a "system in distress" page is transmitted to service support personnel for immediate notification of the particular problem.

Cross or Remote Mounted Disk Drives

Remote storage devices should not be used in parallel processing because they must interface with at least the two (2) file servers. Any read/write activity is effectively doubled, between the disks of a remote mounted file system as they must interface with each disk's separate processor in the configuration.

Remote file systems must also interface with the network router, cable, connectors, concentrators, Adaptable User Interfaces (AUIs), host adapters, smart controller cards, and buses in the path connecting these two systems. Remote file systems effectively double seek time at a minimum, but they can also degrade network performance for the entire Local Area Networks involved.

Disk Drive Allocation

The number of physical disks mounted and available to the file system are critical to the performance of Oracle7 and related client/server applications. Performance bottlenecks can occur because there is access to only a single "RAID5 stripe" or a single input/output disk read/write "needle," regardless of the number of logical volumes per physical disk drive.

The physical and not logical volume devices are used to balance the I/O of the background Oracle processes across the disks.

Disk Partitioning

Physical disks may be "partitioned" or configured by a logical device manager in subunits which may allow a single disk drive to appear to be multiple disk drives, with different device names. The converse is also true as several disks may appear to be a single disk partition.

Before you install Oracle7, check with your operating system administrator for clarification of these physical, not the logical, device names. Also verify the necessary path, permissions, and storage space of the data files that you can create on each disk drive.

Maximizing Parallel Platform Database and Disk Performance Through Balancing the Oracle Processes

The system tablespace, rollback segments, control files and redo logs are always active for every Oracle instance. If Archive mode is enabled, then transaction logging will "journal" all data base activity enabling a point-in-time recovery if needed, in conjunction with your full database backup. Archive mode should write to a duplexed tape and disk media. The fastest recovery will be directly from disk if this is possible.

These processes use specific files, which should be distributed across different disk drives for optimal database speed.

Be certain that all Oracle background and foreground processes have the same operating system process priority. Contention will arise as a high-priority process waits for a low-priority process, which may never swap back in.

Disk Optimization For Oracle7 and the Parallel Server Oracle7 Option

The optimum number of disks for the Oracle installation should at a minimum enable the separation of the table data on disk one and the index data on disk two, in any size.

With additional disk availability, the Oracle7 system tablespace on disk three, rollback segments on disk four, temporary tablespace on disk five, and the redo logs on disk six.

The triple-copied control files are always written to when major RDBMS events occur and are not a serious performance bottleneck. The control files should always be separated across multiple disks, whenever more than one disk drive is available by direct mount to the file system. This triple redundancy ensures rapid recovery in the event of a disk drive failure or accidental erasure.

Time zones and business application groups should influence the tablespace design so that a database maintenance window for one application does not interfere with the operation of 7/24 maximum availability systems.

Disk Storage Devices

The Oracle disk storage devices in Table 52.2 can be standard devices, mirrored devices, Write Once Read Many optical disc (WORM), Write Many Read Many optical disc (WMRM), Redundant Arrays of Independent Disks (RAID) and other media types. These disks should have relatively high read/write rate (8-9 milliseconds for standard disk drives) and should not be cross-mounted from other file system servers, except in the case of an out-of-file-space emergency.

Optical Storage

The media read and write rates for optical storage are significantly slower and therefore should not be used as high-performance file system devices for normal database query activity. These optical devices are very well suited for backups, long-term data archival, and are usually guaranteed for long-term readability of 50 years or more.

Disk/Tape


Storage


Average


Average


Media


Storage


Maximum


Access


Write Time


Device


Media Type


Gigabytes 


Time


Milliseconds


Driver Cost


 


 


Milliseconds


 


 


Standard

4GB

8 MS

8 MS

$1,400.00

"Barracuda"

Fast/Wide

20MB@Second

Optical WORM

25GB

135 MS

68 MS

$7,000.00

12 Inch

Cybernetics 2500

1.3GB

19.8 MS

500 KBS

$3,495.00

Optical WMRM

5.25 Inch

Cybernetics

25GB

67.5 S

12-90M@min.

$4,250.00

CY-8505

8mm tape

DEC TZ-80

40GB

10 S

333M@min.

$5,500.00

tape cartridge

Exabyte

5-10GB

67.5 S

10-35M@min.

$4,125.00

EXB-8505

8mm tape

Mirrored Disks

4GB

4-9 MS/2 disks

4-9 MS/2 disks

$1,400.00

Fast/Wide

20MB@Second

RAID 5 Array

4GB

12-16 MS

12-16 MS

$1,400.00

Fast/Wide

20MB@Second

Microtek

120M

60k @Sec

$5,000.00

Floptical

Compact Disks

SONY minidisk

120M

600k @Sec

$1,000.00

Read/write CD

RAID Versus the Speed of Disk Mirroring

Commonly referred to as fault tolerant disk systems, Redundant Arrays of Independent Disks (RAID) and "mirrored disk arrays" both use different RAID levels to provide redundant duplication of data disk storage devices.

In RAID5, when a single disk drive failure occurs, the redundant duplicate media spread across the remaining disks are used to recreate the data that was stored on the now unavailable disk drive. A "scratch pool" contains extra disk drives reserved for each RAID5 Array. In the event that failure occurs to any of the RAID5 Arrays, an extra disk from the scratch pool is temporarily added to redefine that RAID5 stripe.

This recovery of data should be considered complete only after a new disk is installed and mounted onto the file system.

Performance of common RAID, other than RAID level 1 and 5, is significantly slower as the data from each drive is duplicated among an average of three to five other disk drives. This impacts not only the number of physical read/writes but degrades CPU and I/O performance across the processor bus as well.

The database implications here are twofold in that file writing is significantly slower due to duplication across three (3) to five (5) disks on average. Another common RAID drawback is that the real-time reads are only possible from the single master disk, and not from the redundant copy.

What are RAID Levels 0 - 5? Each RAID level offers a different approach to data redundancy and fast access to that data. Fault tolerance, read/write performance, and storage capacity are unique to each RAID level.

Mirrored Disk Drives

RAID levels 1 and 5 enable a noticeable performance increase while simultaneously providing a consistent data mirror, one for one, across each mirrored disk drive.

Performance is optimized by the operating system as queries for data across mirrored disks are split among the duplexed physical media. This optimizes the physical read/write ratio because when two (2) files are accessed from the same disk which has been mirrored, both disks are used to retrieve the data, thus speeding up reads, enabling faster throughput.

The data actually travels from the disks, through the host adapter and finally across the CPU bus for I/O.

Mirroring also allows for separation of a live system into two (2) segments, one of which remains active to the users. This momentarily "breaks" the mirror while data can be copied via any high-speed means to an archive disk or compressed tape for backup and recovery purposes.

The mirror is then "re-silvered" as it is re-attached and rolls forward to update any file server activity since the initial separation from the master disk drive.

This breaking of the mirror procedure momentarily eliminates the second disk mirror recovery procedure from your contingency plan and should therefore only be done after other alternative backups are verified.

Parallel Processor Types, Smart Controller Cards, and Bus I/O

If the processor is a 32-bit Intel i486 and no more than two processors are connected to a 64-bit bus via the host adapters, then no input/output contention will result, if the MHz rating of the bus is verified to be identical to, or greater than the CPU clock speed rating. This may also be refereed to as a "PCI" or "local" bus.

As other more powerful processors enter the marketplace, specific attention to the architecture, processor type, the number of processors per bus and their maximum throughput rates per clock tick cycle (the MHz rating) must be addressed.

When a single 64-bit processor is connected to a 64-bit bus, having the same clock speed or MHz rating, then no contention or delay of information processing results. If, however, two (2) such 64-bit processors share the same single 64-bit bus with the same clock speed, each individual processor must wait a full cycle while the other processor is active.

This also means that when both processors are active, each processor can only receive input/output half of the time, from the common bus peripheral devices such as disks, tape devices, or network host adapter interfaces.

It is therefore wise to verify that your expansion processor bus is rated "wide and fast" for 20M@Second throughput, or "fiber channel" 200M@Second processor throughput. This will help to eliminate the parallel processor contention for the CPU bus and to allow for the speediest of computational zest.

If the Intel DX processor is a DX2, the processor internal clock speed is doubled or in the DX4, tripled. This is relative only to the internal CPU operations, NOT the input/output across the common bus. The MHz rating of the bus is actually the speed or bandwidth at which the information input/output travels.

Making the Right Choice

With all of these MP, SMP and MPP platforms, how can you be certain make the right choice for your business applications today and tomorrow?

Transaction Processing Performance Council (TPC) Benchmarks

The Transaction Processing Performance Council benchmarks effectively illustrate:

The best single resource from the TPC is probably the Complete Listing Of TPC Results and includes up-to-date listings of all TPC/A, B, C, D, and E benchmarks detailing the company sponsor, system, database, operating system tested, and complete five-year costs including all software and system maintenance support fees.

TPC/A: Online Transaction Processing (OLTP)test that scales upward to measure database update, disk I/O, transaction integrity. All costs are divided by the total system cost and measured in Transactions per Second(TPS), when driven from multiple terminals.

TPC/B: Current database stress test (not OLTP) detailing significant disk I/O, transaction integrity, all costs, moderate system and application execution time.

TPC/C: Simulated order entry of five complex, concurrent OLTP transactions; order delivery, payment record entry, status checks, and warehouse stock level environment test, including all costs, measured in Transactions Per Minute (TPM).

TPC/D: New decision support (DS) OLTP for large systems, processing 17 long running Òreal-worldÓ queries against complex data structures. Available since Q-2 95.

TPC/E: New large business ÓEnterpriseÓ database stress test for complex query processing on very large databases available Q-1, 1996.

TPC Quarterly reports contain executive summaries of the TPC results published in the quarter as well as a complete historical benchmark listing. Executive summaries are two-page summations of the lengthy Full Disclosure Reports. The summations describe major components, performance data, detailed pricing of the system/configurations, and their related diagrams. Supplementary information included relates to competitive system and database analysis topics.

Complete Full Disclosure Reports are an average of 100 pages but include detailed price breakdowns by components and the specific optimizations used to achieve the best performance results.

The TPC Price Performance ratio reports are available from within your organization, or you can telephone the TPC for reports and information directly in Sunnyvale, California, at (408) 295-8894.

Sample TPC Benchmarks

Following in Table 52.3 are a few sample TPC/A and TPC/C benchmarks. The TPC consists of 44 members from the hardware and software industry including:

TPC Results are subject to challenge by competitors and independently audited for 60 days prior to their publication. Benchmarks are summarized to include the complete five-year cost of the system including software with full support and administration costs. These figures are then expressed in total Transactions Per Second (TPS), (TPM), and the associated Price Performance Ratio.

Company


System


TPC/ARev


Throughput


$Price/perf 


5 YR. Total Cost


RDBMS


O/S


Date


AT&T

NCR3655/48 CPU C/S

v 1.2

713.56 TPS/A

$5,941.

$4,239,200.00

Oracle7

MP/RASSVR4.2

10/93

Compaq

Prolinea2000 5864200

v 1.2

240.68TPS/A

$4,890.

$1,176,922.00

Oracle7.0.15

SCO UNIX 3.2.4

9/93

Compaq

system-pro/XL5862100

v 1.2

242.23TPS/A

$5,130.

$1,242,662.00

Oracle7.0.15

SCO UNIX 3.2.4

8/93

Compaq

system-pro/XL5862040

v 1.2

171.82TPS/A

$5,437.

$834,146.00

Oracle7.0.10

SCO UNIX 3.2.4

1/93

HP

9000T500 C/S

v 3.00

5070tpm/C

$530.00@tpm/C

$2,872,151.00

Oracle7.3

HP-UX 10.10

7/95

HP

9000K410 C/S

v 3.00

3809tpm/C

$364.00@tpm/C

$1,384,763.00

Oracle7.3

HP-UX 10.01

9/95

HP

9000H70 C/S

v 1.2

411.73TPS/A

$6,866.

$2,745,542.00

Oracle7.0.11

HP-UX 9.0

6/93

HP

9000H50 C/S

v 1.1

184.55

TPS/A

$8,637.

$1,593,366.00

Oracle7.0.10

HP-UX 9.0

6/93

I.B.M.

ES/9000742 Fastpath

v 1.2

1427.07TPS/A

$13,348.

$19,175,741.0

IMSDB4.1 W RSR

MVS/ESA4.3

12/93

I.B.M.

RS/6000570 C/S

v 1.2

128.50TPS/A

$6,536.

$1,095,880.00

Oracle7

AIX3.2.3UNIX

12/93

Sequent

S2000/750 C/S

v 1.2

1002.37TPS/A

$9,313.

$9,335,113.00

Oracle

7.0.12

DYNEXptkV 2.0.1

6/93

Historical IBM 3090 Model 600J performance is 25 Transactions Per Second, per processor. At six processors, this equates to 150 TPS/A v1.2. These machines cost $6 million to operate for five years, including software, maintenance, and support, thus generating a price performance ratio of $40,000.00@Transaction. The figures shown in Table 52.3 reflect why the current trend is towards client/server and away from the mainframe.

Parallel Processing Platform Selection Criteria

Choosing your parallel processing platform requires specific selection criteria. Decisions you make regarding how you will evaluate the information and select the best current and future solution will have a long-term impact on your entire network user community.

When selecting a platform, consider your real-world experiences with your current file servers:

When your data answering these questions is collected, it should illustrate which file servers you should consider. This should strategically allow for the inclusion of one or more additional Parallel Processors (PP), Symmetrical Multi-Processors (SMP), or Massively Parallel Processors (MPP) as an optional upgrade path for performance improvements if needed.

The logical design should be used to guide the platform design process by providing maximum throughput, ease of use, and ease of maintenance.

The platform and physical design information is then used to distribute tablespaces on opposite disks of the high volume and heavy load queries so as to minimize disk trashing of tablespace files and elimination of I/O contention.

The high-volume queries should access table and index data, spread across different disks so as to minimize disk contention between users and queries, referencing the proposed physical design.

Remember when calculating index size space estimates, indexes can be as large as tables, and frequently total 150% larger than actual table sizes, in megabytes.

Parallel Index Platform Design

Disk access is a major performance bottleneck. Server performance degrades as disk I/O occurs. An index on selective columns can greatly improve server performance as the index reduces disk I/O. All database servers support B-tree indexes. B-trees store key values and their physical storage addresses in a hierarchical tree of index pages.

Calculating Legacy Tablespace and Data File Sizes for the Parallel Platform

Tablespaces are composed of the physical data files on the disks where the data actually resides.

Select * from sys.DB_FILES is the maximum number of Oracle data files that can be opened per database on that operating system. Select * from sys.DBA_DATA_FILES lists the current data files, sizes, current status and paths used by that Oracle database.

Indexspaces are identical to a tablespace except for the index objects which should exclusively occupy this space. Because of the physical disk separation of tables and indexes, queries using the indexes and retrieving table data from non-indexed columns use two different disks, cutting in half the read/write time.

When a table is "striped," it uses multiple data files. These are optimally spread across multiple disk drives for the fastest table and index access possible.

Table Sizing and Types for Parallel Processing Platform Disk Design

When tables are created through the execution of Data Definition Language (DDL), storage parameters are given for the initial sizing and data file. This is the specific disk and full path of the objects creation and for the subsequent or future space allocation extension, when additional storage space will be required.

The initial parameter is the size in kilobytes or megabytes of the first contiguous extent. This is composed of Oracle blocks, in which the data is actually stored internally to the database. The normal block size is 2 k minus about 50–100 bytes, which Oracle uses internally.

The available contiguous disk space must be as large as these "initial" values. A buffer of between 33% to 50%, additional RDBMS reserved disk space, is reasonable from a platform design standard. This can be queried from the Oracle view by using the following command:

Select * from sys.DBA_FREE_SPACE

When an existing table has two (2) or more extents, it should be reorganized for performance considerations and to avoid the famous ORA-1547 Failed to allocate extent of size (number of extents needed) in tablespace (tablespace name) error message. This indicates that your object cannot grow without the addition of a new data file for the tablespace and table or tablespace import with the compress files=Y export parameter option.

Before redesigning your tablespaces, be sure that your tablespaces having data files greater than one (1) can be unified on your proposed disk layout, unless they are striped tablespaces. This will ensure that your largest objects will fit into the new single extents. Also verify that percent free is zero (0), or you may be surprised that your contiguous free space is rapidly used up when new data requires additional extents.

Temporary Segment Parallel Processing Platform Design

A temporary segment is opened for each active user in the Oracle RDBMS. Optimally the temporary tablespace is a "striped " tablespace, or one in which the multiple data files associated with this tablespace are spread across multiple disks. The temporary tablespace and default tablespace are both ALTER USER modifiable parameters.

Parallel Processing Platform Backup Device Design

If archive mode is enabled, the archive destination path should point to a fast, removable media such as a WORM-CD, Cybernetics CY-8505 8mm tape device, or Digital Linear Tape device (DLT) featuring compression capable of storing up to 24 gigabytes per tape or 5 gigabytes uncompressed. A standard 2.2–5 gigabyte Exabyte 8mm tape drive may also be used, offering less capacity for the same price.

These three tape devices offer the capability to store from 8 gigabytes up to 40 gigabytes compressed onto a single tape or five (5) gigabytes uncompressed per tape. This eliminates the need for attended backup tape changers and automated tape-changing hardware, which can fail. The cybernetics read/write rates vary up to 85M per minute when used with an optimum SCSI II interface.

The fastest backup device available is the Digital Linear Tape device (DLT). The data backup or "transfer" rate is a speedy one gigabyte per three minutes! Cartridges are independently rated for over 40,000 cycles of tape writes. The DLT has a storage capacity of 20 to 40 gigabytes per cartridge. The DLT is also priced at less than $5,500.00, which is in the same price range as the other two tape devices.

Parallel Processing Platform Database Creation Requirements

The Oracle installer utility for Oracle7 now avoids the previous limit of 32 maximum data files and other limitations encountered when creating databases from the Oracle version 6 installation utility.

The utility is accessed initially by copying it from the installation media into a temporary location. Do this only after you have read the installation guide specifically written for your hardware platform and operating system version release. About 200M may be required for the temp directory until you remove it, after your installation.

After loading the system variables and specifying directory locations for product locations, give special attention to not accepting the default locations and tablespace sizes for the system, rollback, and control file locations.

These database files should be spread across several different physical disk drives for performance optimization. If the on-line help facilities of many products are installed, the sizing of the tablespaces must increase to hold this data. Specifically the Oracle case tools and SQl*TextRetrieval will require at least a 25-megabyte tablespace for their help examples when loaded.

New Data Types That Enhance Parallel Processing Platforms

The Blob and VARCHAR2 column data types are new to Oracle7 and should be used accordingly.

Oracle's Media Server uses the Blob to store video and audio text for real-time playback, fast forward, and rewind directly from an MP, SMP or MPP server as delivered to between 30,000 to 150,000 concurrent users over cellular, twisted pair, co-ax, and fiber-optic cabling.

Blob or "Binary Large Object" may be up to a 2-gigabyte long RAW field which should be used in a "striped" disk tablespace configuration. This will enable the maximum disk I/O performance.

Blob data types are used in the storage of large data objects, where searching and indexing would otherwise not be possible.

Oracle Callable Interfaces (OCI) can be required to read this data type as a bitstream using Oracle's Pro*C tool, from the file server.

VARCHAR2 data types are a variable-length character field used to avoid wasted column space for missing data in the storage of your records. VARCHAR2 should be used wherever possible, unless math is or will be done (requiring data validation) with the data being stored in this particular column. A 2-byte end-of-file marker is the cost of this feature, while any space savings below this is saved and available for other data storage uses.

Routine Parallel Platform Server Maintenance

Database errors are logged throughout the Oracle system by a configurable parameter in the initSID.ora file located in the $oracle_home/dbs directory.

The default path for the Oracle system errors that the user may have seen before contacting you for advice is in $oracle_home/rdbms/log. This path may be altered by including usr dump destination in your initSID.ora file.

The tkprof tool can be used to analyze these errors if they do not seem obvious to you, or they can be faxed to Oracle for their assistance and interpretation by calling (415) 506-1500 first and speaking to an analyst who will assist you in analyses and logging a TAR with the identification by you of your temporary order number or permanent Customer Service Information (CSI) number.

Parallel Processing Database and Tool Upgrades

When you are upgrading your file server Operating System, a new version of Oracle will most likely have to be installed to support this upgrade after the new OS is installed. Prior to this, always ensure that full database exports and backups are available, and that redundant copies of the new media are available if possible.

When upgrading the Oracle source code, be sure to select "Database upgrade" from the menu and not the "Install New Database" from the orainst menu. If you do select "Install New Database" by mistake and then you enter your existing Oracle SID name, your old instance and database will be initialized or purged of all data, user IDs, and objects and will therefore be useless except as a fresh database.

If you do want to create a user test-bed or staging area, you can use the "orainst" installer utility for new database SID creation from your existing source code and executables.


Before any such maintenance or installation proceeds, be sure to have full database exports and verified backups available to you, and absolutely do not use an existing Oracle SID or any current database files from your file server. Select * from DBA_DATA_FILES will echo the current status of data files, for that database only.

SQL*Net versions installed on the server must match the client versions for connection between the client and the server, when using SQL*Net.

You must also keep or install the other SQL*Net versions that your clients may use such as versions 1.1 for Apple Macs.

Parallel Processing Platform Free Space Calculations

The current Oracle7 Administration guide contains the sizing calculations for both free tablespace (select * from SYS.DBA_FREE_SPACE) and table size estimates. These include the reasons why not to use percent increase other than zero (0) in your Data Definition Language, except in high-availability systems where you may not have the option of adding a data file in an emergency. You should frequently monitor available disk space, data file free space, chained rows, and the number of extents.

To calculate Oracle block space used by an Oracle table or index, execute this query:

Select count(distinct(substr(rowid, 1, 8)||substr (rowid, 15, 4)))

from tablename;

Multiply this total by your Oracle block size (usually 2048-8192 or 32 KB), giving bytes of data and deducting the Pctfree value, if any.

As with Oracle version upgrades in the future, you may want to contact Oracle at (415) 506-1500 to verify that the printed material you are using in your manual is the best and most up-to-date resource for these calculations. As the architecture shifts, these manuals become outdated and are supplemented by definitive white papers on various subjects.

For more information regarding Parallel Platform Optimization, refer to the following sources:

Summary

Traditionally, two distinct approaches—shared nothing and shared disk—have been used in the implementation of database systems on parallel hardware. Each approach in its pure form offers certain unique benefits and tradeoffs. This chapter illustrated a conceptual foundation for its central argument: a hybrid Oracle architecture that combines the strengths of each approach is the most pragmatic real-world solution.

The TPC benchmarks are an invaluable resource for the independent verification of realistic performance gains. Conversions include fault tolerant SMP or MPP configuration, in conjunction with application migration, reengineering, and new development.

Oracle's scalable parallel architecture combines the best elements of pure shared-disk and shared-nothing approaches to provide a proven real-world architecture that enables customers to realize the true potential of parallel processing. Oracle's unique approach optimally exploits the distinct characteristics of all parallel hardware architectures—SMP, clusters, and MPP systems—to deliver scalable performance with superior manageability and unmatched levels of availability. As President Clinton said recently, ÒThere is no mainframe explanation for the PC (client/server) world in which we are now living.Ó

Previous Page TOC Next Page Home