Previous Page TOC Next Page Home


17

Designing a Database

There are several steps involved in developing an effective database design. As with all types of applications, the process begins with requirements analysis. In terms of relational database design, this phase answers questions regarding what data elements must be stored, who will access them, and how.

The second step is to define the logical database. This phase does not deal with how the data will be stored physically, but with how information is grouped logically. The requirements are translated into a model that provides a level of abstraction from the physical database, representing data in terms of business entities and relationships, rather than in terms of tables and columns.

Physical design is the final phase, in which individual data elements are given attributes and defined as columns in tables. This phase also deals with performance considerations relating to the creation of indexes, rollback segments, temporary segments, and the physical layout of data files on disk. DDL (Data Definition Language) scripts are written to create database objects and to be used for capacity planning.

A simple contact manager will be used as a sample application throughout this chapter, illustrating the concepts and techniques presented for each phase of the design process.

Requirements Definition

System requirements are typically gathered through a series of interviews with the end users. This is an iterative process in which systems designers provide structure to the ongoing dialog and document findings, and solicit feedback from the users. Although requirements definition is not normally considered part of the design process, the design is driven by the requirements, and the two processes often overlap. For example, the logical model may bring out new requirements that were not recognized in the earlier phases of analysis. It is important, however, to identify all requirements before developing a physical design, because capacity planning and hardware purchasing decisions are ineffective without a full understanding of system requirements.

A common technique used to define and document database requirements is to develop a data dictionary. As the name implies, a data dictionary simply enumerates and defines the individual data elements that must be stored. An initial draft of the data dictionary for a simple contact manager might look like Table 17.1.

Item


Description


Last Name

The individual contact's last name

First Name

The individual's first name

Middle Initial

The individual's middle initial

Contact Type

Standardized description indicating whether this individual represents a client, a prospect, a vendor, or some other type of contact

Individual Notes

Additional information related to the individual

Company

The name of the company that the individual represents

Company Notes

Additional information related to the individual's company

Address Line 1

Line 1 of the individual's street address

Address Line 2

Line 2 of the individual's street address

Address Line 3

Line 3 of the individual's street address

City

City name of the individual's mailing address

State

State name for the individual's mailing address

Zip Code

Zip code for the individual's mailing address

Address Type

Standardized description indicating whether this is a work, home, or some other type of address

Phone Number

The individual's area code and phone number

Phone Type

Standardized description indicating whether this is a home, office, or other type of phone number

Date Contacted

The date that this individual was contacted

Contacted By

The name of the salesperson or employee who contacted this individual

Contact Method

Standardized description indicating whether the individual was contacted by phone, mail, fax, or some other method

Contact Reason

Standardized description of the reason that the individual was contacted

Contact Notes

Additional information related to this specific contact

Although this is a good way to start defining database requirements, there are obvious shortcomings. The data dictionary does not describe how these individual items are related. It also lacks information regarding how the data is created, updated, and retrieved, among other things.

A functional specification documents the system requirements in plain English and should fill in details concerning who will be using the system, when, and how. Information concerning the number of concurrent users accessing the system, how frequently records are inserted and updated, and how information will be retrieved are particularly important topics to be covered in the functional specification. These factors will help determine hardware and software licensing requirements, and have a significant impact on issues relating to performance, security, and database integrity.

The functional description for the sample contact manager might include a summary similar to the text that follows:

The system will be available to 40 sales representatives, 5 sales managers, 26 sales assistants, 6 purchasing agents, 1 purchasing department manager, and 2 purchasing assistants, for a total of 80 users. Of these 80 possible users, it is expected that a maximum of 20 would be actively using the system at any given time. Purchasing department personnel should have access only to purchasing department contacts, and sales department personnel should have access only to sales contacts.

All users may add information regarding a specific contact at any time, but whereas sales representatives and purchasing agents can add new prospects, only assistants can add new vendors and clients (after obtaining proper approval from a manager). Sales representatives and purchasing agents should have access only to their accounts and prospects, whereas managers should have full access to the entire database for their specific departments.

One assistant from each department will be designated as a system administrator. Only the system administrators will be able to add and modify address, phone, contact types, contact methods, and contact reasons. With the approval of a manager, a system administrator will be able to reassign a vendor or client to a new purchasing agent or sales representative.

For audit purposes, every time information is added or modified, the individual who made the modification, and the date and time that the information was modified, should be recorded.

In the preceding example, the functional specification added several new data elements to the requirements, in addition to pertinent information regarding access and security. The functional specification and data dictionary are often developed simultaneously, as one document may provide relevant information that should be reflected in the other.

An important part of requirements analysis is to anticipate the needs of the users, because they will not always be able to fully explain the system requirements on their own. Based on information from the previous examples, the system designers may have these follow-up questions:

These are obviously just a few of the questions that come to mind. In practice, the functional description should describe the system to the fullest extent and detail possible. The importance of thorough requirements analysis and documentation is often underestimated. Put simply, poor requirements definition will most likely result in poor or inadequate design, because these requirements provide the foundation for the later phases of design, including the logical and physical data models.

The Logical Model

A common way to represent the logical model is through an Entity-Relationship (E-R) Diagram. For the purposes of this type of model, an entity is defined as a discrete object for which items of data are being stored, and a relationship refers to an association between two entities.

In the contact manager example, there are five main entities for which data is being stored:

The relationships between these entities can be summarized in plain terms:

These entities and their relationships can be represented graphically by an E-R diagram, as shown in Figure 17.1.


Figure 17.1. E-R diagram for the contact manager sample application.


Observe how the one to one-or-many and one-to-zero-one-or-many relationships are represented. One-to-one and one-to-zero-or-one relationships can be represented using similar notation.

This may seem to be a simplistic approach to modeling the application's data, but it is often a good first step. In larger applications with hundreds of entities, these models can become extremely complex.

This model can be taken a step further by defining attributes for each entity. An entity's attributes are the individual items of data to be stored that relate specifically to the object. The attributes for each entity in the example are listed in Table 17.2.

Employees


Individuals


Contacts


Employee Number

Last Name

Contact Date

User ID

First Name

Contacted By

Middle Initial

Contact Reason

Company

Contact Method

Individual Notes

Contact Type

Company Notes

Contact Notes

Addresses

Phone Numbers

Address Line 1

Phone Number

Address Line 2

Phone Type

Address Line 3

City

State

Zip Code

Address Type

Note that several items of information are missing. The audit information mentioned in the functional specification is omitted. This can be handled by adding a Last Update User ID and Last Update Date/Time Stamp attribute to each entity. More important, there are attributes missing that are required to relate entities to each other. These data items will be handled somewhat differently because they are not "natural" attributes belonging to a specific entity.

This is a highly abstract view of the data, concerned only with broad categories of data (entities) and the logical relationships between them. The E-R model, although good at representing basic data concepts, is not of much use when it comes to physical implementation. The relational model helps bridge this gap.

The relational model is characterized by its use of keys and relations, among other things. The term relation in the context of relational database theory should not be confused with a relationship. A relation can be viewed as an unordered, two-dimensional table, where each row is distinct. Relationships are built between relations (tables) through common attributes. These common attributes are called keys.

There are several types of keys, and they sometimes differ only in terms of their relationships to other attributes and relations. A primary key uniquely identifies a row in a relation and each relation can have only one primary key, even if more than one attribute is unique. In some cases, it takes more than one attribute to uniquely identify each row in a relation. The aggregate of these attributes is called a concatenated key, or a composite key. In other cases, a primary key must be generated. The entity Individuals in the preceding example illustrates this point. Although it may be likely, there is no guarantee that the combination of the entity's attributes will be unique. A new attribute should be created based on generated values to make Individuals a relation. This can be accomplished in Oracle through the use of a SEQUENCE.

Another type of key, called a foreign key, exists only in terms of the relationship between two relations. A foreign key in a relation is a nonkey attribute that is a primary key (or part of the primary key) in another relation. This is the shared attribute that forms a relationship between two relations (tables). Primary and foreign key relationships are illustrated in Table 17.3.

Referring back to the example, the entities' attributes can be extended to fulfill the audit requirements and make the model relational. Note that the ID attribute is a generated primary key in each relation in which it appears. The reasons for this will be explained in further detail in the following section on performance considerations.

Employees


Individuals


Employee Number (PK)

ID (PK)

User ID

Last Name

Last Update User ID

First Name

Last Update Date/Time

Middle Initial Company Contact Type

Assigned Employee (FK) Individual Notes

Company Notes

Last Update User ID

Last Update Date/Time

Addresses

Phone Numbers

ID (PK)

ID (PK)

Individual ID (FK)

Individual ID (FK)

Address Line 1

Phone Number

Address Line 2

Phone Type

Address Line 3

Last Update User ID

City

Last Update Date/Time

State

Zip Code

Address Type

Last Update User ID

Last Update Date/Time

 


 


Contacts


 


ID (PK)

Individual ID (FK)

Contacted By (FK)

Contact Date

Contact Reason

Contact Method

Contact Notes

Last Update User ID

Last Update Date/Time


(PK) indicates that the attribute is the primary key of the relation. (FK) indicates that the attribute is a foreign key.

There are numerous limitations and redundancies in this model. For example, if one individual works for more than one company, he or she must be stored as two separate individuals to be associated with both companies. Redundancies are also introduced when multiple contacts share the same address and phone number.

The Normalization Process

A process known as normalization is a technique used to group attributes in ways that eliminate these types of problems. More specifically, the goals of normalization are to minimize redundancy and functional dependency. Functional dependencies occur when the value of one attribute can be determined from the value of another attribute. The attribute that can be determined is said to be functionally dependent on the attribute that is the determinant. By definition, then, all nonkey attributes will be functionally dependent on the primary key in every relation (because the primary key uniquely defines each row). When one attribute of a relation does not uniquely define another attribute, but limits it to a set of predefined values, this is called a multivalued dependency. A partial dependency exists when an attribute of a relation is functionally dependent on only one attribute of a concatenated key. Transitive dependencies occur when a nonkey attribute is functionally dependent on one or more other nonkey attributes in the relation.

Normalization is measured in terms of normal forms, and the process of normalization consists of taking appropriate steps to reach the next normal form:

(There are other normal forms, but they are beyond the scope of this discussion.)

Third normal form can be reached in the example by removing the transitive dependencies that exist in the Address relation. This can be achieved by removing city and state from the Address relation and creating two new relations, as you will note in Table 17.4.

Cities


States


Zip Code (PK)

Zip Code (PK)

City (PK)

State


The (PK) next to both Zip Code and City in the Cities relation indicates that the attributes make up the concatenated primary key. It would be very unusual to have such a relation, where all attributes are part of the primary key.

This makes both the zip code and city attributes in the Address relation foreign keys. (Note that City is made part of the key because, in rural areas, one zip code may have more than one city.) The model can be further normalized by eliminating several of the multivalued dependencies, as you will note in Table 17.5.

Address Type


Phone Type


Contact Type


ID (PK)

ID (PK)

ID (PK)

Type

Type

Type

 


 


 


Contact Method


Contact Reason


 


ID (PK)

ID (PK)

Method

Reason

Where these attributes exist in the previously defined relations, they will remain as foreign keys. As these examples illustrate, the dependencies are not eliminated, but their nature is changed so that dependencies exist on key attributes rather than nonkey attributes. In this example, however, fourth normal form is still not achieved. In the Individual relation, Company has a multivalued dependency (at best) on the Last Name, First Name, and Middle Initial.

Separating Company from Individuals brings the model to fourth normal form. Additional relations allow addresses and phones to be associated with either companies or individuals, and allow individuals to be related to more than one company, as you will note in Table 17.6.

Individuals 


Individual-Company Relation


ID (PK)

Individual ID (FK)

Last Name

Company ID (FK)

First Name

Middle Initial

Contact Type (FK)

Assigned Employee (FK)

Individual Notes

Last Update User ID

Last Update Date/Time

 


 


Companies


Entity Type


ID (PK)

ID (PK)

Company

Type

Company Notes

 


 


Addresses


Phone Numbers


ID (PK)

ID (PK)

Entity Type (FK)

Entity Type (FK)

Entity ID (FK)

Entity ID (FK)

Address Line 1

Phone Number

Address Line 2

Phone Type (FK)

Address Line 3

Last Update User ID

City (FK)

Last Update Date/Time

Zip Code (FK)

Address Type (FK)

Last Update User ID

Last Update Date/Time

An additional aspect of the logical model is the design of tablespaces. A tablespace consists of one or more data files and, as the name implies, a tablespace houses one or more database objects. Before proceeding to the physical design, designers should consider how they may want to use tablespaces to group database objects along logical boundaries.

One tablespace is always created when Oracle is installed. This tablespace, called SYSTEM, houses all system tables and other system objects used by Oracle itself. Although this tablespace can be used to house additional tables, it is preferable to create separate tablespaces for application-specific objects. In many cases, it is desirable to create several tablespaces to house different types of database objects. A common logical division is to create separate tablespaces for rollback segments, indexes, tables, and temporary segments. This topic will be discussed in greater detail in the following section, but these logical design issues are worthy of some consideration before proceeding to the physical design.

The Physical Model

The physical database consists of data files, tablespaces, rollback segments, tables, columns and indexes. There are dependencies between these elements that impose an order on the design process. The process often starts with designing the smallest units of physical storage (the column) and proceeds, in order, to each successively larger unit of storage. Overall capacity and performance considerations provide constraints to the design, and should be considered at every step. As with logical modeling, developing a physical design can be a somewhat iterative process.

Column Attributes and DDL

Designing the physical database begins with assigning column attributes. The attributes of a column determine how it will be physically stored in the database by defining its data type and maximum length. The data type and length of a column should be carefully chosen at design time, because it is sometimes difficult to change these attributes after data has been loaded. Consider the following summarizations of each of the Oracle data types:

CHAR(SIZE)

Used to store fixed-length alphanumeric data. The (SIZE) determines the number of character that will be stored for each value in the column. If a value is inserted into the column that is shorter than (SIZE), it will be padded with spaces on the right until it reaches (SIZE) characters in length. The maximum length of this data type is 255 bytes. If (SIZE) is omitted, the default is 1 byte.

VARCHAR2(SIZE)

Used to store variable-length alphanumeric data. This data type differs from CHAR in that inserted values will not be padded with spaces, and the maximum (SIZE) for this type is 2000 bytes.

NUMBER(P, S)

Used to store positive or negative, fixed or floating-point numeric values. The precision, (P), determines the maximum length of the data, whereas the scale, (S), determines the number of places to the right of the decimal. If scale is omitted, the default is 0. If precision is omitted, values are stored with their original precision up to the maximum of 38 digits.

DATE

Used to store dates and times. Oracle uses its own internal format to store 7 bytes each for day, month, century, year, hour, minute and second. This is important because it illustrates the point that dates are fairly expensive to store—49 bytes per record, even if only a portion of the information is used. The default representation for dates in Oracle is DD-MON-YY. For example, '01-JAN-95' is used to represent January 1, 1995.

LONG

Used to store up to 2 gigabytes of alphanumeric data. As with VARCHAR2, values are stored at their actual lengths. LONG values cannot be indexed, and the normal character functions such as SUBSTR cannot be applied to LONG values.

RAW

Used to store binary data with no character set conversion. RAW data cannot exceed 255 bytes. RAW data can be indexed, but no functions can be performed on RAW values.

LONG RAW

Used to store large binary objects such as whole documents, video, and graphics, or entire compiled programs. LONG RAW can store up to 2 gigabytes of information, but cannot be indexed.

ROWID

Used to represent a row's physical address. Every table has a ROWID pseudo-column, which is not evident when describing the table or issuing SELECT * FROM table_name. This address will remain unchanged unless the table is exported and imported, or otherwise physically moved on disk. In practice, this value is rarely used.

You should consider additional factors besides the nature of the data and its length when selecting a data type for a column. For example, one might assume that it is always better to use VARCHAR2 instead of CHAR, so that only the actual number of bytes present in the data will be stored. There are differences, however, in how comparisons are handled with these data types. Two VARCHAR2 values must be the same length to be considered equal, where two CHAR values are compared without consideration of trailing spaces. As a result, if the values 'WORD' and 'WORD ' are being compared, they will compare as equal if they are CHAR values, but will be not be considered equal if they are VARCHAR2 values because of the trailing spaces in the second value.

When using the NUMBER data type, the declared precision and scale greatly affect how the data is stored. If not fully understood, these values may not behave as intended. For example, assume that a column has been declared NUMBER(10,4). One might conclude this means that up to 10 digits can be stored in this column, as many as four of which may be to the right of the decimal. This is not necessarily the case, however. An attempt to insert a value of 1234567 into this column will fail, with an error code of ORA01438 (value larger than specified precision allows for this column). The declaration of NUMBER(10,4) does allow up to 10 digits of precision, but only 6 of these may be to the left of the decimal.

The default DATE format can be changed for the database instance, or for a particular session. If you wish to store times in a column declared as DATE, you must alter the default date format to include the time, or all values inserted will have the default time of 12:00 A.M. (with the exception of the SYSDATE system variable). To insert date values from an application using the default date format, a statement such as:

INSERT INTO table_name (column_name) VALUES ('01-JAN-95')

must be issued. Because there is no way to specify time using the default data format, Oracle will store the default value of 12:00 A.M. If this statement is changed to:

INSERT INTO table_name (column_name) VALUES(SYSDATE)

the time is inserted accurately because SYSDATE shares the same internal representation. The date format can be altered for a session by issuing a statement such as:

ALTER SESSION SET NLS_DATE_FORMAT = 'MM-DD-YYYY HH:MI A.M.'

After this statement has been issued, times can be inserted accurately using the new format:

INSERT INTO table_name (column_name) VALUES('12-31-1994 11:59 P.M.')

Note that this format is valid only for the session. In any subsequent sessions, the statement:

SELECT column_name FROM table_name

will return 31-DEC-94 for the previously inserted value until the session is altered again.


The NLS_DATE_FORMAT can be set for the database instance by including it in the initialization file.

Using the contact manager application example, the column attributes for the Addresses table might be defined as in Table 17.7.

Column


Attribute


ID

NUMBER(10)

EntityType

NUMBER(10)

Entity ID

NUMBER(10)

Address Line 1

VARCHAR2(40)

Address Line 2

VARCHAR2(40)

Address Line 3

VARCHAR2(40)

City

VARCHAR2(30)

Zip Code

NUMBER(5)

Address Type

NUMBER(10)

Last Update User ID

VARCHAR2(20)

Last Update Date/Time

DATE

Defining the column attributes is an important step in capacity planning. From this information, the maximum record size for each table can be determined. This combined with an estimate of the total number of rows helps determine the amount of storage required to house the data.

The next step is to begin creating Data Definition Language (DDL) scripts that will be used to create the tables. This may seem like a step toward implementation, but DDL can be used as a tool for capacity planning and the design of tablespaces and data file layout. The DDL for creating tables consists of defining column attributes and constraints, storage specification, table constraints, and rules. Constraints and rules are discussed in detail in the chapter on enforcing integrity; therefore, for now, the description of DDL focuses on column attributes and storage specification.

Referring back to the sample application, assume that it is estimated that 5,000 address records will be stored initially, and that the number of records is expected to double in the next several years. From the definition of the column attributes for the address table, it is apparent that the maximum size of any single record is 264 bytes. (Remember that DATE columns require 49 bytes always.) Assume further that the primary and foreign key IDs will be sequenced starting with 1, and that Address Line 2 and Address Line 3 are rarely populated. Based on this additional information, a conservative estimate would be that the average record size will be 200 bytes. The total size of the table can then be estimated at 1 MB initially. The script in Listing 17.1 can then be written with an appropriate storage specification.

     CREATE TABLE Addresses (

              Address_ID                 NUMBER(10)    PRIMARY KEY

             ,Address_Type_ID        NUMBER(10)    NOT NULL

             ,Entity_ID              NUMBER(10)    NOT NULL

             ,Entity_Type_ID         NUMBER(10)    NOT NULL

             ,Address_Line1          VARCHAR2(40)  NOT NULL

             ,Address_Line2          VARCHAR2(40)

             ,Address_Line3          VARCHAR2(40)

             ,City                   VARCHAR2(30)  NOT NULL

             ,Zip_Code               NUMBER(5)     NOT NULL

             ,Last_Updt_User         VARCHAR2(20)     NOT NULL

             ,Last_Updt_Timestamp    DATE          NOT NULL

      )

      TABLESPACE Contact_Main

      STORAGE (    INITIAL             1M

                   NEXT              100K

                   MINEXTENTS           1

                   MAXEXTENTS         100

                   PCTINCREASE         10  );

Note that, although the columns have been rearranged and renamed, the physical table still reflects the logical relation, with a few exceptions. The foreign key constraints have been omitted. There are a number of ways to enforce the foreign key constraints, through column constraints, table constraints, or other means. The issue of enforcing integrity through table and column constraints is revisited briefly later in this chapter, and all options will be discussed in full detail in the chapter on enforcing database integrity.

The primary key constraint is an important consideration for a number of reasons. At this point, it is important to recognize that Oracle will automatically generate a unique index for this column. This should not be overlooked in capacity planning.

The tablespace Contact_Main does not exist yet, but scripts referencing the tablespace help determine its size. The STORAGE specification indicates that 1 megabyte will be allocated initially, that additional extents will start at 100 kilobytes, that there will be a minimum of 1 extent and a maximum of 10 extents, and that each extent will be 10 percent larger than the previous extent. Obviously, this specification will allow the table to grow to well over 2 megabytes, which is the planned maximum. Although storage specifications can be modified using ALTER TABLE, in the additional design phase and for capacity planning purposes, it is usually best to allow for the maximum estimated size or more. This is based on the assumption that it is better to overestimate storage requirements than to underestimate them.

After DDL scripts have been created for each of the tables, scripts can be written for the tablespaces that will house them. Assume that the sample application will store all application tables in the tablespace named Contact_Main. The data file(s) created with the tablespace should be sufficiently large to contain all tables at their full size (this can be calculated based on the INITIAL, NEXT, and PCTINCREASE parameters in the STORAGE clause of each table). The script to create Contact_Main might look like Listing 17.2.

        CREATE TABLESPACE Contact_Main

            DATAFILE '/oradata/tables/main1.dat' SIZE 10M REUSE

            DEFAULT STORAGE (  INITIAL            2K

                               NEXT               2K

                               MAXEXTENTS        100

                               PCTINCREASE         0  );

The DEFAULT STORAGE clause of CREATE TABLESPACE determines how space for tables will be allocated when tables are created without a STORAGE specification. Lookup tables are typically numerous and very small, so it may be desirable to create a default storage that will be sufficient to handle all lookup tables. The STORAGE clause can then be omitted from those CREATE TABLE scripts.

Performance Considerations

When designing the physical database, performance is an important consideration. There are numerous factors related to the design that will affect the overall performance of the database. These factors include the data model itself, indexing, rollback and temporary segments, and the physical location of the data on the disks.


For more on performance tuning issues, see Chapter 15, "Performance Tuning and Optimizing."

A factor that can heavily impact overall performance stems from the logical model. The degree of normalization in the model often comes down to a trade-off between flexibility and performance. In the example of normalization presented in the section on the logical model, several relations were created that improved the flexibility of the model, as shown in Table 17.8.

Individual-Company Relation


Companies


Individual ID (FK)

ID(PK)

Company ID (FK)

Company

Company Notes

 


 


Entity Type


 


ID(PK)

Type

Separating company information from the Individuals relation added a considerable amount of flexibility. This allowed individuals to be related to more than one company, and it allowed addresses and phones to be related to either an individual or a company. Another nice feature of the normalized model is that it allows any number of phones and addresses to be related to an individual or a company as in Table 17.9.

Addresses


Phone Numbers


ID (PK)

ID (PK)

EntityType (FK)

Individual ID (FK)

Entity ID (FK)

Phone Number

Address Line 1

Phone Type (FK)

Address Line 2

Last Update User ID

Address Line 3

Last Update Date/Time

City (FK)

Zip Code (FK)

Address Type (FK)

Last Update User ID

Last Update Date/Time

 


 


Address Type


Phone Type


ID (PK)

ID (PK)

Type

Type

A less useful feature of the normalized model is the separation of city and state as in Table 17.10.

Cities


States


Zip Code (PK)

Zip Code (PK)

City (CK)

State

The end result is a very flexible (but possibly overly complex), data model. Assume, for example, that one of the primary uses of the database is to generate a listing of the names of contacts and their companies, addresses, and phone numbers. This is a fairly complex query and illustrates a potential flaw in the model: Although addresses and phones can be related to either companies or individuals, there is nothing in the model that allows phones and addresses to be related to an individual at a company. Assume that, as a workaround, a third entity type is created for an individual at a company, and a rule is imposed that the Individual ID is used for that entity where an entity's ID is required as a foreign key. The phone list can be generated under this scenario, but it requires joining nine tables. Individuals must be joined to Addresses, Phone Numbers, and Individual-Company Relation, which must be joined Companies to get the company name, and Addresses must be joined to Cities and States. In addition, Phone Numbers and Addresses must be joined to Phone Types and Address Types to get their respective standardized type descriptions. Although joining nine tables is not a particularly difficult task, if the database contains millions of Individuals, the number of joins can have a very significant impact on performance. If this report is generated up-to-the-minute online, and the database has a high volume of transactions, the impact is further magnified.

Denormalization, the opposite of normalization, can be used to improve performance under these circumstances. By combining some of the relations, the number of joins can be reduced without sacrificing flexibility. In Table 17.11, the listed denormalizations may be appropriate.

Addresses


Phone Numbers


ID(PK)

ID (PK)

Individual ID

Individual ID

Company ID

Company ID

Address Line 1

Phone Number

Address Line 2

Phone Type

Address Line 3

Last Update User ID

City

Last Update Date/Time

State

Zip Code

Address Type

Last Update User ID

Last Update Date/Time

 


 


Individual-Company Relation


Companies


Individual ID (FK)

ID (PK)

Company ID (FK)

Company

With this model, only seven tables must be joined to generate the list, and no flexibility is lost. Note that the foreign key constraints must be removed from IndividualID and CompanyID on the Addresses and Phone Numbers tables because one or the other might be NULL. Listing 17.3 demonstrates the SQL used to create the report after this denormalization.

       SELECT First_Name, Middle_Init, Last_Name, Company_Name,

              F.Type, Address_Line1, Address_Line2,

              Address_Line3, City, State, Zip, G.Type,

              Phone_Number

       FROM   Individuals A, Individual_Company_Relation B,

              Companies C, Addresses D, Phones_Numbers E,

              Address_Types F, Phone_Types G

       WHERE  A.ID = B.Individual_ID

              AND B.Company_ID = C.ID

              AND B.Individual_ID = D.Individual_ID

              AND B.Company_ID = D.Company_ID

              AND B.Individual_ID = E.Individual_ID

              AND B.Company_ID = E.Company_ID

              AND D.Address_Type = F.ID

              AND E.Phone_Type = G.ID

Aliasing table names with a single letter can save a lot of typing.

Additional denormalization could improve performance further, but probably at the cost of flexibility.

The previous example can also be used to illustrate the importance of indexes. Indexes can be created on single or multiple columns, and may or may not be unique. When creating an index on multiple columns, the order in which the columns are declared is particularly important, because Oracle treats the values of such an index as an aggregate. The column that will be used the most should be declared first in a multicolumn index. In the previous example, the Employee_Company_Relation table is a prime candidate for an index. If both columns are indexed in aggregate, the table itself should never be read. Individual_ID should be declared as the first column in the index because it is used for one more join. The DDL to create this column and its index might look like the script in Listing 17.4.

      CREATE TABLE Individual_Company_Relation (

            Individual_ID        NUMBER(10)    NOT NULL

           ,Company_ID           NUMBER(10)    NOT NULL

      )

      TABLESPACE Contact_Main

      STORAGE (INITIAL       50K

               NEXT               10K

               MAXEXTENTS     10  );

      CREATE INDEX Indiv_Co

            ON Individual_Company_Relation (Individual_ID, Company_ID)

      TABLESPACE Contact_Index

      STORAGE (INITIAL        50K

               NEXT           10K

               MAXEXTENTS      10);

Note that the MINEXTENTS parameter to the STORAGE clause was not used. The default value of 1 is acceptable in most circumstances.

The creation of indexes should be planned very carefully, because improper use of indexes can have a damaging effect on performance. Even where indexes improve the performance of SELECT statements, they have a negative impact on INSERTs and UPDATEs, because the indexes must be modified in addition to the tables.

The column attributes themselves play a role in performance as well. Wherever possible, integers should be used as keys because they can be compared faster than any other data type. Column and table constraints should be avoided because they must be checked whenever a value is inserted or updated. Although these constraints are often necessary, integrity should be enforced by other means when it is possible to do so safely.

Rollback segments also play an important role in the overall performance of the database. As the name would imply, Oracle uses rollback segments as temporary storage for data needed to reverse a transaction. This data must be stored until the transaction is committed. Rollback segments must be sufficiently large to store this data for all transactions occurring at a given time. If rollback segments are not large enough, transactions will fail.

To properly estimate the size of the rollback segments needed, the designer must know how many users will be submitting transactions, and the maximum size of the rows affected by a single transaction. In many large databases, transactions are initiated by batch processes used to load and update data from external sources, or to create summary tables. These batch processes often generate much larger transactions than the user community, and should be considered when planning rollback segments. A rollback segment, like other database objects, can be created with a script, as demonstrated in Listing 17.5.

      CREATE PUBLIC ROLLBACK SEGMENT contact_rbs1

            TABLESPACE contact_rb_segs

            STORAGE (INITIAL            100K

                     NEXT               100K

                     OPTIMAL            500K

                     MAXEXTENTS          100);

      ALTER ROLLBACK SEGMENT contact_rbs1 ONLINE;

When a rollback segment is created, it is not immediately available for use. The ALTER ROLLBACK SEGMENT command must be issued to bring the rollback segment ONLINE before it can be used.

The OPTIMAL parameter to the STORAGE clause indicates that when extents have been created, they will not be deallocated below this value. This, in effect, sets the minimum size of the rollback segment after that threshold is reached.

Rollback segments are typically created in a separate tablespace. The size of the data files in this tablespace should be sufficient to hold the rollback segments at their maximum extents.

Another performance consideration relates to the creation of temporary segments. Temporary segments are similar to rollback segments, except that they are used to store result sets rather than transaction information. When a SELECT statement produces a result set that is too large to be stored in memory, a temporary table is created to store the results until the cursor is closed. Temporary tables may also be created by Oracle to store temporary result sets for complex joins or unions. As with rollback segments, these temporary segments must be sufficiently large to store this data, or SELECT statements may fail.

Temporary segments must be assigned to users explicitly. If no temporary segment is assigned, the SYSTEM tablespace is used by default. It is preferable to create a separate tablespace for these temporary segments, and assign it to users using the TEMPORARY TABLESPACE clause of the CREATE USER command. When designing temporary tablespaces, keep in mind any batch processes that may create large cursors. These too, will require the use of temporary segments.

It may be preferable to create separate segments (both temporary and rollback) for different groups of users, based on the transactions and result sets generated by different groups of users.

Other performance considerations relate to the physical layout of files on disk. Proper use of multiple disks and controllers, clustering, and striping can improve performance greatly in certain situations.

In the example on the creation of indexes (see Listing 17.4), notice that the table and the index were created in separate tablespaces. The example assumes that the tablespaces were created on separate disks, using separate controllers. Keeping indexes on separate physical devices with separate controllers allows the index and the tables to be read almost simultaneously, and minimizes the movement of the read-write heads. In the sample SQL statement, this would allow the read-write head of one drive to continue reading the index while a separate controller reads the Addresses and Phones tables to find the corresponding values. If the index were on the same disk, either the whole index would have to be read into memory before the table, or the heads would have to move back and forth, reading part of the index and part of the table.

The use of separate controllers and disks also applies to rollback and temporary segments. In an ideal configuration, tables, indexes, rollback segments, and temporary segments would all be on separate disks using separate controllers. This configuration would greatly improve overall performance, particularly for batch processes such as the creation of summary tables. In practice, however, this configuration is rarely possible. Regardless of the actual hardware configuration, the designer should carefully consider how these data files will be accessed. The design should attempt to minimize the movement of read/write heads for the most common or mission-critical database operations.

Clustering is another way to improve performance through physical layout on the disk. It is sometimes advantageous to create an indexed cluster for a group of tables that are frequently joined. The index used to join the tables in the cluster must be declared as part of the cluster. Tables created in the cluster must specify this index in the CLUSTER clause of the CREATE TABLE statement. When an indexed cluster is created, each value for the cluster index (or cluster key) is stored only once. The rows of the tables that contain the clustered key value are physically stored together, as if already joined. This method minimizes the movement of read-write heads when accessing these tables and conserves disk space by storing the key value only once.

In the contact manager example, assume that the users will typically be contacting individuals by telephone, and that they will be retrieving this information much more frequently than they will be updating it. Clustering the Individuals and Phone Numbers relations should be considered in this case. The DDL script in Listing 17.6 illustrates how this cluster might be created.

      CREATE CLUSTER Individual_Phone_Numbers

      (Individual_ID      NUMBER(10))

      SIZE    256

      STORAGE (INITIAL         1M

               NEXT            1M

               MAXEXTENTS     100

               PCTINCREASE     10);

      CREATE TABLE Individuals (

             Individual_ID          NUMBER(10)    PRIMARY KEY

            ,Last_Name              VARCHAR2(30)  NOT NULL

            ,First_Name             VARCHAR2(20)  NOT NULL

            ,Middle_Initial         CHAR(1)

            ,Last_Updt_User         VARCHAR2(20)  NOT NULL

            ,Last_Updt_Timestamp    DATE          NOT NULL

      )

      CLUSTER Individual_Phone_Numbers(Individual_ID);

      CREATE TABLE Phone_Numbers (

             Phone_ID               NUMBER(10)    PRIMARY KEY

            ,Individual_ID          NUMBER(10)    NOT NULL

            ,Company_ID             NUMBER(10)    NOT NULL

            ,Phone_Number           NUMBER(10)    NOT NULL

            ,Phone_Type_ID          NUMBER(10)    NOT NULL

            ,Last_Updt_User         VARCHAR2(20)  NOT NULL

            ,Last_Updt_Timestamp    DATE          NOT NULL

      )

      CLUSTER Individual_Phone_Numbers(Individual_ID);

      CREATE INDEX Indiv_Phone on CLUSTER Individual_Phone_Numbers;

The storage specification is not needed for the individual tables. Their rows will be stored in the cluster, and the storage specification for the cluster will be used by all objects that it will contain.

In Listing 17.6, the SIZE keyword indicates the size in bytes needed to store the rows corresponding to one key value. The size is always rounded up to the nearest block size. In the example, because the cluster key is stored only once, it will take only 245 bytes to store the rows for each cluster key. The example assumes that 256 bytes is the closest block size.

A second type of cluster, known as a hash key cluster, can also be utilized to improve performance. Rather than store rows based on a common indexed key, rows are stored together based on a common hash value, which is not physically stored in the database. The hash value is calculated at run time, using either a user-defined function or Oracle's internal hashing function. Although this reduces physical disk accesses, hash key clustering should be used only on small tables, or tables that will have a maximum size that is known at design time. This is because the number of hash keys is fixed at design time, and resolving collisions requires additional reads.

Clustering should not be overused because there can be a negative impact on performance when clustered tables are accessed separately. Insert, update, and delete operations on clustered tables will typically be slower as well. For these reasons, the designer should carefully consider how tables will be used before creating a cluster.

Striping is a technique that consists of spreading a large database object multiple disks. Performance can be greatly improved by striping, particularly when large tables are accessed by full table scans. The striping of a particular table can be forced by creating a tablespace with multiple data files on separate disks, each of which is smaller than the table itself. To provide maximum control over how the table is striped, it should be the only object in the tablespace. Listing 17.7 provides one example of how striping can be accomplished in Oracle.

      CREATE TABLESPACE Individual_Stripes

            DATAFILE 'disk1/oradata/stripe1.dat' SIZE 100K REUSE,

                     'disk2/oradata/stripe2.dat' SIZE 100K REUSE

            DEFAULT STORAGE (  INITIAL           200K

                               NEXT              200K

                               MAXEXTENTS         100

                               PCTINCREASE          0  );

      CREATE TABLE Individuals (

             Individual_ID          NUMBER(10)    PRIMARY KEY

            ,Last_Name              VARCHAR2(30)  NOT NULL

            ,First_Name             VARCHAR2(20)  NOT NULL

            ,Middle_Initial         CHAR(1)

            ,Last_Updt_User         VARCHAR2(20)  NOT NULL

            ,Last_Updt_Timestamp    DATE          NOT NULL

      )

      TABLESPACE Individual_Stripes

            STORAGE (INITIAL             90K

                     NEXT                90K

                     MINEXTENTS            2

                     MAXEXTENTS          100

                     PCTINCREASE           0  );

Obviously, a third extent will not be able to be allocated for this table, making this approach to striping a high maintenance proposition. The size of the table must be continually monitored, and new data files must be added to the tablespace when needed, using the ALTER TABLESPACE command with ADD DATAFILE. Although this approach requires additional maintenance, performance gains can be very significant, particularly if the disks have separate controllers. The designer should consider the trade-offs carefully before recommending that a table be striped.

Capacity Planning

Capacity planning is important in ensuring that adequate storage is available for future growth. The DDL scripts for each database object are invaluable in determining the overall storage required by the database. In fact, the process of capacity planning actually begins with the DDL scripts.

It starts with defining the column attributes. These column attributes, in turn, determine the size of each row in the table. The column attributes also determine the size of each row in indexes created on the columns. These attributes, combined with the estimated total number of rows (including provisions for future growth), are used in defining the storage clause for tables and indexes. For purposes of capacity planning, it should be assumed that all objects will reach their maximum extents.


The MAXEXTENTS parameter to the STORAGE clause has an upper limit that is determined by the operating system and file system environment in which the database resides. It is a bad practice to set MAXEXTENTS equal to the maximum allowable value. If the object grows more than expected, the NEXT parameter can be changed using ALTER object_type. If the maximum extents have already been allocated, however, this will have no effect. In this situation, the size of the object cannot be extended unless MAXEXTENTS can be increased.

The next step is creating DDL for tablespaces. The data file(s) created by these scripts should be sufficiently large to contain all objects that they will contain. When determining the size of the data files, it should be assumed that all objects within the tablespace will reach their maximum size, as defined by the STORAGE clause in the DDL script for each object.

The total size of the database can then be determined by simply adding the sizes of the data files. Never assume, however, that this estimate is accurate and complete. Also, there are a number of additional considerations to be made.


Remember to consider the sizes of log files. These are defined when the database is created.

In capacity planning, the designer must accommodate for unexpected growth. As a general rule of thumb, at least 25 percent (preferably 50 percent) of each disk should be free after the initial installation. This will allow additional data files to be created wherever necessary if tables grow larger than expected. It is common to experience unanticipated growth, and it is even more common to underestimate the requirements for temporary and rollback segments, or to identify the need for additional indexes after the initial design is complete.

The initial hardware configuration and data file layout should accommodate these possibilities. Capacity limitations can be crippling in the event that additional rollback segments or temporary segments cannot be created when needed.

For this reason, the importance of capacity planning should not be underestimated.

Summary

Designing a database is a methodical process. It begins with defining the requirements of the application. Entities and attributes are grouped into a logical data model that meets the business needs. The logical data model can be represented graphically as an Entity-Relationship Diagram. These entities and attributes are then translated into a physical model. The physical model is when data types and column constraints are defined. The normalization process is used to eliminate redundancy of data. Denormalization is a process of breaking the normalization rules to gain performance increases. A well designed database is important for future expansion as well as yielding ease of application programming.

Previous Page TOC Next Page Home