Monday, February 16, 2009

ICAS4113B Identify and resolve common database performance problems

Activity 1 Research the Database Performance Problems topic

There are 5 common PHP database problems and some solution to solve their problems


Problem 1: Using MySQL directly

One common problem is older PHP code using the mysql_ functions to access the database directly. Listing 1 shows how to access the database directly.

Listing 1. Access/get.phpNotice the use of the mysql_connect function to access the database. Also notice the query in which we use string concatenation to add the $name parameter to the query.

This technique has two good alternatives: the PEAR DB module and the PHP Data Objects (PDO) classes. Both provide abstraction from the choice of a particular database. Therefore, your code can run without too much adjustment on IBM® DB2®, MySQL, PostgreSQL, or any other database you want to connect to.

The other value in using the abstraction layers of the PEAR DB module and PDO is that you can use the ? operator in your SQL statements. Doing so makes the SQL easier to maintain and secures your application from SQL injection attacks.

The alternative code using PEAR DB is shown below.

Listing 2. Access/get_good.phpNotice that all direct mentions of MySQL are gone except for the database connection string in $dsn. In addition, we use the $name variable in the SQL through the ? operator. Then, the data for the query is sent in through array at the end of the query() method.

Problem 2: Not using auto-increment functionality

Like most modern databases, MySQL has the ability to create auto-incrementing unique identifiers on a per-record basis. Despite that, we still see code that first runs a SELECT statement to find the maximum id, then adds one to that id, as well as a new record.

Listing 3 shows a sample bad schema.Listing 3. Badid.sql

The id field here is specified simply as an integer. So, although it should be unique, we can add any value we like, as shown in the INSERT statements that follow the CREATE statement. Listing 4 shows the PHP code that adds users into this type of schema.

Listing 4. Add_user.php The code in add_user.php first performs a query to find the maximum value of the id. Then the file runs an INSERT statement with the id value plus one. This code could fail in race conditions on servers with a heavy load. Plus, it's just inefficient.

So what's the alternative? Use the auto-increment feature in MySQL to create unique IDs for each insertion automatically. The updated schema is shown below.

Listing 5. Goodid.php

We added the NOT NULL flag to indicate that the fields must not be null. We also added the AUTO_INCREMENT flag to indicate that the field is auto-incrementing, as well as the PRIMARY KEY flag to indicate which field is an id. These changes speed things up a bit. Listing 6 shows the updated PHP code that inserts users into the table.

Listing 6. Add_user_good.php Instead of getting the maximum id value, I now just use the INSERT statement to insert the data, then use a SELECT statement to retrieve the id of the last inserted record. This code is a lot simpler and more efficient than the original version and its related schema.

An alternative to using MySQL's auto-increment functionality is to use the nextId() method in the PEAR DB system. In the case of MySQL, this creates a new sequence table and manages that using an elaborate locking mechanism. The advantage of using this method is that it will work across different database systems.

Either way, you should use a system that manages incrementing unique IDs for you and not rely on a system where you query first, then increment the value yourself, and add the record. The latter approach is susceptible to race conditions on high-volume sites.

Problem 3: Using multiple databases

Once in a while, we see an application in which each table is in a separate database. There are reasons for doing that in extraordinarily large databases, but for an average application, you don't need this level of segmentation. In addition, even though it's possible to perform relation queries across multiple databases, I strongly recommend against it. The syntax is more complex. Backup and restore is not easily managed. The syntax may or may not work between different database engines. And it's difficult to follow the relational structure when the tables are split over multiple databases.

So, what would multiple databases look like? To begin, you need some data. Listing 7 shows this data divided into four files.

Listing 7. The database files
In the multiple-database version of these files, you would load SQL statements into one database, then load the users SQL statements into another database. The PHP code to query the database for the files associated with a particular user is shown below.

Listing 8. Getfiles.php The get_user function connects to the database that contains the users table and retrieves the ID for a given user. The get_files function connects to the files table and retrieves the file rows associated with the given user.

A better way to do all this is to load the data into one database, then perform a query, such as that shown below.

Listing 9. Getfiles_good.php This code is not only shorter but it's also easier to understand and more efficient. Instead of performing two queries, we're performing one.

While this problem sounds a bit far-fetched, we've seen it in practice enough times to know that all tables should be in the same database unless there's a pressing reason otherwise.

Problem 4: Not using relations

Relational databases aren't like programming languages. They don't have array types. Instead, they use relations among tables to create a one-to-many structure between objects, which has the same effect as an array. One problem I've seen with applications is when engineers attempt to use a database as though it were a programming language, creating arrays by using text strings with comma-separated identifiers. Look at the schema below.

Listing 10. Bad.sql One user in the system can have multiple files. In a programming language, you would use an array to represent the files associated with a user. In this example, the programmer chose to create a files field containing a list of file ids separated by commas. To get a list of all the files for a particular user, the programmer must first read the row from the users table, then parse the file's text and run an individual SELECT statement for each file. This code is shown below.

Listing 11. Get.php This technique is slow, difficult to maintain, and doesn't make good use of the database. The only solution is to re-architect the schema to turn it back into a traditional relational form, as shown below.

Listing 12. Good.sql Here, each file is related to the user through the user_id function in the file table. This probably seems backwards to anyone looking at this as an array. Certainly, arrays don't reference their containing objects -- in fact, just the opposite. But in a relational database, this is how things work and why queries are so much faster and easier. Listing 13 shows the corresponding PHP code.

Listing 13. Get_good.php
Here, we make one query to the database to get all the rows. The code isn't complex, and it uses the database as it was intended.

Problem 5: The n+1 pattern

I can't tell you how many times we've seen large applications in which the code first retrieves a list of entities -- say, customers -- then comes back and retrieves them one by one to get the details for each entity. We call it the n+1 pattern because that's how many queries will be performed -- one query to retrieve the list of all the entities, then one query for each of the n entities. This isn't a problem when n=10, but what about when n=100 or n=1000? Then the inefficiency really kicks in. Listing 14 shows an example of such a schema.

Listing 14. Schema.sql
This schema is solid. There's nothing wrong here. The problem is in the code that accesses the database to find all the books for a given author, as shown below.

Listing 15. Get.php

If you look at the code at the bottom, you are likely to think to yourself, "Hey, this is really clean." First, get the author id, then get a list of the books, then get information about each book. Sure, it's clean -- but is it efficient? No. Look at how many queries we had to perform to retrieve only the books by Jack Herrington. One to get an id, another to get the list of books, then one for each book. Five queries for three books!

The solution is to have one function that performs one bulk query, as shown below.

Listing 16. Get_good.php Now retrieving the list requires a fast, single query. It means that I will likely have to have several of these types of methods with different parameters, but there really is no choice. If you want to have a PHP application that scales, you must make efficient use of the database, and that means smarter queries.

The problem with this example is that it's a bit too clear-cut. Typically, these types of n+1 or n*n problems are much more subtle. And they only appear when the database administrator runs a query profiler on your system when it has performance problems.

Activity 2 For Database Administrators T-SQL is the place to start to learn specifically the ‘flavour’ of SQL that Microsoft uses.

1. A good place to start is with SQL Server 2005 Books Online.

Learning to program in T-SQL is like learning the slightly different ways T-SQL expresses statements as opposed to other versions of SQL. The only thing you need to do is to learn the different syntax rules for T-SQL and or any other differences in development environments and the SQL 2005 editor (SSMS).

To begin take a look at this reference via Help in SSMS.

Click on Help in SSMS to get to Microsoft SQL Server 2005 – M/S Document Explorer.

Click on the Plus sign to open ‘SQL Server 2005 Books Online’

Click on the Plus sign to open ‘SQL Server Language Reference’

Click on ‘Transaction SQL Reference’ and begin reading

Print Off ‘Transaction-SQL Syntax Conventions’ and use this as a Reference every time you look at ‘SQL Server 2005 Books Online’ these are the conventions used to display the SQL code and rules.

2. The SQL Server Management Studio design can be customized to your personal preference and work requirements.

Open SSMS from the Microsoft SQL Server 2005 shortcut from the Start menu

The following are best practices for using the Management Studio workspace efficiently:

* Close windows you do not expect to use right away.

* If the tool you want is not visible, select it from the View menu.

* Use Auto Hide to provide more room in the environment layout.

3. Do a search on “Logical Database Design for Performance” in SQL Server On Line Books 2005. (Look for the topic ‘Normalization’)

What are some of the topics and what do they relate to? Do they talk about the performance problems you saw last week when you searched the Internet?

It may be a good idea to print off the topic on Normalization. Any others?

Considerations in Design for Performance :

Are there large tables? Increase the number of users that can access this large table.

Use an additional column for aggregated or summary information, say for reporting.

Databases can be over-normalised, demoralizing the database slightly to simplify complex processes can improve performance.

Activities 4 Managing indexes (Advanced)

1. Partition & Allocation unit example

The following example returns partition and allocation unit data for two tables: DatabaseLog, a heap with LOB (Large objects) data and no nonclustered indexes, and Currency, a clustered table without LOB data and one nonclustered index. Both tables have a single partition.

Here is the result set. Notice that the DatabaseLog table uses all three allocation unit types, because it contains both data and Text/Image page types. The Currency table does not have LOB data, but does have the allocation unit required to manage data pages. If the Currency table is later modified to include a LOB data type column, a LOB_DATA allocation unit is created to manage that data.

2. creating a simple nonclustered index

The following example creates a nonclustered index on the VendorID column of the Purchasing.ProductVendor table.

No comments: