Optimizing hierarchical SQL queries

in #steempress5 years ago


Overview


 

Every enterprise system has hierarchical data. The reporting engine of these systems generates reports displaying hierarchical tree structures. These kinds of reports are always a performance bottleneck for these systems.

This paper provides you an innovative way of extracting the hierarchical data from the database and thereby paves the way for optimizing SQL queries. In this paper Oracle SQL is used to demonstrate the new approach, however the new approach can be used in any other DBMS. This paper assumes that you have some basic knowledge of hierarchical SQL queries especially Oracle CONNECT BY queries. You can go thru this link https://docs.oracle.com/cd/B19306_01/server.102/b14200/queries003.htm to get an idea of hierarchical queries.

 

Hierarchical data example


 

The database of enterprise systems is mostly in normalized form, BCNF or 3NF hence the information is scattered among various database tables. In the Bill of Material (“BOM”) system the data is stored in tables such as Parts, Line of assembly, Line of usages etc. Line of assembly represents the hierarchical relationship between the parts.

 

Part Table:

 

Part table stores part information such as cost, quantity and mass etc. A part can have multiple versions.

No alt text provided for this image

 

Line of assembly table:

 

Line of assembly represents the hierarchical relationship between the parts. For example, in our case part piston is linked with part engine and the engine is a component of the Body.

 

No alt text provided for this image

 

Conventional approach of accessing hierarchical data:

 

Part expansion which is called as BOM expansion in bill of material world is a very common use case of bill of material management system. In part expansion a part structure is recursively traversed to expand a part to its sub-parts and their sub-parts and so on.

 

In our case if a part expansion for part “Body” is done then following tree structure will be returned.

 

Report for: Expand part “Body” having version = 1

 

Part: Body Version: 1

Part: Engine Version: 1

Part: Piston Version: 1

Part: Axel Version: 1

Part: Wheels Version: 1

The conventional approach of getting the above structure from Oracle database is to use a hierarchical ( CONNECT – BY ) query as shown below.

 

Conventional Query:

 

No alt text provided for this image
However, these kinds of queries have very poor performance and on database with millions of records these queries get chocked.

 

Conventional approach block diagram:

 

No alt text provided for this image
Any kind of hierarchical query works on the final cartesian product of the tables used in the query. In our case, it is a cartesian product between LineofAssemblyTable and PartTable (Part table is joined twice). Depending on the join criteria a cartesian product can be exponentially bigger than the main table.

 

A hierarchical query fetches records by evaluating the relations among all records qualified by the cartesian join. Hence, a huge cartesian product increases the number of relations exponentially. For example, if there are n records in the LineofAssemblyTable and each record is related to m other records then there will be total n * m relations among these records. Now, if each record in LineofAssemblyTable has three matching records in the PartTable then the size of the cartesian product will be n * 3 and total number of relations in the cartesian product will be (n * m) ** k (where k =3). The number of relations grow exponentially with the cartesian product. Hence, on huge tables these kind of hierarchical queries shows very poor performance.

 

Solution:


The solution is to bring down the size of the cartesian product and to bring down the total number of relations. This can be achieved by splitting the conventional process into two steps, traversing and filtering as shown below.

Step1: In the first step which is the traversing step the LineofAssemblyTable structure is traversed hierarchically without joining with PartTable and the traversed records are stored in a temporary table. In this step we are going to extract only those records from LineofAssemblyTable which belong to the input assembly structure i.e. “Body” in our case.

 

Step1 Query:

No alt text provided for this image
In this case we are avoiding the power of k ( k = number of matching records from the PartTable ) in the total number of relations. We are only traversing n * m relations and extracting some records ( say j ) from them. Here j << n ( much smaller than n ) as j consists of only those records which are part of the input assembly part.

 

Step2: In the second step which is the filtering step the temporary table is joined with LineofAssemblyTable and PartTable and the CONNECT BY SQL is applied again on the resultant table join. The CONNECT BY SQL query is needed in the second step to maintain the hierarchy of the traversed structure. Hence, if a record gets dropped because of Part filtering criteria then its successors will also be dropped, and the hierarchy of the structure is maintained.

 

As the temporary table contains only matching records from the given input assembly structure, the cartesian product of the resultant join is much smaller than that of the cartesian product of the join in the conventional approach. Total number of relations in the cartesian product in step2 will be ( ( j * m ) ** k ), which is much smaller than ( ( n * m ) ** k ) as ( j << n ).

 

Following is the final new improved query formed after combining step1 and step2.

 

New improved query:

No alt text provided for this image
New approach block diagram:

 

No alt text provided for this image
Why does it work?

 

In the new approach the hierarchical traversing is divided into two steps, traversing and filtering. In each step very less number of records and relations are processed by the CONNECT BY SQL query as compare to the conventional approach. Hence, the query takes less time to execute. In the real-life production system, we have seen that a conventional CONNECT BY query used to take 900 seconds, however after modifying it using the new improved approach the same query took only 1.5 seconds.

 

Posted from my blog with SteemPress : https://www.golibrary.co/optimizing-hierarchical-sql-queries/
Sort:  
 5 years ago  Reveal Comment