About Me

My photo
Mumbai, Maharastra, India
He has more than 7.6 years of experience in the software development. He has spent most of the times in web/desktop application development. He has sound knowledge in various database concepts. You can reach him at viki.keshari@gmail.com https://www.linkedin.com/in/vikrammahapatra/ https://twitter.com/VikramMahapatra http://www.facebook.com/viki.keshari

Search This Blog

Showing posts with label Cluster Index. Show all posts
Showing posts with label Cluster Index. Show all posts

Monday, February 20, 2017

Heap Scan (Table Scan) Vs Clustered Index Scan

Let’s try out which scan would be more effective with below scenario
   1) When selecting all columns without filter and order by
   2)  When selecting all columns with order by.

Before that let’s understand what HEAP and CI are

Heap: Heap is table without Cluster Index, it has one row in sys.partition with index=0, by default a heap has a single partition. SQL Server uses IAM (Internal Allocation Map) to move through the leaf. The data pages and the rows within them are not arranged in any order and are not linked. The information with data page connection are recorded in IAM pages.

Clustered Index:  Clustered Index are organized in B-Tree structure with top most node called Root and bottom most are leaf nodes, in between top and bottom are intermediates nodes. The Leaf node consists of Index Key column value and reference to data page of underlying table. The pages in each level are linked in doubly linked list pattern. The pages in the data chain and the rows in them are ordered on the value of the clustered index key. All inserts are made at the point where the key value in the inserted row fits in the ordering sequence among existing rows.

Now let’s create a table and insert around 4 Million records in it.

CREATE TABLE HeapVsClusteredScanTable
(
        ColNum BIGINT NOT NULL,
        ColBigNum BIGINT NOT NULL,
        CharacterColumn CHAR(50)
)
GO
INSERT INTO HeapVsClusteredScanTable
(
        ColNum, ColBigNum, CharacterColumn
)
SELECT
        ColNum,
        ColNum + 5000000,
        LEFT(REPLICATE((CAST(ColNum as VARCHAR(50))),50),50)
FROM
(
        SELECT
                ColNum = row_number() over(order by newid() asc)
        FROM master..spt_values a
        CROSS APPLY master..spt_values b
        WHERE a.type = 'P' AND a.number <= 2000 AND a.number > 0 AND
        b.type = 'P' AND b.number <= 2000 AND b.number > 0
) mydata

We have not created any index on this table, so it is a heap. Lets query index stat. Below it shows total pagecount  37384 and there is only one page level.

select page_count, index_depth, page_level = index_level, page_count, record_count
from sys.dm_db_index_physical_stats(db_id(),object_id('HeapVsClusteredScanTable'),null,null,'DETAILED');
GO
page_count           index_depth page_level page_count           record_count
-------------------- ----------- ---------- -------------------- --------------------
37384                1           0          37384                4000000

(1 row(s) affected)

Now lets query page allocation

select allocated_page_page_id, next_page_page_id, previous_page_page_id
from sys.dm_db_data base_page_allocations(db_id(),object_id('HeapVsClusteredScanTable'),null,null,'DETAILED')
where page_type_desc is not null and page_type_desc = 'DATA_PAGE'

allocated_page_page_id next_page_page_id previous_page_page_id
---------------------- ----------------- ---------------------
400                    NULL              NULL
401                    NULL              NULL
402                    NULL              NULL
403                    NULL              NULL
404                    NULL              NULL
405                    NULL              NULL
406                    NULL              NULL

Above result tells in case of Heap there no information stored for next and previous page.

Now lets see the IO measure, CPU time and Elapsed time when querying Heap with and without ORDER BY

Without Order By

SET STATISTICS IO on
set statistics time on
go
select * from HeapVsClusteredScanTable
(4000000 row(s) affected)
Table 'HeapVsClusteredScanTable'. Scan count 1, logical reads 37384, physical reads 0, read-ahead reads 4, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 6844 ms,  elapsed time = 93444 ms.

With Order By

SET STATISTICS IO on
set statistics time on
go
select * from HeapVsClusteredScanTable order  by ColNum
(4000000 row(s) affected)
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 36373, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'HeapVsClusteredScanTable'. Scan count 5, logical reads 37384, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 21109 ms,  elapsed time = 206015 ms.

Above result shows that with “Order By” Claus the execution time and read-ahead read increate significantly.

Introduce Clustered Index: Now Let’s create a clustered index on table

CREATE CLUSTERED INDEX cix_HeapVsClusteredScanTable
ON HeapVsClusteredScanTable(ColNum)
WITH(MAXDOP=1)
Lets query index physical stat

select page_count, index_depth, page_level = index_level, record_count
from sys.dm_db_index_physical_stats(db_id(),object_id('HeapVsClusteredScanTable'),null,null,'DETAILED');
GO
page_count           index_depth page_level record_count
-------------------- ----------- ---------- --------------------
37384                3           0          4000000
79                   3           1          37384
1                    3           2          79

(3 row(s) affected)

Above output shows that the index has 3 level, the leaf level consist of 37384 pages whereas intermediate level consist of 79 pages and root level consist of 1 page.

Lets now query page allocation dmv

select allocated_page_page_id, next_page_page_id, previous_page_page_id, page_level, page_type_desc
from sys.dm_db_database_page_allocations(db_id(),object_id('HeapVsClusteredScanTable'),null,null,'DETAILED')
where page_type_desc is not null and page_type_desc IN( 'DATA_PAGE', 'INDEX_PAGE')
ORDER BY CASE WHEN page_type_desc = 'INDEX_PAGE' THEN 0 ELSE allocated_page_page_id END ASC

allocated_page_page_id next_page_page_id previous_page_page_id page_level page_type_desc
---------------------- ----------------- --------------------- ---------- ---------------
61128                  61129             45911                 1          INDEX_PAGE
61129                  61130             61128                 1          INDEX_PAGE
61130                  61131             61129                 1          INDEX_PAGE
61131                  61132             61130                 1          INDEX_PAGE
61132                  61133             61131                 1          INDEX_PAGE
61133                  61134             61132                 1          INDEX_PAGE

Here the above output shows that each page in clustered index has pointer to previous and next page. (Doubly linked list pattern)

Now lets see the IO measure, CPU time and Elapsed time when querying Clustered Indexed table with and without ORDER BY

Without Order By

SET STATISTICS IO on
set statistics time on
go
select * from HeapVsClusteredScanTable

(4000000 row(s) affected)

Table 'NumbersTable'. Scan count 1, logical reads 37465, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
   CPU time = 6672 ms,  elapsed time = 82045 ms.

With Order By

SET STATISTICS IO on
set statistics time on
go
select * from HeapVsClusteredScanTable order by ColNum

(4000000 row(s) affected)

Table 'NumbersTable'. Scan count 1, logical reads 37465, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
 SQL Server Execution Times:
   CPU time = 6657 ms,  elapsed time = 86031 ms.

Conclusion: Here we can see, with Cluster Index on table, the Order By query make a significant performance compare to Heap table with order by Clause. However general select statement without order by doesn’t make big difference in term of elapsed and cpu time with Clustered table.

Further we can conclude that you can keep your table in heap if it is very.


 Enjy coding…SQL J


Post Reference: Vikram Aristocratic Elfin Share

Friday, February 13, 2015

Myth, Group by always result in sorting the table on Group by Column


There are lot threads on various forum discussing automatic sorting behaviour of Group By clause, lets put extra mile to prove this myth incorrect.

So here is my query, I am running this dummy query on AdventureWorks2012 database

select  pc.Name, sum( psc.ProductSubcategoryID) as 'Sum of Sub Product Id' from Production.ProductCategory pc
inner join Production.ProductSubcategory psc
on pc.ProductCategoryID = psc.ProductCategoryID
group by pc.Name

Name                                               Sum of Sub Product Id
-------------------------------------------------- ---------------------
Accessories                                        378
Bikes                                              6
Clothing                                           172
Components                                         147

Let’s see from the prospective of execution plan,



From execution plan we can see the Group By physical operator is replaced by Stream Aggregate logical operator. And since Stream Operator always want its input to be in sorted order, we can see an existence of Sort operator in the execution plan which is sorting the ProductCategory data on ProductCategory.Name wise. And this is the reason why we get our result in sorted order when we use Group by clause our query.

Now just think about a situation where instead of Stream Aggregate, optimizer feels to use Hash Aggregate, which don’t require its input to be in sorted order.

Let’s try to bring Hash Operator in the plan in place of Stream Operator.

I am not able to stimulate the situation where Hash Aggregate appears in plan. So lets play a tricky game with Optimizer, and tell there is no such Stream aggregate operator present thus don’t make it use for creating plan

You can disable Stream Aggregate operator by

DBCC TRACEON (3604);
DBCC RULEOFF('GbAggToStrm');
GO

But later after your operation don’t forget to Rule on for Stream Aggregate operator.
DBCC TRACEON (3604);
DBCC RULEOFF('GbAggToStrm');
GO
select  pc.Name, sum( psc.ProductSubcategoryID) as 'Sum of Sub Product Id' from Production.ProductCategory pc
inner join Production.ProductSubcategory psc
on pc.ProductCategoryID = psc.ProductCategoryID
group by pc.Name
OPTION (RECOMPILE);
GO
DBCC RULEON('GbAggToStrm');

Name                                               Sum of Sub Product Id
-------------------------------------------------- ---------------------
Bikes                                              6
Clothing                                           172
Accessories                                        378
Components                                         147

Here we can see the result is didn’t get sorted by optimizer, lets see how optimizer executed this query.



We can see the Stream Aggregate operator replaced by Hash Aggregate which don’t require its input to be in sorted manner and that is the reason why our result resulted in unsorted manner.

So to be in safe side, if you want your group by data in sorted manner do use order by clause along with group by, this way you can guarantee your data to be in sorted manner.
 
DBCC TRACEON (3604);
DBCC RULEOFF('GbAggToStrm');
GO
select  pc.Name, sum( psc.ProductSubcategoryID) as 'Sum of Sub Product Id' from Production.ProductCategory pc
inner join Production.ProductSubcategory psc
on pc.ProductCategoryID = psc.ProductCategoryID
group by pc.Name
order by pc.Name
OPTION (RECOMPILE);
GO
DBCC RULEON('GbAggToStrm');

Name                                               Sum of Sub Product Id
-------------------------------------------------- ---------------------
Accessories                                        378
Bikes                                              6
Clothing                                           172
Components                                         147

Conclusion: Always use order by along with Group By, if you want your data to be in sorted manner.
 
SQL Server with a Tea glass makes a perfect combination to deal with darkness :)


Post Reference: Vikram Aristocratic Elfin Share