SQL Server 2012 Windowing Functions Part 2 of 2: New Analytic Functions
This is the second part of my two-part article on windowing functions in SQL Server 2012. In Part 1, I explained the new running and sliding aggregation capabilities added to the OVER clause in SQL Server 2012. In this post, I’ll explain the new T-SQL analytic windowing functions. All of these functions operate using the windowing principles I explained in Part 1.
Eight New Analytic Functions
There are eight new analytic functions that have been added to T-SQL. All of them work in conjunction with an ordered window defined with OVER and an associated ORDER BY clause that can be optionally partitioned with a PARTITION BY clause and framed with a BETWEEN clause. The new functions are:
In the following code listing, the FIRST_VALUE, LAST_VALUE, LAG, and LEAD functions are used to analyze a set of orders at the product level.
DECLARE @Orders AS table(OrderDate date, ProductID int, Quantity int) INSERT INTO @Orders VALUES ('2011-03-18', 142, 74), ('2011-04-11', 123, 95), ('2011-04-12', 101, 38), ('2011-05-21', 130, 12), ('2011-05-30', 101, 28), ('2011-07-25', 123, 57), ('2011-07-28', 101, 12) SELECT OrderDate, ProductID, Quantity, WorstOn = FIRST_VALUE(OrderDate) OVER(PARTITION BY ProductID ORDER BY Quantity), BestOn = LAST_VALUE(OrderDate) OVER(PARTITION BY ProductID ORDER BY Quantity ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), PrevOn = LAG(OrderDate, 1) OVER(PARTITION BY ProductID ORDER BY OrderDate), NextOn = LEAD(OrderDate, 1) OVER(PARTITION BY ProductID ORDER BY OrderDate) FROM @Orders ORDER BY OrderDate OrderDate ProductID Quantity WorstOn BestOn PrevOn NextOn ---------- --------- -------- ---------- ---------- ---------- ---------- 2011-03-18 142 74 2011-03-18 2011-03-18 NULL NULL 2011-04-11 123 95 2011-07-25 2011-04-11 NULL 2011-07-25 2011-04-12 101 38 2011-07-28 2011-04-12 NULL 2011-05-30 2011-05-21 130 12 2011-05-21 2011-05-21 NULL NULL 2011-05-30 101 28 2011-07-28 2011-04-12 2011-04-12 2011-07-28 2011-07-25 123 57 2011-07-25 2011-04-11 2011-04-11 NULL 2011-07-28 101 12 2011-07-28 2011-04-12 2011-05-30 NULL
In this query, four analytic functions specify an OVER clause that partitions the result set by ProductID. The product partitions defined for FIRST_VALUE and LAST_VALUE are sorted by Quantity, while the product partitions for LAG and LEAD are sorted by OrderDate. The full result set is sorted by OrderDate, so you need to visualize the sorted partition for each of the four functions to understand the output—the result set sequence is not the same as the row sequence used in the windowing functions.
FIRST_VALUE and LAST_VALUE
The WorstOn and BestOn columns use FIRST_VALUE and LAST_VALUE respectively to return the “worst” and “best” dates for the product in each partition. Performance is measured by quantity, so sorting each product’s partition by quantity will position the worst order at the first row in the partition and the best order at the last row in the partition. FIRST_VALUE and LAST_VALUE can return the value of any column (OrderDate, in this case), not just the aggregate column itself. For LAST_VALUE, it is also necessary to explicitly define a window over the entire partition with ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING. Otherwise, as explained in my coverage of OVER clause enhancements in Part 1, the default window is ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, which frames (constrains) the window, and does not consider the remaining rows in the partition that are needed to obtain the highest quantity for LAST_VALUE.
In the output, notice that OrderDate, LowestOn, and HighestOn for the first order (product 142) are all the same value (3/18). This is because product 142 was only ordered once, so FIRST_VALUE and LAST_VALUE operate over a partition that has only this one row in it, with an OrderDate value of 3/18. The second row is for product 123, quantity 95, ordered on 4/11. Four rows ahead in the result set (not the partition) there is another order for product 123, quantity 57, placed on 7/25. This means that, for this product, FIRST_VALUE and LAST_VALUE operate over a partition that has these two rows in it, sorted by quantity. This positions the 7/25 order (quantity 57) first and the 4/11 (quantity 95) last within the partition. As a result, rows for product 123 report 7/25 for WorstDate and 4/11 for BestDate. The next order (product 101) appears two more times in the result set, creating a partition of three rows. Again, based on the Quantity sort of the partition, each row in the partition reports the product’s worst and best dates (which are 7/28 and 4/12, respectively).
LAG and LEAD
The PrevOn and NextOn columns use LAG and LEAD to return the previous and next date that each product was ordered. They specify an OVER clause that partitions by ProductId as before, but the rows in these partitions are sorted by OrderDate. Thus, the LAG and LEAD functions examine each product’s orders in chronological sequence, regardless of quantity. For each row in each partition, LAG is able to access previous (lagging) rows within the same partition. Similarly, LEAD can access subsequent (leading) rows within the same partition. The first parameter to LAG and LEAD specifies the column value to be returned from a lagging or leading row, respectively. The second parameter specifies the number of rows back or forward LAG and LEAD should seek within each partition, relative to the current row. The query passes OrderDate and 1 as parameters to LAG and LEAD, using product partitions that are ordered by date. Thus, the query returns the most recent past date, and nearest future date, that each product was ordered.
Because the first order’s product (142) was only ordered once, its single-row partition has no lagging or leading rows, and so LAG and LEAD both return NULL for PrevOn and NextOn. The second order (on 4/11) is for product 123, which was ordered again on 7/25, creating a partition with two rows sorted by OrderDate, with the 4/11 order positioned first and the 7/25 order positioned last within the partition. The first row in a multi-row window has no lagging rows, but at least one leading row. Similarly, the last order in a multi-row window has at least one lagging row, but no leading rows. As a result, the first order (4/11) reports NULL and 7/25 for PrevOn and NextOn (respectively), and the second order (7/25) reports 4/11 and NULL for PrevOn and NextOn (respectively). Product 101 was ordered three times, which creates a partition of three rows. In this partition, the second row has both a lagging row and a leading row. Thus, the three orders report PrevOn and NextOn values for product 101, respectively indicating NULL-5/30 for the first (4/12) order, 4/12-7/28 for the second (5/30) order, and 5/30-NULL for the third and last order.
The last functions to examine are PERCENT_RANK (rank distribution), CUME_DIST (cumulative distribution, or percentile), PERCENTILE_CONT (continuous percentile lookup), and PERCENTILE_DISC (discreet percentile lookup). The following queries demonstrate these functions, which are all closely related, by querying sales figures across each quarter of two years.
DECLARE @Sales table(Yr int, Qtr int, Amount money) INSERT INTO @Sales VALUES (2010, 1, 5000), (2010, 2, 6000), (2010, 3, 7000), (2010, 4, 2000), (2011, 1, 1000), (2011, 2, 2000), (2011, 3, 3000), (2011, 4, 4000) -- Distributed across all 8 quarters SELECT Yr, Qtr, Amount, R = RANK() OVER(ORDER BY Amount), PR = PERCENT_RANK() OVER(ORDER BY Amount), CD = CUME_DIST() OVER(ORDER BY Amount) FROM @Sales ORDER BY Amount -- Distributed (partitioned) by year with percentile lookups SELECT Yr, Qtr, Amount, R = RANK() OVER(PARTITION BY Yr ORDER BY Amount), PR = PERCENT_RANK() OVER(PARTITION BY Yr ORDER BY Amount), CD = CUME_DIST() OVER(PARTITION BY Yr ORDER BY Amount), PD5 = PERCENTILE_DISC(.5) WITHIN GROUP (ORDER BY Amount) OVER(PARTITION BY Yr), PD6 = PERCENTILE_DISC(.6) WITHIN GROUP (ORDER BY Amount) OVER(PARTITION BY Yr), PC5 = PERCENTILE_CONT(.5) WITHIN GROUP (ORDER BY Amount) OVER(PARTITION BY Yr), PC6 = PERCENTILE_CONT(.6) WITHIN GROUP (ORDER BY Amount) OVER(PARTITION BY Yr) FROM @Sales ORDER BY Yr, Amount Yr Qtr Amount R PR CD ---- --- ------- - ----------------- ----- 2011 1 1000.00 1 0 0.125 2011 2 2000.00 2 0.142857142857143 0.375 2010 4 2000.00 2 0.142857142857143 0.375 2011 3 3000.00 4 0.428571428571429 0.5 2011 4 4000.00 5 0.571428571428571 0.625 2010 1 5000.00 6 0.714285714285714 0.75 2010 2 6000.00 7 0.857142857142857 0.875 2010 3 7000.00 8 1 1 Yr Qtr Amount R PR CD PD5 PD6 PC5 PC6 ---- --- ------- - ------------- ---- ------- ------- ---- ---- 2010 4 2000.00 1 0 0.25 5000.00 6000.00 5500 5800 2010 1 5000.00 2 0.33333333333 0.5 5000.00 6000.00 5500 5800 2010 2 6000.00 3 0.66666666667 0.75 5000.00 6000.00 5500 5800 2010 3 7000.00 4 1 1 5000.00 6000.00 5500 5800 2011 1 1000.00 1 0 0.25 2000.00 3000.00 2500 2800 2011 2 2000.00 2 0.33333333333 0.5 2000.00 3000.00 2500 2800 2011 3 3000.00 3 0.66666666667 0.75 2000.00 3000.00 2500 2800 2011 4 4000.00 4 1 1 2000.00 3000.00 2500 2800
The new functions are all based on the RANK function introduced in SQL Server 2005. So both these queries also report on RANK, which will aid both in my explanation and your understanding of each of the new functions.
PERCENT_RANK and CUME_DIST
In the first query, PERCENT_RANK and CUME_DIST (aliased as PR and CD respectively) rank quarterly sales across the entire two year period. Look at the value returned by RANK (aliased as R). It ranks each row in the unpartitioned window (all eight quarters) by dollar amount. Both 2011Q2 and 2010Q4 are tied for $2,000 in sales, so RANK assigns them the same value (2). The next row break the tie, so RANK continues with 4, which accounts for the “empty slot” created by the two previous rows that were tied.
Now examine the values returned by PERCENT_RANK and CUME_DIST. Notice how they reflect the same information as RANK with decimal values ranging from 0 and 1. The only difference between the two is a slight variation in their formulaic implementation, such that PERCENT_RANK always starts with 0 while CUME_DIST always starts with a value greater than 0. Specifically, PERCENT_RANK returns (RANK – 1) / (N – 1) for each row, where N is the total number of rows in the window. This always returns 0 for the first (or only) row in the window. CUME_DIST returns RANK / N, which always returns a value greater than 0 for the first row in the window (which would be 1, if there’s only row). For windows with two or more rows, both functions return 1 for the last row in the window with decimal values distributed among all the other rows.
The second query examines the same sales figures, only this time the result set is partitioned by year. There are no ties within each year, so RANK assigns the sequential numbers 1 through 4 to each of the quarters, for 2010 and 2011, by dollar amount. You can see that PERCENT_RANK and CUME_DIST perform the same RANK calculations as explained for the first query (only, again, partitioned by year this time).
PERCENTILE_DISC and PERCENTILE_CONT
This query also demonstrates PERCENTILE_DISC and PERCENTILE_CONT. These very similar functions each accept a percentile parameter (the desired CUME_DIST value) and “reach in” to the window for the row at or near that percentile. The code demonstrates by calling both functions twice, once with a percentile parameter value of .5 and once with .6, returning columns aliased as PD5, PD6, PC5, and PC6. Both functions examine the CUME_DIST value for each row in the window to find the one nearest to .5 and .6. The subtle difference between them is that PERCENTILE_DISC will return a precise (discreet) value from the row with the matching percentile (or greater), while PERCENTILE_CONT interpolates a value based on a continuous range. Specifically, PERCENTILE_CONT returns a value ranging from the row matching the specified percentile—or a calculated value higher than that (based on the specified percentile) if there is no exact match—and the row with the next higher percentile in the window. This explains the values they return in this query.
Notice that these functions define their window ordering using ORDER BY in a WITHIN GROUP clause rather than in the OVER clause. Thus, you do not (and cannot) specify ORDER BY in the OVER clause. The OVER clause is still required, however, so OVER (with empty parentheses) must be specified even if you don’t want to partition using PARTITION BY.
For the year 2010, the .5 percentile (CUME_DIST value) is located exactly on quarter 1, which had $5,000 in sales. Thus PERCENTILE_DISC(.5) returns 5000. There is no row in the window with a percentile of .6, so PERCENTILE_DISC(.6) matches up against the first row with a percentile greater than or equal to .6, which is the row for quarter 2 with $6,000 in sales, and thus returns 6000. In both cases, PERCENTILE_DISC returns a discreet value from a row in the window at or greater than the specified percentile. The same calculations are performed for 2011, returning 2000 for PERCENTILE_DISC(.5) and 3000 for PERCENTILE_DISC(.6), corresponding to the $2,000 in sales for quarter 2 (percentile .5) and the $3,000 in sales for quarter 3 (percentile .75)
As I stated, PERCENTILE_CONT is very similar. It takes the same percentile parameter to find the row in the window matching that percentile. If there is no exact match, the function calculates a value based on the scale of percentiles distributed across the entire window, rather than looking ahead to the row having the next greater percentile value, as PERCENTILE_DISC does. Then it returns the median between that value and the value found in the row with the next greater percentile. For 2010, the .5 percentile matches up with 5000 (as before). The next percentile in the window is for .75 for 6000. The median between 5000 and 6000 is 5500 and thus, PERCENTILE_CONT(.5) returns 5500. There is no row in the window with a percentile of .6, so PERCENTILE_CONT(.6) calculates what the value for .6 would be (somewhere between 5000 and 6000, a bit closer to 5000) and then calculates the median between that value and the next percentile in the window (again, .75 for 6000). Thus, PERCENTILE_CONT(.6) returns 5800; slightly higher than the 5500 returned for PERCENTILE_CONT(.5).
This post explained the eight new analytic functions added to T-SQL in SQL Server 2012. These new functions, plus the running and sliding aggregation capabilities covered in Part 1, greatly expand the windowing capabilities of the OVER clause available since SQL Server 2005.