Saturday, April 26, 2014

SQL Server analytic functions: trap in LAST_VALUE

Analytic functions introduced in SQL Server 2012 appeared to be really awesome. LAG, LEAD, FIRST_VALUE and LAST_VALUE can simplify queries in particular when dealing with time series. However, there is a potential problem which I trapped into recently.

Here is the syntax of FIRST_VALUE and LAST_VALUE functions as by MSDN:

FIRST_VALUE ( [scalar_expression ] ) 
    OVER ( [ partition_by_clause ] order_by_clause [ rows_range_clause ] )

LAST_VALUE ( [scalar_expression ) 
    OVER ( [ partition_by_clause ] order_by_clause rows_range_clause )

Note that rows_range_clause is mandatory in LAST_VALUE and optional in FIRST_VALUE. However, rows_range_clause can be omitted in LAST_VALUE and SQL server will not report any syntax error and the result will be somewhat unexpected.
Example:

CREATE TABLE #transaction (
 [id] INT NOT NULL IDENTITY(1,1) PRIMARY KEY,
 [date] DATE NOT NULL,
 [type] VARCHAR(10) NOT NULL,
 [amount] MONEY NOT NULL,
 [description] VARCHAR(100) NOT NULL
)

INSERT #transaction ([date], [type], [amount], [description])
VALUES
 ('2012-01-01', 'DEBIT', 100.0, 'First debit in Jan')
, ('2012-01-01', 'CREDIT', 120.0, 'First credit in Jan')
, ('2012-01-05', 'DEBIT', 77.10, 'Last debit in Jan')
, ('2012-01-08', 'CREDIT', 55.60, '')
, ('2012-01-20', 'CREDIT', 80.00, 'Last credit in Jan')
, ('2012-02-08', 'CREDIT', 155.60, 'First credit in Feb')
, ('2012-02-09', 'CREDIT', 200.80, '')
, ('2012-02-09', 'CREDIT', 14.12, '')
, ('2012-02-09', 'DEBIT', 148.00, 'First debit in Feb')
, ('2012-02-13', 'CREDIT', 66.00, 'Last credit in Feb')
, ('2012-02-14', 'DEBIT', 20.00, 'Last debit in Feb')
Selecting all transactions along with [description] of the first transaction of each type in each month:
SELECT
 [id], [date], [type], [amount],
 FIRST_VALUE([description]) OVER (
   PARTITION BY YEAR([date]), MONTH([date]), [type] 
   ORDER BY [date], [id]
   ) [description]
FROM #transaction
The result is as expected:
id date type amount description
1 2012-01-01 DEBIT 100,00 First debit in Jan
2 2012-01-01 CREDIT 120,00 First credit in Jan
3 2012-01-05 DEBIT 77,10 First debit in Jan
4 2012-01-08 CREDIT 55,60 First credit in Jan
5 2012-01-20 CREDIT 80,00 First credit in Jan
6 2012-02-08 CREDIT 155,60 First credit in Feb
7 2012-02-09 CREDIT 200,80 First credit in Feb
8 2012-02-09 CREDIT 14,12 First credit in Feb
9 2012-02-09 DEBIT 148,00 First debit in Feb
10 2012-02-13 CREDIT 66,00 First credit in Feb
11 2012-02-14 DEBIT 20,00 First debit in Feb
Now selecting all transactions along with [description] of the last transaction of each type in each month using FIRST_VALUE function by simly reversing the order in OVER clause:
SELECT DISTINCT
 [id], [date], [type], [amount],
 FIRST_VALUE([description]) OVER (
   PARTITION BY YEAR([date]), MONTH([date]), [type] 
   ORDER BY [date] DESC, [id] DESC
   ) [description]
FROM #transaction
and the result is as expected:
id date type amount description
1 2012-01-01 DEBIT 100,00 Last debit in Jan
2 2012-01-01 CREDIT 120,00 Last credit in Jan
3 2012-01-05 DEBIT 77,10 Last debit in Jan
4 2012-01-08 CREDIT 55,60 Last credit in Jan
5 2012-01-20 CREDIT 80,00 Last credit in Jan
6 2012-02-08 CREDIT 155,60 Last credit in Feb
7 2012-02-09 CREDIT 200,80 Last credit in Feb
8 2012-02-09 CREDIT 14,12 Last credit in Feb
9 2012-02-09 DEBIT 148,00 Last debit in Feb
10 2012-02-13 CREDIT 66,00 Last credit in Feb
11 2012-02-14 DEBIT 20,00 Last debit in Feb
Now selecting all transactions along with [description] of the last transaction of each type in each month using LAST_VALUE function, but we forget to set rows_range_clause as required by documentation:
SELECT
 [id], [date], [type], [amount],
 LAST_VALUE([description]) OVER (
   PARTITION BY YEAR([date]), MONTH([date]), [type] 
   ORDER BY [date], [id]
   ) [description]
FROM #transaction
And here is the trap. We are not getting the last value in the group:
id date type amount description
1 2012-01-01 DEBIT 100,00 First debit in Jan
2 2012-01-01 CREDIT 120,00 First credit in Jan
3 2012-01-05 DEBIT 77,10 Last debit in Jan
4 2012-01-08 CREDIT 55,60 
5 2012-01-20 CREDIT 80,00 Last credit in Jan
6 2012-02-08 CREDIT 155,60 First credit in Feb
7 2012-02-09 CREDIT 200,80 
8 2012-02-09 CREDIT 14,12 
9 2012-02-09 DEBIT 148,00 First debit in Feb
10 2012-02-13 CREDIT 66,00 Last credit in Feb
11 2012-02-14 DEBIT 20,00 Last debit in Feb
The reason for this is that SQL Server uses default rows_range_clause for LAST_VALUE function and it is:
"ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW", which gives the last value in the group up to current row, effectively returning description from the current row as rows are selected in the same order as specified in OVER clause. To fix this a rows_range_clause must be set to the full group via:
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
SELECT
 [id], [date], [type], [amount],
 LAST_VALUE([description]) OVER (
   PARTITION BY YEAR([date]), MONTH([date]), [type] 
   ORDER BY [date], [id] 
   ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
   ) [description]
FROM #transaction
And this is exactly what we want:
id date type amount description
1 2012-01-01 DEBIT 100,00 Last debit in Jan
2 2012-01-01 CREDIT 120,00 Last credit in Jan
3 2012-01-05 DEBIT 77,10 Last debit in Jan
4 2012-01-08 CREDIT 55,60 Last credit in Jan
5 2012-01-20 CREDIT 80,00 Last credit in Jan
6 2012-02-08 CREDIT 155,60 Last credit in Feb
7 2012-02-09 CREDIT 200,80 Last credit in Feb
8 2012-02-09 CREDIT 14,12 Last credit in Feb
9 2012-02-09 DEBIT 148,00 Last debit in Feb
10 2012-02-13 CREDIT 66,00 Last credit in Feb
11 2012-02-14 DEBIT 20,00 Last debit in Feb
In my view Microsoft had to enforce rows_range_clause in LAST_VALUE on a syntax level.