Monday, October 7, 2013

Analytic Functions in SQL 2012 - Part 2 - LEAD and LAG

In the Part 1 of this series we looked at how to get the extreme (Maximum, Minimum, Least, Most, etc) records from an ordered list using two fnctions First_Value and Last_Value. However, there could be situation when you want to access a record that is at a position relative to a current record.

For example, if you want to compare the value of a record with the previous or the next record. There could be a situation when you have a monthly data for multiple years and you want to comnpare values of current month with the same month previous year, assuming that all the 12 months appear in the list whether or not they have values. For these kind of situations, we have next 2 functions - LEAD() and LAG().

Just like other Analytic Functions, these functions too wotk with OVER clause only.


LEAD and LAG


LEAD function provides access to a row at a given offset following the current row. Similarly, LAG function proveds access to a row at a given offset preceding the current row. Before we go to the examples, lets have a close look at the syntax of these functions because they expect two extra input parameters.

LEAD ( return value found expression, [offset expression], [default return value expression])
LAG ( return value found expression, [offset expression], [default return value expression])

The second parameter "offset expression" would be an expression returning positive BIGINT value. This is optional and the default value is 1. In case of LEAD, the offset is moved forward and in case of LAG the offset is moved backwards. However, it can not be negative!! The offset value could be dynamically evaluated using the expression. This really makes this function very powerful and flexible.

The Third parameter "default return value expression" would be an expression which will be returned if no record is found at the given offset position for the current record. This could happen for Last records.


Examples:


To compare the next Quarter Quota Amount with the Current Quarter Quota Amount for a specific SalesRep and Year 


SELECT


SalesQuarter = CalendarQuarter,
CurrentQuarterQuota = SalesAmountQuota,
NextQuarterQuota = LEAD(SalesAmountQuota, 1,0) OVER (ORDER BY CalendarQuarter)
FROM dbo.FactSalesQuota
WHERE EmployeeKey = 272
AND CalendarYear = 2007
 

The OVER Clause first orders the list by CalendarQuarter in ascending order and then the LEAD function is applied to get the next Quarter (offset = 1) in the list. In this ORDER BY clause within the OVER clause does affect the output sequence.




To compare the next Aggregated Quarter Quota Amount with the Aggregated Current Quarter Quota Amount for Year = 2007. This example shows the use of expression instead of just a column name in the LEAD function


SELECT


SalesQuarter = CalendarQuarter,
CurrentQuarterQuota = SUM(SalesAmountQuota),
NextQuarterQuota = LEAD(SUM(SalesAmountQuota), 1,0) OVER (ORDER BY CalendarQuarter)
FROM dbo.FactSalesQuota
WHERE CalendarYear = 2007
GROUP BY CalendarQuarter


Now lets say we want to do this for EACH Year:

SELECT
CalendarYear,


SalesQuarter = CalendarQuarter,
CurrentQuarterQuota = SUM(SalesAmountQuota),
NextQuarterQuota = LEAD(SUM(SalesAmountQuota), 1,0) OVER (PARTITION BY CalendarYear ORDER BY CalendarQuarter)
FROM dbo.FactSalesQuota
GROUP BY CalendarYear, CalendarQuarter
 
 

In this case the Partition By clause will first create small partitions on CalendarYear and then Order these Partitions by CalendarQuarter. Finally, the LEAD function will get the Next Record (second parameter, offset = 1) for the records with Quarter 1, 2 or 3. For the 4th Quarter, the function will return 0 (third parameter, default = 0) as there won't be any next record in the CalendarYear partition for Quarter 4.



What if we want the comparison of the Quarter over Quarter across the years. That means, For Quarter 4 of a current year, the Next Quarter should be Quarter 1 of the Next year. This will be the SQL:

SELECT CalendarYear,
 SalesQuarter = CalendarQuarter,
 CurrentQuarterQuota = SUM(SalesAmountQuota), 
 NextQuarterQuota = LEAD(SUM(SalesAmountQuota), 1,0) OVER (ORDER BY CalendarYear ,  CalendarQuarter)
FROM dbo.FactSalesQuota
GROUP BY CalendarYear, CalendarQuarter

Now instead of creating Partition on CalendarYear, we will work on the WHOLE dataset as one big partition, but order the parition by CalendarYear and CalendarQuarter.



Finally, let's look at an example to use an expression for setting the value of OFFSET parameter instead of a constant. In this case, we need to compare the current quarter SalesQuota with the SalesQuota of First Quarter of the same Year.

SELECT
CalendarYear,
SalesQuarter = CalendarQuarter,
CurrentQuarterQuota = SUM(SalesAmountQuota),
FirstQuarterQuota = LAG(SUM(SalesAmountQuota),
CASE
WHEN CalendarQuarter = 2 THEN 1
WHEN CalendarQuarter = 3 THEN 2
WHEN CalendarQuarter = 4 THEN 3
ELSE 0
END ,0)
OVER (PARTITION BY CalendarYear ORDER BY CalendarQuarter)
FROM dbo.FactSalesQuota
GROUP BY CalendarYear, CalendarQuarter


The CASE Statement can be replaced with a simple expression as below:

SELECT CalendarYear,
SalesQuarter = CalendarQuarter,
CurrentQuarterQuota = SUM(SalesAmountQuota),
FirstQuarterQuota = LAG(SUM(SalesAmountQuota), CalendarQuarter -1, 0)
OVER (PARTITION BY CalendarYear ORDER BY CalendarQuarter)
FROM dbo.FactSalesQuota
GROUP BY CalendarYear, CalendarQuarter




Ofcourse, if you recollect from previous part, the same statement can be written using FIRST_VALUE function as below. However, this statement will not return correct result for years where you do not have data for First Quarter, because FIRST_VALUE function will return the very first record in the year partition. So, if you suspect the table may not have the First Quarter data, it's better to go with LAG function.


SELECT CalendarYear,
SalesQuarter = CalendarQuarter,
CurrentQuarterQuota = SUM(SalesAmountQuota),
FirstQuarterQuota = FIRST_VALUE(SUM(SalesAmountQuota)) OVER (PARTITION BY CalendarYear ORDER BY CalendarQuarter)
FROM dbo.FactSalesQuota
GROUP BY CalendarYear, CalendarQuarter




That's the end of this part.

In the next part of the series, I am going to cover CUME_DIST and PERCENT_RANK. Stay tuned!

 

Wednesday, October 2, 2013

Analytic Functions in SQL 2012 - Part 1 - FIRST_VALUE and LAST_VALUE

SQL 2012 comes with lot of new features and capabilities. One of the interesting add-ins are the new set of Analytic Functions. I am sure most of you (if not everyone) like me used to struggle a lot writing stored procedures for these standard statistic functions. Well, without wasting more time lets have a quick look and then in later posts, I will try to get into more depths of these functions and where possibly we could use them. Your inputs and feedbacks would be on icing on cake - goes without saying.


What are Analytic Functions?

According to technet "Analytic functions compute an aggregate value based on a group of rows. However, unlike aggregate functions, they can return multiple rows for each group."

The second sentence is very important and what makes these functions more interesting. What it means is with aggregate functions like SUM, AVG, MAX, MIN - we do not have luxury of displaying the individual records. Moreover, these aggregate functions return only 1 result and hence one record per group. Whereas, the analytic functions can return multiple rows. Therefore, GROUPING works differently in case of analytic functions.

Due to this, some practical scenarios that were more difficult to implement using aggregate functions have become simple using Analytic functions. Now lets have a look at these functions.

FIRST_VALUE and LAST_VALUE


As the name suggests, these function return the First Value or the Last Value from an ORDERED list. This can be useful in situations where you want to get lets say: Name of Employee with Least Salary, Name of a Product with Maximum Sales, Date on which Least Sales was done and so on..

Alternatives:


In the absence of this function, we would first need to get the Minimum value and then do a Lookup for the minimum value to get the corresponding record.

A second alternative would be to do a TOP 1 statement.


Example:


To get the name of the Least expensive Product.

SELECT
Product = P.EnglishProductName,
P.ListPrice,
LeastExpensive =
FIRST_VALUE(P.EnglishProductName) OVER (ORDER BY P.ListPrice ASC) FROM dbo.DimProduct P

The OVER Clause first orders the list by ListPrice in ascending order and then the FIRST VALUE fnction is applied to get the First Value of EnglishProductName in the list. However, this ORDER BY clause within the OVER clause does not affect the output sequence.



Another example: To get the Least expensive product in each ProductSubCateogry instead of whole list. This can be done by adding "PARTITION" clause in the OVER clause. The Partition clause divides the whole list into multiple categories and then FIRST_VALUE is computed for each of these small partitions. Let's have a look at the SQL for this.

SELECT
SubCategory = S.EnglishProductSubcategoryName,
Product = P.EnglishProductName,
P.ListPrice,
LeastExpensive  = FIRST_VALUE(P.EnglishProductName)
     OVER (PARTITION BY S.EnglishProductSubcategoryName ORDER BY P.ListPrice ASC)
FROM dbo.DimProduct P
JOIN dbo.DimProductSubcategory S ON P.ProductSubcategoryKey = S.ProductSubcategoryKey





To find the difference between the Price of Current Product as compared to the Price of Least Expensive and Most Expensive Product in the Same SubCategory

SELECT SubCategory = S.EnglishProductSubcategoryName,
Product = P.EnglishProductName,
P.ListPrice,

 
PriceDifference_LeastExpensive  = P.ListPrice - FIRST_VALUE(P.ListPrice)
OVER (PARTITION BY S.EnglishProductSubcategoryName ORDER BY P.ListPrice ASC)


PriceDifference_MostExpensive = FIRST_VALUE(P.ListPrice)
OVER (PARTITION BY S.EnglishProductSubcategoryName ORDER BY P.ListPrice DESC) - P.ListPrice

FROM dbo.DimProduct P
JOIN dbo.DimProductSubcategory S ON P.ProductSubcategoryKey = S.ProductSubcategoryKey




This can be rewritten using LAST_VALUE function as


SELECT SubCategory = S.EnglishProductSubcategoryName,
Product = P.EnglishProductName,
P.ListPrice,


PriceDifference_LeastExpensive = P.ListPrice - FIRST_VALUE(P.ListPrice)
OVER (PARTITION BY S.EnglishProductSubcategoryName ORDER BY P.ListPrice ASC)


PriceDifference_MostExpensive = LAST_VALUE(P.ListPrice)
OVER (PARTITION BY S.EnglishProductSubcategoryName ORDER BY P.ListPrice ASC) - P.ListPrice

FROM dbo.DimProduct P
JOIN dbo.DimProductSubcategory S ON P.ProductSubcategoryKey = S.ProductSubcategoryKey




Please do contribute where else you see these functions could be used in practical scenarios.

Part 2 will cover LEAD and LAG functions.

Tuesday, September 3, 2013

Is your Data Warehouse / BI solution PASS or No PASS?

PASS stands for
  • Performance
  • Accuracy
  • Security
  • Scalability
  
Performance
- Extracting data
- Data cleansing
- Transformation
- Loading to Data Warehouse
- Fetching data from DWH / marts
   
Accuracy
- Business logic understanding
- Error Handling
- Missing data
- Data configuration
- Multiple source integration
- SCD
- KPI Calculation
- Aggregations / Roll ups
- Single Version of Truth
   
Security
- Dimension Security
- AD user access
- Sharepoint Taxonomy
- Kerberos
   
Scalable
- Expansion of scope
- Adding more sources
- Adding more vertical information
- Adding more users
- Handling more volume of data
- Change of Hardware architecture
- Support multiple Visualization Tools



 

Thursday, August 1, 2013

SSAS (Left) Outer Join

PROBLEM:


Generally, MDX statement returns the resultset that closely resembles SQL INNER JOIN. That means only matching records are returned. Sometimes, we have a requirement where we need to display all the Dimension records, regardless whether they exist in the Fact or not.


RESOLUTION:


The MDX returns resultset as INNER JOIN when the MDX contains NON EMPTY behavior. If this is removed from the Dimension axis, then the MDX will behave like OUTER JOIN.


INNER JOIN with NON EMPTY key word

SELECT
NON EMPTY {[DimDate].[DateYear].[DateYear]} ON ROWS,
NON EMPTY {[Measures].[VisitCount]} ON COLUMNS
FROM [FactVisit];
 

OUTER JOIN without NON EMPTY

SELECT
{[DimDate].[DateYear].[DateYear]} ON ROWS,
NON EMPTY {[Measures].[VisitCount]} ON COLUMNS
FROM [FactVisit];
 

 

Thursday, July 25, 2013

SSRS Chart Dynamic Axis

This post is related SQL Server Reporting Services 2008 R2.

Problem:

I recently faced a situation when I was creating a column chart, showing Monthly Visit Counts. As the VisitCount Values were quite small (maximum being 2) and since the axis has default setting for Max value and Increment as Auto, it started showing me decimal values or repeating the values (if you switch off the decimal) for the ranges on the Axis. Now, since I am displaying Counts that are supposed to be "integers", the decimals could be confusing to the end-user.




We have following dataset:
MonthName | VisitCount

We need to create a Column Chart with following:
MonthName will go on X-Axis
VisitCount will go on Y-Axis



Trick:

Compute the Max value property of the Y Axis dynamically depending on the Max value in the dataset.


Solution:

Step #1: Add Custom Code

Here we are assuming that we can have the Maximum value for the Axis to be the 10x factor of either 10, 25 or 50. That is 10, 25, 50, 100, 250, 500, 1000, 2500 and so on. If you want other ranges you can modify the varibale values and add more variables.

This code computes the Max value property to be set for the Axis Options, depending on the Maximum value of the Measure in the data set.


Public Function GetScaleMaxVal(ByVal DataMaxVal As Integer) As Integer

   Dim ScaleMaxVal As Integer ScaleMaxVal = 10
Dim ScaleMaxVal10 As Integer
Dim ScaleMaxVal25 As Integer
Dim ScaleMaxVal50 As Integer

 
ScaleMaxVal10 = 10
ScaleMaxVal25 = 25
ScaleMaxVal50 = 50
 

While DataMaxVal > ScaleMaxVal

   If DataMaxVal < ScaleMaxVal10 Then
      ScaleMaxVal = ScaleMaxVal10

   ElseIf DataMaxVal < ScaleMaxVal25 Then
      ScaleMaxVal = ScaleMaxVal25

   Else
      ScaleMaxVal = ScaleMaxVal50

   End If

   ScaleMaxVal10 = ScaleMaxVal10 * 10
   ScaleMaxVal25 = ScaleMaxVal25 * 10
   ScaleMaxVal50 = ScaleMaxVal50 * 10
End While

Return ScaleMaxVal

End Function



Step #2: Call the Function

As we are done writing the function, time to give it a call.
Go to your Chart
-> Vertical Axis properties
-> Axis Options
-> Maximum
Click on Expression button (Fx ) and add following code:

=Code.GetScaleMaxVal( Max(Fields!VisitCount.Value))
Here we are using in built mathematical function Max along with our custom function.

Thats it and you are done!