Featuring datetrunc and date_bucket
An SQL text by Erland Sommarskog, SQL Server
MVP. Most recent update 2024-10-31.
Copyright applies to this text. See here for font conventions used in this article.
SQL 2022 introduced the function generate_series which permits us to solve problems for which we previously had to use a table of numbers (also known as a "tally table"). generate_series is a great addition, because when using a table of numbers there was always a risk that you could run out of numbers. Or some joker could have deleted rows from the table. All these worries are gone with generate_series.
The first chapter in this article presents a number of problems which are easily solved with generate_series. Many problems where generate_series is useful relates to date and time. In the second chapter, I explore such problems more closely, and I introduce tables of dates and months. I also discuss the functions datetrunc and date_bucket, two other functions introduced in SQL 2022. Particularly powerful is date_bucket that can map a date/time value to a span of your choice, be that 15-minute or three-day intervals.
This short story presumes that you are using SQL 2022 or later or one of Azure SQL Database or Azure SQL Managed Instance. More precisely, your database needs to be set to compatibility level 160 for generate_series to be available. (This limitation does not apply to datetrunc and date_bucket.) If you are using SQL 2019 or earlier or you are stuck on a lower compatibility level, you should instead read the short story Using a Table of Numbers, which discusses the same problems as this article, but solves them with what is available in older versions.
The examples in this article work with an demo database called NorthNumbers, and you can download a script to create it here. This database is cloned from Microsoft's old demo database Northwind, to which I have made a few alterations for the benefit of my examples. The database is small, around 100 MB in size.
Table of Contents
Using generate_series to Solve Practical Problems
Working with Date and Time Problems
Weeks and Introducing datetrunc
The date_bucket Function, Take One
Before we look at actual problems, let's first look at generate_series such. It is a built-in function that accepts two to three parameters. The parameters are known as start, stop and step. They must all be of one of the data types tinyint, smallint, int, bigint, decimal or numeric. Furthermore, in a specific call to generate_series, all parameters must be of the same type
The result set has a single column value of the same data type as the input parameters, and it consists of all values from start to stop in increments of step.
Here are some examples.
SELECT * FROM generate_series(1, 10);
The output is:
value
-----------
1
2
3
4
5
6
7
8
9
10
As we can tell from this example, we can leave out step and it appears to default to 1. However, the story is a little more complex. Look at this example:
SELECT * FROM generate_series(10, 1);
The output:
value
-----------
10
9
8
7
6
5
4
3
2
1
Here step defaulted to -1, because start is > stop. At first glance, this may seem like a nice helpful feature, but it is not in my opinion. This can cause surprises when you pass variables or columns for start and stop and you do not want any rows when start is ahead of stop. We will look at a practical example where this causes problems later. Because of this uncertainty, I recommend that you always explicitly give the step, with the possible exception when you pass constant values for start and stop.
Here is an example with higher numbers and a step > 1:
SELECT * FROM generate_series(4732, 4770, 7);
The result:
value
-----------
4732
4739
4746
4753
4760
4767
Here is an example with decimal values and a negative value for step:
DECLARE @d1 decimal(9,2) = 1.00, @d2 decimal(9,2) = -1.00, @d3 decimal(9,2) = -0.25; SELECT value FROM generate_series(@d1, @d2, @d3);
The output this time:
value
---------------------------------------
1.00
0.75
0.50
0.25
0.00
-0.25
-0.50
-0.75
-1.00
On some platforms, for instance Postgres, start and stop can also be date/time data types, but this is not supported in SQL Server. (And speaking of Postgres, on Postgres step always defaults to 1.)
Now that we have looked at the function as such, let's look at some practical problems where generate_series plays an essential role in the solutions.
Say that we want to review the sales for NorthNumbers Traders for a period, for instance the month of December 1997. This is a query to do this:
DECLARE @startdate date = '19971201', @enddate date = '19971231'; SELECT O.OrderDate, COUNT(*) AS [Count], SUM(OD.Amount) AS TotAmount FROM dbo.Orders O JOIN (SELECT OrderID, SUM(UnitPrice * Quantity) AS Amount FROM dbo.[Order Details] OD GROUP BY OrderID) OD ON O.OrderID = OD.OrderID WHERE O.OrderDate BETWEEN @startdate AND @enddate GROUP BY O.OrderDate ORDER BY O.OrderDate;
This query returns 23 rows. But say now that the requirement is that all dates in the period should appear in the output, and not only those that actually have sales. In order to do this, you need something that spans the date dimension, and this is where generate_series comes in handy. Here is a CTE that produces all dates in the period:
WITH dates AS ( SELECT dateadd(DAY, value, @startdate) AS d FROM generate_series(0, datediff(DAY, @startdate, @enddate), 1) )
Note: In case you have not seen a CTE before, CTE stands for Common Table Expression. CTEs serve as logical building blocks, so that you can build more complex queries from smaller ones. A CTE is necessarily not computed as such, as the optimizer may recast the computation order.
In the input to generate_series, we set start to 0 and stop to the length of the interval minus 1, and conveniently enough, this is what the datediff function returns. Following the recommendation above, we explicitly set step to 1. If for some reason @startdate would be after @enddate, we don't wany any output at all.
We use the dateadd function to generate the dates by adding value to @startdate. You can run the query on its own to verify that it returns all dates in December 1997 (with the two variables set as above).
Now that we have this CTE, we need to inject it into the rest of the query. It is the CTE that drives the query, as it provides the dates. Therefore, the CTE is what should appear directly after FROM, and to be sure that all dates are retained in the output, we should left-join it to the rest of the query. This leads us to:
DECLARE @startdate date = '19971201', @enddate date = '19971231'; WITH Dates AS ( SELECT dateadd(DAY, value, @startdate) AS Date FROM generate_series(0, datediff(DAY, @startdate, @enddate), 1) ) SELECT D.Date AS OrderDate, COUNT(O.OrderID) AS [Count], isnull(SUM(OD.Amount), 0) AS TotAmount FROM Dates D LEFT JOIN (dbo.Orders O JOIN (SELECT OrderID, SUM(UnitPrice * Quantity) AS Amount FROM dbo.[Order Details] OD GROUP BY OrderID) OD ON O.OrderID = OD.OrderID) ON O.OrderDate = D.Date GROUP BY D.Date ORDER BY D.Date;
This form of nested joins may be new to you. We use parentheses to mark that Orders and the aggregation from Order Details should logically be joined first, before being left-joined to the dates CTE. (The actual physical join order is as always up to the optimizer.)
Note two changes in the SELECT list: I have changed COUNT(*)
to COUNT(O.OrderID)
. This is required to get a count of zero for the days with no orders. COUNT(*)
counts all rows, whereas COUNT(col)
only counts rows with a non-NULL value in col. I have also wrapped the sum of the amounts in isnull, so that I get zero rather than NULL in the output for the dates without sales.
We will revisit this example a few more times in this article.
Say that you have a table where the business rules mandate the IDs to be contiguous. Every once in a while, you may want to check that there are no gaps to verify that the code the generates the IDs is working properly. Here is a simple-minded query to find if there are any IDs missing in the Orders table in NorthNumbers:
SELECT gs.value AS MissingID FROM generate_series((SELECT MIN(O.OrderID) FROM dbo.Orders O), (SELECT MAX(O.OrderID) FROM dbo.Orders O), 1) AS gs WHERE NOT EXISTS (SELECT * FROM dbo.Orders O WHERE O.OrderID = gs.value);
It returns four ids: 10319, 10320, 10550 and 11064.
From a logical perspective, there is nothing wrong with this query. Performance may however not be fantastic. When I run it against a database with one million orders, it takes a few seconds.
There is a better solution using the LAG or LEAD functions. These functions return a value from the previous (LAG) or the next (LEAD) row in the result set. If we only want to find the start and the end of the gaps, we can do:
; WITH IdAndNext AS ( SELECT OrderID, LEAD(OrderID) OVER(ORDER BY OrderID) AS NextOrderID FROM dbo.Orders ) SELECT OrderID + 1 AS StartRange, NextOrderID - 1 AS EndRange FROM IdAndNext WHERE NextOrderID - OrderID > 1 ORDER BY StartRange;
This query uses LEAD. I leave as an exercise to the reader to rewrite it to use LAG instead.
To get a listing of all missing IDs, we need something to produce all values in a gap, and generate_series is a perfect match here:
; WITH IdAndNext AS ( SELECT OrderID, LEAD(OrderID) OVER(ORDER BY OrderID) AS NextOrderID FROM dbo.Orders ) SELECT I.OrderID + gs.value AS MissingID FROM IdAndNext I CROSS APPLY generate_series(1, I.NextOrderID - I.OrderID - 1, 1) gs WHERE I.NextOrderID - I.OrderID > 1 ORDER BY I.OrderID;
You may not have seen the CROSS APPLY operator before. CROSS APPLY is a sort of a join operator, but there is an important difference. If you say A JOIN B, B itself cannot refer to A, but all correlations need to be in the ON clause that follows. With A APPLY B, you don't have any ON clause, but instead B can refer to A. In this example B is a call to a table-valued function, but B can also be a subquery. We will use CROSS APPLY with generate_series more than once in this article.
Sometimes you encounter problems where you need to loop over all characters in a string. Here is a example that computes the frequency of all letters in the column Notes in the Employees table.
WITH chars AS ( SELECT lower(substring(E.Notes, gs.value, 1)) COLLATE Latin1_General_CI_AS ch FROM dbo.Employees E CROSS APPLY generate_series(convert(bigint, 1), len(E.Notes), convert(bigint, 1)) AS gs ) SELECT ch, COUNT(*) AS charfreq FROM chars WHERE ch BETWEEN 'a' AND 'z' GROUP BY ch ORDER BY charfreq DESC;
For every employee we generate as many numbers as the Notes column is long for that employee. Since generate_series requires all input parameters to be of the same type, and the len function returns bigint when the input is nvarchar(MAX), we need to cast the 1 for the start and step parameter to bigint.
Note: the purpose of the COLLATE clause is to ensure that a and z are really the first and last letters so that the BETWEEN operator works as intended. The alphabets of some languages have letters that come after z. Swedish is one example of this.
Here is a more practical application. In the Customers table there is a column OrganisationNo. This is intended to be a Swedish organisation number for a juridical person. This is a 10-digit string, where the last digit is a check digit which is computed like this:
Here is a query that uses generate_series to validate the check digits in the Customers table, and which returns the invalid organisation numbers together with the expected check digit.
WITH products AS ( SELECT C.OrganisationNo, IIF(gs.value % 2 = 1, 2, 1) * try_cast(substring(C.OrganisationNo, gs.value, 1) AS int) AS product FROM dbo.Customers C CROSS JOIN generate_series(1, 9, 1) AS gs WHERE len(C.OrganisationNo) = 10 AND C.OrganisationNo NOT LIKE '%[^0-9]%' ), checkdigits AS ( SELECT OrganisationNo, (10 - SUM(product / 10 + product % 10) % 10) % 10 AS checkdigit FROM products GROUP BY OrganisationNo ) SELECT OrganisationNo, checkdigit FROM checkdigits WHERE substring(OrganisationNo, 10, 1) <> checkdigit;
In the products CTE, I perform step 1 and 2 above. I also filter out any values that have illegal format, that is, the wrong number of characters or strings that include non-digits. I still need to use try_cast in the SELECT list to avoid conversion errors for non-digits, as SQL Server may prefer to perform the filtering after doing the computation in the SELECT list. (As it happens, there are no non-digit characters in the OrganisationNo column, but as this is a general principle, I include it as a token of best practice.)
The next CTE, checkdigits, performs the steps 3 to 5 in the list above. The final query selects the rows with incorrect check digits. In total, there are six of them in NorthNumbers.
Here is another example on this theme. Say that you want to retrieve the orders for a number of customers during a certain date interval. The customer IDs in NorthNumbers are five-character strings, so you decide to pass them as a string of concatenated values with fixed length. You extract the values from this string with help of generate_series:
CREATE OR ALTER PROCEDURE get_orders @customers nvarchar(500), @startdate date, @enddate date AS SELECT O.OrderID, O.CustomerID, C.CompanyName, O.OrderDate, O.ShippedDate, OD.Amount + O.Freight AS TotalAmount FROM dbo.Orders O JOIN dbo.Customers C ON O.CustomerID = C.CustomerID JOIN (SELECT OrderID, SUM(UnitPrice * Quantity) AS Amount FROM dbo.[Order Details] GROUP BY OrderID) AS OD ON O.OrderID = OD.OrderID WHERE O.OrderDate BETWEEN @startdate AND @enddate AND EXISTS (SELECT * FROM generate_series(1, len(@customers) / 5, 1) AS gs WHERE O.CustomerID = substring(@customers, (gs.value - 1) * 5 + 1, 5)); go EXEC get_orders 'RATTCFOLIGRICARLAZYK', '19970301', '19970331';
You may ask why there is no convert this time when we have the len function. The difference to the above is that @customers is a regular nvarchar, not nvarchar(MAX), and in this case, the return type of len is int.
More often, though, you have numeric ids. And while it is possible to create a fixed-length strings from numeric ids, it is very popular to pass them as a comma-separated list. You can crack such a list into table format with help of generate_series as well, although this is a little more complex. Here is a function do this, together with some examples:
CREATE OR ALTER FUNCTION split_me(@param nvarchar(MAX), @delim nvarchar(10)) RETURNS TABLE AS RETURN(SELECT try_cast(substring(@param, value, charindex(@delim, @param + @delim, value) - value) AS int) AS NumVal, row_number() OVER(ORDER BY value) AS Position FROM generate_series(convert(bigint, 1), len(@param) + len(@delim) - 1, convert(bigint, 1)) WHERE substring(@delim + @param, value, len(@delim)) = @delim); go SELECT NumVal, Position FROM dbo.split_me('12222,7,59644,19,1', ','); SELECT NumVal, Position FROM dbo.split_me('12222<->7<->59644<->19<->1', '<->');
This is the output from both queries:
NumVal Position
----------- --------------------
12222 1
7 2
59644 3
19 4
1 5
You may know that SQL Server comes with a built-in function string_split for this purpose, but while string_split only accepts a single-character delimiter, split_me above accepts a delimiter that can be up to 10 characters long. This function is tailored to handle integer lists, so that you don't need to litter the queries with convert or cast. Also, split_me always returns the list position; with string_split you need to request this with an extra parameter. (This extra parameter to string_split is new to SQL 2022.)
What the function does, logically, is to loop over the string with help of generate_series to see if a delimiter starts in position value. @delim is prepended to @param, so that we also get a hit for the first element in the list. The value starts in position value (since we added @delim before @param) and lasts until we find the next delimiter. We here take benefit of that charindex takes a third parameter to specify the starting position for the search. In the call to charindex, we append @delim to @param, so that we also get a hit for the last element in the list. The extraction is wrapped in try_cast, so that any non-numeric value results in NULL rather than a conversion error.
While this function works and is usable, there are a few possible improvements that are outside the scope for this article. In my article Arrays and Lists in SQL Server, I discuss all sorts of methods to expand a list into table format, and one chapter has the title Using a Table of Numbers where I discuss some enhancements. (Although, as of this writing, I have yet to update that article for generate_series.)
The final example in this chapter may seem esoteric to some, but it is drawn from something I actually developed for a client. Here we also see what can happen if you leave out step, thinking that it always defaults to 1. (As I did, until I found out the hard way.)
In the Customers table in NorthNumbers there is a column BonusPoints. NorthNumbers Traders have decided to conduct a lottery among their customers, and the rule is that each bonus point equates to a lottery ticket. (And thus a customer may win more than one prize.) Winning a prize means that you lose a bonus point.
The task is to write a query which generates the ten lucky winners. The trick is to join the Customers table with generate_series through the APPLY operator, so that we get one row for every bonus point. Then we apply the row_number function, ordering by newid() which is good for generating a random order, and then we pick the first ten of these. We save the winners into a temp table, and in a separate query we deduct the bonus points.
DROP TABLE IF EXISTS #winners CREATE TABLE #winners (CustomerID nchar(5) NOT NULL, CompanyName nvarchar(40) NOT NULL, ContactName nvarchar(40) NOT NULL, PrizeNo int NOT NULL PRIMARY KEY ); WITH numbering AS ( SELECT C.CustomerID, C.CompanyName, C.ContactName, row_number() OVER (ORDER BY newid()) AS PrizeNo FROM dbo.Customers C CROSS APPLY generate_series(convert(smallint, 1), C.BonusPoints) gs ) INSERT #winners(CustomerID, CompanyName, ContactName, PrizeNo) SELECT CustomerID, CompanyName, ContactName, PrizeNo FROM numbering WHERE PrizeNo <= 10; SELECT * FROM #winners ORDER BY PrizeNo; UPDATE dbo.Customers SET BonusPoints -= w.cnt FROM dbo.Customers C JOIN (SELECT CustomerID, COUNT(*) AS cnt FROM #winners GROUP BY CustomerID) AS w ON C.CustomerID = w.CustomerID;
You can run this a few times to see some lucky winners. Be observant, maybe you spot any of these five customers: BSBEV, MORGK, QUICK, SAVEA and WOLZA. So what is special about these? Run this query to see:
SELECT * FROM Customers WHERE BonusPoints = 0
These are the five customers that do not have any bonus points at all, and therefore are not entitled to win anything. While it is not very likely, one or more of them can still appear in the output, because the above script has a bug. But run this:
UPDATE Customers SET BonusPoints = 0
Now, no customers have any bonus points, and the lottery batch should return an empty result. But it keeps returning ten rows. This is because
generate_series(convert(smallint, 1), C.BonusPoints)
in this case is the same as
generate_series(1, 0)
Which returns two rows, with the values 0 and 1. This happens because when stop is < start, the step parameter defaults to -1. Thus, the net effect is that all customers have two bonus points. (And since the final UPDATE decrements the BonusPoints column, the winners get more and more bonus points, further increasing their chances of winning in the next lottery.) Here is a correct version of the lottery script:
DROP TABLE IF EXISTS #winners CREATE TABLE #winners (CustomerID nchar(5) NOT NULL, CompanyName nvarchar(40) NOT NULL, ContactName nvarchar(40) NOT NULL, PrizeNo int NOT NULL PRIMARY KEY ); WITH numbering AS ( SELECT C.CustomerID, C.CompanyName, C.ContactName, row_number() OVER (ORDER BY newid()) AS PrizeNo FROM dbo.Customers C CROSS APPLY generate_series(1, convert(int, C.BonusPoints), 1) gs ) INSERT #winners(CustomerID, CompanyName, ContactName, PrizeNo) SELECT CustomerID, CompanyName, ContactName, PrizeNo FROM numbering WHERE PrizeNo <= 10; SELECT * FROM #winners ORDER BY PrizeNo; UPDATE dbo.Customers SET BonusPoints -= w.cnt FROM dbo.Customers C JOIN (SELECT CustomerID, COUNT(*) AS cnt FROM #winners GROUP BY CustomerID) AS w ON C.CustomerID = w.CustomerID;
Rather than wrapping the 1s for start and step in convert, I've opted to convert the BonusPoints column to reduce the noise level. (Recall that generate_series requires that all parameters have the same data type.)
If you like, you can go back and revisit the other examples to see what happens if the step parameter is left out. As it turns out, there are mitigations in the queries which prevent an incorrect result when start is ahead of stop. And in one example we use MIN and MAX to set them, so it is logically impossible for start to be > stop. Nevertheless, rather than doing an analysis every time whether you actually need to specify step to be 1, it is a lot better to be safe than sorry and always specify step.
This is a beginner-level article, so this is not the place to go into deep discussions about performance. Nevertheless, I like to mention that there is an issue with generate_series, so that you get incorrect estimates in situations when there is no need to. Incorrect estimates often lead to query plans with less-than-optimal performance. For details, see the section on generate_series in my SQL Server Wishlist.
We have already seen an example where we use generate_series with dates to get a row for all dates in an interval. In this chapter we will look a little more closely at such problems, for dates as well as other date/time intervals. We will see that sometimes it may be useful to create a table of dates and months. We will also work with the new functions datetrunc and date_bucket that were added in SQL 2022.
Earlier we saw how we could use generate_series to generate all dates in a period. If you run many queries of that type, you may grow a little tired of doing that dance with dateadd and datediff. An alternative is to create a table with dates, with one row for each date in the range you ever expect to work with. This can help to make your queries a little shorter and easier to read. Here is a script that creates and populate a Dates table with all dates from 1990-01-01 to 2149-12-31:
DROP TABLE IF EXISTS Dates CREATE TABLE Dates (Date date NOT NULL PRIMARY KEY); INSERT Dates(Date) SELECT dateadd(DAY, value, '19900101') FROM generate_series(0, datediff(DAY, '19900101', '21491231'), 1);
It goes without saying that if you decide to implement a Dates table, you need to make conscious decision of the range of dates, so that it includes all dates you can ever expect in your data.
In the beginning of the article we looked at a query that gave us the number of orders and their amount for each day in a period, including the days when no orders were placed. Equipped with the table Dates, we can now revisit that query to make it somewhat shorter:
DECLARE @startdate date = '19971201', @enddate date = '19971231'; SELECT D.Date AS OrderDate, COUNT(O.OrderID) AS [Count], isnull(SUM(OD.Amount), 0) AS TotAmount FROM Dates D LEFT JOIN (dbo.Orders O JOIN (SELECT OrderID, SUM(UnitPrice * Quantity) AS Amount FROM dbo.[Order Details] OD GROUP BY OrderID) OD ON O.OrderID = OD.OrderID) ON O.OrderDate = D.Date WHERE D.Date BETWEEN @startdate AND @enddate GROUP BY D.Date ORDER BY D.Date;
Let's change the requirement a little bit and say that we only want to fill out the result set for weekdays, Monday to Friday. This can easily be achieved by adding this condition to the WHERE clause:
AND datename(weekday, D.Date) NOT IN ('Saturday', 'Sunday')
Note: I prefer to use datename over datepart here, since datepart is dependent on the SET DATEFIRST setting, and if you don't watch out, you could be filtering for Friday and Saturday when you wanted to filter for the weekend. True, the output from datename depends on the SET LANGUAGE setting and could return Samstag or domenica. But if so, you would not be filtering for anything at all, which is likely to stand out more than if you filter for the wrong days of the week. If nothing else, datename works with both SET LANGUAGE us_english
and SET LANGUAGE British
, whereas datepart does not.
A further refinement of this requirement could be that we only want to display business days. For instance, we may not want a row for 1997-12-25, since that was Christmas Day and a holiday in more than one country. This leads to an extension of the Dates table which is known as a calendar table. Just like Dates, a calendar table has one row per day, and in addition there are columns with properties of the days. The exact set of columns depends on your needs, but common examples are IsHoliday, IsBankingDay etc. If you think about the filter with datename above, this filter means that orders placed during the weekends are lost. One way to address this is to have a column NearestWorkingDay, so you can specify that a Sunday should map to the Monday following or whatever you prefer.
Yet a refinement for an international application is to have a two-column key in your calendar table, where the second column is a country code, so that you can track that 17 May is always a holiday in Norway and likewise 6 June is always a holiday in Sweden.
A discussion of how to design and fill up a calendar table would take up too much space in this article and distract readers from its main focus; I only wanted to introduce you to the concept. For further reading I refer you to Ed Pollack's good article Designing a Calendar Table which explores this topic in detail.
Let's now look at running queries with other date and time intervals, starting with months. Say that rather that getting sales per day in December 1997, we want to see the monthly sales from 1996 to 1998 (the three years for which there is data in NorthNumbers). Here is a query to do this:
DECLARE @startdate date = '19960101', @enddate date = '19981231'; WITH OrderData AS ( SELECT convert(char(6), O.OrderDate, 112) AS OrderMonth, OD.Amount FROM dbo.Orders O JOIN (SELECT OrderID, SUM(UnitPrice * Quantity) AS Amount FROM dbo.[Order Details] OD GROUP BY OrderID) OD ON O.OrderID = OD.OrderID WHERE O.OrderDate BETWEEN @startdate AND @enddate ) SELECT OrderMonth, COUNT(*) AS [Count], SUM(Amount) AS Amount FROM OrderData GROUP BY OrderMonth ORDER BY OrderMonth;
Here are the first and last lines of the result set:
OrderMonth Count Amount
---------- ----------- ---------------------
199607 22 30192.10
199608 25 26609.40
199609 23 27636.00
199610 24 39496.40
...
199802 54 104561.95
199803 73 109825.45
199804 74 134630.56
199805 13 15176.36
This query shows an important device to get the month from a date. Use the convert function with style 112 (YYYYMMDD), but convert to char(6), to shave off the day part. I realise that not everyone may fancy this format, but a big advantage of this format that it is easy to sort. The sole reason I introduce a Common Table Expression (CTE) here is to save me from having to repeat the convert in the GROUP BY and ORDER BY clauses.
Note: If you would prefer to have a dash between year and month, you can use format code 121 instead. The data type would need to be char(7) in this case.
As before, we also want the months without any orders, and we can use our Dates table to this aim:
DECLARE @startdate date = '19960101', @enddate date = '19981231'; WITH OrderData AS ( SELECT convert(char(6), D.Date, 112) AS OrderMonth, O.OrderID, isnull(OD.Amount, 0) AS Amount FROM Dates D LEFT JOIN (dbo.Orders O JOIN (SELECT OrderID, SUM(UnitPrice * Quantity) AS Amount FROM dbo.[Order Details] OD GROUP BY OrderID) OD ON O.OrderID = OD.OrderID) ON O.OrderDate = D.Date WHERE D.Date BETWEEN @startdate AND @enddate ) SELECT OrderMonth, COUNT(OrderID), SUM(Amount) AS TotAmount FROM OrderData GROUP BY OrderMonth ORDER BY OrderMonth;
This is the start and the end of the result set:
OrderMonth TotAmount
---------- ----------- ---------------------
199601 0 0,00
199602 0 0,00
199603 0 0,00
199604 0 0,00
199605 0 0,00
199606 0 0,00
199607 22 30192.10
...
199805 13 15176.36
199806 0 0,00
199807 0 0,00
199808 0 0,00
199809 0 0,00
199810 0 0,00
199811 0 0,00
199812 0 0,00
If you are working a lot with months, you may prefer to have a Months table. Not only does it serve to make the queries a little sleeker, but they can also be a little more efficient, since a Months table obviously needs fewer rows compared to Dates. Beside the month itself, we will add two more columns, to make it a little easier to use with date values.
DROP TABLE IF EXISTS Months CREATE TABLE Months(Month int NOT NULL, LowLimit date NOT NULL, UpperLimit date NOT NULL, CONSTRAINT pk_Months PRIMARY KEY (Month) ); INSERT Months(Month, LowLimit, UpperLimit) SELECT convert(int, convert(char(6), Date, 112)), Date, dateadd(MONTH, 1, Date) FROM Dates WHERE Day(Date) = 1;
Here I have opted to make the Month column an int column, but you could make it char(6) if you prefer. If you prefer to have a display of the month like Apr 1997, you could add a MonthName column to the table.
If you look at the query, you realise that the column LowLimit is the first day of the month, while UpperLimit is the first day of next month. You might have expected UpperLimit to be the end of the month instead. To understand this design, let's look a query using the Months table:
DECLARE @startmon int = 199601, @endmon int = 199812; SELECT M.Month, COUNT(O.OrderID) AS [Count], isnull(SUM(OD.Amount), 0) AS TotAmount FROM Months M LEFT JOIN (dbo.Orders O JOIN (SELECT OrderID, SUM(UnitPrice * Quantity) AS Amount FROM dbo.[Order Details] OD GROUP BY OrderID) OD ON O.OrderID = OD.OrderID) ON O.OrderDate >= M.LowLimit AND O.OrderDate < M.UpperLimit WHERE M.Month BETWEEN @startmon and @endmon GROUP BY M.Month ORDER BY M.Month;
The output is the same as above.
For the join of Months and Orders, I use a condition with >=
and <
, rather than something like this:
OrderDate BETWEEN FirstDayOfMonth AND LastDayOfMonth
There is a reason for this, although it does not matter in this particular example, since OrderDate is date only. But imagine that there would be a time portion as well. With the pattern with BETWEEN, LastDayOfMonth would have to be adjusted to include a time portion very close to midnight. But that can be tricky, because which time to add depends on the data type on the left side. By setting the UpperLimit to the first day of the next month and using <
, we avoid this issue altogether.
What if we want another type of interval? Say for instance that we want sales data per week. The challenge compared to monthly sales is to denote the week in an easy manner. There is no convert trick here. Rather, we should explore more general solutions. One such solution is to have a calendar table with a Week column. This is nothing I will explore here, but I mention it, since it certainly is a viable option.
SQL 2022 introduced two functions, datetrunc and date_bucket, that are very useful in this context. Let's start with datetrunc, which is the simpler of the two. datetrunc takes a value of any date/time data type and truncates the value to the datepart you specify. Here are some examples:
DECLARE @d datetime2(3) = '20230326 23:46:14.123'; SELECT datetrunc(MINUTE, @d) AS Minute, datetrunc(DAY, @d) AS Day, datetrunc(ISO_WEEK, @d) AS Monday, datetrunc(QUARTER, @d) AS Quarter;
Here is the output (folded for legibility):
Minute Day
--------------------------- ---------------------------
2023-03-26 23:46:00.000 2023-03-26 00:00:00.000
Monday Quarter
--------------------------- ---------------------------
2023-03-20 00:00:00.000 2023-01-01 00:00:00.000
Note that if you specify WEEK for the datepart rather than ISO_WEEK, which day of the week you get depends on the SET DATEFIRST setting, which we briefly touched previously. (We will discuss it a little more closely at the end of the article.) With ISO_WEEK, on the other hand, you are guaranteed to always get back a Monday with datetrunc.
Here is a query that reports orders in NorthNumbers per week, with each week being represented by the Monday of that week.
DECLARE @startdate date = '19970101', @enddate date = '19971231'; WITH OrderData AS ( SELECT datetrunc(ISO_WEEK, OrderDate) AS OrderWeek, OD.Amount FROM dbo.Orders O JOIN (SELECT OrderID, SUM(UnitPrice * Quantity) AS Amount FROM dbo.[Order Details] OD GROUP BY OrderID) OD ON O.OrderID = OD.OrderID WHERE O.OrderDate BETWEEN @startdate AND @enddate ) SELECT OrderWeek, COUNT(*) AS [Count], SUM(Amount) AS Amount FROM OrderData GROUP BY OrderWeek ORDER BY OrderWeek;
Here is the start and end of the output:
OrderWeek Count Amount
---------- ----------- ---------------------
1996-12-30 5 12326.00
1997-01-06 7 7564.30
1997-01-13 7 16647.00
1997-01-20 7 18421.80
...
1997-12-08 8 10762.47
1997-12-15 12 18715.50
1997-12-22 13 17678.54
1997-12-29 7 14871.25
Extending this query, so that it includes a week, even if there are no orders that week, is left as an exercise to the reader. It should be straightforward – just apply the pattern above.
datetrunc works well as long as you want to group by a whole datepart, be that a year, a week, a minute etc. But what if you want something more odd-sized? A three-day interval? 15 minutes? Or bi-weekly sales data? This is when you turn to the date_bucket function. The name for the function comes from the idea that you take a date/time value and map it to a bucket. This function first appeared in Azure SQL Edge, which is for IoT devices, and where you work a lot with time series. (Edge was also the first in the SQL Server family to sport generate_series, again because of time series.)
But you can also think of it as a more general version of datetrunc that is able to truncate a date/time value to an interval of any length. date_bucket takes three mandatory parameters: datepart, intervallength and datetimeval. There is also a fourth optional parameter, that I will come back to in the next section. The datepart argument is the same set of values that we know from other date/time functions (although there are a few that are not permitted). The function truncates datetimeval to an interval of datepart and the length of the interval is specified by the second parameter, intervallength. The return value has the same data type as datetimeval.
It may be easier to understand, if we look at some examples:
DECLARE @d datetime2(3) = '20230326 23:46:12.123' SELECT date_bucket(MINUTE, 15, @d) AS "15 minutes", date_bucket(HOUR, 6, @d) AS "Six Hours", date_bucket(MONTH, 1, @d) AS Month
The output:
15 minutes Six Hours Month
--------------------------- --------------------------- ---------------------------
2023-03-26 23:45:00.000 2023-03-26 18:00:00.000 2023-03-01 00:00:00.000
That is, with datepart = MINUTE and an interval length of 15 minutes, we get a return value which is 0, 15, 30 or 45 minutes sharp of the hour. With the parameters HOUR and 6, we get values at midnight, at six o'clock in the morning, at noon, and at six o'clock in the evening. As the last example suggests, if you set the interval length to 1, the result is the same as for the corresponding call to datetrunc.
If you have a log table of some sort, you may want to get a summary of how many entries there are during a certain interval. That could be two hours, 15 minutes or something else. In the NorthNumbers there is a table ErrorLog, which is good for an exercise like this. (If you look at the data, you will realise that it comes from an SQL Server errorlog.) Say that we want to see how many entries there are for each 15-minute interval, with each interval being represented by the start of the interval. Here is a query using date_bucket for this purpose:
DECLARE @date date = '20230311'; WITH Buckets AS ( SELECT date_bucket(MINUTE, 15, LogDate) AS IntervalStart FROM dbo.ErrorLog WHERE LogDate >= @date AND LogDate < dateadd(DAY, 1, @date) ) SELECT IntervalStart, COUNT(*) AS "Count" FROM Buckets GROUP BY IntervalStart ORDER BY IntervalStart;
Here is the start and the end of the result set:
IntervalStart Count
--------------------------- -----------
2023-03-11 00:00:00.000 1
2023-03-11 00:15:00.000 2
2023-03-11 01:00:00.000 20
2023-03-11 01:15:00.000 2
2023-03-11 02:15:00.000 2
...
2023-03-11 21:45:00.000 20
2023-03-11 22:15:00.000 2
2023-03-11 22:45:00.000 20
2023-03-11 23:15:00.000 2
2023-03-11 23:45:00.000 20
Not unexpectedly, the query only returns the intervals for which there are any entries. Say that we want to see a row for all 15-minute intervals for the day. As before, we can use generate_series to produce these extra rows. Here is one solution:
DECLARE @date date = '20230311'; WITH LogBuckets AS ( SELECT date_bucket(MINUTE, 15, LogDate) AS IntervalStart FROM dbo.ErrorLog WHERE LogDate >= @date AND LogDate < dateadd(DAY, 1, @date) ), AllBuckets AS ( SELECT dateadd(MINUTE, value, convert(datetime2(0), @date)) AS IntervalStart FROM generate_series(0, 1439, 15) ) SELECT A.IntervalStart, COUNT(L.IntervalStart) AS "Count" FROM AllBuckets A LEFT JOIN LogBuckets L ON A.IntervalStart = L.IntervalStart GROUP BY A.IntervalStart ORDER BY A.IntervalStart;
Here I make use of the third argument to generate_series to produce 0, 15, 30 etc. (If you wonder where 1439 comes from: there are 1440 minutes in a day.) This is the start and the end of the result set:
IntervalStart Count
--------------------------- -----------
2023-03-11 00:00:00 1
2023-03-11 00:15:00 2
2023-03-11 00:30:00 0
2023-03-11 00:45:00 0
2023-03-11 01:00:00 20
...
2023-03-11 22:30:00 0
2023-03-11 22:45:00 20
2023-03-11 23:00:00 0
2023-03-11 23:15:00 2
2023-03-11 23:30:00 0
2023-03-11 23:45:00 20
But when you think of it, we can't write this query without date_bucket, like this:
DECLARE @date date = '20230311'; WITH AllBuckets AS ( SELECT dateadd(MINUTE, value, convert(datetime2(0), @date)) AS IntervalStart FROM generate_series(0, 1439, 15) ) SELECT A.IntervalStart, COUNT(L.LogDate) AS "Count" FROM AllBuckets A LEFT JOIN dbo.ErrorLog L ON L.LogDate >= A.IntervalStart AND L.LogDate < dateadd(MINUTE, 15, A.IntervalStart) GROUP BY A.IntervalStart ORDER BY A.IntervalStart;
Rather than using date_bucket, we create the intervals directly from the output of generate_series and make a left join to the ErrorLog table for each interval. This is certainly shorter, and may therefore be preferable. But which one gives the best performance? The ErrorLog table is too small to make any such investigations meaningful, but it's possible to make some general observations. The first query is more likely to scan the log table for the given date once, whereas the second is more prone to make one search per 15-minute interval. This leads to the expectation that the more intervals your query covers, the costlier the second query will be, whereas the first query will more or less have a constant cost. I like to stress that the actual outcome depends on several things, including available indexes. If you would face this situation in the real world, you would have to run a test to see which strategy that works best.
The key takeaway is that both patterns are valid, and you should have both in your toolbox. If you use one of the them and run into poor performance, you could try the other one, to see if that works better.
Let's close this article by looking at the fourth argument to date_bucket. The documentation calls it origin, but I don't think that is the best of names. I prefer to call it basedate instead. What it does is to set a reference point from which all other buckets are computed. It defaults to midnight 1900-01-01 (as dates often default to in SQL Server). The data type of basedate must be the same as the data type of datetimeval.
In many cases, you don't have to think about it, because date_bucket will do what you want anyway. But let's look at some examples where we need to set the base date.
Say that we have a starting point from which we want to do our log-file analysis, and we want that 15-minute intervals from that point. But that starting point is not necessarily on 0, 15, 30 or 45 minutes of the hour. The simple solution is to use the starting date as the base date. Say that we want to see statistics from the next six hours from that point. Here is a solution (without adding extra rows for intervals without events):
DECLARE @start datetime2(3) = '20230311 13:17'; WITH Buckets AS ( SELECT date_bucket(MINUTE, 15, LogDate, @start) AS IntervalStart FROM dbo.ErrorLog WHERE LogDate >= @start AND LogDate < dateadd(Hour, 6, @start) ) SELECT IntervalStart, COUNT(*) AS "Count" FROM Buckets GROUP BY IntervalStart ORDER BY IntervalStart;
Here is the output:
IntervalStart Count
--------------------------- -----------
2023-03-11 13:17:00.000 2
2023-03-11 13:47:00.000 20
2023-03-11 14:17:00.000 67
2023-03-11 14:47:00.000 20
2023-03-11 15:17:00.000 2
2023-03-11 15:47:00.000 20
2023-03-11 16:17:00.000 2
2023-03-11 16:47:00.000 20
2023-03-11 17:17:00.000 2
2023-03-11 17:47:00.000 20
2023-03-11 18:02:00.000 1
2023-03-11 18:17:00.000 67
2023-03-11 18:47:00.000 20
If you like, you can try with a value of @start that includes seconds and milliseconds. You will find that they will reappear in the IntervalStart column in the output.
Let's go back to our sales statistics. Say that rather than weekly data, we want bi-weekly data, being marked by the Monday of the first week of the period. Above we used datetrunc(ISO_WEEK, OrderDate)
, so a simple-minded approach is to change datetrunc to date_bucket and set the parameter intervallength to 2:
date_bucket(ISO_WEEK, 2, OrderDate)
But not so fast! First of all, ISO_WEEK is one those dateparts that are not permitted with date_bucket. The reason for this is that ISO_WEEK does not serve any purpose with date_bucket. You see, while date_bucket is like a generalised version of datetrunc, date_bucket is insensitive to the SET DATEFIRST setting, thanks to the basedate parameter. These queries illustrate:
DECLARE @date date = '20230326' SET DATEFIRST 7 -- Sunday SELECT datename(WEEKDAY, @date) AS DayOfWeek, datetrunc(WEEK, @date) AS datetrunc, date_bucket(WEEK, 1, @date) AS date_bucket SET DATEFIRST 1 -- Monday SELECT datename(WEEKDAY, @date) AS DayOfWeek, datetrunc(WEEK, @date) AS datetrunc, date_bucket(WEEK, 1, @date) AS date_bucket
The output is this:
DayOfWeek datetrunc date_bucket
------------------- ------------ -----------
Sunday 2023-03-26 2023-03-20
DayOfWeek datetrunc date_bucket
------------------- ------------ -----------
Sunday 2023-03-20 2023-03-20
date_bucket with three parameters always returns the Monday before the given date. This is because, very conveniently, the default base date 1 Jan 1900 was a Monday. If you want a different starting weekday, provide a fourth parameter with a date that falls on that weekday, for instance 1900-01-07 for Sunday.
Now that we know that we should use WEEK with date_bucket, we may try this:
DECLARE @startdate date = '19970101', @enddate date = '19971231'; WITH OrderData AS ( SELECT date_bucket(WEEK, 2, OrderDate) AS OrderWeek, OD.Amount FROM dbo.Orders O JOIN (SELECT OrderID, SUM(UnitPrice * Quantity) AS Amount FROM dbo.[Order Details] OD GROUP BY OrderID) OD ON O.OrderID = OD.OrderID WHERE O.OrderDate BETWEEN @startdate AND @enddate ) SELECT OrderWeek, COUNT(*) AS [Count], SUM(Amount) AS Amount FROM OrderData GROUP BY OrderWeek ORDER BY OrderWeek;
But the output is not really to our liking. Here are the first few rows:
OrderWeek Count Amount
---------- ----------- ---------------------
1996-12-23 5 12326.00
1997-01-06 14 24211.30
1997-01-20 14 30155.50
1997-02-03 15 18364.40
The first interval starts at 1996-12-23, but we want the output to start at 1996-12-30. Not only does the date look funny, but we also get a conspicuously low order count, since of the 14 days in the two-week interval, only five fall between @startdate and @enddate. The remedy is to pass a base date to date_bucket. We could hard-code this, but it seems better to use the input parameter @startdate, and derive the starting Monday from this parameter. We can do this with datetrunc. Thus, we get:
DECLARE @startdate date = '19970101', @enddate date = '19971231'; WITH OrderData AS ( SELECT date_bucket(WEEK, 2, OrderDate, datetrunc(ISO_WEEK, @startdate)) AS OrderWeek, OD.Amount FROM dbo.Orders O JOIN (SELECT OrderID, SUM(UnitPrice * Quantity) AS Amount FROM dbo.[Order Details] OD GROUP BY OrderID) OD ON O.OrderID = OD.OrderID WHERE O.OrderDate BETWEEN @startdate AND @enddate ) SELECT OrderWeek, COUNT(*) AS [Count], SUM(Amount) AS Amount FROM OrderData GROUP BY OrderWeek ORDER BY OrderWeek;
The top rows in the output now look like this:
OrderWeek Count Amount
---------- ----------- ---------------------
1996-12-30 12 19890.30
1997-01-13 14 35068.80
1997-01-27 14 17825.80
1997-02-10 15 22325.70
1997-02-24 14 19394.50
We have now looked at how you can use the function generate_series to solve various problems. While I have given some examples, it is not a complete catalogue, and I am sure that once you have added this function to your toolbox, you will find other uses for generate_series.
We have looked specifically at a number of examples with date and time intervals, and we have learnt a number of ways to express these intervals: with help of the convert, datetrunc and date_bucket functions. We have also seen that in some situations it may be useful to create a table of dates or some other date/time entity. Depending on your needs, a dates table can evolve into a calendar table, and I recommend reading Ed Pollack's article to learn more about this. Or google for "calendar table" to find other sources.
If you have questions or comments directly related to this article, feel free to drop me a mail on esquel@sommarskog.se. I like to stress that you are more than welcome to point out spelling or grammar errors. On the other hand, if you have a specific problem you need help with, I recommend that you ask your question in a public forum for SQL Server, as more people will see your question and you may get help quicker than if you mail an individual.