I have a list of bad practices in SQL Server below. If I am missing something can you please share it? 1) Do not use cursors. Cursor "type" opersations should be done in a middle tier. 2) Do not use temp-tables. Use derived-table or view 3) Too much normalization is not good. 4) Too many joins in quesries is not good.
[quoted text, click to view] >> I have a list of bad practices in SQL Server below.
It would help others to understand your perspective better, if you care to post the reasons behind each points in your "list". -- Anith
[quoted text, click to view] >> I have a list of bad practices in SQL Server below. If I
am missing something can you please share it? << This is funny! I am about 6000 words into my next book on SQL programming style and explain some of these things with a rationale and an exception. Here is some feedback: 2) One problem with temp tables in T-SQL is that they do not follow the model for Standard SQL temp tables, so they will not port easily. But I agree that most of the time programmers use temp tables to hold steps in a process rather than writing a single declarative statement. Why learn RDBMS when you can fake your old skills in the new language? In fact, a lot T-SQL is structured exactly like mag tape file systems -- the old "WORKnnn" tape files are simply replaced by "#temp_nn" tables. 3) I am not sure what "over-normalize" means. If you split up a table beyond a certain point, you destroy the model for entity it was supposed to represent. Properly normalized to AT LEAST 3NF and preferably to 5NF is the minimum. 5) I think you meant DRI actions should replace triggers when possible. --CELKO-- Please post DDL, so that people do not have to guess what the keys, constraints, Declarative Referential Integrity, datatypes, etc. in your schema are. Sample data is also a good idea, along with clear specifications. *** Sent via Developersdex http://www.developersdex.com ***
Well, I didn't want to turn this into a religous debate. You are right, these statements are not true if you are dealing with a mom and pop database. It really doesn't hinder any performance. Even if it did you can always set up an SQL Server cluster using Active/Active. I should refresh my quote. It should be: What are "optimal" practices versus "other" practices in SQL Server. [quoted text, click to view] >-----Original Message----- >I don't think that defining "bad practices" as blanket statements like this >is very useful to anyone. For what it's worth, I almost entirely disagree >with everything you've written! Are these ideas something that you actually >practice today or are you looking for guidance on what good and bad >practices *should* be? > >-- >David Portas >SQL Server MVP >-- > > >.
This is a weird list IMO. I disagree with the "absolute-ness" of 2-5 and my only feeling about cursors is I've never had a compelling reason to turn to using them. Only once I've seen a cursor out perform a set-based solution and that's because I saw Itzik Ben-Gan prove it in a class. The problem with a blanket list is everything should contain proofs and caveats. The "it depends" mantra in SQL Server is true because it really does depend. Eric [quoted text, click to view] <anonymous@discussions.microsoft.com> wrote in message news:166a01c49e0d$26906fe0$a501280a@phx.gbl... >I have a list of bad practices in SQL Server below. If I > am missing something can you please share it? > > 1) Do not use cursors. Cursor "type" opersations should > be done in a middle tier. > > 2) Do not use temp-tables. Use derived-table or view > > 3) Too much normalization is not good. > > 4) Too many joins in quesries is not good. > > 5) Avoid triggers if you can. Use constraints
I don't think that defining "bad practices" as blanket statements like this is very useful to anyone. For what it's worth, I almost entirely disagree with everything you've written! Are these ideas something that you actually practice today or are you looking for guidance on what good and bad practices *should* be? -- David Portas SQL Server MVP --
I agree. That is more appropriate [quoted text, click to view] >-----Original Message----- > A better list would be good practices, > starting with "know why you're using > a technique and what the advantages > and disadvantages are in > your particular environment." > Jeff
See http://www.nigelrivett.net/BadThings.html Especially the paragraph at the top. Think I disagree with all your statements but particularly 2 which in some cases is just wrong. [quoted text, click to view] "anonymous@discussions.microsoft.com" wrote: > I have a list of bad practices in SQL Server below. If I > am missing something can you please share it? > > 1) Do not use cursors. Cursor "type" opersations should > be done in a middle tier. > > 2) Do not use temp-tables. Use derived-table or view > > 3) Too much normalization is not good. > > 4) Too many joins in quesries is not good. > > 5) Avoid triggers if you can. Use constraints
Cursors implemented on the client side can still hold locking transactions. Views are not a replacement for temporary tables. [quoted text, click to view] <anonymous@discussions.microsoft.com> wrote in message news:166a01c49e0d$26906fe0$a501280a@phx.gbl... > I have a list of bad practices in SQL Server below. If I > am missing something can you please share it? > > 1) Do not use cursors. Cursor "type" opersations should > be done in a middle tier. > > 2) Do not use temp-tables. Use derived-table or view > > 3) Too much normalization is not good. > > 4) Too many joins in quesries is not good. > > 5) Avoid triggers if you can. Use constraints
lol. 1st and 3rd are the same really. The whole point is so that you can change the database structure without changing the application - often for performance, as long as the SPs have the same interface then application will never know anything has happened. I do it quite often. Saying that it will never happen probably means that you have bound the application to the database structure. That sort of thing usually means that everything is designed by an application programmer which is what the warning is really about. Just because microsoft does something doesn't mean it's a good idea - look at the system SPs for previous versions of sql server. 2nd one So you're using SP to genrate the change scripts which is fine - the problem is when people use it to make changes without knowing what it is really going to do and don't save any record of what it has done. Read the first paragraph again. These are just things to think about and not necessarily bad in all cases. [quoted text, click to view] "Richard Quinn" wrote: > On Sun, 19 Sep 2004 10:37:03 -0700, Nigel Rivett <sqlnr@hotmail.com> > wrote: > > >See > > http://www.nigelrivett.net/BadThings.html > >Especially the paragraph at the top. > > > > Let's see... > > > Stored procedures to access tables rather than business entities. > > I disagree. Exactly why do you consider this as bad? AFAIK, this > practice is a cornerstone of Microsoft's DNA architecture, FWIW. > > > Use of enterprise manager for data or schema updates. > > I disagree. I always store the changes to a file and then put that in > VSS or add it to my Vis.Studio "Change Scripts" folder. > > > Application bound to the database structure (application should have no knowledge of tables). > > I disagree (mostly). Why do you consider this as bad? > If my application knows nothing about the database (at any layer), > then why am I using a relational database in the first place? I have > yet to see an example of a "database application" which will continue > to work if somebody changes the schema behind its back. > > > - Richard > > --- > Richard Quinn > MCSD, MCDBA firstname.lastname@ieee.org
On Sat, 18 Sep 2004 22:54:39 -0700, [quoted text, click to view] <anonymous@discussions.microsoft.com> wrote: >I have a list of bad practices in SQL Server below. If I >am missing something can you please share it? > >1) Do not use cursors. Cursor "type" opersations should >be done in a middle tier. > >2) Do not use temp-tables. Use derived-table or view > >3) Too much normalization is not good. > >4) Too many joins in quesries is not good. > >5) Avoid triggers if you can. Use constraints
The problem is, all these "bad" practices have valid reasons for use. Kind of like the "Thou shalt not kill" thing getting in the way when someone is intent on taking your own life. A better list would be good practices, starting with "know why you're using a technique and what the advantages and disadvantages are in your particular environment."
[quoted text, click to view] > It really doesn't > hinder any performance. Even if it did you can always > set up an SQL Server cluster using Active/Active.
I must be missing the point here. Clusters have absolutely nothing to do with performance. They are for hardware failover. -- Andrew J. Kelly SQL MVP [quoted text, click to view] <anonymous@discussions.microsoft.com> wrote in message news:30ef01c49e52$47d6c1a0$a401280a@phx.gbl... > Well, I didn't want to turn this into a religous debate. > You are right, these statements are not true if you are > dealing with a mom and pop database. It really doesn't > hinder any performance. Even if it did you can always > set up an SQL Server cluster using Active/Active. I > should refresh my quote. It should be: > > What are "optimal" practices versus "other" practices in > SQL Server. > > >-----Original Message----- > >I don't think that defining "bad practices" as blanket > statements like this > >is very useful to anyone. For what it's worth, I almost > entirely disagree > >with everything you've written! Are these ideas > something that you actually > >practice today or are you looking for guidance on what > good and bad > >practices *should* be? > > > >-- > >David Portas > >SQL Server MVP > >-- > > > > > >. > >
[quoted text, click to view] >> Only once I've seen a cursor out perform a set-based solution and
that's because I saw Itzik Ben-Gan prove it in a class. << I have written five cursors in my career and I know that if we had a CASE expression in the old days, I could have avoided at least three of them. The other two were NP complete things (a version of knapsack and traveling salesman problems) where we wanted to find any answer under a certain cost. The relational solution would find ALL the answers -- sometime after we were all dead. Do you happen to have that example and is it generic enough to use for Standard SQL classes? --CELKO-- Please post DDL, so that people do not have to guess what the keys, constraints, Declarative Referential Integrity, datatypes, etc. in your schema are. Sample data is also a good idea, along with clear specifications. *** Sent via Developersdex http://www.developersdex.com ***
Odd - I'm at a place now where someone implemented a cluster to improve performance then used one half for something else when it didn't. Still have half a cluster. Nigel Rivett www.nigelrivett.net *** Sent via Developersdex http://www.developersdex.com ***
[quoted text, click to view] > 1) Do not use cursors. Cursor "type" opersations should > be done in a middle tier.
Cursor operations should be avoided altogether, which is possible with set-based SQL in at least 99% of cases no matter what the size of your data. Writing cursor-like data manipulation operations in the middle tier is unlikely to scale well. [quoted text, click to view] > 3) Too much normalization is not good.
This statement is pretty meaningless because you don't say how much is too much. Third Normal Form should be required for any relational design. De-normalization below 3NF is a mistake every time in my experience. [quoted text, click to view] > 4) Too many joins in quesries is not good.
Why? Joins in a normalized database can improve performance by reducing the number of physical reads required. Your statement implies that you have a choice in the number of joins you might implement in a query. Can you give an example of this? I'm not sure I understand what kind of scenario you have in mind. [quoted text, click to view] > 5) Avoid triggers if you can. Use constraints
Triggers and Constraints are different solutions to different problems. DRI should usualy be implemented through constraints where possible. I do agree that minimizing the use of triggers is wise; the obvious alternative is to implement DML logic in SPs. -- David Portas SQL Server MVP --
On Sun, 19 Sep 2004 10:37:03 -0700, Nigel Rivett <sqlnr@hotmail.com> [quoted text, click to view] wrote: >See > http://www.nigelrivett.net/BadThings.html >Especially the paragraph at the top. > Let's see... [quoted text, click to view] > Stored procedures to access tables rather than business entities.
I disagree. Exactly why do you consider this as bad? AFAIK, this practice is a cornerstone of Microsoft's DNA architecture, FWIW. [quoted text, click to view] > Use of enterprise manager for data or schema updates.
I disagree. I always store the changes to a file and then put that in VSS or add it to my Vis.Studio "Change Scripts" folder. [quoted text, click to view] > Application bound to the database structure (application should have no knowledge of tables).
I disagree (mostly). Why do you consider this as bad? If my application knows nothing about the database (at any layer), then why am I using a relational database in the first place? I have yet to see an example of a "database application" which will continue to work if somebody changes the schema behind its back. - Richard --- Richard Quinn
I think that Eric refers to the following problem: http://www.winnetmag.com/Article/ArticleID/37636/37636.html If you can't access the complete problem, here goes: --------------------------------------------------------------------- -- Concurrent time intervals --------------------------------------------------------------------- CREATE TABLE Sessions ( key_col INT NOT NULL PRIMARY KEY, app VARCHAR(10) NOT NULL, usr VARCHAR(10) NOT NULL, host VARCHAR(10) NOT NULL, starttime SMALLDATETIME NOT NULL, endtime SMALLDATETIME NOT NULL, CHECK(endtime > starttime) ) INSERT INTO Sessions VALUES(1, 'app1', 'user1', 'host1', '20030212 08:30', '20030212 10:30') INSERT INTO Sessions VALUES(2, 'app1', 'user2', 'host1', '20030212 08:30', '20030212 08:45') INSERT INTO Sessions VALUES(3, 'app1', 'user3', 'host2', '20030212 09:00', '20030212 09:30') INSERT INTO Sessions VALUES(4, 'app1', 'user4', 'host2', '20030212 09:15', '20030212 10:30') INSERT INTO Sessions VALUES(5, 'app1', 'user5', 'host3', '20030212 09:15', '20030212 09:30') INSERT INTO Sessions VALUES(6, 'app1', 'user6', 'host3', '20030212 10:30', '20030212 14:30') INSERT INTO Sessions VALUES(7, 'app1', 'user7', 'host4', '20030212 10:45', '20030212 11:30') INSERT INTO Sessions VALUES(8, 'app1', 'user8', 'host4', '20030212 11:00', '20030212 12:30') INSERT INTO Sessions VALUES(9, 'app2', 'user8', 'host1', '20030212 08:30', '20030212 08:45') INSERT INTO Sessions VALUES(10, 'app2', 'user7', 'host1', '20030212 09:00', '20030212 09:30') INSERT INTO Sessions VALUES(11, 'app2', 'user6', 'host2', '20030212 11:45', '20030212 12:00') INSERT INTO Sessions VALUES(12, 'app2', 'user5', 'host2', '20030212 12:30', '20030212 14:00') INSERT INTO Sessions VALUES(13, 'app2', 'user4', 'host3', '20030212 12:45', '20030212 13:30') INSERT INTO Sessions VALUES(14, 'app2', 'user3', 'host3', '20030212 13:00', '20030212 14:00') INSERT INTO Sessions VALUES(15, 'app2', 'user2', 'host4', '20030212 14:00', '20030212 16:30') INSERT INTO Sessions VALUES(16, 'app2', 'user1', 'host4', '20030212 15:30', '20030212 17:00') -- Calculate the maximum number of concurrent sessions for each user,application -- The datetime ranges provided include starttime and exclude datetime -- i.e., the session was active >= starttime < endtime -- Desired results: app mx ---------- ----------- app1 4 app2 3 --------------------------------------------------------------------- -- Concurrent time intervals --------------------------------------------------------------------- Here's one (of several) pure set-based solution: -- Set-based solution SELECT app, MAX(concurrent) AS mx FROM (SELECT app, starttime, (SELECT COUNT(*) FROM Sessions AS S2 WHERE S1.app = S2.app AND S1.starttime >= S2.starttime AND S1.starttime < S2.endtime) AS concurrent FROM Sessions AS S1) AS C GROUP BY app The cursor-based solution is a cursor based on a query that marks events as increasing or decreasing the count of concurrent sessions. I haven't found a better performing pure set-based solution but I'd be more than glad if I someone did (test against larger tables, 10000, 100000, 1000000 rows). -- Iterative solution DECLARE @app AS varchar(10), @prevapp AS varchar (10), @ts AS datetime, @type AS int, @concurrent AS int, @mx AS int DECLARE @AppsMx TABLE ( app varchar (10) NOT NULL PRIMARY KEY, mx int NOT NULL ) DECLARE sessions_cur CURSOR FAST_FORWARD FOR SELECT app, starttime AS ts, 1 AS type FROM Sessions UNION ALL SELECT app, endtime, -1 FROM Sessions ORDER BY app, ts, type OPEN sessions_cur FETCH NEXT FROM sessions_cur INTO @app, @ts, @type SET @prevapp = @app SET @concurrent = 0 SET @mx = 0 WHILE @@FETCH_STATUS = 0 BEGIN IF @app <> @prevapp BEGIN INSERT INTO @AppsMx VALUES(@prevapp, @mx) SET @concurrent = 0 SET @mx = 0 SET @prevapp = @app END SET @concurrent = @concurrent + @type IF @concurrent > @mx SET @mx = @concurrent FETCH NEXT FROM sessions_cur INTO @app, @ts, @type END IF @prevapp IS NOT NULL INSERT INTO @AppsMx VALUES(@prevapp, @mx) CLOSE sessions_cur DEALLOCATE sessions_cur SELECT * FROM @AppsMx This problem is one of two examples I show in the course where a cursor outperforms a pure set-based query. The second problem has to do with calculating, dare I say, "row numbers" ;-) based on some desired sort criteria. e.g., a simple scenario of calculating row numbers for orders based on orderid sort: USE Northwind SELECT orderid, (SELECT COUNT(*) FROM Orders AS O2 WHERE O2.orderid <= O1.orderid) AS rownum FROM Orders As O1 These problems where a cursor outperforms a set-based solution are typically characterized by an exponential performance degradation as the table grows larger. e.g., for the latter, the optimizer counts the number of rows that have an orderid which is <= the outer orderid, so even with an index you get (n+n^2)/2 rows scanned. The cursor performs better in such a case because the data is scanned only once. So the overhead of the curser is negligible compared to the overhead of the multiple scans. As an aside, this was addressed in SQL Server 2005 by providing the ANSI-SQL 99 ROW_NUMBER() function: SELECT orderid, ROW_NUMBER() OVER(ORDER BY orderid) AS rownum FROM Orders ROW_NUMBER() scans the data only once. It has allows many alternate very efficient solutions to existing problems that had slow set-based solutions. -- BG, SQL Server MVP www.SolidQualityLearning.com [quoted text, click to view] "Joe Celko" <jcelko212@earthlink.net> wrote in message news:OUusbFrnEHA.2096@TK2MSFTNGP15.phx.gbl... >>> Only once I've seen a cursor out perform a set-based solution and > that's because I saw Itzik Ben-Gan prove it in a class. << > > I have written five cursors in my career and I know that if we had a > CASE expression in the old days, I could have avoided at least three of > them. The other two were NP complete things (a version of knapsack and > traveling salesman problems) where we wanted to find any answer under a > certain cost. The relational solution would find ALL the answers -- > sometime after we were all dead. > > Do you happen to have that example and is it generic enough to use for > Standard SQL classes? > > --CELKO-- > Please post DDL, so that people do not have to guess what the keys, > constraints, Declarative Referential Integrity, datatypes, etc. in your > schema are. Sample data is also a good idea, along with clear > specifications. > > > *** Sent via Developersdex http://www.developersdex.com *** > Don't just participate in USENET...get rewarded for it!
"David Portas" <REMOVE_BEFORE_REPLYING_dportas@acm.org> wrote in news:66edncmusabM3dDcRVn-jQ@giganews.com: [quoted text, click to view] > I don't think that defining "bad practices" as blanket statements like > this is very useful to anyone. For what it's worth, I almost entirely > disagree with everything you've written!
Very true... There are times when one must use these bad practices because it's the best way to do things. -- Lucas Tam (REMOVEnntp@rogers.com) Please delete "REMOVE" from the e-mail address when replying.
Hi Anonymous, On Sat, 18 Sep 2004 22:54:39 -0700, <anonymous@discussions.microsoft.com> [quoted text, click to view] wrote: >I have a list of bad practices in SQL Server below. If I >am missing something can you please share it? > >1) Do not use cursors. Cursor "type" opersations should >be done in a middle tier.
I do think that cursors should be avoided as much as possible. But the same holds true for blanket statements. [quoted text, click to view] >2) Do not use temp-tables. Use derived-table or view
Depends on what you want to do. Not everything can be done with derived tables or views. And even when it can be done, there are cases where using a temp table gives better performance. To prove my point, I'll show you three different queries that will find any couple of authors that have the exact same earnings (based on sales_ytd, price and royalty percentages), but only if the earnings are between $4,000 and $4,500. (Note - there might be better ways to solve this query; that's not my point. My point is that the optimizer will not choose the best execution plan if the same view or derived table is used multiple times in the query). Try the below queries. Compare the difference from the set statistics io output. Check the execution plans as well. You'll see that the solutions with view and derived table actually construct the contents of the derived table (or view) twice, making execution more expensive than the version with the temp table. (And the last version can actually be optimized even further by adding a primary key constraint and an extra index - commented out in the code below). --Solution #1, using a view: create view earnings as select a.au_id, sum(t.price * t.ytd_sales * t.royalty / 100 * ta.royaltyper / 100) as earnings from authors a join titleauthor ta on ta.au_id = a.au_id join titles t on t.title_id = ta.title_id group by a.au_id having sum(t.price * t.ytd_sales * t.royalty / 100 * ta.royaltyper / 100) between 4000 and 4500 go set statistics io on select e1.au_id, e2.au_id, e1.earnings from earnings AS e1 join earnings AS e2 on e1.earnings = e2.earnings and e1.au_id < e2.au_id drop view earnings set statistics io off -- Solution #2, using a derived table: set statistics io on select e1.au_id, e2.au_id, e1.earnings from (select a.au_id, sum(t.price * t.ytd_sales * t.royalty / 100 * ta.royaltyper / 100) AS earnings from authors a join titleauthor ta on ta.au_id = a.au_id join titles t on t.title_id = ta.title_id group by a.au_id having sum(t.price * t.ytd_sales * t.royalty / 100 * ta.royaltyper / 100) between 4000 and 4500) AS e1 join (select a.au_id, sum(t.price * t.ytd_sales * t.royalty / 100 * ta.royaltyper / 100) AS earnings from authors a join titleauthor ta on ta.au_id = a.au_id join titles t on t.title_id = ta.title_id group by a.au_id having sum(t.price * t.ytd_sales * t.royalty / 100 * ta.royaltyper / 100) between 4000 and 4500) AS e2 on e1.earnings = e2.earnings and e1.au_id < e2.au_id set statistics io off -- Solution #3, using a temp table: set statistics io on create table #tmp (au_id varchar(11) not null ,earnings money not null -- ,primary key (au_id) ) --create index ixtmp on #tmp(earnings) insert #tmp (au_id, earnings) select a.au_id, sum(t.price * t.ytd_sales * t.royalty / 100 * ta.royaltyper / 100) from authors a join titleauthor ta on ta.au_id = a.au_id join titles t on t.title_id = ta.title_id group by a.au_id having sum(t.price * t.ytd_sales * t.royalty / 100 * ta.royaltyper / 100) between 4000 and 4500 select e1.au_id, e2.au_id, e1.earnings from #tmp AS e1 join #tmp AS e2 on e1.earnings = e2.earnings and e1.au_id < e2.au_id drop table #tmp set statistics io off [quoted text, click to view] >3) Too much normalization is not good.
I disagree. In OLTP databases, the design should be normalized down to fifth normal form, with only very few exception (and these should be documented, as they'll probably prove to be the future bottleneck anyway). Requirements for OLAP databases are different though. That might be a place where carefull denormalization has it's place. But denormalization should (as the name implies) follow normalization - stopping before you have a completely normalized design won't yield a "denormalized" design - you'll end up with just another "unnormalized" design. [quoted text, click to view] >4) Too many joins in quesries is not good.
Define "too many". If you mean: don't join in tables that you don't need in the query, then I would aagree. If you mean: no more than (fill in your arbitrary number) of joins, then this is nonsense. If the query needs data from many tables, you'll just have to get them all into the query, or you'll never get the desired results. (Unless of course you use temp tables for intermediate results, but in THIS case, that WOULD be bad). [quoted text, click to view] >5) Avoid triggers if you can. Use constraints
Use constraints and DRI if they can achieve what you need. Use triggers if you need business logic that constraints and DRI don't offer. Best, Hugo --
[quoted text, click to view] >> The execution time for the set-based SQL was 0.606 seconds on my test
system; the cursor-based SQL required 5.483 seconds. << Another victory for non-procedural code! I was thinking about building a table of points in time, like a more detailed calendar table, then using BETWEEN predicates to count how many sessions each of those points falls inside of. It gives you control of grandularity and you can ask other questions with it. But if all we needed was an index ... --CELKO-- Please post DDL, so that people do not have to guess what the keys, constraints, Declarative Referential Integrity, datatypes, etc. in your schema are. Sample data is also a good idea, along with clear specifications. *** Sent via Developersdex http://www.developersdex.com ***
Thanks for posting that Itzik. Also, thanks for the 2005 - Row_Number() information. [quoted text, click to view] Itzik Ben-Gan wrote: > I think that Eric refers to the following problem: > > http://www.winnetmag.com/Article/ArticleID/37636/37636.html > > If you can't access the complete problem, here goes: > > --------------------------------------------------------------------- > -- Concurrent time intervals > --------------------------------------------------------------------- > CREATE TABLE Sessions > ( > key_col INT NOT NULL PRIMARY KEY, > app VARCHAR(10) NOT NULL, > usr VARCHAR(10) NOT NULL, > host VARCHAR(10) NOT NULL, > starttime SMALLDATETIME NOT NULL, > endtime SMALLDATETIME NOT NULL, > CHECK(endtime > starttime) > ) > > INSERT INTO Sessions VALUES(1, 'app1', 'user1', 'host1', '20030212 > 08:30', '20030212 10:30') > INSERT INTO Sessions VALUES(2, 'app1', 'user2', 'host1', '20030212 > 08:30', '20030212 08:45') > INSERT INTO Sessions VALUES(3, 'app1', 'user3', 'host2', '20030212 > 09:00', '20030212 09:30') > INSERT INTO Sessions VALUES(4, 'app1', 'user4', 'host2', '20030212 > 09:15', '20030212 10:30') > INSERT INTO Sessions VALUES(5, 'app1', 'user5', 'host3', '20030212 > 09:15', '20030212 09:30') > INSERT INTO Sessions VALUES(6, 'app1', 'user6', 'host3', '20030212 > 10:30', '20030212 14:30') > INSERT INTO Sessions VALUES(7, 'app1', 'user7', 'host4', '20030212 > 10:45', '20030212 11:30') > INSERT INTO Sessions VALUES(8, 'app1', 'user8', 'host4', '20030212 > 11:00', '20030212 12:30') > INSERT INTO Sessions VALUES(9, 'app2', 'user8', 'host1', '20030212 > 08:30', '20030212 08:45') > INSERT INTO Sessions VALUES(10, 'app2', 'user7', 'host1', '20030212 > 09:00', '20030212 09:30') > INSERT INTO Sessions VALUES(11, 'app2', 'user6', 'host2', '20030212 > 11:45', '20030212 12:00') > INSERT INTO Sessions VALUES(12, 'app2', 'user5', 'host2', '20030212 > 12:30', '20030212 14:00') > INSERT INTO Sessions VALUES(13, 'app2', 'user4', 'host3', '20030212 > 12:45', '20030212 13:30') > INSERT INTO Sessions VALUES(14, 'app2', 'user3', 'host3', '20030212 > 13:00', '20030212 14:00') > INSERT INTO Sessions VALUES(15, 'app2', 'user2', 'host4', '20030212 > 14:00', '20030212 16:30') > INSERT INTO Sessions VALUES(16, 'app2', 'user1', 'host4', '20030212 > 15:30', '20030212 17:00') > > -- Calculate the maximum number of concurrent sessions for each > user,application > -- The datetime ranges provided include starttime and exclude datetime > -- i.e., the session was active >= starttime < endtime > -- Desired results: > app mx > ---------- ----------- > app1 4 > app2 3 > > --------------------------------------------------------------------- > -- Concurrent time intervals > --------------------------------------------------------------------- > Here's one (of several) pure set-based solution: > > -- Set-based solution > SELECT app, MAX(concurrent) AS mx > FROM (SELECT app, starttime, > (SELECT COUNT(*) > FROM Sessions AS S2 > WHERE S1.app = S2.app > AND S1.starttime >= S2.starttime > AND S1.starttime < S2.endtime) AS concurrent > FROM Sessions AS S1) AS C > GROUP BY app > > The cursor-based solution is a cursor based on a query that marks > events as increasing or decreasing the count of concurrent sessions. > I haven't found a better performing pure set-based solution but I'd > be more than glad if I someone did (test against larger tables, > 10000, 100000, 1000000 rows). > > -- Iterative solution > DECLARE > @app AS varchar(10), > @prevapp AS varchar (10), > @ts AS datetime, > @type AS int, > @concurrent AS int, > @mx AS int > > DECLARE @AppsMx TABLE > ( > app varchar (10) NOT NULL PRIMARY KEY, > mx int NOT NULL > ) > > DECLARE sessions_cur CURSOR FAST_FORWARD FOR > SELECT app, starttime AS ts, 1 AS type > FROM Sessions > > UNION ALL > > SELECT app, endtime, -1 > FROM Sessions > > ORDER BY app, ts, type > > OPEN sessions_cur > > FETCH NEXT FROM sessions_cur > INTO @app, @ts, @type > > SET @prevapp = @app > SET @concurrent = 0 > SET @mx = 0 > > WHILE @@FETCH_STATUS = 0 > BEGIN > IF @app <> @prevapp > BEGIN > INSERT INTO @AppsMx VALUES(@prevapp, @mx) > SET @concurrent = 0 > SET @mx = 0 > SET @prevapp = @app > END > > SET @concurrent = @concurrent + @type > IF @concurrent > @mx SET @mx = @concurrent > > FETCH NEXT FROM sessions_cur > INTO @app, @ts, @type > END > > IF @prevapp IS NOT NULL > INSERT INTO @AppsMx VALUES(@prevapp, @mx) > > CLOSE sessions_cur > > DEALLOCATE sessions_cur > > SELECT * FROM @AppsMx > > This problem is one of two examples I show in the course where a > cursor outperforms a pure set-based query. The second problem has to > do with calculating, dare I say, "row numbers" ;-) based on some > desired sort criteria. e.g., a simple scenario of calculating row > numbers for orders based on orderid sort: > > USE Northwind > > SELECT orderid, > (SELECT COUNT(*) > FROM Orders AS O2 > WHERE O2.orderid <= O1.orderid) AS rownum > FROM Orders As O1 > > These problems where a cursor outperforms a set-based solution are > typically characterized by an exponential performance degradation as > the table grows larger. e.g., for the latter, the optimizer counts > the number of rows that have an orderid which is <= the outer > orderid, so even with an index you get (n+n^2)/2 rows scanned. The > cursor performs better in such a case because the data is scanned > only once. So the overhead of the curser is negligible compared to > the overhead of the multiple scans. As an aside, this was addressed in SQL > Server 2005 by providing the > ANSI-SQL 99 ROW_NUMBER() function: > > SELECT orderid, ROW_NUMBER() OVER(ORDER BY orderid) AS rownum > FROM Orders > > ROW_NUMBER() scans the data only once. It has allows many alternate > very efficient solutions to existing problems that had slow set-based > solutions. > > "Joe Celko" <jcelko212@earthlink.net> wrote in message > news:OUusbFrnEHA.2096@TK2MSFTNGP15.phx.gbl... >>>> Only once I've seen a cursor out perform a set-based solution and >> that's because I saw Itzik Ben-Gan prove it in a class. << >> >> I have written five cursors in my career and I know that if we had a >> CASE expression in the old days, I could have avoided at least three >> of them. The other two were NP complete things (a version of >> knapsack and traveling salesman problems) where we wanted to find >> any answer under a certain cost. The relational solution would find >> ALL the answers -- sometime after we were all dead. >> >> Do you happen to have that example and is it generic enough to use >> for Standard SQL classes? >> >> --CELKO-- >> Please post DDL, so that people do not have to guess what the keys,
[quoted text, click to view] >> I guess you focussed too much on the SQL and forgot to look at indexes.
I was about to comment on Itzik's choice of indexes before I saw your post and it is to the point. Even otherwise, there may be several examples where we find cursors in certain special cases perform better than corresponding set-based solutions. However these "out-performances" are simply attributions of the physical model which vary from OS to OS, hardware to hardware, DBMS to DBMS, underlying dataset variations, indexes involved & lots of factors, which makes any generic performance claims suspect. In general, I wouldn't give much encouragement to loop-based solutions, since it is a deviation from the functional programming approach which is desirable in data management. -- Anith
Well, I'll be damned! :-) An index on app, starttime of course seems like a natural choice here. I tested against larger tables, and both solutions seem to have fairly linear performance degradation. I'll go to my original benchmarks and see what went wrong there. The set-based solution did logically seem to have the potential for exponential performance degradation... Here's the results I got from the new benchmark with the index: #rows set-based cursor-based ---------- ------------- --------------- 16 0 0.01 16,000 0.08 1.463 160,000 0.96 12.68 1,600,000 10.423 136.686 16,000,000 130.503 still running... I used the following code to populate the test tables. SELECT IDENTITY(INT, 1, 1) AS key_col, app, usr, host, starttime, endtime INTO Sessions16k FROM Sessions, Nums WHERE n <= 1000 -- also 10000, 100000, 1000000 CREATE CLUSTERED INDEX ImproveIt ON Sessions16K (app, starttime) ALTER TABLE Sessions16k ADD PRIMARY KEY NONCLUSTERED(key_col) Remind me to get you a beer when we meet next time. Mind if I shared this with SQLMag's readers? Now that this problem was solved, maybe someone can figure out a way to solve the row-numbers problem with a faster set-based solution? Against a table with 100K rows (with an index on orderid), I got 26 seconds with a cursor based solution, and about half an hour for the set-based solution. -- BG, SQL Server MVP www.SolidQualityLearning.com [quoted text, click to view] "Hugo Kornelis" <hugo@pe_NO_rFact.in_SPAM_fo> wrote in message news:5mgtk0tv1bssjkii6231v68p7oqq893mjo@4ax.com... > On Sun, 19 Sep 2004 23:21:04 -0500, Itzik Ben-Gan wrote: > > >I haven't found a better performing pure set-based solution but I'd be more > >than glad if I someone did (test against larger tables, 10000, 100000, > >1000000 rows). > > Hi Itzik, > > I guess you focussed too much on the SQL and forgot to look at indexes. > > CREATE CLUSTERED INDEX ImproveIt ON Sessions (app, starttime) > > I added this line to your creation script, then ran the following lines > several times until I had 65536 rows in the table: > > INSERT Sessions (key_col, app, usr, host, starttime, endtime) > SELECT key_col + (SELECT MAX(key_col) FROM Sessions), app, usr, host, > starttime, endtime > FROM Sessions > > INSERT Sessions (key_col, app, usr, host, starttime, endtime) > SELECT key_col + (SELECT MAX(key_col) FROM Sessions), > app + 'X', usr, host, starttime, endtime > FROM Sessions > SELECT COUNT(*) FROM Sessions > > With this bloated table, I ran your own setbased SQL and your own > cursorbased SQL, both after running DBCC DROPCLEANBUFFERS and DBCC > FREEPROCCACHE, of course. Both produced the same output: > > app mx > ---------- ----------- > app1 256 > app1X 1536 > app1XX 3840 > app1XXX 5120 > app1XXXX 3840 > app1XXXXX 1536 > app1XXXXXX 256 > app2 192 > app2X 1152 > app2XX 2880 > app2XXX 3840 > app2XXXX 2880 > app2XXXXX 1152 > app2XXXXXX 192 > > The execution time for the setbased SQL was 0.606 seconds on my test > system; the cursorbased SQL required 5.483 seconds. > > I don't have the time available right now to test against even larger > tables. > > Best, Hugo > -- > > (Remove _NO_ and _SPAM_ to get my e-mail address)
[quoted text, click to view] On Sun, 19 Sep 2004 23:21:04 -0500, Itzik Ben-Gan wrote: >I haven't found a better performing pure set-based solution but I'd be more >than glad if I someone did (test against larger tables, 10000, 100000, >1000000 rows).
Hi Itzik, I guess you focussed too much on the SQL and forgot to look at indexes. CREATE CLUSTERED INDEX ImproveIt ON Sessions (app, starttime) I added this line to your creation script, then ran the following lines several times until I had 65536 rows in the table: INSERT Sessions (key_col, app, usr, host, starttime, endtime) SELECT key_col + (SELECT MAX(key_col) FROM Sessions), app, usr, host, starttime, endtime FROM Sessions INSERT Sessions (key_col, app, usr, host, starttime, endtime) SELECT key_col + (SELECT MAX(key_col) FROM Sessions), app + 'X', usr, host, starttime, endtime FROM Sessions SELECT COUNT(*) FROM Sessions With this bloated table, I ran your own setbased SQL and your own cursorbased SQL, both after running DBCC DROPCLEANBUFFERS and DBCC FREEPROCCACHE, of course. Both produced the same output: app mx ---------- ----------- app1 256 app1X 1536 app1XX 3840 app1XXX 5120 app1XXXX 3840 app1XXXXX 1536 app1XXXXXX 256 app2 192 app2X 1152 app2XX 2880 app2XXX 3840 app2XXXX 2880 app2XXXXX 1152 app2XXXXXX 192 The execution time for the setbased SQL was 0.606 seconds on my test system; the cursorbased SQL required 5.483 seconds. I don't have the time available right now to test against even larger tables. Best, Hugo --
[quoted text, click to view] On Mon, 20 Sep 2004 14:17:14 +0200, Hugo Kornelis wrote: >CREATE CLUSTERED INDEX ImproveIt ON Sessions (app, starttime) > >I added this line to your creation script
And I added the NONCLUSTERED option on the primary key constraint, of course. Apologies for leaving this out of my post. Best, Hugo --
Re the row-numbers problem, solutions with IDENTITY are not valid because SQL Server 2000 doesn't guarantee that identity values would be generated according to the sort specified in an ORDER BY clause. Identity values sometimes are generated prior to the sort, and there's no way to force SQL Server to assign them otherwise. -- BG, SQL Server MVP www.SolidQualityLearning.com [quoted text, click to view] "Itzik Ben-Gan" <itzik@REMOVETHIS.SolidQualityLearning.com> wrote in message news:uq9Bx6znEHA.3900@TK2MSFTNGP10.phx.gbl... > Well, I'll be damned! :-) > An index on app, starttime of course seems like a natural choice here. > > I tested against larger tables, and both solutions seem to have fairly > linear performance degradation. > I'll go to my original benchmarks and see what went wrong there. The > set-based solution did logically seem to have the potential for exponential > performance degradation... > Here's the results I got from the new benchmark with the index: > > #rows set-based cursor-based > ---------- ------------- --------------- > 16 0 0.01 > 16,000 0.08 1.463 > 160,000 0.96 12.68 > 1,600,000 10.423 136.686 > 16,000,000 130.503 still running... > > I used the following code to populate the test tables. > > SELECT IDENTITY(INT, 1, 1) AS key_col, app, usr, host, starttime, endtime > INTO Sessions16k > FROM Sessions, Nums > WHERE n <= 1000 -- also 10000, 100000, 1000000 > > CREATE CLUSTERED INDEX ImproveIt ON Sessions16K (app, starttime) > ALTER TABLE Sessions16k ADD PRIMARY KEY NONCLUSTERED(key_col) > > Remind me to get you a beer when we meet next time. > Mind if I shared this with SQLMag's readers? > > Now that this problem was solved, maybe someone can figure out a way to > solve the row-numbers problem with a faster set-based solution? > Against a table with 100K rows (with an index on orderid), I got 26 seconds > with a cursor based solution, and about half an hour for the set-based > solution. > > -- > BG, SQL Server MVP > www.SolidQualityLearning.com > > > "Hugo Kornelis" <hugo@pe_NO_rFact.in_SPAM_fo> wrote in message > news:5mgtk0tv1bssjkii6231v68p7oqq893mjo@4ax.com... > > On Sun, 19 Sep 2004 23:21:04 -0500, Itzik Ben-Gan wrote: > > > > >I haven't found a better performing pure set-based solution but I'd be > more > > >than glad if I someone did (test against larger tables, 10000, 100000, > > >1000000 rows). > > > > Hi Itzik, > > > > I guess you focussed too much on the SQL and forgot to look at indexes. > > > > CREATE CLUSTERED INDEX ImproveIt ON Sessions (app, starttime) > > > > I added this line to your creation script, then ran the following lines > > several times until I had 65536 rows in the table: > > > > INSERT Sessions (key_col, app, usr, host, starttime, endtime) > > SELECT key_col + (SELECT MAX(key_col) FROM Sessions), app, usr, host, > > starttime, endtime > > FROM Sessions > > > > INSERT Sessions (key_col, app, usr, host, starttime, endtime) > > SELECT key_col + (SELECT MAX(key_col) FROM Sessions), > > app + 'X', usr, host, starttime, endtime > > FROM Sessions > > SELECT COUNT(*) FROM Sessions > > > > With this bloated table, I ran your own setbased SQL and your own > > cursorbased SQL, both after running DBCC DROPCLEANBUFFERS and DBCC > > FREEPROCCACHE, of course. Both produced the same output: > > > > app mx > > ---------- ----------- > > app1 256 > > app1X 1536 > > app1XX 3840 > > app1XXX 5120 > > app1XXXX 3840 > > app1XXXXX 1536 > > app1XXXXXX 256 > > app2 192 > > app2X 1152 > > app2XX 2880 > > app2XXX 3840 > > app2XXXX 2880 > > app2XXXXX 1152 > > app2XXXXXX 192 > > > > The execution time for the setbased SQL was 0.606 seconds on my test > > system; the cursorbased SQL required 5.483 seconds. > > > > I don't have the time available right now to test against even larger > > tables. > > > > Best, Hugo > > -- > > > > (Remove _NO_ and _SPAM_ to get my e-mail address) > >
Very interesting... and makes sense when thinking about the exponential growth in the number of rows scanned. The hash table probably holds a significantly smaller number of elements in the buckets when the distribution is non-realistic (i.e., just duplicate rows). Well, thanks for having a look! I still didn't go over my original benchmark, but I'm quite sure I had an index on app, starttime, but probably more realistic distribution of values. I'll be back after going over this more thoroughly tonight. Thanks again! -- BG, SQL Server MVP www.SolidQualityLearning.com [quoted text, click to view] "Gert-Jan Strik" <sorry@toomuchspamalready.nl> wrote in message news:414F5EC3.AAFEB5BB@toomuchspamalready.nl... > Itzik (and Hugo), > > the problem is, that the testdata is not realistic, and this influences > the results in a big way. There are too few distinct starttime values. > This doesn't really affect the cursor based solution but masively > changes the efficiency of the set based solution. > > I have a table with realistic data that I transformed to the DDL that > was posted earlier (I converted some INTs to VARCHAR(10)s and DATETIME > to SMALLDATETIME). I indexed it the way Hugo described. The table had > 10803979 rows in it (782976 KB, 1369 distinct apps). > > If I understand the performance table correctly, the set based solution > should have finished anywhere between 10.423 and 130.503 seconds. Well, > I had to cancel it after 30 minutes. (BTW: I also had to cancel the > cursor based solution after 30 minutes) > > So then I kept the first 1832314 rows (129760 KB) which contain (the > first) 399 distinct apps and reindexed the table. > > The cursor based solution then took 18 minutes and 15 seconds to finish. > During the execution, there was hardly any CPU activity. > > Then I ran the set based solution. I cancelled it after 40 minutes. At > that point, it had been overloading 3 of the available 4 CPU's for about > the entire period. > > I ran the test on SQL7 with 2 HT CPU's and enough memory with the > following query and query plan. > > SELECT app, MAX(concurrent) AS mx > FROM (SELECT app, starttime, > (SELECT COUNT(*) > FROM Sessions AS S2 > WHERE S1.app = S2.app > AND S1.starttime >= S2.starttime > AND S1.starttime < S2.endtime) AS concurrent > FROM Sessions AS S1) AS C > GROUP BY app > option (maxdop 3) > > |--Hash Match(Aggregate, HASH:([S1].[app]), > RESIDUAL:([S1].[app]=[S1].[app]) DEFINE:([Expr1005]=MAX([Expr1002]))) > |--Parallelism(Gather Streams) > |--Hash Match(Aggregate, HASH:([S1].[app]), > RESIDUAL:([S1].[app]=[S1].[app]) DEFINE:([Expr1005]=MAX([Expr1002]))) > |--Compute Scalar(DEFINE:([Expr1002]=[Expr1002])) > |--Nested Loops(Inner Join) > |--Sort(ORDER BY:([S1].[starttime] ASC, > [S1].[app] ASC)) > | |--Parallelism(Repartition Streams, > PARTITION COLUMNS:([S1].[app], [S1].[starttime])) > | |--Index > Scan(OBJECT:([mydb].[dbo].[Sessions].[PK_Sessions] AS [S1])) > |--Table Spool > |--Stream > Aggregate(DEFINE:([Expr1002]=Count(*))) > |--Clustered Index > Seek(OBJECT:([mydb].[dbo].[Sessions].[ImproveIt] AS [S2]), > SEEK:([S2].[app]=[S1].[app] AND [S2].[starttime] <= [S1].[starttime]), > WHERE:([S2].[endtime]>[S1].[starttime]) ORDERED) > > > This should get some people thinking again... :-) > > Gert-Jan > > > > Itzik Ben-Gan wrote: >> >> Well, I'll be damned! :-) >> An index on app, starttime of course seems like a natural choice here. >> >> I tested against larger tables, and both solutions seem to have fairly >> linear performance degradation. >> I'll go to my original benchmarks and see what went wrong there. The >> set-based solution did logically seem to have the potential for >> exponential >> performance degradation... >> Here's the results I got from the new benchmark with the index: >> >> #rows set-based cursor-based >> ---------- ------------- --------------- >> 16 0 0.01 >> 16,000 0.08 1.463 >> 160,000 0.96 12.68 >> 1,600,000 10.423 136.686 >> 16,000,000 130.503 still running... >> >> I used the following code to populate the test tables. >> >> SELECT IDENTITY(INT, 1, 1) AS key_col, app, usr, host, starttime, endtime >> INTO Sessions16k >> FROM Sessions, Nums >> WHERE n <= 1000 -- also 10000, 100000, 1000000 >> >> CREATE CLUSTERED INDEX ImproveIt ON Sessions16K (app, starttime) >> ALTER TABLE Sessions16k ADD PRIMARY KEY NONCLUSTERED(key_col) >> >> Remind me to get you a beer when we meet next time. >> Mind if I shared this with SQLMag's readers? >> >> Now that this problem was solved, maybe someone can figure out a way to >> solve the row-numbers problem with a faster set-based solution? >> Against a table with 100K rows (with an index on orderid), I got 26 >> seconds >> with a cursor based solution, and about half an hour for the set-based >> solution. >> >> -- >> BG, SQL Server MVP >> www.SolidQualityLearning.com >> >> "Hugo Kornelis" <hugo@pe_NO_rFact.in_SPAM_fo> wrote in message >> news:5mgtk0tv1bssjkii6231v68p7oqq893mjo@4ax.com... >> > On Sun, 19 Sep 2004 23:21:04 -0500, Itzik Ben-Gan wrote: >> > >> > >I haven't found a better performing pure set-based solution but I'd be >> more >> > >than glad if I someone did (test against larger tables, 10000, 100000, >> > >1000000 rows). >> > >> > Hi Itzik, >> > >> > I guess you focussed too much on the SQL and forgot to look at indexes. >> > >> > CREATE CLUSTERED INDEX ImproveIt ON Sessions (app, starttime) >> > >> > I added this line to your creation script, then ran the following lines >> > several times until I had 65536 rows in the table: >> > >> > INSERT Sessions (key_col, app, usr, host, starttime, endtime) >> > SELECT key_col + (SELECT MAX(key_col) FROM Sessions), app, usr, host, >> > starttime, endtime >> > FROM Sessions >> > >> > INSERT Sessions (key_col, app, usr, host, starttime, endtime) >> > SELECT key_col + (SELECT MAX(key_col) FROM Sessions), >> > app + 'X', usr, host, starttime, endtime >> > FROM Sessions >> > SELECT COUNT(*) FROM Sessions >> > >> > With this bloated table, I ran your own setbased SQL and your own >> > cursorbased SQL, both after running DBCC DROPCLEANBUFFERS and DBCC >> > FREEPROCCACHE, of course. Both produced the same output: >> > >> > app mx >> > ---------- ----------- >> > app1 256 >> > app1X 1536 >> > app1XX 3840 >> > app1XXX 5120 >> > app1XXXX 3840 >> > app1XXXXX 1536 >> > app1XXXXXX 256 >> > app2 192 >> > app2X 1152 >> > app2XX 2880 >> > app2XXX 3840
I'm sure that Mr. Anonymous believes that 4 is an obvious conclusion based on 3 - a big flat table doesn't require joins. ;-) p.s. - It takes a real lamer to post under an anonymous identity. Secure in his opinions? I wouldn't count on it. Follow the advice of such a bold soul? I'd vote for John "Feel-the-wind" Kerry first. -- Michael D. Long
[quoted text, click to view] On Mon, 20 Sep 2004 06:52:02 -0700, Joe Celko wrote: >I was thinking about building a table of points in time, like a more >detailed calendar table, then using BETWEEN predicates to count how many >sessions each of those points falls inside of.
Hi Joe, That was my first thought too. But how many point would you need? Every 5 minutes? Every minute? Every second? The beauty of Itzik's set-based query is that the derived table in effect builds a collection of all points in time that might be relevant for the query. I've been thinking about the puzzle some more this evening, but I don't think I could come up with anything better than Itzik's query. Best, Hugo --
[quoted text, click to view] On Mon, 20 Sep 2004 13:20:46 -0500, Itzik Ben-Gan wrote: >I tested against larger tables, and both solutions seem to have fairly >linear performance degradation.
Hi Itzik, Thanks for runnning these tests. I've already been thinking about how I could create more test data for this puzzle, but your post save me the hassle (and the time). [quoted text, click to view] >Remind me to get you a beer when we meet next time.
Okay. But do you expect to be in the Netherlands soon? :-) [quoted text, click to view] >Mind if I shared this with SQLMag's readers?
Not at all. [quoted text, click to view] >Now that this problem was solved, maybe someone can figure out a way to >solve the row-numbers problem with a faster set-based solution? >Against a table with 100K rows (with an index on orderid), I got 26 seconds >with a cursor based solution, and about half an hour for the set-based >solution.
Hmmm. Adding a non-clustered index on orderid should improve the set-based version, but pprobably not enough to beat the cursor-based version. I'll think about this before going to sleep, maybe my subconsious gets creative tonight. Don't hold your breath for it, though :-) Best, Hugo --
[quoted text, click to view] On Tue, 21 Sep 2004 00:02:32 +0200, Hugo Kornelis wrote: >>Mind if I shared this with SQLMag's readers? > >Not at all.
Forgot to add: When your article is published, I'd like to get a notification and a URL, if possible. My email adress is Hugo at perFact dot info. Best, Hugo --
Itzik (and Hugo), the problem is, that the testdata is not realistic, and this influences the results in a big way. There are too few distinct starttime values. This doesn't really affect the cursor based solution but masively changes the efficiency of the set based solution. I have a table with realistic data that I transformed to the DDL that was posted earlier (I converted some INTs to VARCHAR(10)s and DATETIME to SMALLDATETIME). I indexed it the way Hugo described. The table had 10803979 rows in it (782976 KB, 1369 distinct apps). If I understand the performance table correctly, the set based solution should have finished anywhere between 10.423 and 130.503 seconds. Well, I had to cancel it after 30 minutes. (BTW: I also had to cancel the cursor based solution after 30 minutes) So then I kept the first 1832314 rows (129760 KB) which contain (the first) 399 distinct apps and reindexed the table. The cursor based solution then took 18 minutes and 15 seconds to finish. During the execution, there was hardly any CPU activity. Then I ran the set based solution. I cancelled it after 40 minutes. At that point, it had been overloading 3 of the available 4 CPU's for about the entire period. I ran the test on SQL7 with 2 HT CPU's and enough memory with the following query and query plan. SELECT app, MAX(concurrent) AS mx FROM (SELECT app, starttime, (SELECT COUNT(*) FROM Sessions AS S2 WHERE S1.app = S2.app AND S1.starttime >= S2.starttime AND S1.starttime < S2.endtime) AS concurrent FROM Sessions AS S1) AS C GROUP BY app option (maxdop 3) |--Hash Match(Aggregate, HASH:([S1].[app]), RESIDUAL:([S1].[app]=[S1].[app]) DEFINE:([Expr1005]=MAX([Expr1002]))) |--Parallelism(Gather Streams) |--Hash Match(Aggregate, HASH:([S1].[app]), RESIDUAL:([S1].[app]=[S1].[app]) DEFINE:([Expr1005]=MAX([Expr1002]))) |--Compute Scalar(DEFINE:([Expr1002]=[Expr1002])) |--Nested Loops(Inner Join) |--Sort(ORDER BY:([S1].[starttime] ASC, [S1].[app] ASC)) | |--Parallelism(Repartition Streams, PARTITION COLUMNS:([S1].[app], [S1].[starttime])) | |--Index Scan(OBJECT:([mydb].[dbo].[Sessions].[PK_Sessions] AS [S1])) |--Table Spool |--Stream Aggregate(DEFINE:([Expr1002]=Count(*))) |--Clustered Index Seek(OBJECT:([mydb].[dbo].[Sessions].[ImproveIt] AS [S2]), SEEK:([S2].[app]=[S1].[app] AND [S2].[starttime] <= [S1].[starttime]), WHERE:([S2].[endtime]>[S1].[starttime]) ORDERED) This should get some people thinking again... :-) Gert-Jan [quoted text, click to view] Itzik Ben-Gan wrote: > > Well, I'll be damned! :-) > An index on app, starttime of course seems like a natural choice here. > > I tested against larger tables, and both solutions seem to have fairly > linear performance degradation. > I'll go to my original benchmarks and see what went wrong there. The > set-based solution did logically seem to have the potential for exponential > performance degradation... > Here's the results I got from the new benchmark with the index: > > #rows set-based cursor-based > ---------- ------------- --------------- > 16 0 0.01 > 16,000 0.08 1.463 > 160,000 0.96 12.68 > 1,600,000 10.423 136.686 > 16,000,000 130.503 still running... > > I used the following code to populate the test tables. > > SELECT IDENTITY(INT, 1, 1) AS key_col, app, usr, host, starttime, endtime > INTO Sessions16k > FROM Sessions, Nums > WHERE n <= 1000 -- also 10000, 100000, 1000000 > > CREATE CLUSTERED INDEX ImproveIt ON Sessions16K (app, starttime) > ALTER TABLE Sessions16k ADD PRIMARY KEY NONCLUSTERED(key_col) > > Remind me to get you a beer when we meet next time. > Mind if I shared this with SQLMag's readers? > > Now that this problem was solved, maybe someone can figure out a way to > solve the row-numbers problem with a faster set-based solution? > Against a table with 100K rows (with an index on orderid), I got 26 seconds > with a cursor based solution, and about half an hour for the set-based > solution. > > -- > BG, SQL Server MVP > www.SolidQualityLearning.com > > "Hugo Kornelis" <hugo@pe_NO_rFact.in_SPAM_fo> wrote in message > news:5mgtk0tv1bssjkii6231v68p7oqq893mjo@4ax.com... > > On Sun, 19 Sep 2004 23:21:04 -0500, Itzik Ben-Gan wrote: > > > > >I haven't found a better performing pure set-based solution but I'd be > more > > >than glad if I someone did (test against larger tables, 10000, 100000, > > >1000000 rows). > > > > Hi Itzik, > > > > I guess you focussed too much on the SQL and forgot to look at indexes. > > > > CREATE CLUSTERED INDEX ImproveIt ON Sessions (app, starttime) > > > > I added this line to your creation script, then ran the following lines > > several times until I had 65536 rows in the table: > > > > INSERT Sessions (key_col, app, usr, host, starttime, endtime) > > SELECT key_col + (SELECT MAX(key_col) FROM Sessions), app, usr, host, > > starttime, endtime > > FROM Sessions > > > > INSERT Sessions (key_col, app, usr, host, starttime, endtime) > > SELECT key_col + (SELECT MAX(key_col) FROM Sessions), > > app + 'X', usr, host, starttime, endtime > > FROM Sessions > > SELECT COUNT(*) FROM Sessions > > > > With this bloated table, I ran your own setbased SQL and your own > > cursorbased SQL, both after running DBCC DROPCLEANBUFFERS and DBCC > > FREEPROCCACHE, of course. Both produced the same output: > > > > app mx > > ---------- ----------- > > app1 256 > > app1X 1536 > > app1XX 3840 > > app1XXX 5120 > > app1XXXX 3840 > > app1XXXXX 1536 > > app1XXXXXX 256 > > app2 192 > > app2X 1152 > > app2XX 2880 > > app2XXX 3840 > > app2XXXX 2880 > > app2XXXXX 1152 > > app2XXXXXX 192 > > > > The execution time for the setbased SQL was 0.606 seconds on my test > > system; the cursorbased SQL required 5.483 seconds. > > > > I don't have the time available right now to test against even larger > > tables. > > > > Best, Hugo > > -- > > > > (Remove _NO_ and _SPAM_ to get my e-mail address) --
After examining the issue more thoroughly and running a realistic benchmark, the original assumption still stands, i.e., the cursor based solution has linear performance degradation, while the set-based solution has an exponential one. The set-based solution is much slower even with a small number of rows. I'll try to explain the logic behind the assumption, with supporting numbers, and also why the non-realistic sample data caused the set-based solution to run faster then the cursor-based solution, and to also incur a linear performance degradation. I'm afraid this problem remains open, but of course I welcome anyone that likes challenges to try and find a set-based solution that outperforms the cursor-based. I'll keep on looking myself, of course. Here's a more realistic benchmark. Suppose we're tracking a month worth of sessions data, with 10 applications, and an average session duration of ~50.5 minutes (ranging from 1 through 100). The most efficient index (at least the one that I came up with) is actually a covering non-clustered index placed on (app, starttime, endtime). Examining the execution plan, the number of rows that are scanned for each base row are the actual number of matching rows. i.e., if we have r rows in the table, and each session overlaps with o rows in average, the total number of rows that are going to be scanned are ~ r x o. The exceptions are duplicate rows (same app, starttime) for which the cached aggregate information in the hash table reduces the need to actually scan duplicates multiple times, hence the fast results of the non-realistic sample data. Now for realistic sample data. I'll try to explain the reason for the exponential performance degradation. To populate the table with realistic sample data, you need to calculate random start times within the month, and end times which are start time + random duration (e.g., in my benchmark in the range 1 through 100 minutes). The more sessions there are for each application, the probability for duplicates gets higher, and this will have an increasing positive performance affect on the set-based solution because of the hashed aggregates. But in practical terms even tables with large numbers of rows, taking into consideration production systems that store session data in DATETIME datatypes with an accuracy of 3.333 ms, we can basically ignore duplicates. I already mentioned that the number of rows that are going to be scanned is roughly r x o (minus a bit because of duplicates). Now, populate the table in a factor of f, still using random values within the same month. The table is now f times larger (r x f), but the probability for session overlap also increased by a factor of f (o x f), so the total number of rows that are going to be scanned is ~ r x o x f^2 (-duplicates affect). I/O has a close to linear affect on run-time; if it takes t seconds to query r rows, we should expect ~ t x f^2 seconds to query r x f rows (assuming a non-significant number of duplicates), hence the exponential performance degradation. Here are some estimations based on measured run time of 1 second (floored for the sake of simplicity): estimates: rows factor runtime ---- ------------- ------- 10K 1^2 = 1 1 20K 2^2 = 4 4 40K 4^2 = 16 16 80K 8^2 = 64 64 160K 16^2 = 256 256 1M 100^2 = 10000 10000 And here are the actuals from a benchmark I ran (code below): rows factor runtime reads ---- ------ ------- ----------- 10K 1 1.4 39851 20K 3.5 5 120628 40K 11.4 16 403296 80K 41 57 1454080 160K 150 210 40263 1M still running As you can see the actuals closely resemble the exponential pattern I tried to illustrate. I can attribute the increase in a slightly smaller factor than a power of 2 to duplicates, but there may be other influencing factors as well. So, Hugo, Celko, others, want to take a shot at it and come up with a creative set-based solution that does run fast? BTW, I did try many variations including an auxiliary table of time slots (which is of course less efficient than using the actual start times as time slots). Benchmark's code: USE tempdb SET NOCOUNT ON GO IF OBJECT_ID('Sessions') IS NOT NULL DROP TABLE Sessions GO DECLARE @numrows AS INT SET @numrows = 10000 -- test with 10K - 160K increasing by a factor of 2 SET STATISTICS IO OFF SELECT IDENTITY(INT, 1, 1) AS key_col, D.*, DATEADD( minute, CAST(ABS(RAND(CHECKSUM(NEWID()))*100-0.0000000001) AS INT)+1, starttime) AS endtime INTO Sessions FROM ( SELECT 'app' + CAST(CAST(ABS(RAND(CHECKSUM(NEWID()))*10-0.0000000001) AS INT)+1 AS VARCHAR(10)) AS app, 'user1' AS usr, 'host1' AS host, DATEADD( minute, CAST(ABS(RAND(CHECKSUM(NEWID()))*44540-0.0000000001) AS INT)+1, '20040101') AS starttime FROM Nums WHERE n <= @numrows ) AS D ALTER TABLE Sessions ADD PRIMARY KEY(key_col) CREATE NONCLUSTERED INDEX IDX_NCI_app_se ON Sessions (app, starttime, endtime) DBCC FREEPROCCACHE WITH NO_INFOMSGS DBCC DROPCLEANBUFFERS WITH NO_INFOMSGS SET STATISTICS IO ON DECLARE @dt AS DATETIME SET @dt = GETDATE() SELECT app, MAX(concurrent) AS mx FROM (SELECT app, starttime, (SELECT COUNT(*) FROM Sessions AS S2 WHERE S1.app = S2.app AND S1.starttime >= S2.starttime AND S1.starttime < S2.endtime) AS concurrent FROM Sessions AS S1) AS C GROUP BY app PRINT CAST(@numrows AS VARCHAR(10)) + ' rows: ' + CAST(DATEDIFF(ms, @dt, GETDATE()) / 1000. AS VARCHAR(30)) + ' seconds.' /* -- Auxiliary table of numbers CREATE TABLE Nums(n INT NOT NULL PRIMARY KEY) SET NOCOUNT ON DECLARE @max AS INT, @rc AS INT SET @max = 1000000 SET @rc = 1 BEGIN TRAN INSERT INTO Nums VALUES(1) WHILE @rc * 2 <= @max BEGIN INSERT INTO Nums SELECT n + @rc FROM Nums SET @rc = @rc * 2 END INSERT INTO Nums SELECT n + @rc FROM Nums WHERE n + @rc <= @max COMMIT TRAN */ -- BG, SQL Server MVP www.SolidQualityLearning.com [quoted text, click to view] "Gert-Jan Strik" <sorry@toomuchspamalready.nl> wrote in message news:414F5EC3.AAFEB5BB@toomuchspamalready.nl... > Itzik (and Hugo), > > the problem is, that the testdata is not realistic, and this influences > the results in a big way. There are too few distinct starttime values. > This doesn't really affect the cursor based solution but masively > changes the efficiency of the set based solution. > > I have a table with realistic data that I transformed to the DDL that > was posted earlier (I converted some INTs to VARCHAR(10)s and DATETIME > to SMALLDATETIME). I indexed it the way Hugo described. The table had
You're welcome. I'm glad to see you active in the newsgroup! -- BG, SQL Server MVP www.SolidQualityLearning.com [quoted text, click to view] "Eric Sabine" <mopar41@mail_after_hot_not_before.com> wrote in message news:eks1lnwnEHA.3712@TK2MSFTNGP15.phx.gbl... > Thanks for posting that Itzik. Also, thanks for the 2005 - Row_Number() > information. > > > > Itzik Ben-Gan wrote: >> I think that Eric refers to the following problem: >> >> http://www.winnetmag.com/Article/ArticleID/37636/37636.html >> >> If you can't access the complete problem, here goes: >> >> --------------------------------------------------------------------- >> -- Concurrent time intervals >> --------------------------------------------------------------------- >> CREATE TABLE Sessions >> ( >> key_col INT NOT NULL PRIMARY KEY, >> app VARCHAR(10) NOT NULL, >> usr VARCHAR(10) NOT NULL, >> host VARCHAR(10) NOT NULL, >> starttime SMALLDATETIME NOT NULL, >> endtime SMALLDATETIME NOT NULL, >> CHECK(endtime > starttime) >> ) >> >> INSERT INTO Sessions VALUES(1, 'app1', 'user1', 'host1', '20030212 >> 08:30', '20030212 10:30') >> INSERT INTO Sessions VALUES(2, 'app1', 'user2', 'host1', '20030212 >> 08:30', '20030212 08:45') >> INSERT INTO Sessions VALUES(3, 'app1', 'user3', 'host2', '20030212 >> 09:00', '20030212 09:30') >> INSERT INTO Sessions VALUES(4, 'app1', 'user4', 'host2', '20030212 >> 09:15', '20030212 10:30') >> INSERT INTO Sessions VALUES(5, 'app1', 'user5', 'host3', '20030212 >> 09:15', '20030212 09:30') >> INSERT INTO Sessions VALUES(6, 'app1', 'user6', 'host3', '20030212 >> 10:30', '20030212 14:30') >> INSERT INTO Sessions VALUES(7, 'app1', 'user7', 'host4', '20030212 >> 10:45', '20030212 11:30') >> INSERT INTO Sessions VALUES(8, 'app1', 'user8', 'host4', '20030212 >> 11:00', '20030212 12:30') >> INSERT INTO Sessions VALUES(9, 'app2', 'user8', 'host1', '20030212 >> 08:30', '20030212 08:45') >> INSERT INTO Sessions VALUES(10, 'app2', 'user7', 'host1', '20030212 >> 09:00', '20030212 09:30') >> INSERT INTO Sessions VALUES(11, 'app2', 'user6', 'host2', '20030212 >> 11:45', '20030212 12:00') >> INSERT INTO Sessions VALUES(12, 'app2', 'user5', 'host2', '20030212 >> 12:30', '20030212 14:00') >> INSERT INTO Sessions VALUES(13, 'app2', 'user4', 'host3', '20030212 >> 12:45', '20030212 13:30') >> INSERT INTO Sessions VALUES(14, 'app2', 'user3', 'host3', '20030212 >> 13:00', '20030212 14:00') >> INSERT INTO Sessions VALUES(15, 'app2', 'user2', 'host4', '20030212 >> 14:00', '20030212 16:30') >> INSERT INTO Sessions VALUES(16, 'app2', 'user1', 'host4', '20030212 >> 15:30', '20030212 17:00') >> >> -- Calculate the maximum number of concurrent sessions for each >> user,application >> -- The datetime ranges provided include starttime and exclude datetime >> -- i.e., the session was active >= starttime < endtime >> -- Desired results: >> app mx >> ---------- ----------- >> app1 4 >> app2 3 >> >> --------------------------------------------------------------------- >> -- Concurrent time intervals >> --------------------------------------------------------------------- >> Here's one (of several) pure set-based solution: >> >> -- Set-based solution >> SELECT app, MAX(concurrent) AS mx >> FROM (SELECT app, starttime, >> (SELECT COUNT(*) >> FROM Sessions AS S2 >> WHERE S1.app = S2.app >> AND S1.starttime >= S2.starttime >> AND S1.starttime < S2.endtime) AS concurrent >> FROM Sessions AS S1) AS C >> GROUP BY app >> >> The cursor-based solution is a cursor based on a query that marks >> events as increasing or decreasing the count of concurrent sessions. >> I haven't found a better performing pure set-based solution but I'd >> be more than glad if I someone did (test against larger tables, >> 10000, 100000, 1000000 rows). >> >> -- Iterative solution >> DECLARE >> @app AS varchar(10), >> @prevapp AS varchar (10), >> @ts AS datetime, >> @type AS int, >> @concurrent AS int, >> @mx AS int >> >> DECLARE @AppsMx TABLE >> ( >> app varchar (10) NOT NULL PRIMARY KEY, >> mx int NOT NULL >> ) >> >> DECLARE sessions_cur CURSOR FAST_FORWARD FOR >> SELECT app, starttime AS ts, 1 AS type >> FROM Sessions >> >> UNION ALL >> >> SELECT app, endtime, -1 >> FROM Sessions >> >> ORDER BY app, ts, type >> >> OPEN sessions_cur >> >> FETCH NEXT FROM sessions_cur >> INTO @app, @ts, @type >> >> SET @prevapp = @app >> SET @concurrent = 0 >> SET @mx = 0 >> >> WHILE @@FETCH_STATUS = 0 >> BEGIN >> IF @app <> @prevapp >> BEGIN >> INSERT INTO @AppsMx VALUES(@prevapp, @mx) >> SET @concurrent = 0 >> SET @mx = 0 >> SET @prevapp = @app >> END >> >> SET @concurrent = @concurrent + @type >> IF @concurrent > @mx SET @mx = @concurrent >> >> FETCH NEXT FROM sessions_cur >> INTO @app, @ts, @type >> END >> >> IF @prevapp IS NOT NULL >> INSERT INTO @AppsMx VALUES(@prevapp, @mx) >> >> CLOSE sessions_cur >> >> DEALLOCATE sessions_cur >> >> SELECT * FROM @AppsMx >> >> This problem is one of two examples I show in the course where a >> cursor outperforms a pure set-based query. The second problem has to >> do with calculating, dare I say, "row numbers" ;-) based on some >> desired sort criteria. e.g., a simple scenario of calculating row >> numbers for orders based on orderid sort: >> >> USE Northwind >> >> SELECT orderid, >> (SELECT COUNT(*) >> FROM Orders AS O2 >> WHERE O2.orderid <= O1.orderid) AS rownum >> FROM Orders As O1 >> >> These problems where a cursor outperforms a set-based solution are >> typically characterized by an exponential performance degradation as >> the table grows larger. e.g., for the latter, the optimizer counts >> the number of rows that have an orderid which is <= the outer >> orderid, so even with an index you get (n+n^2)/2 rows scanned. The >> cursor performs better in such a case because the data is scanned >> only once. So the overhead of the curser is negligible compared to >> the overhead of the multiple scans. As an aside, this was addressed in >> SQL Server 2005 by providing the >> ANSI-SQL 99 ROW_NUMBER() function: >> >> SELECT orderid, ROW_NUMBER() OVER(ORDER BY orderid) AS rownum >> FROM Orders >> >> ROW_NUMBER() scans the data only once. It has allows many alternate >> very efficient solutions to existing problems that had slow set-based >> solutions. >> >> "Joe Celko" <jcelko212@earthlink.net> wrote in message >> news:OUusbFrnEHA.2096@TK2MSFTNGP15.phx.gbl... >>>>> Only once I've seen a cursor out perform a set-based solution and >>> that's because I saw Itzik Ben-Gan prove it in a class. << >>> >>> I have written five cursors in my career and I know that if we had a >>> CASE expression in the old days, I could have avoided at least three >>> of them. The other two were NP complete things (a version of
I'm afraid not this time. See my reply to Gert-Jan. But, it was an interesting review of the problem I must say. And it's a good thing that we practice logic and not other stuff. Or, in the words of Arthur Conan Doyle: "Crime is common. Logic is rare. Therefore it is upon the logic rather than upon the crime that you should dwell." :-) -- BG, SQL Server MVP www.SolidQualityLearning.com [quoted text, click to view] "Joe Celko" <jcelko212@earthlink.net> wrote in message news:%23mX9CkxnEHA.2340@TK2MSFTNGP11.phx.gbl... >>> The execution time for the set-based SQL was 0.606 seconds on my test > system; the cursor-based SQL required 5.483 seconds. << > > Another victory for non-procedural code! > > I was thinking about building a table of points in time, like a more > detailed calendar table, then using BETWEEN predicates to count how many > sessions each of those points falls inside of. It gives you control of > grandularity and you can ask other questions with it. But if all we > needed was an index ... > > --CELKO-- > Please post DDL, so that people do not have to guess what the keys, > constraints, Declarative Referential Integrity, datatypes, etc. in your > schema are. Sample data is also a good idea, along with clear > specifications. > > > *** Sent via Developersdex http://www.developersdex.com *** > Don't just participate in USENET...get rewarded for it!
I'm afraid I'm not attending PASS this year. -- BG, SQL Server MVP www.SolidQualityLearning.com [quoted text, click to view] "Eric Sabine" <mopar41@mail_after_hot_not_before.com> wrote in message news:%23vwZWe9nEHA.2052@TK2MSFTNGP10.phx.gbl... > I try to contribute as much as I can. Maybe I will see you at PASS next > week if you're there but I don't see you as a presenter. I will buy _you_ a > beer if you're there. > > Eric > > > Itzik Ben-Gan wrote: > > You're welcome. I'm glad to see you active in the newsgroup! > > > > > > "Eric Sabine" <mopar41@mail_after_hot_not_before.com> wrote in message > > news:eks1lnwnEHA.3712@TK2MSFTNGP15.phx.gbl... > >> Thanks for posting that Itzik. Also, thanks for the 2005 - > >> Row_Number() information. > >> > >> > >> > >> Itzik Ben-Gan wrote: > >>> I think that Eric refers to the following problem: > >>> > >>> http://www.winnetmag.com/Article/ArticleID/37636/37636.html > >>> > >>> If you can't access the complete problem, here goes: > >>> > >>> --------------------------------------------------------------------- > >>> -- Concurrent time intervals > >>> --------------------------------------------------------------------- > >>> CREATE TABLE Sessions > >>> ( > >>> key_col INT NOT NULL PRIMARY KEY, > >>> app VARCHAR(10) NOT NULL, > >>> usr VARCHAR(10) NOT NULL, > >>> host VARCHAR(10) NOT NULL, > >>> starttime SMALLDATETIME NOT NULL, > >>> endtime SMALLDATETIME NOT NULL, > >>> CHECK(endtime > starttime) > >>> ) > >>> > >>> INSERT INTO Sessions VALUES(1, 'app1', 'user1', 'host1', '20030212 > >>> 08:30', '20030212 10:30') > >>> INSERT INTO Sessions VALUES(2, 'app1', 'user2', 'host1', '20030212 > >>> 08:30', '20030212 08:45') > >>> INSERT INTO Sessions VALUES(3, 'app1', 'user3', 'host2', '20030212 > >>> 09:00', '20030212 09:30') > >>> INSERT INTO Sessions VALUES(4, 'app1', 'user4', 'host2', '20030212 > >>> 09:15', '20030212 10:30') > >>> INSERT INTO Sessions VALUES(5, 'app1', 'user5', 'host3', '20030212 > >>> 09:15', '20030212 09:30') > >>> INSERT INTO Sessions VALUES(6, 'app1', 'user6', 'host3', '20030212 > >>> 10:30', '20030212 14:30') > >>> INSERT INTO Sessions VALUES(7, 'app1', 'user7', 'host4', '20030212 > >>> 10:45', '20030212 11:30') > >>> INSERT INTO Sessions VALUES(8, 'app1', 'user8', 'host4', '20030212 > >>> 11:00', '20030212 12:30') > >>> INSERT INTO Sessions VALUES(9, 'app2', 'user8', 'host1', '20030212 > >>> 08:30', '20030212 08:45') > >>> INSERT INTO Sessions VALUES(10, 'app2', 'user7', 'host1', '20030212 > >>> 09:00', '20030212 09:30') > >>> INSERT INTO Sessions VALUES(11, 'app2', 'user6', 'host2', '20030212 > >>> 11:45', '20030212 12:00') > >>> INSERT INTO Sessions VALUES(12, 'app2', 'user5', 'host2', '20030212 > >>> 12:30', '20030212 14:00') > >>> INSERT INTO Sessions VALUES(13, 'app2', 'user4', 'host3', '20030212 > >>> 12:45', '20030212 13:30') > >>> INSERT INTO Sessions VALUES(14, 'app2', 'user3', 'host3', '20030212 > >>> 13:00', '20030212 14:00') > >>> INSERT INTO Sessions VALUES(15, 'app2', 'user2', 'host4', '20030212 > >>> 14:00', '20030212 16:30') > >>> INSERT INTO Sessions VALUES(16, 'app2', 'user1', 'host4', '20030212 > >>> 15:30', '20030212 17:00') > >>> > >>> -- Calculate the maximum number of concurrent sessions for each > >>> user,application > >>> -- The datetime ranges provided include starttime and exclude > >>> datetime -- i.e., the session was active >= starttime < endtime > >>> -- Desired results: > >>> app mx > >>> ---------- ----------- > >>> app1 4 > >>> app2 3 > >>> > >>> --------------------------------------------------------------------- > >>> -- Concurrent time intervals > >>> --------------------------------------------------------------------- > >>> Here's one (of several) pure set-based solution: > >>> > >>> -- Set-based solution > >>> SELECT app, MAX(concurrent) AS mx > >>> FROM (SELECT app, starttime, > >>> (SELECT COUNT(*) > >>> FROM Sessions AS S2 > >>> WHERE S1.app = S2.app > >>> AND S1.starttime >= S2.starttime > >>> AND S1.starttime < S2.endtime) AS concurrent > >>> FROM Sessions AS S1) AS C > >>> GROUP BY app > >>> > >>> The cursor-based solution is a cursor based on a query that marks > >>> events as increasing or decreasing the count of concurrent sessions. > >>> I haven't found a better performing pure set-based solution but I'd > >>> be more than glad if I someone did (test against larger tables, > >>> 10000, 100000, 1000000 rows). > >>> > >>> -- Iterative solution > >>> DECLARE > >>> @app AS varchar(10), > >>> @prevapp AS varchar (10), > >>> @ts AS datetime, > >>> @type AS int, > >>> @concurrent AS int, > >>> @mx AS int > >>> > >>> DECLARE @AppsMx TABLE > >>> ( > >>> app varchar (10) NOT NULL PRIMARY KEY, > >>> mx int NOT NULL > >>> ) > >>> > >>> DECLARE sessions_cur CURSOR FAST_FORWARD FOR > >>> SELECT app, starttime AS ts, 1 AS type > >>> FROM Sessions > >>> > >>> UNION ALL > >>> > >>> SELECT app, endtime, -1 > >>> FROM Sessions > >>> > >>> ORDER BY app, ts, type > >>> > >>> OPEN sessions_cur > >>> > >>> FETCH NEXT FROM sessions_cur > >>> INTO @app, @ts, @type > >>> > >>> SET @prevapp = @app > >>> SET @concurrent = 0 > >>> SET @mx = 0 > >>> > >>> WHILE @@FETCH_STATUS = 0 > >>> BEGIN > >>> IF @app <> @prevapp > >>> BEGIN > >>> INSERT INTO @AppsMx VALUES(@prevapp, @mx) > >>> SET @concurrent = 0 > >>> SET @mx = 0 > >>> SET @prevapp = @app > >>> END > >>> > >>> SET @concurrent = @concurrent + @type > >>> IF @concurrent > @mx SET @mx = @concurrent > >>> > >>> FETCH NEXT FROM sessions_cur > >>> INTO @app, @ts, @type > >>> END > >>> > >>> IF @prevapp IS NOT NULL > >>> INSERT INTO @AppsMx VALUES(@prevapp, @mx) > >>> > >>> CLOSE sessions_cur > >>> > >>> DEALLOCATE sessions_cur > >>> > >>> SELECT * FROM @AppsMx > >>> > >>> This problem is one of two examples I show in the course where a > >>> cursor outperforms a pure set-based query. The second problem has to > >>> do with calculating, dare I say, "row numbers" ;-) based on some > >>> desired sort criteria. e.g., a simple scenario of calculating row > >>> numbers for orders based on orderid sort: > >>> > >>> USE Northwind > >>> > >>> SELECT orderid, > >>> (SELECT COUNT(*) > >>> FROM Orders AS O2 > >>> WHERE O2.orderid <= O1.orderid) AS rownum > >>> FROM Orders As O1 > >>> > >>> These problems where a cursor outperforms a set-based solution are > >>> typically characterized by an exponential performance degradation as > >>> the table grows larger. e.g., for the latter, the optimizer counts > >>> the number of rows that have an orderid which is <= the outer > >>> orderid, so even with an index you get (n+n^2)/2 rows scanned. The > >>> cursor performs better in such a case because the data is scanned
I try to contribute as much as I can. Maybe I will see you at PASS next week if you're there but I don't see you as a presenter. I will buy _you_ a beer if you're there. Eric [quoted text, click to view] Itzik Ben-Gan wrote: > You're welcome. I'm glad to see you active in the newsgroup! > > > "Eric Sabine" <mopar41@mail_after_hot_not_before.com> wrote in message > news:eks1lnwnEHA.3712@TK2MSFTNGP15.phx.gbl... >> Thanks for posting that Itzik. Also, thanks for the 2005 - >> Row_Number() information. >> >> >> >> Itzik Ben-Gan wrote: >>> I think that Eric refers to the following problem: >>> >>> http://www.winnetmag.com/Article/ArticleID/37636/37636.html >>> >>> If you can't access the complete problem, here goes: >>> >>> --------------------------------------------------------------------- >>> -- Concurrent time intervals >>> --------------------------------------------------------------------- >>> CREATE TABLE Sessions >>> ( >>> key_col INT NOT NULL PRIMARY KEY, >>> app VARCHAR(10) NOT NULL, >>> usr VARCHAR(10) NOT NULL, >>> host VARCHAR(10) NOT NULL, >>> starttime SMALLDATETIME NOT NULL, >>> endtime SMALLDATETIME NOT NULL, >>> CHECK(endtime > starttime) >>> ) >>> >>> INSERT INTO Sessions VALUES(1, 'app1', 'user1', 'host1', '20030212 >>> 08:30', '20030212 10:30') >>> INSERT INTO Sessions VALUES(2, 'app1', 'user2', 'host1', '20030212 >>> 08:30', '20030212 08:45') >>> INSERT INTO Sessions VALUES(3, 'app1', 'user3', 'host2', '20030212 >>> 09:00', '20030212 09:30') >>> INSERT INTO Sessions VALUES(4, 'app1', 'user4', 'host2', '20030212 >>> 09:15', '20030212 10:30') >>> INSERT INTO Sessions VALUES(5, 'app1', 'user5', 'host3', '20030212 >>> 09:15', '20030212 09:30') >>> INSERT INTO Sessions VALUES(6, 'app1', 'user6', 'host3', '20030212 >>> 10:30', '20030212 14:30') >>> INSERT INTO Sessions VALUES(7, 'app1', 'user7', 'host4', '20030212 >>> 10:45', '20030212 11:30') >>> INSERT INTO Sessions VALUES(8, 'app1', 'user8', 'host4', '20030212 >>> 11:00', '20030212 12:30') >>> INSERT INTO Sessions VALUES(9, 'app2' |