Groups | Blog | Home
all groups > sql server programming > september 2004 >

sql server programming : List of bad practices



anonymous NO[at]SPAM discussions.microsoft.com
9/18/2004 10:54:39 PM
I have a list of bad practices in SQL Server below. If I
am missing something can you please share it?

1) Do not use cursors. Cursor "type" opersations should
be done in a middle tier.

2) Do not use temp-tables. Use derived-table or view

3) Too much normalization is not good.

4) Too many joins in quesries is not good.

Anith Sen
9/19/2004 1:44:08 AM
[quoted text, click to view]

It would help others to understand your perspective better, if you care to
post the reasons behind each points in your "list".

--
Anith

Joe Celko
9/19/2004 6:10:33 AM
[quoted text, click to view]
am missing something can you please share it? <<

This is funny! I am about 6000 words into my next book on SQL
programming style and explain some of these things with a rationale and
an exception. Here is some feedback:

2) One problem with temp tables in T-SQL is that they do not follow the
model for Standard SQL temp tables, so they will not port easily.

But I agree that most of the time programmers use temp tables to hold
steps in a process rather than writing a single declarative statement.
Why learn RDBMS when you can fake your old skills in the new language?

In fact, a lot T-SQL is structured exactly like mag tape file systems --
the old "WORKnnn" tape files are simply replaced by "#temp_nn" tables.

3) I am not sure what "over-normalize" means. If you split up a table
beyond a certain point, you destroy the model for entity it was supposed
to represent. Properly normalized to AT LEAST 3NF and preferably to 5NF
is the minimum.

5) I think you meant DRI actions should replace triggers when possible.

--CELKO--
Please post DDL, so that people do not have to guess what the keys,
constraints, Declarative Referential Integrity, datatypes, etc. in your
schema are. Sample data is also a good idea, along with clear
specifications.


*** Sent via Developersdex http://www.developersdex.com ***
anonymous NO[at]SPAM discussions.microsoft.com
9/19/2004 7:09:30 AM
Well, I didn't want to turn this into a religous debate.
You are right, these statements are not true if you are
dealing with a mom and pop database. It really doesn't
hinder any performance. Even if it did you can always
set up an SQL Server cluster using Active/Active. I
should refresh my quote. It should be:

What are "optimal" practices versus "other" practices in
SQL Server.

[quoted text, click to view]
Eric Sabine
9/19/2004 8:48:35 AM
This is a weird list IMO. I disagree with the "absolute-ness" of 2-5 and my
only feeling about cursors is I've never had a compelling reason to turn to
using them. Only once I've seen a cursor out perform a set-based solution
and that's because I saw Itzik Ben-Gan prove it in a class. The problem
with a blanket list is everything should contain proofs and caveats. The
"it depends" mantra in SQL Server is true because it really does depend.

Eric




[quoted text, click to view]

David Portas
9/19/2004 9:18:17 AM
I don't think that defining "bad practices" as blanket statements like this
is very useful to anyone. For what it's worth, I almost entirely disagree
with everything you've written! Are these ideas something that you actually
practice today or are you looking for guidance on what good and bad
practices *should* be?

--
David Portas
SQL Server MVP
--

anonymous NO[at]SPAM discussions.microsoft.com
9/19/2004 9:35:34 AM
I agree. That is more appropriate
[quoted text, click to view]
Nigel Rivett
9/19/2004 10:37:03 AM
See
http://www.nigelrivett.net/BadThings.html
Especially the paragraph at the top.

Think I disagree with all your statements but particularly 2 which in some
cases is just wrong.



[quoted text, click to view]
<WKidd>
9/19/2004 12:01:10 PM
Cursors implemented on the client side can still hold locking transactions.
Views are not a replacement for temporary tables.

[quoted text, click to view]

Nigel Rivett
9/19/2004 12:49:03 PM
lol.
1st and 3rd are the same really.
The whole point is so that you can change the database structure without
changing the application - often for performance, as long as the SPs have the
same interface then application will never know anything has happened. I do
it quite often. Saying that it will never happen probably means that you have
bound the application to the database structure. That sort of thing usually
means that everything is designed by an application programmer which is what
the warning is really about.
Just because microsoft does something doesn't mean it's a good idea - look
at the system SPs for previous versions of sql server.

2nd one
So you're using SP to genrate the change scripts which is fine - the problem
is when people use it to make changes without knowing what it is really going
to do and don't save any record of what it has done.

Read the first paragraph again. These are just things to think about and not
necessarily bad in all cases.



[quoted text, click to view]
jeff.nospam NO[at]SPAM zina.com
9/19/2004 3:57:46 PM
On Sat, 18 Sep 2004 22:54:39 -0700,
[quoted text, click to view]

The problem is, all these "bad" practices have valid reasons for use.
Kind of like the "Thou shalt not kill" thing getting in the way when
someone is intent on taking your own life.

A better list would be good practices, starting with "know why you're
using a technique and what the advantages and disadvantages are in
your particular environment."

Andrew J. Kelly
9/19/2004 4:17:22 PM
[quoted text, click to view]


I must be missing the point here. Clusters have absolutely nothing to do
with performance. They are for hardware failover.


--
Andrew J. Kelly SQL MVP


[quoted text, click to view]

Joe Celko
9/19/2004 6:30:04 PM
[quoted text, click to view]
that's because I saw Itzik Ben-Gan prove it in a class. <<

I have written five cursors in my career and I know that if we had a
CASE expression in the old days, I could have avoided at least three of
them. The other two were NP complete things (a version of knapsack and
traveling salesman problems) where we wanted to find any answer under a
certain cost. The relational solution would find ALL the answers --
sometime after we were all dead.

Do you happen to have that example and is it generic enough to use for
Standard SQL classes?

--CELKO--
Please post DDL, so that people do not have to guess what the keys,
constraints, Declarative Referential Integrity, datatypes, etc. in your
schema are. Sample data is also a good idea, along with clear
specifications.


*** Sent via Developersdex http://www.developersdex.com ***
Nigel Rivett
9/19/2004 6:30:13 PM
Odd - I'm at a place now where someone implemented a cluster to improve
performance then used one half for something else when it didn't.
Still have half a cluster.

Nigel Rivett
www.nigelrivett.net

*** Sent via Developersdex http://www.developersdex.com ***
David Portas
9/19/2004 6:50:56 PM
[quoted text, click to view]

Cursor operations should be avoided altogether, which is possible with
set-based SQL in at least 99% of cases no matter what the size of your data.
Writing cursor-like data manipulation operations in the middle tier is
unlikely to scale well.

[quoted text, click to view]

This statement is pretty meaningless because you don't say how much is too
much. Third Normal Form should be required for any relational design.
De-normalization below 3NF is a mistake every time in my experience.

[quoted text, click to view]

Why? Joins in a normalized database can improve performance by reducing the
number of physical reads required. Your statement implies that you have a
choice in the number of joins you might implement in a query. Can you give
an example of this? I'm not sure I understand what kind of scenario you have
in mind.

[quoted text, click to view]

Triggers and Constraints are different solutions to different problems. DRI
should usualy be implemented through constraints where possible. I do agree
that minimizing the use of triggers is wise; the obvious alternative is to
implement DML logic in SPs.

--
David Portas
SQL Server MVP
--

Richard Quinn
9/19/2004 8:30:11 PM
On Sun, 19 Sep 2004 10:37:03 -0700, Nigel Rivett <sqlnr@hotmail.com>
[quoted text, click to view]

Let's see...

[quoted text, click to view]

I disagree. Exactly why do you consider this as bad? AFAIK, this
practice is a cornerstone of Microsoft's DNA architecture, FWIW.

[quoted text, click to view]

I disagree. I always store the changes to a file and then put that in
VSS or add it to my Vis.Studio "Change Scripts" folder.

[quoted text, click to view]

I disagree (mostly). Why do you consider this as bad?
If my application knows nothing about the database (at any layer),
then why am I using a relational database in the first place? I have
yet to see an example of a "database application" which will continue
to work if somebody changes the schema behind its back.


- Richard

---
Richard Quinn
Itzik Ben-Gan
9/19/2004 11:21:04 PM
I think that Eric refers to the following problem:

http://www.winnetmag.com/Article/ArticleID/37636/37636.html

If you can't access the complete problem, here goes:

---------------------------------------------------------------------
-- Concurrent time intervals
---------------------------------------------------------------------
CREATE TABLE Sessions
(
key_col INT NOT NULL PRIMARY KEY,
app VARCHAR(10) NOT NULL,
usr VARCHAR(10) NOT NULL,
host VARCHAR(10) NOT NULL,
starttime SMALLDATETIME NOT NULL,
endtime SMALLDATETIME NOT NULL,
CHECK(endtime > starttime)
)

INSERT INTO Sessions VALUES(1, 'app1', 'user1', 'host1', '20030212 08:30',
'20030212 10:30')
INSERT INTO Sessions VALUES(2, 'app1', 'user2', 'host1', '20030212 08:30',
'20030212 08:45')
INSERT INTO Sessions VALUES(3, 'app1', 'user3', 'host2', '20030212 09:00',
'20030212 09:30')
INSERT INTO Sessions VALUES(4, 'app1', 'user4', 'host2', '20030212 09:15',
'20030212 10:30')
INSERT INTO Sessions VALUES(5, 'app1', 'user5', 'host3', '20030212 09:15',
'20030212 09:30')
INSERT INTO Sessions VALUES(6, 'app1', 'user6', 'host3', '20030212 10:30',
'20030212 14:30')
INSERT INTO Sessions VALUES(7, 'app1', 'user7', 'host4', '20030212 10:45',
'20030212 11:30')
INSERT INTO Sessions VALUES(8, 'app1', 'user8', 'host4', '20030212 11:00',
'20030212 12:30')
INSERT INTO Sessions VALUES(9, 'app2', 'user8', 'host1', '20030212 08:30',
'20030212 08:45')
INSERT INTO Sessions VALUES(10, 'app2', 'user7', 'host1', '20030212 09:00',
'20030212 09:30')
INSERT INTO Sessions VALUES(11, 'app2', 'user6', 'host2', '20030212 11:45',
'20030212 12:00')
INSERT INTO Sessions VALUES(12, 'app2', 'user5', 'host2', '20030212 12:30',
'20030212 14:00')
INSERT INTO Sessions VALUES(13, 'app2', 'user4', 'host3', '20030212 12:45',
'20030212 13:30')
INSERT INTO Sessions VALUES(14, 'app2', 'user3', 'host3', '20030212 13:00',
'20030212 14:00')
INSERT INTO Sessions VALUES(15, 'app2', 'user2', 'host4', '20030212 14:00',
'20030212 16:30')
INSERT INTO Sessions VALUES(16, 'app2', 'user1', 'host4', '20030212 15:30',
'20030212 17:00')

-- Calculate the maximum number of concurrent sessions for each
user,application
-- The datetime ranges provided include starttime and exclude datetime
-- i.e., the session was active >= starttime < endtime
-- Desired results:
app mx
---------- -----------
app1 4
app2 3

---------------------------------------------------------------------
-- Concurrent time intervals
---------------------------------------------------------------------
Here's one (of several) pure set-based solution:

-- Set-based solution
SELECT app, MAX(concurrent) AS mx
FROM (SELECT app, starttime,
(SELECT COUNT(*)
FROM Sessions AS S2
WHERE S1.app = S2.app
AND S1.starttime >= S2.starttime
AND S1.starttime < S2.endtime) AS concurrent
FROM Sessions AS S1) AS C
GROUP BY app

The cursor-based solution is a cursor based on a query that marks events as
increasing or decreasing the count of concurrent sessions.
I haven't found a better performing pure set-based solution but I'd be more
than glad if I someone did (test against larger tables, 10000, 100000,
1000000 rows).

-- Iterative solution
DECLARE
@app AS varchar(10),
@prevapp AS varchar (10),
@ts AS datetime,
@type AS int,
@concurrent AS int,
@mx AS int

DECLARE @AppsMx TABLE
(
app varchar (10) NOT NULL PRIMARY KEY,
mx int NOT NULL
)

DECLARE sessions_cur CURSOR FAST_FORWARD FOR
SELECT app, starttime AS ts, 1 AS type
FROM Sessions

UNION ALL

SELECT app, endtime, -1
FROM Sessions

ORDER BY app, ts, type

OPEN sessions_cur

FETCH NEXT FROM sessions_cur
INTO @app, @ts, @type

SET @prevapp = @app
SET @concurrent = 0
SET @mx = 0

WHILE @@FETCH_STATUS = 0
BEGIN
IF @app <> @prevapp
BEGIN
INSERT INTO @AppsMx VALUES(@prevapp, @mx)
SET @concurrent = 0
SET @mx = 0
SET @prevapp = @app
END

SET @concurrent = @concurrent + @type
IF @concurrent > @mx SET @mx = @concurrent

FETCH NEXT FROM sessions_cur
INTO @app, @ts, @type
END

IF @prevapp IS NOT NULL
INSERT INTO @AppsMx VALUES(@prevapp, @mx)

CLOSE sessions_cur

DEALLOCATE sessions_cur

SELECT * FROM @AppsMx

This problem is one of two examples I show in the course where a cursor
outperforms a pure set-based query. The second problem has to do with
calculating, dare I say, "row numbers" ;-) based on some desired sort
criteria. e.g., a simple scenario of calculating row numbers for orders
based on orderid sort:

USE Northwind

SELECT orderid,
(SELECT COUNT(*)
FROM Orders AS O2
WHERE O2.orderid <= O1.orderid) AS rownum
FROM Orders As O1

These problems where a cursor outperforms a set-based solution are typically
characterized by an exponential performance degradation as the table grows
larger. e.g., for the latter, the optimizer counts the number of rows that
have an orderid which is <= the outer orderid, so even with an index you get
(n+n^2)/2 rows scanned. The cursor performs better in such a case because
the data is scanned only once. So the overhead of the curser is negligible
compared to the overhead of the multiple scans.
As an aside, this was addressed in SQL Server 2005 by providing the ANSI-SQL
99 ROW_NUMBER() function:

SELECT orderid, ROW_NUMBER() OVER(ORDER BY orderid) AS rownum
FROM Orders

ROW_NUMBER() scans the data only once. It has allows many alternate very
efficient solutions to existing problems that had slow set-based solutions.

--
BG, SQL Server MVP
www.SolidQualityLearning.com


[quoted text, click to view]

Lucas Tam
9/19/2004 11:25:43 PM
"David Portas" <REMOVE_BEFORE_REPLYING_dportas@acm.org> wrote in
news:66edncmusabM3dDcRVn-jQ@giganews.com:

[quoted text, click to view]

Very true...

There are times when one must use these bad practices because it's the best
way to do things.

--
Lucas Tam (REMOVEnntp@rogers.com)
Please delete "REMOVE" from the e-mail address when replying.
Hugo Kornelis
9/19/2004 11:41:34 PM
Hi Anonymous,

On Sat, 18 Sep 2004 22:54:39 -0700, <anonymous@discussions.microsoft.com>
[quoted text, click to view]

I do think that cursors should be avoided as much as possible. But the
same holds true for blanket statements.


[quoted text, click to view]

Depends on what you want to do. Not everything can be done with derived
tables or views. And even when it can be done, there are cases where using
a temp table gives better performance. To prove my point, I'll show you
three different queries that will find any couple of authors that have the
exact same earnings (based on sales_ytd, price and royalty percentages),
but only if the earnings are between $4,000 and $4,500.

(Note - there might be better ways to solve this query; that's not my
point. My point is that the optimizer will not choose the best execution
plan if the same view or derived table is used multiple times in the
query).

Try the below queries. Compare the difference from the set statistics io
output. Check the execution plans as well. You'll see that the solutions
with view and derived table actually construct the contents of the derived
table (or view) twice, making execution more expensive than the version
with the temp table. (And the last version can actually be optimized even
further by adding a primary key constraint and an extra index - commented
out in the code below).

--Solution #1, using a view:

create view earnings
as
select a.au_id, sum(t.price * t.ytd_sales * t.royalty / 100 *
ta.royaltyper / 100) as earnings
from authors a
join titleauthor ta
on ta.au_id = a.au_id
join titles t
on t.title_id = ta.title_id
group by a.au_id
having sum(t.price * t.ytd_sales * t.royalty / 100 * ta.royaltyper /
100) between 4000 and 4500
go
set statistics io on
select e1.au_id, e2.au_id, e1.earnings
from earnings AS e1
join earnings AS e2
on e1.earnings = e2.earnings
and e1.au_id < e2.au_id
drop view earnings
set statistics io off

-- Solution #2, using a derived table:

set statistics io on
select e1.au_id, e2.au_id, e1.earnings
from (select a.au_id, sum(t.price * t.ytd_sales * t.royalty / 100 *
ta.royaltyper / 100) AS earnings
from authors a
join titleauthor ta
on ta.au_id = a.au_id
join titles t
on t.title_id = ta.title_id
group by a.au_id
having sum(t.price * t.ytd_sales * t.royalty / 100 *
ta.royaltyper / 100) between 4000 and 4500) AS e1
join (select a.au_id, sum(t.price * t.ytd_sales * t.royalty / 100 *
ta.royaltyper / 100) AS earnings
from authors a
join titleauthor ta
on ta.au_id = a.au_id
join titles t
on t.title_id = ta.title_id
group by a.au_id
having sum(t.price * t.ytd_sales * t.royalty / 100 *
ta.royaltyper / 100) between 4000 and 4500) AS e2
on e1.earnings = e2.earnings
and e1.au_id < e2.au_id
set statistics io off

-- Solution #3, using a temp table:

set statistics io on
create table #tmp (au_id varchar(11) not null
,earnings money not null
-- ,primary key (au_id)
)
--create index ixtmp on #tmp(earnings)
insert #tmp (au_id, earnings)
select a.au_id, sum(t.price * t.ytd_sales * t.royalty / 100 *
ta.royaltyper / 100)
from authors a
join titleauthor ta
on ta.au_id = a.au_id
join titles t
on t.title_id = ta.title_id
group by a.au_id
having sum(t.price * t.ytd_sales * t.royalty / 100 * ta.royaltyper /
100) between 4000 and 4500
select e1.au_id, e2.au_id, e1.earnings
from #tmp AS e1
join #tmp AS e2
on e1.earnings = e2.earnings
and e1.au_id < e2.au_id
drop table #tmp
set statistics io off


[quoted text, click to view]

I disagree. In OLTP databases, the design should be normalized down to
fifth normal form, with only very few exception (and these should be
documented, as they'll probably prove to be the future bottleneck anyway).

Requirements for OLAP databases are different though. That might be a
place where carefull denormalization has it's place. But denormalization
should (as the name implies) follow normalization - stopping before you
have a completely normalized design won't yield a "denormalized" design -
you'll end up with just another "unnormalized" design.


[quoted text, click to view]

Define "too many".

If you mean: don't join in tables that you don't need in the query, then I
would aagree.

If you mean: no more than (fill in your arbitrary number) of joins, then
this is nonsense. If the query needs data from many tables, you'll just
have to get them all into the query, or you'll never get the desired
results. (Unless of course you use temp tables for intermediate results,
but in THIS case, that WOULD be bad).


[quoted text, click to view]

Use constraints and DRI if they can achieve what you need. Use triggers if
you need business logic that constraints and DRI don't offer.

Best, Hugo
--

Joe Celko
9/20/2004 6:52:02 AM
[quoted text, click to view]
system; the cursor-based SQL required 5.483 seconds. <<

Another victory for non-procedural code!

I was thinking about building a table of points in time, like a more
detailed calendar table, then using BETWEEN predicates to count how many
sessions each of those points falls inside of. It gives you control of
grandularity and you can ask other questions with it. But if all we
needed was an index ...

--CELKO--
Please post DDL, so that people do not have to guess what the keys,
constraints, Declarative Referential Integrity, datatypes, etc. in your
schema are. Sample data is also a good idea, along with clear
specifications.


*** Sent via Developersdex http://www.developersdex.com ***
Eric Sabine
9/20/2004 8:03:50 AM
Thanks for posting that Itzik. Also, thanks for the 2005 - Row_Number()
information.



[quoted text, click to view]
Anith Sen
9/20/2004 9:16:51 AM
[quoted text, click to view]

I was about to comment on Itzik's choice of indexes before I saw your post
and it is to the point.

Even otherwise, there may be several examples where we find cursors in
certain special cases perform better than corresponding set-based solutions.
However these "out-performances" are simply attributions of the physical
model which vary from OS to OS, hardware to hardware, DBMS to DBMS,
underlying dataset variations, indexes involved & lots of factors, which
makes any generic performance claims suspect.

In general, I wouldn't give much encouragement to loop-based solutions,
since it is a deviation from the functional programming approach which is
desirable in data management.

--
Anith

Itzik Ben-Gan
9/20/2004 1:20:46 PM
Well, I'll be damned! :-)
An index on app, starttime of course seems like a natural choice here.

I tested against larger tables, and both solutions seem to have fairly
linear performance degradation.
I'll go to my original benchmarks and see what went wrong there. The
set-based solution did logically seem to have the potential for exponential
performance degradation...
Here's the results I got from the new benchmark with the index:

#rows set-based cursor-based
---------- ------------- ---------------
16 0 0.01
16,000 0.08 1.463
160,000 0.96 12.68
1,600,000 10.423 136.686
16,000,000 130.503 still running...

I used the following code to populate the test tables.

SELECT IDENTITY(INT, 1, 1) AS key_col, app, usr, host, starttime, endtime
INTO Sessions16k
FROM Sessions, Nums
WHERE n <= 1000 -- also 10000, 100000, 1000000

CREATE CLUSTERED INDEX ImproveIt ON Sessions16K (app, starttime)
ALTER TABLE Sessions16k ADD PRIMARY KEY NONCLUSTERED(key_col)

Remind me to get you a beer when we meet next time.
Mind if I shared this with SQLMag's readers?

Now that this problem was solved, maybe someone can figure out a way to
solve the row-numbers problem with a faster set-based solution?
Against a table with 100K rows (with an index on orderid), I got 26 seconds
with a cursor based solution, and about half an hour for the set-based
solution.

--
BG, SQL Server MVP
www.SolidQualityLearning.com


[quoted text, click to view]

Hugo Kornelis
9/20/2004 2:17:14 PM
[quoted text, click to view]

Hi Itzik,

I guess you focussed too much on the SQL and forgot to look at indexes.

CREATE CLUSTERED INDEX ImproveIt ON Sessions (app, starttime)

I added this line to your creation script, then ran the following lines
several times until I had 65536 rows in the table:

INSERT Sessions (key_col, app, usr, host, starttime, endtime)
SELECT key_col + (SELECT MAX(key_col) FROM Sessions), app, usr, host,
starttime, endtime
FROM Sessions

INSERT Sessions (key_col, app, usr, host, starttime, endtime)
SELECT key_col + (SELECT MAX(key_col) FROM Sessions),
app + 'X', usr, host, starttime, endtime
FROM Sessions
SELECT COUNT(*) FROM Sessions

With this bloated table, I ran your own setbased SQL and your own
cursorbased SQL, both after running DBCC DROPCLEANBUFFERS and DBCC
FREEPROCCACHE, of course. Both produced the same output:

app mx
---------- -----------
app1 256
app1X 1536
app1XX 3840
app1XXX 5120
app1XXXX 3840
app1XXXXX 1536
app1XXXXXX 256
app2 192
app2X 1152
app2XX 2880
app2XXX 3840
app2XXXX 2880
app2XXXXX 1152
app2XXXXXX 192

The execution time for the setbased SQL was 0.606 seconds on my test
system; the cursorbased SQL required 5.483 seconds.

I don't have the time available right now to test against even larger
tables.

Best, Hugo
--

Hugo Kornelis
9/20/2004 2:21:51 PM
[quoted text, click to view]

And I added the NONCLUSTERED option on the primary key constraint, of
course. Apologies for leaving this out of my post.

Best, Hugo
--

Itzik Ben-Gan
9/20/2004 2:52:12 PM
Re the row-numbers problem, solutions with IDENTITY are not valid because
SQL Server 2000 doesn't guarantee that identity values would be generated
according to the sort specified in an ORDER BY clause.
Identity values sometimes are generated prior to the sort, and there's no
way to force SQL Server to assign them otherwise.

--
BG, SQL Server MVP
www.SolidQualityLearning.com


[quoted text, click to view]

Itzik Ben-Gan
9/20/2004 6:06:15 PM
Very interesting... and makes sense when thinking about the exponential
growth in the number of rows scanned. The hash table probably holds a
significantly smaller number of elements in the buckets when the
distribution is non-realistic (i.e., just duplicate rows).

Well, thanks for having a look!

I still didn't go over my original benchmark, but I'm quite sure I had an
index on app, starttime, but probably more realistic distribution of values.

I'll be back after going over this more thoroughly tonight.

Thanks again!
--
BG, SQL Server MVP
www.SolidQualityLearning.com


[quoted text, click to view]
Michael D. Long
9/20/2004 8:05:54 PM
I'm sure that Mr. Anonymous believes that 4 is an obvious conclusion based
on 3 - a big flat table doesn't require joins. ;-)

p.s. - It takes a real lamer to post under an anonymous identity. Secure in
his opinions? I wouldn't count on it. Follow the advice of such a bold
soul? I'd vote for John "Feel-the-wind" Kerry first.

--
Michael D. Long

Hugo Kornelis
9/20/2004 11:53:09 PM
[quoted text, click to view]

Hi Joe,

That was my first thought too. But how many point would you need? Every 5
minutes? Every minute? Every second?

The beauty of Itzik's set-based query is that the derived table in effect
builds a collection of all points in time that might be relevant for the
query. I've been thinking about the puzzle some more this evening, but I
don't think I could come up with anything better than Itzik's query.

Best, Hugo
--

Hugo Kornelis
9/21/2004 12:02:32 AM
[quoted text, click to view]

Hi Itzik,

Thanks for runnning these tests. I've already been thinking about how I
could create more test data for this puzzle, but your post save me the
hassle (and the time).


[quoted text, click to view]

Okay. But do you expect to be in the Netherlands soon? :-)


[quoted text, click to view]

Not at all.


[quoted text, click to view]

Hmmm. Adding a non-clustered index on orderid should improve the set-based
version, but pprobably not enough to beat the cursor-based version. I'll
think about this before going to sleep, maybe my subconsious gets creative
tonight.

Don't hold your breath for it, though :-)

Best, Hugo
--

Hugo Kornelis
9/21/2004 12:46:49 AM
[quoted text, click to view]

Forgot to add: When your article is published, I'd like to get a
notification and a URL, if possible. My email adress is Hugo at perFact
dot info.

Best, Hugo
--

Gert-Jan Strik
9/21/2004 12:50:43 AM
Itzik (and Hugo),

the problem is, that the testdata is not realistic, and this influences
the results in a big way. There are too few distinct starttime values.
This doesn't really affect the cursor based solution but masively
changes the efficiency of the set based solution.

I have a table with realistic data that I transformed to the DDL that
was posted earlier (I converted some INTs to VARCHAR(10)s and DATETIME
to SMALLDATETIME). I indexed it the way Hugo described. The table had
10803979 rows in it (782976 KB, 1369 distinct apps).

If I understand the performance table correctly, the set based solution
should have finished anywhere between 10.423 and 130.503 seconds. Well,
I had to cancel it after 30 minutes. (BTW: I also had to cancel the
cursor based solution after 30 minutes)

So then I kept the first 1832314 rows (129760 KB) which contain (the
first) 399 distinct apps and reindexed the table.

The cursor based solution then took 18 minutes and 15 seconds to finish.
During the execution, there was hardly any CPU activity.

Then I ran the set based solution. I cancelled it after 40 minutes. At
that point, it had been overloading 3 of the available 4 CPU's for about
the entire period.

I ran the test on SQL7 with 2 HT CPU's and enough memory with the
following query and query plan.

SELECT app, MAX(concurrent) AS mx
FROM (SELECT app, starttime,
(SELECT COUNT(*)
FROM Sessions AS S2
WHERE S1.app = S2.app
AND S1.starttime >= S2.starttime
AND S1.starttime < S2.endtime) AS concurrent
FROM Sessions AS S1) AS C
GROUP BY app
option (maxdop 3)

|--Hash Match(Aggregate, HASH:([S1].[app]),
RESIDUAL:([S1].[app]=[S1].[app]) DEFINE:([Expr1005]=MAX([Expr1002])))
|--Parallelism(Gather Streams)
|--Hash Match(Aggregate, HASH:([S1].[app]),
RESIDUAL:([S1].[app]=[S1].[app]) DEFINE:([Expr1005]=MAX([Expr1002])))
|--Compute Scalar(DEFINE:([Expr1002]=[Expr1002]))
|--Nested Loops(Inner Join)
|--Sort(ORDER BY:([S1].[starttime] ASC,
[S1].[app] ASC))
| |--Parallelism(Repartition Streams,
PARTITION COLUMNS:([S1].[app], [S1].[starttime]))
| |--Index
Scan(OBJECT:([mydb].[dbo].[Sessions].[PK_Sessions] AS [S1]))
|--Table Spool
|--Stream
Aggregate(DEFINE:([Expr1002]=Count(*)))
|--Clustered Index
Seek(OBJECT:([mydb].[dbo].[Sessions].[ImproveIt] AS [S2]),
SEEK:([S2].[app]=[S1].[app] AND [S2].[starttime] <= [S1].[starttime]),
WHERE:([S2].[endtime]>[S1].[starttime]) ORDERED)


This should get some people thinking again... :-)

Gert-Jan



[quoted text, click to view]

--
Itzik Ben-Gan
9/21/2004 1:05:48 AM
After examining the issue more thoroughly and running a realistic benchmark,
the original assumption still stands, i.e., the cursor based solution has
linear performance degradation, while the set-based solution has an
exponential one. The set-based solution is much slower even with a small
number of rows.
I'll try to explain the logic behind the assumption, with supporting
numbers, and also why the non-realistic sample data caused the set-based
solution to run faster then the cursor-based solution, and to also incur a
linear performance degradation.
I'm afraid this problem remains open, but of course I welcome anyone that
likes challenges to try and find a set-based solution that outperforms the
cursor-based. I'll keep on looking myself, of course.

Here's a more realistic benchmark.
Suppose we're tracking a month worth of sessions data, with 10 applications,
and an average session duration of ~50.5 minutes (ranging from 1 through
100).
The most efficient index (at least the one that I came up with) is actually
a covering non-clustered index placed on (app, starttime, endtime).
Examining the execution plan, the number of rows that are scanned for each
base row are the actual number of matching rows. i.e., if we have r rows in
the table, and each session overlaps with o rows in average, the total
number of rows that are going to be scanned are ~ r x o. The exceptions are
duplicate rows (same app, starttime) for which the cached aggregate
information in the hash table reduces the need to actually scan duplicates
multiple times, hence the fast results of the non-realistic sample data.

Now for realistic sample data. I'll try to explain the reason for the
exponential performance degradation.
To populate the table with realistic sample data, you need to calculate
random start times within the month, and end times which are start time +
random duration (e.g., in my benchmark in the range 1 through 100 minutes).
The more sessions there are for each application, the probability for
duplicates gets higher, and this will have an increasing positive
performance affect on the set-based solution because of the hashed
aggregates. But in practical terms even tables with large numbers of rows,
taking into consideration production systems that store session data in
DATETIME datatypes with an accuracy of 3.333 ms, we can basically ignore
duplicates.
I already mentioned that the number of rows that are going to be scanned is
roughly r x o (minus a bit because of duplicates). Now, populate the table
in a factor of f, still using random values within the same month. The table
is now f times larger (r x f), but the probability for session overlap also
increased by a factor of f (o x f), so the total number of rows that are
going to be scanned is ~ r x o x f^2 (-duplicates affect). I/O has a close
to linear affect on run-time; if it takes t seconds to query r rows, we
should expect ~ t x f^2 seconds to query r x f rows (assuming a
non-significant number of duplicates), hence the exponential performance
degradation.

Here are some estimations based on measured run time of 1 second (floored
for the sake of simplicity):

estimates:

rows factor runtime
---- ------------- -------
10K 1^2 = 1 1
20K 2^2 = 4 4
40K 4^2 = 16 16
80K 8^2 = 64 64
160K 16^2 = 256 256
1M 100^2 = 10000 10000

And here are the actuals from a benchmark I ran (code below):

rows factor runtime reads
---- ------ ------- -----------
10K 1 1.4 39851
20K 3.5 5 120628
40K 11.4 16 403296
80K 41 57 1454080
160K 150 210 40263
1M still running

As you can see the actuals closely resemble the exponential pattern I tried
to illustrate. I can attribute the increase in a slightly smaller factor
than a power of 2 to duplicates, but there may be other influencing factors
as well.

So, Hugo, Celko, others, want to take a shot at it and come up with a
creative set-based solution that does run fast?
BTW, I did try many variations including an auxiliary table of time slots
(which is of course less efficient than using the actual start times as time
slots).

Benchmark's code:

USE tempdb
SET NOCOUNT ON
GO
IF OBJECT_ID('Sessions') IS NOT NULL
DROP TABLE Sessions
GO

DECLARE @numrows AS INT
SET @numrows = 10000
-- test with 10K - 160K increasing by a factor of 2

SET STATISTICS IO OFF

SELECT
IDENTITY(INT, 1, 1) AS key_col,
D.*,
DATEADD(
minute,
CAST(ABS(RAND(CHECKSUM(NEWID()))*100-0.0000000001) AS INT)+1,
starttime) AS endtime
INTO Sessions
FROM
(
SELECT
'app' + CAST(CAST(ABS(RAND(CHECKSUM(NEWID()))*10-0.0000000001)
AS INT)+1 AS VARCHAR(10)) AS app,
'user1' AS usr,
'host1' AS host,
DATEADD(
minute,
CAST(ABS(RAND(CHECKSUM(NEWID()))*44540-0.0000000001)
AS INT)+1,
'20040101') AS starttime
FROM Nums
WHERE n <= @numrows
) AS D

ALTER TABLE Sessions ADD PRIMARY KEY(key_col)
CREATE NONCLUSTERED INDEX IDX_NCI_app_se ON Sessions (app, starttime,
endtime)

DBCC FREEPROCCACHE WITH NO_INFOMSGS
DBCC DROPCLEANBUFFERS WITH NO_INFOMSGS

SET STATISTICS IO ON

DECLARE @dt AS DATETIME
SET @dt = GETDATE()

SELECT app, MAX(concurrent) AS mx
FROM (SELECT app, starttime,
(SELECT COUNT(*)
FROM Sessions AS S2
WHERE S1.app = S2.app
AND S1.starttime >= S2.starttime
AND S1.starttime < S2.endtime) AS concurrent
FROM Sessions AS S1) AS C
GROUP BY app

PRINT CAST(@numrows AS VARCHAR(10)) + ' rows: '
+ CAST(DATEDIFF(ms, @dt, GETDATE()) / 1000. AS VARCHAR(30))
+ ' seconds.'

/*
-- Auxiliary table of numbers
CREATE TABLE Nums(n INT NOT NULL PRIMARY KEY)

SET NOCOUNT ON

DECLARE @max AS INT, @rc AS INT

SET @max = 1000000
SET @rc = 1

BEGIN TRAN
INSERT INTO Nums VALUES(1)

WHILE @rc * 2 <= @max
BEGIN
INSERT INTO Nums
SELECT n + @rc FROM Nums

SET @rc = @rc * 2
END

INSERT INTO Nums
SELECT n + @rc FROM Nums
WHERE n + @rc <= @max
COMMIT TRAN
*/

--
BG, SQL Server MVP
www.SolidQualityLearning.com


[quoted text, click to view]
Itzik Ben-Gan
9/21/2004 1:06:44 AM
You're welcome. I'm glad to see you active in the newsgroup!

--
BG, SQL Server MVP
www.SolidQualityLearning.com


[quoted text, click to view]
Itzik Ben-Gan
9/21/2004 1:15:24 AM
I'm afraid not this time. See my reply to Gert-Jan. But, it was an
interesting review of the problem I must say. And it's a good thing that we
practice logic and not other stuff. Or, in the words of Arthur Conan Doyle:
"Crime is common. Logic is rare. Therefore it is upon the logic rather than
upon the crime that you should dwell."


:-)

--
BG, SQL Server MVP
www.SolidQualityLearning.com


[quoted text, click to view]

Itzik Ben-Gan
9/21/2004 8:03:24 AM
I'm afraid I'm not attending PASS this year.

--
BG, SQL Server MVP
www.SolidQualityLearning.com


[quoted text, click to view]
Eric Sabine
9/21/2004 8:36:14 AM
I try to contribute as much as I can. Maybe I will see you at PASS next
week if you're there but I don't see you as a presenter. I will buy _you_ a
beer if you're there.

Eric


[quoted text, click to view]