all groups > sql server programming > january 2006 >
You're in the

sql server programming

group:

Outer Join and WHERE clause strange behavior


Outer Join and WHERE clause strange behavior Mark Williams
1/13/2006 11:06:01 PM
sql server programming:
I am writing a stored procedure to get aggregrate data out of a web site log
table by joining it with a view

DDL for the table:

CREATE TABLE w3cexlog (
pacifictime datetime,
[cs-method] varchar(20),
target varchar(8000),
query varchar(2000),
username varchar(100),
browser varchar(8000),
referrer varchar(2000),
status int,
substatus int,
win32status int,
[sc-bytes] bigint,
[cs-bytes] bigint,
[time-taken] bigint
)

The view is through ADSI, and pulls the Active Directory OU, samAccountName
(which corresponds to username), and full name for the user.

If the view where a base table, it's DDL would look like this

CREATE TABLE PortalUsers (
name nvarchar(512),
samAccountName nvarchar(512),
OU nvarchar(8000)
)

I want to find out how many hits where generated by users in each OU, along
with visitors from each OU, by joining PortalUsers to w3cexlog on
PortalUsers.samAccountName = w3cexlog.username. (The web site uses AD
authentication, so they must match).

Here is the query:

SELECT p.ou, count(w.username) AS hits, count(distinct w.username) AS visitors
FROM portalusers p LEFT JOIN w3cexlog w on
p.samaccountname=w.username
GROUP BY p.ou
ORDER BY 2 DESC, 1

This query gives the expected results; OUs for which none of its users
produced a "hit" still show in the output, but with zero's in the "hits" and
"visitors" columns. However, if I add a WHERE clause to query a specific date
range:


SELECT p.ou, count(w.username) AS hits, count(distinct w.username) AS visitors
FROM portalusers p LEFT JOIN w3cexlog w on
p.samaccountname=w.username
WERE w.pacifictime BETWEEN '20060102' AND '20060112'
GROUP BY p.ou
ORDER BY 2 DESC, 1

The OUs that produces no hits are now gone from the result set! That WHERE
clause wouldn't have removed any rows for the OUs that didn't have hits; they
where never there to begin with; so why does the behavior of the outer join
change?

In order to get around this, I constructed the following query

SELECT t1.OU, t2.hits, t2.visitors FROM
(
SELECT DISTINCT OU FROM PortalUsers
) t1
LEFT JOIN
(
SELECT p.ou, count(w.username) AS hits, count(distinct w.username) AS visitors
FROM portalusers p LEFT JOIN w3cexlog w on
p.samaccountname=w.username
WERE w.pacifictime BETWEEN '20060102' AND '20060112'
GROUP BY p.ou
) t2
ON t1.OU = t2.OU
ORDER BY 2 DESC, 1

But the performance of this is terrible.

--
Re: Outer Join and WHERE clause strange behavior Steve Kass
1/14/2006 2:30:30 AM

[quoted text, click to view]
Mark,

This is the way things work. An outer join may contain results
that fail the ON condition, but it will never contain results that fail
the WHERE condition. The WHERE condition is a real filter
than can be thought of as being applied to the result of the
outer join.

If you have result rows with 0 hits, all the result columns from
w3cexlog in those result rows contain NULL. NULL will never
be between '20060102' and '20060112', so you'll see no rows
that failed the ON condition.

Maybe what you want is just
.... on p.samaccountname=w.username
AND w.pacifictime BETWEEN '20060102' AND '20060112'
....

Steve Kass
Drew University


Re: Outer Join and WHERE clause strange behavior ML
1/14/2006 4:46:01 AM
Another option is to leave the join unchanged and alter the where caluse like
this:

....
WHERE (w.pacifictime BETWEEN '20060102' AND '20060112' or w.pacifictime is
null)
....


ML

---
Re: Outer Join and WHERE clause strange behavior Mark Williams
1/14/2006 11:02:02 AM
Thanks Steve and ML. I should have seen it, but I was a little tired! I had
the incorrect thinking that the WHERE clause was applied before the join,
instead of after.

--


[quoted text, click to view]
Re: Outer Join and WHERE clause strange behavior Louis Davidson
1/14/2006 2:35:16 PM
It will work, but I wouldn't suggest this as a good option, I'm afraid.
This is like adding "and I really meant that is was an outer join" to the
where clause.

Or did you see some advantage with putting it in the where clause, like some
problem that doing it this way might solve?

--
----------------------------------------------------------------------------
Louis Davidson - http://spaces.msn.com/members/drsql/
SQL Server MVP
"Arguments are to be avoided: they are always vulgar and often convincing."
(Oscar Wilde)

[quoted text, click to view]

Re: Outer Join and WHERE clause strange behavior Alexander Kuznetsov
1/15/2006 12:49:35 PM
Another option is to move the condition to the ON clause:

SELECT t1.OU, t2.hits, t2.visitors FROM
(
SELECT DISTINCT OU FROM PortalUsers
) t1
LEFT JOIN
(
SELECT p.ou, count(w.username) AS hits, count(distinct w.username) AS
visitors, pacifictime
FROM portalusers p LEFT JOIN w3cexlog w on
p.samaccountname=w.username
GROUP BY p.ou
) t2
ON (t1.OU = t2.OU )
AND t2.pacifictime BETWEEN '20060102' AND '20060112'
ORDER BY 2 DESC, 1

That might perform better
Re: Outer Join and WHERE clause strange behavior ML
1/15/2006 2:09:01 PM
Since both methods yield same results the only argument would be performance.
So, which one would perform better?


ML

---
Re: Outer Join and WHERE clause strange behavior Louis Davidson
1/15/2006 9:03:01 PM
I didn't even think about this in the original reply. You would actually
have to use:

or w.username is null (since this is the criteria that the join was done
on.)

--
----------------------------------------------------------------------------
Louis Davidson - http://spaces.msn.com/members/drsql/
SQL Server MVP
"Arguments are to be avoided: they are always vulgar and often convincing."
(Oscar Wilde)

[quoted text, click to view]

Re: Outer Join and WHERE clause strange behavior Louis Davidson
1/15/2006 9:15:21 PM
Beyond what was discussed about how if the pacifictime column was nullable
(you would want to check for the join columns being null) the argument is
style and symantics.

I would argue that it is very important to write readable queries that can
be understood by the next reader AND that make sense. Performance is
important, and if the less clear version peformed measurably better (if
anything it would likely be no better than equal, but sometimes there can be
reasons to do things that don't seem kosher for performance sake, but this
is unlikely to be one of those cases.

--
----------------------------------------------------------------------------
Louis Davidson - http://spaces.msn.com/members/drsql/
SQL Server MVP
"Arguments are to be avoided: they are always vulgar and often convincing."
(Oscar Wilde)

[quoted text, click to view]

Re: Outer Join and WHERE clause strange behavior Hugo Kornelis
1/15/2006 10:33:03 PM
[quoted text, click to view]

Hi ML,

That would also include rows with w.pacifictime equal to NULL in the
base table (assuming the column is nullable); I don't think that this is
what Mark wants.

--
Re: Outer Join and WHERE clause strange behavior Hugo Kornelis
1/15/2006 11:58:09 PM
[quoted text, click to view]

Hi ML,

Have you testeed them?

(I'll gladly admit I didn't - but due to the OR in your method, I'd be
surprised if it turns out to be the faster version).

Of course, they don't yield the same results if the column is nullible.

--
Re: Outer Join and WHERE clause strange behavior ML
1/16/2006 2:08:02 AM
When discussing the term "readability" two things IMHO should be taken into
account:

1) what is the widely accepted standard? Above all, the base of readability
is whether a piece of code can be understood by most peers in the specific
industry; and

2) what is the propriatary standard in a segment of industry (e.g. a
company)? In addition to the widely accepted standards special rules and/or
principles can constitute a standard used by a company - be it internally or
in co-operation with other companies.

This is what I meant in my pervious post. When performance is the key and
the product cannot simply follow the two bases of readability an exception to
the rules is made and should be designated as such, or should constitute a
new (special) rule.

Well, at least that's my personal rule.


ML

---
Re: Outer Join and WHERE clause strange behavior ML
1/16/2006 2:10:02 AM
[quoted text, click to view]

Which column, again? The one used in the join or the one used as the filter?


ML

---
Re: Outer Join and WHERE clause strange behavior Mark Williams
1/16/2006 12:10:03 PM
I hadn't been following this thread since after the initial responses. Here
are how the queries perform (15578 rows in w3cexlog, with nonclustered
indexes on pacifictime and username)

SELECT p.ou, count(w.username) AS hits, count(distinct w.username) AS visitors
FROM portalusers p LEFT JOIN w3cexlog w
ON p.samaccountname=w.username
WHERE (w.pacifictime BETWEEN '20060102' AND '20060114') OR w.pacifictime IS
NULL
GROUP BY p.ou
ORDER BY 2 DESC, 1

Account to SQL profiler, the above batch shows a duration of 1250
milliseconds, CPU time of 1031 milliseconds.

SELECT p.ou, count(w.username) AS hits, count(distinct w.username) AS visitors
FROM portalusers p LEFT JOIN w3cexlog w
ON p.samaccountname=w.username AND w.pacifictime BETWEEN '20060102' AND
'20060114'
GROUP BY p.ou
ORDER BY 2 DESC, 1

According to SQL Profiler, the above batch has a duration of 1233
milliseconds, and CPU time of 961 seconds. The query with the date range
restriction applied to the join appears to run ever-so-slightly faster. As
the table gets larger over time, a clearer performance difference may appear.

For the record, none of the columns in the table are nullable, so there will
never be a pacifictime or username of NULL.

Alexander, your query as written will not work because it would have to
group the 'hits' by date. I modified your suggestion into

SELECT t1.OU, COUNT(t2.username) as "hits", COUNT(DISTINCT t2.username) AS
"visitors"
FROM
(
SELECT DISTINCT OU FROM PortalUsers
) t1
LEFT JOIN
(
SELECT p.ou, w.username, w.pacifictime
FROM portalusers p LEFT JOIN w3cexlog w on
p.samaccountname=w.username
) t2
ON (t1.OU = t2.OU )
AND t2.pacifictime BETWEEN '20060102' AND '20060112'
GROUP BY t1.OU
ORDER BY 2 DESC, 1

Profiler shows a duration of 2250 milliseconds for this query.

In my very non-professional opinion, the query that has the date restriction
in the join is slightly more readable.


[quoted text, click to view]
Re: Outer Join and WHERE clause strange behavior ML
1/16/2006 3:21:02 PM
Oh, I see! Thank you for the explanation. So, is it safe to assume that the
correct method of filtering outer sets would be to do it in the join?


ML

---
Re: Outer Join and WHERE clause strange behavior Louis Davidson
1/16/2006 8:43:33 PM
[quoted text, click to view]

This is often the case :) These things take on a life of their own as we
start discussing philosopy instead of helping anymore (but it is fun
nevertheless!)

[quoted text, click to view]

It may also get worse. Query performance gets wierd as results get bigger
and bigger. Luckily it is fun for most DBA's, and it keeps us employed.

[quoted text, click to view]

This to me is the biggest concern when writing queries like this. The
NULLibility of the column should not come into play by accident. Like I
said somewhere else, just change the where clause version to a primary key
column.


[quoted text, click to view]

Re: Outer Join and WHERE clause strange behavior Hugo Kornelis
1/16/2006 9:04:19 PM
[quoted text, click to view]

Hi ML,

The one used in the filter. Here's a repro. Both queries return three
rows, but only one is equal.

CREATE TABLE Tab1 (JoinCol int NOT NULL, OtherCol int)
CREATE TABLE Tab2 (JoinCol int NOT NULL, OtherCol int)
go
INSERT INTO Tab1 (JoinCol, OtherCol)
SELECT 1, 1
UNION ALL
SELECT 2, 0
UNION ALL
SELECT 3, NULL
INSERT INTO Tab2 (JoinCol, OtherCol)
SELECT 1, 1
UNION ALL
SELECT 1, NULL
UNION ALL
SELECT 2, 0
UNION ALL
SELECT 3, NULL
go
SELECT *
FROM Tab1
LEFT JOIN Tab2
ON Tab2.JoinCol = Tab1.JoinCol
WHERE Tab2.OtherCol = 1 OR Tab2.OtherCol IS NULL
go
SELECT *
FROM Tab1
LEFT JOIN Tab2
ON Tab2.JoinCol = Tab1.JoinCol
AND Tab2.OtherCol = 1
go
DROP TABLE Tab2
DROP TABLE Tab1
go


--
Re: Outer Join and WHERE clause strange behavior Alexander Kuznetsov
1/17/2006 10:08:06 AM
[quoted text, click to view]

As the data grows, you could encounter at least 3 different situations:
1. users have on average just a few rows in w3cexlog, just a few users
don't have child rows
2. users have on average hundreds or thousands of rows in w3cexlog,
just a few users don't have child rows
3. most users don't have any rows in w3cexlog, those that do have on
average just a few rows

What performs best in situation 2 may be the worst performer in
situation 1.

[quoted text, click to view]
group the 'hits' by date.

thanks for the correction
Re: Outer Join and WHERE clause strange behavior Hugo Kornelis
1/17/2006 10:00:22 PM
[quoted text, click to view]

Hi ML,

That depends on the results that you need to get.

Your method is correct if you want to join to rows with matching account
id and time in the requested range OR no time stored in the DB.

My method (or Louis' adapted version of yours) is correct if you want to
join to rows with matching account id and time in the requested range.

In the latter case, I still rpefer my syntax to Louis' syntax becuase it
shows much clearer what the join conditions are. But that's a personal
issue, and in a performance-sensitive DB, I'd choose Louis' version
without hesitation if it turned out to be faster.

--
AddThis Social Bookmark Button