Groups | Blog | Home
all groups > sql server programming > march 2004 >

sql server programming : Counting with distinct - Optimizer Problems


Shawn Meyer
3/16/2004 9:55:14 PM
I have a table defined as :

CREATE TABLE [dbo].[job_13_event] (
[id] [int] IDENTITY (1, 1) NOT NULL ,
[recipient_id] [int] NULL ,
[event_type] [int] NULL ,
[mail_type] [int] NULL ,
) ON [PRIMARY]

There is about 5 million rows in this table;
1,063,337 have event_type=3;

When I run the folowing statement it takes 13 seconds
to run. The execution plan shows a table spool/eager spool taking
32%.

SELECT
count(recipient_id) mail_type_count,
count(distinct recipient_id) dist_count,
mail_type
from
job_13_event where event_type = 3
GROUP BY mail_type

When I run the two counts as seperate statements, the count returns in 2
seconds
and the count distinct returns in 3.

SELECT
count(recipient_id) mail_type_count,
mail_type
from
job_13_event where event_type = 3
GROUP BY mail_type

SELECT
count(distinct recipient_id) dist_count,
mail_type
from
job_13_event where event_type = 3
GROUP BY mail_type

If I do a join of the two statements and use FORCEPLAN,
I can get the results to come back in 7 Seconds. Here is the sql for that:

SET FORCEPLAN ON
select a.dist_count, a.mail_type, b.mail_type_count from
( SELECT
count(distinct recipient_id) dist_count,
mail_type
from job_13_event
where event_type = 3
GROUP BY mail_type) a,
( SELECT
count(1) mail_type_count,
mail_type
from job_13_event
where event_type = 3
GROUP BY mail_type ) b
where a.mail_type = b.mail_type

I would think that the optimizer could calculate the first simple statment
by :
sorting by the mail_type and then recipient_id and doing a count for one and
a distinct count for the recipient_id
in one scan. In that case you would get the faster operation the count(1)
essentially for free, resulting in a near 4 second
result time.

Is there something that I am missing that would make my first statement run
as fast or faster than using
the forceplan ???

Thanks.
Shawn

oj
3/16/2004 10:48:24 PM
Uri,

The trick here is that the indexed view has already materialized. Thus,
sqlserver will use this materialized plan instead of going back to the base
table(s) and perform compilation/execution. So, it's really not the same if you
compare the view's plan with the basetable's plan. my 2c.

--
-oj
http://www.rac4sql.net


[quoted text, click to view]

Steve Kass
3/17/2004 12:53:16 AM
Nice trick, Uri. ;)

I'll add that Shawn isn't missing anything, in my opinion. Queries with
both count(*) and count(distinct ...) are often real trouble for the
optimizer, and some user intervention is probably necessary. The view
trick will help with the count(*) and can be joined to the rest of the
query, but it won't speed up the count(distinct..). I don't know if
NOEXPAND and a join will force a good plan - my guess is it will help,
but whether this solution is better than user intervention without a
view depends on how often this query needs to be run and on how
frequently there are inserts and deletes from the table. In any case,
if Shawn uses this trick, the column to replace T.i should be the
primary key column or any column that is never or very infrequently
updated, so that only deletes and inserts will contribute to the overhead.

SK

[quoted text, click to view]
Steve Kass
3/17/2004 7:18:31 AM
I was actually hoping that Uri would mention that the code he posted was
mine and not his idea.
http://groups.google.com/groups?q=%22cast+i+as+bit

SK

[quoted text, click to view]
Uri Dimant
3/17/2004 7:34:14 AM
Shawn
I'd recommend you create an indexed view .
create table T (
i int
)
go

create view T_count with schemabinding
as
select cast(i as bit) as val, count_big(*) T_count from dbo.T
group by cast(i as bit)
go

create unique clustered index T_count_uci on T_count(val)
go
insert into T
select OrderID
from Northwind..[Order Details]
go

select sum(T_count) from T_count with (noexpand)
-- uses an efficient query plan on the materialized view

go
drop view T_count
drop table T


[quoted text, click to view]

Uri Dimant
3/17/2004 8:34:14 AM
Steve
As I mentioned yesterday I use this your script to demonstrate to our
develpers that a indexed view uses more efficient execution query plan.
Thank you.

[quoted text, click to view]

Uri Dimant
3/17/2004 8:56:39 AM
OJ
You are right. I was talking about an indexed views and did not mention
about 'normal' views
..

[quoted text, click to view]

Steve Kass
3/17/2004 11:41:52 AM
Uri,

You didn't declare it as your own, but I think it is polite to mention
the source when you cut and paste someone else's idea like you did.

Steve


[quoted text, click to view]
Uri Dimant
3/17/2004 3:02:39 PM
Steve
Don't understand you
Did you mean that i am stolen your code and declared it as my own?




[quoted text, click to view]

Gert-Jan Strik
3/17/2004 3:22:26 PM
Shawn,

The following rewrite of the statement might perform considerably
better. I agree that the optimizer can always be improved, even in
relatively complex situation as the one you describe.

SELECT SUM(T1.RecipientCount) AS mail_type_count
, COUNT(*) AS dist_count
, T1.mail_type
FROM (
SELECT mail_type,recipient_id,COUNT(*) AS RecipientCount
FROM job_13_event
WHERE event_type=3
AND recipient_id IS NOT NULL
GROUP BY mail_type,recipient_id
) AS T1
GROUP BY T1.mail_type

Hope this helps,
Gert-Jan


[quoted text, click to view]

--
Uri Dimant
3/18/2004 7:42:34 AM
Steve
I got it.

[quoted text, click to view]

Delbert Glass
3/18/2004 10:55:57 AM
The query shown below should be faster.

-- the table

--drop table job_13_event
CREATE TABLE [dbo].[job_13_event] (
[id] [int] IDENTITY (1, 1) NOT NULL ,
[recipient_id] [int] NULL ,
[event_type] [int] NULL ,
[mail_type] [int] NULL ,
) ON [PRIMARY]

-- some raw data

insert into job_13_event
(recipient_id,event_type,mail_type)
values (null,null,null)

insert into job_13_event
(recipient_id,event_type,mail_type)
values (100,1,300)

insert into job_13_event
(recipient_id,event_type,mail_type)
values (111,3,310)
insert into job_13_event
(recipient_id,event_type,mail_type)
values (111,3,310)
insert into job_13_event
(recipient_id,event_type,mail_type)
values (112,3,311)

insert into job_13_event
(recipient_id,event_type,mail_type)
values (121,3,320)
insert into job_13_event
(recipient_id,event_type,mail_type)
values (122,3,320)
insert into job_13_event
(recipient_id,event_type,mail_type)
values (122,3,320)
insert into job_13_event
(recipient_id,event_type,mail_type)
values (122,3,320)

-- peek at the raw data in the table
select *
from job_13_event
order by mail_type, recipient_id

-- sad query
SELECT
count(recipient_id) mail_type_count,
count(distinct recipient_id) dist_count,
mail_type
from
job_13_event where event_type = 3
GROUP BY mail_type
ORDER BY mail_type

-- happy query
-- But not joyful due to
-- second sort operator which seems to be unnecessary.
SELECT
sum(cnt)/2 AS mail_type_count,
sum(1-G_recipient_id) AS dist_count,
mail_type
FROM (
SELECT
GROUPING(mail_type) AS G_mail_type,
mail_type,
GROUPING(recipient_id) AS G_recipient_id,
recipient_id,
count(*) AS cnt
FROM job_13_event
WHERE event_type = 3
AND recipient_id IS NOT NULL
GROUP BY mail_type, recipient_id WITH ROLLUP
) DT
WHERE G_mail_type = 0
GROUP BY mail_type
ORDER BY mail_type

Bye,
Delbert Glass

[quoted text, click to view]

Delbert Glass
3/18/2004 10:57:51 AM
[quoted text, click to view]

(letting "it" refer to the wrong thing)
We know you have it ;-)

Bye,
Delbert Glass

Shawn Meyer
3/18/2004 4:31:51 PM
Thanks for your help.

[quoted text, click to view]

AddThis Social Bookmark Button