Groups | Blog | Home
all groups > sql server (alternate) > may 2004 >

sql server (alternate) : Update in SQL Server 2000 slow?


dberlin NO[at]SPAM alum.rpi.edu
5/7/2004 10:38:19 AM
I have two tables:

T1 : Key as bigint, Data as char(20) - size: 61M records
T2 : Key as bigint, Data as char(20) - size: 5M records

T2 is the smaller, with 5 million records.

They both have clustered indexes on Key.

I want to do:

update T1 set Data = T2.Data
from T2
where T2.Key = T1.Key

The goal is to match Key values, and only update the data field of T1
if they match. SQL server seems to optimize this query fairly well,
doing an inner merge join on the Key fields, however, it then does a
Hash match to get the data fields and this is taking FOREVER. It
takes something like 40 mins to do the above query, where it seems to
me, the data could be updated much more efficiently. I would expect
to see just a merge and update, like I would see in the following
query:

update T1 set Data = [someconstantdata]
from T2
where T2.Key = T1.Key and T2.Data = [someconstantdata]

The above works VERY quickly, and if I were to perform the above query
5 mil times(assuming that my data is completely unique in T2 and I
would need to) it would finish very quickly, much sooner than the
previous query. Why won't SQL server just match these up while it is
merging the data and update in one step? Can I make it do this? If I
extracted the data in sorted order into a flat file, I could write a
program in ten minutes to merge the two tables, and update in one
step, and it would fly through this, but I imagine that SQL server is
capable of doing it, and I am just missing it.

Erland Sommarskog
5/7/2004 10:39:29 PM
Dan Berlin (dberlin@alum.rpi.edu) writes:
[quoted text, click to view]

This query is quite different. Here SQL Server can scan T2, and for
every row where Data has a matching value it can look up the key in T1.
Since SQL Server has statistics about the data, it can tell how many
hits the condition on T2.Data will get.

In your first query, you are not restricting T2, so you will have
to scan all. A nested loop join would mean 5 million lookups in T1 -
probably not good. I would expect merge join to be possible, but that is
still a scan of both tables.

First I would add the condition:

WHERE (T1.Data <> T2.Data OR
T1.Data IS NULL AND T2.Data IS NOT NULL OR
T1.Data IS NOT NULL AND T2.Data IS NULL)

So that you actually update only rows you need to update.

If there are plenty of other columns in the table, I would add non-clustered
indexes on (Key1, Data) for both tables, since theses indexes would
cover the query.

--
Erland Sommarskog, SQL Server MVP, sommar@algonet.se

Books Online for SQL Server SP3 at
dberlin NO[at]SPAM alum.rpi.edu
5/10/2004 7:33:40 AM
[quoted text, click to view]

This was very helpful, thank you!

However, there is still a large Hash Match/Aggregate being performed
that requires 45%(for a T2 of 2.5M records) of the resources for the
query. A complete table scan of the larger table consists of 34% of
the query, the merge join is 19% and the Hash Match is 45%,
effectively doubling the time the query takes to run. The larger my
T2 table is, the longer the hash takes on a scale that is increasing
faster than linearly(exponential? not sure). The hash seems to be
doing the following: HASH: bmk1000, RESIDUAL: (bmk1000=bmk1000)
(T2.Data = ANY(T2.Data))
This is from the query analyzer's estimated execution plan. Do you
know how I can avoid this hash, or why it is necessary? It really
really slows down the query to an unacceptable level.

Thanks again for the help!
Erland Sommarskog
5/10/2004 9:39:51 PM
Dan Berlin (dberlin@alum.rpi.edu) writes:
[quoted text, click to view]

Again, without access to the tables, it is difficult to give very good
suggestions. Query tuning is a lot of hands on.

But if the hashing is a bottleneck, and is growing more than linearly,
one idea is to try to run the update in chunks. Take a reasonbly-sized
interval of the key value at a time.

The hashing is on Data, I would guess to locate the rows that needs
updating. Hashing is probably better than a nested-loop join.

Could you post:

o CREATE TABLE and CREATE INDEX statements for your tables?
o The query as it looks now?
o The query plan you get?

This would leave me a little less in the dark.


--
Erland Sommarskog, SQL Server MVP, sommar@algonet.se

Books Online for SQL Server SP3 at
AddThis Social Bookmark Button