Groups | Blog | Home
all groups > sql server data warehouse > november 2007 >

sql server data warehouse : Re: BULK INSERT performance


Scott Nichol
11/1/2007 1:37:32 PM
Thanks for the responses. If the speed of the BULK INSERT were the only criterion, dropping the non-clustered index would be great.
Without it, but with the clustered index and data still there, I load at about 20,000 rows/sec. Unfortuntely, re-building an index
on a table with 500+ million records takes much longer than the 3000 seconds to load with the index intact.

What I am really wondering is whether anyone else has a similar scenario (clustered and non-clustered indexes on a large table) and
whether the affect on BULK INSERT performance is the same order of magnitude as I am experiencing (2 order of magnitude drop).

--
Scott Nichol

[quoted text, click to view]

Tom Moreau
11/1/2007 3:46:54 PM
What fill factors do you have? I'm thinking that there may be a lot of page
splitting going on.

There could also be disk issues, but you did say that things looked OK
there - "Perfmon does not indicate any resource being pinned". What is the
average disk queue length? Is it > 24?

It may come to pass that you need to use horizontal partitioning and break
the table up.

--
Tom

----------------------------------------------------
Thomas A. Moreau, BSc, PhD, MCSE, MCDBA, MCITP, MCTS
SQL Server MVP
Toronto, ON Canada
https://mvp.support.microsoft.com/profile/Tom.Moreau


[quoted text, click to view]
Thanks for the responses. If the speed of the BULK INSERT were the only
criterion, dropping the non-clustered index would be great.
Without it, but with the clustered index and data still there, I load at
about 20,000 rows/sec. Unfortuntely, re-building an index
on a table with 500+ million records takes much longer than the 3000 seconds
to load with the index intact.

What I am really wondering is whether anyone else has a similar scenario
(clustered and non-clustered indexes on a large table) and
whether the affect on BULK INSERT performance is the same order of magnitude
as I am experiencing (2 order of magnitude drop).

--
Scott Nichol

[quoted text, click to view]

Scott Nichol
11/1/2007 11:54:08 PM
Good questions. Fill factor is 80% (the default?). The write pattern in general is no writes, then bursts every two minutes or so
at checkpoint. During bursts, write queue goes to 100 or more. Each burst is about 8 seconds. The read pattern is more consistent
and lighter, with read queue never higher than 6, but normally 0 (and that's with other concurrent activity in the database). The
reads are from the RAID 10 set with the non-clustered index. There is very little reading from the set with the clustered index.

On a development machine where I can run the BULK INSERT without other activity, the disk activity on the RAID set with the
clustered index consists of one write burst of 4 seconds. The disk activity on the RAID set with the non-clustered index shows
consistent reads for the rest of the load, with intermiittent write bursts.

The feeling we get from the reads is that SQL Server is reading through the existing index nodes to find leaves that need to be
updated (splitting when necessary, etc.).

The values in the non-clustered index are GUIDs and thus distributed throughout the index.

--
Scott Nichol

[quoted text, click to view]

Tom Moreau
11/2/2007 6:31:25 AM
The fact that you're using GUIDs says much. In SQL 2005, they have a new
function - NEWSEQUENTIALID() - that can alleviate the problems that you get
with NEWID(). NEWID() creates random GUIDs, so when you do INSERTs, they're
all over the place. I would try lowering the fill factor on the
nonclustered index. Start with 50 or 60. After the bulk insert, defrag the
index.

Try running DBCC SHOWCONTIG to report how fragmented that index is.

As for the clustered index, is it a monotonically increasing key? If so,
make the fill factor 100, since a lower fill factor won't buy you anything.

--
Tom

----------------------------------------------------
Thomas A. Moreau, BSc, PhD, MCSE, MCDBA, MCITP, MCTS
SQL Server MVP
Toronto, ON Canada
https://mvp.support.microsoft.com/profile/Tom.Moreau


[quoted text, click to view]
Good questions. Fill factor is 80% (the default?). The write pattern in
general is no writes, then bursts every two minutes or so
at checkpoint. During bursts, write queue goes to 100 or more. Each burst
is about 8 seconds. The read pattern is more consistent
and lighter, with read queue never higher than 6, but normally 0 (and that's
with other concurrent activity in the database). The
reads are from the RAID 10 set with the non-clustered index. There is very
little reading from the set with the clustered index.

On a development machine where I can run the BULK INSERT without other
activity, the disk activity on the RAID set with the
clustered index consists of one write burst of 4 seconds. The disk activity
on the RAID set with the non-clustered index shows
consistent reads for the rest of the load, with intermiittent write bursts.

The feeling we get from the reads is that SQL Server is reading through the
existing index nodes to find leaves that need to be
updated (splitting when necessary, etc.).

The values in the non-clustered index are GUIDs and thus distributed
throughout the index.

--
Scott Nichol

[quoted text, click to view]

Scott Nichol
11/2/2007 8:14:55 AM
[quoted text, click to view]

That sounds very useful. One more reason to look forward to migration next year.

[quoted text, click to view]

Good idea: I will definitely check/alleviate fragmentation. I may toy with the fill factor on the dev machine first.

[quoted text, click to view]

No such luck. It is 6 columns. Because the first column is a date, *most* of the data added each day is at the "end" of the index.
Unfortunately, there are some data laggards that arrive days later, so the BULK INSERT is not a pure append.

Thanks for all your help.

--
Scott Nichol

[quoted text, click to view]
Dan Guzman
11/2/2007 8:51:08 AM
[quoted text, click to view]

As Tom said, random GUID values are a real bulk insert performance killer.
Buffer efficiency with these random decreases proportionally with table size
until you get to the point of requiring at least one I/Os for every row
inserted.

One workaround I've used with SQL 2000 is to use a composite key instead of
GUID alone. I don't know of the details of your application but in my case,
I used a unique index consisting of FileId (a sequential integer) and
RecordId (a GUID value). This greatly improved bulk insert performance with
very large tables.

--
Hope this helps.

Dan Guzman
SQL Server MVP

[quoted text, click to view]
Scott Nichol
11/2/2007 11:19:14 AM
Thanks. You confirm our suspicions about the GUID: we seem to have lots of single I/Os when the index gets updated. There's
virtually no I/O queuing and the machine does not seem taxed at all.

If we use the sequential integer approach, we only need the GUID in the index when we hit 4G rows and start duplicating the integer
values we generate (assuming we use negative as well as positive values), right?

Since the value of the index is used as a pseudo-FK from another table, we'd have some work there as well. Not to mention, of
course, adding the new sequential column to the existing half billion rows. Too bad we did not attack the problem earlier!

It sounds like NEWSEQUENTIALID() would help us tremendously if we moved to SQL Server 2005. Is that correct? We want to migrate
during 2008 anyway. If it might do at least nearly as well, it would certainly be easier than retrofitting with the sequential
integer.

--
Scott Nichol

[quoted text, click to view]
Dan Guzman
11/2/2007 11:47:10 AM
[quoted text, click to view]

You don't need the GUID in the index at all if you go with a bigint and let
SQL Server assign the values with IDENTITY. The application I mentioned in
my response provided both the integer and GUID values to SQL Server. This
allowed the application to assign the surrogate key and bulk insert into
related tables.

[quoted text, click to view]

I learned this same lesson the same lesson the hard way. I started with
only GUID clustered PK and found that bulk performance dropped at about 5M
rows. I then changed to a non-clustered index to mitigate the problem but
performance still dropped off with larger tables. Once I added the integer,
I was able to use the clustered primary key with good insert performance.

My experience was also a good lesson on PMON. The only way to really know
if you have a disk bottleneck with a batch application is to know the
maximum throughput the disk subsystem can sustain. Once the limit is hit,
the only ways to get more throughput is to either add hardware or revise the
indexing strategy/application design.

[quoted text, click to view]

Yes, a NEWSEQUENTIALID() default constraint (available in SQL 2005 and
2008), a bigint identity, or the composite key method will certainly improve
insert performance.

--
Hope this helps.

Dan Guzman
SQL Server MVP

[quoted text, click to view]
Scott Nichol
11/2/2007 10:08:15 PM
Trying to avoid the pain of making any "real" change, I implemented my own alternative to NEWSEQUENTIALID(), and it ROCKS.

Using the function to generate GUID values (during data massaging) to load, BULK INSERT on my development machine took just 23
seconds to load each of two files, one 110,766 rows, the other 118,829. That's 4991 rows/sec on average. For comparison, 117,755
rows loaded in 14 seconds (8411 rows/sec) with the non-clustered index dropped. With "normal" GUIDs, I was loading just 90 rows/sec
on that rig. I look forward to seeing this on our production box.

The caveats are

1. I had dropped the non-clustered index to run the test without it, so the index that subsequent tests ran against is a fresh
re-build, presumably as unfragmented as possible.

2. The function I used to generate GUIDs was taken from an article on the web at
http://www.informit.com/articles/article.aspx?p=25862&seqNum=7. It combines 10 bytes of a normally generated GUID with 6 bytes of
the current time. The time portion provides the sequential-ness; the GUID part the random-ness. The algorithm seems very unlikely
to produce a duplicate, but I will review it to confirm before trying in production.

Thanks for all the great feedback.
--
Scott Nichol

[quoted text, click to view]
Dan Guzman
11/3/2007 7:40:02 AM
[quoted text, click to view]

I'm glad you found an easy solution. As long as you use a primary key to
guarantee unique values, I don't see a problem using it.

--
Hope this helps.

Dan Guzman
SQL Server MVP

[quoted text, click to view]
AddThis Social Bookmark Button