all groups > sql server dts > july 2006 >
You're in the

sql server dts

group:

6-7 days to Load Data


RE: 6-7 days to Load Data Frans van Bree
7/13/2006 7:43:02 AM
sql server dts:
Personally I would first load it into a staging table, e.g. STG_SOURCE with 1
columns called SOURCEFIELD of NVARCHAR(1000) So, only using the datapump's
Copy Column function. Or even better, do a BULK INSERT. Then you would have
one table with one field with 56,171,938 records.

After that, do your cleansing, e.g. with a massive SQL insert statement into
a second table with the proper columns. Let that statement perform your
calculations, etc.

6-7 days to Load Data MikeV06
7/13/2006 8:12:11 AM
I am importing a flat file consisting of variable length records terminated
by {CR}{LF}. The record sizes vary between 65 bytes and 79 bytes (the last
field is a variable length name field). I am reading in the record as a
single field and then using a modification to the Transform VBScript to
load the database.

The DTS process has been running about 12 hours and has processed 4,521,000
rows (according to the DTS process status) resulting in a database mdf
having a size of about 378,496 KB. The source file is a text file whose
size is 3,651,176 KB. If each source record averages about 65 bytes, the
file contains 56,171,938 records. At this rate it will take 6 days to load
the database which will be about 4GB in size. I have looked at the initial
data being loaded and it looks correct (compared to the source records).

Can anyone please suggest anything I can do to speed up this load?

Thanks.

'**********************************************************************
' Visual Basic Transformation Script
' Copy each source column to the
' destination column
'************************************************************************
'Put in Debug Loop
Public Counter
Function Main()
Dim Rec, Amt, Dig, NDig, Neg, Amt2
If isNull(Counter) Then Counter = 1
Rec = DTSSource("Col001")
DTSDestination("exp_big_RecordType") = Mid(Rec,1,1)
DTSDestination("exp_big_BatchNum") = Mid(Rec,2,2)
DTSDestination("exp_big_SupplierNum") = Mid(Rec,4,5)
DTSDestination("exp_big_VouchNum") = Mid(Rec,9,6)
DTSDestination("exp_big_InvoiceNum") = Mid(Rec,15,8)
DTSDestination("exp_big_AcctNum") = Mid(Rec,23,4)
DTSDestination("exp_big_StoreNum") = Mid(Rec,27,3)
DTSDestination("exp_big_Date") = Mid(Rec,50,8)
DTSDestination("exp_big_SupplierName") = Mid(Rec,58,23)
Amt = Mid(Rec,30,8)
Dig = Right(Amt,1)
Neg = 1
Select Case Dig
Case "@" NDig = "0"
Case "A" NDig = "1"
Case "B" NDig = "2"
Case "C" NDig = "3"
Case "D" NDig = "4"
Case "E" NDig = "5"
Case "F" NDig = "6"
Case "G" NDig = "7"
Case "H" NDig = "8"
Case "I" NDig = "9"
Case "P" NDig = "0"
Neg = -1
Case "Q" NDig = "1"
Neg = -1
Case "R" NDig = "2"
Neg = -1
Case "S" NDig = "3"
Neg = -1
Case "T" NDig = "4"
Neg = -1
Case "U" NDig = "5"
Neg = -1
Case "V" NDig = "6"
Neg = -1
Case "W" NDig = "7"
Neg = -1
Case "X" NDig = "8"
Neg = -1
Case "Y" NDig = "9"
Neg = -1
End Select
Amt = Left(Amt,7) & NDig
Amt2 = Round(((Amt * Neg) /100),2)

DTSDestination("exp_big_Amount") = Amt2

Main = DTSTransformStat_OK
Re: 6-7 days to Load Data Frans van Bree
7/13/2006 10:13:01 AM
[quoted text, click to view]

I don't know his source, so this is the safest.

[quoted text, click to view]

Batches would indeed be better. If a batch fails, just pickup where you left
off. You're right, just wanted to keep this as simple as possible.

[quoted text, click to view]

Space is not released to the OS if these actions are logged. It is not
recommended to turn on "auto-shrink" for a database, so the db would no
shrink/remain the same, but still increase in your scenario. In the same
order as in my scenario. Ofcourse, in my scenario you could truncate the
staging after you're finished. > 1000x faster than deleting 100000 rows after
each batch. I even expect the deletes would turn him back to his original
scenario, that slow it is.

Thanks for your addition to the tread.

Greetings,

Re: 6-7 days to Load Data Roy Harvey
7/13/2006 11:00:09 AM
That sounds like a workable alternative to me, though I would like to
add a couple of minor points.

First, unless the input is unicode I would use VARCHAR rather than
NVARCHAR. NVARCHAR would take twice the space, rather costly on a
table of this size if not required.

Second, I suggest that it be a two-column table, an INT identity
column and the (N)VARCHAR column. This could be a major advantage
when it comes to loading the real table from the staging table. I
would VERY much want to be able to do that in batches, and having an
identity column would make working in batches trivial.

And on the subject of working in batches, if the entire table were
transformed in one pass it would require double storage space. If
done in batches - say 100,000 rows at a time, to pick a number out of
the air - the processed rows for a batch could be deleted from the
staging table after the batch is loaded. Space released from the
staging table would become available for the real table.

Roy Harvey
Beacon Falls, CT

On Thu, 13 Jul 2006 07:43:02 -0700, Frans van Bree
[quoted text, click to view]
Re: 6-7 days to Load Data Frans van Bree
7/13/2006 11:15:02 AM
[quoted text, click to view]

You're right. I was thinking of that after I sent the post. I am still
thinking on doing a test to see it the database file (MDF) will or will not
increase in your scenario if auto-shrink is off. Because I am not so sure
after all that it will increase (that much). {Depending of course on the how
much bytes there are extra in the fields of the second table, because the
Re: 6-7 days to Load Data MikeV06
7/13/2006 12:03:11 PM
Thanks to both of you for your ideas. I will go give it a whirl and see
what I can make of it.

[quoted text, click to view]
Re: 6-7 days to Load Data Roy Harvey
7/13/2006 1:17:43 PM
On Thu, 13 Jul 2006 10:13:01 -0700, Frans van Bree
[quoted text, click to view]

Good point on the performance of the DELETEs. However, I was not
referring to releasing space back to the OS, just freeing up space
within the database for the other table.

Re: 6-7 days to Load Data Roy Harvey
7/13/2006 3:05:03 PM
On Thu, 13 Jul 2006 11:15:02 -0700, Frans van Bree
[quoted text, click to view]

There is another factor to consider about SQL Server re-using freed
pages. It takes time for the pages to be returned to the pool. That
process is done asyncronously, as the system has time for it, so it
might not actually be back in the pool and available if needed right
away.

Re: 6-7 days to Load Data MikeV06
7/15/2006 10:20:58 AM
[quoted text, click to view]

A couple of update comments about my big file processing.

CREATE TABLE Stage_Big (
-- stg_big_Id_Num int Identity (1,1),
stg_big_Record varchar (200) Null
)
GO

Doing the identity column seems to add additional seconds to the process.
Further, another problem exists which I will mention below.

This BULK INSERT takes 3.24 minutes (1 M of rows = 3.24 minutes). That
means I can insert my 57 Meg of records in about 3 hours. A definite
improvement, but I have not yet worked on the script to convert it to
useable columns.

BULK INSERT Exp_Big.dbo.[Stage_Big]
FROM 'C:\Work\Exp_Big.asc'
WITH (
-- FORMATFILE = 'c:\work\exp_big.fmt',
BATCHSIZE = 10000,
FIRSTROW = 1,
LASTROW = 1000000)
Go

One must use a format file if the Identity column is defined since the
source data does not have an identity field.


8.0
2
1 SQLCHAR 0 0 "" 0 stg_big_Id_Num
SQL_Latin1_General_Cp437_BIN
2 SQLCHAR 0 200 "\r\n" 2 stg_big_Record
SQL_Latin1_General_Cp437_BIN


Question: The log file starts to get really big. I looked for a way to
compact the data/log file but did not find anything. Does a way exist to
Re: 6-7 days to Load Data Roy Harvey
7/15/2006 11:33:35 AM
[quoted text, click to view]

Or, create a view that excludes the identity column and load to the
view name.

[quoted text, click to view]

For a process like this I would generally want the recovery model for
the database set to Simple. This will allow the log to be truncated
after each batch - each 10,000 rows in your sample code. That assumes
that the database is not being used for anything else at the time
which requires maintaining logs, and given what you are doing that
would be a really good idea in any case.

It would also be a good idea, before the mass load, to make sure the
database size is set large enough, or that the increment of growth is
something large.

Roy Harvey
Re: 6-7 days to Load Data MikeV06
7/15/2006 11:50:25 AM
[quoted text, click to view]

Yes, I shoulda thought of that; thank you.

[quoted text, click to view]

Simple is exactly what I needed. Again, thank you.

[quoted text, click to view]

Have it set to unlimited.

[quoted text, click to view]

Re: 6-7 days to Load Data Roy Harvey
7/15/2006 3:15:24 PM
[quoted text, click to view]

The max size can be unlimited, but the initial allocation might not be
large, and the unit of growth must be finite. It is set for each
file, either as a percent or a number of megabytes.

Re: 6-7 days to Load Data MikeV06
7/15/2006 11:29:40 PM
[quoted text, click to view]

Gotcha. Mine was set to 10%. I have it set to 25% now.

Based on my last set of computations, it will now take about *24 hours*
rather than 6-7 days to load the final table. If anyone sees anything else
I should do to speed things up, please tell me.

Thanks.

To summarize what I have done thus far:

To load ~47M records into the stage table using a view required a little
over an hour:

CREATE TABLE Stage_Big (
stg_big_Id_Num int Identity (1,1),
stg_big_Record varchar (200) Null
)

I then indexed stg_big_Id_Num which took about 15 minutes.

This required 3 minutes

INSERT INTO Expenses_Big
(exp_big_RecordType, exp_big_BatchNum, exp_big_SupplierNum,
exp_big_VouchNum,
exp_big_InvoiceNum, exp_big_AcctNum, exp_big_StoreNum,
exp_big_Date, exp_big_SupplierName,
exp_big_Amount)
SELECT SUBSTRING(stg_big_Record, 1, 1),
SUBSTRING(stg_big_Record, 2, 2),
SUBSTRING(stg_big_Record, 4, 5),
SUBSTRING(stg_big_Record, 9, 6),
SUBSTRING(stg_big_Record, 15, 8),
SUBSTRING(stg_big_Record, 23, 4),
SUBSTRING(stg_big_Record, 27, 3),
SUBSTRING(stg_big_Record, 50, 8),
SUBSTRING(stg_big_Record, 58, 23),
dbo.fnComputeAmt(SUBSTRING(stg_big_Record, 30, 8))
FROM Stage_Big
WHERE stg_big_Id_Num < 100000

Hence, it will take *24* hours to load the entire expense table. Better
than 6-7 days, but still a long time.

The function is

CREATE FUNCTION fnComputeAmt (@Amt_string AS char (8))
RETURNS money
AS
BEGIN
DECLARE @Neg INT,
@DIG char (1),
@Ndig char (1),
@AmtMoney money

SET @Dig = Right(@Amt_string,1)
SET @Neg = 1
IF @Dig BETWEEN 'P' and 'Y'
BEGIN
SET @Neg = -1
END
SELECT @Ndig =
CASE @DIG
WHEN '@' THEN '0'
WHEN 'A' THEN '1'
WHEN 'B' THEN '2'
WHEN 'C' THEN '3'
WHEN 'D' THEN '4'
WHEN 'E' THEN '5'
WHEN 'F' THEN '6'
WHEN 'G' THEN '7'
WHEN 'H' THEN '8'
WHEN 'I' THEN '9'
WHEN 'P' THEN '0'
WHEN 'Q' THEN '1'
WHEN 'R' THEN '2'
WHEN 'S' THEN '3'
WHEN 'T' THEN '4'
WHEN 'U' THEN '5'
WHEN 'V' THEN '6'
WHEN 'W' THEN '7'
WHEN 'X' THEN '8'
WHEN 'Y' THEN '9'
END
SET @Amt_string = LEFT(@Amt_string,7) + @Ndig
SET @AmtMoney = Round( CONVERT(Money, @Amt_string) * @Neg /100, 2)

RETURN @AmtMoney
Re: 6-7 days to Load Data Paul Shapiro
7/16/2006 11:09:46 AM
You might get a significant improvement if you can eliminate the function
call. Maybe add a simple table to map the @DIG to @Ndig and then make the
rest as inline sql?
Paul Shapiro

[quoted text, click to view]

Re: 6-7 days to Load Data Roy Harvey
7/16/2006 12:46:38 PM
On Sun, 16 Jul 2006 11:09:46 -0400, "Paul Shapiro"
[quoted text, click to view]

I agree it is worth trying. I think thes expression below does the
job to resolve to the string expression that needs to be
converted/rounded.

CASE WHEN SUBSTRING(stg_big_Record, 37, 1) BETWEEN 'P' and 'Y'
THEN '-'
ELSE '0'
END +
SUBSTRING(stg_big_Record, 30, 7) +
CASE SUBSTRING(stg_big_Record, 37, 1)
WHEN '@' THEN '0'
WHEN 'A' THEN '1'
WHEN 'B' THEN '2'
WHEN 'C' THEN '3'
WHEN 'D' THEN '4'
WHEN 'E' THEN '5'
WHEN 'F' THEN '6'
WHEN 'G' THEN '7'
WHEN 'H' THEN '8'
WHEN 'I' THEN '9'
WHEN 'P' THEN '0'
WHEN 'Q' THEN '1'
WHEN 'R' THEN '2'
WHEN 'S' THEN '3'
WHEN 'T' THEN '4'
WHEN 'U' THEN '5'
WHEN 'V' THEN '6'
WHEN 'W' THEN '7'
WHEN 'X' THEN '8'
WHEN 'Y' THEN '9'
END

Another thought is to create a view on Stage_Big that handles all the
substrings and just leaves you with strings to convert. The
expression above could be used in the view.

And it is worth mentioning that the MONEY data type is generally to be
avoided. DECIMAL is much preferred.

Roy Harvey
Beacon Falls, CT

[quoted text, click to view]
Re: 6-7 days to Load Data Frans van Bree
7/17/2006 1:31:01 AM
Mike,

Good to see there is so much response to your question! Still, the major
part of the performance impact in your proces is in the BULK INSERT step. I
think you can optimize it considerable using the TABLOCK hint in the
WITH-clause. Then, there are no more row-level locks AND the process will be
minimally logged in your transaction log file.

From BOL:
"To help prevent the transaction log from running out of space, a minimally
logged bulk copy can be performed if all of these conditions are met:
The recovery model is simple or bulk-logged.
The target table is not being replicated.
The target table does not have any triggers.
The target table has either 0 rows or no indexes.
Re: 6-7 days to Load Data MikeV06
7/20/2006 7:51:32 PM
[quoted text, click to view]

Always hard to know if a function is faster/slower than a lookup table. I
will try some timing runs to see if I can determine which is the case for
this data.

Re: 6-7 days to Load Data MikeV06
7/20/2006 7:54:53 PM
[quoted text, click to view]

Very nice, thank you.

[quoted text, click to view]

Uhm, that is useful to try as well.

[quoted text, click to view]

I had been trying both. Why is this so?

Re: 6-7 days to Load Data MikeV06
7/20/2006 7:56:00 PM
[quoted text, click to view]

Off to BOL I must go.

Re: 6-7 days to Load Data Roy Harvey
7/21/2006 11:16:47 AM
[quoted text, click to view]

MONEY is not a standard SQL datatype, and I have heard there are
problems with rounding. Use of the MONEY datatype goes back to the
days before SQL Server had DECIMAL or NUMERIC, the behavior of which
follows industry standards.

AddThis Social Bookmark Button