all groups > sql server programming > august 2004 >
You're in the

sql server programming

group:

converting log file to db table


converting log file to db table gordy
8/7/2004 11:33:10 PM
sql server programming: Dear Experts,
I have a log file that is stored in a nested tree format.
The logical record is this:
int datetime
int eventid
int deploymentid
int recipientid
int templateid

If the data were stored in a typed file it would be 20bytes/record which
would be a problem because we add about 50-100 million records a
month. So the physical record is stored like so:
year
-month
--deploymentid
---templateid
----eventid
-----day
------hour
-------minute
--------recipientid

Querying the file is quick and the growth rate is manageable even at
100m/month.

I would like to store this data in the database so I could use it in
other queries, but I'm having trouble coming up with an efficient way to
store it. My first attempt, I just created a table to store the records
flat. There doesn't seem to be a way to let sql exploit the redundancy
in the data so that it doesn't create a 20gb database. After importing a
flattened file and creating a single clustered index (same column order
as the nested file) the database size was rediculous. Also, queries on
this table are taking 5 to 10 minutes which is unacceptable for my
application (should be less than 30 seconds).

Is there a table type or index type I should use for this type of data?
Does anyone have any ideas? btw I'm using MSSQL Server 2000 Enterprise
Re: converting log file to db table John Bell
8/8/2004 8:43:46 AM
Hi

Thanks for the description although it is still not clear what you are doing
in respect to the SQL Server side. Posting DDL (CREATE TABLE... CREATE INDEX
statements etc... http://www.aspfaq.com/etiquette.asp?id=5006 ) and example
data ( as INSERT statements http://vyaskn.tripod.com/code.htm#inserts )
along with the query you are running then it may be possible to help you. In
the mean time you may want to try looking at the query plans and possibly
using the index tuning wizard to see if you can improve things. More
information on these is in books online. Also, you don't say when the index
is created. If it was created before the data load, try changing this to
afterwards.

John

[quoted text, click to view]

Re: converting log file to db table Richard Quinn
8/8/2004 10:25:06 AM
On Sat, 07 Aug 2004 23:33:10 -0700, gordy <gordy@dynamicsdirect.com>
[quoted text, click to view]

Nice. This is already a particularly efficient way to store data,
balanced trees would give you problems updating the index.

[quoted text, click to view]

How many bytes / record does your current storage scheme require?

[quoted text, click to view]

A bad idea. You should use several tables so that SQL server can
exploit the relationships. Considering that a datetime field alone
requires 8 bytes, you are going to need a lot of space very soon with
your single table.

[quoted text, click to view]

Yes, the problem you have is from using an ISAM style schema.

[quoted text, click to view]

Your hardware is probably not the problem, if your file was running on
it then the DB should provide at least equal performance. There are no
different table types to choose from. The index could no doubt be
improved (did you update the statistics? Is it a covering index?) but
that won't solve your problem: your data is not normalized.

Create a table with one record for each month and year.
Create a table each for deploymentid, eventid, templateid, recipientid
(assuming they are not unique each time but draw from a limited set of
possibles).

Create intersection tables comprising of all the possibilities of
combining the previous IDs (assuming we are not talking huge numbers
here).

For each log, you need now create a single row with no more
information than the time at which it occurred, and any other
information unique to this event (perhaps the eventid, it is unclear
from your description). The general idea is to normalize your data so
that you need not repeat the same discrete data with each new tuple.
If you do it cleverly, you can minimize the amount of data you need to
store with each event to a minimum.

My tip: if you are storing data at that rate be very careful about
choosing an indexing and schema strategy, you want to avoid things
like index fragmentation and disk hotspotting. You can look these up
in BOL.

Additionally, you should consult some resources on Database Design,
and maybe ask a professional to help you design the database and an
indexing strategy.

- Richard

---
Richard Quinn
Re: converting log file to db table gordy
8/8/2004 7:41:33 PM
[quoted text, click to view]

with the nested file, worse case (one recipient one event all year)
would be 54 bytes (each nested tree also keeps a count of subtrees and
pointer to the next tree on the same level). in practice there are only
24 hour records under day, 60 minute records under hour and usually
hundreds of recipients under a minute record. Actual avg cost per record
is about 4.0005 bytes


[quoted text, click to view]

this is the table I created
year,month,deploymentid,templateid,eventid,day,hour,minute,recipientid

I loaded the table with 700 million records and created a covering
clustered index w/same order.

[quoted text, click to view]


If I understand correctly.. you're saying to use multiple tables and
autoincrement columns to sort of emulate the nested tree. Something like
this 8 tables

[id1+,year]
[id2+,id1^,month]
[id3+,id2^,deploymentid]
[id4+,id3^,eventid]
[id5+,id4^,day]
[id6+,id5^,hour]
[id7+,id6^,minute]
[id7^,recipientid]

I suppose I could merge columns together to cut down on the extra
autoincrement fields? Maybe year,month,deploymentid could go in one,
eventid,day,hour in another and finally minute,recipientid. Am I
following correctly? It seems like this could give me around 8.0
bytes/record average

Would it be best then to create an indexed view for making updates/queries?

[quoted text, click to view]

thanks for your help. I'm going to be finalizing the model at the mssql
lab in redmond in a few weeks, but I'd like to get as much hashed out as
Re: converting log file to db table Richard Quinn
8/9/2004 10:35:23 AM
On Sun, 08 Aug 2004 19:41:33 -0700, gordy <gordy@dynamicsdirect.com>
[quoted text, click to view]

Exactly. Take a look at this introductory article:
http://support.microsoft.com/default.aspx?scid=kb;EN-US;283878

SQL Server works on sets of data, stored in relations (otherwise known
as tables). The sets should be related to each other through a primary
key in table A and most likely have this key posted to table B. You
can also create a relationship between tables A and B by creating a
third table AB, containing no more than each of the primary keys from
tables A and B, but this involves more overhead and gives more
flexibility.

[quoted text, click to view]

No! ;) Please do not merge columns (into a single column), that
violates relational theory and will make your data again denormalized.
This is guaranteed to cause you severe problems. Each discreet data
field must have its own column. Databases work very differently to how
you are accustomed to storing data. Of course you can merge columns
into tables, which is not normalized but I'm going to do it anyway.

[quoted text, click to view]

You could have the following schema (simplified):

TTIMEDEPLOYMENT
--------------
TTDID DAY DEPLOYMENTID
1 2005-01-01 981729873
2 2005-01-01 123123324
3 2005-01-01 123124241
4 2005-01-02 981729873
5 2005-01-02 123123324
6 2005-01-02 123124241
7 2005-01-03 981729873
.....

Of course, you would end up with a fixed amount of records: (no. of
days) x (no. of DeploymentIDS). This is a non normalized table, the
idea being to save space in your log table.

And a second base table:
THOUREVENT
----------
THID HOUR EVENTID
1 1 889877986
2 1 776656576
3 1 655426552
.....
212 2 889877986
213 2 776656576
214 2 655426552
....


And a third Base Table:
TD_HE
---------------
ALL_ID THID TTDID
1 1 1
2 1 2
3 1 3
....

[quoted text, click to view]

And then your log table might look like this:
TLOG
------------------
ALL_ID MINUTE RECIPIENTID
928323 0 6152761
23123 45 21321312
312312 59 12312312

ALL_ID is 4 Bytes, Minute is tinyint (1 byte), and EventID is 4 Bytes
(or possibly smaller, I don't know how many recipients you have).

[quoted text, click to view]

Yep, this scheme would have 9 Bytes / record, plus however many MBs
were necessary to create the base tables. (Plus some overhead for page
records, plus overhead for whatever indexes you want).

Please be aware that the schema I showed is not normalized fully (only
2NF), and many professionals will give you opposing advise (I hope
with the same validity!). I have used auto incrementing ID fields,
whereas most people would look for a primary key from within the
available fields. The reasons I took these design choices is to save
space, you will be logging millions of records in TLOG and I wanted to
log the minimum amount of information necessary. (I suppose one could
even push up the MINUTE field into one of the Cartesian base tables)

There is a large variety of possible schemas available, ones which
maximise search efficiency, ones which maximise insert speed, ones
which are fully normalized, ones which use federated servers, etc etc.
What you choose depends on lots of circumstances that I know nothing
about.

[quoted text, click to view]

All of the tables - with the possible exception of TLOG, should be
indexed, and the clustered index should cover all the over fields.
This would mean that lookups need never access the heaps, but could
just get the data from the index.

TLOG would be very different, because of the number of inserts. A
possibility would be to perhaps archive the data in TLOG in an N day
interval, and create the index (and update the statistics) on the now
even larger archived table afterwards. Having an index on a table you
are not actively searching is pointless. Having an index on a table
receiving millions of inserts is a big slow down.

Whether to use an indexed view would depend on your schema. Creating
an indexed view might well help if you want to be able to retrieve all
relevant fields for a record in one resultset, by joining several
tables. However in the schema here, I doubt it would help. Each time
you do a reindex on TLOG the indexed view would have to be rebuilt
too... It would depend on the frequency of lookups and the frequency
of your update interval.

HTH, - Richard
---
Richard Quinn
Re: converting log file to db table gordy
8/9/2004 3:42:46 PM
[quoted text, click to view]

actually I meant merge tables, sorry. like:
[id1+,year,month,deploymentid]
[id2+,id1^,templateid,eventid]
[id3+,id2^,day,hour]
[id3^,minute,recipientid]

this is basically the example you gave right?
besides the order that is, I think this order would be most effective at
eliminating redundancy based on what we've collected so far.

[quoted text, click to view]

I wonder if the autoinc fields are going to add too much overhead when
inserting records. For instance, say I have a new record
[2004,8,10,1,1,8,12,59,1234] I would first have to check for the
existing id1 [2004,8,10] and if none exists create one. Then with that
id1 I would check for an existing id2 and so on until I have looked up
or created an id3 to be inserted along with the minute and recipiendid.
I suppose it's not unlike what I currently do with the nested file, just
that it would now be 3 passes (queries?) where as currently it's done in
one pass. Is there a common best practice for doing inserts this way?

[quoted text, click to view]

Thank you so much for your help, I really appreciate it. I'm going to
create these tables and fill'em with a few hundred million and see what
Re: converting log file to db table Richard Quinn
8/10/2004 10:35:57 AM
On Mon, 09 Aug 2004 15:42:46 -0700, gordy <gordy@dynamicsdirect.com>
[quoted text, click to view]

Yes. I kept minute in the last table originally, because it has only
60 discrete values.

[quoted text, click to view]

In that case I would suggest prepopulating the table. It is easy to
predict the sequence! Whether you use an IDENTITY field or some
Primary Key based on (a combination of) your fields does not matter
much, since you will always need to use a primary key in the next
related table for referencing. If you use an IDENTITY field then your
PK and FK will only use 4 bytes.

But that would depend on your already knowing what the set of
deployment, template and event IDs are. If you don't then a different
schema would probably be necessary, a more normalized one.

[quoted text, click to view]

These should be prepopulated too, if possible.

[quoted text, click to view]

Nop, lots of different mediocre practices ;)

What you could do, and I would recommend it, is to pass the data for
each new log record into a stored procedure:

USE tempdb
Create procedure procNewLogRecord1(
@templateid int, -- or the type you need
@deploymentid int,
@eventid int,
@logtime as smalldatetime,
@recipientid int
) AS
-- do lookups and inserts where necessary
return

The @logtime parameter would be something like:
'2004-07-31 12:35'
Then from within the sproc you could do the database intensive tasks
of looking up appropriate key values.

An alternative would be to store a lookup table in your application of
suitable values. It might take a few MBs of RAM but it would be fast
too. Then you would modify the stored procedure thus:

USE tempdb
Create procedure procNewLogRecord2(
@templateid char(12) = NULL,
@deploymentid as char(12) = NULL,
@eventid as char(12) = NULL,
@ID3 as Int = NULL,
@logtime as smalldatetime,
@recipientid as char (12)
) AS return

Meaning that the SP now does not expect to have all of its parameters
filled (some take a default of NULL). If you already know what ID3
will be you need only pass that, the logtime and the recipientid, some
logic in the sproc will skip over looking up the ID3 value.

[quoted text, click to view]

Keep us posted! Don't forget to create the indexes after loading the
data! I suppose you want to test your software by loading data the
same way you would during production. If you want it to be quicker,
try the BULK INSERT command.


---
Richard Quinn
AddThis Social Bookmark Button