sql server programming:
Dear Experts, I have a log file that is stored in a nested tree format. The logical record is this: int datetime int eventid int deploymentid int recipientid int templateid If the data were stored in a typed file it would be 20bytes/record which would be a problem because we add about 50-100 million records a month. So the physical record is stored like so: year -month --deploymentid ---templateid ----eventid -----day ------hour -------minute --------recipientid Querying the file is quick and the growth rate is manageable even at 100m/month. I would like to store this data in the database so I could use it in other queries, but I'm having trouble coming up with an efficient way to store it. My first attempt, I just created a table to store the records flat. There doesn't seem to be a way to let sql exploit the redundancy in the data so that it doesn't create a 20gb database. After importing a flattened file and creating a single clustered index (same column order as the nested file) the database size was rediculous. Also, queries on this table are taking 5 to 10 minutes which is unacceptable for my application (should be less than 30 seconds). Is there a table type or index type I should use for this type of data? Does anyone have any ideas? btw I'm using MSSQL Server 2000 Enterprise
Hi Thanks for the description although it is still not clear what you are doing in respect to the SQL Server side. Posting DDL (CREATE TABLE... CREATE INDEX statements etc... http://www.aspfaq.com/etiquette.asp?id=5006 ) and example data ( as INSERT statements http://vyaskn.tripod.com/code.htm#inserts ) along with the query you are running then it may be possible to help you. In the mean time you may want to try looking at the query plans and possibly using the index tuning wizard to see if you can improve things. More information on these is in books online. Also, you don't say when the index is created. If it was created before the data load, try changing this to afterwards. John [quoted text, click to view] "gordy" <gordy@dynamicsdirect.com> wrote in message news:4115c92a$1_1@Usenet.com... > Dear Experts, > I have a log file that is stored in a nested tree format. > The logical record is this: > int datetime > int eventid > int deploymentid > int recipientid > int templateid > > If the data were stored in a typed file it would be 20bytes/record which > would be a problem because we add about 50-100 million records a > month. So the physical record is stored like so: > year > -month > --deploymentid > ---templateid > ----eventid > -----day > ------hour > -------minute > --------recipientid > > Querying the file is quick and the growth rate is manageable even at > 100m/month. > > I would like to store this data in the database so I could use it in > other queries, but I'm having trouble coming up with an efficient way to > store it. My first attempt, I just created a table to store the records > flat. There doesn't seem to be a way to let sql exploit the redundancy > in the data so that it doesn't create a 20gb database. After importing a > flattened file and creating a single clustered index (same column order > as the nested file) the database size was rediculous. Also, queries on > this table are taking 5 to 10 minutes which is unacceptable for my > application (should be less than 30 seconds). > > Is there a table type or index type I should use for this type of data? > Does anyone have any ideas? btw I'm using MSSQL Server 2000 Enterprise > on a dual P4 xeon w/4gb of ram. the io subsystem is a 10 drive raid.
On Sat, 07 Aug 2004 23:33:10 -0700, gordy <gordy@dynamicsdirect.com> [quoted text, click to view] wrote: >Dear Experts, >I have a log file that is stored in a nested tree format.
Nice. This is already a particularly efficient way to store data, balanced trees would give you problems updating the index. [quoted text, click to view] >The logical record is this: >int datetime >int eventid >int deploymentid >int recipientid >int templateid > >If the data were stored in a typed file it would be 20bytes/record which > would be a problem because we add about 50-100 million records a >month. So the physical record is stored like so: >year >-month >--deploymentid >---templateid >----eventid >-----day >------hour >-------minute >--------recipientid
How many bytes / record does your current storage scheme require? [quoted text, click to view] >Querying the file is quick and the growth rate is manageable even at >100m/month. > >I would like to store this data in the database so I could use it in >other queries, but I'm having trouble coming up with an efficient way to >store it. My first attempt, I just created a table to store the records >flat.
A bad idea. You should use several tables so that SQL server can exploit the relationships. Considering that a datetime field alone requires 8 bytes, you are going to need a lot of space very soon with your single table. [quoted text, click to view] > There doesn't seem to be a way to let sql exploit the redundancy >in the data so that it doesn't create a 20gb database. After importing a >flattened file and creating a single clustered index (same column order >as the nested file) the database size was rediculous. Also, queries on >this table are taking 5 to 10 minutes which is unacceptable for my >application (should be less than 30 seconds).
Yes, the problem you have is from using an ISAM style schema. [quoted text, click to view] >Is there a table type or index type I should use for this type of data? >Does anyone have any ideas? btw I'm using MSSQL Server 2000 Enterprise >on a dual P4 xeon w/4gb of ram. the io subsystem is a 10 drive raid.
Your hardware is probably not the problem, if your file was running on it then the DB should provide at least equal performance. There are no different table types to choose from. The index could no doubt be improved (did you update the statistics? Is it a covering index?) but that won't solve your problem: your data is not normalized. Create a table with one record for each month and year. Create a table each for deploymentid, eventid, templateid, recipientid (assuming they are not unique each time but draw from a limited set of possibles). Create intersection tables comprising of all the possibilities of combining the previous IDs (assuming we are not talking huge numbers here). For each log, you need now create a single row with no more information than the time at which it occurred, and any other information unique to this event (perhaps the eventid, it is unclear from your description). The general idea is to normalize your data so that you need not repeat the same discrete data with each new tuple. If you do it cleverly, you can minimize the amount of data you need to store with each event to a minimum. My tip: if you are storing data at that rate be very careful about choosing an indexing and schema strategy, you want to avoid things like index fragmentation and disk hotspotting. You can look these up in BOL. Additionally, you should consult some resources on Database Design, and maybe ask a professional to help you design the database and an indexing strategy. - Richard --- Richard Quinn
[quoted text, click to view] > > How many bytes / record does your current storage scheme require? >
with the nested file, worse case (one recipient one event all year) would be 54 bytes (each nested tree also keeps a count of subtrees and pointer to the next tree on the same level). in practice there are only 24 hour records under day, 60 minute records under hour and usually hundreds of recipients under a minute record. Actual avg cost per record is about 4.0005 bytes [quoted text, click to view] > > > Your hardware is probably not the problem, if your file was running on > it then the DB should provide at least equal performance. There are no > different table types to choose from. The index could no doubt be > improved (did you update the statistics? Is it a covering index?) but > that won't solve your problem: your data is not normalized.
this is the table I created year,month,deploymentid,templateid,eventid,day,hour,minute,recipientid I loaded the table with 700 million records and created a covering clustered index w/same order. [quoted text, click to view] > > Create a table with one record for each month and year. > Create a table each for deploymentid, eventid, templateid, recipientid > (assuming they are not unique each time but draw from a limited set of > possibles). > > Create intersection tables comprising of all the possibilities of > combining the previous IDs (assuming we are not talking huge numbers > here). > > For each log, you need now create a single row with no more > information than the time at which it occurred, and any other > information unique to this event (perhaps the eventid, it is unclear > from your description). The general idea is to normalize your data so > that you need not repeat the same discrete data with each new tuple. > If you do it cleverly, you can minimize the amount of data you need to > store with each event to a minimum.
If I understand correctly.. you're saying to use multiple tables and autoincrement columns to sort of emulate the nested tree. Something like this 8 tables [id1+,year] [id2+,id1^,month] [id3+,id2^,deploymentid] [id4+,id3^,eventid] [id5+,id4^,day] [id6+,id5^,hour] [id7+,id6^,minute] [id7^,recipientid] I suppose I could merge columns together to cut down on the extra autoincrement fields? Maybe year,month,deploymentid could go in one, eventid,day,hour in another and finally minute,recipientid. Am I following correctly? It seems like this could give me around 8.0 bytes/record average Would it be best then to create an indexed view for making updates/queries? [quoted text, click to view] > > My tip: if you are storing data at that rate be very careful about > choosing an indexing and schema strategy, you want to avoid things > like index fragmentation and disk hotspotting. You can look these up > in BOL. > > Additionally, you should consult some resources on Database Design, > and maybe ask a professional to help you design the database and an > indexing strategy. > > - Richard > > --- > Richard Quinn > MCSD, MCDBA firstname.lastname@ieee.org
thanks for your help. I'm going to be finalizing the model at the mssql lab in redmond in a few weeks, but I'd like to get as much hashed out as
On Sun, 08 Aug 2004 19:41:33 -0700, gordy <gordy@dynamicsdirect.com> [quoted text, click to view] wrote: >If I understand correctly.. you're saying to use multiple tables and >autoincrement columns to sort of emulate the nested tree. Something like >this 8 tables
Exactly. Take a look at this introductory article: http://support.microsoft.com/default.aspx?scid=kb;EN-US;283878 SQL Server works on sets of data, stored in relations (otherwise known as tables). The sets should be related to each other through a primary key in table A and most likely have this key posted to table B. You can also create a relationship between tables A and B by creating a third table AB, containing no more than each of the primary keys from tables A and B, but this involves more overhead and gives more flexibility. [quoted text, click to view] >[id1+,year] >[id2+,id1^,month] >[id3+,id2^,deploymentid] >[id4+,id3^,eventid] >[id5+,id4^,day] >[id6+,id5^,hour] >[id7+,id6^,minute] >[id7^,recipientid] > >I suppose I could merge columns together to cut down on the extra >autoincrement fields?
No! ;) Please do not merge columns (into a single column), that violates relational theory and will make your data again denormalized. This is guaranteed to cause you severe problems. Each discreet data field must have its own column. Databases work very differently to how you are accustomed to storing data. Of course you can merge columns into tables, which is not normalized but I'm going to do it anyway. [quoted text, click to view] > Maybe year,month,deploymentid could go in one, >eventid,day,hour in another
You could have the following schema (simplified): TTIMEDEPLOYMENT -------------- TTDID DAY DEPLOYMENTID 1 2005-01-01 981729873 2 2005-01-01 123123324 3 2005-01-01 123124241 4 2005-01-02 981729873 5 2005-01-02 123123324 6 2005-01-02 123124241 7 2005-01-03 981729873 ..... Of course, you would end up with a fixed amount of records: (no. of days) x (no. of DeploymentIDS). This is a non normalized table, the idea being to save space in your log table. And a second base table: THOUREVENT ---------- THID HOUR EVENTID 1 1 889877986 2 1 776656576 3 1 655426552 ..... 212 2 889877986 213 2 776656576 214 2 655426552 .... And a third Base Table: TD_HE --------------- ALL_ID THID TTDID 1 1 1 2 1 2 3 1 3 .... [quoted text, click to view] > and finally minute,recipientid.
And then your log table might look like this: TLOG ------------------ ALL_ID MINUTE RECIPIENTID 928323 0 6152761 23123 45 21321312 312312 59 12312312 ALL_ID is 4 Bytes, Minute is tinyint (1 byte), and EventID is 4 Bytes (or possibly smaller, I don't know how many recipients you have). [quoted text, click to view] > Am I >following correctly? It seems like this could give me around 8.0 >bytes/record average
Yep, this scheme would have 9 Bytes / record, plus however many MBs were necessary to create the base tables. (Plus some overhead for page records, plus overhead for whatever indexes you want). Please be aware that the schema I showed is not normalized fully (only 2NF), and many professionals will give you opposing advise (I hope with the same validity!). I have used auto incrementing ID fields, whereas most people would look for a primary key from within the available fields. The reasons I took these design choices is to save space, you will be logging millions of records in TLOG and I wanted to log the minimum amount of information necessary. (I suppose one could even push up the MINUTE field into one of the Cartesian base tables) There is a large variety of possible schemas available, ones which maximise search efficiency, ones which maximise insert speed, ones which are fully normalized, ones which use federated servers, etc etc. What you choose depends on lots of circumstances that I know nothing about. [quoted text, click to view] >Would it be best then to create an indexed view for making updates/queries?
All of the tables - with the possible exception of TLOG, should be indexed, and the clustered index should cover all the over fields. This would mean that lookups need never access the heaps, but could just get the data from the index. TLOG would be very different, because of the number of inserts. A possibility would be to perhaps archive the data in TLOG in an N day interval, and create the index (and update the statistics) on the now even larger archived table afterwards. Having an index on a table you are not actively searching is pointless. Having an index on a table receiving millions of inserts is a big slow down. Whether to use an indexed view would depend on your schema. Creating an indexed view might well help if you want to be able to retrieve all relevant fields for a record in one resultset, by joining several tables. However in the schema here, I doubt it would help. Each time you do a reindex on TLOG the indexed view would have to be rebuilt too... It would depend on the frequency of lookups and the frequency of your update interval. HTH, - Richard --- Richard Quinn
[quoted text, click to view] >> >>I suppose I could merge columns together to cut down on the extra >>autoincrement fields? > > > No! ;) Please do not merge columns (into a single column), that > violates relational theory and will make your data again denormalized. > This is guaranteed to cause you severe problems. Each discreet data > field must have its own column. Databases work very differently to how > you are accustomed to storing data. Of course you can merge columns > into tables, which is not normalized but I'm going to do it anyway. >
actually I meant merge tables, sorry. like: [id1+,year,month,deploymentid] [id2+,id1^,templateid,eventid] [id3+,id2^,day,hour] [id3^,minute,recipientid] this is basically the example you gave right? besides the order that is, I think this order would be most effective at eliminating redundancy based on what we've collected so far. [quoted text, click to view] > > Please be aware that the schema I showed is not normalized fully (only > 2NF), and many professionals will give you opposing advise (I hope > with the same validity!). I have used auto incrementing ID fields, > whereas most people would look for a primary key from within the > available fields. The reasons I took these design choices is to save > space, you will be logging millions of records in TLOG and I wanted to > log the minimum amount of information necessary. (I suppose one could > even push up the MINUTE field into one of the Cartesian base tables)
I wonder if the autoinc fields are going to add too much overhead when inserting records. For instance, say I have a new record [2004,8,10,1,1,8,12,59,1234] I would first have to check for the existing id1 [2004,8,10] and if none exists create one. Then with that id1 I would check for an existing id2 and so on until I have looked up or created an id3 to be inserted along with the minute and recipiendid. I suppose it's not unlike what I currently do with the nested file, just that it would now be 3 passes (queries?) where as currently it's done in one pass. Is there a common best practice for doing inserts this way? [quoted text, click to view] > > There is a large variety of possible schemas available, ones which > maximise search efficiency, ones which maximise insert speed, ones > which are fully normalized, ones which use federated servers, etc etc. > What you choose depends on lots of circumstances that I know nothing > about. > > >>Would it be best then to create an indexed view for making updates/queries? > > > All of the tables - with the possible exception of TLOG, should be > indexed, and the clustered index should cover all the over fields. > This would mean that lookups need never access the heaps, but could > just get the data from the index. > > TLOG would be very different, because of the number of inserts. A > possibility would be to perhaps archive the data in TLOG in an N day > interval, and create the index (and update the statistics) on the now > even larger archived table afterwards. Having an index on a table you > are not actively searching is pointless. Having an index on a table > receiving millions of inserts is a big slow down. > > Whether to use an indexed view would depend on your schema. Creating > an indexed view might well help if you want to be able to retrieve all > relevant fields for a record in one resultset, by joining several > tables. However in the schema here, I doubt it would help. Each time > you do a reindex on TLOG the indexed view would have to be rebuilt > too... It would depend on the frequency of lookups and the frequency > of your update interval. > > HTH, - Richard > --- > Richard Quinn > MCSD, MCDBA firstname.lastname@ieee.org
Thank you so much for your help, I really appreciate it. I'm going to create these tables and fill'em with a few hundred million and see what
On Mon, 09 Aug 2004 15:42:46 -0700, gordy <gordy@dynamicsdirect.com> [quoted text, click to view] wrote: >actually I meant merge tables, sorry. like: >[id1+,year,month,deploymentid] >[id2+,id1^,templateid,eventid] >[id3+,id2^,day,hour] >[id3^,minute,recipientid] > >this is basically the example you gave right? >besides the order that is, I think this order would be most effective at >eliminating redundancy based on what we've collected so far.
Yes. I kept minute in the last table originally, because it has only 60 discrete values. [quoted text, click to view] >I wonder if the autoinc fields are going to add too much overhead when >inserting records. For instance, say I have a new record >[2004,8,10,1,1,8,12,59,1234] I would first have to check for the >existing id1 [2004,8,10] and if none exists create one.
In that case I would suggest prepopulating the table. It is easy to predict the sequence! Whether you use an IDENTITY field or some Primary Key based on (a combination of) your fields does not matter much, since you will always need to use a primary key in the next related table for referencing. If you use an IDENTITY field then your PK and FK will only use 4 bytes. But that would depend on your already knowing what the set of deployment, template and event IDs are. If you don't then a different schema would probably be necessary, a more normalized one. [quoted text, click to view] > Then with that >id1 I would check for an existing id2 and so on until I have looked up >or created an id3 to be inserted along with the minute and recipiendid.
These should be prepopulated too, if possible. [quoted text, click to view] >I suppose it's not unlike what I currently do with the nested file, just >that it would now be 3 passes (queries?) where as currently it's done in >one pass. Is there a common best practice for doing inserts this way?
Nop, lots of different mediocre practices ;) What you could do, and I would recommend it, is to pass the data for each new log record into a stored procedure: USE tempdb Create procedure procNewLogRecord1( @templateid int, -- or the type you need @deploymentid int, @eventid int, @logtime as smalldatetime, @recipientid int ) AS -- do lookups and inserts where necessary return The @logtime parameter would be something like: '2004-07-31 12:35' Then from within the sproc you could do the database intensive tasks of looking up appropriate key values. An alternative would be to store a lookup table in your application of suitable values. It might take a few MBs of RAM but it would be fast too. Then you would modify the stored procedure thus: USE tempdb Create procedure procNewLogRecord2( @templateid char(12) = NULL, @deploymentid as char(12) = NULL, @eventid as char(12) = NULL, @ID3 as Int = NULL, @logtime as smalldatetime, @recipientid as char (12) ) AS return Meaning that the SP now does not expect to have all of its parameters filled (some take a default of NULL). If you already know what ID3 will be you need only pass that, the logtime and the recipientid, some logic in the sproc will skip over looking up the ID3 value. [quoted text, click to view] >I'm going to >create these tables and fill'em with a few hundred million and see what >breaks next :)
Keep us posted! Don't forget to create the indexes after loading the data! I suppose you want to test your software by loading data the same way you would during production. If you want it to be quicker, try the BULK INSERT command. --- Richard Quinn
Don't see what you're looking for? Try a search.
|