sql server data warehouse:
Hi all: I just learned today that a project I'm working on has a couple of very large tables that need, somehow, to be optimized for searching. One table has roughly 7 million rows, and another, its relative, has over 16 million. At present, client applications only insert into these tables, but our users have expressed a desire to select from them as well, using a small set of search criteria, so we need to come up with an appropriate approach. At the very least, to continue the application as it exists today, it seems to me the two tables need to be trimmed down -- perhaps have their data archived into separate tables, or databases, or some other approach -- but if users also want to begin reporting on this data, allowing them to select against it could bring the server, which is already overburdened, to its knees. I'm an application developer and not a DBA, so I'm looking to you all for some best-practices suggestions, here. I've looked briefly into SQL Analysis Services, but I'm not sure thatt's the right approach for us, since it appears Analysis Services is merely an interface for handling data processing, and our users need real-time data (or as close to real-time as possible -- big surprise, eh?). I'm wondering if simply segmenting the data (which is very simply stored -- user ID, thing ID, date/time stamp, and that's pretty much it) into separate tables might be an acceptable approach, but that approach also seems a little crude to me. How do DBAs handle very large t ables like this in a way that's least cumbersome to the client-side developers and end users? Thanks hugely in advance,
So you have very large tables in the operational environment that you want to partition, but you still need to provide a convenient method for developers to query against them? This is a common architectural issue, and the standard answer is called [partitioned views]. http://msdn.microsoft.com/library/default.asp?url=/library/en-us/createdb/cm_8_des_06_17zr.asp http://www.microsoft.com/sql/prodinfo/previousversions/scalability.mspx Also, there are techniques for near real time refreshing of OLAP data, but whether or not you choose to go this additional step should perhaps depend on what impact would result from the users querying against the tables directly from the operational system. If the data is not that rich (just a log table with few attributes of interest to users), then it may not be worth the expense and maintenance of setting up a server for Analysis Services. http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnsql2k/html/sql_real-timeolap.asp http://www.microsoft.com/technet/prodtechnol/sql/2000/maintain/anservog.mspx "Christian Nunciato" <ChristianNunciato@discussions.microsoft.com> wrote in message news:5BF9AF00-3A74-4044-A203-2E5CAA48D18D@microsoft.com... [quoted text, click to view] > Hi all: > > I just learned today that a project I'm working on has a couple of very > large tables that need, somehow, to be optimized for searching. One table > has roughly 7 million rows, and another, its relative, has over 16 > million. > > At present, client applications only insert into these tables, but our > users > have expressed a desire to select from them as well, using a small set of > search criteria, so we need to come up with an appropriate approach. At > the > very least, to continue the application as it exists today, it seems to me > the two tables need to be trimmed down -- perhaps have their data archived > into separate tables, or databases, or some other approach -- but if users > also want to begin reporting on this data, allowing them to select against > it > could bring the server, which is already overburdened, to its knees. > > I'm an application developer and not a DBA, so I'm looking to you all for > some best-practices suggestions, here. I've looked briefly into SQL > Analysis > Services, but I'm not sure thatt's the right approach for us, since it > appears Analysis Services is merely an interface for handling data > processing, and our users need real-time data (or as close to real-time as > possible -- big surprise, eh?). I'm wondering if simply segmenting the > data > (which is very simply stored -- user ID, thing ID, date/time stamp, and > that's pretty much it) into separate tables might be an acceptable > approach, > but that approach also seems a little crude to me. How do DBAs handle > very > large t ables like this in a way that's least cumbersome to the > client-side > developers and end users? > > Thanks hugely in advance, > > Chris
Chris, JT has some good points and links in his response, and represent best practices. I'll add to what he has said and give you the flip side - shall we say, not so best practices, but a way to get by until you can implement them. By no means are these recommendations for every situation, but they may fit your needs. 16 million rows is a large set of data but depending on your circumstances, may be manageable as it exists. Indexes of course, play a major role in search speed and server workload. BUT, don't add any new indexes on that large of a table unless you have a test environment to see how long the index will take to build. Indexes also have the potential to increase your save times as well as storage requirements substantially. You may be able to tweak indexes so your save times increase a little but your searches return data rapidly. Schema optimization is also paramount in larger tables, but again, don't adjust it without having a lot of available time and disk space to support the modifications. If you can control the search select statements, keep the columns returned to a minimum. If your environment allows potentially dirty reads, check into using the WITH (NOLOCK) hint in the select statements. All of this also depends on how fast you expect your data to grow. Additionally, the introduction of partitioned views or any other large table management methods will require additional maintenance. You may be able to get by cheaper in the long run by upgrading your server now. -Mike [quoted text, click to view] "Christian Nunciato" wrote: > Hi all: > > I just learned today that a project I'm working on has a couple of very > large tables that need, somehow, to be optimized for searching. One table > has roughly 7 million rows, and another, its relative, has over 16 million. > > At present, client applications only insert into these tables, but our users > have expressed a desire to select from them as well, using a small set of > search criteria, so we need to come up with an appropriate approach. At the > very least, to continue the application as it exists today, it seems to me > the two tables need to be trimmed down -- perhaps have their data archived > into separate tables, or databases, or some other approach -- but if users > also want to begin reporting on this data, allowing them to select against it > could bring the server, which is already overburdened, to its knees. > > I'm an application developer and not a DBA, so I'm looking to you all for > some best-practices suggestions, here. I've looked briefly into SQL Analysis > Services, but I'm not sure thatt's the right approach for us, since it > appears Analysis Services is merely an interface for handling data > processing, and our users need real-time data (or as close to real-time as > possible -- big surprise, eh?). I'm wondering if simply segmenting the data > (which is very simply stored -- user ID, thing ID, date/time stamp, and > that's pretty much it) into separate tables might be an acceptable approach, > but that approach also seems a little crude to me. How do DBAs handle very > large t ables like this in a way that's least cumbersome to the client-side > developers and end users? > > Thanks hugely in advance, >
my first suggestion is: disk subsystem! increase the number of disks to increase the throughput. and setup a Raid 0+1. scanning big tables with or without indexes is (near) allways slowdown by the disks. 1 good IDE drive can read 60MB/s, if your table contain 5Gb of data... then reading the table is long. if you have 4 disks in Raid 0 (stripped) you can raise this value to 60 * 4 = 240MB/s. its not so simple, because index and file fragmentation reduce this throughput, but its a starting point. for example, I have a database with a partitionned table which contains 110 millions of rows. Doing a query like: select D.column1, count(*) from table T inner join dimension D on D.key = T.Key group by D.column1 (the key used is the partitionned key) takes 8minutes on a dev server (1 dual core opteron, 3 IDE drives in raid 0) During this process the CPU is used at 25% The same query is slower (10 minutes) on the production server (4 Xeon, SAN) because we have a SAN issue. so you see the impact of the disk subsystem. also, if you plan to do some group by queries, the tempdb database will be used. so if the tempdb database use his own set of disks, you'll improve the performance. from a developper point of view, maybe you have some control of the queries executed against the database. So if you plan to aggregate data, maybe your application (or your storedprocedures) can update the big table and the aggregated table at the same time, then your "reports" will execute queries against the summarized table. but if you can't control what the user ask for, try using AS, specially AS2005 which provide a lot of options for real time cubes. A webcast demonstrate how to update a MOLAP cube based on a large table (more then 100millions of rows) in near real time. "Christian Nunciato" <ChristianNunciato@discussions.microsoft.com> wrote in message news:5BF9AF00-3A74-4044-A203-2E5CAA48D18D@microsoft.com... [quoted text, click to view] > Hi all: > > I just learned today that a project I'm working on has a couple of very > large tables that need, somehow, to be optimized for searching. One table > has roughly 7 million rows, and another, its relative, has over 16 > million. > > At present, client applications only insert into these tables, but our > users > have expressed a desire to select from them as well, using a small set of > search criteria, so we need to come up with an appropriate approach. At > the > very least, to continue the application as it exists today, it seems to me > the two tables need to be trimmed down -- perhaps have their data archived > into separate tables, or databases, or some other approach -- but if users > also want to begin reporting on this data, allowing them to select against > it > could bring the server, which is already overburdened, to its knees. > > I'm an application developer and not a DBA, so I'm looking to you all for > some best-practices suggestions, here. I've looked briefly into SQL > Analysis > Services, but I'm not sure thatt's the right approach for us, since it > appears Analysis Services is merely an interface for handling data > processing, and our users need real-time data (or as close to real-time as > possible -- big surprise, eh?). I'm wondering if simply segmenting the > data > (which is very simply stored -- user ID, thing ID, date/time stamp, and > that's pretty much it) into separate tables might be an acceptable > approach, > but that approach also seems a little crude to me. How do DBAs handle > very > large t ables like this in a way that's least cumbersome to the > client-side > developers and end users? > > Thanks hugely in advance, > > Chris
I'm a big fan of using continuous SQL replication to a reporting server, then you can index all you want on that box. "Christian Nunciato" <ChristianNunciato@discussions.microsoft.com> wrote in message news:5BF9AF00-3A74-4044-A203-2E5CAA48D18D@microsoft.com... [quoted text, click to view] > Hi all: > > I just learned today that a project I'm working on has a couple of very > large tables that need, somehow, to be optimized for searching. One table > has roughly 7 million rows, and another, its relative, has over 16 million. > > At present, client applications only insert into these tables, but our users > have expressed a desire to select from them as well, using a small set of > search criteria, so we need to come up with an appropriate approach. At the > very least, to continue the application as it exists today, it seems to me > the two tables need to be trimmed down -- perhaps have their data archived > into separate tables, or databases, or some other approach -- but if users > also want to begin reporting on this data, allowing them to select against it > could bring the server, which is already overburdened, to its knees. > > I'm an application developer and not a DBA, so I'm looking to you all for > some best-practices suggestions, here. I've looked briefly into SQL Analysis > Services, but I'm not sure thatt's the right approach for us, since it > appears Analysis Services is merely an interface for handling data > processing, and our users need real-time data (or as close to real-time as > possible -- big surprise, eh?). I'm wondering if simply segmenting the data > (which is very simply stored -- user ID, thing ID, date/time stamp, and > that's pretty much it) into separate tables might be an acceptable approach, > but that approach also seems a little crude to me. How do DBAs handle very > large t ables like this in a way that's least cumbersome to the client-side > developers and end users? > > Thanks hugely in advance, > > Chris
[quoted text, click to view] "J?j?" <willgart@bbbhotmailaaa.com> wrote: > my first suggestion is: > disk subsystem! >
If you need a raid array, you have too little ram. If it doesn't fit in one machine, split it. 16 Million rows is not so large that it would not fit. 16 GByte of ram is below EUR 3000.
Hi Christian, 6 million and 17 million rows is a rather trivial amount of data nowadays..... Buy the hardware to support what you need to do....it will cost less than spending time trying to figure out anything else....really, it will. Peter
Are you suggesting that he pin the entire table to RAM? [quoted text, click to view] "Stephan Eggermont" <stephan@stack.nl> wrote in message news:44057cbc$0$11069$e4fe514c@news.xs4all.nl... > "J?j?" <willgart@bbbhotmailaaa.com> wrote: >> my first suggestion is: >> disk subsystem! >> > > If you need a raid array, you have too little ram. > If it doesn't fit in one machine, split it. > 16 Million rows is not so large that it would > not fit. 16 GByte of ram is below EUR 3000. > > Stephan
Don't see what you're looking for? Try a search.
|