Bart Torbert (barthome1@comcast.net) writes:
[quoted text, click to view] > There are a couple of aspects of our app that makes size usage a big
> factor. First of all we have a database with about 5 million
> individual entities. These entities many children tables. These
> children tables can have multiple children per entity. So the children
> tables can have 10 million plus rows. We have a lot of data to deal
> with.
This far, it does by no means sounds like a huge database. You later
mention 23 GB. As Greg said, that's a not a big database these days.
I would call it mid-sized.
And, I echo Greg and say that size alone is not the measure you should
go with. If you can add an indexed view which takes up 1GB, but which
changes response time of a critical query from 1 hour to 5 seconds, it
would be a folly not to add the view.
Disk is cheap these days, and 23 GB or 70 GB for the database is no
big issue. (Had you been in for 23 TB it would have been another matter!)
The main cost for a bigger database is not in the hardware, but longer
times for backup and restore.
[quoted text, click to view] > Can someone explain the clustering of indices? How does this affect
> performance and storage size?
In a clustered index, the leaf pages are the data pages. It is considered
best practice to always have a clustered index on a table in SQL Server.
There may be situations where it is better to have a heap (as is the name
of a table without a clustered index), but in doubt assume clustering
is the way to go.
From this follows that if the only index of your table is the primary key,
this index should be clustered.
If you have more than one index, it takes some judgement to find the
best index to cluster. Usually, the clustered index should be on a column
on which you do range queries. For instance, for an Orders table, the
OrderID is a poor candidate to cluster on, because you usually look at
orders one by one, and rarely at orders 10222 to 11762. The customer ID
or the order date may be better choices. On the other hand in an
OrderDetails table with a PK of (OrderID, Rowno), this PK may be an
excellent choice to cluster on, because you often want to see all rows
for an order.
One thing to keep in mind is that you should keep your clustered index
small bytewise, because in the non-clustered indexes, the clustering
key serve as row pointers to the data. So a wide cluster key, makes the
non-clustered indexes wide too. And the main reason to keep down the non-
clustered indexes in size is not to conserve disk space, but to speed up
response time. The narrower the index, the more index nodes on a page,
and the faster the index can be searched.
If you go for a non-unique clustered index, SQL Server will add a 4-byte
uniquifier to the clustered index. For this reason, it may be better to
add the PK to the clustered index to make it unique, if the PK is a
four-byte value.
[quoted text, click to view] > On one of our tables with 3 million rows, I tried to apply an index on
> a text field. It took 45 minutes for the index to process. I did not
> turn on clustering. Is this normal or do I not have something setup
> right. The PC I a running on has 1 gig of memory and I think 1 gig
> CPU. Do we need more horsepower for SQLServer with this large of a
> database? Right now, without too many indices, the MDF file is about
> 23 GIG.
45 minutes to add a non-clustered index on a three-million row table
sounds a long time to me, but you did not mention the size of the column.
You said "text field", but you cannot add an index on a column of the
text datatype, so I assume that it was a varchar column. But it might
take longer time to add an index on a heap; I rarely do that, so I don't
have experience.
Changing the clustered index on a table (including adding one, or dropping
one) can take considerable time, since all data has to be moved, and non-
clustered indexes must be updated. Thankfully you can change a clustered
index by adding WITH DROP_EXISTING when you create the new definition,
so you don't have to move data twice.
[quoted text, click to view] > Which brings up another question. Is there some threshold within
> SQLServer as far as number of rows, number of tables, overall DB size
> that it is known that performance start to take a hit.
No, there are no such magic numbers.
However, for some database applications, there can indeed be some magic
number over which performance dips considerably. Consider a database
where the active data is small enough to fit entirely in cache. Response
time will be good, maybe even if some queries are less-than-optimal.
Now, if the portion of active data outgrows the cache, and on top of
that, the access pattern is somewhat cyclic, there will be a lot more
disk accesses, because data falls out of the cache. The decisive factor
is of course how much main memory you have in the machine.
How much that active data is, depends on the application and its business
needs and how well-written the queries are. If a sloppy programmer
introduces a table scan over a big table with historic data of which
the data normally is dormant, you can see the performance dip allover,
because of the the big table taking over thc cache. I would assume that
Oracle would hehave in the same way.
--
Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se
Books Online for SQL Server SP3 at