[quoted text, click to view] On Tue, 7 Jun 2005 09:22:09 -0400, Ben wrote:
> I have a database that stores survey information. I've been asked to add
> information to it to test scalability. Well, its' hurting.
>
> What I have is a 'sample' table or the people that took the survey and then
> several related tables.
>
> Lets' say the sample table looks something like
> tblData_Sample
> SampleID INT IDENTITY(1,1)
> FullName VARCHAR(20)
> AgeGroupID INT
> IncomeID INT
>
> With a clustered index on SampleID and Indexes on AgeGroupID and IncomeID
>
> I also have a table for storing the survey responses. Right now its' setup
> like
> tblData_Grid
> SampleID INT
> RestaurantID INT
> QuestionID INT
> Response INT
>
> There is also a table for storing complete information
> tblData_Session
> SampleID INT
> StartDate DATETIME
> EndDate DATETIME
>
> With a clustered index on SampleID
>
> With a clustered index on SampleID, RestaurantID, QuestionID and Indexes on
> RestaurantID and QuestionID.
>
> The problem I'm having is that after scaling out to only 93,000 completed
> surveys, the data table is at 10,000,000 records.
>
> Needless to say OUCH!!
>
> Queries of the data table are painful... I've rebuilt all of my indexes, so
> I know they're up-to-date and I'm kinda of stuck on what to do for speed
> gains at this point.
>
> It's taking way to long to process (10 seconds on the development server, I
> haven't bothered trying the live cluster yet)
> SELECT DISTINCT g.SampleID, g.RestaurantID
> FROM tblData_Grid g
> WHERE g.SampleID IN
> (
> SELECT s.SampleID
> FROM tblData_Session s
> WHERE s.EndDate IS NOT NULL
> )
>
> So what are things I can to do speed things up here? I've thought about
> taking all questions that are asked of each respondent and moving them to a
> horizontal table.
>
10 Million rows is not really that many. SQL server regularly handles large
OLTP tables in the billions.
SELECT DISTINCT is killing you, though. If there are multiple questions for
each sample/restaurant combination, then you should consider that an entity
of its own and give it a table. It may seem like redundant storage of the
same first few columns as are in Data_Grid, but it will help.
(by the way, "tbl" prefixes are not considered good style, they just
clutter things up. Furthermore, the Data_ prefix isn't much use either -
you know you're dealing with data, right?)
CREATE TABLE Sample (
SampleID INT IDENTITY(1,1) PRIMARY KEY,
FullName VARCHAR(20) UNIQUE NOT NULL,
AgeGroupID INT,
IncomeID INT
)
CREATE TABLE Restaurant (
SampleID INT NOT NULL
REFERENCES Sample (SampleID),
RestaurantID INT NOT NULL,
PRIMARY KEY (SampleID, RestaurantID)
)
CREATE TABLE Grid (
SampleID INT NOT NULL,
RestaurantID INT NOT NULL,
QuestionID INT NOT NULL,
Response INT NOT NULL,
PRIMARY KEY (SampleID, RestaurantID, QuestionID),
FOREIGN KEY (SampleID, RestaurantID)
REFERENCES Restaurant (SampleID, RestaurantID)
)
CREATE TABLE Session (
SampleID INT NOT NULL
REFERENCES Sample (SampleID),
StartDate DATETIME NOT NULL,
EndDate DATETIME NULL,
PRIMARY KEY (SampleID, StartDate)
)
Make sure you insert into Restaurant before inserting into Grid.
Now your query becomes
SELECT r.SampleID, r.RestaurantID
FROM Restaurant r
INNER JOIN Session s ON r.SampleID=s.SampleID
WHERE s.EndDate IS NOT NULL