Groups | Blog | Home
all groups > sql server (microsoft) > june 2005 >

sql server (microsoft) : Efficient record storage 10+ Million records


Ben
6/7/2005 9:22:09 AM
I have a database that stores survey information. I've been asked to add
information to it to test scalability. Well, its' hurting.

What I have is a 'sample' table or the people that took the survey and then
several related tables.

Lets' say the sample table looks something like
tblData_Sample
SampleID INT IDENTITY(1,1)
FullName VARCHAR(20)
AgeGroupID INT
IncomeID INT

With a clustered index on SampleID and Indexes on AgeGroupID and IncomeID

I also have a table for storing the survey responses. Right now its' setup
like
tblData_Grid
SampleID INT
RestaurantID INT
QuestionID INT
Response INT

There is also a table for storing complete information
tblData_Session
SampleID INT
StartDate DATETIME
EndDate DATETIME

With a clustered index on SampleID

With a clustered index on SampleID, RestaurantID, QuestionID and Indexes on
RestaurantID and QuestionID.

The problem I'm having is that after scaling out to only 93,000 completed
surveys, the data table is at 10,000,000 records.

Needless to say OUCH!!

Queries of the data table are painful... I've rebuilt all of my indexes, so
I know they're up-to-date and I'm kinda of stuck on what to do for speed
gains at this point.

It's taking way to long to process (10 seconds on the development server, I
haven't bothered trying the live cluster yet)
SELECT DISTINCT g.SampleID, g.RestaurantID
FROM tblData_Grid g
WHERE g.SampleID IN
(
SELECT s.SampleID
FROM tblData_Session s
WHERE s.EndDate IS NOT NULL
)


So what are things I can to do speed things up here? I've thought about
taking all questions that are asked of each respondent and moving them to a
horizontal table.

I've also considered using partitioned views. Perhaps store data in
mini-grid tables based on the sample quarter or restaurantid and then using
a partitioned view to simulate the grid table...

Before I start making any sweeping changes, I was hoping that someone here
could steer me in the right direction.

Thanks,
Ben

Ross Presser
6/7/2005 1:09:39 PM
[quoted text, click to view]

10 Million rows is not really that many. SQL server regularly handles large
OLTP tables in the billions.

SELECT DISTINCT is killing you, though. If there are multiple questions for
each sample/restaurant combination, then you should consider that an entity
of its own and give it a table. It may seem like redundant storage of the
same first few columns as are in Data_Grid, but it will help.

(by the way, "tbl" prefixes are not considered good style, they just
clutter things up. Furthermore, the Data_ prefix isn't much use either -
you know you're dealing with data, right?)

CREATE TABLE Sample (
SampleID INT IDENTITY(1,1) PRIMARY KEY,
FullName VARCHAR(20) UNIQUE NOT NULL,
AgeGroupID INT,
IncomeID INT
)

CREATE TABLE Restaurant (
SampleID INT NOT NULL
REFERENCES Sample (SampleID),
RestaurantID INT NOT NULL,
PRIMARY KEY (SampleID, RestaurantID)
)

CREATE TABLE Grid (
SampleID INT NOT NULL,
RestaurantID INT NOT NULL,
QuestionID INT NOT NULL,
Response INT NOT NULL,
PRIMARY KEY (SampleID, RestaurantID, QuestionID),
FOREIGN KEY (SampleID, RestaurantID)
REFERENCES Restaurant (SampleID, RestaurantID)
)

CREATE TABLE Session (
SampleID INT NOT NULL
REFERENCES Sample (SampleID),
StartDate DATETIME NOT NULL,
EndDate DATETIME NULL,
PRIMARY KEY (SampleID, StartDate)
)

Make sure you insert into Restaurant before inserting into Grid.

Now your query becomes

SELECT r.SampleID, r.RestaurantID
FROM Restaurant r
INNER JOIN Session s ON r.SampleID=s.SampleID
WHERE s.EndDate IS NOT NULL

Ben
6/7/2005 3:07:03 PM
Yeah I've been leaning towards a sample_restaurant table. I just finished
partitioning my data based on question groups, next stop
sample_restaurant...

Thanks for the reply, sorry about the old naming convention, its something
I'm used to. I like the data prefix because we have several different types
of tables with similar names and using the data prefix avoids collisions and
me having to be creative with names, as for the tbl :( I'm so in bed with
that right now I can't see a way out...

[quoted text, click to view]

aaron.kempf NO[at]SPAM gmail.com
6/14/2005 12:03:16 PM
the only option with decent performance-- with that record count-- is
to use analysis services.

with OLAP (analysis services) it is easy to throw around 100 million
records with sub-second response times (if you do it correctly)
AddThis Social Bookmark Button