all groups > sql server (alternate) > may 2006 >
You're in the

sql server (alternate)

group:

VERY chalanging question


VERY chalanging question groupy
5/30/2006 10:39:23 AM
sql server (alternate): input: 1.5 million records table consisting users with 4 nvchar
fields:A,B,C,D
the problem: there are many records with dublicates A's or duplicates
B's or duplicates A+B's or duplicates B+C+D's & so on. Mathematicly
there are 16-1 posibilities for each duplication.

aim: find the duplicates & filter them, leave only the unique users
which don't have ANY duplication.

We can do it by a simple select query that logicly checks the
duplication in a OR operator.
But it takes about 16 days in a very fast PC.
The DB is in sql-server, converting it to Oracle might acomplish it to
8 days.

How can i do it in a few hours?
Remeber that filtering first the users with parameter A & than by
parameter B & so on will result an error in the final result because it
will loose the information regarding the filtered users - maybe in
parameter C they are equal to other users in the table...

THANK YOU
Re: VERY chalanging question Tom Moreau
5/30/2006 2:07:40 PM
Moving it to Oracle won't buy you anything. Perhaps indexing on each of the
columns to be filtered will help you.

--
Tom

----------------------------------------------------
Thomas A. Moreau, BSc, PhD, MCSE, MCDBA
SQL Server MVP
Toronto, ON Canada
..
[quoted text, click to view]
input: 1.5 million records table consisting users with 4 nvchar
fields:A,B,C,D
the problem: there are many records with dublicates A's or duplicates
B's or duplicates A+B's or duplicates B+C+D's & so on. Mathematicly
there are 16-1 posibilities for each duplication.

aim: find the duplicates & filter them, leave only the unique users
which don't have ANY duplication.

We can do it by a simple select query that logicly checks the
duplication in a OR operator.
But it takes about 16 days in a very fast PC.
The DB is in sql-server, converting it to Oracle might acomplish it to
8 days.

How can i do it in a few hours?
Remeber that filtering first the users with parameter A & than by
parameter B & so on will result an error in the final result because it
will loose the information regarding the filtered users - maybe in
parameter C they are equal to other users in the table...

THANK YOU
Re: VERY chalanging question Tony Rogerson
5/30/2006 9:18:39 PM
Post the SQL you have so far.

Also post the hardware specification you are using this on.

I regularly deal with queries that consume tables with multi-millions of
rows in seconds without problem, the size of your data looks to be around
190MBytes based on 4 columns of 25 characters, basically its piddly.

Tony.

--
Tony Rogerson
SQL Server MVP
http://sqlblogcasts.com/blogs/tonyrogerson - technical commentary from a SQL
Server Consultant
http://sqlserverfaq.com - free video tutorials


[quoted text, click to view]

Re: VERY chalanging question Gert-Jan Strik
5/30/2006 9:21:04 PM
If the table contains 1.5 millions rows, and the query runs for 16 days,
then there must be something wrong with the query or with the table
setup (inclusing indexes).

From your narrative I do not really understand what you are trying to
achieve. Please post DDL (including indexes), some sample data and the
results you are trying to achieve.

Gert-Jan


[quoted text, click to view]
Re: VERY chalanging question Erland Sommarskog
5/30/2006 9:21:06 PM
groupy (liav.ezer@gmail.com) writes:
[quoted text, click to view]

The description is vague, but sounds like you should run:

SELECT userid, A, B, C, D, COUNT(*)
FROM tbl
GROUP BY userid, A, B, C, D
HAVING COUNT(*) >1

While that is not running snap, it should not take 16 days for 1.5
million rows.



--
Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/prodtechnol/sql/2005/downloads/books.mspx
Books Online for SQL Server 2000 at
Re: VERY chalanging question - Explanation Tony Rogerson
5/31/2006 12:00:00 AM
Whats your hardware?

Please post the CREATE TABLE with any indexes for your schema.

What version of SQL Server?

--
Tony Rogerson
SQL Server MVP
http://sqlblogcasts.com/blogs/tonyrogerson - technical commentary from a SQL
Server Consultant
http://sqlserverfaq.com - free video tutorials


[quoted text, click to view]

Re: VERY chalanging question - Explanation Erland Sommarskog
5/31/2006 12:00:00 AM
groupy (liav.ezer@gmail.com) writes:
[quoted text, click to view]

I sincerely doubt that this statement takes two weeks to run for 1.5
million rows. Had you said 1.5 milliard rows, I could maybe have
believed it.

Anyway, first index each column individually. Then try:

DELETE tbl
FROM tbl a
JOIN tbl b ON a.A = b.A
WHERE a.B > b.B OR
a.C > b.C OR
a.D > b.D

DELETE tbl
FROM tbl a
JOIN tbl b ON a.B = b.B
WHERE a.C > b.C OR
a.D > b.D

DELETE tbl
FROM tbl a
JOIN tbl b ON a.C = b.C
WHERE a.D > b.C

After this operation, you still have the rows that have the same values
in four columns. But it is not clear from your description whether you
have such duplicates. If you have this maybe the best:

ATLER TABLE tbl ADD ident int IDENTITY

DELETE tbl
FROM tbl a
JOIN tbl b ON a.A = b.A
WHERE a.ident > b.ident

DELETE tbl
FROM tbl a
JOIN tbl b ON a.B = b.B
WHERE a.ident > b.ident

DELETE tbl
FROM tbl a
JOIN tbl b ON a.C = b.C
WHERE a.ident > b.ident

ALTER TABLE tbl DROP COLUMN ident

Note: all the above is untested. For tested solutions (at least with
regards to correctness), please post:

o CREATE TABLE statement for the table.
o INSERT statements with sample data.
o The desired result given the sample.

--
Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/prodtechnol/sql/2005/downloads/books.mspx
Books Online for SQL Server 2000 at
Re: VERY chalanging question Hugo Kornelis
5/31/2006 12:30:21 AM
[quoted text, click to view]

Hi groupy,

No. Only four possibilities: duplicate A, duplicate B, duplicate C, and
duplicate D. Combinations are just a special case (you can only have a
duplicate A+B if you have both a duplicate A and a duplicate B - though
you can have duplicate A and duplicate B but no duplicate A+B).

[quoted text, click to view]

This specification is incorrect. For instance, with the input like this:

num A B C D
--- --- --- --- ---
1 a1 b1 c1 d1
2 a2 b2 c2 d2
3 a1 b2 c3 d3
4 a2 b1 c4 d4

there are two possible result sets, both containing two rows, that have
no duplicates anymore (1 + 2 or 3 + 4).

If the answer is "I don't care - any resultset without duplicates will
do", then the code below should run pretty fast:

CREATE TABLE #Temp
(A nvarchar(25) NOT NULL,
B nvarchar(25) NOT NULL,
C nvarchar(25) NOT NULL,
D nvarchar(25) NOT NULL)
go
CREATE UNIQUE INDEX x_A ON #Temp(A) WITH (IGNORE_DUP_KEY = ON)
CREATE UNIQUE INDEX x_B ON #Temp(B) WITH (IGNORE_DUP_KEY = ON)
CREATE UNIQUE INDEX x_C ON #Temp(C) WITH (IGNORE_DUP_KEY = ON)
CREATE UNIQUE INDEX x_D ON #Temp(D) WITH (IGNORE_DUP_KEY = ON)
go
INSERT INTO #Temp (A, B, C, D)
SELECT A, B, C, D
FROM YourBigTable
-- Show results
SELECT * FROM #Temp
go
DROP TABLE #Temp
go


--
VERY chalanging question - Explanation groupy
5/31/2006 1:37:22 AM
ok, let's take a look at a sample table representing the problem:

A | B | C | D
--------------------
a1 b1 c1 d1
a1 b2 c2 d2
a1 b1 c3 d3
a4 b4 c4 d3
a5 b5 c5 d5
a6 b6 c6 d3

The duplications are:
rows 1+2+3 on A
row 1+3 on B
rows 3+4+6 on D
the only unique (in all params) row is 5
note: finding first that row 1 similar to 2 on A & deleting it will
loose information because we WON'T know if row 1 similar to row 3 on B.
The same goes for the deletion of row 3 : it will cause lose of data
regarding it's similarity to row 4 on D


The Simple query for retriving all duplicated rows which consumes most
time is:
SELECT COUNT(*),A,B,C,D
FROM tbl
GROUP BY A,B,C,D
HAVING count(*)>1
It takes about 2 weaks on a 1.5 million rows, while all fields are
nvchars & the DB is in SQL-Server

THANK YOU ALL
Re: VERY chalanging question - Explanation jsfromynr
5/31/2006 4:35:24 AM
Hi There,

IF there is no identity column then we may use.
Select identity(int,1,1) myid ,A,B,C,D into tmpTable from select * from
BASETABLE;
create index tmpA on tmpTable(A,myid);
create index tmpB on tmpTable(B,myid);
create index tmpC on tmpTable(C,myid);
create index tmpD on tmpTable(D,myid);

Assuming that there is a column rowid which is monotonically increasing
and there are as many covering indexes as there are columns the query
can become like this .


Delete from tmpTable where myId in
(
Select myID from tmpTable group by A having count(*)>1
Union All
Select myID from tmpTable group by B having count(*)>1
Union All
Select myID from tmpTable group by C having count(*)>1
Union All
Select myID from tmpTable group by D having count(*)>1
);

Hope this serve the purpose.
With Warm regards
Jatinder Singh


[quoted text, click to view]
Re: VERY chalanging question - Explanation jsfromynr
5/31/2006 5:44:18 AM
Hi There,

Sorry!!!! for providing incorrect answer
correct one that you may like to try is
Create Table myData
(
a varchar(2),
b varchar(2),
c varchar(2),
d varchar(2)
)
insert into myData
Select 'a1', 'b1', 'c1', 'd1'
Union
Select 'a1', 'b2', 'c2', 'd2'
Union
Select 'a1', 'b1', 'c3', 'd3'
Union
Select 'a4', 'b4', 'c4', 'd3'
Union
Select 'a5', 'b5', 'c5', 'd5'
Union
Select 'a6', 'b6', 'c6', 'd3'
Alter Table myData add myid int identity(1,1)


Select * from myData
Delete from myData Where myID in
(
Select myID From myData MA ,(Select A from myData group by A having
count(*)>1) AA Where MA.A = AA.A
Union All
Select myID From myData MB ,(Select B from myData group by B having
count(*)>1) BB Where MB.B = BB.B
Union All
Select myID From myData MC ,(Select C from myData group by C having
count(*)>1) CC Where MC.C = CC.C
Union All
Select myID From myData MD ,(Select D from myData group by D having
count(*)>1) DD Where MD.D = DD.D
)
Select * from myData

With Warm regards
Jatinder Singh

[quoted text, click to view]
Re: VERY chalanging question - Explanation groupy
5/31/2006 7:04:41 AM
Thank you all very much, i think i managed something..
Re: VERY chalanging question - Explanation Madhivanan
6/2/2006 2:35:05 AM

How long does it take to run the query now?

Madhivanan


[quoted text, click to view]
Re: VERY chalanging question - Explanation groupy
6/2/2006 3:46:29 AM
It takes about 2 hours...
AddThis Social Bookmark Button