sql server (alternate):
input: 1.5 million records table consisting users with 4 nvchar fields:A,B,C,D the problem: there are many records with dublicates A's or duplicates B's or duplicates A+B's or duplicates B+C+D's & so on. Mathematicly there are 16-1 posibilities for each duplication. aim: find the duplicates & filter them, leave only the unique users which don't have ANY duplication. We can do it by a simple select query that logicly checks the duplication in a OR operator. But it takes about 16 days in a very fast PC. The DB is in sql-server, converting it to Oracle might acomplish it to 8 days. How can i do it in a few hours? Remeber that filtering first the users with parameter A & than by parameter B & so on will result an error in the final result because it will loose the information regarding the filtered users - maybe in parameter C they are equal to other users in the table... THANK YOU
Moving it to Oracle won't buy you anything. Perhaps indexing on each of the columns to be filtered will help you. -- Tom ---------------------------------------------------- Thomas A. Moreau, BSc, PhD, MCSE, MCDBA SQL Server MVP Toronto, ON Canada .. [quoted text, click to view] "groupy" <liav.ezer@gmail.com> wrote in message news:1149010763.810314.63670@i40g2000cwc.googlegroups.com...
input: 1.5 million records table consisting users with 4 nvchar fields:A,B,C,D the problem: there are many records with dublicates A's or duplicates B's or duplicates A+B's or duplicates B+C+D's & so on. Mathematicly there are 16-1 posibilities for each duplication. aim: find the duplicates & filter them, leave only the unique users which don't have ANY duplication. We can do it by a simple select query that logicly checks the duplication in a OR operator. But it takes about 16 days in a very fast PC. The DB is in sql-server, converting it to Oracle might acomplish it to 8 days. How can i do it in a few hours? Remeber that filtering first the users with parameter A & than by parameter B & so on will result an error in the final result because it will loose the information regarding the filtered users - maybe in parameter C they are equal to other users in the table... THANK YOU
Post the SQL you have so far. Also post the hardware specification you are using this on. I regularly deal with queries that consume tables with multi-millions of rows in seconds without problem, the size of your data looks to be around 190MBytes based on 4 columns of 25 characters, basically its piddly. Tony. -- Tony Rogerson SQL Server MVP http://sqlblogcasts.com/blogs/tonyrogerson - technical commentary from a SQL Server Consultant http://sqlserverfaq.com - free video tutorials [quoted text, click to view] "groupy" <liav.ezer@gmail.com> wrote in message news:1149010763.810314.63670@i40g2000cwc.googlegroups.com... > input: 1.5 million records table consisting users with 4 nvchar > fields:A,B,C,D > the problem: there are many records with dublicates A's or duplicates > B's or duplicates A+B's or duplicates B+C+D's & so on. Mathematicly > there are 16-1 posibilities for each duplication. > > aim: find the duplicates & filter them, leave only the unique users > which don't have ANY duplication. > > We can do it by a simple select query that logicly checks the > duplication in a OR operator. > But it takes about 16 days in a very fast PC. > The DB is in sql-server, converting it to Oracle might acomplish it to > 8 days. > > How can i do it in a few hours? > Remeber that filtering first the users with parameter A & than by > parameter B & so on will result an error in the final result because it > will loose the information regarding the filtered users - maybe in > parameter C they are equal to other users in the table... > > THANK YOU >
If the table contains 1.5 millions rows, and the query runs for 16 days, then there must be something wrong with the query or with the table setup (inclusing indexes). From your narrative I do not really understand what you are trying to achieve. Please post DDL (including indexes), some sample data and the results you are trying to achieve. Gert-Jan [quoted text, click to view] Tom Moreau wrote: > > Moving it to Oracle won't buy you anything. Perhaps indexing on each of the > columns to be filtered will help you. > > -- > Tom > > ---------------------------------------------------- > Thomas A. Moreau, BSc, PhD, MCSE, MCDBA > SQL Server MVP > Toronto, ON Canada > . > "groupy" <liav.ezer@gmail.com> wrote in message > news:1149010763.810314.63670@i40g2000cwc.googlegroups.com... > input: 1.5 million records table consisting users with 4 nvchar > fields:A,B,C,D > the problem: there are many records with dublicates A's or duplicates > B's or duplicates A+B's or duplicates B+C+D's & so on. Mathematicly > there are 16-1 posibilities for each duplication. > > aim: find the duplicates & filter them, leave only the unique users > which don't have ANY duplication. > > We can do it by a simple select query that logicly checks the > duplication in a OR operator. > But it takes about 16 days in a very fast PC. > The DB is in sql-server, converting it to Oracle might acomplish it to > 8 days. > > How can i do it in a few hours? > Remeber that filtering first the users with parameter A & than by > parameter B & so on will result an error in the final result because it > will loose the information regarding the filtered users - maybe in > parameter C they are equal to other users in the table... >
groupy (liav.ezer@gmail.com) writes: [quoted text, click to view] > input: 1.5 million records table consisting users with 4 nvchar > fields:A,B,C,D > the problem: there are many records with dublicates A's or duplicates > B's or duplicates A+B's or duplicates B+C+D's & so on. Mathematicly > there are 16-1 posibilities for each duplication. > > aim: find the duplicates & filter them, leave only the unique users > which don't have ANY duplication. > > We can do it by a simple select query that logicly checks the > duplication in a OR operator. > But it takes about 16 days in a very fast PC.
The description is vague, but sounds like you should run: SELECT userid, A, B, C, D, COUNT(*) FROM tbl GROUP BY userid, A, B, C, D HAVING COUNT(*) >1 While that is not running snap, it should not take 16 days for 1.5 million rows. -- Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se Books Online for SQL Server 2005 at http://www.microsoft.com/technet/prodtechnol/sql/2005/downloads/books.mspx Books Online for SQL Server 2000 at
Whats your hardware? Please post the CREATE TABLE with any indexes for your schema. What version of SQL Server? -- Tony Rogerson SQL Server MVP http://sqlblogcasts.com/blogs/tonyrogerson - technical commentary from a SQL Server Consultant http://sqlserverfaq.com - free video tutorials [quoted text, click to view] "groupy" <liav.ezer@gmail.com> wrote in message news:1149062904.244200.49540@f6g2000cwb.googlegroups.com... > ok, let's take a look at a sample table representing the problem: > > A | B | C | D > -------------------- > a1 b1 c1 d1 > a1 b2 c2 d2 > a1 b1 c3 d3 > a4 b4 c4 d3 > a5 b5 c5 d5 > a6 b6 c6 d3 > > The duplications are: > rows 1+2+3 on A > row 1+3 on B > rows 3+4+6 on D > the only unique (in all params) row is 5 > note: finding first that row 1 similar to 2 on A & deleting it will > loose information because we WON'T know if row 1 similar to row 3 on B. > The same goes for the deletion of row 3 : it will cause lose of data > regarding it's similarity to row 4 on D > > > The Simple query for retriving all duplicated rows which consumes most > time is: > SELECT COUNT(*),A,B,C,D > FROM tbl > GROUP BY A,B,C,D > HAVING count(*)>1 > It takes about 2 weaks on a 1.5 million rows, while all fields are > nvchars & the DB is in SQL-Server > > THANK YOU ALL >
groupy (liav.ezer@gmail.com) writes: [quoted text, click to view] > A | B | C | D > -------------------- > a1 b1 c1 d1 > a1 b2 c2 d2 > a1 b1 c3 d3 > a4 b4 c4 d3 > a5 b5 c5 d5 > a6 b6 c6 d3 > > The duplications are: > rows 1+2+3 on A > row 1+3 on B > rows 3+4+6 on D > the only unique (in all params) row is 5 > note: finding first that row 1 similar to 2 on A & deleting it will > loose information because we WON'T know if row 1 similar to row 3 on B. > The same goes for the deletion of row 3 : it will cause lose of data > regarding it's similarity to row 4 on D > > > The Simple query for retriving all duplicated rows which consumes most > time is: > SELECT COUNT(*),A,B,C,D > FROM tbl > GROUP BY A,B,C,D > HAVING count(*)>1 > It takes about 2 weaks on a 1.5 million rows, while all fields are > nvchars & the DB is in SQL-Server
I sincerely doubt that this statement takes two weeks to run for 1.5 million rows. Had you said 1.5 milliard rows, I could maybe have believed it. Anyway, first index each column individually. Then try: DELETE tbl FROM tbl a JOIN tbl b ON a.A = b.A WHERE a.B > b.B OR a.C > b.C OR a.D > b.D DELETE tbl FROM tbl a JOIN tbl b ON a.B = b.B WHERE a.C > b.C OR a.D > b.D DELETE tbl FROM tbl a JOIN tbl b ON a.C = b.C WHERE a.D > b.C After this operation, you still have the rows that have the same values in four columns. But it is not clear from your description whether you have such duplicates. If you have this maybe the best: ATLER TABLE tbl ADD ident int IDENTITY DELETE tbl FROM tbl a JOIN tbl b ON a.A = b.A WHERE a.ident > b.ident DELETE tbl FROM tbl a JOIN tbl b ON a.B = b.B WHERE a.ident > b.ident DELETE tbl FROM tbl a JOIN tbl b ON a.C = b.C WHERE a.ident > b.ident ALTER TABLE tbl DROP COLUMN ident Note: all the above is untested. For tested solutions (at least with regards to correctness), please post: o CREATE TABLE statement for the table. o INSERT statements with sample data. o The desired result given the sample. -- Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se Books Online for SQL Server 2005 at http://www.microsoft.com/technet/prodtechnol/sql/2005/downloads/books.mspx Books Online for SQL Server 2000 at
[quoted text, click to view] On 30 May 2006 10:39:23 -0700, groupy wrote: >input: 1.5 million records table consisting users with 4 nvchar >fields:A,B,C,D >the problem: there are many records with dublicates A's or duplicates >B's or duplicates A+B's or duplicates B+C+D's & so on. Mathematicly >there are 16-1 posibilities for each duplication.
Hi groupy, No. Only four possibilities: duplicate A, duplicate B, duplicate C, and duplicate D. Combinations are just a special case (you can only have a duplicate A+B if you have both a duplicate A and a duplicate B - though you can have duplicate A and duplicate B but no duplicate A+B). [quoted text, click to view] >aim: find the duplicates & filter them, leave only the unique users >which don't have ANY duplication.
This specification is incorrect. For instance, with the input like this: num A B C D --- --- --- --- --- 1 a1 b1 c1 d1 2 a2 b2 c2 d2 3 a1 b2 c3 d3 4 a2 b1 c4 d4 there are two possible result sets, both containing two rows, that have no duplicates anymore (1 + 2 or 3 + 4). If the answer is "I don't care - any resultset without duplicates will do", then the code below should run pretty fast: CREATE TABLE #Temp (A nvarchar(25) NOT NULL, B nvarchar(25) NOT NULL, C nvarchar(25) NOT NULL, D nvarchar(25) NOT NULL) go CREATE UNIQUE INDEX x_A ON #Temp(A) WITH (IGNORE_DUP_KEY = ON) CREATE UNIQUE INDEX x_B ON #Temp(B) WITH (IGNORE_DUP_KEY = ON) CREATE UNIQUE INDEX x_C ON #Temp(C) WITH (IGNORE_DUP_KEY = ON) CREATE UNIQUE INDEX x_D ON #Temp(D) WITH (IGNORE_DUP_KEY = ON) go INSERT INTO #Temp (A, B, C, D) SELECT A, B, C, D FROM YourBigTable -- Show results SELECT * FROM #Temp go DROP TABLE #Temp go --
ok, let's take a look at a sample table representing the problem: A | B | C | D -------------------- a1 b1 c1 d1 a1 b2 c2 d2 a1 b1 c3 d3 a4 b4 c4 d3 a5 b5 c5 d5 a6 b6 c6 d3 The duplications are: rows 1+2+3 on A row 1+3 on B rows 3+4+6 on D the only unique (in all params) row is 5 note: finding first that row 1 similar to 2 on A & deleting it will loose information because we WON'T know if row 1 similar to row 3 on B. The same goes for the deletion of row 3 : it will cause lose of data regarding it's similarity to row 4 on D The Simple query for retriving all duplicated rows which consumes most time is: SELECT COUNT(*),A,B,C,D FROM tbl GROUP BY A,B,C,D HAVING count(*)>1 It takes about 2 weaks on a 1.5 million rows, while all fields are nvchars & the DB is in SQL-Server THANK YOU ALL
Hi There, IF there is no identity column then we may use. Select identity(int,1,1) myid ,A,B,C,D into tmpTable from select * from BASETABLE; create index tmpA on tmpTable(A,myid); create index tmpB on tmpTable(B,myid); create index tmpC on tmpTable(C,myid); create index tmpD on tmpTable(D,myid); Assuming that there is a column rowid which is monotonically increasing and there are as many covering indexes as there are columns the query can become like this . Delete from tmpTable where myId in ( Select myID from tmpTable group by A having count(*)>1 Union All Select myID from tmpTable group by B having count(*)>1 Union All Select myID from tmpTable group by C having count(*)>1 Union All Select myID from tmpTable group by D having count(*)>1 ); Hope this serve the purpose. With Warm regards Jatinder Singh [quoted text, click to view] groupy wrote: > ok, let's take a look at a sample table representing the problem: > > A | B | C | D > -------------------- > a1 b1 c1 d1 > a1 b2 c2 d2 > a1 b1 c3 d3 > a4 b4 c4 d3 > a5 b5 c5 d5 > a6 b6 c6 d3 > > The duplications are: > rows 1+2+3 on A > row 1+3 on B > rows 3+4+6 on D > the only unique (in all params) row is 5 > note: finding first that row 1 similar to 2 on A & deleting it will > loose information because we WON'T know if row 1 similar to row 3 on B. > The same goes for the deletion of row 3 : it will cause lose of data > regarding it's similarity to row 4 on D > > > The Simple query for retriving all duplicated rows which consumes most > time is: > SELECT COUNT(*),A,B,C,D > FROM tbl > GROUP BY A,B,C,D > HAVING count(*)>1 > It takes about 2 weaks on a 1.5 million rows, while all fields are > nvchars & the DB is in SQL-Server > > THANK YOU ALL
Hi There, Sorry!!!! for providing incorrect answer correct one that you may like to try is Create Table myData ( a varchar(2), b varchar(2), c varchar(2), d varchar(2) ) insert into myData Select 'a1', 'b1', 'c1', 'd1' Union Select 'a1', 'b2', 'c2', 'd2' Union Select 'a1', 'b1', 'c3', 'd3' Union Select 'a4', 'b4', 'c4', 'd3' Union Select 'a5', 'b5', 'c5', 'd5' Union Select 'a6', 'b6', 'c6', 'd3' Alter Table myData add myid int identity(1,1) Select * from myData Delete from myData Where myID in ( Select myID From myData MA ,(Select A from myData group by A having count(*)>1) AA Where MA.A = AA.A Union All Select myID From myData MB ,(Select B from myData group by B having count(*)>1) BB Where MB.B = BB.B Union All Select myID From myData MC ,(Select C from myData group by C having count(*)>1) CC Where MC.C = CC.C Union All Select myID From myData MD ,(Select D from myData group by D having count(*)>1) DD Where MD.D = DD.D ) Select * from myData With Warm regards Jatinder Singh [quoted text, click to view] Erland Sommarskog wrote: > groupy (liav.ezer@gmail.com) writes: > > A | B | C | D > > -------------------- > > a1 b1 c1 d1 > > a1 b2 c2 d2 > > a1 b1 c3 d3 > > a4 b4 c4 d3 > > a5 b5 c5 d5 > > a6 b6 c6 d3 > > > > The duplications are: > > rows 1+2+3 on A > > row 1+3 on B > > rows 3+4+6 on D > > the only unique (in all params) row is 5 > > note: finding first that row 1 similar to 2 on A & deleting it will > > loose information because we WON'T know if row 1 similar to row 3 on B. > > The same goes for the deletion of row 3 : it will cause lose of data > > regarding it's similarity to row 4 on D > > > > > > The Simple query for retriving all duplicated rows which consumes most > > time is: > > SELECT COUNT(*),A,B,C,D > > FROM tbl > > GROUP BY A,B,C,D > > HAVING count(*)>1 > > It takes about 2 weaks on a 1.5 million rows, while all fields are > > nvchars & the DB is in SQL-Server > > I sincerely doubt that this statement takes two weeks to run for 1.5 > million rows. Had you said 1.5 milliard rows, I could maybe have > believed it. > > Anyway, first index each column individually. Then try: > > DELETE tbl > FROM tbl a > JOIN tbl b ON a.A = b.A > WHERE a.B > b.B OR > a.C > b.C OR > a.D > b.D > > DELETE tbl > FROM tbl a > JOIN tbl b ON a.B = b.B > WHERE a.C > b.C OR > a.D > b.D > > DELETE tbl > FROM tbl a > JOIN tbl b ON a.C = b.C > WHERE a.D > b.C > > After this operation, you still have the rows that have the same values > in four columns. But it is not clear from your description whether you > have such duplicates. If you have this maybe the best: > > ATLER TABLE tbl ADD ident int IDENTITY > > DELETE tbl > FROM tbl a > JOIN tbl b ON a.A = b.A > WHERE a.ident > b.ident > > DELETE tbl > FROM tbl a > JOIN tbl b ON a.B = b.B > WHERE a.ident > b.ident > > DELETE tbl > FROM tbl a > JOIN tbl b ON a.C = b.C > WHERE a.ident > b.ident > > ALTER TABLE tbl DROP COLUMN ident > > Note: all the above is untested. For tested solutions (at least with > regards to correctness), please post: > > o CREATE TABLE statement for the table. > o INSERT statements with sample data. > o The desired result given the sample. > > -- > Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se > > Books Online for SQL Server 2005 at > http://www.microsoft.com/technet/prodtechnol/sql/2005/downloads/books.mspx > Books Online for SQL Server 2000 at > http://www.microsoft.com/sql/prodinfo/previousversions/books.mspx
Thank you all very much, i think i managed something..
How long does it take to run the query now? Madhivanan [quoted text, click to view] groupy wrote: > Thank you all very much, i think i managed something..
It takes about 2 hours...
Don't see what you're looking for? Try a search.
|