Does anyone have a neat way to scramble data in one table? I have a Person table with fields (PersonID, LastName, FirstName, MiddleName). PersonID is a sequential number from 1 - 1,000,000 and there are NO gaps in the sequence. I need the last names to be scrambled with last names and so forth. ie. Smith, John D. could become Adams, Greg B. The only way I know how to accomplish this is by looping through each Person record and then select 3 random PersonID using the RAND() function and then update the current PersonID with the LastName, FirstName, and Middle names of the rand() persons. This process works but, since I have a million records, it takes my server close to 30 minutes to process. Does anyone know of a faster process to accomplish this?
If you don't mind using modulo on PersonID instead of RAND, you can accomplish this with set-based processing using CASE expressions. UPDATE Person SET LastName = CASE PersonID % 3 WHEN 0 THEN 'Smith' WHEN 1 THEN 'Adams' WHEN 2 THEN 'Doe' END, FirstName = CASE PersonID % 3 WHEN 0 THEN 'John' WHEN 1 THEN 'Greg' WHEN 2 THEN 'John' END, MiddleName = CASE PersonID % 3 WHEN 0 THEN 'D' WHEN 1 THEN 'B' WHEN 2 THEN 'C' END -- Hope this helps. Dan Guzman SQL Server MVP [quoted text, click to view] "SND" <sndip@cox.net> wrote in message news:d4WXb.1477$Mf.1107@okepread04... > Does anyone have a neat way to scramble data in one table? > > I have a Person table with fields (PersonID, LastName, FirstName, > MiddleName). PersonID is a sequential number from 1 - 1,000,000 and there > are NO gaps in the sequence. I need the last names to be scrambled with > last names and so forth. ie. Smith, John D. could become Adams, Greg B. > > The only way I know how to accomplish this is by looping through each Person > record and then select 3 random PersonID using the RAND() function and then > update the current PersonID with the LastName, FirstName, and Middle names > of the rand() persons. This process works but, since I have a million > records, it takes my server close to 30 minutes to process. > > Does anyone know of a faster process to accomplish this? > >
This won't work in my case since I have a million records and I would have to build a very large dynamic Case statement. Sorry for making it confusing with my sample data. [quoted text, click to view] "Dan Guzman" <danguzman@nospam-earthlink.net> wrote in message news:Oi1hirD9DHA.2672@TK2MSFTNGP10.phx.gbl... > If you don't mind using modulo on PersonID instead of RAND, you can > accomplish this with set-based processing using CASE expressions. > > UPDATE Person > SET > LastName = CASE PersonID % 3 > WHEN 0 THEN 'Smith' > WHEN 1 THEN 'Adams' > WHEN 2 THEN 'Doe' > END, > FirstName = CASE PersonID % 3 > WHEN 0 THEN 'John' > WHEN 1 THEN 'Greg' > WHEN 2 THEN 'John' > END, > MiddleName = CASE PersonID % 3 > WHEN 0 THEN 'D' > WHEN 1 THEN 'B' > WHEN 2 THEN 'C' > END > > -- > Hope this helps. > > Dan Guzman > SQL Server MVP > > "SND" <sndip@cox.net> wrote in message news:d4WXb.1477$Mf.1107@okepread04... > > Does anyone have a neat way to scramble data in one table? > > > > I have a Person table with fields (PersonID, LastName, FirstName, > > MiddleName). PersonID is a sequential number from 1 - 1,000,000 and there > > are NO gaps in the sequence. I need the last names to be scrambled with > > last names and so forth. ie. Smith, John D. could become Adams, Greg B. > > > > The only way I know how to accomplish this is by looping through each > Person > > record and then select 3 random PersonID using the RAND() function and > then > > update the current PersonID with the LastName, FirstName, and Middle names > > of the rand() persons. This process works but, since I have a million > > records, it takes my server close to 30 minutes to process. > > > > Does anyone know of a faster process to accomplish this? > > > > > >
Sorry, I misread your post. Hopefully, Steve's answer is more help. -- Dan Guzman SQL Server MVP [quoted text, click to view] "SND" <sndip@cox.net> wrote in message news:G0YXb.1489$Mf.977@okepread04... > This won't work in my case since I have a million records and I would have > to build a very large dynamic Case statement. Sorry for making it confusing > with my sample data. > > "Dan Guzman" <danguzman@nospam-earthlink.net> wrote in message > news:Oi1hirD9DHA.2672@TK2MSFTNGP10.phx.gbl... > > If you don't mind using modulo on PersonID instead of RAND, you can > > accomplish this with set-based processing using CASE expressions. > > > > UPDATE Person > > SET > > LastName = CASE PersonID % 3 > > WHEN 0 THEN 'Smith' > > WHEN 1 THEN 'Adams' > > WHEN 2 THEN 'Doe' > > END, > > FirstName = CASE PersonID % 3 > > WHEN 0 THEN 'John' > > WHEN 1 THEN 'Greg' > > WHEN 2 THEN 'John' > > END, > > MiddleName = CASE PersonID % 3 > > WHEN 0 THEN 'D' > > WHEN 1 THEN 'B' > > WHEN 2 THEN 'C' > > END > > > > -- > > Hope this helps. > > > > Dan Guzman > > SQL Server MVP > > > > "SND" <sndip@cox.net> wrote in message > news:d4WXb.1477$Mf.1107@okepread04... > > > Does anyone have a neat way to scramble data in one table? > > > > > > I have a Person table with fields (PersonID, LastName, FirstName, > > > MiddleName). PersonID is a sequential number from 1 - 1,000,000 and > there > > > are NO gaps in the sequence. I need the last names to be scrambled with > > > last names and so forth. ie. Smith, John D. could become Adams, Greg > B. > > > > > > The only way I know how to accomplish this is by looping through each > > Person > > > record and then select 3 random PersonID using the RAND() function and > > then > > > update the current PersonID with the LastName, FirstName, and Middle > names > > > of the rand() persons. This process works but, since I have a million > > > records, it takes my server close to 30 minutes to process. > > > > > > Does anyone know of a faster process to accomplish this? > > > > > > > > > > > >
First of all, RAND() may do more than just scramble the data. It will likely cause you to get some duplicates and lose some rows, since if you evaluate 1000000*rand() 1,000,000 times, the probability is vanishingly small that you will generate each integer from 1 to 1000000 exactly once each. But even if that's not a problem, I doubt there will be a solution with fewer than 3,000,000 steps. If you want to permute 1,000,000 values randomly, you need to perform some operation that will result in each of the 1,000,000! (one million factorial) possible permutations with equal likelihood. A couple hundred swaps of values will only give rise to 2^(a couple hundred) possible permutations, or far less than 1,000,000! possibilities. What will work is a loop that does something like this: Let N = 1,000,000 while N > 1 begin Select a random integer R between 1 and N inclusive Swap the first name of PersonID = R with the first name of PersonID = N Select another random integer R between 1 and N inclusive Swap the middle name of PersonID = R with the middle name of PersonID = N Select another random integer R between 1 and N inclusive Swap the last name of PersonID = R with the last name of PersonID = N set N = N - 1 end This isn't a simple job in any language. While there may be compact solutions in T-SQL with something like ORDER BY NEWID(), realize that if you had to actually sort 1,000,000 items three times, it would take at least 90,000,000 steps instead of 3 million, so a loop isn't so bad at all. SK [quoted text, click to view] SND wrote: >Does anyone have a neat way to scramble data in one table? > >I have a Person table with fields (PersonID, LastName, FirstName, >MiddleName). PersonID is a sequential number from 1 - 1,000,000 and there >are NO gaps in the sequence. I need the last names to be scrambled with >last names and so forth. ie. Smith, John D. could become Adams, Greg B. > >The only way I know how to accomplish this is by looping through each Person >record and then select 3 random PersonID using the RAND() function and then >update the current PersonID with the LastName, FirstName, and Middle names >of the rand() persons. This process works but, since I have a million >records, it takes my server close to 30 minutes to process. > >Does anyone know of a faster process to accomplish this? > > > >
That is exactly what I am doing. Duplicate and lose rows is not an issue. I also had to re-rand() all rows that ended up with the same PersonID for Last, Middle or First in an infinite loop until none is found. (I think this is ok, since the number of these records to are very low.) If there isn't a set based method, I guess I am doing it right. Thanks. [quoted text, click to view] "Steve Kass" <skass@drew.edu> wrote in message news:OvaSo3E9DHA.2736@TK2MSFTNGP10.phx.gbl... > First of all, RAND() may do more than just scramble the data. It will > likely cause you to get some duplicates and lose some rows, since if you > evaluate 1000000*rand() 1,000,000 times, the probability is vanishingly > small that you will generate each integer from 1 to 1000000 exactly once > each. > > But even if that's not a problem, I doubt there will be a solution with > fewer than 3,000,000 steps. If you want to permute 1,000,000 values > randomly, you need to perform some operation that will result in each of > the 1,000,000! (one million factorial) possible permutations with equal > likelihood. A couple hundred swaps of values will only give rise to > 2^(a couple hundred) possible permutations, or far less than 1,000,000! > possibilities. What will work is a loop that does something like this: > > Let N = 1,000,000 > > while N > 1 begin > Select a random integer R between 1 and N inclusive > Swap the first name of PersonID = R with the first name of PersonID = N > Select another random integer R between 1 and N inclusive > Swap the middle name of PersonID = R with the middle name of > PersonID = N > Select another random integer R between 1 and N inclusive > Swap the last name of PersonID = R with the last name of PersonID = N > set N = N - 1 > end > > This isn't a simple job in any language. While there may be compact > solutions in T-SQL with something like ORDER BY NEWID(), realize that if > you had to actually sort 1,000,000 items three times, it would take at > least 90,000,000 steps instead of 3 million, so a loop isn't so bad at all. > > SK > > > SND wrote: > > >Does anyone have a neat way to scramble data in one table? > > > >I have a Person table with fields (PersonID, LastName, FirstName, > >MiddleName). PersonID is a sequential number from 1 - 1,000,000 and there > >are NO gaps in the sequence. I need the last names to be scrambled with > >last names and so forth. ie. Smith, John D. could become Adams, Greg B. > > > >The only way I know how to accomplish this is by looping through each Person > >record and then select 3 random PersonID using the RAND() function and then > >update the current PersonID with the LastName, FirstName, and Middle names > >of the rand() persons. This process works but, since I have a million > >records, it takes my server close to 30 minutes to process. > > > >Does anyone know of a faster process to accomplish this? > > > > > > > > >
Yes, I do want to munge a production db for demo purposes. Swapping IDs will not work in my case since data is bound to a spefice type of person and will not make sense in a demo if the person types are switched. Whiping out first names is not an option either, since a search on first name is needed. A dictionary table sounds like a good idea, and I will try that. Thanks. [quoted text, click to view] "Tim" <Tim@NoSpam> wrote in message news:uZJJLJG9DHA.2924@tk2msftngp13.phx.gbl... > SND, > > Have you thought about creating a dictionary table of Surnames and in their > creating a mapping of where to go? > Are you trying to munge a production database for demo purposes? If so, then > you could also consider not replacing name parts, but swapping ID values > around instead and destroying the other identifying information - whiping > out the firstnames, dictionary on surnames, randomising middle initials, or > some combination of these methods. > > - Tim > > "SND" <sndip@cox.net> wrote in message news:d4WXb.1477$Mf.1107@okepread04... > > Does anyone have a neat way to scramble data in one table? > > > > I have a Person table with fields (PersonID, LastName, FirstName, > > MiddleName). PersonID is a sequential number from 1 - 1,000,000 and there > > are NO gaps in the sequence. I need the last names to be scrambled with > > last names and so forth. ie. Smith, John D. could become Adams, Greg B. > > > > The only way I know how to accomplish this is by looping through each > Person > > record and then select 3 random PersonID using the RAND() function and > then > > update the current PersonID with the LastName, FirstName, and Middle names > > of the rand() persons. This process works but, since I have a million > > records, it takes my server close to 30 minutes to process. > > > > Does anyone know of a faster process to accomplish this? > > > > > >
[quoted text, click to view] >> PersonID is a sequential number from 1 - 1,000,000 and there are NO
gaps in the sequence. I need the last names to be scrambled with last names and so forth. ie. Smith, John D. could become Adams, Greg B. << Go to a procdural language of your choice that has a random number generator and build a table like this: CREATE TABLE SwapLIst (seq_nbr INTEGER NOT NULL PRIMARY KEY, swapper INTEGER NOT NULL UNIQUE, CHECK (swapper BETWEEN 1 AND 1000000)); Swapper is the row whose lastname will replace the current last name. It has to be unique and between 1 and 1000000. For example, the RANDBETWEEN() in Excel is a quick way to do this. Now do one update: UPDATE Foobar SET lastname = (SELECT lastname FROM Foobar AS F1 WHERE Foobar.seq = F1.swapper); One reason for creating the SwapList table is to be able to un-do the swaps. [quoted text, click to view] >> The only way I know how to accomplish this is by looping through each
Person record [sic] and then select 3 random PersonID using the RAND() function and then update the current PersonID with the LastName, FirstName, and Middle names of the rand() persons. << Rows are not records. SQL is not meant for sequential processing. You're still not thinking in sets. Another thought -- there was an old UNIX routine that generated pronounciable nonsense words for passwords; plug that into the last names column for demo data. They are usually pretty funny, producing things like "John Paul Depfarb" or worse. --CELKO-- =========================== Please post DDL, so that people do not have to guess what the keys, constraints, Declarative Referential Integrity, datatypes, etc. in your schema are. *** Sent via Developersdex http://www.developersdex.com ***
Nice solution! I just wanted to know if it could be done entirely in SQL, [quoted text, click to view] "Joe Celko" <joe.celko@northface.edu> wrote in message news:u5qMr$K9DHA.2308@TK2MSFTNGP11.phx.gbl... > >> PersonID is a sequential number from 1 - 1,000,000 and there are NO > gaps in the sequence. I need the last names to be scrambled with last > names and so forth. ie. Smith, John D. could become Adams, Greg B. << > > Go to a procdural language of your choice that has a random number > generator and build a table like this: > > CREATE TABLE SwapLIst > (seq_nbr INTEGER NOT NULL PRIMARY KEY, > swapper INTEGER NOT NULL UNIQUE, > CHECK (swapper BETWEEN 1 AND 1000000)); > > Swapper is the row whose lastname will replace the current last name. It > has to be unique and between 1 and 1000000. > > For example, the RANDBETWEEN() in Excel is a quick way to do this. Now > do one update: > > UPDATE Foobar > SET lastname > = (SELECT lastname > FROM Foobar AS F1 > WHERE Foobar.seq = F1.swapper); > > One reason for creating the SwapList table is to be able to un-do the > swaps. > > >> The only way I know how to accomplish this is by looping through each > Person record [sic] and then select 3 random PersonID using the RAND() > function and then > update the current PersonID with the LastName, FirstName, and Middle > names of the rand() persons. << > > Rows are not records. SQL is not meant for sequential processing. > You're still not thinking in sets. > > Another thought -- there was an old UNIX routine that generated > pronounciable nonsense words for passwords; plug that into the last > names column for demo data. They are usually pretty funny, producing > things like "John Paul Depfarb" or worse. > > --CELKO-- > =========================== > Please post DDL, so that people do not have to guess what the keys, > constraints, Declarative Referential Integrity, datatypes, etc. in your > schema are. > > *** Sent via Developersdex http://www.developersdex.com *** > Don't just participate in USENET...get rewarded for it!
SND, Have you thought about creating a dictionary table of Surnames and in their creating a mapping of where to go? Are you trying to munge a production database for demo purposes? If so, then you could also consider not replacing name parts, but swapping ID values around instead and destroying the other identifying information - whiping out the firstnames, dictionary on surnames, randomising middle initials, or some combination of these methods. - Tim [quoted text, click to view] "SND" <sndip@cox.net> wrote in message news:d4WXb.1477$Mf.1107@okepread04... > Does anyone have a neat way to scramble data in one table? > > I have a Person table with fields (PersonID, LastName, FirstName, > MiddleName). PersonID is a sequential number from 1 - 1,000,000 and there > are NO gaps in the sequence. I need the last names to be scrambled with > last names and so forth. ie. Smith, John D. could become Adams, Greg B. > > The only way I know how to accomplish this is by looping through each Person > record and then select 3 random PersonID using the RAND() function and then > update the current PersonID with the LastName, FirstName, and Middle names > of the rand() persons. This process works but, since I have a million > records, it takes my server close to 30 minutes to process. > > Does anyone know of a faster process to accomplish this? > >
[quoted text, click to view] >> Nice solution! I just wanted to know if it could be done entirely in
SQL, << Sure; I believe that everything CAN be done in SQL; I do not believe that everything SHOULD be done in SQL :) --CELKO-- =========================== Please post DDL, so that people do not have to guess what the keys, constraints, Declarative Referential Integrity, datatypes, etc. in your schema are. *** Sent via Developersdex http://www.developersdex.com ***
Hello SND, Thank you for your email about how to randomly mix 1 million names in a table. From the responses and description of your issue, it sounds like what you are looking for is: 1. To take a table with a million rows and mix the names within that table. 2. It doesn't matter if the result contains duplicates and it doesn't matter if all the names are used. The key factor is that the real identity for the record no longer applies. 3. Looking for other ways of doing this besides using RAND. Essentially, you just need a random 1 million row table with bogus names and instead of trying to create new names, just mix up the ones you have. If this is correct, there are a couple of ways to do this. Below are 3 examples I came up with. 1. There are 3 columns in this table you wish to mix up. They are FirstName, LastName, and MiddleInitial. Given all the names in the table do not have to be used, you can use a Cartesian product to generate 1 million rows. For example, if I take 100 lastnames, 100 FirstNames, and 100 MiddleNames I can yield 1 Million rows using a Cartesian product: 100 x 100 x 100 = 1 Million. Or any combination thereof: 20 x 500 x 100 = 1 Million or 2000 * 5 * 100 = 1 Million. And so forth. In my example, to save time, I had a table with just 100 rows in it in which I ran the following query to generate my 1 Million row Cartesian product. Select a.Lname, b.Fname, c.Mid from NameTest a, NameTest b, NameTest c In your case where you have a million rows already in the table, you could do something like this to get kind of a mix with the names and end up with 1 Million rows. Again, I just used 100 x 100 x 100 to get my million. You can use any kind of combination that results in 1 Million rows to generate the kind of mix you want: Select a.Lname, b.Fname, c.Mid from (Select Top 100 Lname from NameTest) a, (Select Top 100 Fname from NameTest Order By Fname Desc) b, (Select Top 100 mid from NameTest) c This ran in 8 seconds on my machine and generated the 1 million rows. The disadvantage (depending on how important this is) is that not all names are used. Also what ever numbers you specify determines how many repeats there are. For example if you select 100 LNames and 100 FNames, each last name will be repeated 100 times for each first name. Throw in the Middle Initial 100 times and that adds even more duplicate names. 2. I haven't tried this one, but another way of doing the mix (and preserve all names) would be to use an algorithem similar to a quick sort in which for last name you could swap the first have of the table with the second half. For first name, you could swap the first quarter with the last quarter, the second quarter with the third quarter. This would likely run slower than the cartesian example in 1. 3. I think someone else mentioned creating a swapping table? This is a good way to swap the names without actually changing the data in your existing table and would also use all names. You essentially create a table that contains four fields. This table would consist of 1 million rows. The first field would be 1 through 1 Million in that order. The second column would be LNameKey and would consist of randomly picked numbers between 1 and 1 Million. Third column would be FNamekey and would consist of Random numbers between 1 and 1 Mil. Fourth Column would be MiddleNameKey and consist of Random numbers between 1 and 1 mil. Then you could use these columns as joins between this junction table and the real table joined on the primary key. For example, the query would look something like the following. In this example, my Intermediary table is called IntTbl: SELECT IntTbl.PID, NameTest.lname, NameTest_1.fname, NameTest_2.Mid FROM ((IntTbl INNER JOIN NameTest ON IntTbl.LNameKey = NameTest.PID) INNER JOIN NameTest AS NameTest_1 ON IntTbl.FNameKey = NameTest_1.PID) INNER JOIN NameTest AS NameTest_2 ON IntTbl.MidKey = NameTest_2.PID The disadvantage to this method is that you first have to create the junction table that would consist as your template and contain the randomly generated numbers for each key. The advantage is that once created, executes fast and the original table stays intact. I hope this helps!:) Sincerely, Rob Beene, MCSD, MCDBA Microsoft SQL Server Support This posting is provided "AS IS" with no warranties, and confers no rights.
Don't see what you're looking for? Try a search.
|