Groups | Blog | Home
all groups > sql server programming > february 2004 >

sql server programming : how to scramble data in a table


SND
2/15/2004 7:37:02 PM
Does anyone have a neat way to scramble data in one table?

I have a Person table with fields (PersonID, LastName, FirstName,
MiddleName). PersonID is a sequential number from 1 - 1,000,000 and there
are NO gaps in the sequence. I need the last names to be scrambled with
last names and so forth. ie. Smith, John D. could become Adams, Greg B.

The only way I know how to accomplish this is by looping through each Person
record and then select 3 random PersonID using the RAND() function and then
update the current PersonID with the LastName, FirstName, and Middle names
of the rand() persons. This process works but, since I have a million
records, it takes my server close to 30 minutes to process.

Does anyone know of a faster process to accomplish this?

Dan Guzman
2/15/2004 9:13:42 PM
If you don't mind using modulo on PersonID instead of RAND, you can
accomplish this with set-based processing using CASE expressions.

UPDATE Person
SET
LastName = CASE PersonID % 3
WHEN 0 THEN 'Smith'
WHEN 1 THEN 'Adams'
WHEN 2 THEN 'Doe'
END,
FirstName = CASE PersonID % 3
WHEN 0 THEN 'John'
WHEN 1 THEN 'Greg'
WHEN 2 THEN 'John'
END,
MiddleName = CASE PersonID % 3
WHEN 0 THEN 'D'
WHEN 1 THEN 'B'
WHEN 2 THEN 'C'
END

--
Hope this helps.

Dan Guzman
SQL Server MVP

[quoted text, click to view]

SND
2/15/2004 9:49:44 PM
This won't work in my case since I have a million records and I would have
to build a very large dynamic Case statement. Sorry for making it confusing
with my sample data.

[quoted text, click to view]

Dan Guzman
2/15/2004 11:43:04 PM
Sorry, I misread your post. Hopefully, Steve's answer is more help.

--
Dan Guzman
SQL Server MVP

[quoted text, click to view]

Steve Kass
2/16/2004 12:29:23 AM
First of all, RAND() may do more than just scramble the data. It will
likely cause you to get some duplicates and lose some rows, since if you
evaluate 1000000*rand() 1,000,000 times, the probability is vanishingly
small that you will generate each integer from 1 to 1000000 exactly once
each.

But even if that's not a problem, I doubt there will be a solution with
fewer than 3,000,000 steps. If you want to permute 1,000,000 values
randomly, you need to perform some operation that will result in each of
the 1,000,000! (one million factorial) possible permutations with equal
likelihood. A couple hundred swaps of values will only give rise to
2^(a couple hundred) possible permutations, or far less than 1,000,000!
possibilities. What will work is a loop that does something like this:

Let N = 1,000,000

while N > 1 begin
Select a random integer R between 1 and N inclusive
Swap the first name of PersonID = R with the first name of PersonID = N
Select another random integer R between 1 and N inclusive
Swap the middle name of PersonID = R with the middle name of
PersonID = N
Select another random integer R between 1 and N inclusive
Swap the last name of PersonID = R with the last name of PersonID = N
set N = N - 1
end

This isn't a simple job in any language. While there may be compact
solutions in T-SQL with something like ORDER BY NEWID(), realize that if
you had to actually sort 1,000,000 items three times, it would take at
least 90,000,000 steps instead of 3 million, so a loop isn't so bad at all.

SK


[quoted text, click to view]
SND
2/16/2004 8:30:29 AM
That is exactly what I am doing. Duplicate and lose rows is not an issue.
I also had to re-rand() all rows that ended up with the same PersonID for
Last, Middle or First in an infinite loop until none is found. (I think this
is ok, since the number of these records to are very low.)

If there isn't a set based method, I guess I am doing it right. Thanks.


[quoted text, click to view]

SND
2/16/2004 8:46:48 AM
Yes, I do want to munge a production db for demo purposes. Swapping IDs
will not work in my case since data is bound to a spefice type of person and
will not make sense in a demo if the person types are switched. Whiping out
first names is not an option either, since a search on first name is needed.
A dictionary table sounds like a good idea, and I will try that. Thanks.


[quoted text, click to view]

Joe Celko
2/16/2004 9:11:28 AM
[quoted text, click to view]
gaps in the sequence. I need the last names to be scrambled with last
names and so forth. ie. Smith, John D. could become Adams, Greg B. <<

Go to a procdural language of your choice that has a random number
generator and build a table like this:

CREATE TABLE SwapLIst
(seq_nbr INTEGER NOT NULL PRIMARY KEY,
swapper INTEGER NOT NULL UNIQUE,
CHECK (swapper BETWEEN 1 AND 1000000));

Swapper is the row whose lastname will replace the current last name. It
has to be unique and between 1 and 1000000.

For example, the RANDBETWEEN() in Excel is a quick way to do this. Now
do one update:

UPDATE Foobar
SET lastname
= (SELECT lastname
FROM Foobar AS F1
WHERE Foobar.seq = F1.swapper);

One reason for creating the SwapList table is to be able to un-do the
swaps.

[quoted text, click to view]
Person record [sic] and then select 3 random PersonID using the RAND()
function and then
update the current PersonID with the LastName, FirstName, and Middle
names of the rand() persons. <<

Rows are not records. SQL is not meant for sequential processing.
You're still not thinking in sets.

Another thought -- there was an old UNIX routine that generated
pronounciable nonsense words for passwords; plug that into the last
names column for demo data. They are usually pretty funny, producing
things like "John Paul Depfarb" or worse.

--CELKO--
===========================
Please post DDL, so that people do not have to guess what the keys,
constraints, Declarative Referential Integrity, datatypes, etc. in your
schema are.

*** Sent via Developersdex http://www.developersdex.com ***
SND
2/16/2004 3:37:15 PM
Nice solution! I just wanted to know if it could be done entirely in SQL,


[quoted text, click to view]

Tim
2/16/2004 8:55:32 PM
SND,

Have you thought about creating a dictionary table of Surnames and in their
creating a mapping of where to go?
Are you trying to munge a production database for demo purposes? If so, then
you could also consider not replacing name parts, but swapping ID values
around instead and destroying the other identifying information - whiping
out the firstnames, dictionary on surnames, randomising middle initials, or
some combination of these methods.

- Tim

[quoted text, click to view]

Joe Celko
2/17/2004 8:10:29 AM
[quoted text, click to view]
SQL, <<

Sure; I believe that everything CAN be done in SQL; I do not believe
that everything SHOULD be done in SQL :)

--CELKO--
===========================
Please post DDL, so that people do not have to guess what the keys,
constraints, Declarative Referential Integrity, datatypes, etc. in your
schema are.

*** Sent via Developersdex http://www.developersdex.com ***
rbeene NO[at]SPAM online.microsoft.com
2/20/2004 5:19:49 AM
Hello SND,

Thank you for your email about how to randomly mix 1 million names in a
table. From the responses and description of your issue, it sounds like
what you are looking for is:

1. To take a table with a million rows and mix the names within that table.
2. It doesn't matter if the result contains duplicates and it doesn't
matter if all the names are used. The key factor is that the real identity
for the record no longer applies.
3. Looking for other ways of doing this besides using RAND.

Essentially, you just need a random 1 million row table with bogus names
and instead of trying to create new names, just mix up the ones you have.

If this is correct, there are a couple of ways to do this. Below are 3
examples I came up with.

1. There are 3 columns in this table you wish to mix up. They are
FirstName, LastName, and MiddleInitial. Given all the names in the table
do not have to be used, you can use a Cartesian product to generate 1
million rows. For example, if I take 100 lastnames, 100 FirstNames, and
100 MiddleNames I can yield 1 Million rows using a Cartesian product: 100
x 100 x 100 = 1 Million. Or any combination thereof: 20 x 500 x 100 = 1
Million or 2000 * 5 * 100 = 1 Million. And so forth. In my example, to
save time, I had a table with just 100 rows in it in which I ran the
following query to generate my 1 Million row Cartesian product.

Select a.Lname, b.Fname, c.Mid
from NameTest a, NameTest b, NameTest c

In your case where you have a million rows already in the table, you could
do something like this to get kind of a mix with the names and end up with
1 Million rows. Again, I just used 100 x 100 x 100 to get my million. You
can use any kind of combination that results in 1 Million rows to generate
the kind of mix you want:

Select a.Lname, b.Fname, c.Mid
from (Select Top 100 Lname from NameTest) a, (Select Top 100 Fname from
NameTest Order By Fname Desc) b, (Select Top 100 mid from NameTest) c

This ran in 8 seconds on my machine and generated the 1 million rows. The
disadvantage (depending on how important this is) is that not all names are
used. Also what ever numbers you specify determines how many repeats there
are. For example if you select 100 LNames and 100 FNames, each last name
will be repeated 100 times for each first name. Throw in the Middle Initial
100 times and that adds even more duplicate names.

2. I haven't tried this one, but another way of doing the mix (and
preserve all names) would be to use an algorithem similar to a quick sort
in which for last name you could swap the first have of the table with the
second half. For first name, you could swap the first quarter with the
last quarter, the second quarter with the third quarter. This would likely
run slower than the cartesian example in 1.

3. I think someone else mentioned creating a swapping table? This is a
good way to swap the names without actually changing the data in your
existing table and would also use all names. You essentially create a
table that contains four fields. This table would consist of 1 million
rows. The first field would be 1 through 1 Million in that order. The
second column would be LNameKey and would consist of randomly picked
numbers between 1 and 1 Million. Third column would be FNamekey and would
consist of Random numbers between 1 and 1 Mil. Fourth Column would be
MiddleNameKey and consist of Random numbers between 1 and 1 mil. Then you
could use these columns as joins between this junction table and the real
table joined on the primary key. For example, the query would look
something like the following. In this example, my Intermediary table is
called IntTbl:

SELECT IntTbl.PID, NameTest.lname, NameTest_1.fname, NameTest_2.Mid
FROM ((IntTbl INNER JOIN NameTest ON IntTbl.LNameKey = NameTest.PID) INNER
JOIN NameTest AS NameTest_1 ON IntTbl.FNameKey = NameTest_1.PID) INNER JOIN
NameTest AS NameTest_2 ON IntTbl.MidKey = NameTest_2.PID

The disadvantage to this method is that you first have to create the
junction table that would consist as your template and contain the randomly
generated numbers for each key. The advantage is that once created,
executes fast and the original table stays intact.

I hope this helps!:)

Sincerely,

Rob Beene, MCSD, MCDBA
Microsoft SQL Server Support

This posting is provided "AS IS" with no warranties, and confers no rights.
AddThis Social Bookmark Button