I know this question has been asked. And the usual answer is don't use cursors or any other looping method. Instead, try to find a solution that uses set-based queries. But this brings up several questions / senarios: * I created several stored procedures that take parameters and inserts the data into the appropriate tables. This was done for easy access/use from client side apps (i.e. web-based). Proper development tactics says to try and do "code reuse". So, if I already have stored procs that do my logic, should I be writing a second way of handling the data? If I ever need to change the way the data is handled, I now have to make the same change in two (or more) places. * Different data from the same row needs to be inserted into multiple tables. "Common sense" (maybe "gut instinct" is better) says to handle each row as a "unit". Seems weird to process the entire set for one table, then to process the entire set AGAIN for another table, and then YET AGAIN for a third table, and so on. * Exception handling. Set based processing means that if one row fails the entire set fails. Looping through allows you to fail a row but allow everything else to be processed properly. It also allows you to gather statistics. (How many failed, how many worked, how many were skipped, etc.) ?? Good idea ?? The alternative is to create a temporary table (sandbox or workspace type thing), copy the data to there along with "status" or "valdation" columns, run through the set many times over looking for any rows that may fail, marking them as such, and then at the end only dealing with those rows which "passed" the testing. Of course, in order for this to work you must know (and duplicate) all constraints so you know what to look for in your testing.
[quoted text, click to view] JayCallas@hotmail.com wrote: > I know this question has been asked. And the usual answer is don't use > cursors or any other looping method. Instead, try to find a solution > that uses set-based queries. > > But this brings up several questions / senarios: > > * I created several stored procedures that take parameters and inserts > the data into the appropriate tables. This was done for easy access/use > from client side apps (i.e. web-based). > > Proper development tactics says to try and do "code reuse". So, if I > already have stored procs that do my logic, should I be writing a > second way of handling the data? If I ever need to change the way the > data is handled, I now have to make the same change in two (or more) > places. > > * Different data from the same row needs to be inserted into multiple > tables. "Common sense" (maybe "gut instinct" is better) says to handle > each row as a "unit". Seems weird to process the entire set for one > table, then to process the entire set AGAIN for another table, and then > YET AGAIN for a third table, and so on. > > * Exception handling. Set based processing means that if one row fails > the entire set fails. Looping through allows you to fail a row but > allow everything else to be processed properly. It also allows you to > gather statistics. (How many failed, how many worked, how many were > skipped, etc.) > > ?? Good idea ?? The alternative is to create a temporary table (sandbox > or workspace type thing), copy the data to there along with "status" or > "valdation" columns, run through the set many times over looking for > any rows that may fail, marking them as such, and then at the end only > dealing with those rows which "passed" the testing. Of course, in order > for this to work you must know (and duplicate) all constraints so you > know what to look for in your testing.
Another reason why code re-use is less of an issue in SQL compared to general programming languages is that it's so easy to generate scripts automatically from your database's metadata. That's particularly the case for CRUD scripts, transformation scripts, archiving, auditing, etc. If you have good constraints and naming conventions then you can automate the production of thousands of lines of code in seconds. That maybe not as important as Robert and Hugo's other sensible comments but it is a distinct advantage. -- David Portas, SQL Server MVP Whenever possible please post enough code to reproduce your problem. Including CREATE TABLE and INSERT statements usually helps. State what version of SQL Server you are using and specify the content of any error messages. SQL Server Books Online: http://msdn2.microsoft.com/library/ms130214(en-US,SQL.90).aspx --
Thanks for replying. As you probably guessed, there are specific reasons why I am asking these questions. Maybe a bit more info would be helpful. * I am not talking about the same data being written to multiple tables. I referring to situations where different columns from the same row are written to different tables. A concrete example would be security (stocks) data. There are different types of securities. (Equities and Options are just two of them). Both types of securities share common information like symbol, name, and security type. But options have additional data (strike price, class, and expiration). My current schema has a table named SecurityMaster where common data goes with the primary key being Symbol. I also have a SecurityMasterDerivative table where I put Option specific data which has a foreign key reference to the SecurityMaster table. So when I get a new security master file, the set-based approach says to insert all the new common security data into the SecurityMaster table and then go back and and reprocess the new data to insert the option specific data. My dilemma is that I feel it is more "natural" to treat each row as a single entity. * The stored procedure approach I mentioned in the first point is related to the example I just gave. When inserting a new security into the "Security Master database" (I do not mean the single table), there is a at least one table involved but possibly up to 4 tables depending on the type of security that we are dealing with. * As far as exception handling goes, why is it considered "normal" to want all the data or just some of it? What about the idea of getting "most" of the data in and then dealing with those "exceptions"? Is it considered "business logic" to say "I rather have the majority of the data as opposed to NONE of the data"? I understand that I can make one large INSERT query which also performs ALL the required checks in order to filter out "bad" rows. But I would think that the query could get very complex very quickly. I regularly deal with situations where the data can not only NOT be trusted to be complete but where column names and formats are VERY different even for "similar" information. I find that I need to "backfill" a lot of data or perform a good number of value conversions. One example that comes to mind of using a single query resulted in code that took 25+ minutes ro run. In this situation I was dealing with position data for securities. For Equities, I trusted the CUSIP value they gave me but not the symbol. So I decided to "verify" the symbol against what I knew it as (ie. backfill the column). But it was not enough to just do a left join. Because it was possible to have the CUSIP repeated in the lookup table (the PK was CUSIP and Symbol). So my Symbol lookup logic (remember this was a single query), was to see if there was only one row with the given cusip, then check to see if there was only one active security row with the given cusip, THEN check to see if there was only one north american active security with the given cusip, otherwise use what the source gave us. And that was only for equities... For options the logic is 100% opposite... Talk about complex queries that SUCK in performance... The reason why I suggested the multiple pass validation approach was to be able to record or mark not only those rows that failed but WHY they failed. (Assume that there is a [ValidationCode] and [ValidationReason] column in the "sandbox" table.) I need to be able to create an exception report so somebody can go back and "fix" the data. ----- I guess the bottom line is that I am probably trying to apply my many years of application development approaches and opinions to database development and the two do not mesh.
(JayCallas@hotmail.com) writes: [quoted text, click to view] > * I created several stored procedures that take parameters and inserts > the data into the appropriate tables. This was done for easy access/use > from client side apps (i.e. web-based). > > Proper development tactics says to try and do "code reuse". So, if I > already have stored procs that do my logic, should I be writing a > second way of handling the data? If I ever need to change the way the > data is handled, I now have to make the same change in two (or more) > places.
Undeniably, this is surely a common reason why people end up with cursors. They already have a procedure that performs an operation on a scalar set of data, and now they need to do it multiple times. We have plenty of such examples in our database. When you face this situation you have a tradeoff: speed up development and run a cursor and take the performance cost. Or rewrite that stored procedure so that it deals with data in a table. Typically when we do this, we use a temp table for input or more often a process- keyed table. (See http://www.sommarskog.se/share_data.html#usingtable for a discussion on this.) You keep the old scalar procedure, but to not duplicate logic, you make it a wrapper on the set-based procedure. Rewriting a scalar procedure into set-based is not a trivial task, not the least when you code with performance in mind. And not the least that at least during a transitional period there will be loops that call the set-based procedure for one row at a time. There is certainly an overhead of getting data from a table rather than from parameters. So there is definitely a trade-off here. But if a loop today is the best bet, it may not be tomorrow, because as the data grows in size, performance becomes an issue. I should add that I'm here talking of procedure where logic is really complex. As the others have said, code reuse is not equally much a virtue in SQL as it is application code. [quoted text, click to view] > * Different data from the same row needs to be inserted into multiple > tables. "Common sense" (maybe "gut instinct" is better) says to handle > each row as a "unit". Seems weird to process the entire set for one > table, then to process the entire set AGAIN for another table, and then > YET AGAIN for a third table, and so on.
I'm not sure that I understand this point. As I mentioned, I usually have my input to my set-based procedures in a table, so I don't really see the problem. [quoted text, click to view] > * Exception handling. Set based processing means that if one row fails > the entire set fails. Looping through allows you to fail a row but > allow everything else to be processed properly. It also allows you to > gather statistics. (How many failed, how many worked, how many were > skipped, etc.)
Yes, this is a point that is often over-looked. Judging from some of the other replies, some people appears to prefer ignore this issue, but depending on your business requirement and the likelyhood for errors, this may be a decisive reason to run a cursor. For instance, I recently rewrote a stored procedure in our system which makes a contract note definitive. A stock broker have thousands of contract notes every day. Typically customer notes stay open all day, but at the end of the day they should all be made definitive. In conjunction with this there are a number of updates to perform. With the old procedure, the notes were handled one by one, and if an occasional note failed, that was no disaster. It could be sorted out the next day. For the new procedure, I do perform some initial validations, and notes that fails these validations will not be updated whereas the rest will be. But from the point that I've started to perform updates, there is not really an easy way out if there is some problem with a single note. That means that the entire operation will fail. In our case, this should only happen exceptionally, but if it happens the customer be in dire straits, not the least that this will typically happen after hours. (Please don't suggest that I should validate everything in advance. We're not only talking constraints, but assertions in sub-procedures, whereof some are very complex.) A possible fall back in this case, that I have implemented, but maybe I should would be that if there is an error, I would then start to take the notes one-by-one. -- Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se Books Online for SQL Server 2005 at http://www.microsoft.com/technet/prodtechnol/sql/2005/downloads/books.mspx Books Online for SQL Server 2000 at
(JayCallas@hotmail.com) writes: [quoted text, click to view] > So when I get a new security master file, the set-based approach says > to insert all the new common security data into the SecurityMaster > table and then go back and and reprocess the new data to insert the > option specific data. My dilemma is that I feel it is more "natural" to > treat each row as a single entity.
That is a feeling that you should overcome. It's perfectly normal to read the same row all over again. (Maybe one day we will get a multi-table INSERT, but I am not holding my breath.) [quoted text, click to view] > * As far as exception handling goes, why is it considered "normal" to > want all the data or just some of it? What about the idea of getting > "most" of the data in and then dealing with those "exceptions"? Is it > considered "business logic" to say "I rather have the majority of the > data as opposed to NONE of the data"?
Of course, sometimes that is the business rule: all or nothing. But there are also lots of processes where it's OK that some data slips through the crack, as long what is a unit is a unit. (You don't want an option to be imported into SecurityMaster, but then not getting the option-specific data in place. Then you rather lose it.) In the case where you import data, validation errors may be anticipated, and you could try to check for the most likely in advance. In the end what matters is of course, if the performance for the solution you have is acceptable or not. -- Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se Books Online for SQL Server 2005 at http://www.microsoft.com/technet/prodtechnol/sql/2005/downloads/books.mspx Books Online for SQL Server 2000 at
Disclaimer: difficult to answer without more specific info. [quoted text, click to view] JayCallas@hotmail.com wrote: > I know this question has been asked. And the usual answer is don't use > cursors or any other looping method. Instead, try to find a solution > that uses set-based queries. > > But this brings up several questions / senarios: > > * I created several stored procedures that take parameters and inserts > the data into the appropriate tables. This was done for easy access/use > from client side apps (i.e. web-based). > > Proper development tactics says to try and do "code reuse". So, if I > already have stored procs that do my logic, should I be writing a > second way of handling the data? If I ever need to change the way the > data is handled, I now have to make the same change in two (or more) > places.
If a SP encapsulates a plain INSERT I'd throw it out. If there is more complex logic involved I'd probable leave it in. [quoted text, click to view] > * Different data from the same row needs to be inserted into multiple > tables. "Common sense" (maybe "gut instinct" is better) says to handle > each row as a "unit". Seems weird to process the entire set for one > table, then to process the entire set AGAIN for another table, and then > YET AGAIN for a third table, and so on.
This smells like a design issue. If you have to insert the exact same data into multiple tables chances are that your table layout is flawed. [quoted text, click to view] > * Exception handling. Set based processing means that if one row fails > the entire set fails. Looping through allows you to fail a row but > allow everything else to be processed properly. It also allows you to > gather statistics. (How many failed, how many worked, how many were > skipped, etc.)
Normally you want all or nothing. This one sounds as if you placed business logic into to database. This might or might not be a good idea but there seems to be a chance that this is better done in a middle tier (for example because it eases porting to another RDBMS). But it depends on the error handling and operations you do in SP's. [quoted text, click to view] > ?? Good idea ?? The alternative is to create a temporary table (sandbox > or workspace type thing), copy the data to there along with "status" or > "valdation" columns, run through the set many times over looking for > any rows that may fail, marking them as such, and then at the end only > dealing with those rows which "passed" the testing. Of course, in order > for this to work you must know (and duplicate) all constraints so you > know what to look for in your testing.
I wouldn't recommend one or the other direction with so few info. Kind regards robert
[quoted text, click to view] On 28 Mar 2006 12:06:31 -0800, JayCallas@hotmail.com wrote:
(snip) [quoted text, click to view] >Proper development tactics says to try and do "code reuse".
Hi Jay, This is not always true when dealing with databases. You have to weigh the benefit of code reuse (cheaper maintenance) against the benefit of code duplication (better performance). In databases where performance matters, I won't hesitate a single second to duplicate a stored procedure and change it slightly to optimize for specific situations. [quoted text, click to view] >* Different data from the same row needs to be inserted into multiple >tables. "Common sense" (maybe "gut instinct" is better) says to handle >each row as a "unit". Seems weird to process the entire set for one >table, then to process the entire set AGAIN for another table, and then >YET AGAIN for a third table, and so on.
I agree with Robert's reaction about the validitiy of such a design. But if we assume that this really is valid, then you'll probablly find the cost of the second, third, etc. access to the data faster than the first access, because the data can all be fetched from cache and no physical disk I/O is needed. Of course, you can always test both versions against each other - I think that you'll be hard-pressed to find a scenario where using a cursor outperforms using the same SELECT statement two or three times in a row. [quoted text, click to view] >* Exception handling. Set based processing means that if one row fails >the entire set fails. Looping through allows you to fail a row but >allow everything else to be processed properly. It also allows you to >gather statistics. (How many failed, how many worked, how many were >skipped, etc.)
In an RDBMS, transactions should be atomic (the A in the ACID properties of transactions) - they either succeed as a whole, or they fail as a whole. If you need to exclude rows that would violate a constraint, check the constraint in the WHERE clause. [quoted text, click to view] > The alternative is to create a temporary table (sandbox >or workspace type thing), copy the data to there along with "status" or >"valdation" columns, run through the set many times over looking for >any rows that may fail, marking them as such,
No need to loop over the data many times. In most cases, you only need a single UPDATE with a CASE to check the variuous constraints and set the "status" column accordingly. [quoted text, click to view] > and then at the end only >dealing with those rows which "passed" the testing. Of course, in order >for this to work you must know (and duplicate) all constraints so you >know what to look for in your testing.
Yes. You need to know your constraints. Isn't that a normal part of your job? --
[quoted text, click to view] Erland Sommarskog wrote: >> * Different data from the same row needs to be inserted into multiple >> tables. "Common sense" (maybe "gut instinct" is better) says to handle >> each row as a "unit". Seems weird to process the entire set for one >> table, then to process the entire set AGAIN for another table, and then >> YET AGAIN for a third table, and so on. > > I'm not sure that I understand this point. As I mentioned, I usually > have my input to my set-based procedures in a table, so I don't really > see the problem.
I think what Jay is up to here is mapping of an OO model to a relational model. There is one "master" table which represents the root class of his hierarchy and several tables for specific types. I've never worked with user defined types but maybe these are an option here as well. Kind regards
Robert Klemme (bob.news@gmx.net) writes: [quoted text, click to view] > Erland Sommarskog wrote: >>> * Different data from the same row needs to be inserted into multiple >>> tables. "Common sense" (maybe "gut instinct" is better) says to handle >>> each row as a "unit". Seems weird to process the entire set for one >>> table, then to process the entire set AGAIN for another table, and then >>> YET AGAIN for a third table, and so on. >> >> I'm not sure that I understand this point. As I mentioned, I usually >> have my input to my set-based procedures in a table, so I don't really >> see the problem. > > I think what Jay is up to here is mapping of an OO model to a relational > model. There is one "master" table which represents the root class of > his hierarchy and several tables for specific types. I've never worked > with user defined types but maybe these are an option here as well.
I suppose you are thinking of the CLR UDTs added to SQL 2005. No, they are not an option here. CLR UDTs are for small values that always go together, and where you rarely would search for the individual values. Personally, I see CLR UDTs as a fairly marginal benefit. As for the structure of Jay's tables, I have no problem to understand it, as I work with securities myself, and different classes of securities have different properties. -- Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se Books Online for SQL Server 2005 at http://www.microsoft.com/technet/prodtechnol/sql/2005/downloads/books.mspx Books Online for SQL Server 2000 at
[quoted text, click to view] Erland Sommarskog wrote: > Robert Klemme (bob.news@gmx.net) writes: >> Erland Sommarskog wrote: >>>> * Different data from the same row needs to be inserted into multiple >>>> tables. "Common sense" (maybe "gut instinct" is better) says to handle >>>> each row as a "unit". Seems weird to process the entire set for one >>>> table, then to process the entire set AGAIN for another table, and then >>>> YET AGAIN for a third table, and so on. >>> I'm not sure that I understand this point. As I mentioned, I usually >>> have my input to my set-based procedures in a table, so I don't really >>> see the problem. >> I think what Jay is up to here is mapping of an OO model to a relational >> model. There is one "master" table which represents the root class of >> his hierarchy and several tables for specific types. I've never worked >> with user defined types but maybe these are an option here as well. > > I suppose you are thinking of the CLR UDTs added to SQL 2005. No, they are > not an option here. CLR UDTs are for small values that always go together, > and where you rarely would search for the individual values. Personally, I > see CLR UDTs as a fairly marginal benefit.
Ah, good to know. [quoted text, click to view] > As for the structure of Jay's tables, I have no problem to understand it, > as I work with securities myself, and different classes of securities > have different properties.
Then I probably misunderstood your posting. Sorry for the noise. Kind regards
[quoted text, click to view] JayCallas@hotmail.com wrote: > Thanks for replying. As you probably guessed, there are specific > reasons why I am asking these questions. Maybe a bit more info would be > helpful. > > * I am not talking about the same data being written to multiple > tables. I referring to situations where different columns from the same > row are written to different tables. > > A concrete example would be security (stocks) data. There are different > types of securities. (Equities and Options are just two of them). Both > types of securities share common information like symbol, name, and > security type. But options have additional data (strike price, class, > and expiration). > > My current schema has a table named SecurityMaster where common data > goes with the primary key being Symbol. I also have a > SecurityMasterDerivative table where I put Option specific data which > has a foreign key reference to the SecurityMaster table. > > So when I get a new security master file, the set-based approach says > to insert all the new common security data into the SecurityMaster > table and then go back and and reprocess the new data to insert the > option specific data. My dilemma is that I feel it is more "natural" to > treat each row as a single entity. > > * The stored procedure approach I mentioned in the first point is > related to the example I just gave. When inserting a new security into > the "Security Master database" (I do not mean the single table), there > is a at least one table involved but possibly up to 4 tables depending > on the type of security that we are dealing with. > > * As far as exception handling goes, why is it considered "normal" to > want all the data or just some of it? What about the idea of getting > "most" of the data in and then dealing with those "exceptions"? Is it > considered "business logic" to say "I rather have the majority of the > data as opposed to NONE of the data"? > > I understand that I can make one large INSERT query which also performs > ALL the required checks in order to filter out "bad" rows. But I would > think that the query could get very complex very quickly. I regularly > deal with situations where the data can not only NOT be trusted to be > complete but where column names and formats are VERY different even for > "similar" information. > > I find that I need to "backfill" a lot of data or perform a good number > of value conversions. > > One example that comes to mind of using a single query resulted in code > that took 25+ minutes ro run. In this situation I was dealing with > position data for securities. > > For Equities, I trusted the CUSIP value they gave me but not the > symbol. So I decided to "verify" the symbol against what I knew it as > (ie. backfill the column). But it was not enough to just do a left > join. Because it was possible to have the CUSIP repeated in the lookup > table (the PK was CUSIP and Symbol). So my Symbol lookup logic > (remember this was a single query), was to see if there was only one > row with the given cusip, then check to see if there was only one > active security row with the given cusip, THEN check to see if there > was only one north american active security with the given cusip, > otherwise use what the source gave us. > > And that was only for equities... For options the logic is 100% > opposite... Talk about complex queries that SUCK in performance... > > The reason why I suggested the multiple pass validation approach was to > be able to record or mark not only those rows that failed but WHY they > failed. (Assume that there is a [ValidationCode] and [ValidationReason] > column in the "sandbox" table.) I need to be able to create an > exception report so somebody can go back and "fix" the data. > > ----- > > I guess the bottom line is that I am probably trying to apply my many > years of application development approaches and opinions to database > development and the two do not mesh.
You are describing some common data integration scenarios. SQL isn't always an ideal integration tool. Certainly it is possible to do most kinds of transformation and validation in SQL, especially if you build a good framework and use a staging database to help you. However, it's fair to say that SQL works best under the assumption that your data as a whole will conform to a set of business rules enforced by constraints. Validating data at row level can therefore be hard and complex. This is a major reason why the market for enterprise integration tools exists. Micrsoft's offerings are DTS and Integration Services but there are a host of other solutions too. -- David Portas, SQL Server MVP Whenever possible please post enough code to reproduce your problem. Including CREATE TABLE and INSERT statements usually helps. State what version of SQL Server you are using and specify the content of any error messages. SQL Server Books Online: http://msdn2.microsoft.com/library/ms130214(en-US,SQL.90).aspx --
[quoted text, click to view] >> * Different data from the same row needs to be inserted into multiple tables. "Common sense" (maybe "gut instinct" is better) says to handle
each row as a "unit". Seems weird to process the entire set for one table, then to process the entire set AGAIN for another table, and then YET AGAIN for a third table, and so on. << You still think sequentially and procedurally. The SQL engine is free to hold the data internally and parallelize the process within a single transaction. It is not like you must open and close a file over and over. In Standard SQL/PSM, the keywords are BEGIN ATOMIC.. END; What a series of inserts must do is is see that the order of insertion is EFFECTIVELY preserved. That is a technical thing in "SQL-Speak" that says if I can put data item X in tables T1, T2, ..Tn all at once, I can do it inparallel. I might have to defer constraints, etc., but if the effect is the same, sequential order is not required. [quoted text, click to view] >> * Exception handling. Set based processing means that if one row fails the entire set fails. Looping through allows you to fail a row but allow everything else to be processed properly. It also allows you to gather statistics. (How many failed, how many worked, how many were skipped, etc.)<<
Look up the concept of SAVEPOINTs and transaction grandularity. A SAVEPOINT is a "mini-COMMIT"; when you hit an error, you can roll back to the last save point, do something and try again. The other method is to break the batch into the smallest possible transaction grandularity and run each one. The gimmick is that they have to be separable. Example: making bank deposits to single accounts. Univac experimented with a concurency control method they called logical concurency control. The idea was to look at the job queue, see which statements were done on disjoint sets (posting deposits and bowling league scores) and let them run at the same time without any locks or further checking. I do not know what happened to the project. [quoted text, click to view] >> The alternative is to create a temporary table (sandbox or workspace type thing), copy the data to there along with "status" or "valdation" columns, run through the set many times over looking for any rows that may fail, marking them as such, and then at the end only dealing with those rows which "passed" the testing. Of course, in order for this to work you must know (and duplicate) all constraints so you
know what to look for in your testing. << This is the approach used for data scrubbing in Data Warehouses. It is also the reason that we are seeing Business Rules engines that could used to generate code for the RDBMS, the DW, the front end code, etc. with the certainity that they will all match.
[quoted text, click to view] On 28 Mar 2006 13:50:45 -0800, JayCallas@hotmail.com wrote: >Thanks for replying. As you probably guessed, there are specific >reasons why I am asking these questions. Maybe a bit more info would be >helpful.
(snip) Hi Jay, David and Erland have already addressed many of your points. I'll skip most of your message and only chime in where I feel that I have sometinh useful to add. [quoted text, click to view] >* As far as exception handling goes, why is it considered "normal" to >want all the data or just some of it? What about the idea of getting >"most" of the data in and then dealing with those "exceptions"? Is it >considered "business logic" to say "I rather have the majority of the >data as opposed to NONE of the data"?
Because, in most cases, all the data that is being handled in the same batch is in some way related. Like the classic example of a debit and a credit booking in a double-entry bookkeeping system - you definitely don't want one to fail and the other to succeed. The ""all or nothing" principle is the most common situation in databases. There are example, of course. Like yours. These examples are often found in subsystems that deal with importing lots of (potentially corrupt) data - again, like yours. David already said that a DB might not be the ideal tool for this job. I recommend that you look into ETL tools. SQL Server ships with DTS (for SQL Server 2000) or Integration Services (SSIS - for SQL Server 2005), but there are many more available. If you can't or won't use an ETL tool, than my recommendation would be to import to a staging table, massage the data until you have a completely valid set of data, then import into the main table. Most of the massaging can (and should) be done with set-based operation - but there might be exception where a cursor performs better; as I've already indicated in my first reply, testing is the only way to find out. [quoted text, click to view] >The reason why I suggested the multiple pass validation approach was to >be able to record or mark not only those rows that failed but WHY they >failed. (Assume that there is a [ValidationCode] and [ValidationReason] >column in the "sandbox" table.) I need to be able to create an >exception report so somebody can go back and "fix" the data.
Your proposed approach was, if I understood correctly, to just try and insert the data and catch errors. You would then try to parse the error and translate it to a valid ValidationCode and ValidationReason for your application? Using a set-based construct might be faster and is (IMO) definitely easier: UPDATE StagingTable SET ValidationCode = 123, ValidationReason = 'Negative amount' WHERE Amount < 0 Or, for failures that would have a similar where clause: UPDATE s SET ValidationCode = CASE WHEN t.Symbol IS NULL THEN 124 WHEN t.Market <> 'DJI' THEN 125 END, ValidationReason = CASE WHEN t.Symbol IS NULL THEN 'Unknown ticker symbol' WHEN t.Market <> 'DJI' THEN 'Not traded on Dow Jones' END FROM StagingTable AS s INNER JOIN TickerSymbols AS t ON t.Symbol = s.Symbol WHERE t.Symbol IS NULL OR t.Market <> 'DJI' And then for the final import: INSERT INTO RealTable (....) SELECT ..... FROM StagingTable WHERE ValidationCode IS NULL --
I'm a little late to this discussion, hopefully still being monitored. The ultimate goal is that the data supports the business needs, and provides solutions for your customers. For lack of a better word, lets use the word "business transaction." If the customer wants to insert as much data from a business transaction is valid, then your code should do it. For a complex example like yours, I can't help but believe you will have to look at a given input row, and process and validate that across all the tables and business rules that apply, and then implement the changes. Then, repeat for the rest of the input rows. So you have cursors, and custom logic, and repeated validation across many tables and business rules and foreign keys. The primary goal is to solve the business needs. A secondary goal is supportability. Perhaps you end up with a bunch of custom stored procedures. That is ok, as long as they are documented, readable by others, and supportable. In your examples, the set based answer really won't work. As an example, there could be input Row Q and Row B which you processed and inserted. Then, later you Process Row T which it turns out is prevented from being inserted by part of the existance of Row Q and Row B. You could have maybe figured it out ahead of time, but the odds say you are better off not doing it that way. In general, I'd rather flunk data on import out then get bad data into my main database. It is easier to find and understand data that flunked, then is to clean up a "mistake" that has been in the database a few months. The hardest part for me in projects like this is to come up with a logical way to go about the big picture. Talking it over with someone, and explaining it to someone is a great way to solidify your thoughts, and make sure you are keeping the big picture in mind. regards, doug
Don't see what you're looking for? Try a search.
|