all groups > dotnet odbc.net > april 2006 >
I am using Odbc to read cvs files. Unfortunately, some cvs files are not formated correctly (out of my control). One particular problem is that a quote within an item is not put in double quotes, i.e. the file says "this "item" is bad", "this item is ok" rather then "this ""item"" is bad", "this item is ok" odbc now thinks 'this ' is the first item rather then 'this "item" is bad'. Excel reads the file just fine, though. Is there some workaround, short of fixing the file myself, to make the driver more error tolerant? If it helps anything bellow is how I read the excel file. Thanks connectionString = @"Driver={Microsoft Text Driver (*.txt; *.csv)};DBQ=" + Path.GetDirectoryName(filename); connection = new OdbcConnection(connectionString); connection.Open(); command = new OdbcCommand("Select * FROM " + Path.GetFileName(filename), connection); reader = command.ExecuteReader();
The first line "this "item" is bad", "this item is ok" is correct CSV format. There is no field delimiter (comma) between the two quotes around "item". Either your ODBC handler has a bug or you misconfiguring it. Please post your code that configures your ODBC reader. Your other option is to write a CSV line parser yourself, which isn't too difficult to do. Mike Ober. [quoted text, click to view] "rob" <rmdiv2000@yahoo.com> wrote in message news:1146296378.467061.100080@j33g2000cwa.googlegroups.com... > I am using Odbc to read cvs files. Unfortunately, some cvs files are > not formated correctly (out of my control). One particular problem is > that a quote within an item is not put in double quotes, i.e. the file > says > > "this "item" is bad", "this item is ok" > > rather then > > "this ""item"" is bad", "this item is ok" > > odbc now thinks 'this ' is the first item rather then 'this "item" is > bad'. Excel reads the file just fine, though. Is there some workaround, > short of fixing the file myself, to make the driver more error > tolerant? If it helps anything bellow is how I read the excel file. > > Thanks > > > connectionString = @"Driver={Microsoft Text Driver (*.txt; > *.csv)};DBQ=" + Path.GetDirectoryName(filename); > connection = new OdbcConnection(connectionString); > connection.Open(); > command = new OdbcCommand("Select * FROM " + > Path.GetFileName(filename), connection); > reader = command.ExecuteReader(); > >
Thanks for the reply. I do not do any special configuration but just what is shown in the original post. Then I read the fields in the following ways: while (reader.Read()) { for (i = 0; i < reader.FieldCount; i++){ reader.GetValue(i); }} What kind of configuration do I have to do to get this working? Thanks
[quoted text, click to view] Otis Mukinfus wrote: > On 29 Apr 2006 00:39:38 -0700, "rob" <rmdiv2000@yahoo.com> wrote: > > >I am using Odbc to read cvs files. Unfortunately, some cvs files are > >not formated correctly (out of my control). One particular problem is > >that a quote within an item is not put in double quotes, i.e. the file > >says > > > >"this "item" is bad", "this item is ok" > > > >rather then > > > >"this ""item"" is bad", "this item is ok" > > > >odbc now thinks 'this ' is the first item rather then 'this "item" is > >bad'. Excel reads the file just fine, though. Is there some workaround, > >short of fixing the file myself, to make the driver more error > >tolerant? If it helps anything bellow is how I read the excel file. > > > >Thanks > > > > > >connectionString = @"Driver={Microsoft Text Driver (*.txt; > >*.csv)};DBQ=" + Path.GetDirectoryName(filename); > >connection = new OdbcConnection(connectionString); > >connection.Open(); > >command = new OdbcCommand("Select * FROM " + > >Path.GetFileName(filename), connection); > >reader = command.ExecuteReader(); > > Have you tried reading the files directly, writing code to split the columns out > with the split command? If the data has the commas in the correct places, the > double quotes won't make any difference. If you don't want the quotes you can > replace all of them with string.Empty AFTER YOU SPLIT THE LINE. > > If you're trying to load the files into MS SQL Server, create a DTS Package > (assuming you're using SQL 2K) and do a bulk import into the table you're > filling. > > I never use ODBC or any other data provider to read text files, for the very > reason you are experiencing. > > This is cheating, but have you considered reading the file with Excel and saving > it as a tab delimited or comma delimited file that has no quotes? It won't be a > good solution if you are going to have to automate the process ;o) > > What are you doing with the file after converting it? > > Good luck with your project,
My first approach was using split. The problem that the individual items often also contain comas so split faild almost always. I tried regex as well but again due to many special cases that fails as well. I know that Excel can read the files correctly including all these special (and maybe even wrong) cases. Hopeing that excel uses odbc (rather then duplicating some code) I tried that approach. But it does not seem to work, either. I guess I might end up writing my own parser. It's probably not that big an issue but of course it would be nicer if I had a done solution that handles all the special cases. My final goal is to download cvs files from the internet. Actually, I don't store them as files but have them right in memory. Then I have to do some parsing (avoiding duplicates, extract the right data, do some error checking, etc) on the files and whatever I parsed out goes into an SQL database. Thanks
[quoted text, click to view] Otis Mukinfus wrote: > On 30 Apr 2006 10:10:05 -0700, "rob" <rmdiv2000@yahoo.com> wrote: > > [snip] > >My first approach was using split. The problem that the individual > >items often also contain comas so split faild almost always. I tried > >regex as well but again due to many special cases that fails as well. I > >know that Excel can read the files correctly including all these > >special (and maybe even wrong) cases. Hopeing that excel uses odbc > >(rather then duplicating some code) I tried that approach. But it does > >not seem to work, either. > > > >I guess I might end up writing my own parser. It's probably not that > >big an issue but of course it would be nicer if I had a done solution > >that handles all the special cases. > > > >My final goal is to download cvs files from the internet. Actually, I > >don't store them as files but have them right in memory. Then I have to > >do some parsing (avoiding duplicates, extract the right data, do some > >error checking, etc) on the files and whatever I parsed out goes into > >an SQL database. > > > >Thanks > > I've run into that situation where the commas are inside quotes. What I have > done in the past is to replace all the instances of "," with tabs '\t' and > remove the double quote left at the beginning and end of the file. Then you can > try splitting it on the tabs. It doesn't always work, but it's worth a try.
Thanks for your input. I know this approach will not work for me, though. For instance I have cases where a list of words are put in double quotes and separated by comas. Doing a general replace of "," with [Tab] would separate them where it shouldn't. There might be other scenarios where some items are put in double quotes (the once cotaining comas and double quotes) and items that are not put in double quotes (the once NOT containing comas or double quotes). I am sure the more I will use the stuff the more other scenarios will come up. That is why I would prefer not to write all that stuff myself but use some existing (Microsoft) solution. The only thing I can imagine right now is downloading the data from the internet, store it to a file, load the file in Excel and then read each individual cell from Excel. It's a huge overkill compared to just download the data and parse it myself in memory, though. Any other ideas would be appreciated. Thanks
I'd disagree based on http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm and http://en.wikipedia.org/wiki/Comma-separated_values Where are you getting that this is correct? It also looks like there's a space between the comma and the next quote. This looks like typical bad exports you encounter in the wild where people just drop all contents between quotes. Bruce Dunwiddie http://www.csvreader.com [quoted text, click to view] Michael D. Ober wrote: > The first line "this "item" is bad", "this item is ok" is correct CSV > format. There is no field delimiter (comma) between the two quotes around > "item". Either your ODBC handler has a bug or you misconfiguring it. > Please post your code that configures your ODBC reader. Your other option > is to write a CSV line parser yourself, which isn't too difficult to do. > > Mike Ober. > > "rob" <rmdiv2000@yahoo.com> wrote in message > news:1146296378.467061.100080@j33g2000cwa.googlegroups.com... > > I am using Odbc to read cvs files. Unfortunately, some cvs files are > > not formated correctly (out of my control). One particular problem is > > that a quote within an item is not put in double quotes, i.e. the file > > says > > > > "this "item" is bad", "this item is ok" > > > > rather then > > > > "this ""item"" is bad", "this item is ok" > > > > odbc now thinks 'this ' is the first item rather then 'this "item" is > > bad'. Excel reads the file just fine, though. Is there some workaround, > > short of fixing the file myself, to make the driver more error > > tolerant? If it helps anything bellow is how I read the excel file. > > > > Thanks > > > > > > connectionString = @"Driver={Microsoft Text Driver (*.txt; > > *.csv)};DBQ=" + Path.GetDirectoryName(filename); > > connection = new OdbcConnection(connectionString); > > connection.Open(); > > command = new OdbcCommand("Select * FROM " + > > Path.GetFileName(filename), connection); > > reader = command.ExecuteReader(); > > > >
You said that Excel opens this correctly. When I tried it with my Excel '03, it did not appear to me to open the line "correctly". It dropped the quote before item, left both fields surrounded by quotes, and left a space in front of the second field. You're going to have a rough time finding anything that parses this "correctly" as it's actually an impossibility to pull the data out of this in a stable manner, which is why the rules are there in the first place. If the rules are broken, all parsing rules are out. You're going to have to decide what rules you can make about the file, and what you're willing to give up. If you can for instance always know that there's supposed to be 7 fields per line, you can use a normal parser first. If it doesn't return 7 fields per line, you can bail out to a different parser that can try the replacing of "," with tab or whatever to see if you can then get 7 fields per line. Obviously, there's going to be situations where this will read the data incorrectly, but when you're dealing with an incorrect format, sometimes you have to just get the highest percentages possible and go from there. If you can't find any rules like there must be a certain number of fields, or certain fields must contain certain data, then I wish you all the best in trying to find a way of making it stable, but it's very unlikely. Bruce Dunwiddie http://www.csvreader.com
In my first post I actually said that it is not in a correct format. Michael responded and said it IS a correct csv format. Since I am no expert in csv I did not want to claim otherwise but it was hard to believe. The reason is that then "this ""item"" is bad" should show up in Excel as 'this ""item"" is bad' which it does not, i.e. it shows up as 'this "item" is bad'. The space is an error on my side. I just wrote down an representative example and accidentaly put the space in there. [quoted text, click to view] shriop wrote: > I'd disagree based on > http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm > and > http://en.wikipedia.org/wiki/Comma-separated_values > > Where are you getting that this is correct? It also looks like there's > a space between the comma and the next quote. This looks like typical > bad exports you encounter in the wild where people just drop all > contents between quotes. > > Bruce Dunwiddie > http://www.csvreader.com > > Michael D. Ober wrote: > > The first line "this "item" is bad", "this item is ok" is correct CSV > > format. There is no field delimiter (comma) between the two quotes around > > "item". Either your ODBC handler has a bug or you misconfiguring it. > > Please post your code that configures your ODBC reader. Your other option > > is to write a CSV line parser yourself, which isn't too difficult to do. > > > > Mike Ober. > > > > "rob" <rmdiv2000@yahoo.com> wrote in message > > news:1146296378.467061.100080@j33g2000cwa.googlegroups.com... > > > I am using Odbc to read cvs files. Unfortunately, some cvs files are > > > not formated correctly (out of my control). One particular problem is > > > that a quote within an item is not put in double quotes, i.e. the file > > > says > > > > > > "this "item" is bad", "this item is ok" > > > > > > rather then > > > > > > "this ""item"" is bad", "this item is ok" > > > > > > odbc now thinks 'this ' is the first item rather then 'this "item" is > > > bad'. Excel reads the file just fine, though. Is there some workaround, > > > short of fixing the file myself, to make the driver more error > > > tolerant? If it helps anything bellow is how I read the excel file. > > > > > > Thanks > > > > > > > > > connectionString = @"Driver={Microsoft Text Driver (*.txt; > > > *.csv)};DBQ=" + Path.GetDirectoryName(filename); > > > connection = new OdbcConnection(connectionString); > > > connection.Open(); > > > command = new OdbcCommand("Select * FROM " + > > > Path.GetFileName(filename), connection); > > > reader = command.ExecuteReader(); > > > > > >
Bruce, Thanks for the reply. You are correct that the fields do not show up correctly. What I meant to say is that Excel is able to separate the fields correctly. Actually, I am not too much interested in the fields that might not be rendered correctly. The important fields contain very simple data that is hard to get wrong. The more complex fields with double quotes, new line characters, commas, etc just have to be separated correctly. Excel is able to do that and this is a valid assumption that I can make (otherwise the publisher would be in troubles). So I was hoping that the parsing algorithm from Excel would in one way or another be accessible through the .NET framework. In any case, with the assumption I can make about the file, while not trivial, it should not be that hard to write my own parsing algorithm. But of course if I can assume that the file can be read by Excel and the parsing algorithm is accessible then this is my prefered method. Thanks for the input. [quoted text, click to view] shriop wrote: > You said that Excel opens this correctly. When I tried it with my Excel > '03, it did not appear to me to open the line "correctly". It dropped > the quote before item, left both fields surrounded by quotes, and left > a space in front of the second field. You're going to have a rough time > finding anything that parses this "correctly" as it's actually an > impossibility to pull the data out of this in a stable manner, which is > why the rules are there in the first place. If the rules are broken, > all parsing rules are out. You're going to have to decide what rules > you can make about the file, and what you're willing to give up. If you > can for instance always know that there's supposed to be 7 fields per > line, you can use a normal parser first. If it doesn't return 7 fields > per line, you can bail out to a different parser that can try the > replacing of "," with tab or whatever to see if you can then get 7 > fields per line. Obviously, there's going to be situations where this > will read the data incorrectly, but when you're dealing with an > incorrect format, sometimes you have to just get the highest > percentages possible and go from there. If you can't find any rules > like there must be a certain number of fields, or certain fields must > contain certain data, then I wish you all the best in trying to find a > way of making it stable, but it's very unlikely. > > Bruce Dunwiddie > http://www.csvreader.com
In the definition for CSV format, quotes are included in the text by using back-to-back quotes. So "Hi ""George"" Mason" should be interpreted by a good CSV parser as Hi "George" Mason Problem is that a lot of CSV parsers are poorly written, and a lot of CSV exporters are poorly written. Personally I like to go with Tab-delimited format or some other non-CSV format when I can... Makes life a lot easier. You can Google a CSV parser Regular Expression as well. [quoted text, click to view] "rob" wrote: > In my first post I actually said that it is not in a correct format. > Michael responded and said it IS a correct csv format. Since I am no > expert in csv I did not want to claim otherwise but it was hard to > believe. The reason is that then "this ""item"" is bad" should show up > in Excel as 'this ""item"" is bad' which it does not, i.e. it shows up > as 'this "item" is bad'. > > The space is an error on my side. I just wrote down an representative > example and accidentaly put the space in there. > > > shriop wrote: > > I'd disagree based on > > http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm > > and > > http://en.wikipedia.org/wiki/Comma-separated_values > > > > Where are you getting that this is correct? It also looks like there's > > a space between the comma and the next quote. This looks like typical > > bad exports you encounter in the wild where people just drop all > > contents between quotes. > > > > Bruce Dunwiddie > > http://www.csvreader.com > > > > Michael D. Ober wrote: > > > The first line "this "item" is bad", "this item is ok" is correct CSV > > > format. There is no field delimiter (comma) between the two quotes around > > > "item". Either your ODBC handler has a bug or you misconfiguring it. > > > Please post your code that configures your ODBC reader. Your other option > > > is to write a CSV line parser yourself, which isn't too difficult to do. > > > > > > Mike Ober. > > > > > > "rob" <rmdiv2000@yahoo.com> wrote in message > > > news:1146296378.467061.100080@j33g2000cwa.googlegroups.com... > > > > I am using Odbc to read cvs files. Unfortunately, some cvs files are > > > > not formated correctly (out of my control). One particular problem is > > > > that a quote within an item is not put in double quotes, i.e. the file > > > > says > > > > > > > > "this "item" is bad", "this item is ok" > > > > > > > > rather then > > > > > > > > "this ""item"" is bad", "this item is ok" > > > > > > > > odbc now thinks 'this ' is the first item rather then 'this "item" is > > > > bad'. Excel reads the file just fine, though. Is there some workaround, > > > > short of fixing the file myself, to make the driver more error > > > > tolerant? If it helps anything bellow is how I read the excel file. > > > > > > > > Thanks > > > > > > > > > > > > connectionString = @"Driver={Microsoft Text Driver (*.txt; > > > > *.csv)};DBQ=" + Path.GetDirectoryName(filename); > > > > connection = new OdbcConnection(connectionString); > > > > connection.Open(); > > > > command = new OdbcCommand("Select * FROM " + > > > > Path.GetFileName(filename), connection); > > > > reader = command.ExecuteReader(); > > > > > > > > >
Don't see what you're looking for? Try a search.
|
|
|