DevelopmentNow Blog
 Tuesday, September 23, 2008

I stumbled across two lightweight web-based database administration tools: php Mini Admin and QDBA. Both are single PHP files, so you can just upload them to a server, specify your database login information, and be able to browse tables, edit data, and do other administrative tasks without installing the effective-but-heavy phpMyAdmin or allowing direct remote database access to your MySQL server.

Note that you should still take proper security measures when using these web-based tools (SSL, restrict access using ACLs or IP, don't let users log in as root), but overall they're easy & helpful. Of the two, QDBA is more attractive & usable, but is primarily used to view and edit data. phpMiniAdmin is spartan but offers many additional built-in features like import/export, table optimization & repair, and other "DBA" tasks. Both tools offer a SQL query window were you can execute raw SQL commands against the server.

phpMiniAdmin Screenshot:

 

QDBA Screenshot

September 23, 2008    Bookmark to Digg or other social bookmarking
#    Disclaimer  |  Comments [0]



 Monday, April 07, 2008

The easiest way I've found involves two steps:

Adjust SSMS Settings

  1. Go to Tools->Options
  2. Query Results->SQL Server->Results to Grid
  3. Check "Include column headers when copying or saving results"
  4. Click OK.
  5. Note that the new settings won't affect any existing Query tabs -- you'll need to open new ones and/or restart SSMS.

Now next time you run a query, do this

  1. Make sure the results are displayed in a grid (CTRL+D or Query->Results To->Results to Grid)
  2. Right click in the grid, and click Select All
  3. Right click in the grid again & click Copy
  4. Open up a new Excel spreadsheet, and paste the data in
  5. Do a global search & replace, replacing "NULL" with an empty string.

Voila!

I had tried before with SSMS's export to CSV feature, and it just didn't escape data the way I needed it to.

April 7, 2008    Bookmark to Digg or other social bookmarking
#    Disclaimer  |  Comments [0]



 Monday, September 24, 2007

Been a while since mylast post. Anyhow, recently we needed to change the minimum word length for MySQL Full Text Searches and then rebuild all our full text indexes, which requires doing a quick repair on any relevant tables.

I knew it was a lot of tables, so I came up with a quick query to list all the tables with full text indexes:

select DISTINCT TABLE_SCHEMA, TABLE_NAME
from information_schema.STATISTICS
where INDEX_TYPE = 'FULLTEXT';

You could also change the query to list out the relevant REPAIR commands.

select DISTINCT
CONCAT('repair table ',
TABLE_SCHEMA, '.', TABLE_NAME,
' quick;')
from information_schema.STATISTICS
where INDEX_TYPE = 'FULLTEXT';

September 24, 2007    Bookmark to Digg or other social bookmarking
#    Disclaimer  |  Comments [0]



 Saturday, May 05, 2007

I found a good script for backing up MySQL databases that's simple to install and configure -- automysqlbackup (alternate sourceforge link). I suggest reading Marius Ducea's post about automysqlbackup for some tips on how to use it, but you basically download it, edit a few settings at the top (e.g. database connection info), make the script executable, and then either run it manually or add it to a cron job for regular scheduling. It uses the well-known mysqldump utility to make organized directories of backups, and it rolls backups so you don't end up with a bajillion dump files after running it for a month.

MySQLhotcopy is another Perl script (that isn't based on mysqldump) that can also back up your MySQL databases. It actually backs up the database files themselves instead of creating a big text dump, and some people say it's faster and better than mysqldump-based solutions, especially for very large and/or active databases. I haven't used MySQLhotcopy, but it may be worth checking out if for some reason automysqlbackup or mysqldump won't work for you.

Lastly, 33% of our readers submitted this great tip for copying a MySQL database from one server to another (thx Scott):

mysqldump --opt --compress --user=USERHERE --password=PWHERE
 --host=SOURCE.HOST.HERE SOURCE_DB_NAME | mysql --user=USERHERE
 --password=PWHERE --host=TARGET.HOST.HERE -D TARGET_DB_NAME -C
 TARGET_DB_NAME

Note that it should be entered all on one line.

May 5, 2007    Bookmark to Digg or other social bookmarking
#    Disclaimer  |  Comments [0]



 Tuesday, April 10, 2007

I may have mentioned before that I'm doing a LAMP project these days (among other things). As I go along I'm taking notes on various tasks, etc. so I figured I could post some of them to my blog. Currently the site is hosted on a virtual dedicated server running Fedora Core 4. Anyhow...

Logging MySQL Queries

You may need to turn on query logging for mysql from time to time in order to see what SQL is being passed to the database. Here's how to do so in Linux. This makes some assumptions about directories, so you may need to execute the commands in different places depending on your setup. FYI, I did the following on a Fedora Core 4 distro. Also, if you're running a busy server, your query log will probably get really huge, so be forewarned! I wouldn't leave query logging running all the time, or at least not on a production machine.

Anyhow, shell into your server, log in with an administratively enabled account, and follow the below steps.


Create the MySQL logging file

If you haven't created the MySQL logging file before, you'll need to do so.

First create an empty file with

touch /var/log/mysqlq.log

That makes an empty file. Now change the ownership to the mysql account to it via

chown mysql /var/log/mysqlq.log

Now the mysql user account can write to that file.


Enable MySQL logging

Now you need to stop mysql

/usr/bin/mysqladmin -u root -p shutdown

you'll be prompted for the root password, enter it and mysql will shut down. You may need to hit ENTER after you get the shutdown message in order to get your prompt back.

Now start mysql with logging enabled.

/usr/bin/mysqld_safe --log="/var/log/mysqlq.log" &

The trailing ampersand is important, otherwise you won't get your shell prompt back.

You should see a message about logging and mysql starting. Hit ENTER to get your shell prompt.


Using the Log

You can now run some web sites or do some things that query the MySQL database and the results should be logged.

You can look at the log using vim or other editors. You can look at the last 100 lines of the log via

tail -100 /var/log/mysqlq.log

Note that your log will get pretty big, so to turn off logging, stop MySQL using the shutdown command above, and then start it back up with

/usr/bin/mysqld_safe &


Clearing the Log

You may want to clear your log from time to time if it gets too big. To do first, first make sure that MySQL is not currently logging to the file. You can see if any process is accessing the log by running this command & seeing if anything is returned. If no lines return then nothing is using the log.

fuser /var/log/mysqlq.log

Before clearing the log you could make a backup of it using

cp /var/log/mysqlq.log "$(date +%Y%m%d)-mysqlq.log"

That will create a copy of the log file with today's date in the filename, e.g. 20070408-mysqlq.log

You can then clear the log file using

> /var/log/mysqlq.log

Yes, you type the > in that command. :)


Troubleshooting

If you don't see commands being written to your log, look at the /var/log/mysqld.log log file to see what the problem is.

April 10, 2007    Bookmark to Digg or other social bookmarking
#    Disclaimer  |  Comments [0]



 Thursday, January 04, 2007

Normally one would think that adding a new, unused column to an existing database table wouldn't break anything, right?

Well, normally it wouldn't, unless the column name is the same as on another table, and if some of your queries don't use the table.column syntax when referring to columns.

For example, assume you have two tables with a many-to-one relationship:

Employee
EmployeeID
EmployeeName
DepartmentID
Active

Department
DepartmentID
DepartmentName

Notice that the Employee table has an "Active" column, but the Department table does not.

Down the road, you decide to add an "Active" column to the Department table, too. You figure since it's a brand-new column, it shouldn't break anything. However, if your application uses queries like this:

SELECT EmployeeName, DepartmentName, Active
FROM Employee e INNER JOIN Department d ON e.DepartmentID = d.DepartmentID
WHERE Active=1

they'll break with an "ambiguous column name 'Active'" error as soon as you add an "Active" column to the Department table, because now the "Active" column in the above query is ambiguous: does it refer to Employee.Active or Department.Active?

The solution to all this is establishing some good naming & query-writing habits:

  1. If you're going to add a new column, do a search to see if other tables have columns with the same name (e.g. "select * from syscolumns where name = '<columnname>'" on SQL Server). If they do, double check procs & queries.
  2. Try to use specific column names, e.g. EmployeeName instead of Name. That'll reduce the change of conflict, as well as make the field names, proc parameters, etc more self-documenting.
  3. Make a habit of always using the table.column name convention in queries. So the above query would instead be written as

    SELECT e.EmployeeName, d.DepartmentName, e.Active
    FROM Employee e INNER JOIN Department d ON e.DepartmentID = d.DepartmentID
    WHERE e.Active=1

    which would then protect it if the tables get new, ambiguously-named columns in the future.
January 4, 2007    Bookmark to Digg or other social bookmarking
#    Disclaimer  |  Comments [0]



 Wednesday, December 13, 2006

A service pack for MySQL Enterprise 5.0 was released, bringing the version number up to 5.0.30 and including changes like:

  • Bug Fix: InnoDB showed substandard performance with multiple queries running concurrently.
  • Bug Fix: InnoDB exhibited thread thrashing with more than 50 concurrent connections under an update-intensive workload
  • Bug Fix: Some queries that used MAX() and GROUP BY could incorrectly return an empty result

Plus a new beta version of MySQL 5.1 is out, with various fixes and enhancements.

December 13, 2006    Bookmark to Digg or other social bookmarking
#    Disclaimer  |  Comments [0]



 Tuesday, October 24, 2006

Data Dictionary Creator was recently released by Jon Galloway. It's a UI allowing you to document your SQL Server databases by storing the documentation in extended properties, and then exporting it as Excel, XML, etc. The nice thing is it stores the documentation in the same place that SQL Server normally stores its descriptions.

So instead of using the SQL Server tools to update your descriptions like this (ugh):

sqlserverfielddescription.gif

You can now use a cool lil' UI to do it, like this!

Screenshot

October 24, 2006    Bookmark to Digg or other social bookmarking
#    Disclaimer  |  Comments [0]



 Saturday, October 21, 2006

One of the tricky things with unit testing is getting your test system back to a known state after a test run. If your code involves a lot of database changes, then you often have to go to elaborate lengths like keeping track of all objects & deleting them afterwards, or restoring a database from a recent backup. Ugh.

I just read about EntryZero's dataFresh and it looks interesting. From their site:

Entropy's dataFresh is a toolkit that assists test driven development projects in restoring their database to a known state before each test within a test fixture.  The time consuming effort of having to write tear down methods to clean up the database after running your tests are a thing of the past.

Our appoach is unlike others as we do not attempt to rip and replace the entire database.  Instead we track database modifications to the table level and only work with those tables that have been modified.

So, it might be worth checking it. Currently it only works with SQL Server 2000 & 2005.

October 21, 2006    Bookmark to Digg or other social bookmarking
#    Disclaimer  |  Comments [0]



 Thursday, September 21, 2006
...will run slower, because SQL Server will first check the master database for them. Weird but true, according to SQLMag. So, don't start your procs with sp_.
September 21, 2006    Bookmark to Digg or other social bookmarking
#    Disclaimer  |  Comments [1]



 Wednesday, August 30, 2006

I recently needed to strip out non-alphanumeric characters in SQL Server. I initially thought I might be able to use a managed stored procedure and C# regular expressions to do so, but I thought the performance would be bad (e.g. you'd have to cursor through a table, extract a field value, use RegEx on it, go to the next row, etc.). So I came up with the below function using T-SQL's quasi-regular expressions in PATINDEX:

/*******************************************************************
dbo.fnStripNonAlphaNumeric

Removes all non-alphanumeric characters (including spaces) from
@input, e.g.

select dbo.fnStripNonAlphaNumeric('Help, I "think" I''m falling!')

returns

HelpIthinkImfalling

*******************************************************************/

CREATE FUNCTION dbo.fnStripNonAlphaNumeric
(
    @input varchar(500)
)
RETURNS varchar(500)
AS
BEGIN
    
    DECLARE @i int
    DECLARE @result varchar(500)
    SET @result = @input
    SET @i = patindex('%[^a-zA-Z0-9]%', @result)
    WHILE @i > 0
    BEGIN
        SET @result = STUFF(@result, @i, 1, '')
        SET @i = patindex('%[^a-zA-Z0-9]%', @result)
    END

    RETURN @result

END

Then in use it's something like

SELECT dbo.fnStripNonAlphaNumeric(FieldWithAlphaNumerics) as AlphaCleanValue
FROM MyTable

FWIW, to strip non-alphanumeric in C# you can use the one-liner (assuming you have a initial string called "input")

System.Text.RegularExpressions.Regex.Replace(input, @"[\s\W]*", "")

:)

 

 

 

August 30, 2006    Bookmark to Digg or other social bookmarking
#    Disclaimer  |  Comments [0]



 Monday, August 21, 2006

I went into this in my post about SQL Server remote connections, but basically, for each SQL Server instance running on your database server, you can enable remote access, control which IPs and ports SQL Server listens on, and which IPs are allows to access which port.

Use Windows Security when possible

Connecting via a Windows account is more secure than connecting with a SQL Server login and password. In the past those values were sent in plaintext (eek), although now SQL Server 2005 offers lockouts, password complexity enforcements, and password expiration for SQL Server logins, so they're not as unsecure as they used to be. Also if you're using the SQL Native Client to connect, your login packet is encrypted, so the password isn't in plaintext anymore.

If you need to use SQL Server logins, don't use the sa account -- log in with a different account with limited permissions. And use strong passwords. If it's available over the internet, you should use an SSL certificate to encrypt connections, too (see Encrypting Connections to SQL Server in Books Online).

Allowing Remote Access

SQL Server doesn't allow remote access by default. So if you want other machines to access SQL Server, open up SQL Server Surface Area Configuration (in Programs->Microsoft SQL Server 2005->Configuration Tools) and click Surface Area Configuration for Services and Connections. On the left side, navigate to the Remote Connections node under your SQL Server instance's Database Engine node, and ensure Local and Remote connections is selected with Using TCP/IP only. Then click OK.

CropperCapture[23].gif

 

Configuring IPs

You do so via opening up SQL Server Configuration Manager, opening the Network Configuration node and clicking the Protocols section. You'll see TCP/IP on the right.

CropperCapture[14].gif

 

Double click TCP/IP. You'll see a dialog like below with some settings on the Protocol tab:

CropperCapture[15].gif

By default, Listen All is set to Yes, which means SQL Server is listening on every IP the server has. If the server has a public IP, or if you don't want it listening on certain IPs, set Listen All to No. Now switch to the IP Addresses tab.

CropperCapture[17].gif

You'll see entries for all the machine's bound IPs -- in the above example 192.168.0.10 is the internal IP, 77.89.121.42 is a public IP, and 127.0.0.1 is the local loopback IP (note those aren't my real IPs, they're just for show).  Active means the IP address is a working IP network-wise -- changing the value via the dropdown doesn't do anything. If Listen All is set to No, you can set specific ports for various IPs, or tell SQL Server to not listen on them by setting Enabled to No. Notice how I've used the port 2000 and disabled access on the public IP.

FYI, Dynamic Ports is a setting where SQL Server finds an available port at runtime. I don't recommend Dynamic Ports because a) you have to run the SQL Server Browser service (a security risk) in order to inform clients which port is being used, and b) it's hard to protect SQL Server with a firewall because you never know which port is going to be used. So you're stuck opening up a bunch of ports, which is bad.

Anyhow, if Listen All was set to Yes on the Protocol tab, then you can't apply settings for individual IPs -- you instead apply settings for all IPs in the IPAll section at the bottom.

CropperCapture[18].gif

 

Configuring the Firewall

Now you can create a firewall rule allowing certain IPs access to your machine over certain ports. If you're using Windows Firewall, you can do so by first opening Windows Firewall from the Control Panel and clicking the Exceptions tab. You'll see a list of current exceptions.

CropperCapture[19].gif

Click Add Port. Give your rule a name (e.g. "SQL Server Rule") and specify the port you set in SQL Configuration Manager.

CropperCapture[20].gif

Now we want to specify who can access our server over this port. By default, everyone can (even folks over the internet), which IMO isn't super secure. So click Change Scope. You can choose My Network to allow all computers on your network to access your server, but if you want a more narrow range (good if you have a large network but only a few machine should be able to access your machine), you can enter the specific IPs and/or masks in the Custom List section.

CropperCapture[22].gif

Now click OK, and click OK again, and your new firewall rule is applied.

Finishing Up

Now restart your SQL Server service, and you're ready to connect. You can also feel better that you've protected your SQL Server instance.

 

 

August 21, 2006    Bookmark to Digg or other social bookmarking
#    Disclaimer  |  Comments [0]



By default, remote connections are disabled for all versions of SQL Server 2005, including SQL Server 2005 Express. If you try to connect from a different machine, you'll get an error message like this:

An error has occurred while establishing a connection to the server when connecting to SQL server 2005, this failure may be caused by the fact that under default settings SQL server does not allow remote connection.

To resolve it, you need to enable remote connections in the SQL Server Surface Area Configuration Tool (under Start->Programs->Microsoft SQL Server 2005->Configuration Tools) and then restart the SQL Server Express service. See this full walkthrough is on MSDN on how to do so.

If you then get errors like "Error Locating Service/Instance Specified" you have two options: using the SQL Server Browser, or using TCP/IP & specifying the port when you connect.

SQL Server Browser

The SQL Server Browser service tells interested parties the SQL Server instances available on the machine. It's a mild security risk (especially if you expose it over the internet), but it lets you easily connect to SQL Server instances, especially if your database server has multiple instances running. You'll need to ensure that the SQL Server Browser Service is running on the database server, and that the database server's firewall isn't blocking the SQL Server Express service or the SQL Server Browser service. See this MSDN article on how to do so.

TCP/IP Ports

This method is more secure but a little more work. You don't have to run the SQL Server Browser service, and you get to pick (and manage traffic on) the IP and port that each SQL Server instance listens on. Worth it for a public server, IMO. If you're using TCP/IP and don't want to run SQL Server Browser for security reasons, or if you have multiple SQL Server instances on your server, you'll need to check the TCP/IP ports that your SQL Server Database Engines are listening on, and ensure each running instance has its own port.

Open up SQL Server Configuration Manager (under Start->Programs->Microsoft SQL Server 2005->Configuration Tools), expand SQL Server Network Configuration, click Protocols for SQLEXPRESS, double-click TCP/IP. Note whether "Listen All" is Yes or No. Click the IP Addresses tab. If "Listen All" was Yes, then you can set the ports in the "IPAll" section below. If "Listen All" was set to No, then ensure that there's a port specified for the listed IP addresses, or disable (set Enabled to "No") any public IPs you don't want to expose SQL Server on.

Then open up Windows Firewall on the database server (or whatever your firewall program is) and allow TCP traffic in for the port you specified. Be sure to indicate who you want to access this port, too (e.g. the internet, your local network, a specific subnet, or just the local machine).

Then you should be able to connect to the server using "<ip address>,<port>" (e.g. "192.168.0.10,8000") from SQL Server Management Studio.

 

August 21, 2006    Bookmark to Digg or other social bookmarking
#    Disclaimer  |  Comments [0]



 Saturday, August 12, 2006

SQL Server has a decent full text search engine (IMO), but if you have HTML data in your database, it can be tricky searching on it. For example, if users search on the word "strong" you don't want to bring back data like "<strong>this text is emphasized</strong>". Also, there were early problems with SQL 2000's word breaker, in that it didn't treat > or < as a word delimiter (this problem has since been resolved).

Since SQL 2005 is around I thought I'd throw out a few ways I've noticed to store & search on HTML data.

Just Store HTML data in a varchar or text column

First of all, you can go the simple route & store it in a varchar(max) or text column, and create a full text index on that column. 

The upsides

  • It's a simple approach
  • FTS Change Tracking will track the values for varchar columns. If you're using a text column, changes are tracked unless made via WRITETEXT and UPDATETEXT. That's not much of an issue with SQL 2005, though, since WRITETEXT and UPDATETEXT are now deprecated.
  • It's easy to update values in varchar or text columns
  • You'll get matches on all the words

The downside

  • You'll also get matches on words inside comments and HTML tags (e.g. "font", "arial", "body")

So this might be a place to start for an 80/20 HTML search engine approach, and you could maybe treat words like "font" "td" etc as noise words so they're ignored in searches. Not a perfect solution, though, especially if your users like to search on the word "title."

Store HTML data in an XML Column

Now that SQL 2005 has an "XML" data type column, you can store your HTML data in that instead and search on it.

Upsides:

  • Search results won't include tagnames, attribute names, or words within comments
  • Change tracking will track XML column changes

Downsides:

  • Could be hard to update values in the column, I don't know how easy it is to programmatically interact with XML column data types
  • Your HTML needs to be well-formed. "< font > hey there </ font>" will return an error. "<font> hey there </font>" won't.
  • Full Text Search won't match on tag and attribute names (good), but will match on attribute values (bad). For example, if your data is "<font face="Arial">hi there</font>", searching on "font" won't return a match, but searching on "Arial" will.

Here's a complete script to try out in your own database (SQL Server 2005 only). It creates a new database called "ftstest" and a table called "Test".

create database ftstest
go
use ftstest
go
sp_fulltext_database 'enable'
go
Create fulltext catalog FTSCatalog as default
go
CREATE TABLE Test (
ID int not null identity constraint PK_Test primary key,
Title varchar(1000),
Description XML)
go

insert into Test (Title, Description) values
('some stuff goes here', '<font face="Arial">test1</font>')
go
insert into Test (Title, Description) values
('some second row', '<font> test2 </font>')
go
insert into Test (Title, Description) values
('some stuff here', '<font> test3 foobar </font>')
go
insert into Test (Title, Description) values
('some stuff here', '<font>boogie</font>')
go

CREATE FULLTEXT INDEX ON Test(Title, Description) KEY INDEX PK_Test
GO

-- these queries return data
select * from Test where FREETEXT(*,'stuff')
select * from Test where FREETEXT(*,'test1')
select * from Test where FREETEXT(*,'test2')
select * from Test where FREETEXT(*,'boogie')
select * from Test where FREETEXT(*,'Arial')

-- these don't
select * from Test where FREETEXT(*,'face')
select * from Test where FREETEXT(*,'font')

Store the HTML in an Image Column

This is the old standby. SQL Server can automatically ignore HTML markup in search results if you store your HTML data in a column of the image data type. You also need a second column whose value (e.g. 'htm') indicates the type of data.

Upsides:

  • All HTML markup is ignored for searches (except for a questionable feature where if you have spaces around your tags like this "< strong >" the tagname will be included in the search results).
  • You can actually use this feature to store & perform FTS searches on other types of documents, like PPT, PDF, DOC, etc. So it's good if you're doing a document management system & need to search on not only HTML documents but other kinds, too.

Downsides:

  • You have to deal with updating Image data types, which can be a huge PITA. I really wish SQL supported this for varchar or text columns.

Here's a sample script, in this case the DescriptionContentType column contains the value 'htm', telling SQL Server FTS that the Description column contains HTML data & that the indexer should use the HTML iFilter:

CREATE TABLE Test (
ID int not null identity constraint PK_Test primary key,
Title varchar(1000),
Description image,
DescriptionContentType char(3) default 'htm'
)
go

CREATE FULLTEXT INDEX ON Test(Title, Description TYPE COLUMN DescriptionContenttype)
KEY INDEX PK_Test
go

INSERT INTO Test (Title, Description) VALUES ('hi','<strong>test</strong>')
go

SELECT * FROM Test WHERE CONTAINS(*,'hi')    -- returns results
SELECT * FROM Test WHERE CONTAINS(*,'test')    -- returns results
SELECT * FROM Test WHERE CONTAINS(*,'strong')    -- no results

Use a Separate Keywords Column

This is a complex but common approach. Basically, you store your HTML in a varchar or text column, then strip out the HML markup & store the resulting text in a separate keyword column. You then perform searches on the keyword column. 

Upsides:

  • You get to avoid working with Image columns
  • HTML markup is avoided in search results
  • Change tracking will handle the keyword column (provided it's varchar or text w/o using WRITETEXT)

Downsides

  • You need to find or write a function to strip HTML from your column (easy enough with RegEx)
  • Extra storage space is consumed since your storing a lot of the data twice
  • You have to maintain two columns for HTML data: the HTML column, and the keyword column. Thus more work, more risk of bugs, & possibly more confusion. Plus all code that interacts with that table may need to be aware of & correctly use the columns correctly.

Here's a SQL blurb illustrating the concept.

CREATE TABLE Test (
ID int not null identity constraint PK_Test primary key,
Title varchar(1000),
Description varchar(max),
DescriptionKeywords varchar(max)
)
go

CREATE FULLTEXT INDEX ON Test(Title, DescriptionKeywords)
KEY INDEX PK_Test
go

INSERT INTO Test (Title, Description, DescriptionKeywords)
VALUES ('hi','<strong>test</strong>', fnStripHtmlFromText('<strong>test</strong>'))
go

SELECT * FROM Test WHERE CONTAINS(*,'hi')    -- returns results
SELECT * FROM Test WHERE CONTAINS(*,'test')    -- returns results
SELECT * FROM Test WHERE CONTAINS(*,'strong')    -- no results

Notice the fnStripHtmlFromText function -- that's the function you'd need to write to strip HTML from incoming data. For better protection, you could restrict access to the table to store procedures only, and only expose the Description column, like this:

CREATE PROCEDURE spInsertTest (
    @Title varchar(1000),
    @Description varchar(max)
)
AS
BEGIN
    INSERT INTO Test (Title, Description, DescriptionKeywords)
    VALUES (@Title,@Description, fnStripHtmlFromText(@Description))

    RETURN @@IDENTITY
END

Alternately, if you needed to use raw SQL instead of stored procedures, you could use INSERT and UPDATE triggers to maintain the DescriptionKeywords column, and your SQL could just interact with the Description column. Sorta like this:

CREATE TRIGGER dbo.tuTest
ON dbo.Test
AFTER UPDATE
AS
BEGIN
    SET NOCOUNT ON;

-- update keyword column with keywords from html column
    UPDATE Test SET
        DescriptionKeywords = fnStripHtmlFromText(i.Description)
        FROM Test t INNER JOIN inserted i ON t.ID = i.ID
END

Use a Separate Image Column

Like the separate keyword column solution, above, except that instead of parsing out the keywords, you just store a second copy of your HTML data in an Image column along with a content type column. The upside is you don't need to write an HTML keyword parser, but the downside is your keywords are in an image column (which may be a non issue since you shouldn't interact with it directly). Here's a sample script

CREATE TABLE Test (
ID int not null identity constraint PK_Test primary key,
Title varchar(1000),
Description varchar(max),
FTSDescription image,
FTSDescriptionContenttype char(3) default 'htm'
)
go

CREATE FULLTEXT INDEX ON Test(Title, FTSDescription TYPE COLUMN FTSDescriptionContenttype)
KEY INDEX PK_Test
go

Conclusion

Well that was a long post. The solution depends on what your goals are, but I'd recommend architecting your application in such a way that if you start out with the first, simplest solution, you can enhance your system later to a more sophisticated implementation without breaking everything. That means you should ideally be interacting with your system either via store procedures or objects. Then if you need to change the underlying database schema in order to handle a better search feature, you can do it in your DAL and/or procedures.

Since I'm doing this for an existing project (building a light CMS), I'm personally leaning towards a separate keyword or separate image column approach. I don't want to directly interact with Image columns at all if I can help it. :)

August 12, 2006    Bookmark to Digg or other social bookmarking
#    Disclaimer  |  Comments [0]



 Monday, August 07, 2006

So I was having a little trouble getting full text search to work with the GUI in SQL Server Express with Advanced Services (formerly SQL Server 2005 Express SP1), so I had to do things manually. It was probably a permissions or setup issue with SQL Server Expres or the tools. In addition to setting up FTS, I wanted a search query to weight columns differently in the search rankings -- something that SQL Server FTS doesn't really support.

Setting up Full Text Search

First I had to download and install SQL Server Express with Advanced Services. It's big, but comes with the goodies I wanted.

Then I connected to my SQL Server Express database using SQL Server Management Studio so I could type in some queries. If your SQL Server Express database is in your Visual Studio Project's App_Data folder, you may be out of luck -- I wasn't able to get full text search to work on those, although maybe adjusting permissions would do it.

Once connected to the database, I created a full text catalog

CREATE FULLTEXT CATALOG MyFTCatalog

Next I needed to get the name of a unique index for my table. You can only create full-text indexes on tables with a single-key unique index (e.g. an autonumber primary key index). Remember that your unique index doesn't have to be on the columns that you want to perform full text searches on.

I had a table called Listing a primary key of IdListing and three varchar fields I wanted to search on: Address, Realtor, and Notes. My table already had a unique index called PK_Listing_IdListing, so it was time to create a full-text index on the three columns I wanted to be able to search on:

CREATE FULLTEXT INDEX ON Listing (Address, Realtor, Notes)
KEY INDEX PK_Listing_IdListing
ON MyFTCatalog
WITH CHANGE_TRACKING AUTO

What the above query did is create a full-text index on those three Listing table columns and store it in the full-text catalog named MyFTCatalog. I indicated PK_Listing_IdListing as the index to help uniquely identify rows on the Listing table, and I told the Full Text Search engine to automatically update the full-text catalog if values in the table change.

Lastly I did a quick check to confirm the catalog existed and wasn't still building

SELECT FULLTEXTCATALOGPROPERTY('MyFTCatalog', 'Populatestatus')

And we're set up. Now it was time to query. And man is it hot in here. I guess overclocking your PC makes for a sweaty summer. Anyhow...moving on.

Performing Weighted Queries

There are plenty of pages about performing full-text queries in SQL Server. Here's a place to start.

So my first query looked like this

SELECT IdListing, Address, Realtor, Notes
FROM Listing
WHERE FREETEXT(*,'some keywords')

The * tells FTS to perform the search on all columns in the full-text index. But the query wasn't going to work for me, since it doesn't give more weight to one column over the other. Plus, in order to sort results by ranking, I needed to use the *TABLE full-text queries. I'm partial to FREETEXTTABLE because it already does all the stemming/etc for me.

Then I did a UNION query like this

SELECT TOP 100 Rank, Address, Realtor, Notes
FROM
(
    SELECT f.Rank, l.Address, l.Realtor, l.Notes
    FROM listing l INNER JOIN
    FREETEXTTABLE(listing, Address, 'some keywords') as f
    ON l.idListing = f.[KEY]
    UNION
    SELECT f.Rank, l.Address, l.Realtor, l.Notes
    FROM listing l INNER JOIN
    FREETEXTTABLE(listing, Realtor, 'some keywords') as f
    ON l.idListing = f.[KEY]
    UNION
    SELECT f.Rank, l.Address, l.Realtor, l.Notes
    FROM listing l INNER JOIN
    FREETEXTTABLE(listing, Notes, 'some keywords') as f
    ON l.idListing = f.[KEY]
) as myTable
ORDER BY Rank DESC

which I quickly rewrote to

SELECT TOP 100 f.Rank, l.Address, l.Realtor, l.Notes
FROM Listing l INNER JOIN
(
SELECT Rank, [KEY] from FREETEXTTABLE(listing, Address, 'some keywords')
UNION
select Rank, [KEY] from FREETEXTTABLE(listing, Realtor, 'some keywords')
UNION
select Rank, [KEY] from FREETEXTTABLE(listing, Notes, 'some keywords')
) as f
ON l.IdListing = f.[KEY]
ORDER BY f.Rank DESC

and then added some weights to the rankings, like so.

SELECT TOP 100 f.WeightedRank, l.Address, l.Realtor, l.Notes
FROM listing l INNER JOIN
(
        SELECT Rank * 5.0 as WeightedRank, [KEY] from FREETEXTTABLE(listing, Address, 'some keywords')
        UNION
        select Rank * 3.0 as WeightedRank, [KEY] from FREETEXTTABLE(listing, Realtor, 'some keywords')
        UNION
        select Rank * 1.0 as WeightedRank, [KEY] from FREETEXTTABLE(listing, Notes, 'some keywords')
) as f
ON l.idListing = f.[KEY]
ORDER BY f.WeightedRank DESC

Pretty good. You have the column weighting, and you could wrap it up in a nice little stored procedure and be good to go.

However, there was one last thing I needed. I really wanted a query that would to combine column rankings, so that if there were hits in multiple columns, the rank would be higher than a hit in a single column. So this is what I came up with.

SELECT TOP 100 f.WeightedRank, l.Address, l.Realtor, l.Notes
FROM listing l INNER JOIN
(
    SELECT [KEY], SUM(Rank) AS WeightedRank
    FROM
    (
        SELECT Rank * 5.0 as Rank, [KEY] from FREETEXTTABLE(listing, Address, 'some keywords')
        UNION
        select Rank * 3.0 as Rank, [KEY] from FREETEXTTABLE(listing, Realtor, 'some keywords')
        UNION
        select Rank * 1.0 as Rank, [KEY] from FREETEXTTABLE(listing, Notes, 'some keywords')
    ) as x
    GROUP BY [KEY]
) as f
ON l.idListing = f.[KEY]
ORDER BY f.WeightedRank DESC

Notice how I'm grouping the inner UNION query by [KEY] (in this case, Listing.IdListing) and SUMming the weighted ranks. That allows us to push results with hits in multiple columns higher up in the search rankings. Obviously it's not going to perform as well as a simpler query, but the ranking was important for this project.

Conclusion

So, there ya go. Installing SQL Server Express isn't too bad, although it's a big download. Setting up Full Text Search seemed to work best for me from the command line. And, now you have a way to rank matches with different columns having different weights.

Update: An Alternate Approach

Hilary Cotter (SQL MVP & FTS guru) provided an alternate query. I did a few tests & both seemed comparable in performance, although I didn't test using very large data sets. I made a slight change to his query and added a WHERE clause so that only matches are returned.

select TOP 100
    idListing, Address, Realtor, Notes,
    RankTotal=isnull(RankAddress,0)+isnull(RankRealtor,0)+isnull(RankNotes,0)
from listing
left join (SELECT Rank * 5.0 as RankAddress, [KEY] from
    FREETEXTTABLE(listing, Address, 'Street')) as k
    on k.[key]=Listing.idListing
left join (select Rank * 3.0 as RankRealtor, [KEY] from
    FREETEXTTABLE(listing, Realtor, 'Street')) as l
    on l.[key]=Listing.idListing
left join (select Rank * 1.0 as RankNotes, [KEY] from
    FREETEXTTABLE(listing, Notes, 'Street')) as m
    on m.[key]=Listing.idListing
WHERE RankAddress IS NOT NULL OR RankRealtor IS NOT NULL OR RankNotes IS NOT NULL
ORDER BY RankTotal DESC

Hilary also provided a script (run it in Query Analyzer or in a Query Tab in SQL Mgmt Studio) to set up a test database so you can try the query out yourself. I modified it to seed the test database with a bunch of records (since with only a few records, even LIKE is faster that FTS):

create database realtor
go
use realtor
GO
sp_fulltext_database 'enable'
GO
Create fulltext catalog realtor as default
GO
create table Listing(
    idListing int not null identity constraint ListingPK primary key,
    Address varchar(200), Realtor varchar(200), Notes varchar(200))
GO
-- add initial seed records
insert into Listing(Address, Realtor, Notes)
values('123 Any Street','John Street','the word on the street is good')
insert into Listing(Address, Realtor, Notes)
values('123 Any Road','John Street','the word of mouth is good')
insert into Listing(Address, Realtor, Notes)
values('123 Any Road','John Smith','the word on the street is good')
insert into Listing(Address, Realtor, Notes)
values('123 Any Street','John Smith','the word of mouth is good')
GO
-- multiply seed records, get up over 1M rows
-- might take a while
PRINT 'Please wait a few minutes while the database is seeded'
DECLARE @i int
SET @i = 0
WHILE (@i < 18)
BEGIN
    insert into Listing(Address, Realtor, Notes)
    select TOP 10 Address, Realtor, Notes from Listing

    SET @i = @i + 1
    PRINT convert(varchar,@i)
END
PRINT 'Database has been seeded'
GO
PRINT 'Please wait a few minutes while the fulltext index is built'
GO
create fulltext index on listing(Address, Realtor, Notes)
key index ListingPK
GO
-- check the below query. When it returns zero, the FT index is done building.
SELECT FULLTEXTCATALOGPROPERTY('realtor', 'Populatestatus')
GO


 

August 7, 2006    Bookmark to Digg or other social bookmarking
#    Disclaimer  |  Comments [0]



 Thursday, June 22, 2006

Saw this SQL Login Killer Script on SQL Server Central (yes I know it's old, but it's still good) ...

Handy when you need to kill everyone off in order to perform a restore. I even use it on my dev workstation, since IIS, ColdFusion, & a million other things like to hang on.

Although, I wonder if SQL Server 2005 has an easier way to boot everyone off. Hmm.

June 22, 2006    Bookmark to Digg or other social bookmarking
#    Disclaimer  |  Comments [0]



 Thursday, May 04, 2006

I'm currently working on a property tax consulting startup, and I was noticing that some queries were running slowly. I'm familiar with database indexing, but I wanted to see what the new Database Tuning Advisor that comes with SQL Server 2005 could come up with. It supposedly helps database newbs tune their database without having to know how to create indexes, etc. Seasoned DBAs will probably gasp in horror (with good reason) at automated tuning processes, but what the heck...let's take her for a spin.

Don't Start Without a Trace

The first thing you want to do is create a trace file containing the SQL statements you want analyzed. So fire up SQL Server Profiler, click File->New Trace, and connect to your database. In the trace properties screen you'll want to use the Tuning template, click Save to File, choose a filename, check Server processes trace data, and click Run. See the screenshot below for details.

Now you need to execute the SQL statements that are running slowly. In my case, I just run through a number of test cases in my web application, ensuring realistic database calls. But you could also run SQL statements directly from Query Analyzer or SQL Server Management Studio. You'll need to ensure that your database command timeout is set high enough so your queries finish running instead of rolling back.

Once the SQL statements have finished, stop the trace and exit SQL Server Profiler. You now have a trace file containing all the SQL statements you want tuned. It's a binary file, so don't bother reading it in Notepad, although you can open it up and view it using the Profiler.

Generate Database Tuning Recommendations

Now it's time to tune your database by starting a new Session in the Database Engine Tuning Advisor (DTA -- don't ask why the E is dropped). The DTA groups its recommendations into Sessions. Each session starts out with a Workload (a list of SQL statements to evaluate, e.g. from your trace file) and a set of databases & tables to tune. A Session can either be used to generate tuning recommendations (e.g. Indexes, Statistics, etc), or evaluate pre-specified recommendations and let you know how well they work.

So, open up the DTA -- it'll start making a new Session for you automatically. Connect to your database, select the trace file, choose the database to be tuned from the database dropdown, and put checkboxes next to the databases/tables you want to tune. I didn't bother with the Tuning Options tab, but you can take a look out of curiosity.

Now click Start Analysis at the top, and wait a bit. The Tuning Advisor will run the SQL statements from the trace file, think a bit, and then spit out some recommendations. In my case (see below) it recommended some indexes and recomputing some statistics. It also gave a guesstimate on the speed improvement.

Using the Recommendations

Now you have a set of recommendations, and you have a few choices under the Actions menu item. You can Save Recommendations, which will generate a .sql script file you can apply to your database. You can Apply Recommendations, which just means the Tuning Advisor will apply them to your database, either now or at a specific time (like 2am) in the future. You can also Evaluate Recommendations, allowing you to try them out & see how much they improve performance, without permanently affecting your database. Before Saving, Applying, or Evaluating, you can uncheck any recommendations you want to skip.

I decided to evaluate the recommendations, which creates a new Session (11:32:46 AM) displayed a screen like the one below. Note how the tabs on the top allow you to switch between various Sessions. The tab on the left (the one from 11:08:53 AM) is the first Session that generated the recommendations.

Now that we've set up a "what-if" Session, go ahead and click Start Analysis again, and a Progress screen will come up. The Tuning Advisor will apply the changes, run some comparison tests, undo the changes, and display the results. In the reports tab, you can view reports like Statement detail report (see below) to see how much faster your queries are executing. You can also see how much disk space the indexes take up, and other information.

If you're happy with the results, you can go ahead and Save or Apply the recommendations.

But let's say that you want to try some of the recommendations but not others. If so, you can go back to the original set of recommendations (click its tab, or double click it in the Session Monitor window on the left), check/uncheck some recommendations you want to try, and Evaluate Recommendations again. The ability to play with different sets of recommendations and evaluate them without permanently affecting your database is handy and powerful.

Once I had a set of recommendations whose results I liked, I applied them to my database. I then fired up my application and run through a few test cases, and it ran much faster (as I suspected).

Final Thoughts

The SQL Server Database Tuning Advisor is a nice tool. It probably won't out-tune a skilled DBA, but it does a decent job in recommending indexes. I also like the ability to try "what if" scenarios to evaluate different indexes.

I had a few beefs, however. I didn't like the default names of the recommended indexes, but there was no way to edit them in the GUI. Similarly, there's no way to use the DTA to evaluate your own recommendations -- you can only evaluate recommendations that the DTA comes up with*.

Overall, though, it was a fun tool, and easy to use.

 

* - you can actually use the DTA with custom recommendations (i.e. ones that you came up with) by editing XML files and importing them into the DTA using a command line tool. You can also edit existing DTA recommendations by exporting them to XML, editing them, and re-importing them. I think that's pretty lame that you can't use the GUI for it, but oh well.

May 4, 2006    Bookmark to Digg or other social bookmarking
#    Disclaimer  |  Comments [0]



 Monday, April 10, 2006

I was looking into ways to easily back up my local SQL Server database from the command line, and copy the resulting backup to my fileserver. That way I could kick off a backup manually or from a scheduled task. There are a few articles out there but they're more specific to SQL 2000 or earlier. Here's how to do it for SQL Server 2005.

I first made directory c:\sqlbackup and added a SQL script called backup.sql:

BACKUP DATABASE streamline TO DISK = 'c:\sqlbackup\streamline.bak' WITH INIT
GO
BACKUP LOG streamline TO DISK = 'c:\sqlbackup\streamline_log.bak' WITH INIT
GO

The above commands simply create log & database backups of the "streamline" database. I used the INIT parameter so that the backup files only create one version of the database (otherwise the backup files will grow & grow, containing every version of the database, each time you run a BACKUP command).

Then I made a simple batch file called backup.bat:

@echo off 
REM get today's date and time as one big string
for /f "tokens=2-4 delims=/ " %%i in ( 'date /t') do set theday=%%k%%i%%j
for /f "tokens=1-2 delims=: " %%i in ( 'time /t') do set thetime=%%i%%j
set now=%theday%%thetime%

REM create backup files
C:
cd "\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Binn"
sqlcmd -S localhost -U sa -P somepassword -i c:\sqlbackup\backup.sql -o c:\sqlbackup\log.txt

REM copy backup files to backup device
xcopy "C:\sqlbackup\*.*" \\10.1.10.50\ben_backup\sqlbackup\%now%\ /E /H /R 

The first section uses some DOS batch trickery to create a variable called %now% that contains the current day and time in the form of yyyymmddhhmm. The second section runs the backup script & writes the output to c:\sqlbackup\log.txt. The last section copies the all the backup files to my fileserver, inside a date & time specific folder. Notice how I used the %now% variable in the xcopy destination parameter.

Now my backup folder contains timestamped folders containing everything I need to restore my "streamline" database. You can use a similar technique to make file backups.

April 10, 2006    Bookmark to Digg or other social bookmarking
#    Disclaimer  |  Comments [0]



 Wednesday, February 15, 2006

I was struggling for a while with how to get a profile of the queries being sent to MySQL. All my searches suggested running the debug version of MySQL, but it would either log WAY WAY too much information, or else not log the queries at all. I wanted something akin to the SQL Server Profiler, so I could just see the SQL and proc calls. Is that so much to ask? 

I eventually found the solution in the NHibernate forums (of all places), and it doesn't require running the debug version of MySQL at all. Inside your MySQL option file (e.g. "C:\Program Files\MySQL\MySQL Server 5.0\my.ini" or some other location), add the following line inside the [mysqld] section:

log="C:/Program Files/MySQL/MySQL Server 5.0/mysql.log"

Restart the MySQL service, and you'll get a nice log file continuously updated with all queries being sent to MySQL. You can view it with TextPad or some other text editor that can auto-refresh files when they change.

 

February 15, 2006    Bookmark to Digg or other social bookmarking
#    Disclaimer  |  Comments [0]



 Sunday, February 05, 2006

I was using Kayak today and noticed they had a link about their technology. One thing I noticed was that they use Ruby, which I've been reading about more lately, especially the uber-sexy Ruby on Rails. But Kayak also mentioned that they use memcached to cache common database results. It's basically a super-fast distributed dictionary (key/value pairs) object that you can spread across multiple servers without replicating data all over the place.

Since I'm a Microsoft developer, I normally cache web server stuff using the good ol' Cache object, or maybe the Caching Enterprise Library. Which works fine. But memcached makes a good point, in that if you have, say, 50 web servers, running multiple processes, then you're going to be potentially storing the same data multiple times on every machine. memcached stores your data once, and scales across as many machines as you run it on. And since memcached is built as a caching system, it could potentially be much faster than caching things in a database.

But don't take my word for it, especially since I'm just paraphrasing their "about" page. Read more about how it works on the memcached home page ... if it sounds intriguing to you, you should check it out. What's the worst that could happen? :)

February 5, 2006    Bookmark to Digg or other social bookmarking
#    Disclaimer  |  Comments [0]



 Sunday, January 22, 2006

I was restoring a database backup on my new project when I noticed that some of the stored procedures where owned by a specific user instead of 'dbo.' That's a nono for several reasons -- not all the procs were owned by the same user, plus it can get messy if you have two copies of the same proc, one owned by dbo & one owned by the user, plus if stored procs are owned by a specific user then you can't remove the user from that database, plus other users won't be able to access those procs by default. Plus other stuff. :)

Steve Shofield has a quick blog post about changing ownership using the INFORMATION_SCHEMA views, but I didn't see an equivalent script for stored procedures that works in SQL Server 2000. So here's my versions:

-- convert tables to dbo
select 'sp_changeobjectowner ''[' + table_schema + '].[' + table_name + ']'', ''dbo'''+char(13)+char(10)+'go' from information_schema.tables where table_schema <> 'dbo'
-- convert stored procedures to dbo
select 'sp_changeobjectowner ''[' + routine_schema + '].[' + routine_name + ']'', ''dbo'''+char(13)+char(10)+'go' from information_schema.routines where routine_schema <> 'dbo'


Log into query analyzer, hit CTRL+T to get results in text mode, run the above, and you'll see the in results window some SQL code that you can copy & paste into the query analyzer & run again. You may get a warning saying "Caution: Changing any part of an object name could break scripts and stored procedures", which basically means that if your stored procedures or queries are written to refer to objects by the username.objectname syntax, they might break. But that probably isn't an issue for you if you're running this script.

January 22, 2006    Bookmark to Digg or other social bookmarking
#    Disclaimer  |  Comments [0]



 Wednesday, January 18, 2006

In one of my projects, the client is concerned about the size of the database. He wants to save money on hosting, so I was going to poke around & see where I could cull data (if any).

After a quick surf, Bill Graziano's script from SQLTeam.com did what I needed. Make sure to run DBCC UPDATEUSAGE('databasename') to get the most accurate results.

/**************************************************************************************
*
* BigTables.sql
* Bill Graziano (SQLTeam.com)
* graz@sqlteam.com
* v1.11
*
**************************************************************************************/


declare @id int
declare @type character(2)
declare @pages int
declare @dbname sysname
declare @dbsize dec(15,0)
declare @bytesperpage dec(15,0)
declare @pagesperMB dec(15,0)


create table #spt_space
(
objid int null,
rows int null,
reserved dec(15) null,
data dec(15) null,
indexp dec(15) null,
unused dec(15) null
)

set nocount on

-- Create a cursor to loop through the user tables
declare c_tables cursor for
select id
from sysobjects
where xtype = 'U'

open c_tables

fetch next from c_tables
into @id

while @@fetch_status = 0
begin

/* Code from sp_spaceused */
insert into #spt_space (objid, reserved)
select objid = @id, sum(reserved)
from sysindexes
where indid in (0, 1, 255)
and id = @id

select @pages = sum(dpages)
from sysindexes
where indid < 2
and id = @id
select @pages = @pages + isnull(sum(used), 0)
from sysindexes
where indid = 255
and id = @id
update #spt_space
set data = @pages
where objid = @id


/* index: sum(used) where indid in (0, 1, 255) - data */
update #spt_space
set indexp = (select sum(used)
from sysindexes
where indid in (0, 1, 255)
and id = @id)
- data
where objid = @id

/* unused: sum(reserved) - sum(used) where indid in (0, 1, 255) */
update #spt_space
set unused = reserved
- (select sum(used)
from sysindexes
where indid in (0, 1, 255)
and id = @id)
where objid = @id

update #spt_space
set rows = i.rows
from sysindexes i
where i.indid < 2
and i.id = @id
and objid = @id

fetch next from c_tables
into @id
end


select top 25
Table_Name = (select left(name,25) from sysobjects where id = objid),
rows = convert(char(11), rows),
reserved_KB = ltrim(str(reserved * d.low / 1024.,15,0) + ' ' + 'KB'),
data_KB = ltrim(str(data * d.low / 1024.,15,0) + ' ' + 'KB'),
index_size_KB = ltrim(str(indexp * d.low / 1024.,15,0) + ' ' + 'KB'),
unused_KB = ltrim(str(unused * d.low / 1024.,15,0) + ' ' + 'KB'),
idx_data_ratio = ltrim(str(indexp*100 /data) + '%'),
unused_pct = ltrim(str(unused * 100 /reserved) + '%')

from #spt_space, master.dbo.spt_values d
where d.number = 1
and d.type = 'E'
order by reserved desc

drop table #spt_space
close c_tables
deallocate c_tables

January 18, 2006    Bookmark to Digg or other social bookmarking
#    Disclaimer  |  Comments [0]



 Tuesday, January 17, 2006

One of my current projects involves creating a data warehouse from multiple datafeeds. For cost reasons we're going with MySQL Community Edition, but we're sticking with ASP.NET, C#, and Windows on the code end.

I'll admit that Windows, C#, and MySQL are an unusual combination, but I love C#, and MySQL recently came out with version 5.0, which features stored procedures, transactions, triggers, foreign keys -- lots of good stuff. How could I resist? If you've never installed MySQL before, don't worry...the installer is pretty easy to follow, and the install guide helps you through. Just make sure you also download the tools and the ODBC connector driver.

Since I'm going with an ELT approach (bulk load into an unconstrained temporary table, perform cleanup & transformations, then import into the "real" tables), I needed a way to review the import process and check the errors. A lot of the examples on the web have you doing everything from the command line, calling the MySQL command line tools from PHP, piping stuff out to text files, etc. I wanted to instead wrap the import process into a nice little C# app. So let's jump into the code!

First we need to connect to the database. I'm using the ODBC driver instead of the new ADO.NET provider, because the ODBC overhead isnt' an issue with so few queries, and the ADO.NET provider is so new that I don't quite trust it yet. :)


string conString = "DRIVER={MySQL ODBC 3.51 Driver}; SERVER=localhost; DATABASE=test; UID=theuser; PASSWORD=thepassword; OPTION=3";
OdbcConnection mysqlConnection = new OdbcConnection(conString);
mysqlConnection.Open();

Next we're going to bump the max error count to something big. The default of 64 just isn't enough when you're importing a lot of data. You want to see all the problems encountered during the import.


OdbcCommand updateErrorCount = new OdbcCommand("set max_error_count = 10000;", mysqlConnection);
updateErrorCount.ExecuteNonQuery();

Now let's import data from a flat file into a staging table using the MySQL LOAD DATA bulk import command. This staging table should have a schema that matches the flat file, but shouldn't have any constraints other than a primary key. If there are any constraints (foreign key, etc), and even one record violates them, then the whole import will roll back. And if you have 1 bad row out of 100,000, you would probably prefer to get the 99,999 good records instead of zero. If you designate a primary key, then you can write a LOAD DATA statement that will only import records whose primary key isn't already in the table. This allows you to import multiple files (or even the same file) repeatedly without worrying about duplicate keys.


string importData =
"LOAD DATA LOCAL INFILE 'F:/imports/mls_import_file.txt' "+
"IGNORE "+
"INTO TABLE mls_staging "+
"FIELDS TERMINATED BY '|' "+
"LINES STARTING BY '' "+
"TERMINATED BY '\r\n' "+
"IGNORE 1 LINES; ";

Let me quickly go over the different lines:

  • LOAD DATA LOCAL INFILE 'F:/imports/mls_import_file.txt' causes a bulk load from the F:/imports/mls_import_file.txt file. The LOCAL keyword has the client program read the file and send it to the server. If you omit the LOCAL keyword, then MySQL will try to read the file directly, which is faster, but may have permissions issues.
  • IGNORE tells MySQL to not import any records whose primary key already exists in the destination table
  • INTO TABLE mls_staging is the destination table. It needs to have the same columns as the flat file, otherwise you need to specify which columns to import.
  • FIELDS TERMINATED BY '|' says that the fields are pipe-delimited
  • LINES STARTING BY '' means the records don't start with any special character. Those are two single-quotes, btw. 
  • TERMINATED BY '\r\n' means each record is terminated by a CRLF
  • IGNORE 1 LINES; tells MySQL to skip the first line (e.g. if it contains column header information). And the semicolon indicates the end of a statement, just like in C#.

Now we're actually execute the import statement:


OdbcCommand importCommand = new OdbcCommand(importData, mysqlConnection);
int importRecords = importCommand.ExecuteNonQuery();
Debug.WriteLine(importRecords + " records imported.");

When the LOAD DATA command runs, it'll pull all those rows and load as many into the table as possible, skipping any duplicates. Since there are no foreign key constraints, the main errors we'll see are if the file can't be read or the server is having problems. However, we might see a number of warnings, e.g. missing columns, too-large field values that get truncated, etc. We want to know about all that stuff, so we're first going to get the number of warnings and errors.


OdbcCommand warningCommand = new OdbcCommand("select @@warning_count;", mysqlConnection);
int warningCount = Convert.ToInt32(warningCommand.ExecuteScalar());
OdbcCommand errorCommand = new OdbcCommand("select @@error_count;", mysqlConnection);
int errorCount = Convert.ToInt32(errorCommand.ExecuteScalar());
Debug.WriteLine(String.Format("{0} warnings, {1} errors", warningCount, errorCount));

Now we can inspect the number of warnings and errors, and if there are any, we can log them, email someone, etc. The below code just loops through the warnings and errors and displays them in the output window.


if (warningCount > 0)
{
// show any warnings
OdbcCommand warningListCommand = new OdbcCommand("show warnings;", mysqlConnection);
OdbcDataReader warningList = warningListCommand.ExecuteReader();
while (warningList.Read())
{
Debug.WriteLine(warningList.GetString(0) + " " + warningList.GetString(1) + " " + warningList.GetString(2));
}
warningList.Close();
}

if (errorCount > 0)
{
// show any errors
OdbcCommand errorListCommand = new OdbcCommand("show errors;", mysqlConnection);
OdbcDataReader errorList = errorListCommand.ExecuteReader();
while (errorList.Read())
{
Debug.WriteLine(errorList.GetString(0) + " " + errorList.GetString(1) + " " + errorList.GetString(2));
}
errorList.Close();
}

And lastly we close up shop by calling

mysqlConnection.Close();

Hopefully the above will help you out when importing data in the unholy(?) alliance of MySQL and C#. MySQL's new version bring along a ton of goodies, so don't be surprised if you see MySQL databases popping up more frequently.

 

 

January 17, 2006    Bookmark to Digg or other social bookmarking
#    Disclaimer  |  Comments [0]



 Thursday, December 08, 2005

The released version of SQL Server 2005 is missing the Import/Export Wizard that was available during beta. Now Microsoft wants you to practice using the Business Intelligence Development Studio to create an Integration Services Project. There's even an Integration Services ETL tutorial on MSDN about it.

However, you can still get to the Import/Export (aka DTS) Wizard by typing dtswizard at the command prompt. :)

December 8, 2005    Bookmark to Digg or other social bookmarking
#    Disclaimer  |  Comments [0]