Wednesday, January 28, 2009

SELECT FROM Excel Spreadsheets

Q: How do I run SQL SELECT queries against data stored in Excel spreadsheets?

(This is the opposite direction from importing data into a spreadsheet from a database; e.g., via Excel - Data - From Other Sources - From Microsoft Query, etc.)
A: You can use SQL Anywhere's proxy table feature to treat an Excel spreadsheet like a relational table, thus bringing the power of SQL SELECT to bear on spreadsheet data.

First, a simple example, then some tips and suggestions. Here is an Excel spreadsheet containing website hit counts by browser (MSIE 7, Firefox 3.03, etc):



Here's the code to answer the question, "Who's winning the browser wars, IE or Firefox?". Hit counts are summarized by browser brand name, getting rid of the version number differences:
CREATE SERVER EXCEL 
CLASS 'ODBC' USING 'Excel 2007';

CREATE EXISTING TABLE proxy_browsers
AT 'EXCEL;C:/temp/browsers.xlsx;;Sheet1$';

SELECT proxy_browsers.browser AS brand_name,
SUM ( proxy_browsers.hits ) AS hit_count,
hit_count / total_hits * 100.0 AS percent
FROM proxy_browsers
CROSS JOIN ( SELECT SUM ( hits ) AS total_hits
FROM proxy_browsers )
AS summary
WHERE browser IS NOT NULL
GROUP BY proxy_browsers.browser,
summary.total_hits
ORDER BY hit_count DESC;
Here is what the output looks like; Firefox is winning for this particular website (this blog, actually):



You can read about setting up proxy tables for Excel here:Here is what the table definition looks like in the SQL Anywhere catalog:
CREATE EXISTING TABLE DBA.proxy_browsers -- proxy
-- (
-- hits DOUBLE NULL,
-- pct DOUBLE NULL,
-- browser NVARCHAR ( 255 ) NULL,
-- version NVARCHAR ( 255 ) NULL
-- )
AT 'EXCEL;C:/temp/browsers.xlsx;;Sheet1$';
Here's what the data looks like:
SELECT * 
FROM proxy_browsers
ORDER BY hits DESC;


Here are some tips and suggestions:
  • The official name of the "proxy table" feature is "Remote Data Access"... it helps to know that when you're searching the docs.

  • The syntax for the CREATE EXISTING TABLE ... AT clause goes like this:
    '<CREATE SERVER name>;<filespec>;;<sheet name>$'

  • You may or may not need to code the "magic dollar sign" after the sheet name in the AT clause. If you do need it, and you don't code it, you will see this bogus message bubbling up from the Excel driver:
    The remote table 'EXCEL.C:/temp/browsers.xlsx..Sheet1' could not be found
    SQLCODE=-666, ODBC 3 State="HY000"
    Aside: Don't you try to code the AT clause using periods '.' like in that error message, you need to use semicolons ';' to avoid ambiguity with the filespec.

  • You only have to set up one ODBC DSN for everything, not one DSN for each sheet or workbook. You do have to use "Select Workbook" to point the DSN at a specific file, but after that the file and sheet names are specified in your SQL code. The file named in the "Select Workbook" doesn't even have to remain in existence.

  • Here's how the ODBC Administrator was used to create the "Excel 2007" DSN for this article:






  • The ODBC DSN must be set up on the same computer that's running the SQL Anywhere engine. If it's a network server on some other computer, that's where the DSN is located, and (unless you're using a mapped drive or UNC file specifications) that's where the Excel file is located as well.

    Plus... if the server is started as a service, the DSN must be a "System DSN", not a "User DSN".

  • In related news, the ISQL INPUT and OUTPUT statements no longer support the Excel format in SQL Anywhere 11. Plus, ISQL INPUT cannot read Excel 2007 *.xlsx files in SQL Anywhere 10: "The file may be in an unsupported format, or it could be corrupt."

    These changes are no big deal, since proxy tables work so well, except for the fact that any behavior change is a pain in someone's patootie, and the INPUT and OUTPUT commands specify files that are relative to the client whereas (see above) proxy table DSNs are relative to the server.

    And before anyone says "LOAD and UNLOAD now have CLIENT FILE clauses!", check out how well LOAD and UNLOAD work with proxy tables:
    Feature 'unload from non-SQL Anywhere tables' not implemented
    SQLCODE=-134, ODBC 3 State="HY000"
    Workarounds are welcome, this blog invites comments!

  • You might get away with having the workbook open in Excel at the same time you're running proxy table queries... or not.

  • The names in row 1 of the spreadsheet are used to generate (and match) column names in the proxy table. If you see an error like "Too few parameters. Expected 8." it may be because you have included spaces in the row 1 column names.

    If you really want spaces, you'll have to code an explicit column name list in the CREATE EXISTING TABLE, with "quoted column names" and data types... DOUBLE and NVARCHAR ( 255 ) seem to be good choices.

  • The Microsoft ODBC drivers for Excel 2003 and 2007 seem to impose limits on string lengths: 64 bytes if you're not lucky, 255 bytes if you are. "NVARCHAR" is a good luck talisman here.

  • Only the final cell data values are received via the proxy table, not formulas.

  • Excel data is not intended for simultaneous access by multiple connections, so don't go trying serve up your spreadsheets in an OLTP system.

  • Tip: A quick and dirty way to copy and paste data from the web or some other source into Excel without carrying over fancy formatting is to copy and paste into Notepad first, then copy and paste from Notepad to Excel; e.g., by Ctrl-A and Ctrl-C from Notepad and then Ctrl-V into Excel. This works well if the data pasted into Notepad turns out to be tab-delimited, which it often is when coming from a website.

  • Tip: Here's how to turn a SQL Anywhere result set into a tab-delimited text file, suitable copy and paste from Notepad into Excel:
    UNLOAD 
    SELECT *
    FROM ttt
    ORDER BY ccc
    TO 'c:/temp/ttt.txt'
    DELIMITED BY '\X09' HEXADECIMAL OFF ESCAPES OFF QUOTES OFF;

  • Proxy tables work with more than Excel; you can join Oracle tables with SQL Server, for example. SQL Anywhere's own Migrate Database Wizard uses proxy tables to set up the schema and suck in the data when you're converting from some other DBMS.

  • High volume proxy table queries sometimes run slowly. For example, complex cross-server queries don't get optimized the same way local queries do. Cross-server optimization may improve in the future but it's likely that local optimization will always be better. That's another reason not to serve up proxy tables in an OLTP environment.

  • Having said that, folks who like proxy tables like them a LOT. It's been one of the most popular features in SQL Anywhere since 1998.

10 comments:

Anonymous said...

Hey, that's pretty cool. I've occasionally used Excel to pull data from SQL Anywhere, but have never gone in the other direction. Particularly amused by the proximity to your recent "Everything looks like a database" post, because for me, Excel is my primary go-to tool for most data processing, with SA a close second. Getting them to play nicely together provides some nice synergy.

Breck Carter said...

There is a BUG in this code! A serious logic error! Gasp!

First person who posts a fix, here, gets 100 points. I am planning to write a whole article about it ( yeah, it's that embarrassing :)... the maximum score possible will be reduced as more clues are given.

Anonymous said...

You mean the integer division in "hit_count / total_hits"?

Simplest fix is "100.0 * hit_count / total_hits".

Breck Carter said...

Good thinking, unexpected integer arithmetic is a characteristic error; i.e., a human error that is so commonly made when some computerized device or program is used that it can be called a characteristic of the device itself.

One must always be on guard... or at least, suspicious when things go wrong. I think the NTSB spends half its time thinking about such things, with respect to cockpit user interfaces.

However, in this case proxy_browsers.hits is DOUBLE (scroll down to see the table definition, I usually put those BEFORE the queries, sorry).

Here's a clue: It's not a problem with floating point arithmetic, either. That's a whole other class of characteristic errors, using floating point when fixed point should be used, but it doesn't apply here either.

Another clue: It's a really really simple mistake, with a really really subtle effect. That's 4 "really's", out of a possible maximum of 5, on the sliding SMSE scale (simple mistake subtle effect).

And thanks for reminding me about the contest, I'd forgotten which one of my recent mistakes it applied to (REGEXP_SUBSTR kinda crowded it out :)

Anonymous said...
This comment has been removed by a blog administrator.
Breck Carter said...

Jonathan, I've TEMPORARILY deleted your entry so other folks can play too! I will put it back later.

Breck Carter said...

Not only did Jonathan provide "the fix", for 80 points, he provided an alternative fix that some may find better... for an extra 10 points.

The contest is still open, see Find The Mistake Contest.

Anonymous said...

Thats not a FIX...
Thats the REAL approach for
Multi-Platform
Database-Engineering
Thru Federated-Server...

Dig Deeper and you all see why
everybody going for
FEDERATED QUERIES
and
HETEROGENOUS QUERIES

Check out those KEYWORDS
SYBASE is ...FINALLY....moving
slowly forward ...not there yet...

Anonymous said...

- Remember ...?
- Symantec DBAnywhere
- Borland InterBase ?
- Ms-JetEngine 4.0 ?
- Oracle Linked Server ?
- MS-SQLServer Heterogenous Queries ?

Same concept different syntax in SQL
Same behavior for Remote Databases
Same behavior for Local Databases ( Excel )

Anonymous said...

is it possible to use something like
------------------------
where fields(0)='test'
------------------------
in your sql query? I know in which column the data resides, but I cannot use the header row with field names because the field names have non valid sql characters like points, slashes spaces etc...