Friday, July 6, 2012

JOIN to a text file without a proxy table

Question: I want to check a text file against a table... can I do that without messing around with CREATE SERVER statements and proxy tables and ODBC drivers?

Answer: Well, sure, you could write a loop in Java and use JDBC instead... oh, you don't want to do that either?

I guess unearthing an ORM solution is out of the question, too... you just want to code a SELECT and be done with it, right?

Acceptable Answer: Use OpenString() in the FROM clause.

But first, here's a table and a text file.

The table is used to generate the SQL Anywhere Technical Documents web page, and the text file is sent to Google for the corresponding Custom Search Engine:

CREATE TABLE DBA.article ( -- 526 rows, 284k total = 248k table + 28k ext + 8k index, 553 bytes per row
   artificial_id    /* PK        */ BIGINT NOT NULL DEFAULT autoincrement,
   article_number                   DECIMAL ( 30, 6 ) NOT NULL DEFAULT 0,
   eyecatcher                       LONG VARCHAR NOT NULL DEFAULT '',
   display_date                     LONG VARCHAR NOT NULL DEFAULT '',
   order_by_date                    DATE NOT NULL DEFAULT '1901-01-01',
   versions                         LONG VARCHAR NOT NULL DEFAULT '',
   url                              LONG VARCHAR NOT NULL DEFAULT '',
   title                            LONG VARCHAR NOT NULL DEFAULT '',
   description                      LONG VARCHAR NOT NULL DEFAULT '',
   added_on_date                    DATE NOT NULL DEFAULT '1901-01-01',
   authors                          LONG VARCHAR INLINE 256 PREFIX 8 NOT NULL DEFAULT '',
   updated_at                       TIMESTAMP NOT NULL DEFAULT timestamp,
   CONSTRAINT ASA81 PRIMARY KEY ( -- 8k
      artificial_id )
 );

-- Text file...

http://www.sybase.com/detail?id=1098527
http://www.sybase.com/detail?id=1067224
http://www.sybase.com/detail?id=1098128
http://www.sybase.com/files/White_Papers/wp-ias-MobiLink12Performance.pdf
http://www.sybase.com/detail?id=1096289
http://www.sybase.com/detail?id=1095051
http://www.sybase.com/detail?id=1096331
http://www.sybase.com/detail?id=1095775
http://www.sybase.com/detail?id=1095123
http://www.sybase.com/detail?id=1094921
http://www.sybase.com/detail?id=1094907
http://download.sybase.com/presentation/TW2011/DATSQL16.pdf
http://download.sybase.com/presentation/TW2011/DATSQL12_SQLAnywhereOData.pdf
http://download.sybase.com/presentation/TW2011/DATSQL11.pdf
...
The question now becomes "How do I join article.url with the data in the text file?"

Voila! A single SELECT!

That's all there is to it...
SELECT article.artificial_id,
       article.url, 
       text_file.url 
  FROM article INNER JOIN OpenString ( FILE 'temp_include_sites_in_bulk.txt' )
                                WITH ( url LONG VARCHAR )
                          AS text_file
       ON text_file.url = article.url
 ORDER BY article.artificial_id;

1    http://www.sybase.com/detail?id=1095775  http://www.sybase.com/detail?id=1095775
2    http://www.sybase.com/detail?id=1095123  http://www.sybase.com/detail?id=1095123
3    http://www.sybase.com/detail?id=1094491  http://www.sybase.com/detail?id=1094491
...
550  http://download.sybase.com/presentat...  http://download.sybase.com/presentat...
551  http://download.sybase.com/presentat...  http://download.sybase.com/presentat...
552  http://download.sybase.com/presentat...  http://download.sybase.com/presentat... 
The OpenString() clause on lines 4 to 6 works like this: the FILE clause points to the text file, the WITH clause specifies the layout of the text file (one single string column), and the AS clause gives the OpenString() a table name to be used in the rest of the SELECT.

After that, it's just an ordinary SELECT: the ON clause on line 7 refers to the url column in the two tables (one real table, article, and one not-so-real table, text_file).

The SELECT list, the ORDER BY clause, all business as usual, and the result set (lines 10 to 16) looks like it should.



The REAL question was this: How do I make sure that all the urls in the table also exist in the text file, and vice versa?

Stuff happens when you have the same data stored in two different places: Maybe the table's more up to date than the text file, or maybe the the most recent changes got rolled back, or maybe it got restored from a backup.

Or maybe we're looking at the wrong copy of the text file... a girl can't be too careful these days.

The solution: Change the INNER JOIN to a FULL OUTER JOIN.

Not a LEFT OUTER JOIN, that only tells you about data that's in the table but missing from the text file... a FULL OUTER JOIN tells you about the vice versa too:
SELECT article.artificial_id,
       article.url, 
       text_file.url 
  FROM article FULL OUTER JOIN OpenString ( FILE 'temp_include_sites_in_bulk.txt' )
                                     WITH ( url LONG VARCHAR )
                               AS text_file
       ON text_file.url = article.url
 WHERE text_file.url IS NULL
    OR article.url   IS NULL
 ORDER BY article.artificial_id;

(NULL)  (NULL)         http://www.sybase.com/detail?id=109166
327     not available  (NULL)
337     not available  (NULL)
338     not available  (NULL)
339     not available  (NULL)
344     not available  (NULL)
349     not available  (NULL)
362     too large      (NULL)
366     too large      (NULL)
368     too large      (NULL)
379     empty          (NULL)
502     not available  (NULL)
505     not available  (NULL)
506     not available  (NULL)
507     not available  (NULL)
510     not available  (NULL)
511     not available  (NULL)
530     not available  (NULL)
531     not available  (NULL)
532     not available  (NULL)
533     not available  (NULL)
Line 12 shows an url that exists in the file but not the table: that's a real problem, and it makes writing this query a worthwhile endeavour.

The other rows (lines 13 to 32) show entries in the table that don't have urls at all, so they're not going to show up in the text file sent to Google: that's not a problem.

Ooops, almost forgot... the WHERE clause on lines 8 and 9 cuts the output down to just the Possibly Bad Stuff: if the query returns no rows there's nothing to look at, let alone worry about.

Where did "OpenString" come from?

Here's the question as it was asked on sqlanywhere.forum.com:

Lo these many years, based on nothing more than the prefix "Open", I have assumed that the OpenString function clause was based on some pre-existing feature in common use by some (unknown to me) portion of the computing community.

However, two possibilities now come to mind: first, that the "Open" in OpenString has nothing to do with any utopian philosophy but that it might as well have been called "UnloadString"; and second, that OpenString might be an invention of Watcom, er, iAnywhere Solutions, er, Sybase, er, SAP.

Is it possible?
Here's the answer from Glenn Paulley:

As far as I know that is true - OPENSTRING is Anil's invention, I haven't seen anything like it with any other product.
That being Anil Goel... and here's some more information from John Smirnios:

It was an excellent evolution of LOAD TABLE and, not surprisingly, they both share the same code. Many people needed the ability to massage the data coming from a LOAD TABLE so they would insert the data into a temporary table then massage the data in the temp table. That was wasteful and cumbersome. OpenString allowed it to happen in one step. The rewrite of the execution engine (many versions ago) made it much more feasible to do. LOAD TABLE is now very little other than an INSERT (with special properties and special logging) from OpenString.
If the Wikipedia entry for "innovation" had a photo (it doesn't), it would be Anil's.

If I was Anil's boss (I'm not)... or John Smirnios, I'm not his boss either... but if I was, this would not happen...

Dilbert.com


No comments: