Monday, June 17, 2013

Building a Link Checker Inside SQL Anywhere

This is the second article about the foundations for client-server communications inside SQL Anywhere, where the server is a website and the client is a stored procedure inside a SQL Anywhere database.


Resources...
- The web-style CREATE PROCEDURE statement as described in the SQL Anywhere 16 Help.

- Descriptions of the various HTTP request methods like GET and HEAD, in Wikipedia.

- Descriptions of the associated HTTP headers, also in Wikipedia.

- Kurt Lichtner's excellent 2005 Techwave presentation on the new HTTP client introduced in SQL Anywhere 9.0.2.
... yes, this stuff's been around for a while.

The previous article, Client-Server Via Web Services Inside SQL Anywhere, showed how to build a variety of web client and server functions and procedures, ending with a SQL Anywhere stored procedure that used an HTTP GET request to retrieve the HTML body of an Amazon product web page plus all the associated HTTP status and header fields.

This article delves a bit lower, showing how to build a SQL Anywhere stored procedure that uses HTTP HEAD requests to check the validity of external URLs stored in a database. The database in question is the one used to construct the SQL Anywhere Technical Documents web page, and the motivation is link rot: many of the 1,156 technical document links no longer work, and it's time to find out just how bad the situation is without checking ... each ... link ... by ... hand (hint: it's dire!)
[click here to go straight to the code]

Why not just run a link checker on the web page? Why DIY?

The obvious answer is, that's what this blog is all about! (doing stuff with SQL Anywhere databases)

Dilbert.com 1996-02-27

But... it's one thing to Do It Yourself, it's quite another to DIY without ever seeing an example of the "It" you're trying to "Do".

I can't remember ever having run a "link checker" program, so here goes...
  1. Step 1, google it: how do I check links on a website page

  2. First hit: Link Checker - The W3C Markup Validation Service

  3. Go there, feed it this URL: http://sqlanywhere.blogspot.ca/2012/03/tales-from-doc-face.html

    Figure 1: Home page for the W3C Link Checker

  4. Wait ... for ... a ... long ... time ... Done. Document processed in 651.56 seconds

    Figure 2: Processing report displayed by the W3C Link Checker for the Technical Documents page


The situation is dire!

Figure 3 shows that the W3C Link Checker reports 345 "server side problem" links... yup, it's a problem.

Figure 3: Summary of results displayed by the W3C Link Checker for the Technical Documents page


Isn't that all you need to know? Why DIY?

There's no question that the W3C Link Checker does a good job of parsing and checking a web page, and then telling you about all the problems.

What it doesn't do, is this:
  • The W3C Link Checker doesn't limit itself to the important content, in this case the Technical Documents links. That's understandable, it's parsing a web page, it doesn't know what's important and what isn't.

  • The W3C Link Checker doesn't report the total number of links on the page, or how many were "good" links. You can determine the total by counting the number of "Checking link ..." messages in the output (1,298), then do some arithmetic to calculate the number of "good" links, but that's a pain in the patootie.
    Checking link http://iablog.sybase.com/efarrar/2011/10/is-fuji-really-a-cloud/
    HEAD http://iablog.sybase.com/efarrar/2011/10/is-fuji-really-a-cloud/  fetched in 1.20 seconds
    
    Checking link http://www.sybase.com/detail?id=47925
    HEAD http://www.sybase.com/detail?id=47925  fetched in 1.30 seconds
    
    ... and so on, 1,296 more times
    

  • The W3C Link Checker doesn't provide any mechanism for adhoc reporting. Sure, you could scan the output, muck around with regular expressions... but I won't :)


  • The W3C Link Checker doesn't let you customize the process to, say, automate problem investigation. Yes, yes, it's open source... see cartoon above.
Note: One of the "Settings used" in Figure 2 is "Sleeping 1 second between requests to each server" which seems to indicate that it's not a good idea for a robot (like a link checker) to pound other people's servers. Being flagged as an Evil Doer is something no self-respecting client IP address wants...

...so, this DIY project will make use of the SQL Anywhere WAITFOR statement.

SQL Anywhere Link Checker: The Code

Here's the basic building block for the SQL Anywhere Link Checker:
CREATE PROCEDURE http_head_attributes (
   IN url     LONG VARCHAR )
RESULT (
   attribute  LONG VARCHAR,
   value      LONG VARCHAR )
URL '!url' 
TYPE 'HTTP:HEAD';

When a single URL is passed to this web-style procedure, SQL Anywhere tries to send an HTTP HEAD request to the web server, and if that request is successful it gets back a variable number of HTTP headers. If the attempt is unsuccessful (and '404 Not Found' counts as unsuccessful) SQL Anywhere will raise an exception which can be handled in BEGIN CATCH block.

A successful call to http_head_attributes will return a two-column attribute and value result set containing the header names and values.

Here are examples of successful and unsuccessful calls:
SELECT * FROM http_head_attributes ( 'http://www.google.com' );

attribute         value
Status            HTTP/1.0 200 OK
Body              
Expires           -1
Server            gws
X-XSS-Protection  1; mode=block
Content-Type      text/html; charset=ISO-8859-1
P3P               CP="This is not a P3P policy! See http://www.google.com/support/accounts/bin/answer.py?hl=en&answer=151657 for more info."
Cache-Control     private, max-age=0
X-Frame-Options   SAMEORIGIN
Date              Tue, 11 Jun 2013 23:06:04 GMT
Set-Cookie        NID=67=Dg4bUl ... etcetera ...UKCUD; expires=Wed, 11-Dec-2013 23:06:04 GMT; path=/; domain=.google.com; HttpOnly

SELECT * FROM http_head_attributes ( 'http://www.google.com/unknown' );

HTTP request failed. Status code '404 Not Found'
SQLCODE=-983, ODBC 3 State="HY000"
Tip: The "Body" value is empty, and it should be, but that is not always the case! For example, SELECT * FROM http_head_attributes ( 'http://www.sap.com' ) returns the full Body value, all 28,672 bytes of it. That behavior is rare (other pages on sap.com don't do it) but it's something to keep in mind when issuing an HTTP HEAD request.

Figure 4 shows where the URLs to be checked come from. One row exists in the article table for each entry displayed on the Technical Documents web page, one row is inserted in the link_check table each time an article is checked by the SQL Anywhere Link Checker, and one row is inserted in the http_header table for each row returned by a call to web_client_procedure:

Figure 4: ER Diagram for the Technical Documents database


Here's the code for the main link checker procedure:
CREATE PROCEDURE catalog_link_checker()
BEGIN

DECLARE @url_count         INTEGER;
DECLARE @url_counter       INTEGER;
DECLARE @link_checked_at   TIMESTAMP;
DECLARE @status            LONG VARCHAR;

SELECT COUNT(*)
  INTO @url_count
  FROM article
 WHERE url LIKE 'http%';

SET @url_counter = 0;

TRUNCATE TABLE http_header;
TRUNCATE TABLE link_check;

FOR f_fetch AS c_fetch INSENSITIVE CURSOR FOR
SELECT artificial_id AS @artificial_id,
       url           AS @url
  FROM article
 WHERE url LIKE 'http%'
 ORDER BY order_by_date DESC,
       title DESC
FOR READ ONLY
DO

   WAITFOR DELAY '00:00:01';

   SET @url_counter     = @url_counter + 1;
   SET @link_checked_at = CURRENT TIMESTAMP;

   INSERT link_check VALUES ( @artificial_id, @link_checked_at );

   MESSAGE STRING ( 'URL ', @url_counter, ' of ', @url_count, ': ', @url ) TO CONSOLE;

   BEGIN TRY

      INSERT http_header
      SELECT @artificial_id,
             @link_checked_at,
             attribute,
             value
        FROM http_head_attributes ( @url );

      SET @status = '(unknown)';

      SELECT value
        INTO @status
        FROM http_header
       WHERE artificial_id   = @artificial_id
         AND link_checked_at = @link_checked_at
         AND attribute       = 'Status';

      MESSAGE STRING ( 'Status: ', @status ) TO CONSOLE;

   END TRY
   BEGIN CATCH
      MESSAGE STRING ( 'ERROR_MESSAGE() = ',  ERROR_MESSAGE() )  TO CONSOLE;
      MESSAGE STRING ( 'ERROR_SQLCODE() = ',  ERROR_SQLCODE() )  TO CONSOLE;
      MESSAGE STRING ( 'ERROR_SQLSTATE() = ', ERROR_SQLSTATE() ) TO CONSOLE;
      INSERT http_header VALUES ( @artificial_id, @link_checked_at, 'ERROR_MESSAGE()',  ERROR_MESSAGE() );
      INSERT http_header VALUES ( @artificial_id, @link_checked_at, 'ERROR_SQLCODE()',  ERROR_SQLCODE() );
      INSERT http_header VALUES ( @artificial_id, @link_checked_at, 'ERROR_SQLSTATE()', ERROR_SQLSTATE() );
   END CATCH;

   COMMIT;

END FOR;

END;
  • The FOR loop starting on line 19 steps through each row of the article table in the same order the entries are displayed on the web page, and selects just the primary key and URL values.

  • The WAITFOR DELAY statement on line 29 is a crude implementation of the "Sleeping 1 second between requests to each server" suggestion taken from Figure 2. It's crude because this code sleeps between all requests, not just requests to the same server, and that means this code will take a bit longer to run than the W3C Link Checker.

  • The INSERT on line 34 records the fact that this URL was checked on such-and-such a date and time. This level of record-keeping isn't important right now (hence the TRUNCATE TABLE statements on lines 16 and 17) but it might be in the future if someone wanted to know when a particular document was available and when it wasn't, similar to the "Retrieved" dates that Wikipedia shows in the References sections.

  • The INSERT on lines 40 to 45 copies all the headers from the call to http_head_attributes into the http_header table. If the call fails, control passes to the BEGIN CATCH block on line 59, and three surrogate rows are inserted for the three basic SQL Anywhere exception diagnostic values: ERROR_MESSAGE(), ERROR_SQLCODE() and ERROR_SQLSTATE().
Figure 5 shows that the Link Checker takes 2,084 seconds to run; 1,156 of those seconds are spent in the WAITFOR DELAY statements, the rest waiting for web servers to respond.

Figure 6 shows a few snippets from the dbsrv16 -o filespec.txt diagnostic log containing some URLs that worked (Status: 200 OK) and some that didn't (SQLCODE -981).

Figure 5: Running the Link Checker in ISQL


Figure 6: Link Checker MESSAGE TO CONSOLE Output
I. 06/14 04:39:15. URL 1 of 1156: http://download.sybase.com/ianywhere/documents/UsingCrystalWithSQLAnywhere_FINAL.pdf
I. 06/14 04:39:15. Status: HTTP/1.1 200 OK
I. 06/14 04:39:16. URL 2 of 1156: http://sqlanywhere.blogspot.ca/2012/10/example-recursive-union-inverted-order.html
I. 06/14 04:39:16. Status: HTTP/1.0 200 OK
I. 06/14 04:39:17. URL 3 of 1156: http://sqlanywhere.blogspot.ca/2012/10/example-recursive-union-tree-traversal.html
I. 06/14 04:39:17. Status: HTTP/1.0 200 OK
...
I. 06/14 04:40:10. URL 38 of 1156: http://iablog.sybase.com/paulley/2012/03/limitations-of-proxy-tables/
I. 06/14 04:40:11. ERROR_MESSAGE() = Unable to connect to the remote host specified by 
                      'http://iablog.sybase.com/paulley/2012/03/limitations-of-proxy-tables/'
I. 06/14 04:40:11. ERROR_SQLCODE() = -981
I. 06/14 04:40:11. ERROR_SQLSTATE() = WW050
I. 06/14 04:40:12. URL 39 of 1156: http://iablog.sybase.com/paulley/2012/03/nhibernate-3-3-0cr1-released/
I. 06/14 04:40:13. ERROR_MESSAGE() = Unable to connect to the remote host specified by 
                      'http://iablog.sybase.com/paulley/2012/03/nhibernate-3-3-0cr1-released/'
I. 06/14 04:40:13. ERROR_SQLCODE() = -981
I. 06/14 04:40:13. ERROR_SQLSTATE() = WW050
...
I. 06/14 05:13:53. URL 1154 of 1156: http://www.sybase.com/detail?id=47851
I. 06/14 05:13:55. Status: HTTP/1.1 200 OK
I. 06/14 05:13:56. URL 1155 of 1156: http://www.sybase.com/detail?id=20313
I. 06/14 05:13:56. Status: HTTP/1.1 200 OK
I. 06/14 05:13:57. URL 1156 of 1156: http://www.sybase.com/detail?id=10915
I. 06/14 05:13:57. Status: HTTP/1.1 200 OK

Here's an adhoc query that summarizes the "dire situation" by classifying all the articles by the status of their URLs:
SELECT '1. Articles With Good Links:' AS title, 
       COUNT(*)                       AS count
  FROM article
       INNER JOIN link_check
          ON link_check.artificial_id = article.artificial_id
       INNER JOIN ( SELECT *
                      FROM http_header
                     WHERE TRIM ( attribute ) = 'Status' 
                       AND TRIM ( value ) LIKE 'HTTP/% 200 OK'
                  ) AS http_header
          ON http_header.artificial_id   = link_check.artificial_id
         AND http_header.link_checked_at = link_check.link_checked_at

UNION ALL
SELECT '2. Articles Without Links:', 
       COUNT(*)
  FROM article
 WHERE url NOT LIKE 'http%'

UNION ALL
SELECT '3. Rotted iAnywhere Blog Links:',
       COUNT(*)
  FROM article
       INNER JOIN link_check
          ON link_check.artificial_id = article.artificial_id
       INNER JOIN ( SELECT *
                      FROM http_header
                     WHERE TRIM ( attribute ) = 'ERROR_MESSAGE()' 
                       AND TRIM ( value ) LIKE 'Unable to connect to the remote host specified by ''http://iablog.sybase.com%'
                  ) AS http_header
          ON http_header.artificial_id   = link_check.artificial_id
         AND http_header.link_checked_at = link_check.link_checked_at

UNION ALL
SELECT '4. Invalid Response Links:',
       COUNT(*)
  FROM article
       INNER JOIN link_check
          ON link_check.artificial_id = article.artificial_id
       INNER JOIN ( SELECT *
                      FROM http_header
                     WHERE TRIM ( attribute ) = 'ERROR_MESSAGE()' 
                       AND TRIM ( value )     = 'Invalid response from the HTTP server'
                  ) AS http_header
          ON http_header.artificial_id   = link_check.artificial_id
         AND http_header.link_checked_at = link_check.link_checked_at

UNION ALL
SELECT '5. Other:',
       COUNT(*)
  FROM article
       INNER JOIN link_check
          ON link_check.artificial_id = article.artificial_id
 WHERE NOT EXISTS ( 
   SELECT *
     FROM http_header
    WHERE http_header.artificial_id = article.artificial_id
      AND (    (     TRIM ( attribute ) = 'Status' 
                 AND TRIM ( value )     LIKE 'HTTP/% 200 OK' )
            OR (     TRIM ( attribute ) = 'ERROR_MESSAGE()' 
                 AND TRIM ( value )     LIKE 'Unable to connect%http://iablog.sybase.com%' )
            OR (     TRIM ( attribute ) = 'ERROR_MESSAGE()' 
                 AND TRIM ( value )     = 'Invalid response from the HTTP server' ) ) ) 

UNION ALL
SELECT '6. Total Articles:',
       COUNT(*)
  FROM article
ORDER BY 1;

title                                 count 
------------------------------- ----------- 
1. Articles With Good Links:            791 
2. Articles Without Links:               18 
3. Rotted iAnywhere Blog Links:         345 
4. Invalid Response Links:               18 
5. Other:                                 2 
6. Total Articles:                     1174 

Execution time: 0.095 seconds

What's Going To Happen To The Technical Documents Page?

As the summary shows, only (!) 29% of the links have rotted away (the iAnywhere blog posts on sybase.com). Some of them have been moved to sap.com but it's not clear how many, nor is it clear how to map the old-to-new-URLs... it's probably a long ... slow ... manual ... process to make the repairs, which is why it hasn't been started yet.

And it won't be, at least not until the fate of the remaining documents on sybase.com is determined. There are literally hundreds of whitepapers, Techwave presentations, Insider magazine articles and so on. Chances are, a few will move, many will be abandoned, and all links to sybase.com will rot away.

As the rot rate rises, enthusiasm sinks, and at at some point (40 percent? 50? 60?), the link to the Technical Documents page itself will rot away.

Like I said, the situation is dire, but only for those who care :)

Dilbert.com 1998-04-11





1 comment:

Anonymous said...

"Rotted iAnywhere Blog Links" are a problem for several SA forum pages, as well, which relate to them.

I hope SAP will help to re-locate those. Yes, that's just hope.

At least
Volker does care.