Friday, June 29, 2012

Example: ROW_NUMBER()

Question: How do I check two columns to make sure they have the same ordering?

Answer: Why, that's EASY!

You just take the...

um...

you do a...

just a minnit...

Could you repeat the question, please?


Question: How do I make sure a table with two NOT NULL columns a and b yields the same row ordering no matter which column is used in the ORDER BY?

Whaddaya mean "the same row ordering"? What about duplicates?

OK, the "same row ordering for distinct column values".

The question can be restated as
How do I check that SELECT DISTINCT a, b FROM t ORDER BY a, b yields same result set as ORDER BY b, a?
Here's some data:
CREATE TABLE t (
   pkey   INTEGER NOT NULL DEFAULT AUTOINCREMENT PRIMARY KEY,
   a      INTEGER NOT NULL,
   b      INTEGER NOT NULL );

INSERT t ( a, b ) VALUES ( 3, 22 );
INSERT t ( a, b ) VALUES ( 4, 45 );
INSERT t ( a, b ) VALUES ( 1, 10 );
INSERT t ( a, b ) VALUES ( 3, 22 );
INSERT t ( a, b ) VALUES ( 2, 22 );
INSERT t ( a, b ) VALUES ( 4, 44 );
INSERT t ( a, b ) VALUES ( 5, 33 );
INSERT t ( a, b ) VALUES ( 3, 22 );
COMMIT;

SELECT * FROM t ORDER BY a, b;

pkey  a  b
3     1  10
5     2  22
1     3  22
4     3  22
8     3  22
6     4  44
2     4  45
7     5  33

The order of the pkey column isn't important, what's important is the order of a as compared with the order of b; here's what that looks like:
SELECT DISTINCT a, b FROM t ORDER BY a, b;
SELECT DISTINCT a, b FROM t ORDER BY b, a;

a  b     a  b
1  10    1  10
2  22    2  22
3  22    3  22
4  44    5  33
4  45    4  44
5  33    4  45

Right away, you can see "5, 33" is a problem because it appears at row 6 in one result set and row 4 in the other.

This suggests a brute-force solution:
Compare the ROW_NUMBER() values for the two result sets.

I'm sure YOU can do better than this...

WITH vdistinct AS
        ( SELECT DISTINCT
                 a, 
                 b
            FROM t ),
     vjoin AS
        ( SELECT *
            FROM ( SELECT a                                   AS a1, 
                          b                                   AS b1,  
                          ROW_NUMBER() OVER ( ORDER BY a, b ) AS row1 
                     FROM vdistinct ) AS t1
                 INNER JOIN
                 ( SELECT a                                   AS a2,  
                          b                                   AS b2,  
                          ROW_NUMBER() OVER ( ORDER BY b, a ) AS row2 
                     FROM vdistinct ) AS t2
                 ON t1.a1 = t2.a2
                AND t1.b1 = t2.b2 )
SELECT *
  FROM vjoin
 WHERE row1 <> row2
 ORDER BY row1, row2;

The WITH clause on lines 1 to 18 defines two local views, also known as "common table expressions", called vdistinct and vjoin.

The first local view on lines 1 to 5 takes care of the fact that the DISTINCT keyword applies to the entire SELECT list, and if we want all the distinct pairs of a and b, there can't be anything else (say, ROW_NUMBER()) in the SELECT DISTINCT list.

The second local view on lines 6 to 18y joins the first view to itself on matching values of a and b, with the difference being the two ROW_NUMBER() OVER ( ORDER BY ... ) items.

The final SELECT on lines 19 to 22 looks for ( a, b ) ( a, b ) pairs with different ROW_NUMBER() values... if the result set is empty, then columns a and b are OK...

...but if anything shows up, there's a problem:
a1  b1  row1  a2  b2  row2
4   44  4     4   44  5
4   45  5     4   45  6
5   33  6     5   33  4

There are three rows in the result set but only one problem, why?

Because the question only asked "Is there a problem here?"

Is that good enough?


In 1972 finding all the errors in one run was critically important because you had to wait hours, or overnight, for each test.

But it's not 1972 any more, and finding one error at a time is perfectly OK.

Consider this real-world example: Only lines 3, 4 and 5 had to be changed to use the WITH SELECT on a different table.
WITH vdistinct AS
        ( SELECT DISTINCT
                 article_number AS a, 
                 order_by_date  AS b
            FROM article ),
     vjoin AS
        ( SELECT *
            FROM ( SELECT a                                   AS a1, 
                          b                                   AS b1,  
                          ROW_NUMBER() OVER ( ORDER BY a, b ) AS row1 
                     FROM vdistinct ) AS t1
                 INNER JOIN
                 ( SELECT a                                   AS a2,  
                          b                                   AS b2,  
                          ROW_NUMBER() OVER ( ORDER BY b, a ) AS row2 
                     FROM vdistinct ) AS t2
                 ON t1.a1 = t2.a2
                AND t1.b1 = t2.b2 )
SELECT *
  FROM vjoin
 WHERE row1 <> row2
 ORDER BY row1, row2;

Over 100 rows were returned, not because of data-entry errors but because of a logic error in an automatic load process:
a1,b1,row1,a2,b2,row2
161.011000,'2005-10-01',243,161.011000,'2005-10-01',262
161.020000,'2005-08-21',244,161.020000,'2005-08-21',243
161.030000,'2005-08-21',245,161.030000,'2005-08-21',244
161.040000,'2005-08-21',246,161.040000,'2005-08-21',245
161.050000,'2005-08-21',247,161.050000,'2005-08-21',246
161.060000,'2005-08-21',248,161.060000,'2005-08-21',247
161.070000,'2005-08-21',249,161.070000,'2005-08-21',248
161.080000,'2005-08-21',250,161.080000,'2005-08-21',249
161.090000,'2005-08-21',251,161.090000,'2005-08-21',250
161.100000,'2005-08-21',252,161.100000,'2005-08-21',251
161.110000,'2005-08-21',253,161.110000,'2005-08-21',252
161.120000,'2005-08-21',254,161.120000,'2005-08-21',253
161.130000,'2005-08-21',255,161.130000,'2005-08-21',254
161.140000,'2005-08-21',256,161.140000,'2005-08-21',255
161.150000,'2005-08-21',257,161.150000,'2005-08-21',256
161.160000,'2005-08-21',258,161.160000,'2005-08-21',257
161.170000,'2005-08-21',259,161.170000,'2005-08-21',258

After the logic error was fixed, only six rows showed up:
a1,b1,row1,a2,b2,row2
303.000000,'2011-02-01',759,303.000000,'2011-02-01',764
304.000000,'2011-01-04',760,304.000000,'2011-01-04',759
304.010000,'2011-01-11',761,304.010000,'2011-01-11',760
304.020000,'2011-01-17',762,304.020000,'2011-01-17',761
304.030000,'2011-01-24',763,304.030000,'2011-01-24',762
304.040000,'2011-01-31',764,304.040000,'2011-01-31',763

Now that WAS a data-entry error... one error involving two pairs (303, 2011-02-01) and (304, 2011-01-04). It caused a cascading effect in the subsequent automatic load process, hence the six rows in the WITH SELECT result set.

In other words, when that single error was fixed, the WITH SELECT returned no rows - no problems!



No comments: