Monday, December 8, 2008

Windows-Style ORDER BY

Question: How do I use ORDER BY to sort alphanumeric strings the same way Windows Explorer sorts file names?

Answer: By using CREATE FUNCTION to return an alternative string value that will have the desired result when used in ORDER BY, and then (optionally) calling that function in a new COMPUTE column.

Microsoft describes the modern Windows file name sort order in an article with this breathtakingly long title: The sort order for files and folders whose names contain numerals is different in Windows Vista, Windows XP, and Windows Server 2003 than it is in Windows 2000:

"By default, the newer sort order considers strings in file and folder names as numeric content, not text. Numerals in folder and file names are sorted according to their numeric value."
The article goes on to describe how to hack with the registry to get the old sort order back, but let's not go there :)

Here are some file names as sorted by Windows XP Explorer...
a5.1.txt
a5.02.txt
a5.003.txt
Ie4_01
Ie4_128
Ie5
Ie6
Ie401sp2
Ie501sp2
z8A88.txt
z9A99.txt
z88A8.txt
z99A9.txt
Here's how the traditional ORDER BY sorts those strings in SQL Anywhere 11...
( ...which is exactly the same order produced by the DOS DIR /ON command under Windows XP, but let's ignore that, too :)
'a5.003.txt'
'a5.02.txt'
'a5.1.txt'
'Ie4_01'
'Ie4_128'
'Ie401sp2'
'Ie5'
'Ie501sp2'
'Ie6'
'z88A8.txt'
'z8A88.txt'
'z99A9.txt'
'z9A99.txt'
Here is a test table with two string columns, one for the original name and the other to be used in the ORDER BY clause:
CREATE TABLE sorted (
original_name LONG VARCHAR NOT NULL PRIMARY KEY,
alternate_name LONG VARCHAR NOT NULL
COMPUTE ( reformat ( original_name ) ) );
The alternate_name column is defined as an automatic COMPUTE column based on this expression:
   reformat ( original_name )
where reformat performs the following transformation:
The original string is parsed from left to right looking for numeric and non-numeric substrings. Each non-numeric substring is returned as-is. Each numeric substring is right justified with leading zero characters to some long length (say 40 digits, although 5 is the number used here).
Here's what the function looks like:
CREATE FUNCTION reformat (
IN @original_name LONG VARCHAR )
RETURNS LONG VARCHAR
DETERMINISTIC
BEGIN

DECLARE @alternate_name LONG VARCHAR;
DECLARE @justify_to INTEGER;
DECLARE @pos BIGINT;
DECLARE @state VARCHAR ( 20 );
DECLARE @string_part LONG VARCHAR;
DECLARE @char VARCHAR ( 1 );

DECLARE LOCAL TEMPORARY TABLE string_part (
string_part_number BIGINT NOT NULL
DEFAULT AUTOINCREMENT
PRIMARY KEY,
string_part LONG VARCHAR NOT NULL )
NOT TRANSACTIONAL;

SET @justify_to = 5; -- use 40 if you want to be sure
SET @pos = 1;
SET @state = 'start';
SET @string_part = '';

WHILE @pos <= LENGTH ( @original_name ) LOOP

SET @char = SUBSTR ( @original_name, @pos, 1 );

IF @char LIKE '[0-9]' THEN
IF @state = 'non-numeric' THEN
INSERT string_part ( string_part )
VALUES ( @string_part );
SET @string_part = '';
END IF;
SET @string_part = @string_part + @char;
SET @state = 'numeric';
ELSE
IF @state = 'numeric' THEN
INSERT string_part ( string_part )
VALUES ( @string_part );
SET @string_part = '';
END IF;
SET @string_part = @string_part + @char;
SET @state = 'non-numeric';
END IF;
SET @pos = @pos + 1;
END LOOP;

IF @state = 'start' THEN
RETURN '';
ELSE
INSERT string_part ( string_part )
VALUES ( @string_part );
SELECT LIST (
IF LEFT ( string_part.string_part, 1 )
LIKE '[0-9]'
THEN RIGHT (
STRING (
REPEAT ( '0', @justify_to ),
string_part.string_part ),
@justify_to )
ELSE string_part.string_part
ENDIF,
'' ORDER BY string_part_number )
INTO @alternate_name
FROM string_part;
RETURN @alternate_name;
END IF;

END;
Here are some notes on the code:
  • The string_part table numbers and stores each numeric and non-numeric substring in a separate row so they can be reformatted and gathered back together again via the LIST function.

  • @justify_to controls the width of the right-justified numeric strings.

  • @pos steps through each @char in the @original_name string.

  • @state implements a state-driven parsing loop, starting with @state = 'start' and then flipping back and forth between 'numeric' and 'non-numeric'.

  • @string_part is used to gather up all the @char values with the same value of @state.

  • The REPEAT function is used pad each numeric substring with an enormous number of leading '0' characters, and then the RIGHT function is used to strip off the excess '0' characters from the front of the string.

  • The LIST function is used to gather all the reformatted substrings back together into the single @alternate_name return value.

  • And last but not least, if the input is the empty string, so is the output.
Here's what the test looks like:
INSERT sorted ( original_name ) VALUES ( 'a5.1.txt' );
INSERT sorted ( original_name ) VALUES ( 'a5.02.txt' );
INSERT sorted ( original_name ) VALUES ( 'a5.003.txt' );
INSERT sorted ( original_name ) VALUES ( 'Ie4_01' );
INSERT sorted ( original_name ) VALUES ( 'Ie4_128' );
INSERT sorted ( original_name ) VALUES ( 'Ie5' );
INSERT sorted ( original_name ) VALUES ( 'Ie6' );
INSERT sorted ( original_name ) VALUES ( 'Ie401sp2' );
INSERT sorted ( original_name ) VALUES ( 'Ie501sp2' );
INSERT sorted ( original_name ) VALUES ( 'z8A88.txt' );
INSERT sorted ( original_name ) VALUES ( 'z9A99.txt' );
INSERT sorted ( original_name ) VALUES ( 'z88A8.txt' );
INSERT sorted ( original_name ) VALUES ( 'z99A9.txt' );
COMMIT;

SELECT * FROM sorted ORDER BY sorted.alternate_name;

original_name, alternate_name
'a5.1.txt', 'a00005.00001.txt'
'a5.02.txt', 'a00005.00002.txt'
'a5.003.txt', 'a00005.00003.txt'
'Ie4_01', 'Ie00004_00001'
'Ie4_128', 'Ie00004_00128'
'Ie5', 'Ie00005'
'Ie6', 'Ie00006'
'Ie401sp2', 'Ie00401sp00002'
'Ie501sp2', 'Ie00501sp00002'
'z8A88.txt', 'z00008A00088.txt'
'z9A99.txt', 'z00009A00099.txt'
'z88A8.txt', 'z00088A00008.txt'
'z99A9.txt', 'z00099A00009.txt'
...which is the same as the way Windows XP Explorer does it:
a5.1.txt
a5.02.txt
a5.003.txt
Ie4_01
Ie4_128
Ie5
Ie6
Ie401sp2
Ie501sp2
z8A88.txt
z9A99.txt
z88A8.txt
z99A9.txt
Note that the reformat function translates substrings like 5.02 into 00005.00002, resulting in a different sort order than if the 5.02 was treated as a single decimal number and translated into 00005.02. If you want the latter functionality, the specs for the reformat function might have to be changed to something like this:
The original string is parsed from left to right looking for numeric and non-numeric substrings. Each non-numeric substring is returned as-is. Each numeric substring that does not follow a dot '.' is right justified with leading zero characters to some long length. Each numeric substring that does follow a dot '.' is returned as is.

No comments: