Proposal: Sorting in CQL

17th December 2005

1. Preface
2. Motivation
3. Examples
4. Description
5. Grammar
6. Context set
7. XCQL
8. Explain

1. Preface

This document proposes a way of adding sorting capability to CQL, the Common Query Languages used by the SRU protocol (and the related SRW). It is based on a consensus reached by the SRU Editorial Board on 15th-16th December 2005, and draws freely on prose contributed by the members of the board.

2. Motivation

It is widely agreed that the sorting mechanism in SRU 1.1 is inadequate for several reasons:

That the sorting capabilities of SRU 1.0 were not revised in version 1.1 does not reflect satisfaction with the old specification so much as lack of impetus at that time to design a better mechanism.

We feel that the most effective solution to the problem of sorting in SRU and SRW is to add sorting support to the existing query language of those protocols, CQL. This has the advantages that:

3. Examples

   kernighan sortby title
   kernighan and ritchie sortby title
   dc.creator=kernighan sortby dc.title
   dc.creator=kernighan sortby numberOfLegs/cql.number
   dc.creator=kernighan sortby dc.title/sort.respectCase
   dc.creator=kernighan sortby dc.title/sort.respectCase/sort.descending
   dc.creator=kernighan sortby dc.date dc.title
   dc.creator=kernighan sortby dc.date/sort.missingOmit
   dc.creator=kernighan sortby dc.date/sort.missingValue=1970
   >dc="http://deepcustard.org/1.0" blah sortby dc.custardDepth

4. Description

The new CQL syntax includes a sort-specification, introduced by the new sortby operator, which may only occur at the top level of a query, or governed by one or more top-level prefix-mappings (as in the last example above). The left-hand-side of the sortby operator is a "classic" CQL query; the right-hand-side is an ordered list of sort keys from most to least significant, each constisting of an index name together with zero or more modifiers.

Both the indexes and the modifiers within a sort-specification may be drawn from any context set. (The concept of an index modifier is new with this proposal; for now, we suggest that they be allowed only within a sort-specification and not within the core query). As in the core query, an unqualified index is drawn from the prevailing context-set, and an unqualified modifier from the CQL context set.

The examples above are now interpreted:

kernighan sortby title
Search for "kernighan", sort by "title". The "title" index is interpreted as being from the prevailing context-set, and so could be a Dublin Core title, a title to land, a heraldic title, etc.
kernighan and ritchie sortby title
Search for "kernighan and ritchie", sort by "title"
dc.creator=kernighan sortby dc.title
Search for "dc.creator=kernighan", sort by Dublin Core title.
dc.creator=kernighan sortby numberOfLegs/cql.number
Sort by "numberOfLegs" (from the prevailing context set), treated as a number (and thus sorted numerically rather than lexicographically: "2" sorts before "10").
dc.creator=kernighan sortby dc.title/sort.respectCase
Case-sensitive sorting: the default is usually case-insensitive (but a server can Explain that it defaults to case-sensitive).
dc.creator=kernighan sortby dc.title/sort.respectCase/sort.descending
Case-sensitive, descending sort: the default is usually ascending.
dc.creator=kernighan sortby dc.date dc.title
Multiple-key sorting: records are sorted first by date, and record with the same date are further sorted by title. Any number of sort keys may be specified in this manner.
dc.creator=kernighan sortby dc.date/sort.missingOmit
Sort by date: records that have no date field are omitted from the result set.
dc.creator=kernighan sortby dc.date/sort.missingValue=1970
Sort by date: records that have no date field are sorted as though they had a date of 1970.
>dc="http://deepcustard.org/1.0" blah sortby dc.custardDepth
Sort on "custardDepth" from the hypothetical Deep Custard context set: explicit context-set prefix-mapping may override whatever default mappings are implicitly provided by an application.

5. Grammar

The proposed new syntax requires minimal changes to the current CQL grammar: the addition of three new rules, and the modification of the existing term rule by the addition of the operator sortby:

   sortedQuery ::= prefixAssignment sortedQuery | scopedClause ['sortby' sortSpec]
   sortSpec    ::= sortSpec singleSpec | singleSpec
   singleSpec  ::= index [modifierList]

   term        ::= identifier | 'and' | 'or' | 'not' | 'prox' | 'sortby'

The start production of the new grammar is sortedQuery.

This grammar restricts the sortby keyword to the top level of the query except that it may be govered by prefix-mappings. For example, it does not admit the query:

   (kernighan sortby title) and ritchie

This grammar eagerly accepts the first occurrence of the word sortby in the place of an operator, so that the lame search:

   sortby sortby sortby sortby sortby

is interpreted as a search for "sortby", sorted by the three-key sort-specification consisting of the indexes "sortby", "sortby" (again) and "sortby" (yet again). The alternative interpretation as a search for "sortby" related by the relation "sortby" to an index called "sortby", all sorted by the index "sortby", is not supported; for this reason, "sortby" may no longer be used as a profiled relation name (but you should never have done that anyway.)

6. Context set

Although the index modifiers used in a sort specification may be drawn from any context set - an important point for extensibility - a core set of sort-related modifiers are provided in a special sorting context set, and these are expected to meet the needs of most sorting scenarios. This context set is as follows:

Name: The Sorting Context Set
URI: http://zing.z3950.org/cql/sorting/1.0
Recommended prefix: sort
Object types and names: as in the following table

Object type Name Meaning
Index None
Relation None
Relation modifier None
Boolean modifier None
Index modifier ignoreCase Case-insensitive sorting: for example, unit and UNIT sort together.
respectCase Case-sensitive sorting: for example, unit and UNIT sort separately.
ignoreAccents Accent-insensitive sorting: for example sorensen and sørensen sort together.
respectAccents Accent-sensitive sorting: for example sorensen and sørensen sort separately.
ascending Sort in ascending order.
descending Sort in descending order.
missingOmit Records that have no value for the specified index are omitted from the sorted result set.
missingFail Records that have no value for the specified index cause the search/sort operation to fail with the diagnostic info:srw/diagnostic/1/93
missingLow Records that have no value for the specified index are treated as if they had the lowest possible value, so that they sort first in ascending order and last in descending order.
missingHigh Records that have no value for the specified index are treated as if they had the highest possible value.
missingValue=value Records that have no value for the specified index are treated as if they had the specified value.
locale=value Sort according to the specified locale, which will in general include specifications for whether sorting is case-sensitive or insensitive, how it treats accents, etc. The value is usually of the form C, french, fr_CH, fr_CH.iso88591 or similar.
unicodeCollate=value Specfies the Unicode collation level. The value should be a small integer as described in the Unicode Collation Algorithm report at www.unicode.org/reports/tr10

7. XCQL

The XCQL schema will need to be updated to accomodate sorting.

The sole change is the addition of a new <sortKeys> element, which may occur only at the top level of the XCQL document - either after the index, relation and term in a top-level <searchClause> element, or after the boolean, leftOperand and rightOperand after a <triple> element.

This <sortKeys> element contains one or more keys, each represented by a <key> element containing an <index> element (with the qualified name of the index), and a <modifiers> group which is defined the same here as elsewhere in XCQL: it contains one or more <modifier> elements, each containing a <name> and an optional <value>.

For example, the query:

   dc.title = fish sortby dc.creator/sort.missingValue=frog/sort.descending

Would be encoded as:

   <searchClause>
     <index>dc.title</index>
     <relation><value>=</value></relation>
     <term>fish</term>
     <sortKeys>
       <key>
         <index>dc.creator</index>
         <modifiers>
           <modifier>
             <name>sort.missingValue</name>
             <value>frog</value>
           </modifier>
           <modifier>
             <name>sort.descending</name>
           </modifier>
         </modifiers>
       </key>
     </sortKeys>
   </searchClause>

8. Explain

Explain records may contain a <setting> called maximumSortKeys, being the number of sort keys that the server will accept within a single sort specification. Any further keys cause the sort operation to fail with the diagnostic info:srw/diagnostics/1/84

That the sort context set is available for a database is given in the same way as for other context sets - a <set> entry in the <indexInfo> section.

Indexes may be specified as available for sorting by an attribute called sort on the <index> element. See the ZeeRex documentation for more information on this attribute and its brethren search and scan.

The default case for sorting can be expressed in a <default> element with its type attribute set to sortCase and with value ignoreCase or respectCase. A default of case-insensitive searching is preferred.

The default direction for sorting can be expressed in a <default> element with its type attribute set to sortDirection and with value ascending or descending. A default of ascending searching is preferred.

The default handling of missing values during sorting can be expressed in a single <default> element with its type attribute set to one of the following: missingOmit, missingFail, missingLow, missingHigh or missingValue. If and only if the last of these is used, then the element should contain the value to be used in place of missing values.

Feedback to <mike@indexdata.com> is welcome!