Proposal: Sorting in CQL

17th December 2005

1. Preface
2. Motivation
3. Examples
4. Description
5. Grammar
6. Context set
7. XCQL
8. Explain

1. Preface

This document proposes a way of adding sorting capability to CQL, the Common Query Languages used by the SRU protocol (and the related SRW). It is based on a consensus reached by the SRU Editorial Board on 15th-16th December 2005, and draws freely on prose contributed by the members of the board.

2. Motivation

It is widely agreed that the sorting mechanism in SRU 1.1 is inadequate for several reasons:

It requires its own parser.
It uses XPath, a format and schema specific solution, and is incapable of expressing namespaces in the XPath.
It requires its own parameter in the request.
It is not easily extensible or profilable.
It is tied directly to the SRU version.
Sort criteria may not be specified as part of the query language, as is common in other query languages such as SQL.

That the sorting capabilities of SRU 1.0 were not revised in version 1.1 does not reflect satisfaction with the old specification so much as lack of impetus at that time to design a better mechanism.

We feel that the most effective solution to the problem of sorting in SRU and SRW is to add sorting support to the existing query language of those protocols, CQL. This has the advantages that:

Existing CQL indexes can be used in sort specifications (i.e. abstract access points).
Such indexes and their modifiers can be profiled in the same way as the core of CQL.
Only a single language parser is required in CQL, with minimal changes required to existing CQL parsers to accomodate the new syntax.
As a fringe benefit, applications of CQL other than SRU/W will benefit from the new capabilities.

3. Examples

   kernighan sortby title
   kernighan and ritchie sortby title
   dc.creator=kernighan sortby dc.title
   dc.creator=kernighan sortby numberOfLegs/cql.number
   dc.creator=kernighan sortby dc.title/sort.respectCase
   dc.creator=kernighan sortby dc.title/sort.respectCase/sort.descending
   dc.creator=kernighan sortby dc.date dc.title
   dc.creator=kernighan sortby dc.date/sort.missingOmit
   dc.creator=kernighan sortby dc.date/sort.missingValue=1970
   >dc="http://deepcustard.org/1.0" blah sortby dc.custardDepth

4. Description

The new CQL syntax includes a sort-specification, introduced by the new sortby operator, which may only occur at the top level of a query, or governed by one or more top-level prefix-mappings (as in the last example above). The left-hand-side of the sortby operator is a "classic" CQL query; the right-hand-side is an ordered list of sort keys from most to least significant, each constisting of an index name together with zero or more modifiers.

Both the indexes and the modifiers within a sort-specification may be drawn from any context set. (The concept of an index modifier is new with this proposal; for now, we suggest that they be allowed only within a sort-specification and not within the core query). As in the core query, an unqualified index is drawn from the prevailing context-set, and an unqualified modifier from the CQL context set.

The examples above are now interpreted:

kernighan sortby title: Search for "kernighan", sort by "title". The "title" index is interpreted as being from the prevailing context-set, and so could be a Dublin Core title, a title to land, a heraldic title, etc.
kernighan and ritchie sortby title: Search for "kernighan and ritchie", sort by "title"
dc.creator=kernighan sortby dc.title: Search for "dc.creator=kernighan", sort by Dublin Core title.
dc.creator=kernighan sortby numberOfLegs/cql.number: Sort by "numberOfLegs" (from the prevailing context set), treated as a number (and thus sorted numerically rather than lexicographically: "2" sorts before "10").
dc.creator=kernighan sortby dc.title/sort.respectCase: Case-sensitive sorting: the default is usually case-insensitive (but a server can Explain that it defaults to case-sensitive).
dc.creator=kernighan sortby dc.title/sort.respectCase/sort.descending: Case-sensitive, descending sort: the default is usually ascending.
dc.creator=kernighan sortby dc.date dc.title: Multiple-key sorting: records are sorted first by date, and record with the same date are further sorted by title. Any number of sort keys may be specified in this manner.
dc.creator=kernighan sortby dc.date/sort.missingOmit: Sort by date: records that have no date field are omitted from the result set.
dc.creator=kernighan sortby dc.date/sort.missingValue=1970: Sort by date: records that have no date field are sorted as though they had a date of 1970.
>dc="http://deepcustard.org/1.0" blah sortby dc.custardDepth: Sort on "custardDepth" from the hypothetical Deep Custard context set: explicit context-set prefix-mapping may override whatever default mappings are implicitly provided by an application.

5. Grammar

The proposed new syntax requires minimal changes to the current CQL grammar: the addition of three new rules, and the modification of the existing term rule by the addition of the operator sortby:

   sortedQuery ::= prefixAssignment sortedQuery | scopedClause ['sortby' sortSpec]
   sortSpec    ::= sortSpec singleSpec | singleSpec
   singleSpec  ::= index [modifierList]

   term        ::= identifier | 'and' | 'or' | 'not' | 'prox' | 'sortby'

The start production of the new grammar is sortedQuery.

This grammar restricts the sortby keyword to the top level of the query except that it may be govered by prefix-mappings. For example, it does not admit the query:

   (kernighan sortby title) and ritchie

This grammar eagerly accepts the first occurrence of the word sortby in the place of an operator, so that the lame search:

   sortby sortby sortby sortby sortby

is interpreted as a search for "sortby", sorted by the three-key sort-specification consisting of the indexes "sortby", "sortby" (again) and "sortby" (yet again). The alternative interpretation as a search for "sortby" related by the relation "sortby" to an index called "sortby", all sorted by the index "sortby", is not supported; for this reason, "sortby" may no longer be used as a profiled relation name (but you should never have done that anyway.)

6. Context set

Although the index modifiers used in a sort specification may be drawn from any context set - an important point for extensibility - a core set of sort-related modifiers are provided in a special sorting context set, and these are expected to meet the needs of most sorting scenarios. This context set is as follows:

Name: The Sorting Context Set
URI: http://zing.z3950.org/cql/sorting/1.0
Recommended prefix: sort
Object types and names: as in the following table

Object type	Name	Meaning
Index	None
Relation	None
Relation modifier	None
Boolean modifier	None
Index modifier	`ignoreCase`	Case-insensitive sorting: for example, `unit` and `UNIT` sort together.
	`respectCase`	Case-sensitive sorting: for example, `unit` and `UNIT` sort separately.
	`ignoreAccents`	Accent-insensitive sorting: for example `sorensen` and `sørensen` sort together.
	`respectAccents`	Accent-sensitive sorting: for example `sorensen` and `sørensen` sort separately.
	`ascending`	Sort in ascending order.
	`descending`	Sort in descending order.
	`missingOmit`	Records that have no value for the specified index are omitted from the sorted result set.
	`missingFail`	Records that have no value for the specified index cause the search/sort operation to fail with the diagnostic `info:srw/diagnostic/1/93`
	`missingLow`	Records that have no value for the specified index are treated as if they had the lowest possible value, so that they sort first in ascending order and last in descending order.
	`missingHigh`	Records that have no value for the specified index are treated as if they had the highest possible value.
	`missingValue=`value	Records that have no value for the specified index are treated as if they had the specified value.
	`locale=`value	Sort according to the specified locale, which will in general include specifications for whether sorting is case-sensitive or insensitive, how it treats accents, etc. The value is usually of the form `C`, `french`, `fr_CH`, `fr_CH.iso88591` or similar.
	`unicodeCollate=`value	Specfies the Unicode collation level. The value should be a small integer as described in the Unicode Collation Algorithm report at www.unicode.org/reports/tr10

7. XCQL

The XCQL schema will need to be updated to accomodate sorting.

The sole change is the addition of a new <sortKeys> element, which may occur only at the top level of the XCQL document - either after the index, relation and term in a top-level <searchClause> element, or after the boolean, leftOperand and rightOperand after a <triple> element.

This <sortKeys> element contains one or more keys, each represented by a <key> element containing an <index> element (with the qualified name of the index), and a <modifiers> group which is defined the same here as elsewhere in XCQL: it contains one or more <modifier> elements, each containing a <name> and an optional <value>.

For example, the query:

   dc.title = fish sortby dc.creator/sort.missingValue=frog/sort.descending

Would be encoded as:

   <searchClause>
     <index>dc.title</index>
     <relation><value>=</value></relation>
     <term>fish</term>
     <sortKeys>
       <key>
         <index>dc.creator</index>
         <modifiers>
           <modifier>
             <name>sort.missingValue</name>
             <value>frog</value>
           </modifier>
           <modifier>
             <name>sort.descending</name>
           </modifier>
         </modifiers>
       </key>
     </sortKeys>
   </searchClause>

8. Explain

Explain records may contain a <setting> called maximumSortKeys, being the number of sort keys that the server will accept within a single sort specification. Any further keys cause the sort operation to fail with the diagnostic info:srw/diagnostics/1/84

That the sort context set is available for a database is given in the same way as for other context sets - a <set> entry in the <indexInfo> section.

Indexes may be specified as available for sorting by an attribute called sort on the <index> element. See the ZeeRex documentation for more information on this attribute and its brethren search and scan.

The default case for sorting can be expressed in a <default> element with its type attribute set to sortCase and with value ignoreCase or respectCase. A default of case-insensitive searching is preferred.

The default direction for sorting can be expressed in a <default> element with its type attribute set to sortDirection and with value ascending or descending. A default of ascending searching is preferred.

The default handling of missing values during sorting can be expressed in a single <default> element with its type attribute set to one of the following: missingOmit, missingFail, missingLow, missingHigh or missingValue. If and only if the last of these is used, then the element should contain the value to be used in place of missing values.

Feedback to <mike@indexdata.com> is welcome!