This document proposes a way of adding sorting capability to CQL, the Common Query Languages used by the SRU protocol (and the related SRW). It is based on a consensus reached by the SRU Editorial Board on 15th-16th December 2005, and draws freely on prose contributed by the members of the board.
It is widely agreed that the sorting mechanism in SRU 1.1 is inadequate for several reasons:
That the sorting capabilities of SRU 1.0 were not revised in version 1.1 does not reflect satisfaction with the old specification so much as lack of impetus at that time to design a better mechanism.
We feel that the most effective solution to the problem of sorting in SRU and SRW is to add sorting support to the existing query language of those protocols, CQL. This has the advantages that:
kernighan sortby title kernighan and ritchie sortby title dc.creator=kernighan sortby dc.title dc.creator=kernighan sortby numberOfLegs/cql.number dc.creator=kernighan sortby dc.title/sort.respectCase dc.creator=kernighan sortby dc.title/sort.respectCase/sort.descending dc.creator=kernighan sortby dc.date dc.title dc.creator=kernighan sortby dc.date/sort.missingOmit dc.creator=kernighan sortby dc.date/sort.missingValue=1970 >dc="http://deepcustard.org/1.0" blah sortby dc.custardDepth
The new CQL syntax includes a sort-specification, introduced by the new sortby operator, which may only occur at the top level of a query, or governed by one or more top-level prefix-mappings (as in the last example above). The left-hand-side of the sortby operator is a "classic" CQL query; the right-hand-side is an ordered list of sort keys from most to least significant, each constisting of an index name together with zero or more modifiers.
Both the indexes and the modifiers within a sort-specification may be drawn from any context set. (The concept of an index modifier is new with this proposal; for now, we suggest that they be allowed only within a sort-specification and not within the core query). As in the core query, an unqualified index is drawn from the prevailing context-set, and an unqualified modifier from the CQL context set.
The examples above are now interpreted:
The proposed new syntax requires minimal changes to the current CQL grammar: the addition of three new rules, and the modification of the existing term rule by the addition of the operator sortby:
sortedQuery ::= prefixAssignment sortedQuery | scopedClause ['sortby' sortSpec] sortSpec ::= sortSpec singleSpec | singleSpec singleSpec ::= index [modifierList] term ::= identifier | 'and' | 'or' | 'not' | 'prox' | 'sortby'
The start production of the new grammar is sortedQuery.
This grammar restricts the sortby keyword to the top level of the query except that it may be govered by prefix-mappings. For example, it does not admit the query:
(kernighan sortby title) and ritchie
This grammar eagerly accepts the first occurrence of the word sortby in the place of an operator, so that the lame search:
sortby sortby sortby sortby sortby
is interpreted as a search for "sortby", sorted by the three-key sort-specification consisting of the indexes "sortby", "sortby" (again) and "sortby" (yet again). The alternative interpretation as a search for "sortby" related by the relation "sortby" to an index called "sortby", all sorted by the index "sortby", is not supported; for this reason, "sortby" may no longer be used as a profiled relation name (but you should never have done that anyway.)
Although the index modifiers used in a sort specification may be drawn from any context set - an important point for extensibility - a core set of sort-related modifiers are provided in a special sorting context set, and these are expected to meet the needs of most sorting scenarios. This context set is as follows:
Name: The Sorting Context Set
URI: http://zing.z3950.org/cql/sorting/1.0
Recommended prefix: sort
Object types and names: as in the following table
Object type | Name | Meaning |
---|---|---|
Index | None | |
Relation | None | |
Relation modifier | None | |
Boolean modifier | None | |
Index modifier | ignoreCase | Case-insensitive sorting: for example, unit and UNIT sort together. |
respectCase | Case-sensitive sorting: for example, unit and UNIT sort separately. | |
ignoreAccents | Accent-insensitive sorting: for example sorensen and sørensen sort together. | |
respectAccents | Accent-sensitive sorting: for example sorensen and sørensen sort separately. | |
ascending | Sort in ascending order. | |
descending | Sort in descending order. | |
missingOmit | Records that have no value for the specified index are omitted from the sorted result set. | |
missingFail | Records that have no value for the specified index cause the search/sort operation to fail with the diagnostic info:srw/diagnostic/1/93 | |
missingLow | Records that have no value for the specified index are treated as if they had the lowest possible value, so that they sort first in ascending order and last in descending order. | |
missingHigh | Records that have no value for the specified index are treated as if they had the highest possible value. | |
missingValue=value | Records that have no value for the specified index are treated as if they had the specified value. | |
locale=value | Sort according to the specified locale, which will in general include specifications for whether sorting is case-sensitive or insensitive, how it treats accents, etc. The value is usually of the form C, french, fr_CH, fr_CH.iso88591 or similar. | |
unicodeCollate=value | Specfies the Unicode collation level. The value should be a small integer as described in the Unicode Collation Algorithm report at www.unicode.org/reports/tr10 |
The XCQL schema will need to be updated to accomodate sorting.
The sole change is the addition of a new <sortKeys> element, which may occur only at the top level of the XCQL document - either after the index, relation and term in a top-level <searchClause> element, or after the boolean, leftOperand and rightOperand after a <triple> element.
This <sortKeys> element contains one or more keys, each represented by a <key> element containing an <index> element (with the qualified name of the index), and a <modifiers> group which is defined the same here as elsewhere in XCQL: it contains one or more <modifier> elements, each containing a <name> and an optional <value>.
For example, the query:
dc.title = fish sortby dc.creator/sort.missingValue=frog/sort.descending
Would be encoded as:
<searchClause> <index>dc.title</index> <relation><value>=</value></relation> <term>fish</term> <sortKeys> <key> <index>dc.creator</index> <modifiers> <modifier> <name>sort.missingValue</name> <value>frog</value> </modifier> <modifier> <name>sort.descending</name> </modifier> </modifiers> </key> </sortKeys> </searchClause>
Explain records may contain a <setting> called maximumSortKeys, being the number of sort keys that the server will accept within a single sort specification. Any further keys cause the sort operation to fail with the diagnostic info:srw/diagnostics/1/84
That the sort context set is available for a database is given in the same way as for other context sets - a <set> entry in the <indexInfo> section.
Indexes may be specified as available for sorting by an attribute called sort on the <index> element. See the ZeeRex documentation for more information on this attribute and its brethren search and scan.
The default case for sorting can be expressed in a <default> element with its type attribute set to sortCase and with value ignoreCase or respectCase. A default of case-insensitive searching is preferred.
The default direction for sorting can be expressed in a <default> element with its type attribute set to sortDirection and with value ascending or descending. A default of ascending searching is preferred.
The default handling of missing values during sorting can be expressed in a single <default> element with its type attribute set to one of the following: missingOmit, missingFail, missingLow, missingHigh or missingValue. If and only if the last of these is used, then the element should contain the value to be used in place of missing values.