ZING:CQL - "Z39.50 Next Generation -- Common Query Language"

ZING
CQL
"Common Query Language"

Issues

Philosophy

All CQL queries must have an unambiguous mapping to classic Z39.50 Type-1 queries.

Syntax

Boolean::= 'AND', 'OR', 'NOT', Adjacency

Adjacency::= 'W/'Digit

IndexQualifier: [IndexSet.] IndexID

Relationship::= Equality | '>' | '<' | ‘>=’ | ‘<=’ | ‘@fuzzy@’ | ‘@stem@’ | ‘@relevance@’

Equality::= ':' | ‘=’

QualifiedTerm::= [IndexQualifier Relationship] Term | QualifiedTerm Boolean QualifiedTerm | '(' QualifiedTerm Boolean QualifiedTerm ')'
Term::= NonBlankCharacter* | '"'Character*'"'

Examples

Author:"levan ralph" does a adjacency word list search against the author index
Bib1.AuthorPhrase="levan, ralph" does a string match against the author index

Clarifications

IndexQualifiers map to combinations of Use and Structure attributes. IndexSets do not necessarily map to AttributeSets, but the IndexID’s within the IndexSets do get explicitly mapped to a combination of a Use and a Structure attribute from AttributeSets.

I have made the ‘:’ and ‘=’ characters equivalent.

The Structure attributes implicitly supported by CQL are String and AdjacencyWordList.

Type-1 Mappings

All terms are assumed to have a Truncation attribute of 104 (Z39.58 Masking). This supports the use of ‘?’ and ‘#’ as masking characters. (See http://lcweb.loc.gov/z3950/agency/defns/bib1.html#55.)

All terms are assumed to have a Completeness attribute of 1 (incomplete subfield).

All terms are assumed to have a Position attribute of 3 (any position in field).

IndexQualifiers map to combinations of Use and Structure attributes.

While the Structure attributes implicitly supported by CQL are String and AdjacencyWordList a smart mapping to Type-1 will probably convert them to the more ambiguous Phrase and WordList Structure attributes.

Issues

Human readable?
Human enterable? CQL is a potential area of disagreement between SRW and SRU. SRW may assume that client software can manipulate a human entered query into CQL. SRU has no such advantage; users must type in a CQL query. SO we need to make CQL easy to enter.
Internationally friendly? eg: by using numbers in preference to symbolic identifiers that mean something in
english.
Support for multiple attribute sets in one query?
Distributed searching? (Ability to issue a single query against multiple collections without change.) Different servers provide different access points to their data. In Z39.50 that means different attribute combinations. So in SRW, does this means different indexes? Or can we profile SRW as we do Z39.50?
Direct mapping to Z39.50 constructs?
How to define scope names (are they attribute sets or just a logical grouping for names? Eg: dublin core
attributes are defined in the Bib-1 attribute set at present). Use the current exact Z39.50 attribute sets etc for mapping onto CQL field names? Or creators of index sets describe each index in terms of Z39.50 attributes.
How to manage the population of field-set names? Should there be a central CQL registry of such names? If it can change per server, then reusing a query against multiple servers will be difficult. Should sites be able to define their own new, local sets without going to the global registry? Instead of 'dc.Title', should it be a URL? That is, dublin core XML namespace URI + DC element name? Or should queries be CQL text plus a set of definitions for mapping "dc" to "Dublin Core URI" etc. A suggestion is to provide a URI in Explain that points the user/application to the Index to Attribute Set mapping.
The pattern match characters don't seem to follow any existing standards. (Rather, it mixes several existing standards). Stick to CCL (# and ?) and drop '*'? Want to map to Z39.50 easily; Z39.50 has got a CCL regex attribute already.

Suggested Queries

dc.Title = "Power and Fame"
dc.Title = ("Power" AND "Fame")
dc.Contributor = "LOC" AND dc.Subject = "Standards"
dc.Contributor = "LOC" AND agls.Identifier = "xyzzy"
bib1.Author, dc.Contributor = "Smith"

Other Suggestions

Queries are Unicode text (UTF-8, etc).
All text to be searched always inside quotes. This allows new reserved words to be added later without breaking old queries.
Reserved words upper case? (But not for "EHE".)
Fields to be searched identified by a two-part identifier where the first part identifies the scope for the second part.
Eg: dc.title.