A Gentle Introduction to CQL

16th September 2003

$Id: intro.html,v 1.30 2010-06-17 15:03:44 mike Exp $

1. Introduction: what CQL is
2. Simple queries
3. Booleans
        3.1. Proximity
4. Indexes
        4.1. Index-Set Mapping
        4.2. Index-Sets
5. Relations
        5.1. Relation Modifiers
6. Pattern Matching
        6.1. Word Anchoring
7. Bit by bit, putting it together

1. Introduction: what CQL is

CQL stands for Common Query Language. It is a formal language for representing queries to Information Retrieval systems such as as web indexes, bibliographic catalogues and museum collection information. It is being developed by the Z39.50 Maintenance Agency as part of its ZING initiative (``Z39.50-International: Next Generation'').

Traditionally, query languages have fallen into two camps:

Powerful, expressive but unforgiving languages such as SQL, Index Data's cryptic PQF (Prefix Query Format) and the XML Query being worked on by the W3C.
Simple, intuitive but less powerful languages such CCL, the Common Command Language and google.com's query language.

CQL's goal is to combine the simplicity and intuitiveness of google searching with the expressive power of the Z39.50 Type-1 query. Just as the Unix shells allow users to begin with very simple commands, and work their way up to arbitrarily complex expressions, so CQL is intended to ``do what you mean'' for simple, everyday queries, while also providing the means to express more complex concepts when necessary.

The formal definition of CQL is on the ZING site, at www.loc.gov/standards/sru/specs/cql.html

This document provides a more gently-paced tutorial approach to learning about CQL.

2. Simple queries

   fish
   dinosaur
   comp.sources.misc
   "dinosaur"
   "complete dinosaur"
   "the complete dinosaur"
   "ext->u.generic"
   "and"

The simplest CQL queries of all are unqualified single terms. Several possible terms are listed above. Terms which do not contain ``special characters'' and which are not CQL keywords need not be quoted (although they may be); but terms containing any of the following characters must be quoted so that the parser knows to treat them as single terms:

[space] (separates words of a CQL expression)
= (the equality relation)
< (an inequality relation)
> (an inequality relation)
/ (introduces a relation- or proximity-modifier)
( (introduces a parenthesised sub-search)
) (ends a parenthesised sub-search)

So in the examples above, comp.sources.misc need not be quoted, since . is not a special character; but ext->u.generic does need the quotes, since > is a special character.^[1]

Also, keywords such as and, all, etc. may have their special meanings suppressed by enclosing them in quotes.

In general, multi-word terms are interpreted as requesting records in which a single field contains all the specified words, in the specified order, with no other words in between. This is a proximity search. But see the section below on relations for exceptions.

Some characters, when they occur in a search term, are ``wildcards'', which may stand for one or more other characters. See the section below on pattern matching.

3. Booleans

   dinosaur or bird
   dinosaur not reptile
   dinosaur and bird and reptile
   dinosaur and bird or dinobird
   (bird or dinosaur) and (feathers or scales)
   "feathered dinosaur" and (yixian or jehol)
   (((a and b) or (c not d) not (e or f and g)) and h not i) or j

Queries may be joined together using the three boolean operators, and, or and not. The last of these is a binary operator, finding records which contain ``this but not that''. So, for example, dinosaur not reptile finds records which contain the word ``dinosaur'' but not the word ``reptile''. I do not plan to insult your intelligence by explaining what and and or mean :-)

The queries either side of a boolean operator are known as operands, and may be arbitrarily complex. In particular, they may themselves be boolean combinations. All the boolean operators have the same precedence, and they associate from left to right. That means that, for example, that the searches

   foo and bar or baz
   foo or bar and baz

mean

   (foo and bar) or baz
   (foo or bar) and baz

rather than

   foo and (bar or baz)
   foo or (bar and baz)

- not because and binds tighter than or or vice versa (they have the same priority) but because the leftmost pair is considered first.

Sometimes that's not what you want. In that case, you can override the default interpretation by parenthesising sub-expressions. You're welcome to supply redundant parentheses if that helps you express your query more clearly. For example, the following queries are all equivalent:

   foo and (bar or baz)
   (foo) and (bar or baz)
   (foo) and ((bar) or (baz))
   (foo) and ((bar) or (baz))
   (((((foo))) and (((((((((((((bar)))))))))))) or (baz))))

3.1. Proximity

WARNING. Proximity operators are an advanced topic. You may prefer to skip ahead to indexes, and come back and read this section later.

   complete prox dinosaur
   (caudal or dorsal) prox vertebra
   ribs prox//5 chevrons
   ribs prox//0/sentence chevrons
   ribs prox/>/0/paragraph chevrons

Apart from the more familiar and, or and not, CQL supports a fourth boolean operator, proximity. This is a special kind of ``and'' which requires its operands to occur close to each other. The precedence of proximity operators is the same as that of the more familiar booleans; and like the others, it associates to the left.

The simplest example above, complete prox dinosaur finds records in which the two words are next to each other, in either order. It's nearly equivalent to the single multi-word term "complete dinosaur", except that it also finds dinosaur complete.

In the general case, the proximity operator's semantics are affected by four parameters:

relation indicates whether the operands should be separated by an exact distance, a distance greater than or equal to that specified, etc. It defaults to <= if not specified explicitly.
distance specifies the number of units (words, sentences, etc.) that may separate the operands, with zero indicating the same unit and one indicating adjacent units. The default is 1 if the unit is word, 0 otherwise.
unit specifies whether the proximity condition is about how many words, sentences, paragraphs or elements separate the operands. It defaults to word.
ordering may be ordered to specify that the operands must occur in the record in the same order as they do in the query, or unordered if they may occur in either order (which is the default).

That's a lot of information to pack into an operator. Here's how it's done: the operator is generally of the form

prox/relation/distance/unit/ordering

but any or all of the parameters may be omitted if the default values are required. Further, any trailing part of the operator consisting entirely of slashes (because the defaults are used) is omitted - so the following proximity operators are all equivalent:

prox/<=/1/word/unordered
prox/<=/1/word
prox/<=/1
prox/<=
prox

Here, then are some example of proximity searches and what they mean:

foo prox bar: The words foo and bar immediately adjacent to each other, in either order.
foo prox///sentence bar: The words foo and bar ocurring anywhere in the same sentence. (Recall that the default distance is zero when the unit is not word.)
foo prox//3/element bar: The words must occur within three elements of each other: for example, if a record contains a list of authors, and author number 4 contains foo and author number 7 contains bar then this search will find that record.
foo prox/=/2/paragraph bar: The words must appear exactly two paragraphs apart: it is not good enough for them to appear in the same paragraph or in adjacent paragraphs.
foo prox/>/4/word/ordered bar: Finds records in which the words appear, with foo first, followed more than four words later by bar, in that order.

NOTE. This is a complex and somewhat esoteric set of specifications. While CQL parsers should recognise and correctly interpret all of these proximity operators, there is no guarantee that servers will be able to honour all proximity-searching requests.

4. Indexes

   title=dinosaur
   title = dinosaur
   title = ((dinosaur and bird) or dinobird)
   dc.title = saurischia
   bath.title="the complete dinosaur"
   cql.serverChoice=foo
   cql.resultSet=bar

All the queries discussed so far are targetted at whole records - search for darwin will find both The Origin of the Species and a biography of Darwin, since he is the author of the first and the subject of the second. Sometimes we need to be more specific, and limit a search to a particular field of the records we're interested in. In CQL, we do this using indexes.

An index is generally attached to its search-term with an equals sign (=), although see the section below on relations for the full story. Indexes indicate what part of the records is to be searched - in implementation terms, they frequently specify which index is to be inspected in the database. For example, an index of author typically indicates a search for the names of authors.

A fully-specified index is of the form indexset.name in which indexset is the name of a index-set, and name is the name of an index within this set. This convention enables multiple organisations to develop the indexes needed in their application domains without the possibility of name collisions. For example, both the bibliographic and heraldry communities might wish to use a title index, but they would have very different meanings. You can specify which one you mean by searching for either

bib.title = "Zen and the Art of Motorcycle Maintenance"

heraldry.title=viscount

When no index-set is specified (i.e. the index name does not include a period), the index is interpreted as being in the default index-set, whatever that may be in the context of the query. For example, a taxonomy application might define that the default index-set is taxa, so that the search family=brachiosauridae is interpreted as taxa.family=brachiosauridae rather than, for example, genealogy.family=brachiosauridae.

4.1. Index-Set Mapping

WARNING. Index-set mapping is an advanced topic. You may prefer to skip ahead to relations, and come back and read this section later. Or indeed, never bother reading this. Most CQL users will never need to.

What exactly are the meanings of indexes? We have an idea that bath.title is some kind of bibliographic title search, but does it include journal titles as well as book titles? Does it include searching for words that occur only in subtitles? And should searches against the dc.subject index-set be simple word-searches, or use a controlled subject-vocabulary? And if the latter, which subject vocabulary? LCSH, the Library of Congress Subject Headings? MeSH, the Medical Subject Headings? Or some Dublin Core-specific vocabulary?

The only reliable way to find answers to such questions is by reference to the index-set definitions; but how can we tell which index-set we're dealing with? The short names we've been using - dc, bath, etc. - are not rigorous: the Deep Custard working group might also want to define a dc index-set containing the custardDepth and flavouring indexes, and the plumbing-supplies community might put together a bath set, containing indexes such as maximumWaterDepth, capacity and enamelColour.

The solution to this problem is that each index-set is assigned a truly unique identifier. These identifiers typically look like URIs, and the community that defines a set must choose a URI in space which it owns. For example, the plumbing-supplies community may already own the domain-name plumbing-r-us.com, and so might choose the URI http://plumbing-r-us.com/cql/bath for its bath index-set. Since each Internet domain is owned by only one entity, there is no possibility of clashes.

(As with XML namespace URIs, index-set URIs need not actually point at anything. They are identifiers, not addresses. However, also as with XML namespaces, it is often convenient to put something in the pointed-to location - usually a definition of the index-set semantics, or at least a pointer to where those semantics are specified.)

The problem, then becomes one of how to establish the association between a nice, easy-to-type index-set name like dc and an ugly but rigorous identifier like http://www.loc.gov/srw/index-sets/dc. In many contexts, the meanings of CQL index-set names will be fixed and immutable: for example, many applications will hardwire the definition of the dc prefix to the identifier of the Dublin Core index set. But the CQL language itself provides the means to establish the meanings of prefixes where necessary.

A prefix may be established by the > token, followed by a prefix name, an equals sign and the URI to which it is to be mapped, which will typically need to be quoted. The prefix thus established applies across the query that follows it: for example, in the following query:

   >dc="http://www.loc.gov/srw/index-sets/dc"
   	dc.title=dinosaur and dc.author=farlow

the meaning of the dc prefix on both of the search terms' indexes is governed by the mapping.

Note that the conventional prefix names are just that - conventions only. For example, the following query uses Dublin Core indexes, and is identical in meaning to the one above:

   (>x="http://www.loc.gov/srw/index-sets/dc"
   	x.title=dinosaur) and
   (>aVerySillyLongPrefix="http://www.loc.gov/srw/index-sets/dc"
   	aVerySillyLongPrefix.author=farlow)

Further, the default index-set may be established for a query by omitting the name= part of a prefix definition, so that the following query is equivalent to both of those above:

   >"http://www.loc.gov/srw/index-sets/dc"
   	title=dinosaur and author=farlow

Finally, we must address the question of what defaults apply when prefixes and a default index-set are not established. The answer is that this is not defined as a part of CQL itself: the answers are given by the individual CQL application. For example, a bibliographic application may make the Bath index-set the default, and establish the index-set names bath and dc with conventional meanings, rejecting indexes with any other prefix. That's the reason why most CQL users will never need to bother reading the section that you, heroically, have just reached the end of :-)

4.2. Index-Sets

The procedures for creating a new index-set is described on the offical ZING site, along with a list of some of better known extant index-sets: see www.loc.gov/standards/sru/resources/context-sets.html

At the time of writing, four index-sets are defined at the moment, with more sure to follow:

cql

This special index-set is defined as a part of the SRU protocol, under the auspices of which CQL was developed. It provides special indexes such as:

serverChoice - the server is allowed to choose which indexes to use in fulfilling the search. Servers are recommended to use their broadest searches for this. An unqualified CQL search term (such as fish) is equivalent to using this index (as in cql.serverChoice=fish)
resultSetId - indicates that the term is not text to be search for, but the name that was assigned to a previous search, and that that search is to be incorporated into this one.
For example, if you search for taphonomy or sedimentology and get 3,456 hits, you may wish to narrow your search. If your application has given that search the name foo, you can restrict it to records concerning coelurosaurs by searching for cql.resultSet=foo and coelurosauria.

See www.loc.gov/standards/sru/resources/cql-context-set-v1-2.html

dc

The Dublin Core index-set provides a set of fifteen indexes, including all the usual suspects - author, title, subject - to be used in cross-domain searching. See www.loc.gov/standards/sru/resources/dc-context-set.html

bath

The Bath index-set provides indexes enabling CQL to express the searches described by the Bath profile for bibliographic searching using Z39.50. See zing.z3950.org/srw/bath/2.0/#2

zthes

The Zthes index-set provides indexes for searching and navigating in hierarchical thesauri, as described by the Zthes profile for Z39.50. See zthes.z3950.org/cql/1.0

5. Relations

   year > 1998
   title all "complete dinosaur"
   title any "dinosaur bird reptile"
   title exact "the complete dinosaur"

We said above that an index is generally attached to its search-term with an equals sign. The equals sign is the relation that associates the term with its index.

CQL also supports a variety of other relations.

For numeric indexes, the obvious ordered relations may be used: so, for example, you can search for all the following:

   publicationYear < 1980
   numberOfWheels <= 3
   numberOfPlates = 18
   lengthOfFemur > 2.4
   bioMass >= 100
   numberOfToes <> 3

(The last one is a not-equals search, which finds animals with any number of toes other than three.)

For word indexes, two more relations are supported:

any: The search succeeds if one or more of the words in the term can be found in the record. So, for example, title any "ocean sea lake" is a convenient shorthand for title="ocean" or title="sea" or title="lake".
all: The search succeeds if every one of the words in the term can be found in the record. So, for example, title all "old man sea" is a convenient shorthand for title="old" and title="man" and title="sea".

Finally, there is the exact relation. A query like title exact "the complete dinosaur" indicates a string search rather than a word search. It succeeds only for records whose title field consists exactly of the characters ``the complete dinosaur''.^[2] In particular, unlike the more common title = "the complete dinosaur", the exact version of this query will not find titles which begin with this string but have more words following.

exact searches are most useful on codified fields such as ISBN codes, telephone numbers, etc.

The keyword relations (any, all and exact) must be separated from their index and search-term by whitespace, so that the parser can recognise them. For the symbolic relations (<, <=, >, >=, = and <>), whitespace is optional: all of the following are equivalent:

   title=dinosaur
   title =dinosaur
   title= dinosaur
   title = dinosaur
   title     =            dinosaur

Careful readers will notice that you can't specify a relation without also supplying an index. If you find yourself wanting to do this, then just use the special srw.serverChoice index, like this: srw.serverChoice all "old man sea"

5.1. Relation Modifiers

   title all/stem "complete dinosaur"
   title any / relevant "dinosaur bird reptile"
   title exact/fuzzy "the complete dinosaur"
   author = /fuzzy tailor

Relations may have their behaviour modified by relation modifiers to indicate that special procedures are required in matching the search term against record data. Modifiers are separated from their relations by a forward slash (/). The recognised modifiers are:

stem

The words in the search term are stemmed before being matched against stemmed versions of those in the records: for example, walked, walking, walker etc. would all be chopped down to the stem word walk. This allows a search like title =/stem "these completed dinosaurs" to match everybody's favourite book, The Complete Dinosaur.

The stemming algorithm is implementation-dependent.

relevant

Indicates that the words in the search-term must be in some sense relevant to those in the records being searched. For example, the search subject any/relevant "fish frog" would find records whose subject field included any of the words shark, tuna, coelocanth, toad, amphibian, etc.

The relevance-matching algorithm is implementation-dependent.

fuzzy

A catch-all modifier indicating that the server can apply some form of ``fuzzy matching'' between the specified search-term and its records. This may be useful for badly-spelled search terms. For example, author all/fuzzy "kernaghan richie" might find Kernighan & Ritchie's The C Programming Language.

The fuzzy-matching algorithm is implementation-dependent.

phonetic

Indicates that the server should try to match the term not only against words that are spelled the same but also those that sound the same. For example, subject =/phonetic rose might match the words rows, rhos (more than one Greek letter) and roes (portions of fish eggs).

The phonetic matching algorithm is implementation-dependent.

The modified relation exact/fuzzy appears strange, but in fact has useful applications: for example, consider the case where you can more or less remember your favourite restaurant's telephone number, but you might have got a digit wrong, or got two of them in the wrong order or something:

   telephoneNumber exact/fuzzy "0208 346 6797"

6. Pattern Matching

   dinosaur*
   *sauria
   man?raptor
   man?raptor*
   "the comp*saur"
   char\*

As we mentioned above, certain characters have special meaning when the appear in a search term. These characters are known as wildcards, and can be used to match unspecified characters as follows:

?

Matches any single character, so that c?t will match any of the words cat, cot or cut, but not coat or indeed ct. Multiple adjacent ?s match the appropriate number of characters in the obvious way, so that c??t matches cart, cent, coat, etc., but not cat or crypt.

*

Matches any sequence of zero or more characters, so that c*t will match any of the words cat, coat, crypt and counterargument.

The ? and * characters may occur anywhere in a search term - at the beginning, in the middle or at the end; and they may be arbitrarily mixed, like this: ?in?s*r

^

The word-anchoring character is discussed in its own section below.

Because neither ? nor * is a ``special character'', terms containing them need not be quoted (although, of course, you're welcome to quote them anyway if you like).

The special meaning may be removed from wildcard characters by preceding them with a backslash (\). A backslash may also be used to escape a literal double-quote mark in a search term, like this: \"hello\". To include a literal backslash in a term, just precede it with a bashslash (of course!) like this: \\.

6.1. Word Anchoring

   title="^the complete dinosaur"
   author="bakker^"
   author all "^kernighan ritchie"
   author any "^kernighan ^ritchie ^thompson"

The final wildcard character, ^, is used for word anchoring - that is, indicating the position that a searched-for word must have in its field. A word beginning with ^ must be the first in its field; A word ending with ^ must be the last in its field; and so a word both beginning and ending with ^ must be the only one in its field. The ^ character may not occur in the middle of a word, nor on its own (not as a part of a word).

The special meaning of ^ may be removed, as with other wildcard characters, by preceding it with a backslash (\).

When used with the = operator, the word-anchor must appear at the beginning or end of the whole phrase (or both), and indicates that the whole phrase is anchored to the beginning or end of its field (or both)

When used with the any and all relations, ^ applies to each word individually, so that the search title any "^birds ^reptiles ^dinosaurs" will find books whose titles start with any of the three words.

Word anchoring may not be used with the ordering relations (it would be meaningless). Neither may it be used with the exact relation, since that does not operate on words at all, but on whole strings. (Note also that exact search-terms are in any case whole-field searches, so anchoring is not needed anyway.)

7. Bit by bit, putting it together

   dc.author=(kern* or ritchie) and
	(bath.title exact "the c programming language" or
	 dc.title=elements prox///4 dc.title=programming) and
	subject any/relevant "style design analysis"

That's all there is. Glue the bits together how you wish - for example, the search above finds records whose author (in the cross-domain sense) includes either a word beginning kern or the word ritchie, and which have either the exact title (in the sense to the Bath profile) the c programming language or a title containing the words elements and programming not more the four words apart, and whose subject is relevant to one or more of the words style, design or analysis.

Notes


[1]: If not quoted, the ``search term'' ext->u.generic would be interpreted search for records for which the index called ext- contained a value lexically greater than u.generic. [back]
[2]: In practice, the ``exactly the same characters'' condition may be slightly relaxed: servers' string indexes may normalise upper- and lower-case, whitespace sequences, etc. [back]

Feedback to <mike@indexdata.com> is welcome!