$Id: intro.html,v 1.30 2010-06-17 15:03:44 mike Exp $
CQL stands for Common Query Language. It is a formal language for representing queries to Information Retrieval systems such as as web indexes, bibliographic catalogues and museum collection information. It is being developed by the Z39.50 Maintenance Agency as part of its ZING initiative (``Z39.50-International: Next Generation'').
Traditionally, query languages have fallen into two camps:
CQL's goal is to combine the simplicity and intuitiveness of google searching with the expressive power of the Z39.50 Type-1 query. Just as the Unix shells allow users to begin with very simple commands, and work their way up to arbitrarily complex expressions, so CQL is intended to ``do what you mean'' for simple, everyday queries, while also providing the means to express more complex concepts when necessary.
The formal definition of CQL is on the ZING site, at www.loc.gov/standards/sru/specs/cql.html
This document provides a more gently-paced tutorial approach to learning about CQL.
fish dinosaur comp.sources.misc "dinosaur" "complete dinosaur" "the complete dinosaur" "ext->u.generic" "and"
The simplest CQL queries of all are unqualified single terms. Several possible terms are listed above. Terms which do not contain ``special characters'' and which are not CQL keywords need not be quoted (although they may be); but terms containing any of the following characters must be quoted so that the parser knows to treat them as single terms:
So in the examples above, comp.sources.misc need not be quoted, since . is not a special character; but ext->u.generic does need the quotes, since > is a special character.[1]
Also, keywords such as and, all, etc. may have their special meanings suppressed by enclosing them in quotes.
In general, multi-word terms are interpreted as requesting records in which a single field contains all the specified words, in the specified order, with no other words in between. This is a proximity search. But see the section below on relations for exceptions.
Some characters, when they occur in a search term, are ``wildcards'', which may stand for one or more other characters. See the section below on pattern matching.
dinosaur or bird dinosaur not reptile dinosaur and bird and reptile dinosaur and bird or dinobird (bird or dinosaur) and (feathers or scales) "feathered dinosaur" and (yixian or jehol) (((a and b) or (c not d) not (e or f and g)) and h not i) or j
Queries may be joined together using the three boolean operators, and, or and not. The last of these is a binary operator, finding records which contain ``this but not that''. So, for example, dinosaur not reptile finds records which contain the word ``dinosaur'' but not the word ``reptile''. I do not plan to insult your intelligence by explaining what and and or mean :-)
The queries either side of a boolean operator are known as operands, and may be arbitrarily complex. In particular, they may themselves be boolean combinations. All the boolean operators have the same precedence, and they associate from left to right. That means that, for example, that the searches
foo and bar or baz foo or bar and baz
mean
(foo and bar) or baz (foo or bar) and baz
rather than
foo and (bar or baz) foo or (bar and baz)
- not because and binds tighter than or or vice versa (they have the same priority) but because the leftmost pair is considered first.
Sometimes that's not what you want. In that case, you can override the default interpretation by parenthesising sub-expressions. You're welcome to supply redundant parentheses if that helps you express your query more clearly. For example, the following queries are all equivalent:
foo and (bar or baz) (foo) and (bar or baz) (foo) and ((bar) or (baz)) (foo) and ((bar) or (baz)) (((((foo))) and (((((((((((((bar)))))))))))) or (baz))))
WARNING. Proximity operators are an advanced topic. You may prefer to skip ahead to indexes, and come back and read this section later.
complete prox dinosaur (caudal or dorsal) prox vertebra ribs prox//5 chevrons ribs prox//0/sentence chevrons ribs prox/>/0/paragraph chevrons
Apart from the more familiar and, or and not, CQL supports a fourth boolean operator, proximity. This is a special kind of ``and'' which requires its operands to occur close to each other. The precedence of proximity operators is the same as that of the more familiar booleans; and like the others, it associates to the left.
The simplest example above, complete prox dinosaur finds records in which the two words are next to each other, in either order. It's nearly equivalent to the single multi-word term "complete dinosaur", except that it also finds dinosaur complete.
In the general case, the proximity operator's semantics are affected by four parameters:
That's a lot of information to pack into an operator. Here's how it's done: the operator is generally of the form
prox/relation/distance/unit/ordering
but any or all of the parameters may be omitted if the default values are required. Further, any trailing part of the operator consisting entirely of slashes (because the defaults are used) is omitted - so the following proximity operators are all equivalent:
Here, then are some example of proximity searches and what they mean:
NOTE. This is a complex and somewhat esoteric set of specifications. While CQL parsers should recognise and correctly interpret all of these proximity operators, there is no guarantee that servers will be able to honour all proximity-searching requests.
title=dinosaur title = dinosaur title = ((dinosaur and bird) or dinobird) dc.title = saurischia bath.title="the complete dinosaur" cql.serverChoice=foo cql.resultSet=bar
All the queries discussed so far are targetted at whole records - search for darwin will find both The Origin of the Species and a biography of Darwin, since he is the author of the first and the subject of the second. Sometimes we need to be more specific, and limit a search to a particular field of the records we're interested in. In CQL, we do this using indexes.
An index is generally attached to its search-term with an equals sign (=), although see the section below on relations for the full story. Indexes indicate what part of the records is to be searched - in implementation terms, they frequently specify which index is to be inspected in the database. For example, an index of author typically indicates a search for the names of authors.
A fully-specified index is of the form indexset.name in which indexset is the name of a index-set, and name is the name of an index within this set. This convention enables multiple organisations to develop the indexes needed in their application domains without the possibility of name collisions. For example, both the bibliographic and heraldry communities might wish to use a title index, but they would have very different meanings. You can specify which one you mean by searching for either
bib.title = "Zen and the Art of Motorcycle Maintenance"
or
heraldry.title=viscount
When no index-set is specified (i.e. the index name does not include a period), the index is interpreted as being in the default index-set, whatever that may be in the context of the query. For example, a taxonomy application might define that the default index-set is taxa, so that the search family=brachiosauridae is interpreted as taxa.family=brachiosauridae rather than, for example, genealogy.family=brachiosauridae.
WARNING. Index-set mapping is an advanced topic. You may prefer to skip ahead to relations, and come back and read this section later. Or indeed, never bother reading this. Most CQL users will never need to.
What exactly are the meanings of indexes? We have an idea that bath.title is some kind of bibliographic title search, but does it include journal titles as well as book titles? Does it include searching for words that occur only in subtitles? And should searches against the dc.subject index-set be simple word-searches, or use a controlled subject-vocabulary? And if the latter, which subject vocabulary? LCSH, the Library of Congress Subject Headings? MeSH, the Medical Subject Headings? Or some Dublin Core-specific vocabulary?
The only reliable way to find answers to such questions is by reference to the index-set definitions; but how can we tell which index-set we're dealing with? The short names we've been using - dc, bath, etc. - are not rigorous: the Deep Custard working group might also want to define a dc index-set containing the custardDepth and flavouring indexes, and the plumbing-supplies community might put together a bath set, containing indexes such as maximumWaterDepth, capacity and enamelColour.
The solution to this problem is that each index-set is assigned a truly unique identifier. These identifiers typically look like URIs, and the community that defines a set must choose a URI in space which it owns. For example, the plumbing-supplies community may already own the domain-name plumbing-r-us.com, and so might choose the URI http://plumbing-r-us.com/cql/bath for its bath index-set. Since each Internet domain is owned by only one entity, there is no possibility of clashes.
(As with XML namespace URIs, index-set URIs need not actually point at anything. They are identifiers, not addresses. However, also as with XML namespaces, it is often convenient to put something in the pointed-to location - usually a definition of the index-set semantics, or at least a pointer to where those semantics are specified.)
The problem, then becomes one of how to establish the association between a nice, easy-to-type index-set name like dc and an ugly but rigorous identifier like http://www.loc.gov/srw/index-sets/dc. In many contexts, the meanings of CQL index-set names will be fixed and immutable: for example, many applications will hardwire the definition of the dc prefix to the identifier of the Dublin Core index set. But the CQL language itself provides the means to establish the meanings of prefixes where necessary.
A prefix may be established by the > token, followed by a prefix name, an equals sign and the URI to which it is to be mapped, which will typically need to be quoted. The prefix thus established applies across the query that follows it: for example, in the following query:
>dc="http://www.loc.gov/srw/index-sets/dc" dc.title=dinosaur and dc.author=farlow
the meaning of the dc prefix on both of the search terms' indexes is governed by the mapping.
Note that the conventional prefix names are just that - conventions only. For example, the following query uses Dublin Core indexes, and is identical in meaning to the one above:
(>x="http://www.loc.gov/srw/index-sets/dc" x.title=dinosaur) and (>aVerySillyLongPrefix="http://www.loc.gov/srw/index-sets/dc" aVerySillyLongPrefix.author=farlow)
Further, the default index-set may be established for a query by omitting the name= part of a prefix definition, so that the following query is equivalent to both of those above:
>"http://www.loc.gov/srw/index-sets/dc" title=dinosaur and author=farlow
Finally, we must address the question of what defaults apply when prefixes and a default index-set are not established. The answer is that this is not defined as a part of CQL itself: the answers are given by the individual CQL application. For example, a bibliographic application may make the Bath index-set the default, and establish the index-set names bath and dc with conventional meanings, rejecting indexes with any other prefix. That's the reason why most CQL users will never need to bother reading the section that you, heroically, have just reached the end of :-)
The procedures for creating a new index-set is described on the offical ZING site, along with a list of some of better known extant index-sets: see www.loc.gov/standards/sru/resources/context-sets.html
At the time of writing, four index-sets are defined at the moment, with more sure to follow:
For example, if you search for taphonomy or sedimentology and get 3,456 hits, you may wish to narrow your search. If your application has given that search the name foo, you can restrict it to records concerning coelurosaurs by searching for cql.resultSet=foo and coelurosauria.
See www.loc.gov/standards/sru/resources/cql-context-set-v1-2.html
year > 1998 title all "complete dinosaur" title any "dinosaur bird reptile" title exact "the complete dinosaur"
We said above that an index is generally attached to its search-term with an equals sign. The equals sign is the relation that associates the term with its index.
CQL also supports a variety of other relations.
For numeric indexes, the obvious ordered relations may be used: so, for example, you can search for all the following:
publicationYear < 1980 numberOfWheels <= 3 numberOfPlates = 18 lengthOfFemur > 2.4 bioMass >= 100 numberOfToes <> 3
(The last one is a not-equals search, which finds animals with any number of toes other than three.)
For word indexes, two more relations are supported:
Finally, there is the exact relation. A query like title exact "the complete dinosaur" indicates a string search rather than a word search. It succeeds only for records whose title field consists exactly of the characters ``the complete dinosaur''.[2] In particular, unlike the more common title = "the complete dinosaur", the exact version of this query will not find titles which begin with this string but have more words following.
exact searches are most useful on codified fields such as ISBN codes, telephone numbers, etc.
The keyword relations (any, all and exact) must be separated from their index and search-term by whitespace, so that the parser can recognise them. For the symbolic relations (<, <=, >, >=, = and <>), whitespace is optional: all of the following are equivalent:
title=dinosaur title =dinosaur title= dinosaur title = dinosaur title = dinosaur
Careful readers will notice that you can't specify a relation without also supplying an index. If you find yourself wanting to do this, then just use the special srw.serverChoice index, like this: srw.serverChoice all "old man sea"
title all/stem "complete dinosaur" title any / relevant "dinosaur bird reptile" title exact/fuzzy "the complete dinosaur" author = /fuzzy tailor
Relations may have their behaviour modified by relation modifiers to indicate that special procedures are required in matching the search term against record data. Modifiers are separated from their relations by a forward slash (/). The recognised modifiers are:
The stemming algorithm is implementation-dependent.
The relevance-matching algorithm is implementation-dependent.
The fuzzy-matching algorithm is implementation-dependent.
The phonetic matching algorithm is implementation-dependent.
The modified relation exact/fuzzy appears strange, but in fact has useful applications: for example, consider the case where you can more or less remember your favourite restaurant's telephone number, but you might have got a digit wrong, or got two of them in the wrong order or something:
telephoneNumber exact/fuzzy "0208 346 6797"
dinosaur* *sauria man?raptor man?raptor* "the comp*saur" char\*
As we mentioned above, certain characters have special meaning when the appear in a search term. These characters are known as wildcards, and can be used to match unspecified characters as follows:
The ? and * characters may occur anywhere in a search term - at the beginning, in the middle or at the end; and they may be arbitrarily mixed, like this: ?in?s*r
Because neither ? nor * is a ``special character'', terms containing them need not be quoted (although, of course, you're welcome to quote them anyway if you like).
The special meaning may be removed from wildcard characters by preceding them with a backslash (\). A backslash may also be used to escape a literal double-quote mark in a search term, like this: \"hello\". To include a literal backslash in a term, just precede it with a bashslash (of course!) like this: \\.
title="^the complete dinosaur" author="bakker^" author all "^kernighan ritchie" author any "^kernighan ^ritchie ^thompson"
The final wildcard character, ^, is used for word anchoring - that is, indicating the position that a searched-for word must have in its field. A word beginning with ^ must be the first in its field; A word ending with ^ must be the last in its field; and so a word both beginning and ending with ^ must be the only one in its field. The ^ character may not occur in the middle of a word, nor on its own (not as a part of a word).
The special meaning of ^ may be removed, as with other wildcard characters, by preceding it with a backslash (\).
When used with the = operator, the word-anchor must appear at the beginning or end of the whole phrase (or both), and indicates that the whole phrase is anchored to the beginning or end of its field (or both)
When used with the any and all relations, ^ applies to each word individually, so that the search title any "^birds ^reptiles ^dinosaurs" will find books whose titles start with any of the three words.
Word anchoring may not be used with the ordering relations (it would be meaningless). Neither may it be used with the exact relation, since that does not operate on words at all, but on whole strings. (Note also that exact search-terms are in any case whole-field searches, so anchoring is not needed anyway.)
dc.author=(kern* or ritchie) and (bath.title exact "the c programming language" or dc.title=elements prox///4 dc.title=programming) and subject any/relevant "style design analysis"
That's all there is. Glue the bits together how you wish - for example, the search above finds records whose author (in the cross-domain sense) includes either a word beginning kern or the word ritchie, and which have either the exact title (in the sense to the Bath profile) the c programming language or a title containing the words elements and programming not more the four words apart, and whose subject is relevant to one or more of the words style, design or analysis.
Notes