Search Engine Overview

The concept of a search engine may not be new to the casual user of the Internet, but beyond the use of keywords to find pages or news group postings of interest there is some complexity. That complexity, or technicality is found in the Boolean test. For a programmer, this technicality is taken as a matter of fact, however for the casual user it may appear to be intimidating, especially in my presentation of it with added features. Some users may have used various engines and found them natural in their application of the common concepts of AND, OR, and NOT. These concepts are natural expressions in every day usage and commonly used hundreds of times a day in casual discussion or writing.

Keywords: A word or group of words you find significant. That is, they are key (important) to your trying to describe what you want to find (and thus keywords are important words). If you are looking for recommendations of where to stay in the Caribbean you would probably begin searching for those postings that include such keywords as Caribbean (certainly that) or more specifically St. John if you wanted to know about that city in Antigua. However, the seasoned traveler to the Caribbean would already be aware that there is more than one city named St. John in the Caribbean. Thus a keyword based search would return you many references to both that city in Antigua and on many other islands. You could restrict the number of references returned to you by enlarging your list of keywords to include Antigua and St. John. This would have the effect of looking for all references that contain either BOTH or at least ONE of those words depending upon the search engine being used. In a sense you have just enlarged your search to include a Boolean test. Nearly every engine will tell you in the small type just which distinction is being made, but you have to look. That is, some engines will return those references that contain Antigua AND St. John, while other engines will return those references that contain Antigua OR St. John. The first will be a shorter list of references than the second. You may of course find either useful, but you may as easily be disappointed.

Boolean test: A group of words linked by reserved words. Those reserved words are the already mentioned AND, OR, and NOT. These reserved words are commonly made distinctive from your keywords by being entirely capitalized (as shown here and in the last sentences in the paragraph above). They are called reserved words because you are forbidden to use them as keywords. That is, these reserved words are not part of any engine's search for references of interest to you. A Boolean test is the means by which the engine determines if it can satisfy your interest with any particular article or posting it is reading for you. That is, if that article or posting contains: St. John AND Antigua, then the engine saves it for your consideration. On the other hand, all articles or postings that do not contain: St. John AND Antigua are ignored and the engine moves on to read another in its search to satisfy your query. Thus the engine would in this case of a Boolean test restrict the number of references that are returned to you on the basis of what is called your query (which is also called here, the Boolean test). In these engines you are perfectly free to pose a query that instead searches for St. John OR Antigua. In this case you will get more references that contain at least one keyword, and possibly both.

Query: This is another term for Boolean test, but for the purpose of this discussion is also allowed to go beyond that form of testing with enhancements. Query is simply your interest stated as a logical question. We have moved on beyond the simple keyword search to include the logic and some new commands. It is not always enough to ask for any reference that contains: St. John AND Antigua, because there are some articles or postings that could have both words, but they are separated by text and meaning. That is, the article may be talking about St. John the Island, not the city, with a mention of Antigua in a remote paragraph. A useful command such as NEAR will allow you to force the engine to limit the proximity of these two keywords. This still constitutes a risk that the engine will know you mean the city on that island, but with such restrictions imposed by this NEAR command, you will get closer to your desired goal of having your query satisfied. Thus you would pose the query: St. John NEAR Antigua.

Command: We have already touched on one, NEAR. My engines use this and two others that allow you to explicitly state the scope or range you wish to test over. The two other commands you will find in my engine are PRECEDES and FOLLOWS. These commands also have a number associated with them to distinguish this range of testing. Thus the query:

'St. John' NEAR 10 'Antigua'

forces the engine to consider only those postings or articles that have these two keywords within 10 words of one another. NEAR also does not distinguish whether they come in the order you wrote them, or in reverse order. This is why my engine also will take your queries:

'St. John' PRECEDES 10 'Antigua'

'St. John' FOLLOWS 10 'Antigua'

and thus the engine is now position sensitive over the range of 10 (or any number you care to use) words. You will note that in these last three queries that I have used the single quote mark to isolate the keywords. This form of isolation is called using those marks as fences. Fences allow you to force the engine to consider not only one keyword, but many keywords as a phrase. For these set of commands, this form of fencing is necessary to allow the engine to see the various elements of your query. This is the technicality of the engine's requirement, but you may see that it also aids you in being able to see the distinctions as well. That is, you can see each of the commands above have four parts. It becomes even more significant if you were to pose the query:

'St. John Antigua' NEAR 25 'all inclusive resorts'

With this query you are searching for postings or articles that contain both phrases within a restricted range of each other.

Fences: These marks allow the engine to distinguish various elements of your query. There are other fences that do more in relation to logical groupings and other extensions of the engine's ability to search. Such marks that are used as fences are:

"double-quote" For exact phrases where white space is important;

'single-quote' For simple phrases where only the words matter;

(parenthesis) For logical grouping;

[square-bracket] For command structures (not shown applied in sample commands above);

<angle-bracket> For Rule names.

Each of these fences allow you to develop sophisticated queries. The <angle-bracket> fences provide you the ability to use other queries you have already used and found you frequently repeat. When you place these queries between these (parenthesis) and name them inside <angle-bracket>, they are called Rules.

Rules: These are queries that you give a name to and store in a Rule Base. Thus these queries may be stored and called from within your query. You may even use Rules to call other Rules. This level of complexity has the elements of a computer language which allows you to do your own programming of a search.

As this is just an overview, such discussion of the fences above and Rules and Rule Bases is left to both example found in the engine and your experimentation when using the engine. There are also a number of differences between this discussion and the actual search engine operation. One such example is that I use symbols & for AND, | for OR and ! for NOT. Please observe the "instructions" tab of the search engine and review each panel for usage and definitions of what is called the "syntax" of the logic for the engine. The search engine includes a test area to allow you to develop your queries against known data. In this way you can determine the efficiency and thoroughness of your logic without having to do it blindly online. This test area will teach you more about the nature of logic and syntax than all the help buttons in Windows or on the Internet. All of the instructional material for the search engine is found in numerous text files loaded each time the search engine starts. All of the engine's default queries, Rule Bases, test area data and so on are also easily available. You may change these "default" files to suit your needs (there is a folder with the original text files just so you don't lose replacement copies). You may also open and save to your own files anywhere else in your system (this is the basic reason I circulate this as an application rather than an applet).


R.W. Clark