Native Searches in the National Service Center for Environmental Publications Repository
When using a Metadata Fields search or optionally in an Advanced search in the NSCEP Repository, to be truly effective, it will be necessary to understand how to control the behavior of the underlying search engine. The following details how to craft searches that precisely target document content.
- Content Words and Phrases
- Wild Cards
- Boolean Operators
- Positional Operators
- Precedence and Parentheses
- Number Range Operator
- Quorum Operator
- Separators
- Fuzzy Lookup
- Search Rules and Conventions
Content Words and Phrases
The simplest search statement contains a single content word or character string. For example, to retrieve all information in your files about Chicago, type the search statement:
chicago
directing the EPA Publications Web Server to retrieve every source document with the word chicago.
A content phrase consists of two or more content words appearing together, that is, without intervening operators such as AND or OR. The EPA Publications Web Server treats content phrases as one entity. The search statement:
chicago cubs
retrieves only those files with cubs immediately following chicago. A phrase can contain one or more noise words, for example:
Billy the Kid
The EPA Publications Web Server ignores the noise word, the, in this phrase. If you want to search for two words that do not form a phrase, connect them with a Boolean operator, either AND or OR, for example:
cleveland OR detroit
The EPA Publications Web Server will retrieve all documents that mention one or both cities. Refer to Boolean operators for additional information.
Wild Cards
Wild card symbols added to content words lend a great deal of flexibility to search statements. Use wild cards to search for prefix, root and suffix, and to find variations in spelling of a word. The EPA Publications Web Server uses two wild card symbols: ? and *.
Question mark ( ? ) replaces a single character, for example:
b?rn, retrieves born and barn and burn.
?andy retrieves candy and dandy and sandy.
You can use more than one question mark in a word, for example:
sh??e retrieves shore and shade.
When you use ? the program retrieves only files containing words with exactly the same number of characters. For example, a search for 6060 without a wild card would not retrieve the zip code, 60607. A search for 60607 would retrieve only that zip code. Searching for 6060? would retrieve zip codes 60600 through 60609.
Asterisk ( * ) replaces zero or more characters, for example:
*vert retrieves convert and revert.
Use care when crafting search statements with multiple character wild cards to avoid results not related to the search topic. For example, to find information about automobiles, the search statement, auto*, would retrieve auto, automobile and automotive. It would also retrieve autobiography, autocracy and autograph. A more specific search statement would be auto OR automo*.
Boolean Operators
A search statement with only one content word retrieves every file containing that word. When you want to use more than one term in a search statement, insert operators between terms to indicate a relationship. The EPA Publications Web Server retrieves only files that meet the conditions of that relationship. You can use OR, AND and NOT.
• The Search Operator OR
OR instructs the program to retrieve files with at least one term from the search statement. OR enlarges the search topic; use it to look for terms that have similar meaning, or refer to similar subjects. The search statement:
car OR transportation retrieves all files with one or both terms: car or transportation. This search statement is more thorough and complete than if either word were used alone.
You can combine use of wild card characters with the OR operator in search statements containing content words with similar meaning for more complete results.
universit* - retrieves both university and universities
If the search topic is higher education in general, this is a better search statement:
college OR universit* OR higher education
• The Search Operator AND
The operator AND searches for files with terms found on both sides of AND in the search statement. While the operator OR broadens the search topic, AND narrows the topic. Use AND to connect terms with different meanings. Using this search statement:
new england AND north dakota - retrieved files contain at least one mention of each phrase. In this search statement:
conservation OR irrigation - retrieved files need contain only one term from the search statement, although they may contain both. AND searches for occurrences of terms on both sides of the operator; retrieved files must contain both.
• The Search Operator NOT
Use NOT to narrow the search topic. NOT stipulates that retrieved files must not contain the word immediately following NOT in the search statement. You can use NOT with AND or OR to form a single operator between two content words, for example:
bark AND NOT tree
You can use NOT alone when joining two content words, for example, ball not bat. In this example, NOT alone is equivalent to AND NOT.
To find all files with no mention of cars, use the search statement:
NOT cars
To locate information about cars but not used cars, use the search statement:
(cars) AND NOT used cars
Note: The order of content words in the search statement affects the result. Consider this statement:
used cars AND NOT cars
The program would retrieve no files, because every file referring to used cars also refers to cars.
Positional Operators
Positional Operators identify either a required proximity between content words, or a content word's proximity to other document elements.
• The WITHIN Operator: W/n
W/n limits the search to content words that appear within a defined range (n) in either direction. AND, OR and NOT retrieve files if search statement terms appear anywhere in the same text file. Within n means that n-1 words can intervene. "n" can be any integer from 1 to 16,382. Do not use a comma to punctuate the integer, as in the previous sentence.
When combining the W/n operator with other positional operators, the Within n relationship applies to adjacent components. Using the following as a search statement:
blue sky w/10 green grass w/10 clear water
in the retrieved text file, blue must be adjacent to sky; sky must be within 10 words of green; green must be adjacent to grass; grass must be within 10 words of clear; clear must be adjacent to water.
The WITHIN operator is especially useful when searching long documents. The search statement, lincoln AND illinois, retrieves a file even if Lincoln appears on page one and Illinois on page twenty. The search statement, lincoln W/10 illinois, requires that one word be within ten words of the other. This helps ensure that search terms are contextually related.
Example of W/n
Compare the following search statements for retrieving information from a company's internal sales reports:
client AND complaint
defines a broader search topic than:
client W/10 complaint
The AND operator retrieves any file with the term client if complaint is also present. When the operator W/10 replaces AND, the program retrieves only files that mention client within ten words of complaint.
Note: The position of content words connected by W/n does not affect search results. For example, 1983 W/8 tax* defines the same search topic as tax* W/8 1983.
A special use of W/n combines it with one of these positional separators: sentence (EOS), paragraph (EOP) or page (EOG) to carry out a search with this format: term1 W/n/sep term2; for example:
Minnesota W/3/EOP Maine AND fishing
This search statement would retrieve files that mention fishing, and where Minnesota appears within three paragraphs of Maine. In another example:
Supreme Court W/5/EOG civil rights
in retrieved files the phrase, Supreme Court would be within five pages of the phrase civil rights.
In this use of W/n, instead of counting individual words, the program counts lines or sentences or paragraphs or pages to meet the criterion represented by "n."
• The Precedes Operator: P/n
Use of P/n is similar to W/n with the added stipulation that the term preceding P/n in the search statement must also precede in any retrieved files within n range. Using this search statement:
physical education P/100 fitness
The program retrieves files meeting two conditions:
- physical must be adjacent to education.
- education must precede fitness within 100 words.
• The operator TO
Use TO to search for occurrences of a term falling between two other terms. In the following search statement:
sales TO product {results}
The program searches for occurrences of results falling between occurrences of sales and of product. This technique is similar to a proximity search, but much more powerful. It highlights only the term results in retrieved files. Sales and product are not objects of the search, except as delimiters of the range for locating the term results.
Precedence and Parentheses
When you use two or more operators in a search statement, The EPA Publications Web Server must give one operator precedence over the other to resolve the meaning of the statement. It evaluates a statement in an order determined by operator precedence, but you can always override normal order of evaluation by using parentheses, which have precedence higher than any operator.
The EPA Publications Web Server observes the following operator precedence from highest to lowest. Operators at the same level in the list are of equal precedence. The program evaluates them from left to right in the search expression:
- NOT
- OR
- W/n P/n
- AND
- TO
Why Use Parentheses?
Parentheses give you explicit control over the order of evaluation in complex search statements. When you use parentheses to group terms around operators, the EPA Publications Web Server interprets contents within parentheses as one unit. The use of parentheses is identical to that of algebra. We recommend that you always use parentheses when designing complex search statements (more than two operators). This helps ensure that searches function as expected.
Examples of Parentheses
To search for information discussing cars or synonyms for cars and also sales, use parentheses:
(cars AND sales) OR car dealer
First the EPA Publications Web Server searches for all files that contain one term within parentheses. Then from that group it selects only those files that also mention the other term.
You can use multiple sets of parentheses within one search statement:
(disk drive AND printer AND modem) OR (sales AND revenue AND profit)
The program retrieves files with all terms from at least one set of parentheses within the search statement. You can also nest parentheses, for example:
((cars AND trucks) OR trains) AND (ships OR submarines)
Note that AND is the primary operator. Only files that satisfy conditions on both sides of the statement are retrieved. If you had used OR as the primary operator, the program could retrieve files that satisfy conditions on only one side of the statement.
Examples of Precedence Ordering
Because OR has precedence over AND, EPA Publications Web Server interprets the search statement:
chicago OR los angeles AND new york
to be the same as
(chicago OR los angeles) AND new york
and looks for files that mention either Chicago AND New York, or Los Angeles AND New York.
Parentheses can override precedence, for example:
chicago OR (los angeles AND new york)
Because parentheses have highest precedence the EPA Publications Web Server locates only files that mention either Chicago only or, both Los Angeles AND New York or, all three. It would not retrieve files that contain New York alone or Los Angeles alone.
Number Range Operator
You can search for numbers both as "terms," that is, alphanumeric character strings and as numeric values. To locate a number as a term without regard to its value, enclose it in double quotes in the search statement, for example:
jones and "60615"
Use this search statement to retrieve letters to someone named Jones whose zip code is 60615.
If you omit the quotation marks, the program would search for the value, 60615, and all equivalent values, for example, 60615.00. When you use quotes, the search is limited to that enclosed character string.
You can use these math operators in number range searches:
- < less than
- < = less than or equal to
- = equal to
- < > not equal to
- > greater than
- > = greater than or equal to
The following are examples of number range search statements:
> = 65 w/10 social security
> 21 AND high school graduate
Use number range search statements to locate a value falling between two other values in the following format:
> or > = lower value : < or < = higher value
For example, the following search statement:
>1 : <10
would locate every number in the index meeting both conditions:greater than 1 and less than 10, whether integer or decimal. Searches of this type take time to execute, because every number must be looked at. If your document collection has more than a few thousand numbers, this kind of search takes too long, and may error out due to lack of system resources.
The search statement, < > 5, is treated as identical to NOT 5.
Quorum Operator
The quorum operator searches for a specified number of terms within a search statement from one to all in the following format:
n of {term, term, .....}
where "term" is a single character string or a phrase. With the following search statement:
3 of {history, english, social studies, geography, humanities,psychology}
you could search a collection of resumes to locate applicants prepared to teach in a certain number of fields from a range of options.
When n = 1, the program converts the expression within brackets to a series of content words joined by OR, and retrieves a text file, even if it contains only one term from the search statement, for example:
1 of {mechanical drawing, drafting, prototype design, modeling}
When n equals the number of terms within brackets, the expression is converted to a series of ANDs, and a text file is retrieved only if it contains all terms from the search statement, for example:
3 of {word proc*, desktop pub*, spreadsheet}
Separators
The EPA Publications Web Server recognizes these separators:
EOG end of page
EOL end of line
EOP end of paragraph
EOS end of sentence
They limit a search to a physically defined range of a text file. In this sense, they are similar to proximity search statements. Separators are very useful when combined with the TO operator. For example, use the search statement:
experience TO EOP {(driver or chauffeur) and >= 3}
to locate resumes of persons with a minimum of three years' experience as a driver.
To locate a single paragraph that includes two terms, use a search similar to this:
EOP TO EOP {economic and policy}
Note: If you want to search for any of the separators as text strings, enclose them in quotes, for example, "EOG". If you do not do this, the search results will contain every file that has the End Of Page marker, which is, of course, every file.
Fuzzy Searches
A fuzzy search can locate all occurrences of a word, plus all other words that are "close" in spelling to the original word. You specify the degree of closeness to the original word.
Examples of Fuzzy
Think of fuzzy search in terms of how similar one word is to another. To change one word into another, you can add, delete and replace single characters. A single degree is one change of one character. For example:
To change "commuter" into "computer" requires one replacement: the second "m" with "p." One degree.
To change "computw" into "computer" requires one replacement and one addition: replace "w" with "e"and add "r." Two degrees.
To change "coinputer" into "computer" requires one replacement and one deletion: replace "i" with "m,"and delete "n." Two degrees.
The higher the degree, the greater the margin of error; the lower the degree, the less leeway is allowed in matching a search term with words in your files.
Degree of Fuzzy
Degree of Fuzzy ranges from 1 to 4 by default. We recommend that you set Degree to 2 for searching normal text. This provides for mistakes that occur in scanned text because of broken and joined characters. If you need to search for long words, set Degree to 3 or 4.
An additional constraint takes into account the length of the word you are searching for, to prevent the retrieval of too many irrelevant shorter words. This constraint limits the degree for a specific word to be the lesser of the Fuzzy Degree setting and 0.5 times the word's length. For example if you set Fuzzy Degree to 4 and the search term is six characters long, the actual Degree of Fuzzy will be 0.5 X 6 = 3 rather than 4.
Search rules and conventions
- With the exception of NOT, place operators only between search terms, and never at the beginning or end of a search statement.
- Use NOT in conjunction with a single content word, for example: NOT car
NOT may never appear at the end of a search statement. You may also use NOT with a phrase in parentheses, for example: NOT (new york) - With the exception of NOT, two operators cannot appear in sequence in a search statement. You can use the NOT operator with AND and OR, that is, AND NOT and OR NOT.
- Because all operators are noise words, you cannot use them as content words in search statements. For example, the search statement, and OR or will not be accepted.
- The EPA Publications Web Server is not case sensitive; it regards uppercase and lowercase letters as identical. We show operators in upper case for emphasis and clarity.
- An operator can appear more than once in a search statement.
- The W/n operator must include an integer in the range 1 to 16,382, followed by a space and a content word. Omit comma in integer.
- You can use one term to retrieve both the hyphenated and non-hyphenated spellings of a term; for example, the search term:
- database retrieves database and data-base, but not data base
- data-base retrieves data-base, but not database and data base
- data base retrieves data-base and data base, but not database.
When a multi-syllable word begins near the end of a line, a word processor may force hyphenation. The EPA Publications Web Server can find such a word in either its hyphenated or non-hyphenated form. As a side effect of this capability, searches with duplicate words in series also find single occurrences of that word; for example, the search statement, sing sing, would find single occurrences of sing as well as the phrase, sing sing. The program recognizes words with normally appearing hyphens, for example, Winston-Salem.
- The EPA Publications Web Server recognizes all printable characters in the ASCII character set.
- The EPA Publications Web Server ignores a sentence-ending period and other trailing punctuation marks, when a space or a carriage return follows. The program recognizes periods when followed by a character, as in I.B.M. or in 292.004. It treats apostrophes as null characters, and ignores them.