UNIX Power Tools

UNIX Power ToolsSearch this book
Previous: 27.7 grepping for a List of Patterns Chapter 27
Searching Through Files
Next: 27.9 New greps Are Much Faster
 

27.8 glimpse and agrep

glimpse
Glimpse is an indexing and query system that lets you search huge amounts of text (for example, all of your files) very quickly. For example, if you're looking for the word something, just type glimpse something; all matching lines will appear with the filename at the start.

Before you use glimpse, you need to index your files by running glimpseindex. You'll probably want to run it every night from cron (40.12). So, your searches will miss files that have been added since the last glimpseindex run. But, other than that problem (which can't be avoided in an indexed system like this), glimpse is fantastic - especially because it's (usually) so fast.

The speed depends on the size of the index file you build: a bigger index makes the searches faster. But even with the smallest index file, I can search my entire 70-Megabyte email archive, on a fairly slow workstation, in less than 30 seconds. With faster CPUs and disks, the search could be much quicker. One weakness is in search patterns that could match many files, which can take a lot of time to do: glimpse will print a warning and ask if you want to continue the search. (After glimpse checks its index for possible matches, it runs agrep on the possibly matching files to check and get the exactly matching records.)

agrep is one of the nicer additions to the grep family. It's not only one of the faster greps around, it has the unique feature that it will look for approximate matches. It's also record-oriented rather than line-oriented. Glimpse calls agrep, but you can also use agrep without using glimpse. The three most significant features of agrep that are not supported by the grep family are:

  1. The ability to search for approximate patterns, with a user-definable level of accuracy. For example,

    % agrep -2 homogenos foo

    will find "homogeneous" as well as any other word that can be obtained from "homogenos" with at most 2 substitutions, insertions, or deletions.

    % agrep -B homogenos foo

    will generate a message of the form:

    best match has 2 errors, there are 5 matches, output them? (y/n)

  2. agrep is record-oriented rather than just line-oriented; a record is by default a line, but it can be user-defined with the -d option specifying a pattern that will be used as a record delimiter. For example,

    % agrep -d '^From ' 'pizza' mbox

    outputs all mail messages (1.33) (delimited by a line beginning with From and a space) in the file mbox that contain the keyword pizza. Another example:

    % agrep -d '$$' pattern foo

    will output all paragraphs (separated by an empty line) that contain pattern.

  3. agrep allows multiple patterns with AND (or OR) logic queries. For example,

    % agrep -d '^From ' 'burger,pizza' mbox

    outputs all mail messages containing at least one of the two keywords (, stands for OR).

    % agrep -d '^From ' 'good;pizza' mbox

    outputs all mail messages containing both keywords.

Putting these options together one can write queries like:

% agrep -d '$$' -2 '<CACM>;TheAuthor;Curriculum;<198[5-9]>' bib

which outputs all paragraphs referencing articles in CACM between 1985 and 1989 by TheAuthor dealing with Curriculum. Two errors are allowed, but they cannot be in either CACM or the year. (The <> brackets forbid errors in the pattern between them.)

Other agrep features include searching for regular expressions (with or without errors), unlimited wildcards, limiting the errors to only insertions or only substitutions or any combination, allowing each deletion, for example, to be counted as, say, 2 substitutions or 3 insertions, restricting parts of the query to be exact and parts to be approximate, and many more.

Email glimpse-request@cs.arizona.edu to be added to the glimpse mailing list. Email glimpse@cs.arizona.edu to report bugs, ask questions, discuss tricks for using glimpse, etc. (This is a moderated mailing list with very little traffic, mostly announcements.)

- JP, SW, UM


Previous: 27.7 grepping for a List of Patterns UNIX Power ToolsNext: 27.9 New greps Are Much Faster
27.7 grepping for a List of Patterns Book Index27.9 New greps Are Much Faster

The UNIX CD Bookshelf NavigationThe UNIX CD BookshelfUNIX Power ToolsUNIX in a NutshellLearning the vi Editorsed & awkLearning the Korn ShellLearning the UNIX Operating System