www.tgries.de/agrep -> www.tgries.de/agrep/
From the authors' notes: AGREP is a powerful tool for fast searching a file or many files for a string or regular expression, with approximate matching capabilities and user-definable records. AGREP is similar to egrep (or grep or fgrep), but it is much more general and usually faster. It also supports many kinds of queries including arbitrary wild cards, sets of patterns, and in general, regular expressions. It supports most of the options supported by the GREP family plus several more (but it is not 100% compatible with grep). AGREP is the search engine and part of the 14 GLIMPSE tool for searching and indexing whole file systems. GLIMPSE stands for Global Implicit Search and is part of the 15 HARVEST Information Discovery and Access System. AGREP belongs to the University of Arizona, which licenses it (see 16 copyright). It is not public-domain, but free for non-commercial use. But running this, you need one, two, or three additional files, depending on your operating system (see next section). EXE All these Zip files include the COPYRIGHT and README files. OUT Remark: this does not search the C:\MAIL directory itself. Another method to search multiple files and/or subdirectories: Use AGREP's built-in @reponsefile (listfile) option ! LST Remember, that AGREP is faster when you load it once and let it search a bunch of files: AGREP needle C:\MAIL\* AGREP by Udi Manber, University of Arizona, Sun Wu, Thomas Gries. The DPMI DOS extender RSX allows to run 32-bit programs (like AGREP) in a DOS box of several operation systems. More information on codepages can be found here: 39 codepage 437 (US) and 850 (Latin-1) (list of pointers) 40 ISO 8859-1 National Caracter Set FAQ (FAQ; Examples: "" matches "" "" matches "" but these do not match "a" or "" ! This simply allows to force case-sensitive searches in case that you also use the environment variable 42 AGREPOPTS saying to search case-insensitive. SYS) SET AGREPOPTS=-i -V4 make AGREP search case-insensitive by default, verbose level 4 Different levels of verbose option -V version information only -V0 no diagnostic messages at all (use the -V0 option together with -s option to avoid any output) -V1 shows Grand Total (= count of records having matches; When calling AGREP, the return code reflects to the number of matches: return code of AGREP meaning >= 0 the total number of matches (zero means no match) < 0 syntax error/s or inaccessible file/s There was a problem of infinite loops for older AGREP versions. The following can cause an infinite loop: agrep pattern * > output_file. If the number of matches is high, they may be deposited in output_file before it is completely read leading to more matches of the pattern within output_file (the matches are against the whole directory). It's not clear whether this is a "bug" (grep will do the same), but be warned. EXE for OS/2 cannot be run in a DOS box of Windows or OS/2. There are a few restrictions regarding the possible combinations of search options. AGREP will display a message, when it does not support the requested combination of options. As later versions could do, please check regularly this page for amendments. EXE does not allow to search long file or directory names under Windows 95, Windows NT. But AGREP does not fail to find your needle, it's only a problem of presenting this record. When using AGREP and pipes, there could be some problems like the message "no target files found". The program itself shows six pages of on-line help - when you call it without any parameters. Visit the help pages for AGREP 46 here There is a list of all options of AGREP and a lot of examples. If you find a problem, please send 47 me your bug report. The Rexx API - An Introduction to Extending the Rexx Language ( 50 contents) (by Bill Potvin). CMD - Examples taken from \EMX\SAMPLES of 51 emx (by Eberhard Mattes). Return codes: If you intend to call this version of AGREP from a PERL script, you probably want to avoid any output while keeping AGREPs 52 return code. In this case, please use options -s (almost silent) together with 53 -V0 (verbose nothing) to avoid the output of the Grand Total number of matches. Retrieval in context (RIC) = focus function One of the most powerful features is already the -d option allowing user-definable records. The proposed extension of this option would be very useful when the target files do not have a certain record structure. In this case, one would prefer to run AGREP line oriented (default), but giving -dn would allow a range of n target lines to be displayed around the line with the needle. Preferable in combination with highlight function: Highlight (mark, tag) the matches in the output record Allow user-definable prefix- and suffix strings to mark the needle in the output record. The strings could be composed of ANSI strings to select/deselect colours, or they could be used to generate 54 HTML links. Implementation of Sunday's Optimal Mismatch Algorithm (for exact pattern searches) New metasymbols for three predefined sets of characters @ all letters % all digits all the rest Examples: search for a car plate number which starts with "ABC" followed by 4 digits: "ABC%%%%" or to search for 55 US patents "US%%%%%%%" or for 56 European patents "EP%%%%%%%" Dynamic Metasymbol Assignment DMSA In its current implementation, AGREP needs sixteen characters from the character set internally. The graphic characters, which are not common to text files, cannot be searched at the moment and must therefore not appear in the needle string. It is planned to remove that restriction of AGREP in a later version. Except for exact matching of simple patterns, for which we use a simple variation of the Boyer-Moore algorithm, all the algorithms (listed below) were designed by 60 Sun Wu and 61 Udi Manber. It supports many extensions such as approximate regular expression pattern matching, non-uniform costs, simultaneous matching of multiple patterns, mixed exact/approximate matching, etc. It assumes that the set of patterns contains k patterns, and that the shortest pattern is of size m. Let b = log_c (2*m), where c is the size of alphabet set. In the preprocessing, a table is built to determine whether a given substring of size b is in the pattern. Suppose we are looking for matches with at most k errors. The search is done in two passes: In the first pass (the filtering pass), the areas in the text that have a possibility to contain the matches are marked. The second pass finds the matches in those marked areas. The search in the first pass is done in the following way. Suppose the end position of the pattern is currently aligned with position tx in the text. The algorithm scans backward from tx until either (k+1) blocks that do not occur in the pattern have been scanned, or the scan has passed position (tx-m+k). In the former case, pattern is shifted forward to align the beginning position of the pattern with one character after the position in the text where the scan was stopped. In the latter case, we marked tx-m to tx+m as a candidate area. For ASCII text and pattern, this algorithm is faster than amonkey. If we partition A into (k+1) blocks, then the distance between A and B is > k if none of the blocks of A occur in B. This implies that to match A with no more than k errors, B has to contain a substring that matches exactly one block of A. Permission is granted to copy this software, to redistribute it on a nonprofit basis, and to use it for any purpose, subject to the following restrictions and understandings. Any copy made of this software must include this copyright notice in full. All materials developed as a consequence of the use of this software shall duly acknowledge such use, in accordance with the usual standards of acknowledging credit in academic research. The authors have made no warranty or representation that the operation of this software will be error-free or suitable for any application, and they are under under no obligation to provide any services, by way of maintenance, update, or otherwise. The software is an experimental prototype offered on an as-is basis. Redistribution for profit requires the express, w...
|