XML, or text.
![]()
|
Table of Contents:
Frequently Asked Questions[ TOC ] General Questions[ TOC ] What is Swish-e?Swish-e is Simple Web Indexing System for Humans - Enhanced. With it, you can quickly and easily index directories of files or remote web sites and search the generated indexes for words and phrases. [ TOC ] So, is Swish-e a search engine?Well, yes. Probably the most common use of Swish-e is to provide a search engine for web sites. The Swish-e distribution includes CGI scripts that can be used with it to add a search engine for your web site. The CGI scripts can be found in the example directory of the distribution package. See the README file for information about the scripts. But Swish-e can also be used to index all sorts of data, such as email messages, data stored in a relational database management system, XML documents, or documents such as Word and PDF documents -- or any combination of those sources at the same time. Searches can be limited to fields or MetaNames within a document, or limited to areas within an HTML document (e.g. body, title). Programs other than CGI applications can use Swish-e, as well. [ TOC ] Should I upgrade if I'm already running a previous version of Swish-e?A large number of bug fixes, feature additions, and logic corrections were made in version 2.2. In addition, indexing speed has been drastically improved (reports of indexing times changing from four hours to 5 minutes), and major parts of the indexing and search parsers have been rewritten. There's better debugging options, enhanced output formats, more document meta data (e.g. last modified date, document summary), options for indexing from external data sources, and faster spidering just to name a few changes. (See the CHANGES file for more information. Since so much effort has gone into version 2.2, support for previous versions will probably be limited. [ TOC ] Are there binary distributions available for Swish-e on platform foo?Foo? Well, yes there are some binary distributions available. Please see the Swish-e web site for a list at http://swish-e.org/. In general, it is recommended that you build Swish-e from source, if possible. [ TOC ] Do I need to reindex my site each time I upgrade to a new Swish-e version?At times it might not strictly be necessary, but since you don't really know if anything in the index has changed, it is a good rule to reindex. [ TOC ]
It offers more features, and it does a much better job at extracting out
the text from a web page. In addition, you can use the
documents. It's also recommended for parsing XML, as it offers many more The internal HTML parser will have limited support, and does have a number of bugs. For example, HTML entities may not always be correctly converted and properties do not have entities converted. The internal parser tends to parser. If you are using the Perl module (the C interface to the Swish-e library) linked in the binary, and one without, and build the Perl module against Hopefully, the library will someday soon be split into indexing and searching code (volunteers welcome). [ TOC ] Does Swish-e include a CGI interface?
An example CGI script is included in the example directory. (Type Please be careful when picking a CGI script to use with Swish-e. Quite a few of the scripts that have been available for it are insecure and should not be used. The included example CGI script was designed with security in mind. Regardless, you are encouraged to have your local Perl expert review it (and all other CGI scripts you use) before placing into production. This is just a good policy to follow. [ TOC ] How secure is Swish-e?We know of no security issues with using Swish-e. Careful attention has been made with regard to common security problems such as buffer overruns when programming Swish-e.
The most likely security issue with Swish-e is when it is run via a poorly
written CGI interface. This is not limited to CGI scripts written in Perl,
as it's just as easy to write an insecure CGI script in C, Java, PHP, or
Python. A good source of information is included with the Perl
distribution. Type Note that there are many free yet insecure and poorly written CGI scripts available -- even some designed for use with Swish-e. Please carefully review any CGI script you use. Free is not such a good price when you get your server hacked... [ TOC ] Should I run Swish-e as the superuser (root)?No. Never. [ TOC ] What files does Swish-e write?
Swish writes the index file, of course. This is specified with the
IndexFile configuration directive or by the
The index file is actually a collection of files, but all start with the
file name specified with the IndexFile directive or the For example, the file ending in .prop contains the document properties.
When creating the index files Swish-e appends the extension .temp
to the index file names. When indexing is complete Swish-e renames the
.temp files to the index files specified by IndexFile or
Swish-e also writes temporary files in some cases during indexing (e.g.
The temporary files are created in the directory specified by the
environment variables [ TOC ] Can I index PDF and MS-Word documents?Yes, you can use a Filter to convert documents while indexing, or you can use a program that ``feeds'' documents to Swish-e that have already been converted. See <Indexing> below. [ TOC ] Can I index documents on a web server?
Yes, Swish-e provides two ways to index (spider) documents on a web server.
See Swish-e can retrieve documents from a file system or from a remote web server. It can also execute a program that returns documents back to it. This program can retrieve documents from a database, filter compressed documents files, convert PDF files, extract data from mail archives, or spider remote web sites. [ TOC ] Can I implement keywords in my documents?Yes, Swish-e can associate words with MetaNames while indexing, and you can limit your searches to these MetaNames while searching. In your HTML files you can put keywords in HTML META tags or in XML blocks. META tags can have two formats in your source documents:
Then, to inform Swish-e about the existence of the meta name in your documents, edit the line in your configuration file:
When searching you can now limit some or all search terms to that MetaName. For example, to look for documents that contain the word apple and also have either fruit or cooking in the DC.subject meta tag. [ TOC ] What are document properties?A document property is typically data that describes the document. For example, properties might include a document's path name, its last modified date, its title, or its size. Swish-e stores a document's properties in the index file, and they can be reported back in search results. Swish-e also uses properties for sorting. You may sort your results by one or more properties, in ascending or descending order. Properties can also be defined within your documents. HTML and XML files can specify tags (see previous question) as properties. The contents of these tags can then be returned with search results. These user-defined properties can also be used for sorting search results. For example, if you had the following in your documents
and
Or for sorting:
[ TOC ] What's the difference between MetaNames and PropertyNames?MetaNames allows keywords searches in your documents. That is, you can use MetaNames to restrict searches to just parts of your documents. PropertyNames, on the other hand, define text that can be returned with results, and can be used for sorting. Both use meta tags found in your documents (as shown in the above two questions) to define the text you wish to use as a property or meta name. You may define a tag as both a property and a meta name. For example:
placed in your documents and then using configuration settings of:
will allow you to limit your searches to documents created by accounting:
That will find all documents with the word And you can also say:
which will return all documents with the word You can use properties and meta names at the same time, too:
That searches only in the
(See also the [ TOC ] Can Swish-e index multi-byte characters?
No. This will require much work to change. But, Swish-e works with eight
Bit characters, so many characters sets can be used. Note that it does call
the ANSI-C [ TOC ] Indexing[ TOC ] How do I pass Swish-e a list of files to index?Currently, there is not a configuration directive to include a file that contains a list of files to index. But, there is a directive to include another configuration file.
And in
You may also specify more than one configuration file on the command line:
Another option is to create a directory with symbolic links of the files to index, and index just that directory. [ TOC ] How does Swish-e know which parser to use?Swish can parse HTML, XML, and text documents. The parser is set by associating a file extension with a parser by the IndexContents directive. You may set the default parser with the DefaultContents directive. If a document is not assigned a parser it will default to the You may use Filters or an external program to convert documents to HTML, [ TOC ] Can I reindex and search at the same time?Yes. Starting with version 2.2 Swish-e indexes to temporary files, and then renames the files when indexing is complete. On most systems renames are atomic. But, since Swish-e also generates more than one file during indexing there will be a very short period of time between renaming the various files when the index is out of sync. Settings in src/config.h control some options related to temporary files, and their use during indexing. [ TOC ] Can I index phrases?Phrases are indexed automatically. To search for a phrase simply place double quotes around the phrase. For example:
[ TOC ] How can I prevent phrases from matching across sentences?Use the BumpPositionCounterCharacters configuration directive. [ TOC ] Swish-e isn't indexing a certain word or phrase.There are a number of configuration parameters that control what Swish-e considers a ``word'' and it has a debugging feature to help pinpoint any indexing problems.
Configuration file directives (SWISH-CONFIG)
WordCharacters, BeginCharacters, Swish-e also uses compile-time defaults for many settings. These are located in src/config.h file.
Use of the command line arguments
You may also wish to index a single file that contains words that are or are not indexing as you expect and use -T to output debugging information about the index. A useful command might be:
Once you see how Swish-e is parsing and indexing your words, you can adjust the configuration settings mentioned above to control what words are indexed. Another useful command might be:
This will show white-spaced words parsed from the document (PARSED_WORDS), and how those words are split up into separate words for indexing (INDEXED_WORDS). [ TOC ] How do I keep Swish-e from indexing numbers?Swish-e indexes words as defined by the WordCharacters setting, as described above. So to avoid indexing numbers you simply remove digits from the WordCharacters setting. There are also some settings in src/config.h that control what ``words'' are indexed. You can configure swish to never index words that are all digits, vowels, or consonants, or that contain more than some consecutive number of digits, vowels, or consonants. In general, you won't need to change these settings. Also, there's an experimental feature called IgnoreNumberChars which allows you to define a set of characters that describe a number. If a word is made up of only those characters it will not be indexed. [ TOC ] Swish-e crashes and burns on a certain file. What can I do?This shouldn't happen. If it does please post to the Swish-e discussion list the details so it can be reproduced by the developers. In the mean time, you can use a FileRules directive to exclude the particular file name, or pathname, or its title. If there are serious problems in indexing certain types of files, they may not have valid text in them (they may be binary files, for instance). You can use NoContents to exclude that type of file.
Swish-e will issue a warning if an embedded null character is found in a
document. This warning will be an indication that you are trying to index
binary data. If you need to index binary files try to find a program that
will extract out the text (e.g. [ TOC ] How to I prevent indexing of some documents?
When using the file system to index your files you can use the
FileRules directive. Other than If you are spidering, use a robots.text file in your document root. This is a standard way to excluded files from search engines, and is fully supported by Swish-e. See http://www.robotstxt.org/
You can also modify the spider.pl spider perl program to skip, index content only, or spider only listed web
pages. Type Robots Exclusion in your documents:
See the obeyRobotsNoIndex directive. [ TOC ] How do I prevent indexing parts of a document?
To prevent Swish-e from indexing a common header, footer, or navigation
HTML tag around the text you wish to ignore and use the
IgnoreMetaTags directive. This will generate an error message if the parser), but not with documents parsed by the text (TXT) parser. the following comments in your documents to prevent indexing:
and/or these may be used also:
[ TOC ] How do I modify the path or URL of the indexed documents.
Use the ReplaceRules configuration directive to rewrite path names and URLs. If you are using [ TOC ] How can I index data from a database?
Use the ``prog'' document source method of indexing. Write a program to
extract out the data from your database, and format it as XML, HTML, or
text. See the examples in the [ TOC ] How do I index my PDF, Word, and compressed documents?Swish-e can internally only parse HTML, XML and TXT (text) files by default, but can make use of filters that will convert other types of files such as MS Word documents, PDF, or gzipped files into one of the file types that Swish-e understands.
Please see SWISH-CONFIG
and the examples in the filters and The FileFilter directive can be used in any of the input methods, but the -S http method (spidering with the SwishSpider program) only indexes ...) so filter programs can not be used to convert documents such as PDF files.
Another option is to use the prog document source input method. In this case you write a program (such as a
perl script) that will read and convert your data as needed and then output
one of the formats that Swish-e understands. Examples of using the prog input method for filtering are included in the The disadvantage of using the prog input method is that you must write a program that reads the documents from the source (e.g. from the file system or via a spider to read files on a web server), and also include the code to filter the documents. It's much easier to use the FileFilter option since the filter can often be implemented with just a single configuration directive. On the other hand, the advantage of using the prog input method for indexing is speed. Filtering within a prog input method program will be faster if your filtering program is something like a Perl script (something that has a large start-up cost). This may or may not be an issue for you, depending on how much time your indexing requires. You can also use a combination of methods. For example, say you are indexing a directory that contains PDF files using a FileFilter directive. Now you want to index a MySQL database that also contains PDF files. You can write a prog input method program to read your MySQL database and use the same FileFilter configuration parameter (and filter program) to convert the PDF files into one of the native Swish-e formats (TXT, HTML, XML). Do note that it will be slower to use the FileFilter method instead of running the filter directly from the prog input method program. When FileFilter is used with the prog input method Swish-e must create a temporary file containing the output from your prog method program, and then execute the filter program. In general, use the FileFilter method to filter documents. If indexing speed is an issue, consider writing a prog input method program. If you are already using the prog method, then filtering will probably be best accomplished within that program.
Here's two examples of how to run a filter program, one using Swish-e's
FileFilter directive, another using a prog input method program. These filters simply use the program First, using the FileFilter method, here's the entire configuration file (swish.conf):
and index with the command
Now, the same thing with using the prog document source input method and a Perl program called catfilter.pl. You can see that's it's much more work than using the FileFilter method above, but provides a place to do additional processing. In this example, the prog method is only slightly faster. But if you needed a perl script to run as a FileFilter then prog will be significantly faster.
And index with the command:
This example will probably not work under Windows due to the '-|' open. A simple piped open may work just as well: That is, replace:
with this:
Perl will try to avoid running the command through the shell if meta
characters are not passed to the open. See [ TOC ] Eh, but I just want to know how to index PDF documents!See the examples in the conf directory. [ TOC ] I'm using the prog method to index PDF documents, but the file contents are not indexed.The some of the examples in the prog-bin directory use a module to convert the PDF files into XML. So you must tell Swish-e that you are indexing XML files for the PDF extension.
[ TOC ] I'm using Windows and can't get Filters or the prog input method to work!
Both the For example, you would need to specify the path to perl as (assuming this is where perl is on your system):
Or run a filter like:
[ TOC ] How do I index non-English words?Swish-e indexes 8-bit characters only. This is the ISO 8859-1 Latin-1 character set, and includes many non-English letters (and symbols). As long as they are listed in WordCharacters they will be indexed. Actually, you probably can index any 8-bit character set, as long as you don't mix character sets in the same index. The TranslateCharacters directive (SWISH-CONFIG) can translate characters while indexing and searching. You may specify the mapping of one character to another character with the TranslateCharacters directive.
Latin-1 when indexing. In cases where a string can not be converted from
UTF-8 to ISO 8859-1 (because it contains non 8859-1 characters), the string
will be sent to Swish-e in UTF-8 encoding. This will results in some words
indexed incorrectly. Setting [ TOC ] Can I add/remove files from an index?Not really. Swish-e currently has no way to add or remove items from its index. But, Swish-e indexes so quickly that it's often possible to reindex the entire document set when a file needs to be added, modified or removed. If you are spidering a remote site then consider caching documents locally compressed.
Incremental additions can be handled in a couple of ways, depending on your
situation. It's probably easiest to create one main index every night (or
every week), and then create an index of just the new files between main
indexing jobs and use the You can merge the indexes into one index (instead of using -f), but it's not clear that this has any advantage over searching multiple indexes. How does one create the incremental index?
One method is by using the
This option has the disadvantage that Swish-e must process every file in
every directory as if they were going to be indexed (the test for
Also, if you use the Swish-e index file as the file passed to Another option is to maintain a parallel directory tree that contains symlinks pointing to the main files. When a new file is added (or changed) to the main directory tree you create a symlink to the real file in the parallel directory tree. Then just index the symlink directory to generate the incremental index. This option has the disadvantage that you need to have a central program that creates the new files that can also create the symlinks. But, indexing is quite fast since Swish-e only has to look at the files that need to be indexed. When you run full indexing you simply unlink (delete) all the symlinks. Both of these methods have issues where files could end up in both indexes, or files being left out of an index. Use of file locks while indexing, and hash lookups during searches can help prevent these problems. [ TOC ] I run out of memory trying to index my files.It's true that indexing can take up a lot of memory! Swish-e is extremely fast at indexing, but that comes at the cost of memory. The best answer is install more memory.
Another option is use the
Here's an example of indexing all .html files in /usr/doc on Linux. This
first example is without
This is with
You can also build a number of smaller indexes and then merge together with
Finally, if you do build a number of smaller indexes, you can specify more
than one index when searching by using the [ TOC ] My system admin says Swish-e uses too much of the CPU!That's a good thing! That expensive CPU is suppose to be busy. Indexing takes a lot of work -- to make indexing fast much of the work is done in memory which reduces the amount of time Swish-e is waiting on I/O. But, there's two things you can try:
The
The other thing is to simply lower the priority of the job using the
If concerned about searching time, make sure you are using the -b and -m switches to only return a page at a time. If you know that your result sets will be large, and that you wish to return results one page at a time, and that often times many pages of the same query will be requested, you may be smart to request all the documents on the first request, and then cache the results to a temporary file. The perl module File::Cache makes this very simple to accomplish. [ TOC ] Spidering[ TOC ] How can I index documents on a web server?
If possible, use the file system method
If this is impossible (the web server is not local, or documents are
dynamically generated), Swish-e provides two methods of spidering. First,
it includes the http method of indexing
As of Swish-e 2.2, there's a general purpose ``prog'' document source where
a program can feed documents to it for indexing. A number of example
programs can be found in the The advantage of the ``prog'' document source feature over the ``http'' method is that the program is only executed one time, where the swishspider.pl program used in the ``http'' method is executed once for every document read from the web server. The forking of Swish-e and compiling of the perl script can be quite expensive, time-wise.
The other advantage of the [ TOC ] Why does swish report "./swishspider: not found"?Does the file swishspider exist where the error message displays? If not, either set the configuration option SpiderDirectory to point to the directory where the swishspider program is found, or place the swishspider program in the current directory when running swish-e. If you are running Windows, make sure ``perl'' is in your path. Try typing perl from a command prompt. If you not running windows, make sure that the shebang line (the first line of the swishspider program that starts with #!) points to the correct location of perl. Typically this will be /usr/bin/perl or /usr/local/bin/perl. Also, make sure that you have execute and read permissions on swishspider. The swishspider perl script is only used with the -S http method of indexing. [ TOC ] I'm using the spider.pl program to spider my web site, but some large files are not indexed.
The [ TOC ] I still don't think all my web pages are being indexed.
The spider.pl program has a number of debugging switches and can be quite verbose in
telling you what's happening, and why. See [ TOC ] Swish is not spidering Javascript links!Swish cannot follow links generated by Javascript, as they are generated by the browser and are not part of the document. [ TOC ] How do I spider other websites and combine it with my own (filesystem) index?
You can either merge
You will have better results with the [ TOC ] Searching[ TOC ] How do I limit searches to just parts of the index?If you can identify ``parts'' of your index by the path name you have two options. The first options is by indexing the document path. Add this to your configuration:
Now you can search for words or phrases in the path name:
So that will only find documents with the word ``foo'' and where the file's path contains ``sales''. That might not works as well as you like, though, as both of these paths will match:
This can be solved by searching with a phrase (assuming ``/'' is not a WordCharacter):
The second option is a bit more powerful. With the ExtractPath directive you can use a regular expression to extract out a sub-set of the path and save it as a separate meta name:
Which says match a path that starts with ``/web/'' and extract out everything after that up to, but not including the next ``/'' and save it in variable $1, and then match everything from the ``/'' onward. Then replace the entire matches string with $1. And that gets indexed as meta name ``department''. Now you can search like:
and be sure that you will only match the documents in the /www/sales/* path. Note that you can map completely different areas of your file system to the same metaname:
Finally, if you have something more complicated, use [ TOC ] How can I limit searches to the title, body, or comment?
Use the [ TOC ] I can't limit searches to title/body/comment.Or, I can't search with meta names, all the names are indexed as "plain". Check in the config.h file if #define INDEXTAGS is set to 1. If it is, change it to 0, recompile, and index again. When INDEXTAGS is 1, ALL the tags are indexed as plain text, that is you index ``title'', ``h1'', and so on, AND they loose their indexing meaning. If INDEXTAGS is set to 0, you will still index meta tags and comments, unless you have indicated otherwise in the user config file with the IndexComments directive. Also, check for the UndefinedMetaTags setting in your configuration file. [ TOC ] I've tried running the included CGI script and I get a "Internal Server Error"
Debugging CGI scripts are beyond the scope of this document. Internal
Server Error basically means ``check the web server's log for an error
message'', as it can mean a bad shebang (#!) line, a missing perl module,
FTP transfer error, or simply an error in the program. The CGI script swish.cgi in the example directory contains some debugging suggestions. Type There are also many, many CGI FAQs available on the Internet. A quick web search should offer help. As a last resort you might ask your webadmin for help... [ TOC ] When I try to view the swish.cgi page I see the contents of the Perl program.
Your web server is not configured to run the program as a CGI script. This
problem is described in [ TOC ] How do I make Swish-e highlight words in search results?Short answer: Use the supplied swish.cgi script located in the examples directory. Long answer: Swish-e can't because it doesn't have access to the source documents when returning results, of course. But a front-end program of your creation can highlight terms. Your program can open up the source documents and then use regular expressions to replace search terms with highlighted or bolded words. But, that will fail with all but the most simple source documents. For HTML documents, for example, you must parse the document into words and tags (and comments). A word you wish to highlight may span multiple HTML tags, or be a word in a URL and you wish to highlight the entire link text. Perl modules such as HTML::Parser and XML::Parser make word extraction possible. Next, you need to consider that Swish-e uses settings such as WordCharacters, BeginCharacters, EndCharacters, IgnoreFirstChar, and IgnoreLast, char to define a ``word''. That is, you can't consider that a string of characters with white space on each side is a word. Then things like TranslateCharacters, and HTML Entities may transform a source word into something else, as far as Swish-e is concerned. Finally, searches can be limited by metanames, so you may need to limit your highlighting to only parts of the source document. Throw phrase searches and stopwords into the equation and you can see that it's not a trivial problem to solve.
All hope is not lost, thought, as Swish-e does provide some help. Using the [ TOC ] Do filters effect the performance during search?No. Filters (FileFilter or via ``prog'' method) are only used for building the search index database. During search requests there will be no filter calls. [ TOC ] I have read the FAQ but I still have questions about using Swish-e.The Swish-e discussion list is the place to go. http://swish-e.org/. Please do not email developers directly. The list is the best place to ask questions. Before you post please read QUESTIONS AND TROUBLESHOOTING located in the INSTALL page. You should also search the Swish-e discussion list archive which can be found on the swish-e web site. In short, be sure to include in the following when asking for help.
[ TOC ] Document Info$Id: SWISH-FAQ.pod,v 1.26 2002/09/11 00:54:09 whmoseley Exp $ . [ TOC ]
|