diff options
Diffstat (limited to 'debian/htdig/htdig-3.2.0b6/htdoc/require.html')
-rw-r--r-- | debian/htdig/htdig-3.2.0b6/htdoc/require.html | 392 |
1 files changed, 392 insertions, 0 deletions
diff --git a/debian/htdig/htdig-3.2.0b6/htdoc/require.html b/debian/htdig/htdig-3.2.0b6/htdoc/require.html new file mode 100644 index 00000000..d1975701 --- /dev/null +++ b/debian/htdig/htdig-3.2.0b6/htdoc/require.html @@ -0,0 +1,392 @@ +<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> +<html> + <head> + <title> + ht://Dig: Features and System requirements + </title> + </head> + <body bgcolor="#eef7ff"> + <h1> + Features and System requirements + </h1> + <p> + ht://Dig Copyright © 1995-2004 <a href="THANKS.html">The ht://Dig Group</a><br> + Please see the file <a href="COPYING">COPYING</a> for + license information. + </p> + <hr noshade> + <h2> + Features + </h2> + <p> + Here are some of the major features of ht://Dig. They are in + no particular order. + </p> + <blockquote> + <dl> + <dt> + <strong><img src="bdot.gif" width=9 height=9 alt="*"> + Intranet searching</strong> + </dt> + <dd> + ht://Dig has the ability to search through many servers + on a network by acting as a WWW browser. + </dd> + <dt> + <strong><img src="bdot.gif" width=9 height=9 alt="*"> + It is free</strong> + </dt> + <dd> + The whole system is released under the + <a href="COPYING">GNU Library General Public License (LGPL)</a> + </dd> + <dt> + <strong><img src="bdot.gif" width=9 height=9 alt="*"> + Robot exclusion is supported</strong> + </dt> + <dd> + The <a href="http://www.robotstxt.org/wc/norobots.html"> + Standard for Robot Exclusion</a> is + <a href="meta.html#robots">supported by ht://Dig.</a> + </dd> + <dt> + <strong><img src="bdot.gif" width=9 height=9 alt="*"> + Boolean expression searching</strong> + </dt> + <dd> + Searches can be arbitrarily complex using boolean + expressions. + </dd> + <dt> + <strong><img src="bdot.gif" width=9 height=9 alt="*"> + Phrase searching</strong> + </dt> + <dd> + A phrase can be searched for by enclosing it in quotes. + Phrase searches can be combined with word searches, as in + <code>Linux and "high quality"</code>. + </dd> + <dt> + <strong><img src="bdot.gif" width=9 height=9 alt="*"> + Configurable search results</strong> + </dt> + <dd> + The output of a search can easily be tailored to your + needs by means of providing HTML templates. + </dd> + <dt> + <strong><img src="bdot.gif" width=9 height=9 alt="*"> + Fuzzy searching</strong> + </dt> + <dd> + Searches can be performed using various + <a href="attrs.html#search_algorithm">configurable algorithms</a>. + Currently the following algorithms are + supported (in any combination): + <ul> + <li> + exact + </li> + <li> + soundex + </li> + <li> + metaphone + </li> + <li> + common word endings + </li> + <li> + synonyms + </li> + <li> + accent stripping + </li> + <li> + substring and prefix + </li> + <li> + regular expressions + </li> + <li> + simple spelling corrections + </li> + </ul> + </dd> + <dt> + <strong><img src="bdot.gif" width=9 height=9 alt="*"> + Searching of many file formats</strong> + </dt> + <dd> + Both HTML documents and plain text files can be + searched directly ht://Dig itself. There is also a + <a href="attrs.html#external_parsers">mechanism + to allow external programs ("external parsers")</a> to be used + while building the database so that arbitrary file formats + can be searched. <br> + </dd> + <dt> + <strong><img src="bdot.gif" width=9 height=9 alt="*"> + Document retrieval using many transport services</strong> + </dt> + <dd> + Several transport services can be handled by ht://Dig, + including http://, ftp:// and file:///. + There is also a + <a href="attrs.html#external_protocols">mechanism + to allow external programs ("external protocols")</a> to be used + while building the database so that arbitrary transport + services can be used. <br> + </dd> + <dt> + <strong><img src="bdot.gif" width=9 height=9 alt="*"> + Keywords can be added to HTML documents</strong> + </dt> + <dd> + Any number of <a href="meta.html">keywords</a> + can be added to HTML documents + which will not show up when the document is viewed. + This is used to make a document more like to be found + and also to make it appear higher in the list of + matches. + </dd> + <dt> + <strong><img src="bdot.gif" width=9 height=9 alt="*"> + Email notification of expired documents</strong> + </dt> + <dd> + Special meta information can be added to HTML documents + which can be used to + <a href="notification.html">notify the maintainer</a> of those + documents at a certain time. It is handy to get + reminded when to remove the "New" images from a certain + page, for example. + </dd> + <dt> + <strong><img src="bdot.gif" width=9 height=9 alt="*"> + A Protected server can be indexed</strong> + </dt> + <dd> + ht://Dig can be told to use a specific + <a href="attrs.html#authorization">username and password</a> + when it retrieves documents. This can be used + to index a server or parts of a server that are + protected by a username and password. + </dd> + <dt> + <strong><img src="bdot.gif" width=9 height=9 alt="*"> + Searches on subsections of the database</strong> + </dt> + <dd> + It is easy to set up a search which only returns + documents whose + <a href="hts_form.html#restrict">URL matches a certain pattern.</a> + This becomes very useful for people who want to make their + own data searchable without having to use a separate + search engine or database. + </dd> + <dt> + <strong><img src="bdot.gif" width=9 height=9 alt="*"> + Full source code included</strong> + </dt> + <dd> + The search engine comes with full source code. The + whole system is released under the terms and conditions + of the <a href="COPYING">GNU Library General Public License (LGPL) version + 2.0</a> + </dd> + <dt> + <strong><img src="bdot.gif" width=9 height=9 alt="*"> + The depth of the search can be limited</strong> + </dt> + <dd> + Instead of limiting the search to a set of machines, it + can also be restricted to documents that are a certain + number of <a href="attrs.html#max_hop_count">"mouse-clicks"</a> + away from the start document. + </dd> + <dt> + <strong><img src="bdot.gif" width=9 height=9 alt="*"> + Full support for the ISO-Latin-1 character set</strong> + </dt> + <dd> + Both SGML entities like '&agrave;' and ISO-Latin-1 + characters can be indexed and searched. + </dd> + </dl> + </blockquote> + <hr size="4" noshade> + <h1> + Requirements to build ht://Dig + </h1> + <p> + ht://Dig was developed under Unix using C++. + </p> + <p> + For this reason, you will need a Unix machine, a C compiler + and a C++ compiler. (The C compiler is needed to compile some + of the GNU libraries) + </p> + <p> + Unfortunately, we only have access to a couple of different + Unix machines. ht://Dig has been tested on these machines: + </p> + <ul> +<!-- + <li> + Sun Solaris 2.5 SPARC (using gcc/g++ 2.7.2) + </li> + <li> + Sun SunOS 4.1.4 SPARC (using gcc/gcc 2.7.0) + </li> + <li> + HP/UX A.09.01 (using gcc/g++ 2.6.0) + </li> + <li> + IRIX 5.3 (SGI C++ compiler. Don't know the version) + </li> + <li> + Debian Linux 2.0 (using egcs 1.1b) + </li> +--> + <li> + FreeBSD 4.6 (using gcc 2.95.3) <!-- lha --> + </li> + <li> + Mandrake Linux 8.2 (using gcc 3.2) <!-- lha --> + </li> + <li> + Debian, 2.2.19 kernel (using gcc 2.95.4) <!-- lha --> + </li> + <li> + Debian on an Alpha <!-- lha --> + </li> + <li> + RedHat 7.3, 8.0 <!-- Jim Cole --> + </li> + <li> + Sun Solaris 2.8 = SunOS 5.8 (using gcc 3.1) <!-- lha --> + </li> + <li> + Sun Solaris 2.8 = SunOS 5.8 (using Sun's cc / g++ 3.1) <!-- lha --> + </li> + <li> + Mac OS X 10.2 (using gcc) <!-- Jim Cole --> + </li> + + </ul> + There are reports of ht://Dig working on a number of other platforms. + <h3> + libstdc++ + </h3> + <p> + If you plan on using g++ to compile ht://Dig, you have to make + sure that libstdc++ has been installed. Unfortunately, libstdc++ is a + separate package from gcc/g++. You can get libstdc++ from the + <a href="ftp://ftp.gnu.org/pub/gnu/">GNU software archive</a>. + </p> + +<!-- The current Makefiles don't use include... + <h3> + Berkeley 'make' + </h3> + <p> + The building relies heavily on the make program. The problem + with this is that not all make programs are the same. The + requirement for the make program is that it understands the + 'include' statement as in + </p> + <blockquote> + <code>include somefile otherfile</code> + </blockquote> + <p> + The Berkeley 4.4 make program doesn't use this syntax, instead + it wants + </p> + <blockquote> + <code>.include "somefile"</code><br> + <code>.include "otherfile"</code> + </blockquote> + <p> + and hence it cannot be used to build ht://Dig. + </p> + <p> + If your make program doesn't understand the right 'include' + syntax, it is best if you get and install + <a href="ftp://ftp.gnu.org/pub/gnu/">gnumake</a> before you try + to compile everything. The alternative is to change all the + Makefiles. + </p> +--> + <hr noshade> + <h1> + Disk space requirements + </h1> + <p> + The search engine will require lots of disk space to store + its databases. Unfortunately, there is no exact formula to + compute the space requirements. It depends on the number of + documents you are going to index but also on the various + options you use. + </p> + <p>As a temporary measure, 3.2 betas use a very inefficient + database structure to enable phrase searching. This will be + fixed before the release of 3.2.0. Currently, indexing a site of + around 10,000 documents gives a database of around 400MB using the + default setting for + <a href="attrs.html#max_doc_size">maximum document size</a> and storing the + <a href="attrs.html#max_head_length">first 50,000 bytes of each document</a> + to enable context to be displayed. + <!-- To give you an idea of the space + requirements, here is what I have deduced from our own + database size at San Diego State University. + </p> + <p> + If you keep around the wordlist database (for update digging + instead of initial digging) I found that multiplying the + number of documents covered by 12,000 will come pretty close + to the space required. + </p> + <p> + We have about 13,000 documents: + </p> +<pre> + 13,000 + 12,000 x + =========== + 156,000,000 +</pre> + or about 150 MB. + <p> + Without the wordlist database, the factor drops down to about + 7500: + </p> +<pre> + 13,000 + 7,500 x + =========== + 97,500,000 +</pre> + or about 93 MB. +--> + <p> + Keep in mind that we keep at most 50,000 bytes of each + document. This may seen a lot, but most documents aren't very + big and it gives us a big enough chunk to almost always show + an excerpt of the matches. + </p> + <p> + You may find that if you store most of each document, the + databases are almost the same size, or even larger than the + documents themselves! Remember that if you're storing a + significant portion of each document (say 50,000 bytes as + above), you have that requirement, plus the size of the word + database and all the additional information about each document + (size, URL, date, etc.) required for searching. + </p> + <hr size="4" noshade> + + Last modified: $Date: 2004/05/28 13:15:19 $ + + </body> +</html> |