Definition of terms:

In order to fully understand the material below, it is necessary to specify some of the jargon used with WebMaven.
Term Definition

URI WebMaven uses the term URI to encompass URLs along with other associated data necessary to reference a file via the HTTP protocol. From RFC 2068:

URIs have been known by many names: WWW addresses, Universal Document Identifiers, Universal Resource Identifiers , and finally the combination of Uniform Resource Locators (URL) and Names (URN). As far as HTTP is concerned, Uniform Resource Identifiers are simply formatted strings which identify -- via name, location, or any other characteristic -- a resource.

Local path A remote site, be it a complete domain (e.g. www.cfsrexx.com) or one or more paths within a remote domain (e.g. www.cfsrexx.com/pub/ & www.cfsrexx.com/WebMaven/) are downloaded into the local path directory.

Note: Multiple remote paths can only be specified with a the enterprise edition.


Remote path Remote path is a domain name, with or without a path. The remote path corresponds to the local path on a one-to-one basis when only one remote path is specified.

When more than one remote path is specified, each directory in the remote path corresponds, one-for-one, with a local directory subordinate to the local path. Multiple remote paths can include the same or different domains and protocols.

For example, assuming the following remote tree structure:


   Baseball
     |
     *-- American
     |    |
     |    *-- Red Sox
     |    |    |
     |    |    *-- Pawtucket
     |    |    |
     |    |    *-- Trenton
     |    |
     |    *-- Yankees
     |         |
     |         *-- Columbus
     |         |
     |         *-- Norwich
     |
     *-- National
          |
          *-- Mets
          |    |
          |    *-- Norfolk
          |    |
          |    *-- Binghampton
          |
          *-- Dodgers
               |
               *-- Albuquerque
               |
               *-- San Antonio
   

Enterprise enabled versions of WebMaven could specify remote path values of Baseball/American/Red Sox  and Baseball/National/Dodgers.  and a local path of MLB  which would result in the following local tree structure on the local hard drive:


   MLB
    |
    *-- American
    |    |
    |    *-- Red Sox
    |         |
    |         *-- Pawtucket
    |         |
    |         *-- Trenton
    |
    *-- National
         |
         *-- Dodgers
              |
              *-- Albuquerque
              |
              *-- San Antonio
   

Note: Multiple remote paths can only be specified with the enterprise edition.


Out of tree Out of tree designates HTML links that are not within the remote path(s). What happens with out of tree links is detailed in the Protocol processing table.

 

Feature by level: (default is underlined)
Function Unlicensed Personal license Enterprise license
Default file name INDEX!.HTM INDEX!.HTM | user specified INDEX!.HTM | user specified
Maximum number of download subtasks 1 1-2;   2 1-9;   3
Maximum download byte limit Yes Yes Yes
Full vs. relative localized paths full only relative only relative | full
HTTP client version 1.0 only 1.1 | 1.0 1.1 | 1.0
HTML priority load balancing No Yes | No Yes | No
Case sensitive URIs No Yes | No Yes | No
Useable with proxy server No Yes Yes
Remote paths per local tree 1 1 unlimited
User ID / Password capable No Yes Yes
Sample Reports
(Sample reports)
Report Name Unlicensed Personal license Enterprise license
Aged File List No No Yes | No
Cookie Report
(does not affect handling cookie requests)
No Yes | No Yes | No
Domain Name Server Lookup Time No No Yes | No
Unresolved Domain Names 1 Yes Yes Yes
Download Time Report No No Yes | No
E-mail Addresses No No Yes | No
Externally Processed MIME Types No No Yes | No
HTML Syntax Errors 1 Yes Yes Yes
Invalid HTML links 1 Yes Yes Yes
Image Exception Report No Yes | No Yes | No
Image Group Report No Yes | No Yes | No
Java Class Report
(does not affect processing Java classes)
No No Yes | No
Out Of Tree Links No No No | Yes
Relocated URIs No No No | Yes
Site Map No No No | Yes
Summary of WebMaven reports Yes Yes Yes
Titles of HTML Pages & Images No No Yes | No
File Cross Reference Table No No No | Yes
Note 1: These reports contain a courtesy e-mail report to the Webmaster at the referenced site which the user may optionally send. If "Webmaster" does not appear as an e-mail address within the retrieved files, the most commonly used e-mail address in the retrieved files at the domain will be used as the mailto: address for the reports. The report contains the same detail information shown on the WebMaven report in ASCII format.

The user must confirm sending the e-mail report to the Webmaster.

 

Protocol processing:
Protocol WebMaven action
HTTP; FTP Content-type=text/html:
Retrieves the file if it is within the remote path tree; otherwise, creates an out of tree page.

Content-type=(all others)

Retrieves the file.
MAILTO Left unchanged (this will permit deferred delivery of mail messages).
FILE; GOPHER; HTTPS; NEWS;
NNTP; TELNET; WAIS
Creates an out of tree page.
PNM (Real Audio) Retrieves the referenced URI.

 

HTTP response processing:
HTTP Response WebMaven action
300+ - Redirection Process redirected URI
400+ - Not found
500+ - Server error
HTML expected:
Creates an out of tree page indicating referencing page.

Image expected:

Inserts the bad image icon Example of bad image gif

 

WebMaven related files;
File name Function
WebMaven.CHK This is WebMaven's checkpoint / restart file. It is created in the local path if WebMaven terminates without finishing the retrieval of all files in the remote tree(s)
WebMave?.DMP In the event that WebMaven terminates abnormally (abends), a .DMP file is created in the WebMaven program directory. DMP files must be ZIP'ed and sent to Program Product Support when reporting the problem.

ZIP'ed .DMP files, along with other relevant material, can be e-mailed to product support or they can be anonymously uploaded to ftp.cfsrexx.com/incoming/. Be sure to use a file name that is unique to you so that it will not conflict with any files uploaded by others.

WebMaven.LOG A WebMaven.LOG file is created / extended in the local path directory each time WebMaven is run. The contents of the .LOG file vary depending upon user-selected options.
WebMaven.!!! This sentinel file exists in the local path while WebMaven is running.

 

HTML Tag Processing
HTML Tag Attributes processed
<A> HREF
<APPLET> ARCHIVE; CODE; CODEBASE; PARAM
<AREA> HREF
<BASE> HREF
<BGSOUND> SRC
<BLOCKQUOTE> CITE
<BODY> BACKGROUND; STYLESRC
<DEL> CITE
<EMBED> SRC
<FORM> ACTION; INPUT
<FRAME> LONGDESC; SRC
<HEAD> PROFILE
<IFRAME> LONGDESC; SRC
<IMG> LONGDESC; LOWSRC; SRC; USEMAP
<INPUT> SRC; USEMAP
<INS> CITE
<INPUT> SRC; USEMAP
<LINK> HREF
<META> HTTP-EQUIV
<OBJECT> ARCHIVE; CLASSID; CODEBASE; DATA; PARAM; USEMAP
<OPTGROUP> VALUE
<OPTION> VALUE
<PARAM> VALUE
<Q> CITE
<SCRIPT> SRC