URLList Gen v0.1

About
-----

URLList is a very simple script for generating a list of all publicly accessibly URLs on a given webserver. It 
generates the list using HTTP requests, and scrubs out any pages that report errors (page not found etc.) to 
prevent duplication of broken links.

It will consume quite a lot of bandwidth if not run from the target webserver, so if possible it should be run 
on the server. However if that is not possible, the script doesn't actually retrieve much, it simply crawls the 
target site, so Images etc are not downloaded, although they are indexed.

The software is released under the GNU GPL and is copyright Ben Tasker 2009
Please see LICENSE for more details, or visit http://benscomputer.no-ip.org/LICENSE


Configuration
--------------

The script needs a few variables setting before it can actually be used, open the program with your favourite 
text editor and set the following as required

WEBADDRESS should point to the web address of the target server, you should use the FQDN as this is what search 
engines will need to access your site, entering a local IP will cause you no end of grief down the line.

URLLISTLOCAT should contain a path (including filename) to where you wish to store the URLLIST once it has been 
generated. Theoretically this should be in a web facing directory, usually the root directory of your webserver 
(/var/www/htdocs or whatever)

TMPLOCAT should be set to the directory that the program will store temporary files in, for some reason it did 
not like /tmp on my system, and generated one line URLLISTS. Any other directory with Read Write access seems 
fine though.


Running the Program
-------------------

To run the program, simply execute the URL_list_gen.sh file from the command line (you may need to make it 
executable first) It will not output very much to STD:OUT but once it has completed you should have a fairly 
comprehensive URLLIST on your webserver ready for use with HTML_sitemap_gen, Yahoo! and Google Sitemap 
Generator.