URLList Gen v0.1 About ----- URLList is a very simple script for generating a list of all publicly accessibly URLs on a given webserver. It generates the list using HTTP requests, and scrubs out any pages that report errors (page not found etc.) to prevent duplication of broken links. It will consume quite a lot of bandwidth if not run from the target webserver, so if possible it should be run on the server. However if that is not possible, the script doesn't actually retrieve much, it simply crawls the target site, so Images etc are not downloaded, although they are indexed. The software is released under the GNU GPL and is copyright Ben Tasker 2009 Please see LICENSE for more details, or visit http://benscomputer.no-ip.org/LICENSE Configuration -------------- The script needs a few variables setting before it can actually be used, open the program with your favourite text editor and set the following as required WEBADDRESS should point to the web address of the target server, you should use the FQDN as this is what search engines will need to access your site, entering a local IP will cause you no end of grief down the line. URLLISTLOCAT should contain a path (including filename) to where you wish to store the URLLIST once it has been generated. Theoretically this should be in a web facing directory, usually the root directory of your webserver (/var/www/htdocs or whatever) TMPLOCAT should be set to the directory that the program will store temporary files in, for some reason it did not like /tmp on my system, and generated one line URLLISTS. Any other directory with Read Write access seems fine though. Running the Program ------------------- To run the program, simply execute the URL_list_gen.sh file from the command line (you may need to make it executable first) It will not output very much to STD:OUT but once it has completed you should have a fairly comprehensive URLLIST on your webserver ready for use with HTML_sitemap_gen, Yahoo! and Google Sitemap Generator.