Sitemap Generation Tools

This content was originally published on

This is the main project page for the Sitemap Generation tools,

All code and information is released under the GNU GPL and is copyright Ben Tasker 2009, however the Howto does refer to Google Sitemap Generator, which is a utility not written or owned by me. It is freely available under the BSD 2.0 License.

For a copy of the GNU GPL please either look within the downloads

The tools are designed to make creating and managing sitemaps far easier on a Linux server, if you use a Content Management System then it probably already generates sitemaps for you, if it doesn't then these tools can ssave you from doing it manually, or being tied to the 500 page limit that many of the utilities have.

There are a range of Dependencies, but most should already be present. You will need

  • BASH
  • wget
  • tr
  • awk
  • sed
  • grep

  • As ever, the software is made freely available with absolutely NO WARRANTY, it does no harm on my system and shouldn't on you, but I make no claim that it definitely wont. All software is installed and run at your own risk.

    The software consists of three main sections, each can be run independantly, or from within the wrapper (that I haven't yet written, so check back!)
    All the software is designed to be run from a cron job, so set it on a convenient schedule.

    UPDATE: I've discovered a slight issue with the URL List Generation script, for some reason wget can be a bit picky about the use of --spider, if your URL List only contains one line after running the script try configuring and running this script instead. It retrieves the pages and deletes them afterwards, so this will affect the amount on Bandwidth used. Obviously if you are running the script from the target server this will have little impact.

    For information on howto setup and configure the Google Sitemap Generator to run automatically please read this HOWTO (PDF)

    To generate HTML sitemaps please install, configure and run HTML Sitemap Gen V0.1. Installation instructions are contained within the README file.
    The program will generated a HTML sitemap, seperated by file type, HTML is listed first (including PHP etc) followed by PHP, Images, Plain Text Files and Finally all other Document types.
    The Layout of the finished page is defined in a template, the script simply insert the information, for an example of the output please visit my sitemap

    HTML_Gen_V0.1.tar.gz README CODE
    MD5 Checksum
    086b710d31de1412d4b93be7b5db6455 HTML_sitemap_gen_V0.1.tar.gz

    To Generate the URLLIST upon which the above two programs rely (and Yahoo! uses) please install, configure and run URL List Generator V0.1. Installation instructions are included within the README file

    URL_List_gen_v0.1.tar.gz README CODE
    MD5 Checksum
    ca9fe330ea1b202665e0abe6aedfb87c URL_List_gen_v0.1.tar.gz

    If utilising all three programs in cron jobs, it is worth noting that there is no specific order in which they should be called. It seems logical though that there is very little benefit in calling a program that relies on an URLLIST until the list itself has been refreshed. So most people would call the URLList Gen program first. Whilst it is unlikely that Google would ever crawl your site immediately after receiving a ping, on the off chance, it would be wise to wait until the HTML sitemap is created before running the Google Sitemap Generator.

    Be aware that several of the scripts do post a lot to STD:OUT, so your Cron log may fill up quite quickly, the wrapper script will give you the option to specify a seperate log file to use, and will also call the programs in the most logical order.

    The Generator Wrapper is intended to be called from a Cron job (or the command line) and simply calls the programs in the most logical order. It also reduces the amount pushed to STD:OUT, you can either specify a log file or most output will be sent to /dev/null.

    Usage is as follows; -h # Prints a short Synopsis # Send all log information to /dev/null
    sitemap_wrapper /path/to/some/file # Sends all logging information to the specified file.

    Sitemap Wrapper