Keeping Hitcounts accurate when using an NGinx Caching Proxy

In previous documentation, we've configured sites to use NGinx as a Reverse Caching Proxy, leading to hugely improved response times on popular content. We've also implemented a custom configuration so that we can refresh the cache periodically, ensuring that dynamic content (such as Twitter modules) updates.

One thing we haven't done as yet, though, is to address the issue of internal hitcounts. We've looked specifically at using NGinx with Joomla, and noted that a side effect would be inaccurate hitcounts displayed within Joomla (which remains true even when using the internal page caching).

In this documentation, we'll be implementing a small script to ensure that hits served from the cache are still recorded within Joomla (or Wordpress, Drupal - whatever you happen to be using), albeit a little while after they happen.

Most people won't need a realtime view of hits (if you do, you'll need a different implementation to the one we're going to develop), so we're going to have the hitcounts update hourly.

Put simply, the aim is to parse NGinx's access logs and place a cache bypassing request for each cache hit that we find - as we're running hourly we'll only process requests which occurred in the last hour.

 

Log Format

Your log format may well differ to mine, so you'll need to adjust to suit. The most important thing, however, is that we have a log of whether or not the request was a cache hit. As a result, I've added the following to the http section of /etc/nginx/nginx.conf

log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for" '
'"$http_host" CACHE_$upstream_cache_status'; 

I've included $http_host because I have NGinx log to a single access log. If you've split your logs by host, remember to adjust the processing script to reflect that.

Note: It makes life easier later if you wrap [$time_local] in double quotes, but (for some reason) I find it harder to manually read the log like that

We'll assume you're logging to /var/log/nginx/access.log, if not, adjust to suit.

 

Cache Bypass Headers

In a previous tutorial, we added support for a cache bypass header - so that we could force the cache to update. We don't necessarily want to do that here, it'd be better to bypass the cache and not update the cache with the response - we may well be placing numerous requests for a single page within a very short timespan.

Within the server block for your site (mine's in /etc/nginx/conf.d/virtual.conf) find the line containing proxy_cache (there may be multiple if you have multiple sites configured). Look around it for directives proxy_no_cache and proxy_cache_bypass. If they exist you'll want to append to them, if not insert the lines below

proxy_no_cache $http_x_logparsing_nocache;
proxy_cache_bypass $http_x_logparsing_nocache;

We've now stated that if the header X-LOGPARSING-NOCACHE exists, the cache should be completely ignored. This is important because we want the request to hit the origin site - otherwise the hit counts won't update.

 

Parsing the Logs

All we need to do now, is identify the requests that should have increased the hits, but didn't because they came from the cache (as we don't want to duplicate hits for requests which did increase the hitcount).

To parse logs in the format I'm using, I created a simple PHP script. The idea is to let you perform custom actions based on the log entry - in this case we want to repeat the request but completely bypass the cache. We'll also be setting a custom user-agent so that we can easily identify those requests in the logs.

You can grab the script from GitHub or download it here

We need to create a handler for the domain we're working with, so within the directory 'Domains' we create a new directory using our domain name (as it appears in the NGinx logs), so I'd create www.bentasker.co.uk.

Within this new directory we create process-entry.php with the following content

<?php
/** NGinx Log Actions
*
*
* @copyright 2013 B Tasker, All Rights Reserved
* @license GNU GPL V2
*
* Domain specific handler for www.bentasker.co.uk
*
*/

$domain="http://www.bentasker.co.uk"; // Change to suit
// Todo, implement support for the log_struct array
$cache_status = $data[12];

// Don't place requests for these pages
$excluded_paths = array("/","/index.php","/documentation/linux/",
"/documentation/","/documentation/joomla/","/blog/","/blog/opinion/");

// Work out if we need to place a request (let's not perpetuate any errors)
if ($cache_status == 'CACHE_HIT' && $data[6] == 200 && substr($data[5],0,3) == 'GET'){
$request = substr($data[5],4);
$request = explode(" ",$request);

if (!in_array($request[0],$excluded_paths)){

$ch = curl_init($domain.$request[0]);
curl_setopt($ch, CURLOPT_HTTPHEADER,array("X-LOGPARSING-NOCACHE: True"));
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,5);
curl_setopt($ch,CURLOPT_USERAGENT,"Log-Parse Cache-Hit Updater V1.1");
$resp = curl_exec($ch);
curl_close($ch);

// We don't actually want to do anything with the response.
unset($resp);
}

}

 

Save the file. Next we need to edit the script's main configuration, so open config.php in a text editor.

We need to set the location of the log to parse, the structure of the log and the date format used

<?php
/** NGinx Log Actions
*
*
* @copyright 2013 B Tasker, All Rights Reserved
* @license GNU GPL V2
*
* Configuration file
*

// Log directory
$log_location = '/var/log/nginx';
// Log filename
$log_name = 'access.log';

// We need to define the log record structure - you can grab this from your NGinx config file
// Note: If you've include time_local (and haven't wrapped it in quotes), follow it with an entry called timezone
$log_struct = array('remote_addr','-','remote_user', 'time_local','timezone','request','status',
'body_bytes_sent','http_referer','http_user_agent','http_x_forwarded_for',
'http_host','upstream_cache_status');

// What (PHP date) format are the dates in?
//Don't go any further than the end of the time (excluding minutes & seconds)
// Example for Nginx's 'local time' "[d/M/Y:H:"
$date_format = "[d/M/Y:H:";

I mentioned earlier that it makes life much easier if you wrap [$time_local] within double quotes - if you have then you don't need to include 'timezone' above. fgetcsv will only accept a single character as a field enclosure, and so breaks the date into two fields.

An alternative, if you'd prefer, would be to use tabs as the field delimiter when configuring NGinx.

 

Running the script

All that remains is to configure the server to run the script at regular intervals. We'll do this using cron.

I have a separate user set up - nginxloguser - but the most important thing is that the user you select needs to be able to read NGinx's logs, otherwise there'll be nothing to process.

If you're planning on running the script as root, ensure you move it to somewhere that only root can access it - the last thing you want is an unprivileged user editing a script that will automatically run with superuser privileges: bad things will happen!

To be safe, it's probably wise to take this precaution whichever user is running the script, so the best bet is to move it to their home directory - replace my username with the one appropriate to your system.

mv nginx_log_actions /home/nginxloguser/
chown -R nginxloguser /home/nginxloguser/nginx_log_actions

# Don't do this if you're running as root
# Prevent the user from being to update the script
chmod -R o-w  /home/nginxloguser/nginx_log_actions

# Switch to the user
su nginxloguser
# Run it once to make sure all is working properly
php /home/nginxloguser/nginx_log_actions/run_logs.php

# If all went well, add to the crontab
crontab -e

# Add the following, run once an hour at 1 minute past
1 * * * * php /home/nginxloguser/nginx_log_actions/run_logs.php
# Save and close

The script will now automatically run and you should see the hits update within whichever web application you're running. If you want to capture the output (at least for a little while) change the crontab line to

* * * php /home/nginxloguser/nginx_log_actions/run_logs.php | mail -s "Logs Parse" bob@example.com

 

Troubleshooting

The 'issue' you're most likely to encounter is that hits don't seem to be updating. If that's the case there are two things to check

  • Are there actually any cache_hits in your log file
  • Is your NGinx configuration actually letting your requests bypass the cache?

Other than that, check that you haven't introduced syntax errors and that the log schema you defined in config.php is correct.

One other issue that may occur - pay close attention to when your logfiles are rotated. You may need to adjust the interval you run at.

 

Conclusion

We've now configured a simple script to read our NGinx access logs line by line, placing a cache busting request for each recorded cache hit. Whilst it doesn't provide a realtime view, the hit counter within our web application can now remain accurate to within an hour.

It's not the most efficient implementation, and those with spectacularly busy websites may need to take the basic concept and adjust the code to squeeze some efficiency savings from it.