Tightening Controls over Public Activity Feeds on Mastodon
At times, the last couple of weeks have been fairly busy for privacy on Mastodon, with two different but interrelated concerns rearing their heads.
The first was Boyter's Mastinator.com, a federated solution which allowed the following of arbitrary accounts, potentially enabling others to circumvent blocks as well as being problematic for the privacy of "Follower Only" posts.
The second was Matt Cloy's #fediblock
post about a non-federated full-text search engine:
Inevitably, this provoked a strong reaction: full-text search is a hot topic, having historically been used to help target harrassment campaigns, something personally experienced by many of those objecting.
For avoidance of doubt, the solution was never publicly available (Matt has confirmed it now never will be and was only ever used for testing).
@cloy
's implementation worked by placing requests to Mastodon's public API endpoints and so could not simply be defederated in the way that an ActivityPub implementation like Mastinator might be addressed.
As noted in the opening #fediblock
message, the indexer's requests originate from various (changeable) IPs, so relying on simple IP blocklists would be ineffective (and even if that were not the case, would still serve only to block this particular instance). Although easily misinterpreted as a deliberate circumvention attempt, this kind of IP cycling is extremely common where public clouds are used to run workloads (whether on something like an AWS EC2 instance, or a container in AWS Fargate), which probably goes some way to explain the suggestion of using a common signal: robots.txt
.
This is clearly an example of a different threat model to that of Mastinator: it involves an external entity requesting and indexing the responses of public API endpoints, whilst potentially taking measures to circumvent targeted blocking attempts (even if that wasn't occurring here).
Of the two threats, it's implementations like @cloy
's that I intended to focus on in this post.
There's no denying that the way in which the issue was raised was extremely counter-productive, but there is some truth in Matt's later arguments that others are already quietly doing this and that Mastodon perhaps doesn't do enough to restrict access to these feeds.
Ultimately, Matt's account at techhub.social
was suspended and then deleted: an unfortunate (if predictable) result of perceived rage-baiting.
The whole affair prompted me to take a deeper look at exactly how such an implementation might work, in order to see what can be done to try and prevent (or at least mitigate) similar attempts in a way that delivers a better success rate than that achievable by reactively blocking IPs.
During the process, I reviewed my own instance's logs and stumbled across a handful of crawlers that are (or were) periodically scraping my instance's public APIs for whatever ends.
In this post, I'll provide details of those crawlers (so that other instance admims can proactively block any that they weren't already aware of), the defensive tools that Mastodon provides and a method for covering the gaps that Mastodon unfortunately leaves open.
Contents
- Individual User Protections
- Mastodon
- An Observed Scraper
- Requiring Authentication For Trends
- Active Checks
- Additional Protection Mechanisms
- Larger Instance Admin Considerations
- Cloudflare Worker Example
- Observed Scraper IPs
- Conclusion
Individual User Protections
This post is quite heavily focused on the instance level, and as such isn't really targeted at non-admin users, who generally lack the privileges to change anything detailed here.
However, there are things that an individual user can do to help (slightly) reduce the likelihood of their accounts being indexed/profiled
- Opt Out of Search Engine Indexing
- Consider disabling display of your social graph
- Consider whether you want to require approval of Follow Requests (
Preferences
->Profile
->Require follow requests
)
They may also want to add #nobots
to their profile as it's a common signal (if somewhat likely to be ignored by the sort of operations that we should be most concerned about).
Mastodon
Most of this section will probably be familiar to instance admins, but it's worth laying out a high level overview of what Mastodon provides, so that we know what we're working with.
Public API Endpoints
Mastodon exposes a number of public API endpoints, some of which you probably want to remain publicly available, for example
/api/v1/custom_emojis
/api/v2/instance
However, there are also endpoints which detail activity on the instance
- Trending tags:
/api/v1/trends/tags
- Trending toots:
/api/v1/trends/statuses
- Local feed:
/api/v1/timelines/public?local=true&only_media=false
- Federated Feed:
/api/v1/timelines/public?remote=false&allow_local_only=false&only_media=false
These are called to populate the tabs on the instance's default page
Requests are also made to them when logged in users view trends/hashtags etc, whether from the Web UI or a mobile app.
Even without the existence of indexers, these feeds are potentially problematic, and I've written previously about how the existence of these public feeds can reveal unwanted information, particularly on smaller instances.
Mastodon's Controls
Mastodon does include a configuration option to shield some of this information from public view.
Within the server preferences is the option (Preferences
-> Administration
-> Server Settings
-> Discovery
-> Allow unauthenticated access to public timelines
)
However, this only disables unauthenticated viewing of Local
and Federated
.
Crawlers will still be able to use the trends/
endpoints to access information about trending content on your instance. In practice, this means that the Mastodon instance is doing half the scrapers work for them by sorting and exposing only the most popular content (sorting the wheat from the chaff, as it were).
Mastodon 4.x introduced a new configuration variable: DISALLOW_UNAUTHENTICATED_API_ACCESS
(thanks to @jerry
for pointing me towards it).
When enabled, the instance will require all API calls to be authenticated.
Although this prevents unauthenticated access to the trends
endpoints it also means that links to profiles and posts will not work for unauthenticated users.
Whilst the setting obviously provides a much higher level of protection, it comes at the cost of some of the usual workflows:
- Users from other instances will be unable to view anything after selecting
Open Original Page
in the Web UI/App for one of your users (so they won't be able to see interactions that their instance doesn't display). - Admins on other servers won't be able to look at the overall behaviour of a user on your server whilst investigating an abuse report, or worse a
#fediblock
report (I've a horrible feeling that some of the less pleasant instances might start enabling this setting for exactly that reason).
As with any new functionality, there have also been reports of issues with its initial implementation.
Still, it is an option for admins who want the strongest level of protection and are willing to tolerate the drawbacks.
An Observed Scraper
When first looking into this, I had already unticked Allow unauthenticated access to public timelines
, preventing unauthenticated access to the Local
and Federated
feeds and so had assumed that I wouldn't find much in terms of successful scraping/crawling attempts in my logs.
I was wrong... I found log entries showing a crawler periodically pulling information on trends from my instance
129.153.55.48 - - [02/Jan/2023:06:59:13 +0000] "GET /api/v1/trends/statuses?limit=40&offset=0 HTTP/1.1" 200 3342 "-" "axios/1.2.1" "-" "mastodon.bentasker.co.uk" CACHE_- 0.129 mikasa -
129.153.55.48 - - [02/Jan/2023:07:18:30 +0000] "GET /api/v1/trends/statuses?limit=40&offset=0 HTTP/1.1" 200 2259 "-" "axios/1.2.1" "-" "mastodon.bentasker.co.uk" CACHE_- 0.082 mikasa -
The crawler running at 129.153.55.48
hits my instance every 20 minutes (at rough 8 minute offsets) and fetches the trending toot list (the user-agent implies it's built using Axios which is a NodeJS HTTP Request library).
There are occasional requests to fetch a toot (presumably because either my, or someone else's trending list) included it.
129.153.55.48 - - [28/Dec/2022:00:58:16 +0000] "GET /api/v1/trends/statuses?limit=40&offset=40 HTTP/1.1" 200 33 "-" "axios/1.2.1" "-" "mastodon.bentasker.co.uk" CACHE_- 0.050 mikasa - "-" "-" "-"
129.153.55.48 - - [28/Dec/2022:00:58:16 +0000] "GET /api/v1/trends/statuses?limit=40&offset=0 HTTP/1.1" 200 7855 "-" "axios/1.2.1" "-" "mastodon.bentasker.co.uk" CACHE_- 0.143 mikasa - "-" "-" "-"
129.153.55.48 - - [28/Dec/2022:01:00:26 +0000] "GET /api/v1/statuses/109585997399946330 HTTP/1.1" 200 1421 "-" "axios/1.2.1""-" "mastodon.bentasker.co.uk" CACHE_- 0.098 mikasa - "-" "-" "-"
The delay between requests might be a sign that the toot was referred to in a trending list on another instance, but may also just be a sign that the crawler adds toots to a queue rather than fetching them immediately.
As is my wont, I've done some osint on the system behind this IP. Sharing that is not the purpose of this post, however I have confirmed that
- it's not a residential IP, so the information above shouldn't help directly identify anybody
- it does not appear to be related to Matt Cloy's proposed crawler (the portal for which, incidentally, is no longer online)
I've spoken privately with a number of instance admins and have confirmed that this crawler shows up in the logs of quite a range of instances.
The Crawler's Purpose
There are a wide range of organisations interested in social media activity, most of which are pursuing one commercial model or another (I don't doubt though, that there are some intelligence agenices mixed in).
Amongst those will be organisations who specialise in brand reputation monitoring, as well as those who create social media walls for display in hotels, service stations and news feeds.
The observed behaviour of the crawler certainly fits the behavioural pattern you'd expect for something like a social media wall:
- Hit known instances and grab information on trending toots/tags
- Retrieve those toots (or a selection of)
- Compile results and display
Of course, as others pointed out during the #fediblock
discussion, this necessarily entails republishing content without the author's permission. This is not corporate social media, there is no perpetual worldwide license authorising re-use of user's toots and doing so is often considered unwelcome.
Crawling trends
rather than Local
or Federated
could also, theoretically, be used to address data reduction issues whilst indexing the fediverse. Using trending
as a signal allows an indexer to only use storage and compute resources indexing higher profile content (whether that results in a useful search experience depends on the intended use of the index).
I'm inclined to think that this particular crawler probably best fits the media wall use-case, especially given the regularity with which it checks back against even my (tiny) Mastodon instance.
IP Blocks
The obvious first impulse, is to block the crawler's IP: it's a known bad actor and there's no point wasting time sending it a SYN-ACK
, much less handling SSL handshakes and accepting a request.
However, blocking by IP is an incredibly limited approach and has long been unable to give any real assurance: circumventing IP blocks has never been particularly difficult and yet still manages to be far easier and cheaper than it's ever been.
The increased existence of CGNAT on ISP networks also means that there's a non-zero probability that any given IP block might accidentally block more than expected - potentially leading to legitimate users accidentally being denied service.
Even without those (significant) drawbacks, manually curated IP blocklists also don't do very much to protect instances against the next bad actor to pop up using the same technique - it's an entirely reactive defence, because the behaviour needs to occur for you to become aware and block.
So, whilst IP Blocking is an easy way to achieve short-term relief, where possible, it should always be followed up with behavioural rules.
Requiring Authentication For Trends
Mastodon doesn't currently provide a means to require authentication for the trends feed, other than by enabling DISALLOW_UNAUTHENTICATED_API_ACCESS
.
Whilst it's more than possible to fork Mastodon and implement additional protections, that approach isn't particular appetising: it'd mean committing to maintaining the fork everytime a new Mastodon instance was released.
Instead, I wanted to be able to place protections in front of Mastodon - on a CDN, a Web Application Firewall (WAF) or even just a reverse proxy - so that future software updates aren't delayed by the need to merge changes back into a fork.
Browser Behaviour
Using a browser's Developer Tools we can see that the way that Mastodon's UI forms it's requests for public endpoints differs depending on whether the user is authenticated or not.
For ease of comparison, the Copy as cURL
option has been used to extract the requests:
Unauthenticated
curl 'https://mastodon.bentasker.co.uk/api/v1/trends/statuses' \
-H 'User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:108.0) Gecko/20100101 Firefox/108.0' \
-H 'Accept: application/json, text/plain, */*' \
-H 'Accept-Language: en-GB,en;q=0.5' \
-H 'Accept-Encoding: gzip, deflate, br' \
-H 'Referer: https://mastodon.bentasker.co.uk/' \
-H 'X-CSRF-Token: <a token>' \
-H 'DNT: 1' \
-H 'Connection: keep-alive' \
-H 'Cookie: _mastodon_session=<a session id>' \
-H 'Sec-Fetch-Dest: empty' \
-H 'Sec-Fetch-Mode: cors' \
-H 'Sec-Fetch-Site: same-origin' \
-H 'Pragma: no-cache' \
-H 'Cache-Control: no-cache' \
-H 'TE: trailers'
Authenticated
curl 'https://mastodon.bentasker.co.uk/api/v1/trends/statuses' \
-H 'User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:108.0) Gecko/20100101 Firefox/108.0' \
-H 'Accept: application/json, text/plain, */*' \
-H 'Accept-Language: en-GB,en;q=0.5' \
-H 'Accept-Encoding: gzip, deflate, br' \
-H 'Referer: https://mastodon.bentasker.co.uk/' \
-H 'X-CSRF-Token: <a token>' \
-H 'Sec-Fetch-Dest: empty' \
-H 'Sec-Fetch-Mode: cors' \
-H 'Sec-Fetch-Site: same-origin' \
-H 'Authorization: Bearer <an auth token>' \
-H 'Connection: keep-alive' \
-H 'Cookie: _session_id=<another session id>; _mastodon_session=<a session id>' \
-H 'Pragma: no-cache' \
-H 'Cache-Control: no-cache' \
-H 'TE: trailers'
Despite the API not actually requiring authentication, the Authenticated request provides two extra items
- There's an extra cookie
session_id
- An API auth token is provided in the
Authorization
header
This may be a happy accident, or it might be that the result varies based on the user's ID. For our purposes, it doesn't really matter which, the important thing is that it means there's a way to differentiate between an authenticated and an unauthenticated request at the HTTP level.
Designing Auth Rules
The availability of these request specifics means that we can design some criteria to use so that a WAF rule can decide whether to permit a request or not:
In psuedo-code we could write a rule that looks like
if req.path == "/api/v1/trends/statuses":
if "authorization" not in req.headers:
deny()
if "_session_id" not in req.cookies:
deny()
allow()
The request items are also present in authenticated calls to the Local
and Federated
views, so the same ruleset can potentially be used to add an additional layer of protection to those too.
A ruleset like this should be absolutely trivial to configure in a variety of WAF products.
Testing for evasion
Before embarking on that path though, I wanted to check how easily this simple ruleset could be evaded.
Unfortunately there's a problem: testing for existence isn't enough.
If an invalid authentication token (or session id) is included in the request, the WAF will detects it's presence and proxy the request through to Mastodon. Unfortunately, rather than rejecting the request, Mastodon ignores the invalid items and returns the requested feed (most likely, on those endpoints, it's not actually checking the auth token's presence or validity at all).
This means that our simple ruleset can be circumvented by simply including nonsense
curl 'https://mastodon.bentasker.co.uk/api/v1/trends/statuses' \
-H 'User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:108.0) Gecko/20100101 Firefox/108.0' \
-H 'Accept: application/json, text/plain, */*' \
-H 'Accept-Language: en-GB,en;q=0.5' \
-H 'Referer: https://mastodon.bentasker.co.uk/' \
-H 'X-CSRF-Token: <a token>' \
-H 'Sec-Fetch-Dest: empty' \
-H 'Sec-Fetch-Mode: cors' \
-H 'Sec-Fetch-Site: same-origin' \
-H 'Authorization: Bearer na-na-na-na-na-notarealtoken' \
-H 'Connection: keep-alive' \
-H 'Cookie: _session_id=geeIbetYouWishThisWereRealMrAdmin; _mastodon_session=<a session id>' \
-H 'Pragma: no-cache' \
-H 'Cache-Control: no-cache' \
-H 'TE: trailers'
This doesn't completely invalidate the ruleset: much like moving SSH from port 22, despite being easily defeated, it raises the bar a little, excluding some of the lower hanging fruit.
Of course, the more widely adopted the ruleset, the more likely that bot authors will include this simple circumvention.
Active Checks
Whilst it's dissapointing the Mastodon doesn't reject invalid tokens, there is still a path forwards: Rather than simply requiring the presence of a token, we should instead require the presence of a valid token.
Tokens are minted per-session rather than per-request, so our WAF can check validity by using it to make an upstream request to an authenticated API endpoint before placing the real request.
This would mean request flows something like the following
Valid user:
-------------------------
| Mastodon |
-------------------------
^ 3.| ^ 5. |
| | | |
GET | GET |
/test | (token) |
(token) | | |
| 200 | 200
| | | |
2.| V 4.| V
-------------------------
| WAF |
-------------------------
^ 6.|
| |
GET 200
(token) |
| |
1.| V
-------------------------
| Client |
-------------------------
-------------------------------
Bot:
-------------------------
| Mastodon |
-------------------------
-------------------------
| WAF |
-------------------------
^ 2.|
| |
GET GTFO
| |
1.| v
-------------------------
| Bot |
-------------------------
Testing the token also means that we no longer need to test for the presence of a _session_id
cookie: having a valid token is sufficient to prove access to the instance (Mastodon's API accepts it, so we should too).
However, this approach is not without it's dawbacks:
Using an active check makes implementation a little harder, because rather than a simple rules-based WAF, we now need something capable of running dynamic code against requests.
There is a small additional risk inherent in the design: for each (valid) downstream request to the protected API endpoint, the WAF would make two requests (the token-test probe and the real request), providing a small amplification vector to someone wanting to try and drive up origin load.
There are, however, mitigations which can be designed in
- Endpoints like
/api/v1/trends/statuses
should be fairly cacheable, so the second response could be served from a cache, preventing it from contributing load to the origin - A cache could also be used for the probe results so that the same token is not repeatedly re-checked
I prefer the second mitigation, because it's not so subject to future changes in Mastodon's behaviour.
In addition to being mitigable, the range of users able to exploit this should be relatively small: By definition, to achieve amplification, a valid token would need to be provided (otherwise the second upstream request will never be made), limiting the subset of possible adversaries to existing user accounts, rather than internet randomers. So any potential attacker would need to have (or compromise) an account on the targeted instance.
LUA Implementation
I've historically built quite a lot using OpenResty. Not only can it run LUA, but it can also usually act as a drop in replacement for a vanilla NGINX build, meaning that an implementation built with it would be potentially viable for any instance admins who currently front Mastodon with Nginx.
I've previously done some work on (and written about) analysing the results of a WAF ruleset written in LUA, which brings with it a lazy bonus: I wouldn't need to design and build a reporting solution, because I already have one.
Given the arguments in it's favour, I decided to build the first implementation using LUA and (based on little more than a finger in the air) to have it call the Suggestions API endpoint in order to verify that the token provided in the request from downstream is valid.
Doing things slightly backwards, to deploy, I did the following
mkdir -p /etc/nginx/domains.d/LUA/resty
cd /etc/nginx/domains.d/LUA/resty
# Get Deps
wget https://github.com/ledgetech/lua-resty-http/raw/master/lib/resty/http.lua \
https://github.com/ledgetech/lua-resty-http/raw/master/lib/resty/http_headers.lua \
https://github.com/ledgetech/lua-resty-http/raw/master/lib/resty/http_connect.lua
The following LUA was then written to implement the ruleset, saved to disk as /etc/nginx/domains.d/LUA/enforce_mastodon.lua
--
-- enforce_mastodon.lua
--
-- Nginx config requirements
--
-- Following vars must be defined in Nginx config
--
-- $origin Origin IP domain
-- $origin_port Origin Port
-- $origin_host_header the host header to pass through
-- $origin_do_ssl set to "no" to disable upstream SSL
--
-- It also requires that a shared dict called mastowaf_cache have
-- been defined
-- We rely on https://github.com/ledgetech/lua-resty-http
-- to make the upstream request
local http = require("resty.http")
function place_api_request(ngx, auth_header)
-- Place a request to the origin using the /api/v2/suggestions endpoint
--
-- This path requires authentication, so allows us to verify that a provided
-- authentication token is valid
-- Initiate the HTTP connector
local httpc = http.new()
ok,err = httpc:connect(ngx.var.origin,ngx.var.origin_port)
if not ok
then
ngx.log(ngx.ERR,"Connection Failed")
return false
end
if ngx.var.origin_do_ssl ~= "no" then
session, err = httpc:ssl_handshake(False, server, false)
if err ~= nil then
ngx.log(ngx.ERR,"SSL Handshake Failed")
return false
end
end
headers = {
["user-agent"] = "WAF Probe",
host = ngx.var.origin_host_header,
Authorization = auth_header
}
local res, err = httpc:request {
path = "/api/v2/suggestions",
method = 'GET',
headers = headers
}
-- We're done with the connection, send to keepalive pool
httpc:set_keepalive()
-- Check the connection worked
if not res then
ngx.log(ngx.ERR,"Upstream Request Failed")
return false
end
-- Check the status
if res.status == 200
then
-- Authorised!
return true
end
ngx.log(ngx.ERR, res.status)
return false
end
function validateRequest(ngx)
-- The main work horse
--
-- Take the Authorization from the request
-- Deny if there is none
-- Check cache for that token and return the result
-- If needed, place upstream request to check that the token is valid
-- Return true to allow the request, false to deny it
local auth_header = ngx.var.http_authorization
-- Is the header empty/absent?
if auth_header == nil then
ngx.log(ngx.ERR,"No Auth Header")
ngx.header["X-Fail"] = "mastoapi-no-auth"
return false
end
-- Check the cache
local cache = ngx.shared.mastowaf_cache
local cachekey = "auth-" .. auth_header
local e = cache:get(cachekey)
-- See if we hit the cache
if e ~= nil then
if e == "true" then
return true
end
ngx.header["X-Fail"] = "mastoapi-cached-deny"
return false
end
-- We got a value, do a test request with it
local api_req = place_api_request(ngx, auth_header)
if api_req ~= true then
ngx.log(ngx.ERR,"Failed to validate Auth Token")
ngx.header["X-Fail"] = "mastoapi-token-invalid"
return false
end
-- Update the cache
-- We cache results for 20 seconds
cache:set(cachekey,tostring(api_req),20)
-- Authorise the request
return true
end
ngx.log(ngx.ERR,"Loaded")
if validateRequest(ngx) ~= true then
ngx.log(ngx.ERR,"Blocking unauthorised request")
ngx.header["X-Denied-By"] = "edge mastodon_api_enforce"
ngx.status = 403
ngx.exit(403)
end
A downloadable copy of this script is available at https://github.com/bentasker/article_scripts/tree/main/restricting-unauthenticated-access-to-mastodons-public-feeds/LUA
The lines adding response headers X-Fail
and X-Denied-By
can safely be disabled, these headers are logged and used by my WAF analysis setup.
In order to work, the shared cache needs to be enabled in the http
section of nginx.conf
:
lua_shared_dict mastowaf_cache 10m;
This defines a 10MB shared memory area for the shared dictionary mastowaf_cache
. If/when it's full items will be LRU'd out.
Because I've deployed into a custom path, nginx
also needs to be told where to look for the lua-resty-http
dependency:
# Use my custom LUA directory for libraries
lua_package_path '/etc/nginx/domains.d/LUA/?.lua;;';
Finally, within the server
block responsible for proxying to Mastodon I included some location blocks which define config and then define the LUA as an authentication provider
location ~* ^/api/v1/trends/(statuses|tags) {
set $origin 1.2.3.4;
set $origin_port 443;
set $origin_host_header mastodon.bentasker.co.uk;
set $origin_do_ssl yes;
access_by_lua_file /etc/nginx/domains.d/LUA/enforce_mastodon.lua;
# Do proxying stuff
}
location /api/v1/timelines/public {
set $origin 1.2.3.4;
set $origin_port 443;
set $origin_host_header mastodon.bentasker.co.uk;
set $origin_do_ssl yes;
access_by_lua_file /etc/nginx/domains.d/LUA/enforce_mastodon.lua;
# Do proxying stuff
}
I added these on my CDN, but if this were being added to the config I describe in Running a Masto Instance it would look like this
server {
listen 443 ssl http2;
listen [::]:443 ssl http2;
root /mnt/none;
index index.html index.htm;
server_name mastodon.bentasker.co.uk; # Replace with your domain name
ssl on;
# Replace your domain in these paths
ssl_certificate /etc/letsencrypt/live/mastodon.bentasker.co.uk/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/mastodon.bentasker.co.uk/privkey.pem;
ssl_session_timeout 5m;
ssl_prefer_server_ciphers On;
ssl_protocols TLSv1 TLSv1.1 TLSv1.2;
absolute_redirect off;
server_name_in_redirect off;
error_page 404 /404.html;
error_page 410 /410.html;
location / {
proxy_set_header Host $http_host;
proxy_set_header X-Forwarded-Proto https;
proxy_pass http://web:3000;
}
location ~* ^/api/v1/trends/(statuses|tags) {
set $origin 1.2.3.4;
set $origin_port 443;
set $origin_host_header mastodon.bentasker.co.uk;
set $origin_do_ssl yes;
access_by_lua_file /etc/nginx/domains.d/LUA/enforce_mastodon.lua;
proxy_set_header Host $http_host;
proxy_set_header X-Forwarded-Proto https;
proxy_pass http://web:3000;
}
location /api/v1/timelines/public {
set $origin 1.2.3.4;
set $origin_port 443;
set $origin_host_header mastodon.bentasker.co.uk;
set $origin_do_ssl yes;
access_by_lua_file /etc/nginx/domains.d/LUA/enforce_mastodon.lua;
proxy_set_header Host $http_host;
proxy_set_header X-Forwarded-Proto https;
proxy_pass http://web:3000;
}
location ^~ /api/v1/streaming {
proxy_set_header Host $http_host;
proxy_set_header X-Forwarded-Proto https;
proxy_pass http://streaming:4000;
proxy_buffering off;
proxy_redirect off;
proxy_http_version 1.1;
tcp_nodelay on;
}
}
With the LUA live and enforcing, I awaited the next of the thrice-hourly visits by the crawler
129.153.55.48 - - [02/Jan/2023:16:58:40 +0000] "GET /api/v1/trends/statuses?limit=40&offset=0 HTTP/1.1" 403 148 "-" "axios/1.2.1" "-" "mastodon.bentasker.co.uk" CACHE_- 0.000 mikasa - "-" "-" "-"
129.153.55.48 - - [02/Jan/2023:16:58:40 +0000] "GET /api/v1/trends/statuses?limit=40&offset=40 HTTP/1.1" 403 148 "-" "axios/1.2.1" "-" "mastodon.bentasker.co.uk" CACHE_- 0.000 mikasa - "-" "-" "-"
It now correctly receives a HTTP 403
instead of the feed.
Testing
Blocking requests was working, but I also needed to make sure that legitimate traffic was allowed through and tested the following
- Web UI: All feeds work as they should (when logged in)
- Tusky (Android): Works fine (note: doesn't have a trending view, so can't check hashtags etc, but local/federated views work fine)
- Official Mastodon client (Android): Feeds work fine
- tooot (Android): Federated/Local Feeds work fine (can't find a trending hashtags view)
Not being an Apple user, I've not been able to test iOS specific clients (though I'd hope they work much the same).
Limitations
There are some limitations, most are inherent in any instance-level block, but still seem worth enumerating
- The defence only prevents fetching of information directly from this instance. If the instance publishes to ActivityPub Relays, activity could potentially be pulled from them instead (although there is potentially still some benefit/protection from being mixed in with activity from other instances)
- If public signups are permitted, an adversary could create a legitimate account in order to use a legitimate token in their scraping requests. However, this still serves to raise the bar and they'd need to do so on every instance they wished to scrape (increasing the chance of detection)
- This only offers any protection against external scrapers: federated approaches like Mastinator require a completely different approach.
- It doesn't prevent scraping of profiles and toots in the way that
DISALLOW_UNAUTHENTICATED_API_ACCESS
does, but also doesn't bring the drawbacks associated with that level of restriction.
Additional Protection Mechanisms
There are some additional mechanisms which can (and probably should) be considered, even though they're not specific to these public endpoints
- Basic user-agent checks: blocking empty and known bot user-agents will clear the low hanging fruit
- IP Behaviour Checks: I'm not a massive fan of IP based blocking, as it's inherently flawed, but they are popular with some - collaborative lists are generally more effective than individual manually curated lists (though not without abuse potential).
Larger Instance Admin Considerations
When it comes to blocking (or allowing) activity, admins of larger instances have much more to think about than I do.
The method that I've described has an obvious trade-off: it prevents access to public feeds by unauthenticated users.
Whilst this helps to prevent arbitrary scraping of activity, it also means that prospective users are unable to view the Local
activity feed on the instance's default page, so they'll be unable to get a feel for the instance that they're thinking of joining.
The crux of the matter though, is that although unethical, anything publicly available is liable to be scraped/indexed, so raising the bar by removing unauthenticated access to endpoints that are common between different installs is not without value.
As with most moderation, it boils down to finding the right balance for your particular users.
There is a middle ground: implement a neutered version of the ruleset that logs but does not enforce.
This would allow you to build instrumentation showing how regularly these endpoints are seeing unauthenticated requests, which can then either be used to reactively add IP blocks, or to support a later decision to start enforcing the ruleset.
Cloudflare Worker Example
Although my LUA implementation should be quite easy to follow, it may not be the most useful example for admins more used to dealing with other solutions/services.
So, I decided to also create an example for use within Cloudflare's Worker system.
Be aware that there is an element of caveat emptor here: I don't use their service (as well as privacy and technical concerns, there are also various moral arguments against using Cloudflare).
This means that I've built this example entirely based on examples in Cloudflare's documentation and haven't the means to test it beyond ensuring that it's actually valid Javascript
Update: This has now, very kindly, been tested for me and appears to be working as it should.
/* Cloudflare Worker Public Mastodon API Protection
*
* Hacked together based on the following docs
*
* https://developers.cloudflare.com/workers/examples/fetch-json/
* https://developers.cloudflare.com/workers/examples/auth-with-headers/
*
*/
// The origin to send probe requests to
const api_endpoint = "https://1.2.3.4/api/v2/suggestions";
// The host header to include in those requests
const masto_host_head = "mastodon.bentasker.co.uk"
/**
* Checks for an Authorization header and (if present) uses it to send a probe
* to an authenticated origin endpoint to gauge token validity
*
* No token, or invalid token results in a 403
*
* TODO: Add caching of results to prevent repeated upstream calls
*
*/
async function doProbe(request){
// The "Authorization" header is sent when authenticated.
if (request.headers.has('Authorization')) {
// Build a probe request
const init = {
headers: {
'content-type': 'application/json;charset=UTF-8',
'authorization': request.headers.get('Authorization'),
'host' : masto_host_head
},
};
const response = await fetch(api_endpoint, init);
if (response.status == 200){
return fetch(request);
}
// Otherwise, it failed
return new Response('Invalid auth', {
status: 403,
});
}
// No auth header
return new Response('You need to login.', {
status: 403,
});
}
/**
* Receives a HTTP request and replies with a response.
* @param {Request} request
* @returns {Promise<Response>}
*/
async function handleRequest(request) {
const { protocol, pathname } = new URL(request.url);
var resp
switch (pathname) {
case '/api/v1/timelines/public': {
resp = await doProbe(request);
return resp;
}
case '/api/v1/trends/statuses': {
resp = await doProbe(request);
return resp;
}
case '/api/v1/trends/tags': {
resp = await doProbe(request);
return resp;
}
default:
return fetch(request);
}
}
addEventListener('fetch', event => {
event.respondWith(handleRequest(event.request));
});
A copy of this script is available at https://github.com/bentasker/article_scripts/tree/main/restricting-unauthenticated-access-to-mastodons-public-feeds/Cloudflare.
Current Known Scraper IPs
One of the advantages of having the LUA implementation set X-Fail
and X-Denied-By
is that it enables me to easily identify log lines banned by the rule, and to quite trivially enumerate hosts observed trying to hit public API endpoints without authentication.
Extracting this information, in fact, is so trivial that I don't even need to open my Grafana dashboards
grep "/api/v1" /var/log/nginx/access.log | grep "edge mastodon_api_enforce" | awk -F'\t' '{print $1,$5,$9,$18}' | sort | uniq
Which looking at results since the protection went live, gives the following
129.105.31.75 "GET /api/v1/timelines/public?limit=80&local=true HTTP/1.1" "Python/3.6 aiohttp/3.6.2" "mastoapi-token-invalid"
129.153.55.48 "GET /api/v1/trends/statuses?limit=40&offset=0 HTTP/1.1" "axios/1.2.1" "mastoapi-no-auth"
129.153.55.48 "GET /api/v1/trends/statuses?limit=40&offset=40 HTTP/1.1" "axios/1.2.1" "mastoapi-no-auth"
154.3.44.201 "GET /api/v1/timelines/public?only_media=false HTTP/1.1" "Apache-HttpClient/4.5.11 (Java/1.8.0_171)" "mastoapi-no-auth"
158.101.19.243 "GET /api/v1/timelines/public?limit=40 HTTP/2.0" "Typhoeus - https://github.com/typhoeus/typhoeus" "mastoapi-no-auth"
168.119.64.252 "GET /api/v1/trends/tags HTTP/2.0" "Mozilla/5.0 (compatible; AhrefsBot/7.0; +http://ahrefs.com/robot/)" "mastoapi-no-auth"
173.212.199.194 "GET /api/v1/timelines/public?local=true HTTP/2.0" "-" "mastoapi-no-auth"
35.232.35.2 "GET /api/v1/timelines/public?limit=40 HTTP/1.1" "python-requests/2.28.1" "mastoapi-no-auth"
46.226.110.114 "GET /api/v1/timelines/public?local=true&limit=40 HTTP/2.0" "-" "mastoapi-no-auth"
46.226.110.114 "GET /api/v1/trends/tags HTTP/2.0" "-" "mastoapi-no-auth"
87.157.136.163 "GET /api/v1/timelines/public?local=true&limit=40 HTTP/1.1" "fedi_stats/0.1.2 (by @lightmoll@social.nekover.se)" "mastoapi-no-auth"
99.105.215.234 "GET /api/v1/timelines/public?limit=500 HTTP/2.0" "-" "mastoapi-no-auth"
(I have stripped out some that were likely to identify residential users)
That first one is very interesting: it provided a token, but an invalid one.
The IP 129.105.31.75
is in a range associated with Northwestern University, if this is related to research the very least they could do is disclose it in the user-agent. It seems to visit once every 24 hours, so I may yet look into capturing more information about it.
The second thing interesting thing in the dataset is the relevant proportion of Oracle IPs - their Free Cloud Tier has obviously made them the location of choice for unwanted shit.
Because the exceptions are written into IOx, I've also been able to create a dashboard with the same list, using a simple SQL query
SELECT
DISTINCT IP, path, ua
FROM "waf_exceptions" WHERE
reason = 'mastodon_api_enforce' AND
$__timeFilter(time)
The report will improve over time (probably starting by adding the referer
column helps so it's easier to spot browsing end-users).
Exception Rate
Although the rate at which scrapers scrape is quite low, the frequency at which they're observed is quite high (although this is skewed quite heavily by 129.153.55.48
)
Conclusion
There are a number of external scrapers active within the fediverse (not that I expect that anyone to that particularly suprising).
Mastodon offers instance administrators a fairly coarse level of control over what information is made publicly available, via the controls Allow unauthenticated access to public timelines
(limited coverage) and DISALLOW_UNAUTHENTICATED_API_ACCESS
(near-total coverage, with multiple drawbacks).
Although unticking Allow unauthenticated access to public timelines
disables the public Local
and Federated
feeds, a public feed of trending toots and hashtags remains available.
This (IMO) is a failing in Mastodon that should be rectified. Similarly, the fact that Mastodon ignores invalid authentication credentials sent to it's API endpoints is problematic, if only because it means there's inconsistency across endpoints.
The availability of the trending
endpoints provides scrapers with a well known and common endpoints to request across disparate instances in order to capture an insight into activity on each of those instances.
Details of trending toots can, in some circumstances, reveal who a user follows as well as leaking information about which other instances are federated with (if toots from kinky.business
start showing up on an instance with a single active user, it really doesn't take much to put 2+2 together).
The Trending Toots
feed can also help indexers to identify popular posts (and posters), reducing their own operating costs (because they no longer have to index and store low value toots that no-one will likely ever search for again).
The availability of trending feeds really should be put under the control of instance administrators, seperate from the control (and associated drawbacks) offered by DISALLOW_UNAUTHENTICATED_API_ACCESS
.
The approach that I've used is relatively flexible because it's deployed seperately to Mastodon: there's no need to incur the overhead of forking Mastodon and it can initially be tested without needing to make production changes (just deploy an OpenResty install/container to test with whilst getting it just right).
The worker implementation should work for Cloudflare users and provide a reference for users of other providers/services to translate from.
It may not be possible for us to completely prevent scraping without having instances turn into walled gardens, but that doesn't mean that we have to make it easy for them.