+++ title = "OpenBSD: Blocking bad bots using Relayd" author = ["MichaƂ Sapka"] date = 2023-12-11T19:08:00+01:00 categories = ["bsd"] draft = false weight = 2002 abstract = "How do I fight bad crawlers?" [menu] [menu.bsd-openbsd] weight = 2002 identifier = "openbsd-blocking-bad-bots-using-relayd" parent = "obsdweb" name = "Blocking bad bots using Relayd" +++ The bane of existence for most of small pages: web crawlers. They create most traffic this site sees and makes my [site stats](https://michal.sapka.me/site/info/#site-stats) overly optimistic. We can go with [robots.txt](https://en.wikipedia.org/wiki/Robots_Exclusion_Protocol), but what if it's not enough? I can tell a valuable bot to not index some part of my site, but: a) some bots ignore it b) what if I don't want some bots to even have the chance to ask? Get that SEO scanning and LLM training out of here! ## Blocking crawlers {#blocking-crawlers} The rest of this guide assumes webstack: Relayd and Httpd. Relayd is great and since it works on higher level than pf, we can read headers. Luckily, those crawlers send usable "User-Agents" which we can block. First, let's see who uses my site the most. Assuming you use "forwarded"[^fn:1] style for logs, we can do: ```shell awk -F '"' '{print $6}' | sort | uniq -c | sort# ``` Then we need to manually select agents we want to block. It won't be easy, as the strings are long and contain a lot of unnecessary information - which includes plain lies. You need to define which part of the full User-Agent is common and can be used for blocking. Then we can create block rules in a Relayd protocol. Relayd doesn't use regexp, and instead allows using case-sensitive Lua globs. Stars will match everything. ```shell block request method "GET" header "User-Agent" value "**" ``` Remember that config assumes last-one-wins, so the block rules should be the last matching. I just put those end the end of my config. You can create a \`block quick...\` rule if you want - it will short-circuit the entire protocol. Therefore, my "https" protocol now has a series of blocks: ```shell http protocol "https" { # most of the procol omitted block request method "GET" header "User-Agent" value "*Bytespider*" block request method "GET" header "User-Agent" value "*ahrefs*" block request method "GET" header "User-Agent" value "*censys*" block request method "GET" header "User-Agent" value "*commoncrawl*" block request method "GET" header "User-Agent" value "*dataforseo*" block request method "GET" header "User-Agent" value "*mj12*" block request method "GET" header "User-Agent" value "*semrush*" block request method "GET" header "User-Agent" value "*webmeup*" block request method "GET" header "User-Agent" value "*zoominfo*" } ``` (usage of globs was proposed to me on [OpenBSD mailing list]() [^fn:1]: : vide