From 5fd3c1d796123976c1896618cd6d6aab268119af Mon Sep 17 00:00:00 2001 From: mms Date: Mon, 11 Dec 2023 21:33:18 +0100 Subject: feat: ban bad bots --- content/bsd/blocking-bad-bots-openbsd.md | 59 ++++++++++++++++++++++++++++++++ content/bsd/home.md | 1 + 2 files changed, 60 insertions(+) create mode 100644 content/bsd/blocking-bad-bots-openbsd.md (limited to 'content/bsd') diff --git a/content/bsd/blocking-bad-bots-openbsd.md b/content/bsd/blocking-bad-bots-openbsd.md new file mode 100644 index 0000000..121f6d0 --- /dev/null +++ b/content/bsd/blocking-bad-bots-openbsd.md @@ -0,0 +1,59 @@ +--- +title: "Blocking bad bots using Relayd" +category: +- bsd +- update +- bsd-update +abstract: +date: 2023-12-10T12:27:54+02:00 +--- +The bane of existence for most of small pages: web crawlers. +They create most traffic this site sees and makes my [site stats](https://michal.sapka.me/site/info/#site-stats) overly optimistic. +We can go with [robots.txt](https://en.wikipedia.org/wiki/Robots_Exclusion_Protocol), but what if it's not enough? +I can tell a valuable bot to not index some part of my site, but: +a) some bots ignore it +a) what if I don't want some bots to even have the chance to ask? + +Get that SEO scanning and LLM training out of here! + +## Blocking crawlers + +The rest of this guide assumes webstack: Relayd and Httpd. +Relayd is great and since it works on higher level than pf, we can read headers. Luckily, those crawlers send usable "User-Agents" which we can block. + +First, let's see who uses my site the most. Assuming you use "forwarded" style for logs, we can do: + +{{}} +awk -F '"' '{print $6}' | sort | uniq -c | sort +{{}} + +Then we need to manually select agents we want to block. It won't be easy, as the strings are long and contain a lot of unnecessary information - which includes plain lies. You need to define which part of the full Uer-Agent is common and can be used for blocking. + +Then we can create block rules in a Relayd protocol. Relayd doesn't use regexp, and instead allows using case-sensitive Lua globs. Stars will match everything. + +{{}} +block request method "GET" header "User-Agent" value "**" +{{}} + +Remember that config assumes last-one-wins, so the block rules should be the last matching. I just put those end the end of my config. You can create a `block quick...` rule if you want - it will short-circuit the entire protocol. + +Therefore, my "https" protocol now has a series of blocks: + +{{}} +http protocol "https" { +# most of the procol omitted + block request method "GET" header "User-Agent" value "*Bytespider*" + block request method "GET" header "User-Agent" value "*ahrefs*" + block request method "GET" header "User-Agent" value "*censys*" + block request method "GET" header "User-Agent" value "*commoncrawl*" + block request method "GET" header "User-Agent" value "*dataforseo*" + block request method "GET" header "User-Agent" value "*mj12*" + block request method "GET" header "User-Agent" value "*semrush*" + block request method "GET" header "User-Agent" value "*webmeup*" + block request method "GET" header "User-Agent" value "*zoominfo*" +} +{{}} + +*(using globs was proposed to me on [OpenBSD mailing list](https://marc.info/?l=openbsd-misc&m=170206886109953&w=2)* + + diff --git a/content/bsd/home.md b/content/bsd/home.md index 0f5d9d5..2366758 100644 --- a/content/bsd/home.md +++ b/content/bsd/home.md @@ -32,3 +32,4 @@ Since at least a year, I've been a BSD type of a guy. My personal laptop is runn - OpenBSD server - [OpenBSD Amsterdam](/bsd/moved-to-openbsd/) - [Webstack - Httpd(8), Relayd(8)](/bsd/moved-to-openbsd/#httpd8-and-relayd8) + - [Blocking bad bots and crawlers](/bsd/blocking-bad-bots-openbsd/) -- cgit v1.2.3