diff options
Diffstat (limited to 'content/bsd/blocking-bad-bots-openbsd.md')
-rw-r--r-- | content/bsd/blocking-bad-bots-openbsd.md | 71 |
1 files changed, 42 insertions, 29 deletions
diff --git a/content/bsd/blocking-bad-bots-openbsd.md b/content/bsd/blocking-bad-bots-openbsd.md index a52c1b3b..cea98399 100644 --- a/content/bsd/blocking-bad-bots-openbsd.md +++ b/content/bsd/blocking-bad-bots-openbsd.md @@ -1,48 +1,61 @@ ---- -title: "Blocking bad bots using Relayd" -category: -- bsd -- update -- bsd-update -abstract: -date: 2023-12-11T20:27:54+02:00 ---- ++++ +title = "OpenBSD: Blocking bad bots using Relayd" +author = ["MichaĆ Sapka"] +date = 2023-12-11T19:08:00+01:00 +categories = ["bsd"] +draft = false +weight = 2002 +abstract = "How do I fight bad crawlers?" +[menu] + [menu.bsd-openbsd] + weight = 2002 + identifier = "openbsd-blocking-bad-bots-using-relayd" + name = "Blocking bad bots using Relayd" ++++ + The bane of existence for most of small pages: web crawlers. They create most traffic this site sees and makes my [site stats](https://michal.sapka.me/site/info/#site-stats) overly optimistic. We can go with [robots.txt](https://en.wikipedia.org/wiki/Robots_Exclusion_Protocol), but what if it's not enough? -I can tell a valuable bot to not index some part of my site, but: -a) some bots ignore it -a) what if I don't want some bots to even have the chance to ask? +I can tell a valuable bot to not index some part of my site, but: +a) some bots ignore it +b) what if I don't want some bots to even have the chance to ask? Get that SEO scanning and LLM training out of here! -## Blocking crawlers + +## Blocking crawlers {#blocking-crawlers} The rest of this guide assumes webstack: Relayd and Httpd. -Relayd is great and since it works on higher level than pf, we can read headers. Luckily, those crawlers send usable "User-Agents" which we can block. +Relayd is great and since it works on higher level than pf, we can read headers. +Luckily, those crawlers send usable "User-Agents" which we can block. -First, let's see who uses my site the most. Assuming you use "forwarded"[^log-style] style for logs, we can do: -[^log-style]: vide https://man.openbsd.org/httpd.conf.5#style +First, let's see who uses my site the most. Assuming you use "forwarded"[^fn:1] style for logs, we can do: -{{<highlight shell>}} -awk -F '"' '{print $6}' <path to log file> | sort | uniq -c | sort -{{</highlight>}} +```shell +awk -F '"' '{print $6}' <path to log file> | sort | uniq -c | sort# +``` -Then we need to manually select agents we want to block. It won't be easy, as the strings are long and contain a lot of unnecessary information - which includes plain lies. You need to define which part of the full User-Agent is common and can be used for blocking. +Then we need to manually select agents we want to block. +It won't be easy, as the strings are long and contain a lot of unnecessary information - which includes plain lies. +You need to define which part of the full User-Agent is common and can be used for blocking. -Then we can create block rules in a Relayd protocol. Relayd doesn't use regexp, and instead allows using case-sensitive Lua globs. Stars will match everything. +Then we can create block rules in a Relayd protocol. +Relayd doesn't use regexp, and instead allows using case-sensitive Lua globs. +Stars will match everything. -{{<highlight shell>}} +```shell block request method "GET" header "User-Agent" value "*<common part>*" -{{</highlight>}} +``` -Remember that config assumes last-one-wins, so the block rules should be the last matching. I just put those end the end of my config. You can create a `block quick...` rule if you want - it will short-circuit the entire protocol. +Remember that config assumes last-one-wins, so the block rules should be the last matching. +I just put those end the end of my config. +You can create a \`block quick...\` rule if you want - it will short-circuit the entire protocol. Therefore, my "https" protocol now has a series of blocks: -{{<highlight shell "linenos=inline">}} +```shell http protocol "https" { -# most of the procol omitted + # most of the procol omitted block request method "GET" header "User-Agent" value "*Bytespider*" block request method "GET" header "User-Agent" value "*ahrefs*" block request method "GET" header "User-Agent" value "*censys*" @@ -53,8 +66,8 @@ http protocol "https" { block request method "GET" header "User-Agent" value "*webmeup*" block request method "GET" header "User-Agent" value "*zoominfo*" } -{{</highlight>}} - -*(using globs was proposed to me on [OpenBSD mailing list](https://marc.info/?l=openbsd-misc&m=170206886109953&w=2)* +``` +(usage of globs was proposed to me on [OpenBSD mailing list](<https://marc.info/?l=openbsd-misc&m=170206886109953&w=2>) +[^fn:1]: : vide <https://man.openbsd.org/httpd.conf.5#style> |