summaryrefslogtreecommitdiff
path: root/content/bsd/blocking-bad-bots-openbsd.md
diff options
context:
space:
mode:
Diffstat (limited to 'content/bsd/blocking-bad-bots-openbsd.md')
-rw-r--r--content/bsd/blocking-bad-bots-openbsd.md71
1 files changed, 42 insertions, 29 deletions
diff --git a/content/bsd/blocking-bad-bots-openbsd.md b/content/bsd/blocking-bad-bots-openbsd.md
index a52c1b3b..cea98399 100644
--- a/content/bsd/blocking-bad-bots-openbsd.md
+++ b/content/bsd/blocking-bad-bots-openbsd.md
@@ -1,48 +1,61 @@
----
-title: "Blocking bad bots using Relayd"
-category:
-- bsd
-- update
-- bsd-update
-abstract:
-date: 2023-12-11T20:27:54+02:00
----
++++
+title = "OpenBSD: Blocking bad bots using Relayd"
+author = ["MichaƂ Sapka"]
+date = 2023-12-11T19:08:00+01:00
+categories = ["bsd"]
+draft = false
+weight = 2002
+abstract = "How do I fight bad crawlers?"
+[menu]
+ [menu.bsd-openbsd]
+ weight = 2002
+ identifier = "openbsd-blocking-bad-bots-using-relayd"
+ name = "Blocking bad bots using Relayd"
++++
+
The bane of existence for most of small pages: web crawlers.
They create most traffic this site sees and makes my [site stats](https://michal.sapka.me/site/info/#site-stats) overly optimistic.
We can go with [robots.txt](https://en.wikipedia.org/wiki/Robots_Exclusion_Protocol), but what if it's not enough?
-I can tell a valuable bot to not index some part of my site, but:
-a) some bots ignore it
-a) what if I don't want some bots to even have the chance to ask?
+I can tell a valuable bot to not index some part of my site, but:
+a) some bots ignore it
+b) what if I don't want some bots to even have the chance to ask?
Get that SEO scanning and LLM training out of here!
-## Blocking crawlers
+
+## Blocking crawlers {#blocking-crawlers}
The rest of this guide assumes webstack: Relayd and Httpd.
-Relayd is great and since it works on higher level than pf, we can read headers. Luckily, those crawlers send usable "User-Agents" which we can block.
+Relayd is great and since it works on higher level than pf, we can read headers.
+Luckily, those crawlers send usable "User-Agents" which we can block.
-First, let's see who uses my site the most. Assuming you use "forwarded"[^log-style] style for logs, we can do:
-[^log-style]: vide https://man.openbsd.org/httpd.conf.5#style
+First, let's see who uses my site the most. Assuming you use "forwarded"[^fn:1] style for logs, we can do:
-{{<highlight shell>}}
-awk -F '"' '{print $6}' <path to log file> | sort | uniq -c | sort
-{{</highlight>}}
+```shell
+awk -F '"' '{print $6}' <path to log file> | sort | uniq -c | sort#
+```
-Then we need to manually select agents we want to block. It won't be easy, as the strings are long and contain a lot of unnecessary information - which includes plain lies. You need to define which part of the full User-Agent is common and can be used for blocking.
+Then we need to manually select agents we want to block.
+It won't be easy, as the strings are long and contain a lot of unnecessary information - which includes plain lies.
+You need to define which part of the full User-Agent is common and can be used for blocking.
-Then we can create block rules in a Relayd protocol. Relayd doesn't use regexp, and instead allows using case-sensitive Lua globs. Stars will match everything.
+Then we can create block rules in a Relayd protocol.
+Relayd doesn't use regexp, and instead allows using case-sensitive Lua globs.
+Stars will match everything.
-{{<highlight shell>}}
+```shell
block request method "GET" header "User-Agent" value "*<common part>*"
-{{</highlight>}}
+```
-Remember that config assumes last-one-wins, so the block rules should be the last matching. I just put those end the end of my config. You can create a `block quick...` rule if you want - it will short-circuit the entire protocol.
+Remember that config assumes last-one-wins, so the block rules should be the last matching.
+I just put those end the end of my config.
+You can create a \`block quick...\` rule if you want - it will short-circuit the entire protocol.
Therefore, my "https" protocol now has a series of blocks:
-{{<highlight shell "linenos=inline">}}
+```shell
http protocol "https" {
-# most of the procol omitted
+ # most of the procol omitted
block request method "GET" header "User-Agent" value "*Bytespider*"
block request method "GET" header "User-Agent" value "*ahrefs*"
block request method "GET" header "User-Agent" value "*censys*"
@@ -53,8 +66,8 @@ http protocol "https" {
block request method "GET" header "User-Agent" value "*webmeup*"
block request method "GET" header "User-Agent" value "*zoominfo*"
}
-{{</highlight>}}
-
-*(using globs was proposed to me on [OpenBSD mailing list](https://marc.info/?l=openbsd-misc&m=170206886109953&w=2)*
+```
+(usage of globs was proposed to me on [OpenBSD mailing list](<https://marc.info/?l=openbsd-misc&m=170206886109953&w=2>)
+[^fn:1]: : vide <https://man.openbsd.org/httpd.conf.5#style>