summaryrefslogtreecommitdiff
path: root/content/bsd/blocking-bad-bots-openbsd.md
blob: 65273903f28d9c6a25562b338d1c42f5622cd082 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
+++
title = "OpenBSD: Blocking bad bots using Relayd"
author = ["Michał Sapka"]
date = 2023-12-11T19:08:00+01:00
categories = ["bsd"]
draft = false
weight = 2002
primary_menu = "bsd"
abstract = "How do I fight bad crawlers?"
[menu]
  [menu.bsd]
    weight = 2002
    identifier = "openbsd-blocking-bad-bots-using-relayd"
    parent = "obsdweb"
    name = "Blocking bad bots using Relayd"
+++

The bane of existence for most of small pages: web crawlers.
They create most traffic this site sees and makes my [site stats](https://michal.sapka.me/site/info/#site-stats) overly optimistic.
We can go with [robots.txt](https://en.wikipedia.org/wiki/Robots_Exclusion_Protocol), but what if it's not enough?
I can tell a valuable bot to not index some part of my site, but:
a) some bots ignore it
b) what if I don't want some bots to even have the chance to ask?

Get that SEO scanning and LLM training out of here!


## Blocking crawlers {#blocking-crawlers}

The rest of this guide assumes webstack: Relayd and Httpd.
Relayd is great and since it works on higher level than pf, we can read headers.
Luckily, those crawlers send usable "User-Agents" which we can block.

First, let's see who uses my site the most. Assuming you use "forwarded"[^fn:1] style for logs, we can do:

```shell
awk -F '"' '{print $6}' <path to log file> | sort | uniq -c | sort#
```

Then we need to manually select agents we want to block.
It won't be easy, as the strings are long and contain a lot of unnecessary information - which includes plain lies.
You need to define which part of the full User-Agent is common and can be used for blocking.

Then we can create block rules in a Relayd protocol.
Relayd doesn't use regexp, and instead allows using case-sensitive Lua globs.
Stars will match everything.

```shell
block request method "GET" header "User-Agent" value "*<common part>*"
```

Remember that config assumes last-one-wins, so the block rules should be the last matching.
I just put those end the end of my config.
You can create a \`block quick...\` rule if you want - it will short-circuit the entire protocol.

Therefore, my "https" protocol now has a series of blocks:

```shell
http protocol "https" {
    # most of the procol omitted
    block request method "GET" header "User-Agent" value "*Bytespider*"
    block request method "GET" header "User-Agent" value "*ahrefs*"
    block request method "GET" header "User-Agent" value "*censys*"
    block request method "GET" header "User-Agent" value "*commoncrawl*"
    block request method "GET" header "User-Agent" value "*dataforseo*"
    block request method "GET" header "User-Agent" value "*mj12*"
    block request method "GET" header "User-Agent" value "*semrush*"
    block request method "GET" header "User-Agent" value "*webmeup*"
    block request method "GET" header "User-Agent" value "*zoominfo*"
}
```

(usage of globs was proposed to me on [OpenBSD mailing list](<https://marc.info/?l=openbsd-misc&m=170206886109953&w=2>)

[^fn:1]: : vide <https://man.openbsd.org/httpd.conf.5#style>