content/blog/2024/llm-honeypot.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

+++
title = "LLM honeypot"
author = ["Michał Sapka"]
date = 2024-06-28T14:14:00+02:00
categories = ["blog"]
draft = false
weight = 2002
abstract = "The only way to fight I see"
+++

Big tech doesn't care about people; LLM industry actively seeks harm.
We're [seeing it time after time again](https://www.theverge.com/2024/6/27/24187405/perplexity-ai-twitter-lie-plagiarism).
They consider the open web to be a resource that exists only for them to harvest.

But the web was designed with good intentions in mind.
There is no way to actively _block_ them.
Copyright? Nope, fair use.
Robots.txt? Nope, some don't care - other pretend to care after the theft.
Identifying them? Good luck. Not only the IPs are in _millions_, but they lie in their user-agents.

Some are trying to poison the LLM by prompt injection, but this will not work in any bigger dataset.

Personally, I want to at least try.
Therefore, my site contains a honeypot: _open [a git repository](https://michal.sapka.me/git/mms/Library-of-knowledge) and your IP will be logged_.
For now I collect them, but soon they will be blocked on my firewall for some time - a week maybe?

This repo is:

-   disallowed by robots.txt, so no good agents would harvest it
-   labeled as ban hammer in the description.

I'll wait for some time and publish results.