summaryrefslogtreecommitdiff
path: root/content
diff options
context:
space:
mode:
Diffstat (limited to 'content')
-rw-r--r--content/blog/2024/llm-honeypot.md32
1 files changed, 32 insertions, 0 deletions
diff --git a/content/blog/2024/llm-honeypot.md b/content/blog/2024/llm-honeypot.md
new file mode 100644
index 0000000..aa6240f
--- /dev/null
+++ b/content/blog/2024/llm-honeypot.md
@@ -0,0 +1,32 @@
++++
+title = "LLM honeypot"
+author = ["MichaƂ Sapka"]
+date = 2024-06-28T14:14:00+02:00
+categories = ["blog"]
+draft = false
+weight = 2002
+abstract = "The only way to fight I see"
++++
+
+Big tech doesn't care about people; LLM industry actively seeks harm.
+We're [seeing it time after time again](https://www.theverge.com/2024/6/27/24187405/perplexity-ai-twitter-lie-plagiarism).
+They consider the open web to be a resource that exists only for them to harvest.
+
+But the web was designed with good intentions in mind.
+There is no way to actively _block_ them.
+Copyright? Nope, fair use.
+Robots.txt? Nope, some don't care - other pretend to care after the theft.
+Identifying them? Good luck. Not only the IPs are in _millions_, but they lie in their user-agents.
+
+Some are trying to poison the LLM by prompt injection, but this will not work in any bigger dataset.
+
+Personally, I want to at least try.
+Therefore, my site contains a honeypot: _open [a gi repository](https://michal.sapka.me/git/mms/Library-of-knowledge) and your IP will be logged_.
+For now I collect them, but soon they will be blocked on my firewall for some time - a week maybe?
+
+This repo is:
+
+- disallowed by robots.txt, so no good agents would harvest it
+- labeled as ban hammer in the description.
+
+I'll wait for some time and publish results.