Bot scrapers DoS me?

16 Jun 2025, 10:53

ai / broken web / internet

I have a routine that runs every 15 minutes on my home machine and polls other of my servers and collates the results.

Once or twice a day one of my machines, at linode, was refusing to talk. It wasn’t causing a problem since the data is replicated and the system catches up, but it was annoying.

Digging around, the machine looked like it was working normally. But I found, in one log, that the machine had a load average of over 30 when the problem happens.

Now in the past when this occurred it was because of hardware issues at linode. But I wasn’t seeing anything in the logs. Normally sar data would show excessive I/O loads or long service times, but everything looked normal. Indeed it was just showing a higher (75%) CPU usage in user space, but still had over 20% idle time.

CGI

As luck would have it I was logged in while the problem occurred. A ps showed many dozens of CGI scripts running. Which is odd.

Checking my Apache logs and I found entries like

34.10.139.117 - - [03/Jun/2025:00:25:52 -0400] "GET /sf.pl?ACTION=SHOW&DETAIL=9780575050044 HTTP/1.1" 200 936 "https://spuddy.org/sf.pl" "Scrapy/2.11.2 (+https://scrapy.org)"
34.10.139.117 - - [03/Jun/2025:00:25:53 -0400] "GET /sf.pl?ACTION=SHOW&DETAIL=9780575052482 HTTP/1.1" 200 945 "https://spuddy.org/sf.pl" "Scrapy/2.11.2 (+https://scrapy.org)"
34.10.139.117 - - [03/Jun/2025:00:25:53 -0400] "GET /sf.pl?ACTION=SHOW&DETAIL=9780575055261 HTTP/1.1" 200 954 "https://spuddy.org/sf.pl" "Scrapy/2.11.2 (+https://scrapy.org)"
34.10.139.117 - - [03/Jun/2025:00:25:53 -0400] "GET /sf.pl?ACTION=SHOW&DETAIL=9780380756674 HTTP/1.1" 200 924 "https://spuddy.org/sf.pl" "Scrapy/2.11.2 (+https://scrapy.org)"

Well, that explains a lot. But I thought I had a robots.txt entry to stop this (after Google search indexer started hitting it).

Yup…

User-agent: *
Disallow: /cds.pl
Disallow: /sf.pl
Disallow: /video.pl

Looking at the web site for Scrapy I saw, quite predominantly featured, that it has a feature ROBOTSTXT_OBEY where this scraper can ignore the robots.txt entries.

That’s a pretty arsehole thing to build into your bot, to be honest.

Block by IP?

I originally thought of blocking by IP, but the scraper appears to be hosted at Google cloud and changes IP addresses.

In the past 2 weeks I’ve seen these sorts of volumes

% sudo grep  -h Scrapy *access_log* | cut -d' ' -f1 | sort | uniq -c | sort -nr
  74220 104.154.64.63
  60663 104.197.101.217
  54657 34.45.158.181
  50737 34.41.99.8
  50610 34.9.117.151
  49530 34.10.205.39
  38538 34.42.59.237
  38490 35.224.154.106
  30822 34.132.153.18
  23115 35.238.9.157
  23081 35.188.193.33
  15410 34.44.52.133
  15410 34.29.68.151
  15410 34.10.139.117
   7705 34.29.122.46
...

I’d be playing whack-a-mole if I went down this path!

Apache rate limiting with mod_evasive

I mentioned this else-net and someone responded they’d seem similar behaviour against their web server and had configured nginx to start rate limiting.

That sounded like a good idea. To the best of my knowledge, Apache doesn’t have this built in (I might be wrong! Let me know…) but it looked like a third party module, mod_evasive, could do the job.

I noticed that EPEL had this for RedHat 7, but not for 8 or 9. So I downloaded the SRPMS and used that to build an rpm for my servers using mock.

Config

After compiling and installing mod_evasive, I created this configuration (based on the defaults, all deployed by ansible).

LoadModule evasive20_module modules/mod_evasive24.so

<IfModule mod_evasive24.c>
    DOSHashTableSize    3097
    DOSPageCount        10
    DOSSiteCount        20
    DOSPageInterval     1
    DOSSiteInterval     5
    DOSBlockingPeriod   10
    DOSEmailNotify      sweh
</IfModule>

The results.

Initially I was worried about false positive blocks; would I start blocking legitimate spidering of my site (eg by google search). So each time I got a “blocked” alert I would log into the server and check it.

A few iterations of this, and some code changes (changing the response from 403 to 429; the email it sent out wasn’t quite right, and I wanted to enhance the data in the email so I could see if this was a good block) and I had something that seemed to mostly work.

Now I get an alert that looks something like

  Subject: HTTP BLACKLIST 104.154.64.63

  mod_evasive HTTP Blacklisted 104.154.64.63
      URI: /sf.pl?ACTION=SHOW&DETAIL=9780441662517
    Agent: Scrapy/2.11.2 (+https://scrapy.org)

When I looked through my logs yesterday morning I saw, in 2 hours

% awk '$9==429' sweh-ssl.access_log | wc -l
18460

Of those 17,940 were calls to a CGI.

Looking more closely at the entries, the first block was at “15/Jun/2025:05:19:17” and the last at “15/Jun/2025:05:21:49”. So it blocked 18,000 calls in 2 minutes. That’s pretty good!

Now it’s not perfect; I run Apache in prefork mode (‘cos I’m so old school!) so each forked instance keeps its own state, which means that if one web server process blocks a site another may let it through, or a process terminates and a new one replaces it then it will have no state and would allow traffic. eg at “15/Jun/2025:05:19:54” it blocked 262 attempts but let through 3.

But the number it blocks is really helping.

Who would have thought a cheap linode could handle 260 requests per second! Of course it’s all in memory when it hits 429 land, but still…

Forks of mod_evasive

I’ve since checked on github and I see there’s lots of forks of it. I spotted one of those forks appears to use shared memory to keep state so the prefork issue I’m seeing won’t happen. But that fork also removes some functionality I like.

For the moment, I think I’ll stick with this.

Side effects

Funnily enough this has also been catching script kiddies running simple bots against my site:

129.146.124.161 - - [15/Jun/2025:00:12:12 -0400] "GET /credentials.xml HTTP/1.1" 429 227 "http://spuddy.org/credentials.xml" "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.85 Safari/537.36"

13.74.158.147 - - [15/Jun/2025:14:33:13 -0400] "GET /wp-includes/Text/Diff/Engine/ HTTP/1.1" 429 227 "-" "-"

47.251.102.239 - - [15/Jun/2025:17:58:46 -0400] "GET /index.php?lang=../../../../../../../../tmp/index1 HTTP/1.1" 429 227 "-" "Custom-AsyncHttpClient"

It’s also spotted broken bots that keep requesting the same page (I’ve seen this one before)

117.159.55.9 - - [15/Jun/2025:13:48:21 -0400] "GET /images/emblem.gif HTTP/1.1" 429 227 "-" "Go-http-client/1.1"
117.159.55.9 - - [15/Jun/2025:13:48:22 -0400] "GET /images/emblem.gif HTTP/1.1" 429 227 "-" "Go-http-client/1.1"
117.159.55.9 - - [15/Jun/2025:13:48:22 -0400] "GET /images/emblem.gif HTTP/1.1" 429 227 "-" "Go-http-client/1.1"
117.159.55.9 - - [15/Jun/2025:13:48:22 -0400] "GET /images/emblem.gif HTTP/1.1" 429 227 "-" "Go-http-client/1.1"

And, of course, OpenAI

20.171.207.4 - - [14/Jun/2025:22:40:37 -0400] "GET /post/2018-01-04-meltdown_spectre/ HTTP/1.1" 429 227 "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot)"

Ranting

Web scrapers should obey the wishes of the content creator. robots.txt should be followed. At the very least they should act to not break the remote server. I know I can’t stop you stealing my stuff (and I’m resigned to that; theft of intellectual property is a different blog post entirely), but at least don’t break my machine.

Also you don’t need to scrape the same page a gazillions times. For example,

% cat *access_log* | grep -c 9780575050044
133

You’ve requested details about one book 133 times in 2 weeks? What the hell?

This is just abusive behaviour.

This web site is my hobby; it costs $20/month to run. It’s arseholes like this scraper that make this less fun.

Heh, at least I get the satisfaction of knowing that it’s costing you money to do this!

OK, rant over.

Summary

The modern internet is basically out of control. Mass theft of data for AI purposes is prevalent. I suspect that the majority of hits on my server are from bots, and not humans.

In the past 28 days, there were 771,000 hits on this web server (across the various URLs it serves), 548,000 were from Scrapy. Of the rest a quick check based on the user-agent, 55% were trivially bots (scrapers, script kiddies, RSS readers, whatever). Some of those bots may be useful :-)

Many of those bots are broken in various ways and can cause harm to victims.

In amongst this noise are various so-called “white hat” organisations that are also scanning machines without permission. At least they normally obey robots.txt and just cause noise in the logs.

It seems clear that most of the traffic on the internet is just machines talking to machines and not human eyeballs at all.

Any website, even hobby ones like this, are going to need more robust defenses against this brokenness. Tools like mod_evasive are reactive, but if they react quickly enough then they can prevent service degradation. You’d need different tools if you want to prevent data theft in the first place!

And an enterprise that needs to look at the logs to determine a real attack… good luck! All this brokenness and “white hat” traffic is just making your life harder.

And a possible defense

CGI