Skip to content

Commit eb1c0ec

Browse files
Adds detection for various bots (#8156)
* Improves detection for Cortex Xpanse * Improves detection for generic bots * Adds detection for Replicate-Bot * Adds detection for Cypex * Adds detection for TikTokSpider * Adds detection for FHMS ITS Research Scanner * Adds detection for Together-Bot * Adds detection for xAI-Bot * Adds detection for Groq-Bot * Adds detection for Big Sur AI * Improves detection for Cohere AI * Adds detection for FirecrawlAgent
1 parent 46f90be commit eb1c0ec

File tree

2 files changed

+207
-15
lines changed

2 files changed

+207
-15
lines changed

Tests/fixtures/bots.yml

Lines changed: 128 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -4271,12 +4271,12 @@
42714271
-
42724272
user_agent: 'Expanse indexes the network perimeters of our customers. If you have any questions or concerns, please reach out to: scaninfo@expanseinc.com'
42734273
bot:
4274-
name: Expanse
4274+
name: Cortex Xpanse
42754275
category: Security Checker
4276-
url: https://expanse.co/
4276+
url: https://docs-cortex.paloaltonetworks.com/r/1/Cortex-Xpanse/Scanning-activity
42774277
producer:
4278-
name: Expanse Inc.
4279-
url: https://expanse.co/
4278+
name: Palo Alto Networks, Inc.
4279+
url: https://www.paloaltonetworks.com/
42804280
-
42814281
user_agent: HuaweiWebCatBot/6.0) (To acquire the allowed html pages as reliable information of URL categorization in the automatic process for Huawei Web Categorization.; https://isecurity.huawei.com/; sec at huawei dot com)
42824282
bot:
@@ -5087,12 +5087,12 @@
50875087
-
50885088
user_agent: 'Expanse, a Palo Alto Networks company, searches across the global IPv4 space multiple times per day to identify customers' presences on the Internet. If you would like to be excluded from our scans, please send IP addresses/domains to: scaninfo@paloaltonetworks.com'
50895089
bot:
5090-
name: Expanse
5090+
name: Cortex Xpanse
50915091
category: Security Checker
5092-
url: https://expanse.co/
5092+
url: https://docs-cortex.paloaltonetworks.com/r/1/Cortex-Xpanse/Scanning-activity
50935093
producer:
5094-
name: Expanse Inc.
5095-
url: https://expanse.co/
5094+
name: Palo Alto Networks, Inc.
5095+
url: https://www.paloaltonetworks.com/
50965096
-
50975097
user_agent: Mozilla/5.0 (compatible; MaCoCu; +https://www.clarin.si/info/macocu-massive-collection-and-curation-of-monolingual-and-bilingual-data/)
50985098
bot:
@@ -8816,3 +8816,123 @@
88168816
producer:
88178817
name: 'Zoom Video Communications, Inc'
88188818
url: 'https://www.zoom.com/'
8819+
-
8820+
user_agent: Hello from Palo Alto Networks, find out more about our scans in https://docs-cortex.paloaltonetworks.com/r/1/Cortex-Xpanse/Scanning-activity
8821+
bot:
8822+
name: Cortex Xpanse
8823+
category: Security Checker
8824+
url: https://docs-cortex.paloaltonetworks.com/r/1/Cortex-Xpanse/Scanning-activity
8825+
producer:
8826+
name: Palo Alto Networks, Inc.
8827+
url: https://www.paloaltonetworks.com/
8828+
-
8829+
user_agent: Mozilla/5.0 (Laravel Reaver .env Presence)
8830+
bot:
8831+
name: Generic Bot
8832+
-
8833+
user_agent: Mozilla/5.0 (bang2013@atomicmail.io)
8834+
bot:
8835+
name: Generic Bot
8836+
-
8837+
user_agent: libredtail-http
8838+
bot:
8839+
name: Generic Bot
8840+
-
8841+
user_agent: Mozilla/5.0 (compatible; Replicate-Bot/1.0; +https://replicate.com/)
8842+
bot:
8843+
name: Replicate-Bot
8844+
category: Service Agent
8845+
url: https://replicate.com/
8846+
producer:
8847+
name: Replicate, Inc.
8848+
url: https://replicate.com/
8849+
-
8850+
user_agent: cypex.ai/scanning Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) Chrome/126.0.0.0 Safari/537.36
8851+
bot:
8852+
name: Cypex
8853+
category: Security Checker
8854+
url: https://cypex.ai/scanning/
8855+
producer:
8856+
name: Cypex
8857+
url: https://cypex.ai/
8858+
-
8859+
user_agent: Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; TikTokSpider; ttspider-feedback@tiktok.com)
8860+
bot:
8861+
name: TikTokSpider
8862+
category: Search bot
8863+
url: https://www.tiktok.com/
8864+
producer:
8865+
name: TikTok Inc.
8866+
url: https://www.tiktok.com/
8867+
-
8868+
user_agent: fhms-its-research-scanner/1.0 (+https://fb02itsscan02.fh-muenster.de)
8869+
bot:
8870+
name: FHMS ITS Research Scanner
8871+
category: Security Checker
8872+
url: https://fb02itsscan02.fh-muenster.de/
8873+
producer:
8874+
name: University of Applied Sciences Muenster
8875+
url: https://www.fh-muenster.de/
8876+
-
8877+
user_agent: Mozilla/5.0 (compatible; Together-Bot/1.0; +https://together.ai/)
8878+
bot:
8879+
name: Together-Bot
8880+
category: Crawler
8881+
url: https://www.together.ai/
8882+
producer:
8883+
name: Together Computer Inc.
8884+
url: https://www.together.ai/
8885+
-
8886+
user_agent: Mozilla/5.0 (compatible; xAI-Bot/1.0; +https://x.ai/)
8887+
bot:
8888+
name: xAI-Bot
8889+
category: Crawler
8890+
url: https://x.ai/
8891+
producer:
8892+
name: X.AI LLC
8893+
url: https://x.ai/
8894+
-
8895+
user_agent: Mozilla/5.0 (compatible; Groq-Bot/1.0; +https://groq.com/)
8896+
bot:
8897+
name: Groq-Bot
8898+
category: Crawler
8899+
url: https://groq.com/
8900+
producer:
8901+
name: Groq, Inc.
8902+
url: https://groq.com/
8903+
-
8904+
user_agent: Mozilla/5.0 (compatible; bigsur.ai/1.0)
8905+
bot:
8906+
name: Big Sur AI
8907+
category: Crawler
8908+
url: https://bigsur.ai/
8909+
producer:
8910+
name: Big Sur AI, Inc.
8911+
url: https://bigsur.ai/
8912+
-
8913+
user_agent: Mozilla/5.0 (compatible; Cohere-Command/1.0; +https://cohere.com/)
8914+
bot:
8915+
name: Cohere AI
8916+
category: Crawler
8917+
url: https://cohere.com/
8918+
producer:
8919+
name: Cohere, Inc.
8920+
url: https://cohere.com/
8921+
-
8922+
user_agent: cohere-training-data-crawler
8923+
bot:
8924+
name: Cohere AI
8925+
category: Crawler
8926+
url: https://cohere.com/
8927+
producer:
8928+
name: Cohere, Inc.
8929+
url: https://cohere.com/
8930+
-
8931+
user_agent: Mozilla/5.0 (compatible; FirecrawlAgent/1.0)
8932+
bot:
8933+
name: FirecrawlAgent
8934+
category: Service Agent
8935+
url: https://www.firecrawl.dev/
8936+
producer:
8937+
name: SideGuide Technologies, Inc.
8938+
url: https://www.sideguide.dev/

regexes/bots.yml

Lines changed: 79 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -2454,6 +2454,14 @@
24542454
name: 'ByteDance Ltd.'
24552455
url: 'https://bytedance.com/'
24562456

2457+
- regex: 'TikTokSpider'
2458+
name: 'TikTokSpider'
2459+
category: 'Search bot'
2460+
url: 'https://www.tiktok.com/'
2461+
producer:
2462+
name: 'TikTok Inc.'
2463+
url: 'https://www.tiktok.com/'
2464+
24572465
- regex: 'WikiDo'
24582466
name: 'WikiDo'
24592467
category: 'Search bot'
@@ -2773,13 +2781,13 @@
27732781
name: 'ProjectDiscovery, Inc.'
27742782
url: 'https://projectdiscovery.io/'
27752783

2776-
- regex: 'scaninfo@(?:expanseinc|paloaltonetworks)\.com'
2777-
name: 'Expanse'
2784+
- regex: '(?:expanseinc|paloaltonetworks)\.com'
2785+
name: 'Cortex Xpanse'
27782786
category: 'Security Checker'
2779-
url: 'https://expanse.co/'
2787+
url: 'https://docs-cortex.paloaltonetworks.com/r/1/Cortex-Xpanse/Scanning-activity'
27802788
producer:
2781-
name: 'Expanse Inc.'
2782-
url: 'https://expanse.co/'
2789+
name: 'Palo Alto Networks, Inc.'
2790+
url: 'https://www.paloaltonetworks.com/'
27832791

27842792
- regex: 'HuaweiWebCatBot'
27852793
name: 'HuaweiWebCatBot'
@@ -4045,7 +4053,7 @@
40454053
name: 'Exipert, Inc.'
40464054
url: 'https://www.checkmarknetwork.com/'
40474055

4048-
- regex: 'cohere-ai'
4056+
- regex: 'cohere-(?:ai|command|training)'
40494057
name: 'Cohere AI'
40504058
category: 'Crawler'
40514059
url: 'https://cohere.com/'
@@ -5106,8 +5114,72 @@
51065114
name: 'Zoom Video Communications, Inc'
51075115
url: 'https://www.zoom.com/'
51085116

5117+
- regex: 'Replicate-Bot'
5118+
name: 'Replicate-Bot'
5119+
category: 'Service Agent'
5120+
url: 'https://replicate.com/'
5121+
producer:
5122+
name: 'Replicate, Inc.'
5123+
url: 'https://replicate.com/'
5124+
5125+
- regex: 'cypex\.ai'
5126+
name: 'Cypex'
5127+
category: 'Security Checker'
5128+
url: 'https://cypex.ai/scanning/'
5129+
producer:
5130+
name: 'Cypex'
5131+
url: 'https://cypex.ai/'
5132+
5133+
- regex: 'fhms-its-research-scanner'
5134+
name: 'FHMS ITS Research Scanner'
5135+
category: 'Security Checker'
5136+
url: 'https://fb02itsscan02.fh-muenster.de/'
5137+
producer:
5138+
name: 'University of Applied Sciences Muenster'
5139+
url: 'https://www.fh-muenster.de/'
5140+
5141+
- regex: 'Together-Bot'
5142+
name: 'Together-Bot'
5143+
category: 'Crawler'
5144+
url: 'https://www.together.ai/'
5145+
producer:
5146+
name: 'Together Computer Inc.'
5147+
url: 'https://www.together.ai/'
5148+
5149+
- regex: 'xAI-Bot'
5150+
name: 'xAI-Bot'
5151+
category: 'Crawler'
5152+
url: 'https://x.ai/'
5153+
producer:
5154+
name: 'X.AI LLC'
5155+
url: 'https://x.ai/'
5156+
5157+
- regex: 'Groq-Bot'
5158+
name: 'Groq-Bot'
5159+
category: 'Crawler'
5160+
url: 'https://groq.com/'
5161+
producer:
5162+
name: 'Groq, Inc.'
5163+
url: 'https://groq.com/'
5164+
5165+
- regex: 'bigsur\.ai'
5166+
name: 'Big Sur AI'
5167+
category: 'Crawler'
5168+
url: 'https://bigsur.ai/'
5169+
producer:
5170+
name: 'Big Sur AI, Inc.'
5171+
url: 'https://bigsur.ai/'
5172+
5173+
- regex: 'FirecrawlAgent'
5174+
name: 'FirecrawlAgent'
5175+
category: 'Service Agent'
5176+
url: 'https://www.firecrawl.dev/'
5177+
producer:
5178+
name: 'SideGuide Technologies, Inc.'
5179+
url: 'https://www.sideguide.dev/'
5180+
51095181
# Generic bots
5110-
- regex: 'nuhk|grub-client|Download Demon|SearchExpress|Microsoft URL Control|borg|altavista|dataminr\.com|teoma|oegp|http%20client|htdig|mogimogi|larbin|scrubby|searchsight|semanticdiscovery|snappy|zeal(?!ot)|dataparksearch|findlinks|BrowserMob|URL2PNG|ZooShot|GomezA|Google SketchUp|Read%20Later|7Siters|centuryb\.o\.t9|InterNaetBoten|EasyBib AutoCite|Bidtellect|tomnomnom/meg|cortex|Re-re Studio|adreview|AHC/|NameOfAgent|Request-Promise|ALittle Client|Hello,? world|wp_is_mobile|0xAbyssalDoesntExist|Anarchy99|^revolt|nvd0rz|xfa1|Hakai|gbrmss|fuck-your-hp|IDBTE4M CODE87|Antoine|Insomania|Hells-Net|b3astmode|Linux Gnu \(cow\)|Test Certificate Info|iplabel|Magellan|TheSafex?Internetx?Search|Searcherx?web|kirkland-signature|LinkChain|survey-security-dot-txt|infrawatch|Time/|r00ts3c-owned-you|nvdorz|Root Slut|NiggaBalls|BotPoke|GlobalWebSearch|xx032_bo9vs83_2a|sslshed|geckotrail|Wordup|Keydrop|\(compatible\)|John Recon|SPARK COMMIT|masjesu|Komaru_The_Cat|Jesus Christ of Nazareth is LORD|Kowai|Hakai|LoliSec|LMAO|^xenu|^(?:chrome|firefox|Abcd|Dark|KvshClient|Node.js|Report Runner|url|Zeus|ZmEu)$|OnlyScans|TheInternetSearchx'
5182+
- regex: 'nuhk|grub-client|Download Demon|SearchExpress|Microsoft URL Control|borg|altavista|dataminr\.com|teoma|oegp|http%20client|htdig|mogimogi|larbin|scrubby|searchsight|semanticdiscovery|snappy|zeal(?!ot)|dataparksearch|findlinks|BrowserMob|URL2PNG|ZooShot|GomezA|Google SketchUp|Read%20Later|7Siters|centuryb\.o\.t9|InterNaetBoten|EasyBib AutoCite|Bidtellect|tomnomnom/meg|cortex|Re-re Studio|adreview|AHC/|NameOfAgent|Request-Promise|ALittle Client|Hello,? world|wp_is_mobile|0xAbyssalDoesntExist|Anarchy99|^revolt|nvd0rz|xfa1|Hakai|gbrmss|fuck-your-hp|IDBTE4M CODE87|Antoine|Insomania|Hells-Net|b3astmode|Linux Gnu \(cow\)|Test Certificate Info|iplabel|Magellan|TheSafex?Internetx?Search|Searcherx?web|kirkland-signature|LinkChain|survey-security-dot-txt|infrawatch|Time/|r00ts3c-owned-you|nvdorz|Root Slut|NiggaBalls|BotPoke|GlobalWebSearch|xx032_bo9vs83_2a|sslshed|geckotrail|Wordup|Keydrop|\(compatible\)|John Recon|SPARK COMMIT|masjesu|Komaru_The_Cat|Jesus Christ of Nazareth is LORD|Kowai|Hakai|LoliSec|LMAO|^xenu|^(?:chrome|firefox|Abcd|Dark|KvshClient|Node.js|Report Runner|url|Zeus|ZmEu)$|OnlyScans|TheInternetSearchx|Laravel Reaver|bang2013|libredtail'
51115183
name: 'Generic Bot'
51125184

51135185
# Generic detections

0 commit comments

Comments
 (0)