Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
105 changes: 105 additions & 0 deletions SEO-STRATEGY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# SEO Strategy for npmx.dev

This document outlines the technical SEO strategy adopted for `npmx.dev`, considering its nature as a dynamic SSR application with near infinite content (the npm registry) and current internationalization constraints.

## 1. Indexing & Crawling

### The Challenge

`npmx` acts as a mirror/browser for the npm registry. We do not know all valid URLs (`/package/[name]`) in advance, and there are millions of possible combinations. Additionally, invalid URLs could generate spam content or infinite loops.

### The Solution: Organic Crawling

We do not use a massive `sitemap.xml`. We rely on natural link discovery by bots (Googlebot, Bingbot, etc.):

1. **Entry Point:** The Home page (`/`) links to popular packages.
2. **Expansion:** Each package page links to its **Dependencies**, **DevDependencies**, and **PeerDependencies**.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this step.

That makes it possible that even super popular packages may be missed out completely, provided that no-one has created a package that depends on them, or that this problem exists further down the line.

In other words, we would only index stuff that - roughly - would get installed if you'd run pnpm install nuxt vue nitro react svelte vite next astro typescript angular - plus their devDependencies - which to me, intuitively, sounds like a tiny fraction of useful packages out there.

Copy link
Contributor Author

@userquin userquin Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

infinite recursion since the bot will follow that links (peer, dev deps and deps)

3. **Result:** Bots jump from package to package, indexing the npm dependency graph organically and efficiently.

### Error Handling (404)

To prevent indexing of non-existent URLs (`/package/fake-package`):

- The SSR server returns a real **HTTP 404 Not Found** status code when the npm API indicates the package does not exist.
- This causes search engines to immediately discard the URL and not index it, without needing an explicit `noindex` tag.

## 2. `robots.txt` File

The goal of `robots.txt` is to optimize the _Crawl Budget_ by blocking low-value or computationally expensive areas.

**Proposed `public/robots.txt`:**

```txt
User-agent: *
Allow: /

# Block internal search results (duplicate/infinite content)
Disallow: /search

# Block user utilities and settings
Disallow: /settings
Disallow: /compare
Disallow: /auth/

# Block code explorer and docs (high crawl cost, low SEO value for general search)
Disallow: /package-code/
Disallow: /package-docs/

# Block internal API endpoints
Disallow: /api/
```
Comment on lines +32 to +50
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

npmjs also blocks old versions from being indexed:

https://www.npmjs.com/robots.txt

I think it makes a lot of sense.


## 3. Internationalization (i18n) & SEO

### Current Status

- The application supports multiple languages (UI).
- **No URL prefixes are used** (e.g., `/es/package/react` does not exist, only `/package/react`).
- Language is determined on the client-side (browser) or defaults to English on the server.

### SEO Implications

- **Canonicalization:** There is only one canonical URL per package (`https://npmx.dev/package/react`).
- **Indexing Language:** Googlebot typically crawls from the US without specific cookies/preferences. The SSR server renders in `en-US` by default.
- **Result:** **Google will index the site exclusively in English.**

### Is this a problem?

**No.** For a global technical tool like `npmx`:

- Search traffic is predominantly in English (package names, technical terms).
- We avoid the complexity of managing `hreflang` and duplicate content across 20+ languages.
- User Experience (UX) remains localized: users land on the page (indexed in English), and the client hydrates the app in their preferred language.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...and a vast majority of READMEs are in English anyway, and they take up a significant amount of npmx's displayed content.


## 4. Summary of Actions

1. ✅ **404 Status:** Ensured in SSR for non-existent packages.
2. ✅ **Internal Linking:** Dependency components (`Dependencies.vue`) generate crawlable links (`<NuxtLink>`).
3. ✅ **Dynamic Titles:** `useSeoMeta` correctly manages titles and descriptions, escaping special characters for security and proper display.
4. 📝 **Pending:** Update `public/robots.txt` with the proposed blocking rules to protect the _Crawl Budget_.

## 5. Implementation Details: Meta Tags & Sitemap

### Pages Requiring `noindex, nofollow`

Based on the `robots.txt` strategy, the following Vue pages should explicitly include the `<meta name="robots" content="noindex, nofollow">` tag via `useSeoMeta`. This acts as a second layer of defense against indexing low-value content.

- **`app/pages/search.vue`**: Internal search results.
- **`app/pages/settings.vue`**: User preferences.
- **`app/pages/compare.vue`**: Dynamic comparison tool.
- **`app/pages/package-code/[...path].vue`**: Source code explorer.
- **`app/pages/package-docs/[...path].vue`**: Generated documentation (consistent with robots.txt block).
Comment on lines +85 to +91
Copy link
Contributor

@coderabbitai coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Robots.txt disallow prevents crawlers from seeing the noindex meta.

If a path is disallowed, most bots won’t fetch the page, so the meta tag is not an effective “second layer.” Either allow crawling and use noindex,follow, or keep the disallow and drop the meta requirement for those paths.

🧰 Tools
🪛 LanguageTool

[uncategorized] ~87-~87: Loose punctuation mark.
Context: ...e content. - app/pages/search.vue: Internal search results. - **`app/pages...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~88-~88: Loose punctuation mark.
Context: ... results. - app/pages/settings.vue: User preferences. - **`app/pages/compar...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~89-~89: Loose punctuation mark.
Context: ...eferences. - app/pages/compare.vue: Dynamic comparison tool. - **`app/pages...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~90-~90: Loose punctuation mark.
Context: ...app/pages/package-code/[...path].vue**: Source code explorer. - **`app/pages/pa...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~91-~91: Loose punctuation mark.
Context: ...app/pages/package-docs/[...path].vue**: Generated documentation (consistent wit...

(UNLIKELY_OPENING_PUNCTUATION)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct, technically if it's blocked, the metadata isn't read. But we keep the Disallow directive to prioritize the crawl budget. The noindex directive is proposed as a defensive security measure in case crawling is accidentally allowed or for bots that don't strictly adhere to robots.txt but do respect the metadata.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@userquin, that makes perfect sense—thank you for the clarification! You're absolutely right that this "defense in depth" approach is a solid SEO best practice. The Disallow directive optimizes your crawl budget for well-behaved bots, while the noindex meta tags provide a safety net for edge cases like accidental configuration changes or non-compliant crawlers. This dual-layer strategy is prudent for production environments.


✏️ Learnings added
Learnt from: userquin
Repo: npmx-dev/npmx.dev PR: 894
File: SEO-STRATEGY.md:85-91
Timestamp: 2026-02-03T22:21:10.658Z
Learning: For SEO strategy, using both `robots.txt` Disallow directives and `noindex` meta tags together is a valid "defense in depth" approach. The robots.txt optimizes crawl budget for compliant bots, while meta tags serve as a fallback for bots that ignore robots.txt or in case of accidental configuration changes.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


### Canonical URLs & i18n

- **Canonical Rule:** The canonical URL is **always the English (default) URL**, regardless of the user's selected language or browser settings.
- Example: `https://npmx.dev/package/react`
- **Reasoning:** Since we do not use URL prefixes for languages (e.g., `/es/...`), there is technically only _one_ URL per resource. The language change happens client-side. Therefore, the canonical tag must point to this single, authoritative URL to prevent confusion for search engines.
Comment on lines +93 to +97
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the i18n mechanics we have, is this section even relevant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just add a draft version, we should discuss the document at discord: https://discord.com/channels/1464542801676206113/1468368119528685620


### Sitemap Strategy

- **Decision:** **No `sitemap.xml` will be generated.**
- **Why?**
- Generating a sitemap for 2+ million npm packages is technically unfeasible and expensive to maintain.
- A partial sitemap (e.g., top 50k packages) is redundant because these packages are already well-linked from the Home page and "Popular" lists.
- **Organic Discovery:** As detailed in Section 1, bots will discover content naturally by following dependency links, which is the most efficient way to index a graph-based dataset like npm.
Loading