-
-
Notifications
You must be signed in to change notification settings - Fork 203
chore: add SEO stragegy cookbook #894
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,105 @@ | ||
| # SEO Strategy for npmx.dev | ||
|
|
||
| This document outlines the technical SEO strategy adopted for `npmx.dev`, considering its nature as a dynamic SSR application with near infinite content (the npm registry) and current internationalization constraints. | ||
|
|
||
| ## 1. Indexing & Crawling | ||
|
|
||
| ### The Challenge | ||
|
|
||
| `npmx` acts as a mirror/browser for the npm registry. We do not know all valid URLs (`/package/[name]`) in advance, and there are millions of possible combinations. Additionally, invalid URLs could generate spam content or infinite loops. | ||
|
|
||
| ### The Solution: Organic Crawling | ||
|
|
||
| We do not use a massive `sitemap.xml`. We rely on natural link discovery by bots (Googlebot, Bingbot, etc.): | ||
|
|
||
| 1. **Entry Point:** The Home page (`/`) links to popular packages. | ||
| 2. **Expansion:** Each package page links to its **Dependencies**, **DevDependencies**, and **PeerDependencies**. | ||
| 3. **Result:** Bots jump from package to package, indexing the npm dependency graph organically and efficiently. | ||
|
|
||
| ### Error Handling (404) | ||
|
|
||
| To prevent indexing of non-existent URLs (`/package/fake-package`): | ||
|
|
||
| - The SSR server returns a real **HTTP 404 Not Found** status code when the npm API indicates the package does not exist. | ||
| - This causes search engines to immediately discard the URL and not index it, without needing an explicit `noindex` tag. | ||
|
|
||
| ## 2. `robots.txt` File | ||
|
|
||
| The goal of `robots.txt` is to optimize the _Crawl Budget_ by blocking low-value or computationally expensive areas. | ||
|
|
||
| **Proposed `public/robots.txt`:** | ||
|
|
||
| ```txt | ||
| User-agent: * | ||
| Allow: / | ||
|
|
||
| # Block internal search results (duplicate/infinite content) | ||
| Disallow: /search | ||
|
|
||
| # Block user utilities and settings | ||
| Disallow: /settings | ||
| Disallow: /compare | ||
| Disallow: /auth/ | ||
|
|
||
| # Block code explorer and docs (high crawl cost, low SEO value for general search) | ||
| Disallow: /package-code/ | ||
| Disallow: /package-docs/ | ||
|
|
||
| # Block internal API endpoints | ||
| Disallow: /api/ | ||
| ``` | ||
|
Comment on lines
+32
to
+50
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. npmjs also blocks old versions from being indexed: https://www.npmjs.com/robots.txt I think it makes a lot of sense. |
||
|
|
||
| ## 3. Internationalization (i18n) & SEO | ||
|
|
||
| ### Current Status | ||
|
|
||
| - The application supports multiple languages (UI). | ||
| - **No URL prefixes are used** (e.g., `/es/package/react` does not exist, only `/package/react`). | ||
| - Language is determined on the client-side (browser) or defaults to English on the server. | ||
|
|
||
| ### SEO Implications | ||
|
|
||
| - **Canonicalization:** There is only one canonical URL per package (`https://npmx.dev/package/react`). | ||
| - **Indexing Language:** Googlebot typically crawls from the US without specific cookies/preferences. The SSR server renders in `en-US` by default. | ||
| - **Result:** **Google will index the site exclusively in English.** | ||
|
|
||
| ### Is this a problem? | ||
|
|
||
| **No.** For a global technical tool like `npmx`: | ||
|
|
||
| - Search traffic is predominantly in English (package names, technical terms). | ||
| - We avoid the complexity of managing `hreflang` and duplicate content across 20+ languages. | ||
| - User Experience (UX) remains localized: users land on the page (indexed in English), and the client hydrates the app in their preferred language. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ...and a vast majority of READMEs are in English anyway, and they take up a significant amount of npmx's displayed content. |
||
|
|
||
| ## 4. Summary of Actions | ||
|
|
||
| 1. ✅ **404 Status:** Ensured in SSR for non-existent packages. | ||
| 2. ✅ **Internal Linking:** Dependency components (`Dependencies.vue`) generate crawlable links (`<NuxtLink>`). | ||
| 3. ✅ **Dynamic Titles:** `useSeoMeta` correctly manages titles and descriptions, escaping special characters for security and proper display. | ||
| 4. 📝 **Pending:** Update `public/robots.txt` with the proposed blocking rules to protect the _Crawl Budget_. | ||
|
|
||
| ## 5. Implementation Details: Meta Tags & Sitemap | ||
|
|
||
| ### Pages Requiring `noindex, nofollow` | ||
|
|
||
| Based on the `robots.txt` strategy, the following Vue pages should explicitly include the `<meta name="robots" content="noindex, nofollow">` tag via `useSeoMeta`. This acts as a second layer of defense against indexing low-value content. | ||
|
|
||
| - **`app/pages/search.vue`**: Internal search results. | ||
| - **`app/pages/settings.vue`**: User preferences. | ||
| - **`app/pages/compare.vue`**: Dynamic comparison tool. | ||
| - **`app/pages/package-code/[...path].vue`**: Source code explorer. | ||
| - **`app/pages/package-docs/[...path].vue`**: Generated documentation (consistent with robots.txt block). | ||
|
Comment on lines
+85
to
+91
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Robots.txt disallow prevents crawlers from seeing the If a path is disallowed, most bots won’t fetch the page, so the meta tag is not an effective “second layer.” Either allow crawling and use 🧰 Tools🪛 LanguageTool[uncategorized] ~87-~87: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION) [uncategorized] ~88-~88: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION) [uncategorized] ~89-~89: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION) [uncategorized] ~90-~90: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION) [uncategorized] ~91-~91: Loose punctuation mark. (UNLIKELY_OPENING_PUNCTUATION)
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's correct, technically if it's blocked, the metadata isn't read. But we keep the Disallow directive to prioritize the crawl budget. The noindex directive is proposed as a defensive security measure in case crawling is accidentally allowed or for bots that don't strictly adhere to robots.txt but do respect the metadata.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
✏️ Learnings added
|
||
|
|
||
| ### Canonical URLs & i18n | ||
|
|
||
| - **Canonical Rule:** The canonical URL is **always the English (default) URL**, regardless of the user's selected language or browser settings. | ||
| - Example: `https://npmx.dev/package/react` | ||
| - **Reasoning:** Since we do not use URL prefixes for languages (e.g., `/es/...`), there is technically only _one_ URL per resource. The language change happens client-side. Therefore, the canonical tag must point to this single, authoritative URL to prevent confusion for search engines. | ||
|
Comment on lines
+93
to
+97
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Given the i18n mechanics we have, is this section even relevant?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I just add a draft version, we should discuss the document at discord: https://discord.com/channels/1464542801676206113/1468368119528685620 |
||
|
|
||
| ### Sitemap Strategy | ||
|
|
||
| - **Decision:** **No `sitemap.xml` will be generated.** | ||
| - **Why?** | ||
| - Generating a sitemap for 2+ million npm packages is technically unfeasible and expensive to maintain. | ||
| - A partial sitemap (e.g., top 50k packages) is redundant because these packages are already well-linked from the Home page and "Popular" lists. | ||
| - **Organic Discovery:** As detailed in Section 1, bots will discover content naturally by following dependency links, which is the most efficient way to index a graph-based dataset like npm. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this step.
That makes it possible that even super popular packages may be missed out completely, provided that no-one has created a package that depends on them, or that this problem exists further down the line.
In other words, we would only index stuff that - roughly - would get installed if you'd run
pnpm install nuxt vue nitro react svelte vite next astro typescript angular- plus their devDependencies - which to me, intuitively, sounds like a tiny fraction of useful packages out there.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
infinite recursion since the bot will follow that links (peer, dev deps and deps)