-
Notifications
You must be signed in to change notification settings - Fork 0
Draft: MkDocs documentation site #19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Dahlializi
wants to merge
19
commits into
main
Choose a base branch
from
docs
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
19 commits
Select commit
Hold shift + click to select a range
e565ab2
Add initial MkDocs documentation site
8bef376
Add MkDocs documentation pages and navigation
d5b5e27
Add mkdocs documentation dependencies
thompsonmj 08e190a
Merge remote-tracking branch 'origin/main' into docs
thompsonmj 497f517
Banner and logo images
thompsonmj 63fa90f
Expand docs navigation and quick reference
thompsonmj 38c0faa
Update sample input datasets
thompsonmj c8da990
Add mkdocs-gen-files to docs extras
thompsonmj e76b1be
Add docs page and PR preview deployment workflow
thompsonmj 3e5f134
Wordsmith landing page
thompsonmj 3ad259f
Add tooltip explaning Metazoa→Animalia resolution
thompsonmj f055a83
Merge remote-tracking branch 'origin/main' into docs
thompsonmj 723cd8e
Add common name example
thompsonmj 58e610e
Update style
thompsonmj 4f8605d
Remove redundant info in docs from README
thompsonmj ff6cf81
Replace brief description in README
thompsonmj 5df1ebd
Update example output data for quick ref guide
thompsonmj 3a43b04
Match example data paths to repo
thompsonmj f9623a5
Pronunciation tip update and GNVerifier capitalization
thompsonmj File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,103 @@ | ||
| name: Build & Deploy MkDocs (gh-pages with PR previews) | ||
|
|
||
| on: | ||
| workflow_dispatch: | ||
| pull_request: | ||
| branches: [ main ] | ||
| types: [opened, synchronize, reopened, closed] | ||
| push: | ||
| branches: [ main ] | ||
|
|
||
| permissions: | ||
| contents: write | ||
| pages: write | ||
|
|
||
| jobs: | ||
| build: | ||
| # Run for push, workflow dispatch, PRs from SAME repo that are not closed | ||
| if: | | ||
| github.event_name == 'push' || | ||
| github.event_name == 'workflow_dispatch' || | ||
| (github.event_name == 'pull_request' && | ||
| github.event.pull_request.head.repo.fork == false && | ||
| github.event.action != 'closed') | ||
| runs-on: ubuntu-latest | ||
| concurrency: | ||
| group: ${{ github.workflow }}-${{ github.ref }} | ||
| cancel-in-progress: true | ||
| steps: | ||
| - uses: actions/checkout@v4 | ||
| with: | ||
| fetch-depth: 0 | ||
| - uses: actions/setup-python@v5 | ||
| with: | ||
| python-version: "3.11" | ||
| - name: Install deps | ||
| run: | | ||
| python -m pip install --upgrade pip | ||
| pip install '.[docs]' | ||
| - name: Build with MkDocs | ||
| run: mkdocs build | ||
| - name: Upload built site as artifact | ||
| uses: actions/upload-artifact@v4 | ||
| with: | ||
| name: site | ||
| path: ./site | ||
|
|
||
| deploy: | ||
| needs: build | ||
| # Deploy on push to main (root) or PRs from SAME repo (not closed) -> pr-<N>/ | ||
| if: | | ||
| github.event_name == 'push' || | ||
| (github.event_name == 'pull_request' && | ||
| github.event.pull_request.head.repo.fork == false && | ||
| github.event.action != 'closed') | ||
| runs-on: ubuntu-latest | ||
| concurrency: | ||
| group: ${{ github.workflow }}-${{ github.ref }} | ||
| cancel-in-progress: true | ||
| steps: | ||
| - name: Download built site | ||
| uses: actions/download-artifact@v4 | ||
| with: | ||
| name: site | ||
| path: ./site | ||
| - name: Deploy to gh-pages | ||
| uses: peaceiris/actions-gh-pages@v4 | ||
| with: | ||
| github_token: ${{ secrets.GITHUB_TOKEN }} | ||
| publish_branch: gh-pages | ||
| publish_dir: ./site | ||
| keep_files: true | ||
| destination_dir: ${{ github.event_name == 'pull_request' && format('pr-{0}', github.event.number) || '' }} | ||
|
|
||
| cleanup: | ||
| # Only when a same-repo PR closes | ||
| if: > | ||
| github.event_name == 'pull_request' && | ||
| github.event.pull_request.head.repo.fork == false && | ||
| github.event.action == 'closed' | ||
| runs-on: ubuntu-latest | ||
| steps: | ||
| - name: Checkout gh-pages | ||
| uses: actions/checkout@v4 | ||
| with: | ||
| ref: gh-pages | ||
| fetch-depth: 0 | ||
| - name: Configure git author | ||
| run: | | ||
| git config user.name "github-actions[bot]" | ||
| git config user.email "github-actions[bot]@users.noreply.github.com" | ||
| - name: Remove preview folder | ||
| shell: bash | ||
| run: | | ||
| set -euo pipefail | ||
| PR_DIR="pr-${{ github.event.number }}" | ||
| echo "Attempting to remove $PR_DIR" | ||
| if [ -d "$PR_DIR" ]; then | ||
| git rm -r "$PR_DIR" | ||
| git commit -m "Remove preview for PR #${{ github.event.number }}" | ||
| git push origin gh-pages | ||
| else | ||
| echo "No preview folder $PR_DIR found; nothing to do." | ||
| fi |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,194 +1,18 @@ | ||
| # TaxonoPy | ||
| <h1 align="center"> | ||
| <img src="docs/_assets/taxonopy_banner.svg" alt="TaxonoPy banner"> | ||
| </h1> | ||
|
|
||
| [](https://doi.org/10.5281/zenodo.15499454) | ||
|
|
||
| [](https://pypi.org/project/taxonopy) | ||
| [](https://pypi.org/project/taxonopy) | ||
|
|
||
| `TaxonoPy` (taxon-o-py) is a command-line tool for creating an internally consistent taxonomic hierarchy using the [Global Names Verifier (gnverifier)](https://github.com/gnames/gnverifier). See below for the structure of inputs and outputs. | ||
| ## TaxonoPy: Reproducible, Traceable, and Scalable Biological Taxonomy Alignment | ||
|
|
||
| ## Purpose | ||
| The motivation for this package is to create an internally consistent and standardized classification set for organisms in a large biodiversity dataset composed from different data providers that may use very similar and overlapping but not identical taxonomic hierarchies. | ||
| TaxonoPy (taxon-o-pie) is a command-line tool for harmonizing large biodiversity datasets into a consistent taxonomy ready for AI applications. Built on the [Global Names Verifier (GNVerifier)](https://github.com/gnames/gnverifier), it provides complete provenance tracking, flexible resolution strategies, and batch processing of 100M+ records to address challenges in reproducibility and scale in massive multi-source taxonomy alignment. | ||
|
|
||
| Its development has been driven by its application in the TreeOfLife-200M (TOL) dataset. This dataset contains over 200 million samples of organisms from four core data providers: | ||
|
|
||
| - [The Global Biodiversity Information Facility (GBIF)](https://www.gbif.org/) | ||
| - [BIOSCAN-5M](https://biodiversitygenomics.net/projects/5m-insects/) | ||
| - [FathomNet](https://www.fathomnet.org/) | ||
| - [The Encyclopedia of Life (EOL)](https://eol.org/) | ||
|
|
||
| The names (and classification) of taxa may be (and often are) inconsistent across these resources. This package addresses this problem by creating an internally consistent classification set for such taxa. | ||
|
|
||
| ### Input | ||
|
|
||
| A directory containing Parquet partitions of the seven-rank Linnaean taxonomic metadata for organisms in the dataset. Labels should include: | ||
| - `uuid`: a unique identifier for each sample (required). | ||
| - `kingdom`, `phylum`, `class`, `order`, `family`, `genus`, `species`: the taxonomic ranks of the organism (required, may have sparsity). | ||
| - `scientific_name`: the scientific name of the organism, to the most specific rank available (optional). | ||
| - `common_name`: the common (i.e. vernacular) name of the organism (optional). | ||
|
|
||
| See the example data in | ||
| - `examples/input/sample.parquet` | ||
| - `examples/resolved/sample.resolved.parquet` (generated with [`taxonopy resolve`](#command-resolve)) | ||
| - `examples/resolved_with_common_names/sample.resolved.parquet` (generated with [`taxonopy common-names`](#command-common-names)) | ||
|
|
||
| ### Challenges | ||
| This taxonomy information is provided by each data provider and the original sources, but the classification can be... | ||
|
|
||
| - **Inconsistent**: both between and within sources (e.g. kingdom Metazoa vs. Animalia). | ||
| - **Incomplete**: many samples are missing one or more ranks. Some have 'holes' where higher and lower ranks are present, but intermediate ranks are missing. | ||
| - **Incorrect**: some samples have incorrect classifications. This can come in the form of spelling errors, nonstandard ideosyncratic terms, or outdated classifications. | ||
| - **Ambiguous**: homonyms, synonyms, and other terms that can be interpreted in multiple ways unless handled systematically. | ||
|
|
||
| Taxonomic authorities exist to standardize classification, but ... | ||
| - There are many authorities. | ||
| - They may disagree. | ||
| - A given organism may be missing from some. | ||
|
|
||
| ### Solution | ||
| `TaxonoPy` uses the taxonomic hierarchies provided by the TOL core data providers to query GNVerifier and create a standardized classification for each sample in the TOL dataset. It prioritizes the [GBIF Backbone Taxonomy](https://verifier.globalnames.org/data_sources/11), since this represents the largest part of the TOL dataset. Where GBIF misses, backup sources such as the [Catalogue of Life](https://verifier.globalnames.org/data_sources/1) and [Open Tree of Life (OTOL) Reference Taxonomy](https://verifier.globalnames.org/data_sources/179) are used. | ||
|
|
||
| ## Installation | ||
|
|
||
| `TaxonoPy` can be installed with `pip` after setting up a virtual environment. | ||
|
|
||
| ### User Installation with `pip` | ||
|
|
||
| To install the latest version of `TaxonoPy`, run: | ||
| ```console | ||
| pip install taxonopy | ||
| ``` | ||
|
|
||
| ### Usage | ||
| You may view the help for the command line interface by running: | ||
| ```console | ||
| taxonopy --help | ||
| ``` | ||
| This will show you the available commands and options: | ||
| ```console | ||
| usage: taxonopy [-h] [--cache-dir CACHE_DIR] [--cache-input CACHE_INPUT] | ||
| [--show-cache-path] [--cache-stats] [--clear-cache] | ||
| [--show-config] [--version] | ||
| {resolve,trace,common-names} ... | ||
|
|
||
| TaxonoPy: Resolve taxonomic names using GNVerifier and trace data provenance. | ||
|
|
||
| positional arguments: | ||
| {resolve,trace,common-names} | ||
| resolve Run the taxonomic resolution workflow | ||
| trace Trace data provenance of TaxonoPy objects | ||
| common-names Merge vernacular names (post-process) into resolved outputs | ||
|
|
||
| options: | ||
| -h, --help show this help message and exit | ||
| --cache-dir CACHE_DIR | ||
| Directory for TaxonoPy cache (can also be set with TAXONOPY_CACHE_DIR environment variable) (default: None) | ||
| --cache-input CACHE_INPUT | ||
| Input dataset path to compute cache stats for when no command is provided (default: None) | ||
| --show-cache-path Display the current cache directory path and exit (default: False) | ||
| --cache-stats Display statistics about the cache and exit (default: False) | ||
| --clear-cache Clear the TaxonoPy object cache. May be used in isolation. (default: False) | ||
| --show-config Show current configuration and exit (default: False) | ||
| --version Show version number and exit | ||
| ``` | ||
|
|
||
| ### Cache behavior | ||
|
|
||
| `taxonopy resolve` caches parsed entries, entry groups, and every resolution attempt chain using [`diskcache`](https://grantjenks.com/docs/diskcache/) as a stable provenance artifact tied to the TaxonoPy version and input dataset. By default the cache root is `~/.cache/taxonopy`, but you can override it by setting the environment variable `TAXONOPY_CACHE_DIR` or specifying `--cache-dir`. Its primary purpose is to support the `trace` command, which allows you to trace the provenance of any taxonomic entry resolved by TaxonoPy. | ||
|
|
||
| - Each resolve run writes into `resolve_v<version>_<fingerprint>` where the fingerprint is a SHA-256 hash of the input files’ metadata, so namespaces stay stable per combination of dataset and package version. | ||
| - Inspect a namespace without rerunning by invoking `taxonopy --cache-dir <root> --cache-input <input> --cache-stats`, which reports total size, entry counts, and key-prefix breakdowns. Passing `--cache-stats` after `resolve` or `trace` performs the same check and exits. | ||
| - If both the namespace and the output directory already contain data, `taxonopy resolve` warns and exits unless you pass `--full-rerun`, which clears the cache namespace and output before proceeding. Use `--clear-cache` to wipe only the namespace. | ||
|
|
||
| #### Command: `resolve` | ||
| The `resolve` command is used to perform taxonomic resolution on a dataset. It takes a directory of Parquet partitions as input and outputs a directory of resolved Parquet partitions. | ||
| ``` | ||
| usage: taxonopy resolve [-h] -i INPUT -o OUTPUT_DIR | ||
| [--output-format {csv,parquet}] | ||
| [--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}] | ||
| [--log-file LOG_FILE] [--force-input] [--full-rerun] | ||
| [--batch-size BATCH_SIZE] [--all-matches] | ||
| [--capitalize] [--fuzzy-uninomial] [--fuzzy-relaxed] | ||
| [--species-group] [--refresh-cache] [--cache-stats] | ||
|
|
||
| options: | ||
| -h, --help show this help message and exit | ||
| -i, --input INPUT Path to input Parquet or CSV file/directory | ||
| -o, --output-dir OUTPUT_DIR | ||
| Directory to save resolved and unsolved output files | ||
| --output-format {csv,parquet} | ||
| Output file format | ||
| --log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL} | ||
| Set logging level | ||
| --log-file LOG_FILE Optional file to write logs to | ||
| --force-input Force use of input metadata without resolution | ||
| --full-rerun Replace existing cache/output if detected for this input | ||
|
|
||
| GNVerifier Settings: | ||
| --batch-size BATCH_SIZE | ||
| Max number of name queries per GNVerifier API/subprocess call | ||
| --all-matches Return all matches instead of just the best one | ||
| --capitalize Capitalize the first letter of each name | ||
| --fuzzy-uninomial Enable fuzzy matching for uninomial names | ||
| --fuzzy-relaxed Relax fuzzy matching criteria | ||
| --species-group Enable group species matching | ||
|
|
||
| Cache Management: | ||
| --refresh-cache Force refresh of cached objects (input parsing, grouping) before running. | ||
| --cache-stats Display cache statistics for this input and exit. | ||
| ``` | ||
| It is recommended to keep GNVerifier settings at their defaults. | ||
|
|
||
| #### Command: `trace` | ||
| The `trace` command is used to trace the provenance of a taxonomic entry. It takes a UUID and an input path as arguments and outputs the full path of the entry through TaxonoPy. | ||
| ```console | ||
| usage: taxonopy trace [-h] {entry} ... | ||
|
|
||
| positional arguments: | ||
| {entry} | ||
| entry Trace an individual taxonomic entry by UUID | ||
|
|
||
| options: | ||
| -h, --help show this help message and exit | ||
|
|
||
| usage: taxonopy trace entry [-h] --uuid UUID --from-input FROM_INPUT [--format {json,text}] [--verbose] | ||
|
|
||
| options: | ||
| -h, --help show this help message and exit | ||
| --uuid UUID UUID of the taxonomic entry | ||
| --from-input FROM_INPUT | ||
| Path to the original input dataset | ||
| --format {json,text} Output format | ||
| --verbose Show full details including all UUIDs in group | ||
| ``` | ||
|
|
||
| #### Command: `common-names` | ||
| The `common-names` command is used to merge vernacular names into the resolved output. It takes a directory of resolved Parquet partitions as input and outputs a directory of resolved Parquet partitions with common names. | ||
| ```console | ||
| usage: taxonopy common-names [-h] --resolved-dir ANNOTATION_DIR --output-dir OUTPUT_DIR | ||
|
|
||
| options: | ||
| -h, --help show this help message and exit | ||
| --resolved-dir ANNOTATION_DIR | ||
| Directory containing your *.resolved.parquet files | ||
| --output-dir OUTPUT_DIR | ||
| Directory to write annotated .parquet files | ||
| ``` | ||
| Note that the `common-names` command is a post-processing step and should be run after the `resolve` command. | ||
|
|
||
| ### Example Usage | ||
|
|
||
| To perform taxonomic resolution on a dataset with subsequent common name annotation, run: | ||
| ```console | ||
| taxonopy resolve \ | ||
| --input /path/to/formatted/input \ | ||
| --output-dir /path/to/resolved/output | ||
| ``` | ||
| ```console | ||
| taxonopy common-names \ | ||
| --resolved-dir /path/to/resolved/output \ | ||
| --output-dir /path/to/resolved_with_common-names/output | ||
| ``` | ||
| ## Documentaion | ||
| See https://imageomics.github.io/TaxonoPy for documentation on installation, usage, and more. | ||
|
|
||
| ## Development | ||
| See the [Wiki Development Page](https://github.com/Imageomics/TaxonoPy/wiki/Development) for development instructions. | ||
thompsonmj marked this conversation as resolved.
Show resolved
Hide resolved
|
||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.