Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 26 additions & 1 deletion docs/development/contributing/index.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,30 @@
# Contributing

We welcome contributions to TaxonoPy. More detailed guidance will be added here.
We welcome contributions to TaxonoPy.

---


## Contribution Opportunities

Documented failure cases are valuable inputs for improving TaxonoPy.
Contributions may include:

* documenting additional failure patterns
* proposing secondary tie-breaking heuristics
* extending existing resolution profiles
* adding dataset-specific disambiguation rules

Clear documentation of *why* a resolution fails is often as important as
resolving it.

If you encounter recurring failure modes, consider opening an issue with:

* example UUIDs
* trace output
* GNVerifier results
* proposed resolution logic

---

If you have suggestions or run into a bug, please open an issue at [https://github.com/Imageomics/TaxonoPy/issues](https://github.com/Imageomics/TaxonoPy/issues).
140 changes: 140 additions & 0 deletions docs/development/failure_analysis_workflow/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
# Failure Analysis Workflow

A significant portion of TaxonoPy development involves understanding *why* certain taxonomic resolutions fail and whether those failures are expected, data-driven, or indicative of missing strategy coverage.

This workflow was developed during large-scale resolution of the **EOL dataset**, but applies broadly to other sources.

---

## 1. Identify Failed Resolution Entries

Start by locating entries marked as failed in resolved Parquet outputs.
A common failure status encountered during analysis is:

* `FAILED_FORCED_INPUT`

Example command:

```bash
parquet cat <resolved_parquet_files> \
| grep FAILED_FORCED_INPUT \
| head \
| jq
```

This step yields candidate UUIDs for deeper inspection.

---

## 2. Compare Raw Input vs. Final Resolution

For each failed UUID, compare the **raw input taxonomy** with the **final resolved output**.

Typical fields to inspect include:

* `scientific_name`
* `kingdom` → `genus`
* `source_dataset`
* `resolution_status`
* `resolution_strategy`

This comparison often reveals inconsistencies in the input taxonomy (e.g., genus assignments that differ from authoritative sources).

---

## 3. Trace Resolution Decisions

Use the `trace` command to inspect how TaxonoPy attempted to resolve the entry and why it failed.

Example:

```bash
taxonopy --cache-dir <cache_directory> \
trace entry \
--uuid "<UUID>" \
--from-input <source_dataset_directory> \
--verbose
```

The trace output provides:

* grouping information
* query plan (term, rank, source)
* resolution strategies attempted
* explicit failure reasons
* metadata used for match selection

---

## 4. Verify Against External Authorities (GNVerifier)

To determine whether a failure is due to missing data or genuine ambiguity,
independently verify the same taxonomic name using **Global Names Verifier**.

=== "Docker Runtime"

```bash
docker run --rm -i gnames/gnverifier:v1.2.5 \
-j 1 \
--format compact \
--capitalize \
--all_matches \
--sources 11 \
"<scientific_name>" | jq
```

Runs GNVerifier inside an official container image.
This method requires Docker but does not require local installation
of GNVerifier.

=== "Local CLI Installation"

```bash
gnverifier -j 1 \
--format compact \
--capitalize \
--all_matches \
--sources 11 \
"<scientific_name>" | jq
```

Runs GNVerifier installed directly on the local system.

---

## 5. Common Failure Pattern: Multi-Accepted Match Tie

Across analyzed EOL cases, the most frequent failure pattern observed was:

> **Tie between multiple accepted results with equal taxonomic matches**

These failures are typically produced by the strategy:

* `ExactMatchPrimarySourceMultiAcceptedTaxonomicMatch`

Example failure reason from trace output:

```json
{
"failure_reason": "Tie between N results with equal taxonomic matches"
}
```

---

## 6. Why This Strategy Fails

This strategy is intentionally conservative:

* it prioritizes correctness over forced resolution
* it fails when multiple equally valid “best” matches exist
* it avoids arbitrary selection without clear disambiguation signals

However, analysis shows that many tied matches differ subtly in ways not currently used for secondary discrimination, such as:

* author or publication year suffixes
* infra-specific placeholders (e.g., `spec`)
* rank depth differences
* minor spelling or canonical variations

---
13 changes: 11 additions & 2 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,10 @@ nav:
- TaxonoPy:
- User guide: index.md
- Quick reference: user-guide/quick-reference.md
- CLI help reference: command_line_usage/help.md
- CLI help reference:
- command_line_usage/help.md
- command_line_usage/tutorial.md

- Installation: user-guide/installation.md
- IO:
- user-guide/io/index.md
Expand All @@ -18,6 +21,9 @@ nav:
- Development:
- Contributing:
- development/contributing/index.md

- Failure Analysis Workflow:
- development/failure_analysis_workflow/index.md
- Acknowledgments: acknowledgments.md

theme:
Expand Down Expand Up @@ -46,7 +52,7 @@ theme:
- content.code.copy
- content.code.annotate
- content.tooltips

- content.tabs.link
extra_css:
- stylesheets/extra.css

Expand Down Expand Up @@ -79,3 +85,6 @@ markdown_extensions:
- pymdownx.details
- pymdownx.highlight
- pymdownx.superfences
- pymdownx.superfences
- pymdownx.tabbed:
alternate_style: true
Loading