Statistics in Python

This issue gives an account of the current state of statistics in Python, summarized by AI based on interviews of Python developers, data scientists, and data analysts. The intent is to serve as a starting point for continued discussion about statistics in Python. A hand-curated list of potential action items is included as a TLDR:

---
Potential sprintable items:
- Write a "Python vs. R for Statistics" translation table (@stefanv)
- Create webpage listing libraries and their uses (@stefanv)
- Compose educational materials, specifically for stats (@stefanv)
- Design a way for programmers to easily publish statistics routines individually (@stefanv / @mdhaber). The idea is to create a hybrid between a free-for-all (users publish their own package) and getting a PR accepted into an existing package (e.g. SciPy, statsmodels, scikit-learn, pingouin). It's a generalization of something @tupui and I have been discussing in the context of statistical distributions (e.g. see https://github.com/scikit-stats/scikit-stats/issues/4 and other issues at that repo)

---

@stefanv prepared a [Perplexity](https://www.perplexity.ai/) workspace with custom instructions that included the full transcript of 49 interviews with Python developers, data scientists, and data analysts he performed for a recent grant.

Prompt to Perplexity: 
> Summarize the state of statistics in the Python ecosystem. What does it have, and what does it need? What is it's relationship to other languages.

Response: (reviewed by humans, and it looks quite good)

**Core Offerings**

- Python has several major libraries for statistics:
    - *scipy.stats* for basic statistical tests and distributions
    - *statsmodels* for econometrics and classical statistics
    - *scikit-learn* for machine learning and some statistical modeling[^1][^2][^3].
- There are many smaller, specialized packages, but the ecosystem is dominated by these large, general-purpose libraries[^3].

**Strengths**

- Python is widely used in data science and is increasingly adopted for statistics, especially in education and applied research[^2].
- Libraries like scikit-learn offer a clean, consistent interface for modeling, which is appreciated for its engineering rigor and integration with the broader Python data stack[^2].
- The core libraries are generally well-tested and reliable, with high software engineering standards[^3].

**Weaknesses and Needs**

- **Fragmentation**: The ecosystem is fragmented, with different libraries adopting incompatible approaches (e.g., statsmodels vs. scikit-learn), leading to confusion for users and students[^1][^2][^3].
- **User Experience**: There is no central landing place or unified entry point for statistics in Python, unlike R's tidyverse or RStudio. This makes it harder for newcomers to get started[^2].
- **Interoperability**: Data structures (Pandas, NumPy, etc.) do not always work seamlessly across libraries. Conversions are often required, and function outputs are unpredictable compared to R's tidyverse pipelines[^2].
- **Teaching Resources**: Python lacks the abundance of user-friendly, statistics-focused tutorials and case studies found in the R community[^2].
- **Contributor Barriers**: Contributing to core libraries is difficult due to high standards and lack of modularity. Small, specialized packages exist but are less visible and less used than in R[^3].
- **Statistical Methods Coverage**: Some advanced or niche statistical methods are missing or hard to find, especially compared to R's vast CRAN repository[^1][^3].
- **Community and Culture**: The Python statistics community is less cohesive and less connected than R's, which has a strong identity and established events[^2].


## Relationship to Other Languages

- **R**: R is still considered the gold standard for statistics, with better branding, a more cohesive ecosystem, and more teaching resources. R's tidyverse and RStudio provide a smoother user experience for statistics[^1][^2].
- **Interoperability**: While some users switch between Python and R in their workflow, true interoperability is limited. Most projects use one language at a time, often using R for data manipulation and Python for modeling or vice versa[^2].
- **Other Platforms**: Tools like GraphPad Prism are still popular among practicing scientists for basic statistical analyses, indicating that neither Python nor R fully dominates in all applied domains[^3].


## Summary Table: Python vs. R for Statistics

| Aspect | Python | R |
| :-- | :-- | :-- |
| Core Libraries | scipy.stats, statsmodels, scikit-learn | base R, tidyverse, many CRAN packages |
| User Experience | Fragmented, less cohesive | Cohesive, tidyverse pipelines, RStudio |
| Teaching Resources | Improving, but less abundant | Extensive, beginner-friendly |
| Community | Large, less connected in statistics | Strong, statistics-focused, welcoming |
| Package Development | High barriers, less modularity | Easy, many small packages, dev tools |
| Interoperability | Needs improvement (data structures, APIs) | Strong within tidyverse, RStudio |
| Branding | Data science/machine learning focus | Statistics-focused |

## What Python Needs

- A more unified, user-friendly interface for statistics, possibly modeled after R's tidyverse[^1][^2].
- Better interoperability between core data structures and libraries[^2].
- More accessible teaching resources and case studies[^2].
- Lower barriers for contributors and more visibility for specialized statistical packages[^3].
- Stronger community identity and central organization for statistics in Python[^2].


## Conclusion

Python's statistics ecosystem is powerful but fragmented, with significant room for improvement in usability, interoperability, teaching resources, and community cohesion. R remains the default for statistics, but Python is gaining ground, especially as data science and machine learning continue to grow. Stronger integration, better documentation, and a more unified vision could help Python become a true peer to R in the statistics domain[^1][^2][^3].

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Statistics in Python #40

Relationship to Other Languages

Summary Table: Python vs. R for Statistics

What Python Needs

Conclusion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Aspect	Python	R
Core Libraries	scipy.stats, statsmodels, scikit-learn	base R, tidyverse, many CRAN packages
User Experience	Fragmented, less cohesive	Cohesive, tidyverse pipelines, RStudio
Teaching Resources	Improving, but less abundant	Extensive, beginner-friendly
Community	Large, less connected in statistics	Strong, statistics-focused, welcoming
Package Development	High barriers, less modularity	Easy, many small packages, dev tools
Interoperability	Needs improvement (data structures, APIs)	Strong within tidyverse, RStudio
Branding	Data science/machine learning focus	Statistics-focused

Statistics in Python #40

Description

Relationship to Other Languages

Summary Table: Python vs. R for Statistics

What Python Needs

Conclusion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions