Skip to content

Statistics in Python #40

@mdhaber

Description

@mdhaber

This issue gives an account of the current state of statistics in Python, summarized by AI based on interviews of Python developers, data scientists, and data analysts. The intent is to serve as a starting point for continued discussion about statistics in Python. A hand-curated list of potential action items is included as a TLDR:


Potential sprintable items:

  • Write a "Python vs. R for Statistics" translation table (@stefanv)
  • Create webpage listing libraries and their uses (@stefanv)
  • Compose educational materials, specifically for stats (@stefanv)
  • Design a way for programmers to easily publish statistics routines individually (@stefanv / @mdhaber). The idea is to create a hybrid between a free-for-all (users publish their own package) and getting a PR accepted into an existing package (e.g. SciPy, statsmodels, scikit-learn, pingouin). It's a generalization of something @tupui and I have been discussing in the context of statistical distributions (e.g. see Expansion Plan statistical-python/scikit-stats#4 and other issues at that repo)

@stefanv prepared a Perplexity workspace with custom instructions that included the full transcript of 49 interviews with Python developers, data scientists, and data analysts he performed for a recent grant.

Prompt to Perplexity:

Summarize the state of statistics in the Python ecosystem. What does it have, and what does it need? What is it's relationship to other languages.

Response: (reviewed by humans, and it looks quite good)

Core Offerings

  • Python has several major libraries for statistics:
    • scipy.stats for basic statistical tests and distributions
    • statsmodels for econometrics and classical statistics
    • scikit-learn for machine learning and some statistical modeling[^1][^2][^3].
  • There are many smaller, specialized packages, but the ecosystem is dominated by these large, general-purpose libraries[^3].

Strengths

  • Python is widely used in data science and is increasingly adopted for statistics, especially in education and applied research[^2].
  • Libraries like scikit-learn offer a clean, consistent interface for modeling, which is appreciated for its engineering rigor and integration with the broader Python data stack[^2].
  • The core libraries are generally well-tested and reliable, with high software engineering standards[^3].

Weaknesses and Needs

  • Fragmentation: The ecosystem is fragmented, with different libraries adopting incompatible approaches (e.g., statsmodels vs. scikit-learn), leading to confusion for users and students[^1][^2][^3].
  • User Experience: There is no central landing place or unified entry point for statistics in Python, unlike R's tidyverse or RStudio. This makes it harder for newcomers to get started[^2].
  • Interoperability: Data structures (Pandas, NumPy, etc.) do not always work seamlessly across libraries. Conversions are often required, and function outputs are unpredictable compared to R's tidyverse pipelines[^2].
  • Teaching Resources: Python lacks the abundance of user-friendly, statistics-focused tutorials and case studies found in the R community[^2].
  • Contributor Barriers: Contributing to core libraries is difficult due to high standards and lack of modularity. Small, specialized packages exist but are less visible and less used than in R[^3].
  • Statistical Methods Coverage: Some advanced or niche statistical methods are missing or hard to find, especially compared to R's vast CRAN repository[^1][^3].
  • Community and Culture: The Python statistics community is less cohesive and less connected than R's, which has a strong identity and established events[^2].

Relationship to Other Languages

  • R: R is still considered the gold standard for statistics, with better branding, a more cohesive ecosystem, and more teaching resources. R's tidyverse and RStudio provide a smoother user experience for statistics[^1][^2].
  • Interoperability: While some users switch between Python and R in their workflow, true interoperability is limited. Most projects use one language at a time, often using R for data manipulation and Python for modeling or vice versa[^2].
  • Other Platforms: Tools like GraphPad Prism are still popular among practicing scientists for basic statistical analyses, indicating that neither Python nor R fully dominates in all applied domains[^3].

Summary Table: Python vs. R for Statistics

Aspect Python R
Core Libraries scipy.stats, statsmodels, scikit-learn base R, tidyverse, many CRAN packages
User Experience Fragmented, less cohesive Cohesive, tidyverse pipelines, RStudio
Teaching Resources Improving, but less abundant Extensive, beginner-friendly
Community Large, less connected in statistics Strong, statistics-focused, welcoming
Package Development High barriers, less modularity Easy, many small packages, dev tools
Interoperability Needs improvement (data structures, APIs) Strong within tidyverse, RStudio
Branding Data science/machine learning focus Statistics-focused

What Python Needs

  • A more unified, user-friendly interface for statistics, possibly modeled after R's tidyverse[^1][^2].
  • Better interoperability between core data structures and libraries[^2].
  • More accessible teaching resources and case studies[^2].
  • Lower barriers for contributors and more visibility for specialized statistical packages[^3].
  • Stronger community identity and central organization for statistics in Python[^2].

Conclusion

Python's statistics ecosystem is powerful but fragmented, with significant room for improvement in usability, interoperability, teaching resources, and community cohesion. R remains the default for statistics, but Python is gaining ground, especially as data science and machine learning continue to grow. Stronger integration, better documentation, and a more unified vision could help Python become a true peer to R in the statistics domain[^1][^2][^3].

Metadata

Metadata

Assignees

No one assigned

    Labels

    Data analysisStatistics, optimization, sparse data, plotting, and dataframes.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions