Potential refactor for large speed boost

Hi, I had to do some optimizations recently for zonal stats on a very large number of polygons, and I thought the  method  might fit well here.  
Essentially, instead of iterating thru each feature one at time, I use the rasterio.features.rasterize function to label each pixel of a raster with an id for all features. Then some array reshaping is used to create a sorted masked array, and all stats can then be generated in a vectorized fashion. It offers huge performance boosts when feature counts become large (see figure  below). I setup just a few stats (['mean','max','min','sum','std']) to compare with the original zonal method just as a proof of concept. 

![image](https://github.com/perrygeo/python-rasterstats/assets/12282062/9807df8b-bc77-47ee-9cd2-9d29072c4131)

Some downsides of this:
- It creates a copy of the original raster array, so will be more memory intensive with larger rasters. 
- It will load a larger chunk of the raster, also using more memory than 1 feature at a time.
- it won't work with overlapping polygons as is.  
- rasterizing all features at once means a pixel can only be assigned to a single feature, whereas before a pixel split by 2 polygons could be "shared". I suspect this is the cause in some slight differences in the final stats, but in my comparisons the differences are very tiny.  

You can see the code here https://github.com/sdtaylor/python-rasterstats/tree/zonal_refactor_poc  
And a testing/comparison script here below.  
@perrygeo any interest in having this in rasterstats somehow?  


## Test code  
```
import numpy as np
import rasterstats
import rasterio

from shapely.geometry import Point, MultiPoint, box, MultiPolygon
from shapely.ops import voronoi_diagram

import pandas as pd

from timeit import default_timer as timer

test_tif = './tests/data/slope.tif'


def generate_random_polygons(n, bounds, seed=10):
    """
    Use random points and voronoi diagrams to create
    numerous polygon shapes completely covering everything
    within the  bounds 
    

    Parameters
    ----------
    n : int
        number of polygons.
    bounds : tuple
        (minx, miny, maxx, maxy)
    seed   : numeric
        random seed

    Returns
    -------
    [Polygon]
        List of polygons

    """
    minx, miny, maxx, maxy = bounds
    
    rng = np.random.default_rng(seed)
    
    points_x = rng.uniform(minx, maxx, n)
    points_y = rng.uniform(miny, maxy, n)
    
    bounding_box = box(*bounds)
    point_geom = MultiPoint([Point(x,y) for x,y in zip(points_x, points_y)])
    geoms = [g.intersection(bounding_box) for g in voronoi_diagram(point_geom).geoms]
    return geoms

with rasterio.open(test_tif) as src:
    tif_profile = src.profile
    tif_bounds  = src.bounds

#----------------
# Time both methods
stats = ['mean','max','min','sum','std']

timing_results = []
comparison_results = {s:[] for s in stats}


for n_polygons in [10,50,100,500]:
    print(f'testing with {n_polygons} polygons')
    geoms = generate_random_polygons(n=n_polygons, bounds=tif_bounds, seed=None)
    
    # Drop some geoms so the raster is not 100% covered
    geoms = geoms[5:]
        
    original_start = timer()
    original_method = list(rasterstats.main.gen_zonal_stats(geoms, test_tif, stats=stats, all_touched=False))
    original_end   = timer()
    original_total = round(original_end - original_start,5)
    
    
    exp_start = timer()
    exp_method = rasterstats.main.gen_zonal_stats_experment(geoms, test_tif, stats=stats, all_touched=False)
    exp_end  = timer()
    exp_total = round(exp_end - exp_start, 5)
    
    timing_results.append({
        'n_polygons' : n_polygons,
        'original_time' :original_total,
        'experimental_time' : exp_total
        })

    # Compare the difference in final numbers between the original method
    # and new method.
    for i,o,e in zip(range(len(original_method)), original_method, exp_method):
        for s in stats:
            if o[s] is None and e[s] is None:
                comparison_results[s].append(0)
            else:
                comparison_results[s].append(o[s] - e[s])


pd.DataFrame(timing_results).plot(x='n_polygons', y=['original_time','experimental_time'],ylabel='seconds', kind='bar')

for stat, differences in comparison_results.items():
    print(f'{stat} largest abs difference: {np.abs(differences).max()}')
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Potential refactor for large speed boost #291

Test code

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Potential refactor for large speed boost #291

Description

Test code

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions