Skip to content

Potential refactor for large speed boost #291

@sdtaylor

Description

@sdtaylor

Hi, I had to do some optimizations recently for zonal stats on a very large number of polygons, and I thought the method might fit well here.
Essentially, instead of iterating thru each feature one at time, I use the rasterio.features.rasterize function to label each pixel of a raster with an id for all features. Then some array reshaping is used to create a sorted masked array, and all stats can then be generated in a vectorized fashion. It offers huge performance boosts when feature counts become large (see figure below). I setup just a few stats (['mean','max','min','sum','std']) to compare with the original zonal method just as a proof of concept.

image

Some downsides of this:

  • It creates a copy of the original raster array, so will be more memory intensive with larger rasters.
  • It will load a larger chunk of the raster, also using more memory than 1 feature at a time.
  • it won't work with overlapping polygons as is.
  • rasterizing all features at once means a pixel can only be assigned to a single feature, whereas before a pixel split by 2 polygons could be "shared". I suspect this is the cause in some slight differences in the final stats, but in my comparisons the differences are very tiny.

You can see the code here https://github.com/sdtaylor/python-rasterstats/tree/zonal_refactor_poc
And a testing/comparison script here below.
@perrygeo any interest in having this in rasterstats somehow?

Test code

import numpy as np
import rasterstats
import rasterio

from shapely.geometry import Point, MultiPoint, box, MultiPolygon
from shapely.ops import voronoi_diagram

import pandas as pd

from timeit import default_timer as timer

test_tif = './tests/data/slope.tif'


def generate_random_polygons(n, bounds, seed=10):
    """
    Use random points and voronoi diagrams to create
    numerous polygon shapes completely covering everything
    within the  bounds 
    

    Parameters
    ----------
    n : int
        number of polygons.
    bounds : tuple
        (minx, miny, maxx, maxy)
    seed   : numeric
        random seed

    Returns
    -------
    [Polygon]
        List of polygons

    """
    minx, miny, maxx, maxy = bounds
    
    rng = np.random.default_rng(seed)
    
    points_x = rng.uniform(minx, maxx, n)
    points_y = rng.uniform(miny, maxy, n)
    
    bounding_box = box(*bounds)
    point_geom = MultiPoint([Point(x,y) for x,y in zip(points_x, points_y)])
    geoms = [g.intersection(bounding_box) for g in voronoi_diagram(point_geom).geoms]
    return geoms

with rasterio.open(test_tif) as src:
    tif_profile = src.profile
    tif_bounds  = src.bounds

#----------------
# Time both methods
stats = ['mean','max','min','sum','std']

timing_results = []
comparison_results = {s:[] for s in stats}


for n_polygons in [10,50,100,500]:
    print(f'testing with {n_polygons} polygons')
    geoms = generate_random_polygons(n=n_polygons, bounds=tif_bounds, seed=None)
    
    # Drop some geoms so the raster is not 100% covered
    geoms = geoms[5:]
        
    original_start = timer()
    original_method = list(rasterstats.main.gen_zonal_stats(geoms, test_tif, stats=stats, all_touched=False))
    original_end   = timer()
    original_total = round(original_end - original_start,5)
    
    
    exp_start = timer()
    exp_method = rasterstats.main.gen_zonal_stats_experment(geoms, test_tif, stats=stats, all_touched=False)
    exp_end  = timer()
    exp_total = round(exp_end - exp_start, 5)
    
    timing_results.append({
        'n_polygons' : n_polygons,
        'original_time' :original_total,
        'experimental_time' : exp_total
        })

    # Compare the difference in final numbers between the original method
    # and new method.
    for i,o,e in zip(range(len(original_method)), original_method, exp_method):
        for s in stats:
            if o[s] is None and e[s] is None:
                comparison_results[s].append(0)
            else:
                comparison_results[s].append(o[s] - e[s])


pd.DataFrame(timing_results).plot(x='n_polygons', y=['original_time','experimental_time'],ylabel='seconds', kind='bar')

for stat, differences in comparison_results.items():
    print(f'{stat} largest abs difference: {np.abs(differences).max()}')

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions