Multiprocess Spatial Join with Spatial Index

by rolo   Last Updated August 14, 2019 15:22 PM

I'm a newbie in implementing GIS on python, but the lack of any GIS software at the office has forced me to learn about it :).

I have a huge grid that I'm currently filling out with counts of points using spatial index. When using a subset of the grid the code works just fine but when I use anything bigger than that then it takes forever.

This is the code that I'm currently using

# build the r-tree index
sindex = points.sindex

# find the points that intersect with each subpolygon and add them to points_within_geometry
points_within_grid = pd.DataFrame()

# grid-level dataseet
grid_data = pd.DataFrame()

i=1
for poly in grid:
    clear_output(wait=True)

    # buffer by the <1 micron dist to account for any space lost in the quadrat cutting
    # otherwise may miss point(s) that lay directly on quadrat line
    poly = poly.buffer(1e-14).buffer(0)

    # find approximate matches with r-tree, then precise matches from those approximate ones
    possible_matches_index = list(sindex.intersection(poly.bounds))
    possible_matches = points.iloc[possible_matches_index]
    precise_matches = possible_matches[possible_matches.intersects(poly)]

    # count number of firms within polygon
    x=pd.DataFrame({'grid_index': i,
                    'geometry':[Polygon(poly.envelope)],
                    'n_firms': len(precise_matches)})

    for poi in range(1,9):
        x['sic_'+str(poi)]=len(precise_matches[precise_matches.sic1==poi])

    grid_data=grid_data.append(x)    

    # points within the grid
    points_within_grid = points_within_grid.append(precise_matches)
    print("Current Progress: ", np.round(i/len(grid)*100,0),"%")

    i=i+1


# drop duplicate points, if buffered poly caused an overlap on point(s) that lay directly on a quadrat line
points_within_grid = points_within_grid.drop_duplicates(subset=['x', 'y','poi_id'])
points_outside_grid = points[~points.poi_id.isin(points_within_grid.poi_id)]

So, I've been thinking about using multiprocessing.Pool to make it faster. I've been reading online but don't really know how to do it. The first step would be to define a function that can be introduced to the Pool.

So far I have this function,

def grid_inter(poly)
    # buffer by the <1 micron dist to account for any space lost in the quadrat cutting
    # otherwise may miss point(s) that lay directly on quadrat line
    poly = poly.buffer(1e-14).buffer(0)

    # find approximate matches with r-tree, then precise matches from those approximate ones
    possible_matches_index = list(sindex.intersection(poly.bounds))
    possible_matches = points.iloc[possible_matches_index]
    precise_matches = possible_matches[possible_matches.intersects(poly)]

    # count number of firms within polygon
    x=pd.DataFrame({'grid_index': i,
                    'geometry':[Polygon(poly.envelope)],
                    'n_firms': len(precise_matches)})

    for poi in range(1,9):
        x['sic_'+str(poi)]=len(precise_matches[precise_matches.sic1==poi])
    return x

But I'm stuck in the next step which I think should be something like this

pool = mp.Pool(mp.cpu_count())

results = pool.map(grid_inter, [poly for poly in grid])

pool.close()

Any help to make it work would be super appreciated!

Thanks and have a great day!



Related Questions


matrix union in arcgis

Updated February 06, 2019 08:22 AM

Multiprocessing issues with ArcPy

Updated November 30, 2017 03:22 AM

NetworkX - Indexed Spatial node queries?

Updated August 21, 2018 13:22 PM