How to select nearest neighbors fulfilling a condition?

by Rasmus Mackeprang   Last Updated September 11, 2019 10:26 AM

Question: How do I select "n nearest neighbors that fullfill a condition" for all elements in a geodataframe?

Example: "For all trees in the forest, what are the heights of the two tallest pines within a radius of 100 m?" (Note that "tree" isn't necessarily "pine".)

If I just wanted the nearest neighboring trees for every tree, I could use

libpysal.weights.KNN.from_dataframe(df_g, k=2, radius=100)

(given a geodataframe)

I'm looking for a way to get the nearest neighbors that fulfill a condition.

Worked example

This piece of code defines a geodataframe with 9 points:

import pandas as pd, libpysal, geopandas as gp,matplotlib.pyplot as plt
from shapely.wkt import loads

# 18 points with values and types
points=['POINT (0.1 0.2)','POINT (-1 0)','POINT (1 0)','POINT (0 -1)','POINT (0 1)','POINT (-2 0)','POINT (2 0)','POINT (0 -2)','POINT (0 2)']

gdf=gp.GeoDataFrame(df,geometry=[loads(x) for x in df.points])

I want to look for neighbors within a radius of 2 having type 1.

So, for the central point, I want to look for neighbors among the orange points and not the blue:

Points with types

If type wasn't an issue, I could loop over nearest neighbors like:

knn2 = libpysal.weights.KNN.from_dataframe(gdf, k=2,radius=2)
for index,row in gdf.iterrows(): # Looping over all points
    knn_neighbors = knn2.neighbors[index] # Get neighbors
    knnsubset = gdf.iloc[knn_neighbors] # Get subdataframe
    print("Mean: ",knnsubset['value'].mean()) # Calculating mean of 'value'

For the central point, that will select the two green points as shown:

Selection without type

However, I want to only consider the orange points.

Simple "fix":

I can of course select "enough" neighbors and then filter them afterwards:

knn2 = libpysal.weights.KNN.from_dataframe(gdf, k=8,radius=2) # Select enough neighbors
for index,row in gdf.iterrows(): # Looping over all points
    knn_neighbors = knn2.neighbors[index] # Get neighbors
    knnsubset = gdf.iloc[knn_neighbors] # Get subdataframe
    knnsubset=knnsubset[knnsubset.types==1].head(2) #Require type 1 and take the two first
    print("Mean: ",knnsubset['value'].mean()) # Calculating mean of 'value'

Simple fix

As shown, it selects the right points. Howver, there are two problems:

  • There is no clear way to select "enough" neighbors. Had there been enough blue points in between. I would not have caught the orange points.
  • This scales poorly when talking millions of points with wildly varying densities. Selecting 100 neighbors to find 4 incurs a penalty in terms of processing time.

This seems like a problem that someone will have solved at some point. Any pointers? Should I be thinking sklearn?

Answers 1

AS mentioned in the comment, you can use DistanceBand instead of KNN for the first filtering (as in KNN you do not actually know how many you have to select to have at least 2 pines).

Starting from your gdf above:

W = libpysal.weights.DistanceBand.from_dataframe(gdf, threshold=2)

results = []  # it is faster to save to list than directly to gdf
for index, row in gdf.iterrows():
    neighbors = W.neighbors[index]
    subset = gdf.loc[neighbors]  # get all within threshold
    limited = subset.loc[subset.types == 1]  # limit to type you want - here you can do more filtering
    if len(limited) > 0:  # if something of type 1 is close enough
        limited['dist'] = limited.geometry.distance(row.geometry)  #get distances
        limited.nsmallest(n=2, columns='dist')
        results.append(list(limited.index))  # save indices of selection

gdf['result'] = results  # save to gdf

This will save indices of points fulfilling the conditions (or None).

I am not sure how it will behave for large datasets, though, but pandas filtering should not be a problem and getting distance to limited df is vectorized.

September 11, 2019 10:26 AM

Related Questions

computing haversine over two lists

Updated February 19, 2019 00:26 AM

Geopandas Rotate function

Updated January 18, 2019 04:26 AM

Sorting a Dataframe based on specified indices

Updated September 01, 2019 12:26 PM