# How to select nearest neighbors fulfilling a condition?

by Rasmus Mackeprang   Last Updated September 11, 2019 10:26 AM

Question: How do I select "n nearest neighbors that fullfill a condition" for all elements in a geodataframe?

Example: "For all trees in the forest, what are the heights of the two tallest pines within a radius of 100 m?" (Note that "tree" isn't necessarily "pine".)

If I just wanted the nearest neighboring trees for every tree, I could use

``````libpysal.weights.KNN.from_dataframe(df_g, k=2, radius=100)
``````

(given a geodataframe)

I'm looking for a way to get the nearest neighbors that fulfill a condition.

Worked example

This piece of code defines a geodataframe with 9 points:

``````import pandas as pd, libpysal, geopandas as gp,matplotlib.pyplot as plt

# 18 points with values and types
points=['POINT (0.1 0.2)','POINT (-1 0)','POINT (1 0)','POINT (0 -1)','POINT (0 1)','POINT (-2 0)','POINT (2 0)','POINT (0 -2)','POINT (0 2)']
values=[9,8,7,6,5,4,3,2,1]
types=[0,0,0,0,0,1,1,1,1]

df=pd.DataFrame({'points':points,'value':values,'types':types})
``````

I want to look for neighbors within a radius of 2 having type 1.

So, for the central point, I want to look for neighbors among the orange points and not the blue:

If type wasn't an issue, I could loop over nearest neighbors like:

``````knn2 = libpysal.weights.KNN.from_dataframe(gdf, k=2,radius=2)
for index,row in gdf.iterrows(): # Looping over all points
knn_neighbors = knn2.neighbors[index] # Get neighbors
knnsubset = gdf.iloc[knn_neighbors] # Get subdataframe
print("Mean: ",knnsubset['value'].mean()) # Calculating mean of 'value'
``````

For the central point, that will select the two green points as shown:

However, I want to only consider the orange points.

Simple "fix":

I can of course select "enough" neighbors and then filter them afterwards:

``````knn2 = libpysal.weights.KNN.from_dataframe(gdf, k=8,radius=2) # Select enough neighbors
for index,row in gdf.iterrows(): # Looping over all points
knn_neighbors = knn2.neighbors[index] # Get neighbors
knnsubset = gdf.iloc[knn_neighbors] # Get subdataframe
knnsubset=knnsubset[knnsubset.types==1].head(2) #Require type 1 and take the two first
print("Mean: ",knnsubset['value'].mean()) # Calculating mean of 'value'
``````

As shown, it selects the right points. Howver, there are two problems:

• There is no clear way to select "enough" neighbors. Had there been enough blue points in between. I would not have caught the orange points.
• This scales poorly when talking millions of points with wildly varying densities. Selecting 100 neighbors to find 4 incurs a penalty in terms of processing time.

This seems like a problem that someone will have solved at some point. Any pointers? Should I be thinking sklearn?

Tags :

AS mentioned in the comment, you can use DistanceBand instead of KNN for the first filtering (as in KNN you do not actually know how many you have to select to have at least 2 pines).

``````W = libpysal.weights.DistanceBand.from_dataframe(gdf, threshold=2)

results = []  # it is faster to save to list than directly to gdf
for index, row in gdf.iterrows():
neighbors = W.neighbors[index]
subset = gdf.loc[neighbors]  # get all within threshold
limited = subset.loc[subset.types == 1]  # limit to type you want - here you can do more filtering
if len(limited) > 0:  # if something of type 1 is close enough
limited['dist'] = limited.geometry.distance(row.geometry)  #get distances
limited.nsmallest(n=2, columns='dist')
results.append(list(limited.index))  # save indices of selection
else:
results.append(None)

gdf['result'] = results  # save to gdf
``````

This will save indices of points fulfilling the conditions (or None).

I am not sure how it will behave for large datasets, though, but pandas filtering should not be a problem and getting distance to limited df is vectorized.

martinfleis
September 11, 2019 10:26 AM