# Intercept calculation in Theil-Sen estimator

by EML   Last Updated June 30, 2020 01:19 AM

I've just written some C code for Theil-Sen, after some Googling (I don't have any definitive documentation on it). My understanding of the intercept calculation is that I first calculate the median slope, and then construct a line through every data point with this slope, find the intercept of every line, and then take the median intercept.

The only way I can find to test the code is to compare the results against the Kendall-Theil Robust Line program, from the USGS. On a dataset of 237 points (healthcare data, with a Pearson correlation of ~0.55), we agree exactly on the median slope, but disagree on the intercept (by 1.4%). According to my figures, the KTRL intercept isn't the median intercept, but is instead 46% of the way through the range.

After some digging around in the KTRL code, it appears that they calculate the intercept by creating a single 'median line', rather than the median of all intercepts. Their intercept is medianY - medianX * median slope.

Any feedback on which is the "right" way to do this, if there is one, or how this is handled in R/etc?

Thanks.

Tags :

The Theil-Sen estimator is essentially an estimator for the slope alone; the line has been constructed in a host of different ways - there are a large variety of ways to calculate the intercept.

You said:

My understanding of the intercept calculation is that I first calculate the median slope, and then construct a line through every data point with this slope, find the intercept of every line, and then take the median intercept.

A common one (probably the most common) is to compute median($$y-bx$$). This is what Sen looked at, for example; if I understand your intercept definition correctly this is the same as the intercept you mention.

There are a couple of approaches that compute the intercept of the line through each pair of points and attempts to get some kind of weighted-median but based off that (putting more weight on the points further apart in x-space).

Another is to try to get an estimator with higher efficiency at the normal (akin to that of the slope estimator in typical situations) and similar breakdown point to the slope estimate (there's probably little point in having better breakdown at the expense of efficiency), such as using the Hodges-Lehmann estimator (median of pairwise averages) on $$y-bx$$. This has a kind of symmetry in the way the slopes and intercepts are defined ... and generally gives something very close to the LS line when the normal assumptions nearly hold, whereas the Sen-intercept can be - relatively speaking - quite different.

Some people just compute the mean residual.

There are still other suggestions that have been looked at. There's really no 'one' intercept to go with the slope estimate.

Dietz lists several possibilities, possibly even including all the ones I mentioned, but that's by no means exhaustive.

The Kendall–Theil Robust Line program from the USGS has a companion PDF.

On page 8 (PDF page 15) it states the method used and formula as you found but gives the reference as Conover.

## Intercept

The estimate of the intercept is calculated by use of the Conover (1980) equation

$$b = Y_\text{median} -m\times X_\text{median}\ \ ,\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (6)$$

where
$\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ b\ \ \ \ \$ is the estimated intercept,
$\ \ \ \ \ \ \ Y_\text{median}\ \ \ \$ is the median of the response variables,
$\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ m\ \ \ \$ is the estimated slope,
and
$\ \ \ \ \ \ X_\text{median}\ \ \ \$ is the median of the explanatory variables.

I confirm this does produce the same result as the program. Whether there are superior methods and so on, as always, a matter of opinion and your particular circumstances.

The M-estimation algorithm is arguably erroneous.

for i = 1,   # dat-1 do
for j = i+1, # dat do


Change there is j indexing from i+1 and don't process instances of i == j.

Then either rank (sort) the result choosing index as described, or arguably take the median, which will give a slightly different result. If you plot, the data will look like a CDF plot.