How to choose interpolation points to reduce maximum error for inverse CDF lookup

567 Views Asked by At

QUESTION: How do I pick interpolation points that keep the maximum error for any point in each interpolated segment within a specified bound?

The goal is to shape a random distribution according to Zipf's law using inverse transform sampling. I have found a good approximation of the Zipf normalization factor here:

https://arxiv.org/abs/1511.01480 ("Approximation of the truncated Zeta distribution and Zipf's law" by Maurizio Naldi)

What I am now doing is creating a spline of the CDF values versus rank and trying to set an error bound such that for no rank is the interpolated value off by more than a certain percentage of its value (relative error). If I have too many spline points, that wastes memory. If I have too few points, that increases error. I am trying to find the optimal placement of spline points so as to balance these two.

My current approach (that fails) is to loop over all ranks and look at the differences in PDF values between successive ranks. This difference is a proxy for the first derivative. I keep track of the MIN and MAX values for the difference, and when they differ by too much in proportion to the CDF, I assume that the curve has bent too much and I add a new interpolation point and start tracking differences all over. This algorithm produces too few interpolation points, because the error builds up faster than I expected.

I have another idea: maintain two pointers, one at the end of the growing segment and one at the midpoint that advances half as fast. When the midpoint's error becomes too large, add a new interpolation point. However, I am not sure that the worst error will necessarily be in the middle of the range.

I could store the CDF values for all ranks (using LOTS of memory - my current application has a maximum rank of a half-million). Then I could exhaustively test every interpolated value for a proposed segment and stop when the error gets too big. I could overshoot and do a binary search to find the optimum segment lengths - faster, but still requiring lots of memory. I am hoping for a single or double pass online algorithm with modest memory overhead.

By the way, for each spline segment, I have tried the following interpolating functions:

  1. Linear - poor results.

  2. Quadratic - better, but still poor.

  3. Hyperbolic - works on segments twice as long as Quadratic, but still not good enough. (If I find a better way to pick interpolation points, this may be good enough.)

NOTE: Hyperbolic works better than expected because, whereas Gaussian distributions produce an S-curve for a CDF as do most other distributions, the Zipf PDF has its peak at the beginning, so its CDF looks like a hyperbola with an asymptote at one.

UPDATE:

I adopted one of my ideas. I keep track of the midpoint of the spline segment as it grows with a second pointer and when the error at the midpoint grows too large, I end the segment and start another. As expected, the midpoint is not the place with the greatest error, so I divide my ideal error by seven and if the midpoint error exceeds that I start a new segment.

The above was not sufficient. I discovered that since the last segment is guaranteed to end at CDF = 1, the last segment actually reaches the asymptote. My curve-fitting formula for hyperbola does not expect the asymptote to be reached until rank reaches infinity, so this introduces an error. Thus for the last several segments near CDF = 1, I fallback to using a parabolic fit.

The above approach yields good results for CDF error, but sometimes leads to poor results on round-trip calculations. If I ask for the CDF for a given rank, then use that CDF in an inverse lookup to find the rank, the resulting rank does not always match the original rank. I am still working out how to correct for this round-trip error.

1

There are 1 best solutions below

0
On

I solved the problem of the poor round trip agreement. The rank to CDF function has a horizontal asymptote at 1. The inverse function has a vertical asymptote at N. I had the wrong interpolation function!

I found that lengthening segments until the midpoint error is too large was a good solution, in concert with fitting each segment to a hyperbola except the last segment (or several segments), which I fit to a quadratic using Lagrange Quadratic Interpolation Polynomials.

For N = 10000, I need about 200 interpolation points to keep the error very low. But if I am just using the Zipf distribution for testing, thirty is sufficient.