Segmentation with piecewise linear regression

187 Views Asked by At

I am willing to segment a sequence into n subsequences (n known) where my points in each subsequence should be similar in a way that a piecewise linear function could fit the points (minimize the distance in each subsequence and the overall sequence). I have tried the package ruptures with the algo Binseg which allows to provide the number of segments and I have also tried numpy/scipy to directly fit piecewise linear functions. Then I realized I need to apply weights to my points, else what I want to achieve doesn't work great. How could I trick either solution, or do you know another solution that could directly take as an argument an array of weights?

Edit for more context:

  • The curve is usually flat, steepening, flattening, concave or convex, or a mix, and consists of at most 40 points (i.e. max sequence is 40)
  • There can be 1 or 2 outliers (usually the first point but not necessarily)
  • I am aiming at n between 4 and 8 (i.e. between 4 and 8 subsequences, defined as a parameter).
  • My sequence is at most 40 points, and it's in a loop with ~20 iterations. It must take less than 30sec for the whole loop, so max ~1.5sec per sequence.

Here is an example of the data:

df = pd.DataFrame({
    'Term Dummy':[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32],
    'Shock':[131.759276601612,-5.28111953539055,-5.30412333137685,6.19553924065018,-5.97658803726517,-7.8325986545673,-9.50784210778306,-15.7385664305344,-23.3182508381464,-29.4897840376819,-31.467551725682,-33.4723203935889,-34.6650947285782,-35.7471724234754,-36.4799776375108,-37.3264043303424,-37.4155331344124,-37.8155991350952,-38.7550833588797,-38.3608088160098,-36.7211814243519,-35.7477615422699,-34.1458248652337,-32.8287847811565,-31.4018236645802,-29.9742754473972,-28.6193854123123,-24.90985538625,-21.3217573325541,-18.7350606702909,-16.0799516664911,-16.1549305201347,-16.1433168994669],
    'Weight':[1924,41170,120247,289092,311692,50265,127579,38255,225164,300420,96928,189792,177827,511969,417120,17840,72257,160679,89074,186051,102120,53770,662958,100838,765414,820977,533239,113092,60063,174082,238152,215960,115665]
})

Shock is the sequence that needs to be reduced to n subsequences. For example, a solution with n=6 could be (grouping the index/Term dummy):

[
[0],
[1,2],
[3],
[4,5,6,7,8,9,10,11,12,13,14,15,16,17,18],
[19,20,21,22,23,24,25,26,27,28,29],
[30,31,32]
]

Shocks and example of segmentation

The plot below shows 3 curves of Shock on the top, and the Sensitivity (Delta in the plot) and the Impact. Weight is the absolute value of Sensitivity.

Plot of 3 Shock (top), Sensitivity and Impact (bottom)

1

There are 1 best solutions below

2
Reinderien On

This question is all over the place. I'm fairly convinced that it's an x/y problem and you should actually be doing something else; but here is what you originally asked for:

import numpy as np
from matplotlib import pyplot as plt
from scipy.optimize import minimize, LinearConstraint
import scipy.sparse


def piecewise(y: np.ndarray, n: int) -> np.ndarray:
    """
    :param y: Length-m array to approximate
    :param n: Number of segments
    :return: (n+1) segment bound indices; will always include 0 and m-1
    """
    m = y.size
    x_inferred = np.arange(m)

    def cost(x: np.ndarray) -> float:
        xshort = (0, *x, m-1)
        yshort = np.interp(x=xshort, xp=x_inferred, fp=y)
        y_long = np.interp(x=x_inferred, xp=xshort, fp=yshort)
        error = y_long - y
        return error.dot(error)

    x0 = np.linspace(start=m/n, stop=m*(n - 1)/n, num=n - 1)
    bounds = np.stack((
        np.arange(1, n),
        np.arange(m - n, m - 1),
    ), axis=1)
    increasing = LinearConstraint(
        A=scipy.sparse.eye(n-2, n-1, k=1) - scipy.sparse.eye(n-2, n-1),
        lb=1,
    )
    res = minimize(fun=cost, x0=x0, bounds=bounds, constraints=increasing)
    assert res.success, res.message
    return np.concatenate(((0,), res.x.round().astype(int), (m-1,)))


def test() -> None:
    shock = np.array((
       131.7592766 ,  -5.28111954,  -5.30412333,   6.19553924,
        -5.97658804,  -7.83259865,  -9.50784211, -15.73856643,
       -23.31825084, -29.48978404, -31.46755173, -33.47232039,
       -34.66509473, -35.74717242, -36.47997764, -37.32640433,
       -37.41553313, -37.81559914, -38.75508336, -38.36080882,
       -36.72118142, -35.74776154, -34.14582487, -32.82878478,
       -31.40182366, -29.97427545, -28.61938541, -24.90985539,
       -21.32175733, -18.73506067, -16.07995167, -16.15493052,
       -16.1433169,
    ))

    x = piecewise(y=shock, n=6)
    fig, ax = plt.subplots()
    ax.set_xticks(x)
    ax.grid(True, color='#CCC')
    ax.scatter(np.arange(shock.size), shock, label='original')
    ax.plot(x, shock[x], label='approx')
    ax.legend()
    plt.show()


if __name__ == '__main__':
    test()

It doesn't do anything clever with regard to global optimization so this may be prone to local solutions, but for your provided data it works well.

fit

Well-ish. It works better if you use a 2nd-order differential initialization vector (especially for an example like n=8):

from typing import Literal

import numpy as np
from matplotlib import pyplot as plt
from scipy.optimize import minimize, LinearConstraint
import scipy.sparse


def piecewise(y: np.ndarray, n: int, init: Literal['2diff', 'dist'] = '2diff') -> np.ndarray:
    """
    :param y: Length-m array to approximate
    :param n: Number of segments
    :param init: Initialization method, either 2nd-order differential or evenly distributed
    :return: (n+1) segment bound indices; will always include 0 and m-1
    """
    m = y.size
    x_inferred = np.arange(m)

    def cost(x: np.ndarray) -> float:
        xshort = (0, *x, m-1)
        yshort = np.interp(x=xshort, xp=x_inferred, fp=y)
        y_long = np.interp(x=x_inferred, xp=xshort, fp=yshort)
        error = y_long - y
        return error.dot(error)

    if init == '2diff':
        diff2 = np.abs(np.diff(np.diff(y)))
        x0 = 1 + np.sort(diff2.argsort()[1-n:])
    else:
        x0 = np.linspace(start=m / n, stop=m * (n - 1) / n, num=n - 1)

    bounds = np.stack((
        np.arange(1, n),
        np.arange(m - n, m - 1),
    ), axis=1)
    increasing = LinearConstraint(
        A=scipy.sparse.eye(n-2, n-1, k=1) - scipy.sparse.eye(n-2, n-1),
        lb=1,
    )
    res = minimize(fun=cost, x0=x0, bounds=bounds, constraints=increasing,)
    assert res.success, res.message
    return np.concatenate(((0,), res.x.round().astype(int), (m-1,)))


def test() -> None:
    shock = np.array((
       131.7592766 ,  -5.28111954,  -5.30412333,   6.19553924,
        -5.97658804,  -7.83259865,  -9.50784211, -15.73856643,
       -23.31825084, -29.48978404, -31.46755173, -33.47232039,
       -34.66509473, -35.74717242, -36.47997764, -37.32640433,
       -37.41553313, -37.81559914, -38.75508336, -38.36080882,
       -36.72118142, -35.74776154, -34.14582487, -32.82878478,
       -31.40182366, -29.97427545, -28.61938541, -24.90985539,
       -21.32175733, -18.73506067, -16.07995167, -16.15493052,
       -16.1433169,
    ))

    xdist = piecewise(y=shock, n=8, init='dist')
    x2diff = piecewise(y=shock, n=8, init='2diff')
    fig, ax = plt.subplots()
    ax.scatter(np.arange(shock.size), shock, label='original')
    ax.plot(xdist, shock[xdist], label='approx (dist)')
    ax.plot(x2diff, shock[x2diff], label='approx (2diff)')
    ax.legend()
    plt.show()


if __name__ == '__main__':
    test()

2nd order diff