R persistence homology - identify points that generate topological features

157 Views Asked by At

I am using the TDA package in R and successfully running persistence homology using the gridDiag() function. I am interested in the points that are related to the loops (a.k.a. 1 dimensional simplicial complexes). When I use the gridDiag() function with the parameter location = TRUE, you can recieve the cycleLocation to draw the simplicial complex back onto your point cloud. Here is some example code:

#generate data
set.seed(2)
x = runif(60, min=0, max=100)
y = runif(60, min=0, max=100)
coords <- cbind(x,y)
plot(coords)

#compute persistent homology, with location = TRUE
library(TDA)
Xlim=c(min(coords[,1]), max(coords[,1]))
Ylim=c(min(coords[,2]), max(coords[,2]))
by=1
lim = cbind(Xlim, Ylim)
Diag <- gridDiag(coords, distFct, lim = lim, by = by, sublevel = TRUE,
                 library = "Dionysus", location = TRUE, printProgress = TRUE)

#plot
par(mfrow = c(1, 3))
plot(coords, cex = 0.5, pch = 19)
title(main = "Data")
threshold = 1 #persistence value for topological features plotted
plot(Diag[["diagram"]], band = 2*threshold)
title(main = "Distance Function Diagram")
one <- which(Diag[["diagram"]][, 1] == 1 & sqrt(0.5*(Diag[["diagram"]][, "Death"]-Diag[["diagram"]][, "Birth"]))>threshold)
plot(coords, col = 2, main = "Representative loop of grid points")
for (i in seq(along = one)) {
  points(Diag[["birthLocation"]][one[i], , drop = FALSE], pch = 15, cex = 3,
         col = i)
  points(Diag[["deathLocation"]][one[i], , drop = FALSE], pch = 17, cex = 3,
         col = i)
  for (j in seq_len(dim(Diag[["cycleLocation"]][[one[i]]])[1])) {
    lines(Diag[["cycleLocation"]][[one[i]]][j, , ], pch = 19, cex = 1, col = i)
  }
}

Plot from the above example code.

However, the object you recieve is the null space between the growing radius balls. My question is whether there is a simple way to obtain the point coordinates that intiate the loop? Specifically, when the loop is born, can you identify the points that overlap in their radial balls that generate the loop.

A similar question was asked here, however, the solution uses a different clustering algorithm which works well only for the type of dataset given as an example. In my case, and the example data I give, the points are not clearly separated and I would like to know if I can get my answer from the computation already performed. Ideally, a list where each sub-list is for each thresholded simplicial complex, which contains a vector of the vertex indexes in coords which generates that simplicial complex.

2

There are 2 best solutions below

0
seb_math_bio On

I have a solution based on the fact that we know the radii of the balls and so can dilate the loop from cycleLocation by the same amount. Then, we identify all the points which then lie within that loop.

See Edit for update There is some discrepancy with the original cycleLocation and the polygon used as an input to the dilation function (i.e. ashape()) as cycleLocation vertices appeared unordered which makes it hard to convert to a standard polygon, hence the need to obtain a new polygon with a concave hull function. Here is the output I get so you can see for yourself:

Plot from solution code

New plot from updated solution code

N.B. Coordinates can be vertices of multiple simplicial complexes but since we have simplicial complexes that share vertices the plot has given the coordinate the last colour of the simplical complex to be computed.

It works pretty well but I think there is (or should be) a direct output of gridDiag() or gridFiltration() that simply identifies the coordinates of your vertices back onto your point cloud. Something I cannot work out at the moment...

  find_loop_vertex_indexes <- function(coords,
                                     ph,
                                     one,
                                     visualise=FALSE
) {
  #import
  library(alphahull)
  library(igraph)
  library(polyclip)
  library(sf)

  #main
  cmplx_generators <- list()
  if (visualise==TRUE) {
    par(mfrow = c(1, 1))
    plot(coords, cex = 0.5, pch = 19)
    
  }
  for (i in seq(along = one)) {
    # Extract loop coordinates ####
    loop_coords <- c()
    for (j in seq_len(dim(ph[["cycleLocation"]][[one[i]]])[1])) {
      loop_coords <- rbind(loop_coords, ph[["cycleLocation"]][[one[i]]][j,1, ])
    }
    poly_points <- unique(data.frame(x=loop_coords[,1], y=loop_coords[,2]))
    
    # Alpha shape ####
    #preamble
    for (id in 1:dim(poly_points)[1]) {
      separation_dist <- euc_dist_many(poly_points, poly_points[id,])
    }
    #parameters
    alphaRes <- unique(sort(separation_dist[separation_dist!=0]))
    minAlpha = alphaRes[1]
    maxAlphaIterations <- length(separation_dist[separation_dist!=0])
    boundary <- list()
    alphaParams <- list()
    cluster_coords <- poly_points
    aggregation = 1
    
    #main
    boundary <- list()
    alphaParams <- list()
    if (length(cluster_coords$x)<3) {
      warning("In a cluster, at least three non-collinear points are required.")
      alpha=NA
      boundary[[aggregation]] <- data.frame(NA)
      alphaParams[[aggregation]] <- data.frame(final_alpha=alpha, minAlpha=minAlpha, alphaRes=alphaRes)
    } else {
      alpha=alphaRes[1]
      loop=0
      nTimes=0
      linear=FALSE
      while (nTimes<maxAlphaIterations & loop==0) {
        linear <- are_points_on_line1(cluster_coords)
        if (linear) {
          warning("Aggregation identified lies on a line.")
          break
        }
        ashape.obj <- ashape(cluster_coords,alpha=alpha)
        
        # Convert alpha shapes to polygons ####
        ashape.obj$edges[,1] <- as.character(ashape.obj$edges[,1])
        while (nTimes<maxAlphaIterations & length(ashape.obj$edges[,1])<2) {
          nTimes = nTimes + 1
          alpha = alphaRes[nTimes]
          if (linear) {
            warning("Aggregation identified lies on a line.")
            break
          }
          ashape.obj <- ashape(cluster_coords,alpha=alpha)
          ashape.obj$edges[,1] <- as.character(ashape.obj$edges[,1])
        }
        ashape_graph <- graph_from_edgelist(matrix(ashape.obj$edges[,1:2],nrow=dim(ashape.obj$edges)[1],ncol=2), directed = FALSE)
        if (!igraph::is.connected(ashape_graph)) {
          nTimes = nTimes + 1
          alpha = alphaRes[nTimes]
          # warning("Graph not connected")
        } else if (any(igraph::degree(ashape_graph) != 2)) {
          nTimes = nTimes + 1
          alpha = alphaRes[nTimes]
          # warning("Graph not circular")
        } else if (igraph::clusters(ashape_graph)$no > 1) {
          nTimes = nTimes + 1
          alpha = alphaRes[nTimes]
          # warning("Graph composed of more than one circle")
        } else {
          loop=1
          # Delete one edge to create a chain
          cut_graph <- ashape_graph - E(ashape_graph)[1]
          # Find chain end points
          ends = names(which(degree(cut_graph) == 1))
          path = get.shortest.paths(cut_graph, ends[1], ends[2])$vpath[[1]]
          # this is an index into the points
          pathX = as.numeric(V(ashape_graph)[path]$name)
          # join the ends
          pathX = c(pathX, pathX[1])
          ashapePoly <- pathX
        }
      }
      if (nTimes>=maxAlphaIterations) {
        warning("No final boundary. Reached maximum iterations.")
        boundary[[aggregation]] <- data.frame(NA)
        alphaParams[[aggregation]] <- data.frame(final_alpha=alpha, minAlpha=minAlpha, alphaRes=alphaRes)
        alpha=minAlpha
      } else if (linear) {
        warning("No final boundary. Points are linear along an axis.")
        boundary[[aggregation]] <- data.frame(NA)
        alphaParams[[aggregation]] <- data.frame(final_alpha=alpha, minAlpha=minAlpha, alphaRes=alphaRes)
        alpha=minAlpha
      } else {
        boundary[[aggregation]] <- cluster_coords[ashapePoly,]
        alphaParams[[aggregation]] <- data.frame(final_alpha=alpha, minAlpha=minAlpha, alphaRes=alphaRes)
        alpha=minAlpha
      }
    }
    
    if (dim(boundary[[aggregation]])[1] == 0) {
      cmplx_generators[[i]] <- c()
    } else {
      # Offset ####
      poly_offset <- polyoffset(boundary, ph[["diagram"]][one[i], "Birth"])
      if (visualise == TRUE) {
        #visualise
        polygon(boundary[[1]]$x, boundary[[1]]$y, border = i)
        polygon(poly_offset[[1]]$x, poly_offset[[1]]$y, border = i, lty = "dashed")
      }
      
      #convert to sf ####
      points_sf_input <- as.matrix(coords)
      points_sf <- st_multipoint(points_sf_input)
      poly_points_sf_input <- as.matrix(cbind(poly_offset[[1]]$x, poly_offset[[1]]$y))
      poly_points_sf <- st_multipoint(poly_points_sf_input)
      poly_sf <- st_cast(poly_points_sf, "POLYGON")
      
      # Identify points ####
      vertices_sf <- st_intersection(points_sf, poly_sf)
      
      # Construct return object ####
      vertices <- as.matrix(st_coordinates(vertices_sf))
      cmplx_generators[[i]] <- which(coords[,1] %in% vertices[,1] & coords[,2] %in% vertices[,2])
    }
  }
  # Highlight coordinates used as a simplicial complex vertex
  if (visualise==TRUE) {
    for (loop_id in 1:length(cmplx_generators)) {
      points(coords[cmplx_generators[[loop_id]],], cex = 1, pch = 2, col = loop_id, )
    }
  }
  return(cmplx_generators)
}

For the function euc_dist_many(), this is a personal function to calculate the distance of many coordinates from one other coordinate. Here is the code for that:

euc_dist_many <- function(coords,
                          coords1
) {
  #main
  coords1 <- data.frame(x=coords1[1], y=coords1[2])
  coords1_many <- do.call("rbind", replicate(dim(coords)[1], coords1, simplify = FALSE))
  return(sqrt(rowSums((coords - coords1_many) ^ 2)))
}

Also for the function are_points_on_line1(), this is a personal function to check whether all coordinates sit on a line. Here is the code for that too:

are_points_on_line1 <- function(coords
) {
  #main
  index2=2
  coords1 <- as.numeric(coords[1,])
  coords2 <- as.numeric(coords[index2,])
  
  #check if coords are the same and change coords until they are not
  while ((coords1[1]-coords2[1])==0 & (coords1[2]-coords2[2])==0 & index2<dim(coords)[1]) {
    index2=index2+1
    coords2 <- as.numeric(coords[index2,])
  }
  
  index3=index2+1
  linear=TRUE
  while (linear & index3<dim(coords)[1]) {
    coords3 <- as.numeric(coords[index3,])
    P = (coords2[1]-coords1[1])*(coords3[2]-coords1[2]) - (coords2[2]-coords1[2])*(coords3[1]-coords1[1])
    if (P!=0) {
      linear=FALSE
    }
    index3=index3+1
  }
  
  return(linear)
}

Edit I have improved the parameter sweep for the concave algorithm ashape() and reformatted the main code as a function. The update in the parameter sweep means that the 'discrepancy' noted before is significantly less of a problem, if at all.

0
OnDraganov On

I do not know the particularities of the R's TDApackage, but you can get loop representatives from the reduced boundary matrices, if you have access to those.

If $R$ is the reduced boundary matrix, and you are interested in a bar $(i,j)$, where $i,j$ are the birth and the death simplex, respectively, then $R[j]$, the $j$-th column of $R$, is a representative of the feature that was born with the $i$-th simplex and died with the $j$-th. The advantage of this representative is that it is one that gets sent to zero when it dies (as opposed to getting merged with an older non-trivial feature). The disadvantage is that it is very non-canonical and will differ depending on the reduction algorithm used.

As an alternative, you can get a somewhat canonical representative in a similar way. Let $V$ be a matrix initialised as the identity, with which we perform every column operation that we perform to reduce the boundary matrix. That is, $R=B\cdot V$, where $B$ is the boundary matrix. If we have, again, a bar $(i,j)$, then $V[i]$ is the unique representative of the feature such that it is only composed of one birth simplex, and all the other are death simplices. This also makes it the lexicographically-minimal representative.

Here is a simple Python code to showcase those. I use gudhi there to compute the alpha complex, and then do the persistent homology from scratch to show where exactly you get the representatives. The plot shows the $R[j]$ representative in solid blue, and the $V[i]$ one in dashed red.

import gudhi
import random
import itertools
from matplotlib import pyplot as plt

n = 100 #number of random points to generate
points = [(random.random(), random.random()) for _ in range(n)]
gudhi_complex = gudhi.AlphaComplex(points).create_simplex_tree()
alpha_complex = {tuple(sorted(simplex)) : radius for simplex, radius in gudhi_complex.get_filtration()}
boundary_matrix = {simplex : set(itertools.combinations(simplex, len(simplex)-1))-{tuple()} for simplex in alpha_complex}

R = { k : v for k,v in boundary_matrix.items()}
V = { k : {k} for k in boundary_matrix}
lowinv = {} # lowinv[i]=index of column with the lowest 1 at i
order_function = lambda s: (alpha_complex[s], s)
for s in sorted(R, key=order_function):
    t = lowinv.get(max(R[s], key=order_function),-1) if len(R[s])!=0 else -1
    while t!=-1:
        R[s] = R[t]^R[s] # symmetric difference of t-th and s-th columns
        V[s] = V[t]^V[s]
        t = lowinv.get(max(R[s], key=order_function),-1) if len(R[s])!=0 else -1
    if len(R[s])!=0:
        lowinv[max(R[s], key=order_function)] = s

loops = [bar for bar in lowinv.items() if len(bar[0])==2]
longest_loop = max(loops, key=lambda bar: alpha_complex[bar[1]]-alpha_complex[bar[0]])
representative_1 = R[longest_loop[1]]
representative_2 = V[longest_loop[0]]

for edge in representative_1:
    plt.plot( *zip(points[edge[0]], points[edge[1]]), '-', color='blue', alpha=.6 )
for edge in representative_2:
    plt.plot( *zip(points[edge[0]], points[edge[1]]), '--', color='red', alpha=.6 )
plt.plot(*zip(*points),'o')
plt.show()

A few outputs of this code follow

figure_1 figure_2 figure_3