Warnings when transforming to logarithmic scale, a lot of NaNs produced

1.2k Views Asked by At

For a few weeks, I have used the following script to produce a scatterplot with approximately 10,000 (non-zero, positive) datapoints. Only few (<20) datapoints were not included because of warnings with the transformation.

visual <- ggplot(data=dots, aes(GRNHLin, REDHLin)) +
    geom_point(colour=rgb(0.17, 0.44, 0.71), size=0.500, alpha=0.250) +
    scale_x_log10(breaks = trans_breaks("log10", function(x) 10^x),
                  labels = trans_format("log10", math_format(10^.x)), limits = c(1,1e4)) +
    scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
                  labels = trans_format("log10", math_format(10^.x)), limits = c(1,1e3))
visual

Since this week, I want to do some model-based clustering. The script I wrote (see below) uses the same dataset (10,000 non-zero, positive datapoints) but leaves out more than 9,000 datapoints because of:

Warning messages:
1: In self$trans$transform(x) : NaNs produced
2: Transformation introduced infinite values in continuous x-axis 
3: In self$trans$transform(x) : NaNs produced
4: Transformation introduced infinite values in continuous y-axis 
5: Removed 9692 rows containing missing values (geom_point). 

This is the second script:

dots.Mclust <- Mclust(dots, modelNames="VVV", G=8)

visual <- fviz_cluster(dots.Mclust, 
             ellipse=FALSE, 
             shape=20, 
             geom = c("point")) +
  scale_x_log10(breaks = trans_breaks("log10", function(x) 10^x),
                labels = trans_format("log10", math_format(10^.x)), limits = c(1,1e3)) +
  scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
                labels = trans_format("log10", math_format(10^.x)), limits = c(1,1e4))
visual

EDIT

Some additional information:

The dataset contains only values that are larger than 0. Head(dots.Mclust) provides the following:

$data
           GRNHLin    RED2HLin
   [1,]   81.50364  176.379654
   [2,]   57.94751  116.310577
   [3,]   42.89310  119.758621
   [4,]   41.82213  275.607971
   [5,]  437.14648  141.309647
   [6,]   15.20952  177.128616
   [7,]   18.88731  257.249207
   [8,]  768.64935  172.374069
   [9,]   24.66220  118.283150
  [10,]   17.12160   68.955154
  [11,]   73.00019   71.517052
  [12,] 1182.08911  180.694122
  [13,]  320.09827  224.808563
  [14,]  268.42401  235.375259
  [15,]  149.05655  205.708282
  [16,]   98.43160  152.093704
  [17,]   25.10120  177.061386
  [18,]  293.87103  239.007050
  [19,]  118.42249  295.722168
  [20,]  724.16718  243.950455
  [21,]  255.26083  128.209717
  [22,]  105.15983  247.946701
  [23,]   86.25691  220.004745
  [24,]  122.01743   32.232780
  [25,]   50.42104    9.923141

The graph, after removing the scaling on the x-axis and y-axis, looks the following. Apparently, something goes wrong with the datapoints. There are no negative values in the dataset, but there are still (a lot of) points below 0. Furthermore, the x-axis and y-axis do not cover the values found in entry [12,]. This is probably the underlying cause of the problem. But how does this issue with wrong values occur?

Graph after plotting (without scaling x-axis and y-axis).

What is the underlying issue here?

1

There are 1 best solutions below

0
On

It is indeed correct, as mentioned in the comments, that the sample data are centered and rescaled. This option can be turned off via including

stand=FALSE,

in the options of fviz_cluster.