For a few weeks, I have used the following script to produce a scatterplot with approximately 10,000 (non-zero, positive) datapoints. Only few (<20) datapoints were not included because of warnings with the transformation.
visual <- ggplot(data=dots, aes(GRNHLin, REDHLin)) +
geom_point(colour=rgb(0.17, 0.44, 0.71), size=0.500, alpha=0.250) +
scale_x_log10(breaks = trans_breaks("log10", function(x) 10^x),
labels = trans_format("log10", math_format(10^.x)), limits = c(1,1e4)) +
scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
labels = trans_format("log10", math_format(10^.x)), limits = c(1,1e3))
visual
Since this week, I want to do some model-based clustering. The script I wrote (see below) uses the same dataset (10,000 non-zero, positive datapoints) but leaves out more than 9,000 datapoints because of:
Warning messages:
1: In self$trans$transform(x) : NaNs produced
2: Transformation introduced infinite values in continuous x-axis
3: In self$trans$transform(x) : NaNs produced
4: Transformation introduced infinite values in continuous y-axis
5: Removed 9692 rows containing missing values (geom_point).
This is the second script:
dots.Mclust <- Mclust(dots, modelNames="VVV", G=8)
visual <- fviz_cluster(dots.Mclust,
ellipse=FALSE,
shape=20,
geom = c("point")) +
scale_x_log10(breaks = trans_breaks("log10", function(x) 10^x),
labels = trans_format("log10", math_format(10^.x)), limits = c(1,1e3)) +
scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
labels = trans_format("log10", math_format(10^.x)), limits = c(1,1e4))
visual
EDIT
Some additional information:
The dataset contains only values that are larger than 0. Head(dots.Mclust) provides the following:
$data
GRNHLin RED2HLin
[1,] 81.50364 176.379654
[2,] 57.94751 116.310577
[3,] 42.89310 119.758621
[4,] 41.82213 275.607971
[5,] 437.14648 141.309647
[6,] 15.20952 177.128616
[7,] 18.88731 257.249207
[8,] 768.64935 172.374069
[9,] 24.66220 118.283150
[10,] 17.12160 68.955154
[11,] 73.00019 71.517052
[12,] 1182.08911 180.694122
[13,] 320.09827 224.808563
[14,] 268.42401 235.375259
[15,] 149.05655 205.708282
[16,] 98.43160 152.093704
[17,] 25.10120 177.061386
[18,] 293.87103 239.007050
[19,] 118.42249 295.722168
[20,] 724.16718 243.950455
[21,] 255.26083 128.209717
[22,] 105.15983 247.946701
[23,] 86.25691 220.004745
[24,] 122.01743 32.232780
[25,] 50.42104 9.923141
The graph, after removing the scaling on the x-axis and y-axis, looks the following. Apparently, something goes wrong with the datapoints. There are no negative values in the dataset, but there are still (a lot of) points below 0. Furthermore, the x-axis and y-axis do not cover the values found in entry [12,]. This is probably the underlying cause of the problem. But how does this issue with wrong values occur?
What is the underlying issue here?
It is indeed correct, as mentioned in the comments, that the sample data are centered and rescaled. This option can be turned off via including
in the options of fviz_cluster.