I am new to gnuplot. I am also new to the Kernel Density Resampling using a Gaussian supported by gnuplot using “smooth kdensity”. I played with the gnuplot demo script provided below. I am trying to understand what the meaning of the second column is of $kdensity1. If I print $kdensity1 I get these values:
(...)
140.76 13.1663 13.1663 i
146.032 12.5092 12.5092 i
151.304 11.6501 11.6501 i
156.575 10.6298 10.6298 i
161.847 9.5347 9.5347 i
167.119 8.48325 8.48325 i
172.391 7.56657 7.56657 i
177.662 6.80631 6.80631 i
(...)
The first column seems to be the computed Gaussian kernel over the sample values defined by the second column of $viol1 provided by the randomized expression. But I am trying to understand how the second column of $kdensity1 is computed since it defines the “density” or the spread of the violin plot. And it seems the 20.0 constant from the $2/20.0 computing is assumed. But this value must surely be different given another sample set that has a different range. Thus, how is column 2 of $kdensity1 computed (or what is its relationship) and how can I find the constant (20) for computing the spreading?
nsamp = 3000
set print $viol1
do for [i=1:nsamp] {
y = (i%4 == 0) ? 300. + 70.*invnorm(rand(0)) \
: (i%4 == 1) ? 400. + 10.*invnorm(rand(0)) \
: 120. + 40.*invnorm(rand(0))
print sprintf(" 35.0 %8.5g", y)
}
unset print
set title "kdensity mirrored sideways to give a violin plot"
set table $kdensity1
plot $viol1 using 2:(1) smooth kdensity bandwidth 10. with filledcurves above y lt 9 title 'B'
unset table
set border 2
unset margins
unset xtics
set ytics nomirror rangelimited
set xrange [-1:5]
plot $kdensity1 using (3 + $2/20.):1 with filledcurve x=3 lt 9 notitle, '' using (3 - $2/20.):1 with filledcurve x=3 lt 9 notitle
You are referring to this violin plot demo. Check on Wikipedia Kernel_(statistics) and from gnuplot
help kdensity:and have a look at the further minimized example.
$Datarequires only one column of datasmooth kdensitythe part(1)is the weight of that datapoint, here constant value of1(because of the parentheses). Ashelp kdensitysuggests, you can use(1/N)instead, whereNis the number of points to get a normalized violin plot.with filledcurves above y lt 9 title 'B'for plotting it to a table. This generates a unnecessary 3rd column which is identical to the 2nd column.As I understand, for each datapoint there will be a Gaussian kernel with a given width and the area 1. And you sum up all Gaussians which will give you the shape of the curve in the second column of
$kdensity.Concerning
$2/20., I guess, this factor20.in the example is just a scaling factor that the two violin plots do not collide in the graph.In order to find out a reasonable scaling factor automatically, you could do a
statsto get the maximum of the second column of the datablock$kdensity. If you want to compare several violin plots you should take the maximum of all maxima as scaling factor.And the
3 +and3 -in the plot command is simply an x-offset where your violin plot is mirrored.Script:
Actually, you could shorten the plot command to a single line:
Result: