Gnuplot: Meaning of the second column of smooth kdensity

Question

Gnuplot: Meaning of the second column of smooth kdensity

51 Views Asked by Guy B At 28 February 2024 at 20:02

I am new to gnuplot. I am also new to the Kernel Density Resampling using a Gaussian supported by gnuplot using “smooth kdensity”. I played with the gnuplot demo script provided below. I am trying to understand what the meaning of the second column is of $kdensity1. If I print $kdensity1 I get these values:

(...)
140.76 13.1663 13.1663 i
146.032 12.5092 12.5092 i
151.304 11.6501 11.6501 i
156.575 10.6298 10.6298 i
161.847 9.5347 9.5347 i
167.119 8.48325 8.48325 i
172.391 7.56657 7.56657 i
177.662 6.80631 6.80631 i
(...)

The first column seems to be the computed Gaussian kernel over the sample values defined by the second column of $viol1 provided by the randomized expression. But I am trying to understand how the second column of $kdensity1 is computed since it defines the “density” or the spread of the violin plot. And it seems the 20.0 constant from the $2/20.0 computing is assumed. But this value must surely be different given another sample set that has a different range. Thus, how is column 2 of $kdensity1 computed (or what is its relationship) and how can I find the constant (20) for computing the spreading?

nsamp = 3000 
set print $viol1
do for [i=1:nsamp] {
    y = (i%4 == 0) ? 300. +  70.*invnorm(rand(0)) \
      : (i%4 == 1) ? 400. +  10.*invnorm(rand(0)) \
      :              120. +  40.*invnorm(rand(0))
    print sprintf(" 35.0 %8.5g", y)
}
unset print

set title "kdensity mirrored sideways to give a violin plot"

set table $kdensity1
plot $viol1 using 2:(1) smooth kdensity bandwidth 10. with filledcurves above y lt 9 title 'B'
unset table

set border 2
unset margins
unset xtics
set ytics nomirror rangelimited

set xrange [-1:5]
plot  $kdensity1 using (3 + $2/20.):1 with filledcurve x=3 lt 9 notitle, '' using (3 - $2/20.):1 with filledcurve x=3 lt 9 notitle

Original Q&A

There are 1 best solutions below

**theozh** · Accepted Answer · 2024-02-29T11:48:35.103000

You are referring to this violin plot demo. Check on Wikipedia Kernel_(statistics) and from gnuplot help kdensity:

The smooth kdensity option generates and plots a kernel density estimate using Gaussian kernels for the distribution from which a set of values was drawn. Values are taken from the first data column, optional weights are taken from the second column. A Gaussian is placed at the location of each point and the sum of all these Gaussians is plotted as a function. To obtain a normalized histogram, each weight should be 1/number-of-points.
...

and have a look at the further minimized example.

the input data $Data requires only one column of data
when plotting to a table with option smooth kdensity the part (1) is the weight of that datapoint, here constant value of 1 (because of the parentheses). As help kdensity suggests, you can use (1/N) instead, where N is the number of points to get a normalized violin plot.
I don't understand why the original example is using with filledcurves above y lt 9 title 'B' for plotting it to a table. This generates a unnecessary 3rd column which is identical to the 2nd column.

As I understand, for each datapoint there will be a Gaussian kernel with a given width and the area 1. And you sum up all Gaussians which will give you the shape of the curve in the second column of $kdensity.

Concerning $2/20., I guess, this factor 20. in the example is just a scaling factor that the two violin plots do not collide in the graph.

In order to find out a reasonable scaling factor automatically, you could do a stats to get the maximum of the second column of the datablock $kdensity. If you want to compare several violin plots you should take the maximum of all maxima as scaling factor.

And the 3 + and 3 - in the plot command is simply an x-offset where your violin plot is mirrored.

Script:

### violin plot
reset session

$Data <<EOD
1.0
2.0
4.0
1.0
2.0
2.0
EOD

N = |$Data|
set table $kdensity
    plot $Data u 1:(1) smooth kdensity bandwidth 0.2
unset table

stats [*:*][*:*] $kdensity u 2
print STATS_max, STATS_min

set key noautotitle
set xrange[1:5]
set yrange[0:5]

plot $kdensity u (3 + $2/STATS_max):1 w filledcurve x=3 lt 9, \
            '' u (3 - $2/STATS_max):1 w filledcurve x=3 lt 9
### end of script

Actually, you could shorten the plot command to a single line:

plot for [i=-1:1:2] $kdensity u (3 + i*$2/STATS_max):1 w filledcurve x=3 lt 9

Result:

Gnuplot: Meaning of the second column of smooth kdensity

There are 1 best solutions below

Related Questions in STATISTICS

Related Questions in GNUPLOT

Related Questions in KERNEL-DENSITY

Trending Questions

Popular # Hahtags

Popular Questions