The Shannon entropy is:
\r\n\r\n is the end of a HTPP header:
Incomplete HTTP header:
I have a network dump in PCAP format (dump.pcap) and I am trying to compute the entropy of the number of packets in HTTP protocol with \r\n\r\n
and without \r\n\r\n
in the header using Python and compare them. I read the packets using:
import pyshark
pkts = pyshark.FileCapture('dump.pcap')
I think Ti
in shannon formula is the data of my dump file.
dump.pcap: https://uploadfiles.io/y5c7k
I already computed the entropy of IP numbers:
import numpy as np
import collections
sample_ips = [
"131.084.001.031",
"131.084.001.031",
"131.284.001.031",
"131.284.001.031",
"131.284.001.000",
]
C = collections.Counter(sample_ips)
counts = np.array(list(C.values()),dtype=float)
#counts = np.array(C.values(),dtype=float)
prob = counts/counts.sum()
shannon_entropy = (-prob*np.log2(prob)).sum()
print (shannon_entropy)
Any idea? Is it possible to compute the entropy of the number of packets in HTTP protocol with \r\n\r\n
and without \r\n\r\n
in the header or it is a nonsense idea?
A few lines of the dump:
30 2017/246 11:20:00.304515 192.168.1.18 192.168.1.216 HTTP 339 GET / HTTP/1.1
GET / HTTP/1.1
Host: 192.168.1.216
accept-language: en-US,en;q=0.5
accept-encoding: gzip, deflate
accept: */*
user-agent: Mozilla/5.0 (X11; Linux i686; rv:45.0) Gecko/20100101 Firefox/45.0
Connection: keep-alive
content-type: application/x-www-form-urlencoded; charset=UTF-8
While I don't see why you want to do it, I disagree with others who believe it is nonsensical.
You could, for instance take a coin and flip it and and measure its entropy. Suppose you flip 1,000 times and get 500 heads and 500 tails. That is 0.5 frequency for each outcome, or what statisticians would formally call an 'event'.
Now, since the two Ti's are equally (0.5), and the log base 2 of 0.5 is -1, the entropy of the coin is -2 *(0.5 * -1) = -1 (the minus 2 is the minus sign out front and recognizing adding two identical things is the same as multiplying by 2.
What if the coin came up with heads 127 times more often than tails? Tails now occurs with probability 1/128 which has a log base 2 of -7. So that gives a contribution of about 1/32 from multiplying -7 times 1/128 (roughly). Heads have a probability really close to 1. But the log base 2 (or base anything) of 1 is zero. So that term gives roughly zero. Thus, the entropy of that coin is about -1/32, remembering the minus sign (if I did this all right in my head).
So the trick for you is to collect lots of random messages, and count them into two buckets. Then just do the calculations as above.
If you are asking how to do that counting, and you have these on a computer, you can use a tool like grep (the regular expression tool on unix) or a similar utility on other systems. It will sort them for you.