I am trying to work on this question however I am not sure how to use the Euclidean equation to find the solution.
Question:
Following are keywords, frequencies, and token counts from 3 other documents.
Doc 4 – tablet: 7; memory: 5; apps: 8; sluggish: 5
Doc 5 – memory: 4; performance: 6; playbook: 8; apps: 6
Doc 6 –tablet: 6; performance: 3; playbook: 7; sluggish: 3
Token counts: Doc 4: 55 Doc 5: 60 Doc 6: 65
(i) Use Euclidean Distance to calculate similarity values for the three pairs of documents (4,5), (4,6), (5,6) with relative frequency values. State the distance for each pair to 4 decimal places (4 d.p.).
I have tried to use the Euclidean Distance formula with the given pairs of documents to find the distance for each pair.
This is the equation that i have tried to use:
dist((x, y), (a, b)) = √(x - a)² + (y - b)²
According to the solutions this is what the answer should be:
Euclidean D4,D5 = 0.2343 to 4.d.p
Euclidean D5,D6 = 0.1693 to 4.d.p
Euclidean D4,D6 = 0.2153 to 4.d.p
Any help would be appreciated.
First you should make your document-term matrix based on your term-frequency. Term-frequency of a term means the number of times that term is repeated in a document divided by number of tokens document has. So we just made the below table:
As you mentioned the distance formula yourself I will just calculate the distance between document 4 and 5 as an example.
d(Document4,Document5) = [(7/55-0)^2 + (5/55-4/60)^2 + (8/55-6/60)^2 + (5/55-0)^2 + (0-6/60)^2 + (0-8/60)^2]^(1/2) = 0.23428614982 which is rounded to 0.2343.