Druid Cohort Analysis?

505 Views Asked by At

We collect data on our website traffic, which results in about 50k to 100k unique visits a day.

Cohort analysis:

Find the percentage of users within a 24-hour period which register at the website and then actually go to our purchasing page (calculate the percentages of how many users do this within the first, second, third etc. hour after registration).

Two very abbreviated sample documents:

  • sessionId: our unique identifier for performing counts
  • url: the url for evaluating cohorts
  • time: unix timestamp for event

{ "sessionId": "some-random-id", "time": 1428238800000, (unix timestamp: Apr 5th, 3:00 pm) "url": "/register" }

{ "sessionId": "some-random-id", "time": 1428241500000, (unix timestamp: Apr 5th, 3:45 pm) "url": "/buy" }

If I want to do the same aggregation for a period of, say, 6 months & would like to check perform cohorts for returning customers? The data set would be too immense.

On a side note: I am also not interested in getting 100% accurate results, an approximation would be sufficient for trend analysis.

Can we achieve this with Druid? Or It's not suitable for this kind of analysis? Is there anything else, that is superior to do cohort analysis?

1

There are 1 best solutions below

0
On BEST ANSWER

I think you can do this with druid and data sketches. Look at the last example is this page In case you want to go with this approximation method you can look here to understand the bound errors of the approximation and the trade off that you can make to trade memory for accuracy.