Intepreting WEKA data

26 Views Asked by At

I have a CSV file which contains about 10,000 entries. The CSV file contains data on individuals who are booking a holiday at a particular hotel. The CSV file contains the following columns

1: country_origin (as a nominal variable) 2: month_booking (as a nominal variable) 3: is_cancelled (as a binary variable)

I am trying to use WEKA to establish which countries are associated with the greatest frequency of cancellations.

I am not quite sure how to go about doing this - I considered using a tree (J48) classifier but I don't really understand what the results mean so I cannot intepret if they are correct.


This is what I have done

  1. Open file in WEKA
  2. Pre-processed to ensure all data is nominal so it can be used by J48
  3. Selected J48 classifier and used Cross-validation fold = 10. Selected my class as "is_cancelled".

I then got an output looking like this (non exhaustive). What does this mean?

enter image description here

1

There are 1 best solutions below

0
nekomatic On

J48 is trying to build a classification tree model that predicts whether is_cancelled is 1 or 0 based on the predictors country_origin and month_booking.

Your output shows you that if country_origin is PRT and month_booking is July the model predicts that is_cancelled will be 1, and that (if I'm interpreting it correctly) this prediction was correct for 4905 instances in the training data and incorrect for 2276 instances, and so on for the other possible values of these predictors.

If you just want to know which countries had the highest proportion of cancellations in your data, I don't think Weka is the right tool. You could do this in Excel for example by loading your data as a table and typing percentage is_cancelled by country_origin in the Analyze Data query box:

screenshot from Excel Analyze Data thing