h2o.automl: NaN values in leaderboerd

723 Views Asked by At

I was running h2o.automl() example from: http://h2o-release.s3.amazonaws.com/h2o/master/3888/docs-website/h2o-docs/automl.html . Everything went fine except NaN values in leaderboard. Predictions also works fine. Is it a bug or I'm doing something wrong?

library(h2o)

localH2O <- h2o.init(ip = "localhost",
                 port = 54321, 
                 nthreads = -1, 
                 min_mem_size = "20g")

train <- h2o.importFile("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
test <- h2o.importFile("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv")

y <- "response"
x <- setdiff(names(train), y)

train[,y] <- as.factor(train[,y])
test[,y] <- as.factor(test[,y])

aml <- h2o.automl(x = x, y = y,
              training_frame = train,
              leaderboard_frame = test,
              max_runtime_secs = 30)

lb <- aml@leaderboard
lb

                                   model_id auc logloss
1  StackedEnsemble_0_AutoML_20170908_094736 NaN     NaN
2  StackedEnsemble_0_AutoML_20170908_094407 NaN     NaN
3 GBM_grid_0_AutoML_20170908_094736_model_1 NaN     NaN
4 GBM_grid_0_AutoML_20170908_094407_model_0 NaN     NaN
5 GBM_grid_0_AutoML_20170908_094407_model_1 NaN     NaN
6 GBM_grid_0_AutoML_20170908_094736_model_0 NaN     NaN

I've checked and there are normal values in H2O Flow on localhost:54321 and also I'm getting normal values using h2o.getFrame():

h2o.getFrame("leaderboard")
                                   model_id      auc  logloss
1  StackedEnsemble_0_AutoML_20170908_094736 0,787145 0,554983
2  StackedEnsemble_0_AutoML_20170908_094407 0,785154 0,556897
3 GBM_grid_0_AutoML_20170908_094736_model_1 0,778587 0,563741
4 GBM_grid_0_AutoML_20170908_094407_model_0 0,776755 0,564247
5 GBM_grid_0_AutoML_20170908_094407_model_1 0,776640 0,564436
6 GBM_grid_0_AutoML_20170908_094736_model_0 0,774611 0,566920

I'm using h2o v. 3.15.0.4018

h2o.clusterInfo()
R is connected to the H2O cluster: 
H2O cluster uptime:         2 hours 8 minutes 
H2O cluster version:        3.15.0.4018 
H2O cluster version age:    15 hours and 47 minutes  
H2O cluster name:           H2O_started_from_R_maju116_ozj558 
H2O cluster total nodes:    1 
H2O cluster total memory:   19.03 GB 
H2O cluster total cores:    8 
H2O cluster allowed cores:  8 
H2O cluster healthy:        TRUE 
H2O Connection ip:          localhost 
H2O Connection port:        54321 
H2O Connection proxy:       NA 
H2O Internal Security:      FALSE 
H2O API Extensions:         XGBoost, Algos, AutoML, Core V3, Core V4 
R Version:                  R version 3.4.1 (2017-06-30) 

Session info:

R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.2 LTS

Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.18.so

locale:
 [1] LC_CTYPE=pl_PL.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=pl_PL.UTF-8        LC_COLLATE=pl_PL.UTF-8    
 [5] LC_MONETARY=pl_PL.UTF-8    LC_MESSAGES=pl_PL.UTF-8   
 [7] LC_PAPER=pl_PL.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=pl_PL.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_0.7.2       purrr_0.2.3       readr_1.1.1       tidyr_0.7.1      

[5] tibble_1.3.4      ggplot2_2.2.1     tidyverse_1.1.1   h2oEnsemble_0.2.1
 [9] h2o_3.15.0.4018  

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.12     cellranger_1.1.0 compiler_3.4.1   plyr_1.8.4      
 [5] bindr_0.1        forcats_0.2.0    bitops_1.0-6     tools_3.4.1     

 [9] lubridate_1.6.0  jsonlite_1.5     nlme_3.1-131     gtable_0.2.0    

[13] lattice_0.20-35  pkgconfig_2.0.1  rlang_0.1.2      psych_1.7.5     

[17] parallel_3.4.1   haven_1.1.0      bindrcpp_0.2     xml2_1.1.1      

[21] httr_1.3.1       stringr_1.2.0    hms_0.3          grid_3.4.1      

[25] glue_1.1.1       R6_2.2.2         readxl_1.0.0     foreign_0.8-69  

[29] modelr_0.1.1     reshape2_1.4.2   magrittr_1.5     scales_0.5.0    

[33] rvest_0.3.2      assertthat_0.2.0 mnormt_1.5-5     colorspace_1.3-2
[37] stringi_1.1.5    lazyeval_0.2.0   munsell_0.4.3    RCurl_1.95-4.8  

[41] broom_0.4.2 
1

There are 1 best solutions below

2
On BEST ANSWER

Just a hunch, but try running R in the en_US locale.

If that fixes it, I imagine what is happening is that either aml@leaderboard or h2o.getFrame("leaderboard") is choking on the comma in the floating point numbers, and that is where the NaN is coming from. I.e. display bug, not an data bug.

(If that does fix it, it might also be useful to know what happens if you run both H2O and R in the same pl_PL.UTF-8 locale.)