I have two sql using parquet-arrow:
`table` has 50 column
sql1 = `select * from table`, total_data_size = 45GB
sql2 = `select value from table`, total_data_size = 30GB
I add profile for io-throughput(Yeah, drop page-cache and just watch disk-io).
I found:
Parquet on HDFS: sql2 is faster than sql1, about 1.5 times which is reasonable
Parquet on local-disk(1MB randread=130MB;1MB read=250MB): sq1 is faster than sql2, about 4 times which is confusing.
I guess two reasons via iostat:
- the io-load is high(about 100~130MB/S, utils=90%~100%) when execute sql2, which seem mean the
select one columnis morerand readand make the io-throughput decrease select *will cache more page-cache and the hit-ratio is high in process though I drop page-cache before executing. so for theselect *, the io-throughput actually is benefit from cache hit ratio.
Expect your help, thanks!
I use cachestat to get the page-cache hit-ratio, and I found
select *has higher ratio(50%) thanselect one column(27%), so the io-throughput ofselect *is more better because of the page-cacheI try open with
O_DIRECTto read the parquet to make sure the conclusion, but it reporterrno: 22, strerror: Invalid argument, I haven't found the error root cause, but I think the page-cache hit-ratio is the root cause for io-throughput.However, why
select *has higher hit-ratio?