I've been confounded by how filters work in HBase (or, largely equivalently, in HappyBase--which I use to interact with HBase). The source of my confusion is that I can't seem to get a handle on what filters do.
Some filters, like SingleColumnValueFilter
, cause rows not to be emitted based on the value of one of their columns. This makes sense--in my mind, this is what filters should be for. However, other filters, like FirstKeyOnlyFilter
, appear not to filter in the row-wise sense, but rather filter the data that is surfaced to the requester--i.e., they filter columnwise, like the columns
argument. Not only this, but they appear to affect whether or not other filters get access to data.
Perhaps I'm just using them wrong. But, to me, a "filter" should remove items based on the output that operates on their properties, like "Find me all people over 7 feet tall!" But the behavior of FirstKeyOnlyFilter
, at least in HBase, seems to be more akin to "Bring me everyones left Ear and nothing else!" Further, if I have a filter like:
SingleColumnValueFilter('body', 'height', =, 'regexstring:^over7ft$') AND FirstKeyOnlyFilter
, FirstKeyOnlyFilter
appears to restrict the first filter from accessing the column family:column "body:height".
What is with this design choice? The filter above looks like it's saying, "Bright me the name of everyone exactly 7 feet tall!" but instead it's saying something more like "Bright me every name if the name is 7 feet tall!." The first key of a row doesn't have columns any more than names can be said to have a 'height.'
What am I doing wrong? Is this a peculiarity of HappyBase or is it the same in HBase proper?
Filters match on both on the columns available in each row.
As you have noticed some HBase filters restrict the columns that are returned to the client. This is an intentional design choice to reduce the amount of memory and network resources used during the client call.
Recall that HBase is really a rowkey mapping to a series of key-value pairs (the key in the key-value is referred to as the column qualifier). They are not strictly a set in that underlying data abstraction is really a rowkey+columnQualifier to value (a Cell). Filters work at the Cell level. This is also why column qualifiers are recommended to be short since they are actually stored with every row/value.
The
FirstKeyOnlyFilter
is designed to return as little data as possible, while maintaining the knowledge that a rowkey did exist with some key-value mapping. It could be any key-value mapping that is returned.Alternatively, you can use the
KeyOnlyFilter
instead of theFirstKeyOnlyFilter
which will null out the values associated with each column that is returned. This should give you the capability to match as needed while minimizing the data returned.