Spring boot Hadoop, Webhdfs and Apache Knox

1.1k Views Asked by At

I have a Spring boot application which is accessing HDFS through Webhdfs secured via Apache Knox secured by Kerberos. I created my own KnoxWebHdfsFileSystem with custom scheme (swebhdfsknox) as a subclass of WebHdfsFilesystem which only changes the URLs to contain the Knox proxy prefix. So it effectively remaps requests from form:

http://host:port/webhdfs/v1/...

to the Knox one:

http://host:port/gateway/default/webhdfs/v1/...

I do this by overriding two methods:

  1. public URI getUri()
  2. URL toUrl(Op op, Path fspath, Param<?, ?>... parameters)

So far so good. I let spring boot create FsShell for me and use it for various operations such as list files, mkdir etc. All work fine. Except copyFromLocal which as documented requires 2 steps and redirect. And on the last step when the filesystem tries to PUT to the final URL which received in Location header it fails with error:

org.apache.hadoop.security.AccessControlException: Authentication required
    at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.validateResponse(WebHdfsFileSystem.java:334) ~[hadoop-hdfs-2.6.0.jar:na]
    at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.access$200(WebHdfsFileSystem.java:91) ~[hadoop-hdfs-2.6.0.jar:na]
    at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$FsPathOutputStreamRunner$1.close(WebHdfsFileSystem.java:787) ~[hadoop-hdfs-2.6.0.jar:na]
    at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:54) ~[hadoop-common-2.6.0.jar:na]
    at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:112) ~[hadoop-common-2.6.0.jar:na]
    at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:366) ~[hadoop-common-2.6.0.jar:na]
    at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:338) ~[hadoop-common-2.6.0.jar:na]
    at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:302) ~[hadoop-common-2.6.0.jar:na]
    at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1889) ~[hadoop-common-2.6.0.jar:na]
    at org.springframework.data.hadoop.fs.FsShell.copyFromLocal(FsShell.java:265) ~[spring-data-hadoop-core-2.2.0.RELEASE.jar:2.2.0.RELEASE]
    at org.springframework.data.hadoop.fs.FsShell.copyFromLocal(FsShell.java:254) ~[spring-data-hadoop-core-2.2.0.RELEASE.jar:2.2.0.RELEASE]

I suspect the problem is the redirect somehow but can't figure out what might be the problem here. If I do the same requests via curl the file is successfully uploaded to HDFS.

1

There are 1 best solutions below

1
On BEST ANSWER

This is a known issue with using existing Hadoop clients against Apache Knox using the HadoopAuth provider for kerberos on Knox. If you were to use curl or some other REST client it would likely work for you. The existing Hadoop java client doesn't expect a SPNEGO challenge from the DataNode - which is what the PUT in the send step is talking to. The DataNode expects the block access token/delegation token issued by the NameNode in the first step to be present. The Knox gateway however will require SPNEGO authentication for every request to that topology.

This is an issue that is on the roadmap to be addressed and will likely become hotter with interest moving more inside the cluster rather than only accessing resources through it from the outside.

The following JIRA tracks this item and as you can see from the title is related to DistCp which is a similar usecase: https://issues.apache.org/jira/browse/KNOX-482

Feel free to take a look and lend a hand with testing or developing - it would all be most welcome!

Another possibility would be to change the Hadoop java client to deal with a SPNEGO challenge for the DataNode as well.