Image via Future of CIO.
A few weeks ago, while working on a unified storage abstraction for Hadoop, I implemented the Hadoop FileSystem interface at work. To make it easier to test, I wanted to make my Filesystem the default for Hadoop URIs on the CLI. That way, I could test my FileSystem implementation by merely executing
‘hadoop fs -ls <path_on_filesystem>’
as opposed to
‘hadoop fs -ls <filesystem_scheme>://<filesystem_authority>:<filesystem_port>
While working on this, I had quite a few ‘Eureka!’ moments about how and where fs.default.name is used in Hadoop.
Going into this task, I had a basic understanding of fs.default.name. As the name of the property suggests, it points to the default URI for all filesystem requests in Hadoop. A filesystem path in Hadoop has two main components: a URI that identifies the filesystem, and a path specifying the location of a file/directory in the filesystem. fs.default.name the URI component is optional. So if the user makes a request for a file by specifying its path only, Hadoop tries to find that path on the filesystem defined by fs.default.name. If fs.default.name is set to an an HDFS URI like hdfs://<authority>:<port>, then Hadoop tries to find the path on HDFS whose namenode is running at <authority>:<port>. At the same time, if a user specifies both the URI and the path in the request, then the URI in the request overrides fs.default.name, and Hadoop tries to find the path on the filesystem identified by the URI in the request. This is a fairly simple and sensible design choice.
However, there are two other interesting characteristics of fs.default.name that I discovered:
1. In the Hadoop client, fs.default.name MUST point to a filesystem URI that can accept filesystem requests.
Assuming my understanding above was right, I set fs.default.name to a local HDFS namenode (hdfs://localhost:9000). I chose ‘xyz’ as the scheme for my filesystem, and localhost:16666 as the authority and port. I then executed my tests as ‘hadoop fs -ls xyz://localhost:16666/path/to/file’, expecting Hadoop to read /path/to/file on my filesystem. Note that I did not have an HDFS namenode running at hdfs://localhost:9000. I was presuming that since I was passing a fully qualified path in my request, fs.default.name was not needed. However, as it turned out, I started seeing Connection refused and Connection timed out errors from localhost/127.0.0.1:9000.
Upon digging further into this, I traced the path of my request through the FsShell class. That was when I had my first “Eureka!” moment of the day: for any filesystem request, Hadoop first tries to ensure that the URI fs.default.name points to is available to serve filesystem requests. This is regardless of whether the user’s request contains a fully qualified URI or just a path. My understanding is that this ensures that if the user specifies just a directory path, fs.default.name is available to serve the request. For example, you cannot set fs.default.name to ‘blah’, even if you always use fully qualified URIs for paths. fs.default.name MUST point to a URI that is available and capable of handling filesystem requests.
2. The Hadoop namenode uses fs.default.name to read not only its port number, but also its hostname.
The other usage of fs.default.name I discovered was that it is used by the Hadoop namenode to determine its address (hostname and port). This came as a surprise: since the Hadoop (1.x) namenode is single-host, I presumed that its hostname would always be ‘localhost’, allowing users to configure its port by specifying it in fs.default.name. Such behavior could present an issue when running Hadoop on a single-host setup, on which you would generally have one core-site.xml defining fs.default.name. However, core-site.xml is used by both the Hadoop client, to get the URI of the default filesystem, as well as by the namenode, to read its address.
This was surprising because my understanding was that the Hadoop namenode reads all its configuration parameters from hdfs-site.xml, and not core-site.xml. If you use the same core-site.xml for both client and hadoop-namenode, then you can run into the same issue I experienced.
To ease testing, I wanted to use my filesystem as the default. As a result, I set fs.default.name to xyz://localhost:16666 in the core-site.xml. I also wanted to run HDFS on the same host. Hence, I ran the start-dfs.sh script that comes bundled with Hadoop. This caused protocol mismatch errors, since the start-dfs.sh script was trying to start the Hadoop namenode at localhost:16666, instead of localhost:9000.
On a production setup, the Hadoop namenode, my filesystem, and the Hadoop client would each exist on separate hosts, reducing the chance this would be a problem. I could have separate core-site.xml’s for client and Hadoop namenode, each containing a different fs.default.name, which would work just fine. On a single host setup with just one core-site.xml, this could be a problem. The solution that we came up with was using two different core-site.xmls, one of them used in the classpath while running start-dfs.sh, the other while running the Hadoop client.
fs.default.name is a crucial HDFS configuration parameter that seems to be easy to comprehend at a high level. However, I hope this blog post helps clear two finer details about it.
About the AuthorMore Content by Bhooshan Mogal