The problem
A couple of months ago, one of our data analysts pernamently run into troubles when he wanted to run more resource-intensive Hive queries. Surprisingly, his queries were valid, syntactically-correct and run successfully on small data, but they just failed on larger datasets. On the other hand, other users were able to run the same queries successfully on the same large datasets. Obviously, it sounds like some permissions problem, however the user had right HDFS and Hive permissions.
The observations
We observed that when our user run the more resource-intensive Hive query (that spawns a lot of map tasks), the Hadoop cluster (especially HDFS daemons) experienced stability problems i.e. NameNode became less responsive and freezed, causing tens of DataNodes to lose connectivity and became marked “dead” (even though the DataNode daemons were still running on these servers).
The logs for NameNode showed a lot of warnings and exceptions thrown in the method
The method
package org.apache.hadoop.security; ... import org.apache.hadoop.util.Shell; import org.apache.hadoop.util.Shell.ExitCodeException; /** * A simple shell-based implementation of {@link GroupMappingServiceProvider} * that exec's the <code>groups</code> shell command to fetch the group * memberships of a given user. */ public class ShellBasedUnixGroupsMapping implements GroupMappingServiceProvider { ... /** * Get the current user's group list from Unix by running the command 'groups' * NOTE. For non-existing user it will return EMPTY list * @param user user name * @return the groups list that the <code>user</code> belongs to * @throws IOException if encounter any error when running the command */ private static List<String> getUnixGroups(final String user) throws IOException { String result = ""; try { result = Shell.execCommand(Shell.getGroupsForUserCommand(user)); } catch (ExitCodeException e) { // if we didn't get the group - just return empty list; LOG.warn("got exception trying to get groups for user " + user, e); } StringTokenizer tokenizer = new StringTokenizer(result); List<String> groups = new LinkedList<String>(); while (tokenizer.hasMoreTokens()) { groups.add(tokenizer.nextToken()); } return groups; } }
This Unix command to find the user-group mapping is simply
package org.apache.hadoop.util; ... abstract public class Shell { /** a Unix command to get a given user's groups list */ public static String[] getGroupsForUserCommand(final String user) { //'groups username' command return is non-consistent across different unixes return new String [] {"bash", "-c", "id -Gn " + user}; } ... }
Security in Apache Hadoop
Normally (with the default settings), Apache Hadoop is a very trusty elephant. The username of user, who is submitting the job, is just taken from a client machine (and not verified at all, so one user can easily impersonate another e.g. by typing
Possible fixes
How this problem could be solved? Obviously, a quick and dirty solution is to create an account on NameNode server for each user who accessing HDFS (directly, or by submitting MapReduce jobs to the cluster). However, for many reasons, you do not want to give everybody an account on NameNode server.
User-group resolution with AD/LDAP
Instead AD or LDAP could be used for resolving the group membership of users who access HDFS. Hadoop provides a couple of configuration settings
Alternatively, nss_ldap (it allows LDAP directory servers to be used as a primary source of name service information including e.g. users, hosts, groups) can be tried. In this case setting configuration options
We actually solved this issue by using
hadoop.security.group.mapping.ldap.search.filter.group (objectClass=posixGroup) hadoop.security.group.mapping.ldap.search.attr.member memberUid hadoop.security.group.mapping.ldap.search.attr.group.name cn
The problematic one is
Strong authentication with Kerberos
One can go even one step further and use Kerberos. Although Kerberos is usually configured to take advantage of AD/LDAP servers (so the way how user-group mapping is resolved does not change), it will also provide a full-authentication of users accessing the cluster (so that user identity will be verified and nobody can easily impersonate other user).
Just one thing to note – installing and configuring Kerberos involves many tedious and difficult steps (some of them can be automated by Cloudera Manager). Basically, it is not just changing the configuration property