summaryrefslogtreecommitdiff
path: root/kernel/capability.c
AgeCommit message (Collapse)AuthorFilesLines
2014-09-17CAPABILITIES: remove undefined caps from all processesEric Paris1-0/+4
commit 7d8b6c63751cfbbe5eef81a48c22978b3407a3ad upstream. This is effectively a revert of 7b9a7ec565505699f503b4fcf61500dceb36e744 plus fixing it a different way... We found, when trying to run an application from an application which had dropped privs that the kernel does security checks on undefined capability bits. This was ESPECIALLY difficult to debug as those undefined bits are hidden from /proc/$PID/status. Consider a root application which drops all capabilities from ALL 4 capability sets. We assume, since the application is going to set eff/perm/inh from an array that it will clear not only the defined caps less than CAP_LAST_CAP, but also the higher 28ish bits which are undefined future capabilities. The BSET gets cleared differently. Instead it is cleared one bit at a time. The problem here is that in security/commoncap.c::cap_task_prctl() we actually check the validity of a capability being read. So any task which attempts to 'read all things set in bset' followed by 'unset all things set in bset' will not even attempt to unset the undefined bits higher than CAP_LAST_CAP. So the 'parent' will look something like: CapInh: 0000000000000000 CapPrm: 0000000000000000 CapEff: 0000000000000000 CapBnd: ffffffc000000000 All of this 'should' be fine. Given that these are undefined bits that aren't supposed to have anything to do with permissions. But they do... So lets now consider a task which cleared the eff/perm/inh completely and cleared all of the valid caps in the bset (but not the invalid caps it couldn't read out of the kernel). We know that this is exactly what the libcap-ng library does and what the go capabilities library does. They both leave you in that above situation if you try to clear all of you capapabilities from all 4 sets. If that root task calls execve() the child task will pick up all caps not blocked by the bset. The bset however does not block bits higher than CAP_LAST_CAP. So now the child task has bits in eff which are not in the parent. These are 'meaningless' undefined bits, but still bits which the parent doesn't have. The problem is now in cred_cap_issubset() (or any operation which does a subset test) as the child, while a subset for valid cap bits, is not a subset for invalid cap bits! So now we set durring commit creds that the child is not dumpable. Given it is 'more priv' than its parent. It also means the parent cannot ptrace the child and other stupidity. The solution here: 1) stop hiding capability bits in status This makes debugging easier! 2) stop giving any task undefined capability bits. it's simple, it you don't put those invalid bits in CAP_FULL_SET you won't get them in init and you won't get them in any other task either. This fixes the cap_issubset() tests and resulting fallout (which made the init task in a docker container untraceable among other things) 3) mask out undefined bits when sys_capset() is called as it might use ~0, ~0 to denote 'all capabilities' for backward/forward compatibility. This lets 'capsh --caps="all=eip" -- -c /bin/bash' run. 4) mask out undefined bit when we read a file capability off of disk as again likely all bits are set in the xattr for forward/backward compatibility. This lets 'setcap all+pe /bin/bash; /bin/bash' run Signed-off-by: Eric Paris <eparis@redhat.com> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Andrew Vagin <avagin@openvz.org> Cc: Andrew G. Morgan <morgan@kernel.org> Cc: Serge E. Hallyn <serge.hallyn@canonical.com> Cc: Kees Cook <keescook@chromium.org> Cc: Steve Grubb <sgrubb@redhat.com> Cc: Dan Walsh <dwalsh@redhat.com> Signed-off-by: James Morris <james.l.morris@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-16fs,userns: Change inode_capable to capable_wrt_inode_uidgidAndy Lutomirski1-11/+7
commit 23adbe12ef7d3d4195e80800ab36b37bee28cd03 upstream. The kernel has no concept of capabilities with respect to inodes; inodes exist independently of namespaces. For example, inode_capable(inode, CAP_LINUX_IMMUTABLE) would be nonsense. This patch changes inode_capable to check for uid and gid mappings and renames it to capable_wrt_inode_uidgid, which should make it more obvious what it does. Fixes CVE-2014-4014. Cc: Theodore Ts'o <tytso@mit.edu> Cc: Serge Hallyn <serge.hallyn@ubuntu.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-04-14Add file_ns_capable() helper function for open-time capability checkingLinus Torvalds1-0/+24
Nothing is using it yet, but this will allow us to delay the open-time checks to use time, without breaking the normal UNIX permission semantics where permissions are determined by the opener (and the file descriptor can then be passed to a different process, or the process can drop capabilities). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-15userns: Teach inode_capable to understand inodes whose uids map to other ↵Eric W. Biederman1-2/+4
namespaces. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-04-08userns: Replace the hard to write inode_userns with inode_capable.Eric W. Biederman1-0/+19
This represents a change in strategy of how to handle user namespaces. Instead of tagging everything explicitly with a user namespace and bulking up all of the comparisons of uids and gids in the kernel, all uids and gids in use will have a mapping to a flat kuid and kgid spaces respectively. This allows much more of the existing logic to be preserved and in general allows for faster code. In this new and improved world we allow someone to utiliize capabilities over an inode if the inodes owner mapps into the capabilities holders user namespace and the user has capabilities in their user namespace. Which is simple and efficient. Moving the fs uid comparisons to be comparisons in a flat kuid space follows in later patches, something that is only significant if you are using user namespaces. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-17Revert "capabitlies: ns_capable can use the cap helpers rather than lsm call"Linus Torvalds1-1/+1
This reverts commit d2a7009f0bb03fa22ad08dd25472efa0568126b9. J. R. Okajima explains: "After this commit, I am afraid access(2) on NFS may not work correctly. The scenario based upon my guess. - access(2) overrides the credentials. - calls inode_permission() -- ... -- generic_permission() -- ns_capable(). - while the old ns_capable() calls security_capable(current_cred()), the new ns_capable() calls has_ns_capability(current) -- security_capable(__task_cred(t)). current_cred() returns current->cred which is effective (overridden) credentials, but __task_cred(current) returns current->real_cred (the NFSD's credential). And the overridden credentials by access(2) lost." Requested-by: J. R. Okajima <hooanon05@yahoo.co.jp> Acked-by: Eric Paris <eparis@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-15Merge branch 'for-linus' of git://selinuxproject.org/~jmorris/linux-securityLinus Torvalds1-34/+46
* 'for-linus' of git://selinuxproject.org/~jmorris/linux-security: capabilities: remove __cap_full_set definition security: remove the security_netlink_recv hook as it is equivalent to capable() ptrace: do not audit capability check when outputing /proc/pid/stat capabilities: remove task_ns_* functions capabitlies: ns_capable can use the cap helpers rather than lsm call capabilities: style only - move capable below ns_capable capabilites: introduce new has_ns_capabilities_noaudit capabilities: call has_ns_capability from has_capability capabilities: remove all _real_ interfaces capabilities: introduce security_capable_noaudit capabilities: reverse arguments to security_capable capabilities: remove the task from capable LSM hook entirely selinux: sparse fix: fix several warnings in the security server cod selinux: sparse fix: fix warnings in netlink code selinux: sparse fix: eliminate warnings for selinuxfs selinux: sparse fix: declare selinux_disable() in security.h selinux: sparse fix: move selinux_complete_init selinux: sparse fix: make selinux_secmark_refcount static SELinux: Fix RCU deref check warning in sel_netport_insert() Manually fix up a semantic mis-merge wrt security_netlink_recv(): - the interface was removed in commit fd7784615248 ("security: remove the security_netlink_recv hook as it is equivalent to capable()") - a new user of it appeared in commit a38f7907b926 ("crypto: Add userspace configuration API") causing no automatic merge conflict, but Eric Paris pointed out the issue.
2012-01-05capabilities: remove task_ns_* functionsEric Paris1-14/+0
task_ in the front of a function, in the security subsystem anyway, means to me at least, that we are operating with that task as the subject of the security decision. In this case what it means is that we are using current as the subject but we use the task to get the right namespace. Who in the world would ever realize that's what task_ns_capability means just by the name? This patch eliminates the task_ns functions entirely and uses the has_ns_capability function instead. This means we explicitly open code the ns in question in the caller. I think it makes the caller a LOT more clear what is going on. Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: Serge E. Hallyn <serge.hallyn@canonical.com>
2012-01-05capabitlies: ns_capable can use the cap helpers rather than lsm callEric Paris1-1/+1
Just to reduce the number of places to change if we every change the LSM hook, use the capability helpers internally when possible. Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: Serge E. Hallyn <serge.hallyn@canonical.com>
2012-01-05capabilities: style only - move capable below ns_capableEric Paris1-16/+16
Although the current code is fine for consistency this moves the capable code below the function it calls in the c file. It doesn't actually change code. Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: Serge E. Hallyn <serge.hallyn@canonical.com>
2012-01-05capabilites: introduce new has_ns_capabilities_noauditEric Paris1-5/+25
For consistency in interfaces, introduce a new interface called has_ns_capabilities_noaudit. It checks if the given task has the given capability in the given namespace. Use this new function by has_capabilities_noaudit. Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: Serge E. Hallyn <serge.hallyn@canonical.com>
2012-01-05capabilities: call has_ns_capability from has_capabilityEric Paris1-16/+10
Declare the more specific has_ns_capability first in the code and then call it from has_capability. The declaration reversal isn't stricty necessary since they are both declared in header files, but it just makes sense to put more specific functions first in the code. Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: Serge E. Hallyn <serge.hallyn@canonical.com>
2012-01-05capabilities: remove all _real_ interfacesEric Paris1-3/+15
The name security_real_capable and security_real_capable_noaudit just don't make much sense to me. Convert them to use security_capable and security_capable_noaudit. Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: Serge E. Hallyn <serge.hallyn@canonical.com>
2012-01-05capabilities: reverse arguments to security_capableEric Paris1-1/+1
security_capable takes ns, cred, cap. But the LSM capable() hook takes cred, ns, cap. The capability helper functions also take cred, ns, cap. Rather than flip argument order just to flip it back, leave them alone. Heck, this should be a little faster since argument will be in the right place! Signed-off-by: Eric Paris <eparis@redhat.com>
2011-10-31kernel: Map most files to use export.h instead of module.hPaul Gortmaker1-1/+1
The changed files were only including linux/module.h for the EXPORT_SYMBOL infrastructure, and nothing else. Revector them onto the isolated export header for faster compile times. Nothing to see here but a whole lot of instances of: -#include <linux/module.h> +#include <linux/export.h> This commit is only changing the kernel dir; next targets will probably be mm, fs, the arch dirs, etc. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-05-19Merge branch 'master' into nextJames Morris1-0/+12
Conflicts: include/linux/capability.h Manually resolve merge conflict w/ thanks to Stephen Rothwell. Signed-off-by: James Morris <jmorris@namei.org>
2011-05-13Cache user_ns in struct credSerge E. Hallyn1-0/+12
If !CONFIG_USERNS, have current_user_ns() defined to (&init_user_ns). Get rid of _current_user_ns. This requires nsown_capable() to be defined in capability.c rather than as static inline in capability.h, so do that. Request_key needs init_user_ns defined at current_user_ns if !CONFIG_USERNS, so forward-declare that in cred.h if !CONFIG_USERNS at current_user_ns() define. Compile-tested with and without CONFIG_USERNS. Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> [ This makes a huge performance difference for acl_permission_check(), up to 30%. And that is one of the hottest kernel functions for loads that are pathname-lookup heavy. ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-04capabilities: delete unused cap_set_fullEric Paris1-2/+0
unused code. Clean it up. Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: David Howells <dhowells@redhat.com> Acked-by: Andrew G. Morgan <morgan@kernel.org> Signed-off-by: James Morris <jmorris@namei.org>
2011-04-04capabilities: do not drop CAP_SETPCAP from the initial taskEric Paris1-2/+0
In olden' days of yore CAP_SETPCAP had special meaning for the init task. We actually have code to make sure that CAP_SETPCAP wasn't in pE of things using the init_cred. But CAP_SETPCAP isn't so special any more and we don't have a reason to special case dropping it for init or kthreads.... Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: Andrew G. Morgan <morgan@kernel.org> Signed-off-by: James Morris <jmorris@namei.org>
2011-03-24userns: make has_capability* into real functionsSerge E. Hallyn1-0/+54
So we can let type safety keep things sane, and as a bonus we can remove the declaration of init_user_ns in capability.h. Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Daniel Lezcano <daniel.lezcano@free.fr> Cc: David Howells <dhowells@redhat.com> Cc: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-24userns: security: make capabilities relative to the user namespaceSerge E. Hallyn1-5/+37
- Introduce ns_capable to test for a capability in a non-default user namespace. - Teach cap_capable to handle capabilities in a non-default user namespace. The motivation is to get to the unprivileged creation of new namespaces. It looks like this gets us 90% of the way there, with only potential uid confusion issues left. I still need to handle getting all caps after creation but otherwise I think I have a good starter patch that achieves all of your goals. Changelog: 11/05/2010: [serge] add apparmor 12/14/2010: [serge] fix capabilities to created user namespaces Without this, if user serge creates a user_ns, he won't have capabilities to the user_ns he created. THis is because we were first checking whether his effective caps had the caps he needed and returning -EPERM if not, and THEN checking whether he was the creator. Reverse those checks. 12/16/2010: [serge] security_real_capable needs ns argument in !security case 01/11/2011: [serge] add task_ns_capable helper 01/11/2011: [serge] add nsown_capable() helper per Bastian Blank suggestion 02/16/2011: [serge] fix a logic bug: the root user is always creator of init_user_ns, but should not always have capabilities to it! Fix the check in cap_capable(). 02/21/2011: Add the required user_ns parameter to security_capable, fixing a compile failure. 02/23/2011: Convert some macros to functions as per akpm comments. Some couldn't be converted because we can't easily forward-declare them (they are inline if !SECURITY, extern if SECURITY). Add a current_user_ns function so we can use it in capability.h without #including cred.h. Move all forward declarations together to the top of the #ifdef __KERNEL__ section, and use kernel-doc format. 02/23/2011: Per dhowells, clean up comment in cap_capable(). 02/23/2011: Per akpm, remove unreachable 'return -EPERM' in cap_capable. (Original written and signed off by Eric; latest, modified version acked by him) [akpm@linux-foundation.org: fix build] [akpm@linux-foundation.org: export current_user_ns() for ecryptfs] [serge.hallyn@canonical.com: remove unneeded extra argument in selinux's task_has_capability] Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Daniel Lezcano <daniel.lezcano@free.fr> Acked-by: David Howells <dhowells@redhat.com> Cc: James Morris <jmorris@namei.org> Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-02-11security: add cred argument to security_capable()Chris Wright1-1/+1
Expand security_capable() to include cred, so that it can be usable in a wider range of call sites. Signed-off-by: Chris Wright <chrisw@sous-sol.org> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: James Morris <jmorris@namei.org>
2010-04-02sched: Remove remaining USER_SCHED codeLi Zefan1-1/+0
This is left over from commit 7c9414385e ("sched: Remove USER_SCHED"") Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Dhaval Giani <dhaval.giani@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: David Howells <dhowells@redhat.com> LKML-Reference: <4BA9A05F.7010407@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-09capabilities: Use RCU to protect task lookup in sys_capgetThomas Gleixner1-2/+2
cap_get_target_pid() protects the task lookup with tasklist_lock. security_capget() is called under tasklist_lock as well but tasklist_lock does not protect anything there. The capabilities are protected by RCU already. So tasklist_lock only protects the lookup and prevents the task going away, which can be done with rcu_read_lock() as well. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: James Morris <jmorris@namei.org>
2009-11-24remove CONFIG_SECURITY_FILE_CAPABILITIES compile optionSerge E. Hallyn1-2/+0
As far as I know, all distros currently ship kernels with default CONFIG_SECURITY_FILE_CAPABILITIES=y. Since having the option on leaves a 'no_file_caps' option to boot without file capabilities, the main reason to keep the option is that turning it off saves you (on my s390x partition) 5k. In particular, vmlinux sizes came to: without patch fscaps=n: 53598392 without patch fscaps=y: 53603406 with this patch applied: 53603342 with the security-next tree. Against this we must weigh the fact that there is no simple way for userspace to figure out whether file capabilities are supported, while things like per-process securebits, capability bounding sets, and adding bits to pI if CAP_SETPCAP is in pE are not supported with SECURITY_FILE_CAPABILITIES=n, leaving a bit of a problem for applications wanting to know whether they can use them and/or why something failed. It also adds another subtly different set of semantics which we must maintain at the risk of severe security regressions. So this patch removes the SECURITY_FILE_CAPABILITIES compile option. It drops the kernel size by about 50k over the stock SECURITY_FILE_CAPABILITIES=y kernel, by removing the cap_limit_ptraced_target() function. Changelog: Nov 20: remove cap_limit_ptraced_target() as it's logic was ifndef'ed. Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Acked-by: Andrew G. Morgan" <morgan@kernel.org> Signed-off-by: James Morris <jmorris@namei.org>
2009-11-23Silence the existing API for capability version compatibility check.Andrew G. Morgan1-2/+2
When libcap, or other libraries attempt to confirm/determine the supported capability version magic, they generally supply a NULL dataptr to capget(). In this case, while returning the supported/preferred magic (via a modified header content), the return code of this system call may be 0, -EINVAL, or -EFAULT. No libcap code depends on the previous -EINVAL etc. return code, and all of the above three return codes can accompany a valid (successful) attempt to determine the requested magic value. This patch cleans up the system call to return 0, if the call is successfully being used to determine the supported/preferred capability magic value. Signed-off-by: Andrew G. Morgan <morgan@kernel.org> Acked-by: Steve Grubb <sgrubb@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: James Morris <jmorris@namei.org>
2009-10-13capabilities: simplify bound checks for copy_from_user()Arjan van de Ven1-3/+6
The capabilities syscall has a copy_from_user() call where gcc currently cannot prove to itself that the copy is always within bounds. This patch adds a very explicity bound check to prove to gcc that this copy_from_user cannot overflow its destination buffer. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Acked-by: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: James Morris <jmorris@namei.org>
2009-01-14[CVE-2009-0029] System call wrappers part 04Heiko Carstens1-2/+2
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-06Merge branch 'next' into for-linusJames Morris1-1/+1
2009-01-06CRED: Fix regression in cap_capable() as shown up by sys_faccessat() [ver #3]David Howells1-1/+1
Fix a regression in cap_capable() due to: commit 3b11a1decef07c19443d24ae926982bc8ec9f4c0 Author: David Howells <dhowells@redhat.com> Date: Fri Nov 14 10:39:26 2008 +1100 CRED: Differentiate objective and effective subjective credentials on a task The problem is that the above patch allows a process to have two sets of credentials, and for the most part uses the subjective credentials when accessing current's creds. There is, however, one exception: cap_capable(), and thus capable(), uses the real/objective credentials of the target task, whether or not it is the current task. Ordinarily this doesn't matter, since usually the two cred pointers in current point to the same set of creds. However, sys_faccessat() makes use of this facility to override the credentials of the calling process to make its test, without affecting the creds as seen from other processes. One of the things sys_faccessat() does is to make an adjustment to the effective capabilities mask, which cap_capable(), as it stands, then ignores. The affected capability check is in generic_permission(): if (!(mask & MAY_EXEC) || execute_ok(inode)) if (capable(CAP_DAC_OVERRIDE)) return 0; This change passes the set of credentials to be tested down into the commoncap and SELinux code. The security functions called by capable() and has_capability() select the appropriate set of credentials from the process being checked. This can be tested by compiling the following program from the XFS testsuite: /* * t_access_root.c - trivial test program to show permission bug. * * Written by Michael Kerrisk - copyright ownership not pursued. * Sourced from: http://linux.derkeiler.com/Mailing-Lists/Kernel/2003-10/6030.html */ #include <limits.h> #include <unistd.h> #include <stdio.h> #include <stdlib.h> #include <fcntl.h> #include <sys/stat.h> #define UID 500 #define GID 100 #define PERM 0 #define TESTPATH "/tmp/t_access" static void errExit(char *msg) { perror(msg); exit(EXIT_FAILURE); } /* errExit */ static void accessTest(char *file, int mask, char *mstr) { printf("access(%s, %s) returns %d\n", file, mstr, access(file, mask)); } /* accessTest */ int main(int argc, char *argv[]) { int fd, perm, uid, gid; char *testpath; char cmd[PATH_MAX + 20]; testpath = (argc > 1) ? argv[1] : TESTPATH; perm = (argc > 2) ? strtoul(argv[2], NULL, 8) : PERM; uid = (argc > 3) ? atoi(argv[3]) : UID; gid = (argc > 4) ? atoi(argv[4]) : GID; unlink(testpath); fd = open(testpath, O_RDWR | O_CREAT, 0); if (fd == -1) errExit("open"); if (fchown(fd, uid, gid) == -1) errExit("fchown"); if (fchmod(fd, perm) == -1) errExit("fchmod"); close(fd); snprintf(cmd, sizeof(cmd), "ls -l %s", testpath); system(cmd); if (seteuid(uid) == -1) errExit("seteuid"); accessTest(testpath, 0, "0"); accessTest(testpath, R_OK, "R_OK"); accessTest(testpath, W_OK, "W_OK"); accessTest(testpath, X_OK, "X_OK"); accessTest(testpath, R_OK | W_OK, "R_OK | W_OK"); accessTest(testpath, R_OK | X_OK, "R_OK | X_OK"); accessTest(testpath, W_OK | X_OK, "W_OK | X_OK"); accessTest(testpath, R_OK | W_OK | X_OK, "R_OK | W_OK | X_OK"); exit(EXIT_SUCCESS); } /* main */ This can be run against an Ext3 filesystem as well as against an XFS filesystem. If successful, it will show: [root@andromeda src]# ./t_access_root /tmp/xxx 0 4043 4043 ---------- 1 dhowells dhowells 0 2008-12-31 03:00 /tmp/xxx access(/tmp/xxx, 0) returns 0 access(/tmp/xxx, R_OK) returns 0 access(/tmp/xxx, W_OK) returns 0 access(/tmp/xxx, X_OK) returns -1 access(/tmp/xxx, R_OK | W_OK) returns 0 access(/tmp/xxx, R_OK | X_OK) returns -1 access(/tmp/xxx, W_OK | X_OK) returns -1 access(/tmp/xxx, R_OK | W_OK | X_OK) returns -1 If unsuccessful, it will show: [root@andromeda src]# ./t_access_root /tmp/xxx 0 4043 4043 ---------- 1 dhowells dhowells 0 2008-12-31 02:56 /tmp/xxx access(/tmp/xxx, 0) returns 0 access(/tmp/xxx, R_OK) returns -1 access(/tmp/xxx, W_OK) returns -1 access(/tmp/xxx, X_OK) returns -1 access(/tmp/xxx, R_OK | W_OK) returns -1 access(/tmp/xxx, R_OK | X_OK) returns -1 access(/tmp/xxx, W_OK | X_OK) returns -1 access(/tmp/xxx, R_OK | W_OK | X_OK) returns -1 I've also tested the fix with the SELinux and syscalls LTP testsuites. Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: J. Bruce Fields <bfields@citi.umich.edu> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: James Morris <jmorris@namei.org>
2009-01-06Revert "CRED: Fix regression in cap_capable() as shown up by sys_faccessat() ↵James Morris1-1/+1
[ver #2]" This reverts commit 14eaddc967b16017d4a1a24d2be6c28ecbe06ed8. David has a better version to come.
2009-01-05CRED: Fix regression in cap_capable() as shown up by sys_faccessat() [ver #2]David Howells1-1/+1
Fix a regression in cap_capable() due to: commit 5ff7711e635b32f0a1e558227d030c7e45b4a465 Author: David Howells <dhowells@redhat.com> Date: Wed Dec 31 02:52:28 2008 +0000 CRED: Differentiate objective and effective subjective credentials on a task The problem is that the above patch allows a process to have two sets of credentials, and for the most part uses the subjective credentials when accessing current's creds. There is, however, one exception: cap_capable(), and thus capable(), uses the real/objective credentials of the target task, whether or not it is the current task. Ordinarily this doesn't matter, since usually the two cred pointers in current point to the same set of creds. However, sys_faccessat() makes use of this facility to override the credentials of the calling process to make its test, without affecting the creds as seen from other processes. One of the things sys_faccessat() does is to make an adjustment to the effective capabilities mask, which cap_capable(), as it stands, then ignores. The affected capability check is in generic_permission(): if (!(mask & MAY_EXEC) || execute_ok(inode)) if (capable(CAP_DAC_OVERRIDE)) return 0; This change splits capable() from has_capability() down into the commoncap and SELinux code. The capable() security op now only deals with the current process, and uses the current process's subjective creds. A new security op - task_capable() - is introduced that can check any task's objective creds. strictly the capable() security op is superfluous with the presence of the task_capable() op, however it should be faster to call the capable() op since two fewer arguments need be passed down through the various layers. This can be tested by compiling the following program from the XFS testsuite: /* * t_access_root.c - trivial test program to show permission bug. * * Written by Michael Kerrisk - copyright ownership not pursued. * Sourced from: http://linux.derkeiler.com/Mailing-Lists/Kernel/2003-10/6030.html */ #include <limits.h> #include <unistd.h> #include <stdio.h> #include <stdlib.h> #include <fcntl.h> #include <sys/stat.h> #define UID 500 #define GID 100 #define PERM 0 #define TESTPATH "/tmp/t_access" static void errExit(char *msg) { perror(msg); exit(EXIT_FAILURE); } /* errExit */ static void accessTest(char *file, int mask, char *mstr) { printf("access(%s, %s) returns %d\n", file, mstr, access(file, mask)); } /* accessTest */ int main(int argc, char *argv[]) { int fd, perm, uid, gid; char *testpath; char cmd[PATH_MAX + 20]; testpath = (argc > 1) ? argv[1] : TESTPATH; perm = (argc > 2) ? strtoul(argv[2], NULL, 8) : PERM; uid = (argc > 3) ? atoi(argv[3]) : UID; gid = (argc > 4) ? atoi(argv[4]) : GID; unlink(testpath); fd = open(testpath, O_RDWR | O_CREAT, 0); if (fd == -1) errExit("open"); if (fchown(fd, uid, gid) == -1) errExit("fchown"); if (fchmod(fd, perm) == -1) errExit("fchmod"); close(fd); snprintf(cmd, sizeof(cmd), "ls -l %s", testpath); system(cmd); if (seteuid(uid) == -1) errExit("seteuid"); accessTest(testpath, 0, "0"); accessTest(testpath, R_OK, "R_OK"); accessTest(testpath, W_OK, "W_OK"); accessTest(testpath, X_OK, "X_OK"); accessTest(testpath, R_OK | W_OK, "R_OK | W_OK"); accessTest(testpath, R_OK | X_OK, "R_OK | X_OK"); accessTest(testpath, W_OK | X_OK, "W_OK | X_OK"); accessTest(testpath, R_OK | W_OK | X_OK, "R_OK | W_OK | X_OK"); exit(EXIT_SUCCESS); } /* main */ This can be run against an Ext3 filesystem as well as against an XFS filesystem. If successful, it will show: [root@andromeda src]# ./t_access_root /tmp/xxx 0 4043 4043 ---------- 1 dhowells dhowells 0 2008-12-31 03:00 /tmp/xxx access(/tmp/xxx, 0) returns 0 access(/tmp/xxx, R_OK) returns 0 access(/tmp/xxx, W_OK) returns 0 access(/tmp/xxx, X_OK) returns -1 access(/tmp/xxx, R_OK | W_OK) returns 0 access(/tmp/xxx, R_OK | X_OK) returns -1 access(/tmp/xxx, W_OK | X_OK) returns -1 access(/tmp/xxx, R_OK | W_OK | X_OK) returns -1 If unsuccessful, it will show: [root@andromeda src]# ./t_access_root /tmp/xxx 0 4043 4043 ---------- 1 dhowells dhowells 0 2008-12-31 02:56 /tmp/xxx access(/tmp/xxx, 0) returns 0 access(/tmp/xxx, R_OK) returns -1 access(/tmp/xxx, W_OK) returns -1 access(/tmp/xxx, X_OK) returns -1 access(/tmp/xxx, R_OK | W_OK) returns -1 access(/tmp/xxx, R_OK | X_OK) returns -1 access(/tmp/xxx, W_OK | X_OK) returns -1 access(/tmp/xxx, R_OK | W_OK | X_OK) returns -1 I've also tested the fix with the SELinux and syscalls LTP testsuites. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>
2009-01-04sanitize audit_log_capset()Al Viro1-3/+1
* no allocations * return void * don't duplicate checked for dummy context Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-11-13CRED: Inaugurate COW credentialsDavid Howells1-54/+24
Inaugurate copy-on-write credentials management. This uses RCU to manage the credentials pointer in the task_struct with respect to accesses by other tasks. A process may only modify its own credentials, and so does not need locking to access or modify its own credentials. A mutex (cred_replace_mutex) is added to the task_struct to control the effect of PTRACE_ATTACHED on credential calculations, particularly with respect to execve(). With this patch, the contents of an active credentials struct may not be changed directly; rather a new set of credentials must be prepared, modified and committed using something like the following sequence of events: struct cred *new = prepare_creds(); int ret = blah(new); if (ret < 0) { abort_creds(new); return ret; } return commit_creds(new); There are some exceptions to this rule: the keyrings pointed to by the active credentials may be instantiated - keyrings violate the COW rule as managing COW keyrings is tricky, given that it is possible for a task to directly alter the keys in a keyring in use by another task. To help enforce this, various pointers to sets of credentials, such as those in the task_struct, are declared const. The purpose of this is compile-time discouragement of altering credentials through those pointers. Once a set of credentials has been made public through one of these pointers, it may not be modified, except under special circumstances: (1) Its reference count may incremented and decremented. (2) The keyrings to which it points may be modified, but not replaced. The only safe way to modify anything else is to create a replacement and commit using the functions described in Documentation/credentials.txt (which will be added by a later patch). This patch and the preceding patches have been tested with the LTP SELinux testsuite. This patch makes several logical sets of alteration: (1) execve(). This now prepares and commits credentials in various places in the security code rather than altering the current creds directly. (2) Temporary credential overrides. do_coredump() and sys_faccessat() now prepare their own credentials and temporarily override the ones currently on the acting thread, whilst preventing interference from other threads by holding cred_replace_mutex on the thread being dumped. This will be replaced in a future patch by something that hands down the credentials directly to the functions being called, rather than altering the task's objective credentials. (3) LSM interface. A number of functions have been changed, added or removed: (*) security_capset_check(), ->capset_check() (*) security_capset_set(), ->capset_set() Removed in favour of security_capset(). (*) security_capset(), ->capset() New. This is passed a pointer to the new creds, a pointer to the old creds and the proposed capability sets. It should fill in the new creds or return an error. All pointers, barring the pointer to the new creds, are now const. (*) security_bprm_apply_creds(), ->bprm_apply_creds() Changed; now returns a value, which will cause the process to be killed if it's an error. (*) security_task_alloc(), ->task_alloc_security() Removed in favour of security_prepare_creds(). (*) security_cred_free(), ->cred_free() New. Free security data attached to cred->security. (*) security_prepare_creds(), ->cred_prepare() New. Duplicate any security data attached to cred->security. (*) security_commit_creds(), ->cred_commit() New. Apply any security effects for the upcoming installation of new security by commit_creds(). (*) security_task_post_setuid(), ->task_post_setuid() Removed in favour of security_task_fix_setuid(). (*) security_task_fix_setuid(), ->task_fix_setuid() Fix up the proposed new credentials for setuid(). This is used by cap_set_fix_setuid() to implicitly adjust capabilities in line with setuid() changes. Changes are made to the new credentials, rather than the task itself as in security_task_post_setuid(). (*) security_task_reparent_to_init(), ->task_reparent_to_init() Removed. Instead the task being reparented to init is referred directly to init's credentials. NOTE! This results in the loss of some state: SELinux's osid no longer records the sid of the thread that forked it. (*) security_key_alloc(), ->key_alloc() (*) security_key_permission(), ->key_permission() Changed. These now take cred pointers rather than task pointers to refer to the security context. (4) sys_capset(). This has been simplified and uses less locking. The LSM functions it calls have been merged. (5) reparent_to_kthreadd(). This gives the current thread the same credentials as init by simply using commit_thread() to point that way. (6) __sigqueue_alloc() and switch_uid() __sigqueue_alloc() can't stop the target task from changing its creds beneath it, so this function gets a reference to the currently applicable user_struct which it then passes into the sigqueue struct it returns if successful. switch_uid() is now called from commit_creds(), and possibly should be folded into that. commit_creds() should take care of protecting __sigqueue_alloc(). (7) [sg]et[ug]id() and co and [sg]et_current_groups. The set functions now all use prepare_creds(), commit_creds() and abort_creds() to build and check a new set of credentials before applying it. security_task_set[ug]id() is called inside the prepared section. This guarantees that nothing else will affect the creds until we've finished. The calling of set_dumpable() has been moved into commit_creds(). Much of the functionality of set_user() has been moved into commit_creds(). The get functions all simply access the data directly. (8) security_task_prctl() and cap_task_prctl(). security_task_prctl() has been modified to return -ENOSYS if it doesn't want to handle a function, or otherwise return the return value directly rather than through an argument. Additionally, cap_task_prctl() now prepares a new set of credentials, even if it doesn't end up using it. (9) Keyrings. A number of changes have been made to the keyrings code: (a) switch_uid_keyring(), copy_keys(), exit_keys() and suid_keys() have all been dropped and built in to the credentials functions directly. They may want separating out again later. (b) key_alloc() and search_process_keyrings() now take a cred pointer rather than a task pointer to specify the security context. (c) copy_creds() gives a new thread within the same thread group a new thread keyring if its parent had one, otherwise it discards the thread keyring. (d) The authorisation key now points directly to the credentials to extend the search into rather pointing to the task that carries them. (e) Installing thread, process or session keyrings causes a new set of credentials to be created, even though it's not strictly necessary for process or session keyrings (they're shared). (10) Usermode helper. The usermode helper code now carries a cred struct pointer in its subprocess_info struct instead of a new session keyring pointer. This set of credentials is derived from init_cred and installed on the new process after it has been cloned. call_usermodehelper_setup() allocates the new credentials and call_usermodehelper_freeinfo() discards them if they haven't been used. A special cred function (prepare_usermodeinfo_creds()) is provided specifically for call_usermodehelper_setup() to call. call_usermodehelper_setkeys() adjusts the credentials to sport the supplied keyring as the new session keyring. (11) SELinux. SELinux has a number of changes, in addition to those to support the LSM interface changes mentioned above: (a) selinux_setprocattr() no longer does its check for whether the current ptracer can access processes with the new SID inside the lock that covers getting the ptracer's SID. Whilst this lock ensures that the check is done with the ptracer pinned, the result is only valid until the lock is released, so there's no point doing it inside the lock. (12) is_single_threaded(). This function has been extracted from selinux_setprocattr() and put into a file of its own in the lib/ directory as join_session_keyring() now wants to use it too. The code in SELinux just checked to see whether a task shared mm_structs with other tasks (CLONE_VM), but that isn't good enough. We really want to know if they're part of the same thread group (CLONE_THREAD). (13) nfsd. The NFS server daemon now has to use the COW credentials to set the credentials it is going to use. It really needs to pass the credentials down to the functions it calls, but it can't do that until other patches in this series have been applied. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: James Morris <jmorris@namei.org> Signed-off-by: James Morris <jmorris@namei.org>
2008-11-13CRED: Separate task security context from task_structDavid Howells1-2/+2
Separate the task security context from task_struct. At this point, the security data is temporarily embedded in the task_struct with two pointers pointing to it. Note that the Alpha arch is altered as it refers to (E)UID and (E)GID in entry.S via asm-offsets. With comment fixes Signed-off-by: Marc Dionne <marc.c.dionne@gmail.com> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: James Morris <jmorris@namei.org> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: James Morris <jmorris@namei.org>
2008-11-13CRED: Neuter sys_capset()David Howells1-204/+23
Take away the ability for sys_capset() to affect processes other than current. This means that current will not need to lock its own credentials when reading them against interference by other processes. This has effectively been the case for a while anyway, since: (1) Without LSM enabled, sys_capset() is disallowed. (2) With file-based capabilities, sys_capset() is neutered. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Andrew G. Morgan <morgan@kernel.org> Acked-by: James Morris <jmorris@namei.org> Signed-off-by: James Morris <jmorris@namei.org>
2008-11-11Capabilities: BUG when an invalid capability is requestedEric Paris1-0/+5
If an invalid (large) capability is requested the capabilities system may panic as it is dereferencing an array of fixed (short) length. Its possible (and actually often happens) that the capability system accidentally stumbled into a valid memory region but it also regularly happens that it hits invalid memory and BUGs. If such an operation does get past cap_capable then the selinux system is sure to have problems as it already does a (simple) validity check and BUG. This is known to happen by the broken and buggy firegl driver. This patch cleanly checks all capable calls and BUG if a call is for an invalid capability. This will likely break the firegl driver for some situations, but it is the right thing to do. Garbage into a security system gets you killed/bugged Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: Arjan van de Ven <arjan@linux.intel.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Andrew G. Morgan <morgan@kernel.org> Signed-off-by: James Morris <jmorris@namei.org>
2008-11-11When the capset syscall is used it is not possible for audit to record theEric Paris1-0/+5
actual capbilities being added/removed. This patch adds a new record type which emits the target pid and the eff, inh, and perm cap sets. example output if you audit capset syscalls would be: type=SYSCALL msg=audit(1225743140.465:76): arch=c000003e syscall=126 success=yes exit=0 a0=17f2014 a1=17f201c a2=80000000 a3=7fff2ab7f060 items=0 ppid=2160 pid=2223 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=1 comm="setcap" exe="/usr/sbin/setcap" subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 key=(null) type=UNKNOWN[1322] msg=audit(1225743140.465:76): pid=0 cap_pi=ffffffffffffffff cap_pp=ffffffffffffffff cap_pe=ffffffffffffffff Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: James Morris <jmorris@namei.org>
2008-11-05file capabilities: add no_file_caps switch (v4)Serge E. Hallyn1-0/+11
Add a no_file_caps boot option when file capabilities are compiled into the kernel (CONFIG_SECURITY_FILE_CAPABILITIES=y). This allows distributions to ship a kernel with file capabilities compiled in, without forcing users to use (and understand and trust) them. When no_file_caps is specified at boot, then when a process executes a file, any file capabilities stored with that file will not be used in the calculation of the process' new capability sets. This means that booting with the no_file_caps boot option will not be the same as booting a kernel with file capabilities compiled out - in particular a task with CAP_SETPCAP will not have any chance of passing capabilities to another task (which isn't "really" possible anyway, and which may soon by killed altogether by David Howells in any case), and it will instead be able to put new capabilities in its pI. However since fI will always be empty and pI is masked with fI, it gains the task nothing. We also support the extra prctl options, setting securebits and dropping capabilities from the per-process bounding set. The other remaining difference is that killpriv, task_setscheduler, setioprio, and setnice will continue to be hooked. That will be noticable in the case where a root task changed its uid while keeping some caps, and another task owned by the new uid tries to change settings for the more privileged task. Changelog: Nov 05 2008: (v4) trivial port on top of always-start-\ with-clear-caps patch Sep 23 2008: nixed file_caps_enabled when file caps are not compiled in as it isn't used. Document no_file_caps in kernel-parameters.txt. Signed-off-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Andrew G. Morgan <morgan@kernel.org> Signed-off-by: James Morris <jmorris@namei.org>
2008-08-14security: Fix setting of PF_SUPERPRIV by __capable()David Howells1-8/+13
Fix the setting of PF_SUPERPRIV by __capable() as it could corrupt the flags the target process if that is not the current process and it is trying to change its own flags in a different way at the same time. __capable() is using neither atomic ops nor locking to protect t->flags. This patch removes __capable() and introduces has_capability() that doesn't set PF_SUPERPRIV on the process being queried. This patch further splits security_ptrace() in two: (1) security_ptrace_may_access(). This passes judgement on whether one process may access another only (PTRACE_MODE_ATTACH for ptrace() and PTRACE_MODE_READ for /proc), and takes a pointer to the child process. current is the parent. (2) security_ptrace_traceme(). This passes judgement on PTRACE_TRACEME only, and takes only a pointer to the parent process. current is the child. In Smack and commoncap, this uses has_capability() to determine whether the parent will be permitted to use PTRACE_ATTACH if normal checks fail. This does not set PF_SUPERPRIV. Two of the instances of __capable() actually only act on current, and so have been changed to calls to capable(). Of the places that were using __capable(): (1) The OOM killer calls __capable() thrice when weighing the killability of a process. All of these now use has_capability(). (2) cap_ptrace() and smack_ptrace() were using __capable() to check to see whether the parent was allowed to trace any process. As mentioned above, these have been split. For PTRACE_ATTACH and /proc, capable() is now used, and for PTRACE_TRACEME, has_capability() is used. (3) cap_safe_nice() only ever saw current, so now uses capable(). (4) smack_setprocattr() rejected accesses to tasks other than current just after calling __capable(), so the order of these two tests have been switched and capable() is used instead. (5) In smack_file_send_sigiotask(), we need to allow privileged processes to receive SIGIO on files they're manipulating. (6) In smack_task_wait(), we let a process wait for a privileged process, whether or not the process doing the waiting is privileged. I've tested this with the LTP SELinux and syscalls testscripts. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Casey Schaufler <casey@schaufler-ca.com> Acked-by: Andrew G. Morgan <morgan@kernel.org> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: James Morris <jmorris@namei.org>
2008-07-24security: filesystem capabilities refactor kernel codeAndrew G. Morgan1-117/+221
To date, we've tried hard to confine filesystem support for capabilities to the security modules. This has left a lot of the code in kernel/capability.c in a state where it looks like it supports something that filesystem support for capabilities actually suppresses when the LSM security/commmoncap.c code runs. What is left is a lot of code that uses sub-optimal locking in the main kernel With this change we refactor the main kernel code and make it explicit which locks are needed and that the only remaining kernel races in this area are associated with non-filesystem capability code. Signed-off-by: Andrew G. Morgan <morgan@kernel.org> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-04security: filesystem capabilities: fix fragile setuid fixup codeAndrew G. Morgan1-0/+21
This commit includes a bugfix for the fragile setuid fixup code in the case that filesystem capabilities are supported (in access()). The effect of this fix is gated on filesystem capability support because changing securebits is only supported when filesystem capabilities support is configured.) [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Andrew G. Morgan <morgan@kernel.org> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-31capabilities: remain source compatible with 32-bit raw legacy capability ↵Andrew G. Morgan1-38/+73
support. Source code out there hard-codes a notion of what the _LINUX_CAPABILITY_VERSION #define means in terms of the semantics of the raw capability system calls capget() and capset(). Its unfortunate, but true. Since the confusing header file has been in a released kernel, there is software that is erroneously using 64-bit capabilities with the semantics of 32-bit compatibilities. These recently compiled programs may suffer corruption of their memory when sys_getcap() overwrites more memory than they are coded to expect, and the raising of added capabilities when using sys_capset(). As such, this patch does a number of things to clean up the situation for all. It 1. forces the _LINUX_CAPABILITY_VERSION define to always retain its legacy value. 2. adopts a new #define strategy for the kernel's internal implementation of the preferred magic. 3. deprecates v2 capability magic in favor of a new (v3) magic number. The functionality of v3 is entirely equivalent to v2, the only difference being that the v2 magic causes the kernel to log a "deprecated" warning so the admin can find applications that may be using v2 inappropriately. [User space code continues to be encouraged to use the libcap API which protects the application from details like this. libcap-2.10 is the first to support v3 capabilities.] Fixes issue reported in https://bugzilla.redhat.com/show_bug.cgi?id=447518. Thanks to Bojan Smojver for the report. [akpm@linux-foundation.org: s/depreciate/deprecate/g] [akpm@linux-foundation.org: be robust about put_user size] [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Andrew G. Morgan <morgan@kernel.org> Cc: Serge E. Hallyn <serue@us.ibm.com> Cc: Bojan Smojver <bojan@rexursive.com> Cc: stable@kernel.org Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
2008-02-05Add 64-bit capability support to the kernelAndrew Morgan1-9/+104
The patch supports legacy (32-bit) capability userspace, and where possible translates 32-bit capabilities to/from userspace and the VFS to 64-bit kernel space capabilities. If a capability set cannot be compressed into 32-bits for consumption by user space, the system call fails, with -ERANGE. FWIW libcap-2.00 supports this change (and earlier capability formats) http://www.kernel.org/pub/linux/libs/security/linux-privs/kernel-2.6/ [akpm@linux-foundation.org: coding-syle fixes] [akpm@linux-foundation.org: use get_task_comm()] [ezk@cs.sunysb.edu: build fix] [akpm@linux-foundation.org: do not initialise statics to 0 or NULL] [akpm@linux-foundation.org: unused var] [serue@us.ibm.com: export __cap_ symbols] Signed-off-by: Andrew G. Morgan <morgan@kernel.org> Cc: Stephen Smalley <sds@tycho.nsa.gov> Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: Chris Wright <chrisw@sous-sol.org> Cc: James Morris <jmorris@namei.org> Cc: Casey Schaufler <casey@schaufler-ca.com> Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19Uninline find_pid etc set of functionsPavel Emelyanov1-1/+1
The find_pid/_vpid/_pid_ns functions are used to find the struct pid by its id, depending on whic id - global or virtual - is used. The find_vpid() is a macro that pushes the current->nsproxy->pid_ns on the stack to call another function - find_pid_ns(). It turned out, that this dereference together with the push itself cause the kernel text size to grow too much. Move all these out-of-line. Together with the previous patch this saves a bit less that 400 bytes from .text section. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Paul Menage <menage@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19Uninline find_task_by_xxx set of functionsPavel Emelyanov1-4/+2
The find_task_by_something is a set of macros are used to find task by pid depending on what kind of pid is proposed - global or virtual one. All of them are wrappers above the most generic one - find_task_by_pid_type_ns() - and just substitute some args for it. It turned out, that dereferencing the current->nsproxy->pid_ns construction and pushing one more argument on the stack inline cause kernel text size to grow. This patch moves all this stuff out-of-line into kernel/pid.c. Together with the next patch it saves a bit less than 400 bytes from the .text section. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Paul Menage <menage@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19pid namespaces: changes to show virtual ids to userPavel Emelyanov1-6/+8
This is the largest patch in the set. Make all (I hope) the places where the pid is shown to or get from user operate on the virtual pids. The idea is: - all in-kernel data structures must store either struct pid itself or the pid's global nr, obtained with pid_nr() call; - when seeking the task from kernel code with the stored id one should use find_task_by_pid() call that works with global pids; - when showing pid's numerical value to the user the virtual one should be used, but however when one shows task's pid outside this task's namespace the global one is to be used; - when getting the pid from userspace one need to consider this as the virtual one and use appropriate task/pid-searching functions. [akpm@linux-foundation.org: build fix] [akpm@linux-foundation.org: nuther build fix] [akpm@linux-foundation.org: yet nuther build fix] [akpm@linux-foundation.org: remove unneeded casts] Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Alexey Dobriyan <adobriyan@openvz.org> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Paul Menage <menage@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19pid namespaces: define is_global_init() and is_container_init()Serge E. Hallyn1-1/+2
is_init() is an ambiguous name for the pid==1 check. Split it into is_global_init() and is_container_init(). A cgroup init has it's tsk->pid == 1. A global init also has it's tsk->pid == 1 and it's active pid namespace is the init_pid_ns. But rather than check the active pid namespace, compare the task structure with 'init_pid_ns.child_reaper', which is initialized during boot to the /sbin/init process and never changes. Changelog: 2.6.22-rc4-mm2-pidns1: - Use 'init_pid_ns.child_reaper' to determine if a given task is the global init (/sbin/init) process. This would improve performance and remove dependence on the task_pid(). 2.6.21-mm2-pidns2: - [Sukadev Bhattiprolu] Changed is_container_init() calls in {powerpc, ppc,avr32}/traps.c for the _exception() call to is_global_init(). This way, we kill only the cgroup if the cgroup's init has a bug rather than force a kernel panic. [akpm@linux-foundation.org: fix comment] [sukadev@us.ibm.com: Use is_global_init() in arch/m32r/mm/fault.c] [bunk@stusta.de: kernel/pid.c: remove unused exports] [sukadev@us.ibm.com: Fix capability.c to work with threaded init] Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Acked-by: Pavel Emelianov <xemul@openvz.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Herbert Poetzel <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@sw.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-18whitespace fixes: capability syscallsDaniel Walker1-95/+95
Large chunks of 5 spaces instead of tabs. Signed-off-by: Daniel Walker <dwalker@mvista.com> Cc: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-18V3 file capabilities: alter behavior of cap_setpcapAndrew Morgan1-4/+1
The non-filesystem capability meaning of CAP_SETPCAP is that a process, p1, can change the capabilities of another process, p2. This is not the meaning that was intended for this capability at all, and this implementation came about purely because, without filesystem capabilities, there was no way to use capabilities without one process bestowing them on another. Since we now have a filesystem support for capabilities we can fix the implementation of CAP_SETPCAP. The most significant thing about this change is that, with it in effect, no process can set the capabilities of another process. The capabilities of a program are set via the capability convolution rules: pI(post-exec) = pI(pre-exec) pP(post-exec) = (X(aka cap_bset) & fP) | (pI(post-exec) & fI) pE(post-exec) = fE ? pP(post-exec) : 0 at exec() time. As such, the only influence the pre-exec() program can have on the post-exec() program's capabilities are through the pI capability set. The correct implementation for CAP_SETPCAP (and that enabled by this patch) is that it can be used to add extra pI capabilities to the current process - to be picked up by subsequent exec()s when the above convolution rules are applied. Here is how it works: Let's say we have a process, p. It has capability sets, pE, pP and pI. Generally, p, can change the value of its own pI to pI' where (pI' & ~pI) & ~pP = 0. That is, the only new things in pI' that were not present in pI need to be present in pP. The role of CAP_SETPCAP is basically to permit changes to pI beyond the above: if (pE & CAP_SETPCAP) { pI' = anything; /* ie., even (pI' & ~pI) & ~pP != 0 */ } This capability is useful for things like login, which (say, via pam_cap) might want to raise certain inheritable capabilities for use by the children of the logged-in user's shell, but those capabilities are not useful to or needed by the login program itself. One such use might be to limit who can run ping. You set the capabilities of the 'ping' program to be "= cap_net_raw+i", and then only shells that have (pI & CAP_NET_RAW) will be able to run it. Without CAP_SETPCAP implemented as described above, login(pam_cap) would have to also have (pP & CAP_NET_RAW) in order to raise this capability and pass it on through the inheritable set. Signed-off-by: Andrew Morgan <morgan@kernel.org> Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: James Morris <jmorris@namei.org> Cc: Casey Schaufler <casey@schaufler-ca.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>