summaryrefslogtreecommitdiff
path: root/fs/ceph
AgeCommit message (Collapse)AuthorFilesLines
2014-01-09ceph: allow sync_read/write return partial successed size of read/write.majianpeng1-1/+3
commit ee7289bfadda5f4ef60884547ebc9989c8fb314a upstream. For sync_read/write, it may do multi stripe operations.If one of those met erro, we return the former successed size rather than a error value. There is a exception for write-operation met -EOLDSNAPC.If this occur,we retry the whole write again. Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09ceph: fix bugs about handling short-read for sync read mode.majianpeng1-23/+16
commit 02ae66d8b229708fd94b764f6c17ead1c7741fcf upstream. cephfs . show_layout >layyout.data_pool: 0 >layout.object_size: 4194304 >layout.stripe_unit: 4194304 >layout.stripe_count: 1 TestA: >dd if=/dev/urandom of=test bs=1M count=2 oflag=direct >dd if=/dev/urandom of=test bs=1M count=2 seek=4 oflag=direct >dd if=test of=/dev/null bs=6M count=1 iflag=direct The messages from func striped_read are: ceph: file.c:350 : striped_read 0~6291456 (read 0) got 2097152 HITSTRIPE SHORT ceph: file.c:350 : striped_read 2097152~4194304 (read 2097152) got 0 HITSTRIPE SHORT ceph: file.c:381 : zero tail 4194304 ceph: file.c:390 : striped_read returns 6291456 The hole of file is from 2M--4M.But actualy it zero the last 4M include the last 2M area which isn't a hole. Using this patch, the messages are: ceph: file.c:350 : striped_read 0~6291456 (read 0) got 2097152 HITSTRIPE SHORT ceph: file.c:358 : zero gap 2097152 to 4194304 ceph: file.c:350 : striped_read 4194304~2097152 (read 4194304) got 2097152 ceph: file.c:384 : striped_read returns 6291456 TestB: >echo majianpeng > test >dd if=test of=/dev/null bs=2M count=1 iflag=direct The messages are: ceph: file.c:350 : striped_read 0~6291456 (read 0) got 11 HITSTRIPE SHORT ceph: file.c:350 : striped_read 11~6291445 (read 11) got 0 HITSTRIPE SHORT ceph: file.c:390 : striped_read returns 11 For this case,it did once more striped_read.It's no meaningless. Using this patch, the message are: ceph: file.c:350 : striped_read 0~6291456 (read 0) got 11 HITSTRIPE SHORT ceph: file.c:384 : striped_read returns 11 Big thanks to Yan Zheng for the patch. Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09ceph: Add check returned value on func ceph_calc_ceph_pg.majianpeng1-2/+6
commit 2fbcbff1d6b9243ef71c64a8ab993bc3c7bb7af1 upstream. Func ceph_calc_ceph_pg maybe failed.So add check for returned value. Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Reviewed-by: Sage Weil <sage@inktank.com> Signed-off-by: Sage Weil <sage@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09ceph: cleanup types in striped_read()Dan Carpenter1-4/+4
commit 688bac461ba3e9d221a879ab40b687f5d7b5b19c upstream. We pass in a u64 value for "len" and then immediately truncate away the upper 32 bits. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <alex.elder@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09ceph: fix null pointer dereferenceNathaniel Yazdani1-0/+3
commit c338c07c51e3106711fad5eb599e375eadb6855d upstream. When register_session() is given an out-of-range argument for mds, ceph_mdsmap_get_addr() will return a null pointer, which would be given to ceph_con_open() & be dereferenced, causing a kernel oops. This fixes bug #4685 in the Ceph bug tracker <http://tracker.ceph.com/issues/4685>. Signed-off-by: Nathaniel Yazdani <n1ght.4nd.d4y@gmail.com> Reviewed-by: Sage Weil <sage@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09ceph: avoid accessing invalid memorySasha Levin1-1/+1
commit 5446429630257f4723829409337a26c076907d5d upstream. when mounting ceph with a dev name that starts with a slash, ceph would attempt to access the character before that slash. Since we don't actually own that byte of memory, we would trigger an invalid access: [ 43.499934] BUG: unable to handle kernel paging request at ffff880fa3a97fff [ 43.500984] IP: [<ffffffff818f3884>] parse_mount_options+0x1a4/0x300 [ 43.501491] PGD 743b067 PUD 10283c4067 PMD 10282a6067 PTE 8000000fa3a97060 [ 43.502301] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 43.503006] Dumping ftrace buffer: [ 43.503596] (ftrace buffer empty) [ 43.504046] CPU: 0 PID: 10879 Comm: mount Tainted: G W 3.10.0-sasha #1129 [ 43.504851] task: ffff880fa625b000 ti: ffff880fa3412000 task.ti: ffff880fa3412000 [ 43.505608] RIP: 0010:[<ffffffff818f3884>] [<ffffffff818f3884>] parse_mount_options$ [ 43.506552] RSP: 0018:ffff880fa3413d08 EFLAGS: 00010286 [ 43.507133] RAX: ffff880fa3a98000 RBX: ffff880fa3a98000 RCX: 0000000000000000 [ 43.507893] RDX: ffff880fa3a98001 RSI: 000000000000002f RDI: ffff880fa3a98000 [ 43.508610] RBP: ffff880fa3413d58 R08: 0000000000001f99 R09: ffff880fa3fe64c0 [ 43.509426] R10: ffff880fa3413d98 R11: ffff880fa38710d8 R12: ffff880fa3413da0 [ 43.509792] R13: ffff880fa3a97fff R14: 0000000000000000 R15: ffff880fa3413d90 [ 43.509792] FS: 00007fa9c48757e0(0000) GS:ffff880fd2600000(0000) knlGS:000000000000$ [ 43.509792] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 43.509792] CR2: ffff880fa3a97fff CR3: 0000000fa3bb9000 CR4: 00000000000006b0 [ 43.509792] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 43.509792] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 43.509792] Stack: [ 43.509792] 0000e5180000000e ffffffff85ca1900 ffff880fa38710d8 ffff880fa3413d98 [ 43.509792] 0000000000000120 0000000000000000 ffff880fa3a98000 0000000000000000 [ 43.509792] ffffffff85cf32a0 0000000000000000 ffff880fa3413dc8 ffffffff818f3c72 [ 43.509792] Call Trace: [ 43.509792] [<ffffffff818f3c72>] ceph_mount+0xa2/0x390 [ 43.509792] [<ffffffff81226314>] ? pcpu_alloc+0x334/0x3c0 [ 43.509792] [<ffffffff81282f8d>] mount_fs+0x8d/0x1a0 [ 43.509792] [<ffffffff812263d0>] ? __alloc_percpu+0x10/0x20 [ 43.509792] [<ffffffff8129f799>] vfs_kern_mount+0x79/0x100 [ 43.509792] [<ffffffff812a224d>] do_new_mount+0xcd/0x1c0 [ 43.509792] [<ffffffff812a2e8d>] do_mount+0x15d/0x210 [ 43.509792] [<ffffffff81220e55>] ? strndup_user+0x45/0x60 [ 43.509792] [<ffffffff812a2fdd>] SyS_mount+0x9d/0xe0 [ 43.509792] [<ffffffff83fd816c>] tracesys+0xdd/0xe2 [ 43.509792] Code: 4c 8b 5d c0 74 0a 48 8d 50 01 49 89 14 24 eb 17 31 c0 48 83 c9 ff $ [ 43.509792] RIP [<ffffffff818f3884>] parse_mount_options+0x1a4/0x300 [ 43.509792] RSP <ffff880fa3413d08> [ 43.509792] CR2: ffff880fa3a97fff [ 43.509792] ---[ end trace 22469cd81e93af51 ]--- Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Reviewed-by: Sage Weil <sage@inktan.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09ceph: Free mdsc if alloc mdsc->mdsmap failed.majianpeng1-1/+3
commit fb3101b6f0db9ae3f35dc8e6ec908d0af8cdf12e upstream. Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Reviewed-by: Sage Weil <sage@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09ceph: improve error handling in ceph_mdsmap_decodeEmil Goode1-1/+3
commit c213b50b7dcbf06abcfbf1e4eee5b76586718bd9 upstream. This patch makes the following improvements to the error handling in the ceph_mdsmap_decode function: - Add a NULL check for return value from kcalloc - Make use of the variable err Signed-off-by: Emil Goode <emilgoode@gmail.com> Signed-off-by: Sage Weil <sage@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09ceph: Avoid data inconsistency due to d-cache aliasing in readpage()Li Wang1-2/+6
commit 56f91aad69444d650237295f68c195b74d888d95 upstream. If the length of data to be read in readpage() is exactly PAGE_CACHE_SIZE, the original code does not flush d-cache for data consistency after finishing reading. This patches fixes this. Signed-off-by: Li Wang <liwang@ubuntukylin.com> Signed-off-by: Sage Weil <sage@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09ceph: wake up 'safe' waiters when unregistering requestYan, Zheng1-1/+2
commit fc55d2c9448b34218ca58733a6f51fbede09575b upstream. We also need to wake up 'safe' waiters if error occurs or request aborted. Otherwise sync(2)/fsync(2) may hang forever. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Sage Weil <sage@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09ceph: cleanup aborted requests when re-sending requests.Yan, Zheng1-1/+4
commit eb1b8af33c2e42a9a57fc0a7588f4a7b255d2e79 upstream. Aborted requests usually get cleared when the reply is received. If MDS crashes, no reply will be received. So we need to cleanup aborted requests when re-sending requests. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Greg Farnum <greg@inktank.com> Signed-off-by: Sage Weil <sage@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-09-26ceph: Don't forget the 'up_read(&osdc->map_sem)' if met error.majianpeng1-1/+3
commit 494ddd11be3e2621096bb425eed2886f8e8446d4 upstream. Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Reviewed-by: Sage Weil <sage@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-07-13ceph: fix sleeping function called from invalid context.majianpeng1-4/+5
commit a1dc1937337a93e699eaa56968b7de6e1a9e77cf upstream. [ 1121.231883] BUG: sleeping function called from invalid context at kernel/rwsem.c:20 [ 1121.231935] in_atomic(): 1, irqs_disabled(): 0, pid: 9831, name: mv [ 1121.231971] 1 lock held by mv/9831: [ 1121.231973] #0: (&(&ci->i_ceph_lock)->rlock){+.+...},at:[<ffffffffa02bbd38>] ceph_getxattr+0x58/0x1d0 [ceph] [ 1121.231998] CPU: 3 PID: 9831 Comm: mv Not tainted 3.10.0-rc6+ #215 [ 1121.232000] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS 080015 11/09/2011 [ 1121.232027] ffff88006d355a80 ffff880092f69ce0 ffffffff8168348c ffff880092f69cf8 [ 1121.232045] ffffffff81070435 ffff88006d355a20 ffff880092f69d20 ffffffff816899ba [ 1121.232052] 0000000300000004 ffff8800b76911d0 ffff88006d355a20 ffff880092f69d68 [ 1121.232056] Call Trace: [ 1121.232062] [<ffffffff8168348c>] dump_stack+0x19/0x1b [ 1121.232067] [<ffffffff81070435>] __might_sleep+0xe5/0x110 [ 1121.232071] [<ffffffff816899ba>] down_read+0x2a/0x98 [ 1121.232080] [<ffffffffa02baf70>] ceph_vxattrcb_layout+0x60/0xf0 [ceph] [ 1121.232088] [<ffffffffa02bbd7f>] ceph_getxattr+0x9f/0x1d0 [ceph] [ 1121.232093] [<ffffffff81188d28>] vfs_getxattr+0xa8/0xd0 [ 1121.232097] [<ffffffff8118900b>] getxattr+0xab/0x1c0 [ 1121.232100] [<ffffffff811704f2>] ? final_putname+0x22/0x50 [ 1121.232104] [<ffffffff81155f80>] ? kmem_cache_free+0xb0/0x260 [ 1121.232107] [<ffffffff811704f2>] ? final_putname+0x22/0x50 [ 1121.232110] [<ffffffff8109e63d>] ? trace_hardirqs_on+0xd/0x10 [ 1121.232114] [<ffffffff816957a7>] ? sysret_check+0x1b/0x56 [ 1121.232120] [<ffffffff81189c9c>] SyS_fgetxattr+0x6c/0xc0 [ 1121.232125] [<ffffffff81695782>] system_call_fastpath+0x16/0x1b [ 1121.232129] BUG: scheduling while atomic: mv/9831/0x10000002 [ 1121.232154] 1 lock held by mv/9831: [ 1121.232156] #0: (&(&ci->i_ceph_lock)->rlock){+.+...}, at: [<ffffffffa02bbd38>] ceph_getxattr+0x58/0x1d0 [ceph] I think move the ci->i_ceph_lock down is safe because we can't free ceph_inode_info at there. Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Reviewed-by: Sage Weil <sage@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-06-12Merge branch 'for-linus' of ↵Linus Torvalds3-58/+89
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull ceph fixes from Sage Weil: "There is a pair of fixes for double-frees in the recent bundle for 3.10, a couple of fixes for long-standing bugs (sleep while atomic and an endianness fix), and a locking fix that can be triggered when osds are going down" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: rbd: fix cleanup in rbd_add() rbd: don't destroy ceph_opts in rbd_add() ceph: ceph_pagelist_append might sleep while atomic ceph: add cpu_to_le32() calls when encoding a reconnect capability libceph: must hold mutex for reset_changed_osds()
2013-05-17ceph: ceph_pagelist_append might sleep while atomicJim Schutt3-61/+89
Ceph's encode_caps_cb() worked hard to not call __page_cache_alloc() while holding a lock, but it's spoiled because ceph_pagelist_addpage() always calls kmap(), which might sleep. Here's the result: [13439.295457] ceph: mds0 reconnect start [13439.300572] BUG: sleeping function called from invalid context at include/linux/highmem.h:58 [13439.309243] in_atomic(): 1, irqs_disabled(): 0, pid: 12059, name: kworker/1:1 . . . [13439.376225] Call Trace: [13439.378757] [<ffffffff81076f4c>] __might_sleep+0xfc/0x110 [13439.384353] [<ffffffffa03f4ce0>] ceph_pagelist_append+0x120/0x1b0 [libceph] [13439.391491] [<ffffffffa0448fe9>] ceph_encode_locks+0x89/0x190 [ceph] [13439.398035] [<ffffffff814ee849>] ? _raw_spin_lock+0x49/0x50 [13439.403775] [<ffffffff811cadf5>] ? lock_flocks+0x15/0x20 [13439.409277] [<ffffffffa045e2af>] encode_caps_cb+0x41f/0x4a0 [ceph] [13439.415622] [<ffffffff81196748>] ? igrab+0x28/0x70 [13439.420610] [<ffffffffa045e9f8>] ? iterate_session_caps+0xe8/0x250 [ceph] [13439.427584] [<ffffffffa045ea25>] iterate_session_caps+0x115/0x250 [ceph] [13439.434499] [<ffffffffa045de90>] ? set_request_path_attr+0x2d0/0x2d0 [ceph] [13439.441646] [<ffffffffa0462888>] send_mds_reconnect+0x238/0x450 [ceph] [13439.448363] [<ffffffffa0464542>] ? ceph_mdsmap_decode+0x5e2/0x770 [ceph] [13439.455250] [<ffffffffa0462e42>] check_new_map+0x352/0x500 [ceph] [13439.461534] [<ffffffffa04631ad>] ceph_mdsc_handle_map+0x1bd/0x260 [ceph] [13439.468432] [<ffffffff814ebc7e>] ? mutex_unlock+0xe/0x10 [13439.473934] [<ffffffffa043c612>] extra_mon_dispatch+0x22/0x30 [ceph] [13439.480464] [<ffffffffa03f6c2c>] dispatch+0xbc/0x110 [libceph] [13439.486492] [<ffffffffa03eec3d>] process_message+0x1ad/0x1d0 [libceph] [13439.493190] [<ffffffffa03f1498>] ? read_partial_message+0x3e8/0x520 [libceph] . . . [13439.587132] ceph: mds0 reconnect success [13490.720032] ceph: mds0 caps stale [13501.235257] ceph: mds0 recovery completed [13501.300419] ceph: mds0 caps renewed Fix it up by encoding locks into a buffer first, and when the number of encoded locks is stable, copy that into a ceph_pagelist. [elder@inktank.com: abbreviated the stack info a bit.] Cc: stable@vger.kernel.org # 3.4+ Signed-off-by: Jim Schutt <jaschut@sandia.gov> Reviewed-by: Alex Elder <elder@inktank.com>
2013-05-17ceph: add cpu_to_le32() calls when encoding a reconnect capabilityJim Schutt2-3/+6
In his review, Alex Elder mentioned that he hadn't checked that num_fcntl_locks and num_flock_locks were properly decoded on the server side, from a le32 over-the-wire type to a cpu type. I checked, and AFAICS it is done; those interested can consult Locker::_do_cap_update() in src/mds/Locker.cc and src/include/encoding.h in the Ceph server code (git://github.com/ceph/ceph). I also checked the server side for flock_len decoding, and I believe that also happens correctly, by virtue of having been declared __le32 in struct ceph_mds_cap_reconnect, in src/include/ceph_fs.h. Cc: stable@vger.kernel.org # 3.4+ Signed-off-by: Jim Schutt <jaschut@sandia.gov> Reviewed-by: Alex Elder <elder@inktank.com>
2013-05-07aio: don't include aio.h in sched.hKent Overstreet1-0/+1
Faster kernel compiles by way of fewer unnecessary includes. [akpm@linux-foundation.org: fix fallout] [akpm@linux-foundation.org: fix build] Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Reviewed-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-01ceph: use ceph_create_snap_context()Alex Elder1-2/+1
Now that we have a library routine to create snap contexts, use it. This is part of: http://tracker.ceph.com/issues/4857 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01libceph: kill off osd data write_request parametersAlex Elder2-7/+6
In the incremental move toward supporting distinct data items in an osd request some of the functions had "write_request" parameters to indicate, basically, whether the data belonged to in_data or the out_data. Now that we maintain the data fields in the op structure there is no need to indicate the direction, so get rid of the "write_request" parameters. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01ceph: fix printk format warnings in file.cRandy Dunlap1-2/+2
Fix printk format warnings by using %zd for 'ssize_t' variables: fs/ceph/file.c:751:2: warning: format '%ld' expects argument of type 'long int', but argument 11 has type 'ssize_t' [-Wformat] fs/ceph/file.c:762:2: warning: format '%ld' expects argument of type 'long int', but argument 11 has type 'ssize_t' [-Wformat] Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: ceph-devel@vger.kernel.org Signed-off-by: Sage Weil <sage@inktank.com>
2013-05-01ceph: fix race between writepages and truncateYan, Zheng1-7/+7
ceph_writepages_start() reads inode->i_size in two places. It can get different values between successive read, because truncate can change inode->i_size at any time. The race can lead to mismatch between data length of osd request and pages marked as writeback. When osd request finishes, it clear writeback page according to its data length. So some pages can be left in writeback state forever. The fix is only read inode->i_size once, save its value to a local variable and use the local variable when i_size is needed. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Alex Elder <elder@inktank.com>
2013-05-01ceph: apply write checks in ceph_aio_writeYan, Zheng1-35/+59
copy write checks in __generic_file_aio_write to ceph_aio_write. To make these checks cover sync write path. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Alex Elder <elder@inktank.com>
2013-05-01ceph: take i_mutex before getting Fw capYan, Zheng2-12/+13
There is deadlock as illustrated bellow. The fix is taking i_mutex before getting Fw cap reference. write truncate MDS --------------------- -------------------- -------------- get Fw cap lock i_mutex lock i_mutex (blocked) request setattr.size -> <- revoke Fw cap Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
2013-05-01libceph: change how "safe" callback is usedAlex Elder1-24/+28
An osd request currently has two callbacks. They inform the initiator of the request when we've received confirmation for the target osd that a request was received, and when the osd indicates all changes described by the request are durable. The only time the second callback is used is in the ceph file system for a synchronous write. There's a race that makes some handling of this case unsafe. This patch addresses this problem. The error handling for this callback is also kind of gross, and this patch changes that as well. In ceph_sync_write(), if a safe callback is requested we want to add the request on the ceph inode's unsafe items list. Because items on this list must have their tid set (by ceph_osd_start_request()), the request added *after* the call to that function returns. The problem with this is that there's a race between starting the request and adding it to the unsafe items list; the request may already be complete before ceph_sync_write() even begins to put it on the list. To address this, we change the way the "safe" callback is used. Rather than just calling it when the request is "safe", we use it to notify the initiator the bounds (start and end) of the period during which the request is *unsafe*. So the initiator gets notified just before the request gets sent to the osd (when it is "unsafe"), and again when it's known the results are durable (it's no longer unsafe). The first call will get made in __send_request(), just before the request message gets sent to the messenger for the first time. That function is only called by __send_queued(), which is always called with the osd client's request mutex held. We then have this callback function insert the request on the ceph inode's unsafe list when we're told the request is unsafe. This will avoid the race because this call will be made under protection of the osd client's request mutex. It also nicely groups the setup and cleanup of the state associated with managing unsafe requests. The name of the "safe" callback field is changed to "unsafe" to better reflect its new purpose. It has a Boolean "unsafe" parameter to indicate whether the request is becoming unsafe or is now safe. Because the "msg" parameter wasn't used, we drop that. This resolves the original problem reportedin: http://tracker.ceph.com/issues/4706 Reported-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Sage Weil <sage@inktank.com>
2013-05-01ceph: let osd client clean up for interrupted requestAlex Elder1-6/+0
In ceph_sync_write(), if a safe callback is supplied with a request, and an error is returned by ceph_osdc_wait_request(), a block of code is executed to remove the request from the unsafe writes list and drop references to capabilities acquired just prior to a call to ceph_osdc_wait_request(). The only function used for this callback is sync_write_commit(), and it does *exactly* what that block of error handling code does. Now in ceph_osdc_wait_request(), if an error occurs (due to an interupt during a wait_for_completion_interruptible() call), complete_request() gets called, and that calls the request's safe_callback method if it's defined. So this means that this cleanup activity gets called twice in this case, which is erroneous (and in fact leads to a crash). Fix this by just letting the osd client handle the cleanup in the event of an interrupt. This resolves one problem mentioned in: http://tracker.ceph.com/issues/4706 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-05-01ceph: fix symlink inode operationsYan, Zheng1-0/+6
add getattr/setattr and xattrs related methods. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Greg Farnum <greg@inktank.com>
2013-05-01ceph: Use pseudo-random numbers to choose mdsSam Lang1-3/+5
We don't need to use up entropy to choose an mds, so use prandom_u32() to get a pseudo-random number. Also, we don't need to choose a random mds if only one mds is available, so add special casing for the common case. Fixes http://tracker.ceph.com/issues/3579 Signed-off-by: Sam Lang <sam.lang@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com>
2013-05-01libceph: add, don't set data for a messageAlex Elder1-2/+2
Change the names of the functions that put data on a pagelist to reflect that we're adding to whatever's already there rather than just setting it to the one thing. Currently only one data item is ever added to a message, but that's about to change. This resolves: http://tracker.ceph.com/issues/2770 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01libceph: combine initializing and setting osd dataAlex Elder2-9/+6
This ends up being a rather large patch but what it's doing is somewhat straightforward. Basically, this is replacing two calls with one. The first of the two calls is initializing a struct ceph_osd_data with data (either a page array, a page list, or a bio list); the second is setting an osd request op so it associates that data with one of the op's parameters. In place of those two will be a single function that initializes the op directly. That means we sort of fan out a set of the needed functions: - extent ops with pages data - extent ops with pagelist data - extent ops with bio list data and - class ops with page data for receiving a response We also have define another one, but it's only used internally: - class ops with pagelist data for request parameters Note that we *still* haven't gotten rid of the osd request's r_data_in and r_data_out fields. All the osd ops refer to them for their data. For now, these data fields are pointers assigned to the appropriate r_data_* field when these new functions are called. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01libceph: specify osd op by index in requestAlex Elder1-1/+1
An osd request now holds all of its source op structures, and every place that initializes one of these is in fact initializing one of the entries in the the osd request's array. So rather than supplying the address of the op to initialize, have caller specify the osd request and an indication of which op it would like to initialize. This better hides the details the op structure (and faciltates moving the data pointers they use). Since osd_req_op_init() is a common routine, and it's not used outside the osd client code, give it static scope. Also make it return the address of the specified op (so all the other init routines don't have to repeat that code). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01libceph: add data pointers in osd op structuresAlex Elder2-5/+8
An extent type osd operation currently implies that there will be corresponding data supplied in the data portion of the request (for write) or response (for read) message. Similarly, an osd class method operation implies a data item will be supplied to receive the response data from the operation. Add a ceph_osd_data pointer to each of those structures, and assign it to point to eithre the incoming or the outgoing data structure in the osd message. The data is not always available when an op is initially set up, so add two new functions to allow setting them after the op has been initialized. Begin to make use of the data item pointer available in the osd operation rather than the request data in or out structure in places where it's convenient. Add some assertions to verify pointers are always set the way they're expected to be. This is a sort of stepping stone toward really moving the data into the osd request ops, to allow for some validation before making that jump. This is the first in a series of patches that resolve: http://tracker.ceph.com/issues/4657 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01libceph: keep source rather than message osd op arrayAlex Elder2-16/+11
An osd request keeps a pointer to the osd operations (ops) array that it builds in its request message. In order to allow each op in the array to have its own distinct data, we will need to keep track of each op's data, and that information does not go over the wire. As long as we're tracking the data we might as well just track the entire (source) op definition for each of the ops. And if we're doing that, we'll have no more need to keep a pointer to the wire-encoded version. This patch makes the array of source ops be kept with the osd request structure, and uses that instead of the version encoded in the message in places where that was previously used. The array will be embedded in the request structure, and the maximum number of ops we ever actually use is currently 2. So reduce CEPH_OSD_MAX_OP to 2 to reduce the size of the structure. The result of doing this sort of ripples back up, and as a result various function parameters and local variables become unnecessary. Make r_num_ops be unsigned, and move the definition of struct ceph_osd_req_op earlier to ensure it's defined where needed. It does not yet add per-op data, that's coming soon. This resolves: http://tracker.ceph.com/issues/4656 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01libceph: a few more osd data cleanupsAlex Elder1-13/+17
These are very small changes that make use osd_data local pointers as shorthands for structures being operated on. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01libceph: define osd data initialization helpersAlex Elder2-15/+8
Define and use functions that encapsulate the initializion of a ceph_osd_data structure. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01ceph: build osd request message later for writepagesAlex Elder1-26/+33
Hold off building the osd request message in ceph_writepages_start() until just before it will be submitted to the osd client for execution. We'll still create the request and allocate the page pointer array after we learn we have at least one page to write. A local variable will be used to keep track of the allocated array of pages. Wait until just before submitting the request for assigning that page array pointer to the request message. Create ands use a new function osd_req_op_extent_update() whose purpose is to serve this one spot where the length value supplied when an osd request's op was initially formatted might need to get changed (reduced, never increased) before submitting the request. Previously, ceph_writepages_start() assigned the message header's data length because of this update. That's no longer necessary, because ceph_osdc_build_request() will recalculate the right value to use based on the content of the ops in the request. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01libceph: hold off building osd requestAlex Elder2-5/+6
Defer building the osd request until just before submitting it in all callers except ceph_writepages_start(). (That caller will be handed in the next patch.) Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01ceph: kill ceph alloc_page_vec()Alex Elder1-27/+18
There is a helper function alloc_page_vec() that, despite its generic sounding name depends heavily on an osd request structure being populated with certain information. There is only one place this function is used, and it ends up being a bit simpler to just open code what it does, so get rid of the helper. The real motivation for this is deferring building the of the osd request message, and this is a step in that direction. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01ceph: define ceph_writepages_osd_request()Alex Elder1-10/+24
Mostly for readability, define ceph_writepages_osd_request() and use it to allocate the osd request for ceph_writepages_start(). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01libceph: don't build request in ceph_osdc_new_request()Alex Elder2-20/+36
This patch moves the call to ceph_osdc_build_request() out of ceph_osdc_new_request() and into its caller. This is in order to defer formatting osd operation information into the request message until just before request is started. The only unusual (ab)user of ceph_osdc_build_request() is ceph_writepages_start(), where the final length of write request may change (downward) based on the current inode size or the oldest snapshot context with dirty data for the inode. The remaining callers don't change anything in the request after has been built. This means the ops array is now supplied by the caller. It also means there is no need to pass the mtime to ceph_osdc_new_request() (it gets provided to ceph_osdc_build_request()). And rather than passing a do_sync flag, have the number of ops in the ops array supplied imply adding a second STARTSYNC operation after the READ or WRITE requested. This and some of the patches that follow are related to having the messenger (only) be responsible for filling the content of the message header, as described here: http://tracker.ceph.com/issues/4589 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01ceph: use page_offset() in ceph_writepages_start()Alex Elder1-1/+1
There's one spot in ceph_writepages_start() that open-codes what page_offset() does safely. Use the macro so we don't have to worry about wrapping. This resolves: http://tracker.ceph.com/issues/4648 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01ceph: set up page array mempool with correct sizeAlex Elder1-2/+5
In create_fs_client() a memory pool is set up be used for arrays of pages that might be needed in ceph_writepages_start() if memory is tight. There are two problems with the way it's initialized: - The size provided is the number of pages we want in the array, but it should be the number of bytes required for that many page pointers. - The number of pages computed can end up being 0, while we will always need at least one page. This patch fixes both of these problems. This resolves the two simple problems defined in: http://tracker.ceph.com/issues/4603 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01libceph: wrap auth ops in wrapper functionsSage Weil1-14/+12
Use wrapper functions that check whether the auth op exists so that callers do not need a bunch of conditional checks. Simplifies the external interface. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com>
2013-05-01libceph: add update_authorizer auth methodSage Weil1-1/+6
Currently the messenger calls out to a get_authorizer con op, which will create a new authorizer if it doesn't yet have one. In the meantime, when we rotate our service keys, the authorizer doesn't get updated. Eventually it will be rejected by the server on a new connection attempt and get invalidated, and we will then rebuild a new authorizer, but this is not ideal. Instead, if we do have an authorizer, call a new update_authorizer op that will verify that the current authorizer is using the latest secret. If it is not, we will build a new one that does. This avoids the transient failure. This fixes one of the sorry sequence of events for bug http://tracker.ceph.com/issues/4282 Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com>
2013-05-01ceph: fix buffer pointer advance in ceph_sync_writeHenry C Chang1-1/+1
We should advance the user data pointer by _len_ instead of _written_. _len_ is the data length written in each iteration while _written_ is the accumulated data length we have writtent out. Signed-off-by: Henry C Chang <henry.cy.chang@gmail.com> Reviewed-by: Greg Farnum <greg@inktank.com> Tested-by: Sage Weil <sage@inktank.com>
2013-05-01ceph: use i_release_count to indicate dir's completenessYan, Zheng5-49/+45
Current ceph code tracks directory's completeness in two places. ceph_readdir() checks i_release_count to decide if it can set the I_COMPLETE flag in i_ceph_flags. All other places check the I_COMPLETE flag. This indirection introduces locking complexity. This patch adds a new variable i_complete_count to ceph_inode_info. Set i_release_count's value to it when marking a directory complete. By comparing the two variables, we know if a directory is complete Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-05-01ceph: only set message data pointers if non-emptyAlex Elder1-3/+10
Change it so we only assign outgoing data information for messages if there is outgoing data to send. This then allows us to add a few more (currently commented-out) assertions. This is related to: http://tracker.ceph.com/issues/4284 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
2013-05-01libceph: isolate other message data fieldsAlex Elder1-1/+1
Define ceph_msg_data_set_pagelist(), ceph_msg_data_set_bio(), and ceph_msg_data_set_trail() to clearly abstract the assignment of the remaining data-related fields in a ceph message structure. Use the new functions in the osd client and mds client. This partially resolves: http://tracker.ceph.com/issues/4263 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01libceph: set page info with byte lengthAlex Elder1-1/+1
When setting page array information for message data, provide the byte length rather than the page count ceph_msg_data_set_pages(). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01libceph: isolate message page field manipulationAlex Elder1-2/+2
Define a function ceph_msg_data_set_pages(), which more clearly abstracts the assignment page-related fields for data in a ceph message structure. Use this new function in the osd client and mds client. Ideally, these fields would never be set more than once (with BUG_ON() calls to guarantee that). At the moment though the osd client sets these every time it receives a message, and in the event of a communication problem this can happen more than once. (This will be resolved shortly, but setting up these helpers first makes it all a bit easier to work with.) Rearrange the field order in a ceph_msg structure to group those that are used to define the possible data payloads. This partially resolves: http://tracker.ceph.com/issues/4263 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01libceph: record byte count not page countAlex Elder2-14/+21
Record the byte count for an osd request rather than the page count. The number of pages can always be derived from the byte count (and alignment/offset) but the reverse is not true. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>