summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)AuthorFilesLines
2011-04-18md: fix up raid1/raid10 unplugging.NeilBrown2-28/+20
We just need to make sure that an unplug event wakes up the md thread, which is exactly what mddev_check_plugged does. Also remove some plug-related code that is no longer needed. Signed-off-by: NeilBrown <neilb@suse.de>
2011-04-18md: incorporate new plugging into raid5.NeilBrown1-7/+16
In raid5 plugging is used for 2 things: 1/ collecting writes that require a bitmap update 2/ collecting writes in the hope that we can create full stripes - or at least more-full. We now release these different sets of stripes when plug_cnt is zero. Also in make_request, we call mddev_check_plug to hopefully increase plug_cnt, and wake up the thread at the end if plugging wasn't achieved for some reason. Signed-off-by: NeilBrown <neilb@suse.de>
2011-04-18md: provide generic support for handling unplug callbacks.NeilBrown2-0/+60
When an md device adds a request to a queue, it can call mddev_check_plugged. If this succeeds then we know that the md thread will be woken up shortly, and ->plug_cnt will be non-zero until then, so some processing can be delayed. If it fails, then no unplug callback is expected and the make_request function needs to do whatever is required to make the request happen. Signed-off-by: NeilBrown <neilb@suse.de>
2011-04-18md - remove old plugging code.NeilBrown4-104/+8
md has some plugging infrastructure for RAID5 to use because the normal plugging infrastructure required a 'request_queue', and when called from dm, RAID5 doesn't have one of those available. This relied on the ->unplug_fn callback which doesn't exist any more. So remove all of that code, both in md and raid5. Subsequent patches with restore the plugging functionality. Signed-off-by: NeilBrown <neilb@suse.de>
2011-04-18md/dm - remove remains of plug_fn callback.NeilBrown1-8/+0
Now that unplugging is done differently, the unplug_fn callback is never called, so it can be completely discarded. Signed-off-by: NeilBrown <neilb@suse.de>
2011-04-18md: use new plugging interface for RAID IO.NeilBrown3-1/+10
md/raid submits a lot of IO from the various raid threads. So adding start/finish plug calls to those so that some plugging happens. Signed-off-by: NeilBrown <neilb@suse.de>
2011-04-07Merge branch 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6Linus Torvalds7-10/+10
* 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6: Fix common misspellings
2011-04-05dm: improve block integrity supportMike Snitzer1-34/+80
The current block integrity (DIF/DIX) support in DM is verifying that all devices' integrity profiles match during DM device resume (which is past the point of no return). To some degree that is unavoidable (stacked DM devices force this late checking). But for most DM devices (which aren't stacking on other DM devices) the ideal time to verify all integrity profiles match is during table load. Introduce the notion of an "initialized" integrity profile: a profile that was blk_integrity_register()'d with a non-NULL 'blk_integrity' template. Add blk_integrity_is_initialized() to allow checking if a profile was initialized. Update DM integrity support to: - check all devices with _initialized_ integrity profiles match during table load; uninitialized profiles (e.g. for underlying DM device(s) of a stacked DM device) are ignored. - disallow a table load that would result in an integrity profile that conflicts with a DM device's existing (in-use) integrity profile - avoid clearing an existing integrity profile - validate all integrity profiles match during resume; but if they don't all we can do is report the mismatch (during resume we're past the point of no return) Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-03-31Fix common misspellingsLucas De Marchi7-10/+10
Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
2011-03-28md: Fix integrity registration error when no devices are capableMartin K. Petersen1-6/+2
We incorrectly returned -EINVAL when none of the devices in the array had an integrity profile. This in turn prevented mdadm from starting the metadevice. Fix this so we only return errors on mismatched profiles and memory allocation failures. Reported-by: Giacomo Catenazzi <cate@cateee.net> Reported-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dmLinus Torvalds9-28/+300
* git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm: dm stripe: implement merge method dm mpath: allow table load with no priority groups dm mpath: fail message ioctl if specified path is not valid dm ioctl: add flag to wipe buffers for secure data dm ioctl: prepare for crypt key wiping dm crypt: wipe keys string immediately after key is set dm: add flakey target dm: fix opening log and cow devices for read only tables
2011-03-24Merge branch 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds17-408/+102
* 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits) Documentation/iostats.txt: bit-size reference etc. cfq-iosched: removing unnecessary think time checking cfq-iosched: Don't clear queue stats when preempt. blk-throttle: Reset group slice when limits are changed blk-cgroup: Only give unaccounted_time under debug cfq-iosched: Don't set active queue in preempt block: fix non-atomic access to genhd inflight structures block: attempt to merge with existing requests on plug flush block: NULL dereference on error path in __blkdev_get() cfq-iosched: Don't update group weights when on service tree fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away block: Require subsystems to explicitly allocate bio_set integrity mempool jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging fs: make fsync_buffers_list() plug mm: make generic_writepages() use plugging blk-cgroup: Add unaccounted time to timeslice_used. block: fixup plugging stubs for !CONFIG_BLOCK block: remove obsolete comments for blkdev_issue_zeroout. blktrace: Use rq->cmd_flags directly in blk_add_trace_rq. ... Fix up conflicts in fs/{aio.c,super.c}
2011-03-24dm stripe: implement merge methodMustafa Mesanovic1-1/+22
Implement a merge function in the striped target. When the striped target's underlying devices provide a merge_bvec_fn (like all DM devices do via dm_merge_bvec) it is important to call down to them when building a biovec that doesn't span a stripe boundary. Without the merge method, a striped DM device stacked on DM devices causes bios with a single page to be submitted which results in unnecessary overhead that hurts performance. This change really helps filesystems (e.g. XFS and now ext4) which take care to assemble larger bios. By implementing stripe_merge(), DM and the stripe target no longer undermine the filesystem's work by only allowing a single page per bio. Buffered IO sees the biggest improvement (particularly uncached reads, buffered writes to a lesser degree). This is especially so for more capable "enterprise" storage LUNs. The performance improvement has been measured to be ~12-35% -- when a reasonable chunk_size is used (e.g. 64K) in conjunction with a stripe count that is a power of 2. In contrast, the performance penalty is ~5-7% for the pathological worst case stripe configuration (small chunk_size with a stripe count that is not a power of 2). The reason for this is that stripe_map_sector() is now called once for every call to dm_merge_bvec(). stripe_map_sector() will use slower division if stripe count isn't a power of 2. Signed-off-by: Mustafa Mesanovic <mume@linux.vnet.ibm.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-03-24dm mpath: allow table load with no priority groupsMike Snitzer1-3/+10
This patch adjusts the multipath target to allow a table with both 0 priority groups and 0 for the initial priority group number. If any mpath device is held open when all paths in the last priority group have failed, userspace multipathd will attempt to reload the associated DM table to reflect the fact that the device no longer has any priority groups. But the reload attempt always failed because the multipath target did not allow 0 priority groups. All multipath target messages related to priority group (enable_group, disable_group, switch_group) will handle a priority group of 0 (will cause error). When reloading a multipath table with 0 priority groups, userspace multipathd must be updated to specify an initial priority group number of 0 (rather than 1). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: Babu Moger <babu.moger@lsi.com> Acked-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-03-24dm mpath: fail message ioctl if specified path is not validMike Snitzer1-2/+2
Fail the reinstate_path and fail_path message ioctl if the specified path is not valid. The message ioctl would succeed for the 'reinistate_path' and 'fail_path' messages even if action was not taken because the specified device was not a valid path of the multipath device. Before, when /dev/vdb is not a path of mpathb: $ dmsetup message mpathb 0 reinstate_path /dev/vdb $ echo $? 0 After: $ dmsetup message mpathb 0 reinstate_path /dev/vdb device-mapper: message ioctl failed: Invalid argument Command failed $ echo $? 1 Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-03-24dm ioctl: add flag to wipe buffers for secure dataMilan Broz1-2/+21
Add DM_SECURE_DATA_FLAG which userspace can use to ensure that all buffers allocated for dm-ioctl are wiped immediately after use. The user buffer is wiped as well (we do not want to keep and return sensitive data back to userspace if the flag is set). Wiping is useful for cryptsetup to ensure that the key is present in memory only in defined places and only for the time needed. (For crypt, key can be present in table during load or table status, wait and message commands). Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-03-24dm ioctl: prepare for crypt key wipingMilan Broz1-14/+11
Prepare code for implementing buffer wipe flag. No functional change in this patch. Signed-off-by: Milan Broz <mbroz@redhat.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-03-24dm crypt: wipe keys string immediately after key is setMilan Broz1-5/+14
Always wipe the original copy of the key after processing it in crypt_set_key(). Signed-off-by: Milan Broz <mbroz@redhat.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-03-24dm: add flakey targetJosef Bacik3-0/+219
This target is the same as the linear target except that it returns I/O errors periodically. It's been found useful in simulating failing devices for testing purposes. I needed a dm target to do some failure testing on btrfs's raid code, and Mike pointed me at this. Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-03-24dm: fix opening log and cow devices for read only tablesMilan Broz2-2/+2
If a table is read-only, also open any log and cow devices it uses read-only. Previously, even read-only devices were opened read-write internally. After patch 75f1dc0d076d1c1168f2115f1941ea627d38bd5a block: check bdev_read_only() from blkdev_get() was applied, loading such tables began to fail. The patch was reverted by e51900f7d38cbcfb481d84567fd92540e7e1d23a block: revert block_dev read-only check but this patch fixes this part of the code to work with the original patch. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-03-23dm: use little-endian bitopsAkinobu Mita1-4/+4
As a preparation for removing ext2 non-atomic bit operations from asm/bitops.h. This converts ext2 non-atomic bit operations to little-endian bit operations. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Alasdair Kergon <agk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23md: use little-endian bitopsAkinobu Mita1-3/+3
As a preparation for removing ext2 non-atomic bit operations from asm/bitops.h. This converts ext2 non-atomic bit operations to little-endian bit operations. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Acked-by: NeilBrown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22block: fix non-atomic access to genhd inflight structuresShaohua Li1-3/+4
After the stack plugging introduction, these are called lockless. Ensure that the counters are updated atomically. Signed-off-by: Shaohua Li<shaohua.li@intel.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-03-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6Linus Torvalds1-12/+10
* git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (170 commits) [SCSI] scsi_dh_rdac: Add MD36xxf into device list [SCSI] scsi_debug: add consecutive medium errors [SCSI] libsas: fix ata list corruption issue [SCSI] hpsa: export resettable host attribute [SCSI] hpsa: move device attributes to avoid forward declarations [SCSI] scsi_debug: Logical Block Provisioning (SBC3r26) [SCSI] sd: Logical Block Provisioning update [SCSI] Include protection operation in SCSI command trace [SCSI] hpsa: fix incorrect PCI IDs and add two new ones (2nd try) [SCSI] target: Fix volume size misreporting for volumes > 2TB [SCSI] bnx2fc: Broadcom FCoE offload driver [SCSI] fcoe: fix broken fcoe interface reset [SCSI] fcoe: precedence bug in fcoe_filter_frames() [SCSI] libfcoe: Remove stale fcoe-netdev entries [SCSI] libfcoe: Move FCOE_MTU definition from fcoe.h to libfcoe.h [SCSI] libfc: introduce __fc_fill_fc_hdr that accepts fc_hdr as an argument [SCSI] fcoe, libfc: initialize EM anchors list and then update npiv EMs [SCSI] Revert "[SCSI] libfc: fix exchange being deleted when the abort itself is timed out" [SCSI] libfc: Fixing a memory leak when destroying an interface [SCSI] megaraid_sas: Version and Changelog update ... Fix up trivial conflicts due to whitespace differences in drivers/scsi/libsas/{sas_ata.c,sas_scsi_host.c}
2011-03-17block: Require subsystems to explicitly allocate bio_set integrity mempoolMartin K. Petersen9-19/+35
MD and DM create a new bio_set for every metadevice. Each bio_set has an integrity mempool attached regardless of whether the metadevice is capable of passing integrity metadata. This is a waste of memory. Instead we defer the allocation decision to MD and DM since we know at metadevice creation time whether integrity passthrough is needed or not. Automatic integrity mempool allocation can then be removed from bioset_create() and we make an explicit integrity allocation for the fs_bio_set. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reported-by: Zdenek Kabelac <zkabelac@redhat.com> Acked-by: Mike Snitzer <snizer@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-03-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6Linus Torvalds1-1/+1
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1480 commits) bonding: enable netpoll without checking link status xfrm: Refcount destination entry on xfrm_lookup net: introduce rx_handler results and logic around that bonding: get rid of IFF_SLAVE_INACTIVE netdev->priv_flag bonding: wrap slave state work net: get rid of multiple bond-related netdevice->priv_flags bonding: register slave pointer for rx_handler be2net: Bump up the version number be2net: Copyright notice change. Update to Emulex instead of ServerEngines e1000e: fix kconfig for crc32 dependency netfilter ebtables: fix xt_AUDIT to work with ebtables xen network backend driver bonding: Improve syslog message at device creation time bonding: Call netif_carrier_off after register_netdevice bonding: Incorrect TX queue offset net_sched: fix ip_tos2prio xfrm: fix __xfrm_route_forward() be2net: Fix UDP packet detected status in RX compl Phonet: fix aligned-mode pipe socket buffer header reserve netxen: support for GbE port settings ... Fix up conflicts in drivers/staging/brcm80211/brcmsmac/wl_mac80211.c with the staging updates.
2011-03-16Merge branch 'for-2.6.39' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds1-1/+1
* 'for-2.6.39' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: fix build failure introduced by s/freezeable/freezable/ workqueue: add system_freezeable_wq rds/ib: use system_wq instead of rds_ib_fmr_wq net/9p: replace p9_poll_task with a work net/9p: use system_wq instead of p9_mux_wq xfs: convert to alloc_workqueue() reiserfs: make commit_wq use the default concurrency level ocfs2: use system_wq instead of ocfs2_quota_wq ext4: convert to alloc_workqueue() scsi/scsi_tgt_lib: scsi_tgtd isn't used in memory reclaim path scsi/be2iscsi,qla2xxx: convert to alloc_workqueue() misc/iwmc3200top: use system_wq instead of dedicated workqueues i2o: use alloc_workqueue() instead of create_workqueue() acpi: kacpi*_wq don't need WQ_MEM_RECLAIM fs/aio: aio_wq isn't used in memory reclaim path input/tps6507x-ts: use system_wq instead of dedicated workqueue cpufreq: use system_wq instead of dedicated workqueues wireless/ipw2x00: use system_wq instead of dedicated workqueues arm/omap: use system_wq in mailbox workqueue: use WQ_MEM_RECLAIM instead of WQ_RESCUER
2011-03-10Merge branch 'for-2.6.39/stack-plug' into for-2.6.39/coreJens Axboe16-386/+63
Conflicts: block/blk-core.c block/blk-flush.c drivers/md/raid1.c drivers/md/raid10.c drivers/md/raid5.c fs/nilfs2/btnode.c fs/nilfs2/mdt.c Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-03-10block: kill off REQ_UNPLUGJens Axboe4-9/+5
With the plugging now being explicitly controlled by the submitter, callers need not pass down unplugging hints to the block layer. If they want to unplug, it's because they manually plugged on their own - in which case, they should just unplug at will. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-03-10block: remove per-queue pluggingJens Axboe15-372/+58
Code has been converted over to the new explicit on-stack plugging, and delay users have been converted to use the new API for that. So lets kill off the old plugging along with aops->sync_page(). Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-03-03Merge branch 'master' of ↵David S. Miller8-13/+38
master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/bnx2x/bnx2x.h
2011-03-03netlink: kill eff_cap from struct netlink_skb_parmsPatrick McHardy1-1/+1
Netlink message processing in the kernel is synchronous these days, capabilities can be checked directly in security_netlink_recv() from the current process. Signed-off-by: Patrick McHardy <kaber@trash.net> Reviewed-by: James Morris <jmorris@namei.org> [chrisw: update to include pohmelfs and uvesafb] Signed-off-by: Chris Wright <chrisw@sous-sol.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-24md: Fix - again - partition detection when array becomes activeNeilBrown2-1/+23
Revert b821eaa572fd737faaf6928ba046e571526c36c6 and f3b99be19ded511a1bf05a148276239d9f13eefa When I wrote the first of these I had a wrong idea about the lifetime of 'struct block_device'. It can disappear at any time that the block device is not open if it falls out of the inode cache. So relying on the 'size' recorded with it to detect when the device size has changed and so we need to revalidate, is wrong. Rather, we really do need the 'changed' attribute stored directly in the mddev and set/tested as appropriate. Without this patch, a sequence of: mknod / open / close / unlink (which can cause a block_device to be created and then destroyed) will result in a rescan of the partition table and consequence removal and addition of partitions. Several of these in a row can get udev racing to create and unlink and other code can get confused. With the patch, the rescan is only performed when needed and so there are no races. This is suitable for any stable kernel from 2.6.35. Reported-by: "Wojcik, Krzysztof" <krzysztof.wojcik@intel.com> Signed-off-by: NeilBrown <neilb@suse.de> Cc: stable@kernel.org
2011-02-21Merge branch 'master' into for-2.6.39Tejun Heo5-49/+106
2011-02-21md: avoid spinlock problem in blk_throtl_exitNeilBrown6-9/+8
blk_throtl_exit assumes that ->queue_lock still exists, so make sure that it does. To do this, we stop redirecting ->queue_lock to conf->device_lock and leave it pointing where it is initialised - __queue_lock. As the blk_plug functions check the ->queue_lock is held, we now take that spin_lock explicitly around the plug functions. We don't need the locking, just the warning removal. This is needed for any kernel with the blk_throtl code, which is which is 2.6.37 and later. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>
2011-02-16md: correctly handle probe of an 'mdp' device.NeilBrown1-0/+3
'mdp' devices are md devices with preallocated device numbers for partitions. As such it is possible to mknod and open a partition before opening the whole device. this causes md_probe() to be called with a device number of a partition, which in-turn calls mddev_find with such a number. However mddev_find expects the number of a 'whole device' and does the wrong thing with partition numbers. So add code to mddev_find to remove the 'partition' part of a device number and just work with the 'whole device'. This patch addresses https://bugzilla.kernel.org/show_bug.cgi?id=28652 Reported-by: hkmaly@bigfoot.com Signed-off-by: NeilBrown <neilb@suse.de> Cc: <stable@kernel.org>
2011-02-16md: don't set_capacity before array is active.NeilBrown1-3/+3
If the desired size of an array is set (via sysfs) before the array is active (which is the normal sequence), we currrently call set_capacity immediately. This means that a subsequent 'open' (as can be caused by some udev-triggers program) will notice the new size and try to probe for partitions. However as the array isn't quite ready yet the read will fail. Then when the array is read, as the size doesn't change again we don't try to re-probe. So when setting array size via sysfs, only call set_capacity if the array is already active. Signed-off-by: NeilBrown <neilb@suse.de>
2011-02-14md: Fix raid1->raid0 takeoverKrzysztof Wojcik1-0/+1
Takeover raid1->raid0 not succeded. Kernel message is shown: "md/raid0:md126: too few disks (1 of 2) - aborting!" Problem was that we weren't updating ->raid_disks for that takeover, unlike all the others. Signed-off-by: Krzysztof Wojcik <krzysztof.wojcik@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
2011-02-12[SCSI] dm mpath: propagate target errors immediatelyHannes Reinecke1-12/+10
DM now has more information about the nature of the underlying storage failure. Path failure is avoided if a request failed due to a target error. Instead the target error is immediately passed up the stack. Discard requests that fail due to non-target errors may now be retried. Errors restricted to the path will be retried or returned if no paths are available, irregarding the no_path_retry setting. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Hannes Reinecke <hare@suse.de> Acked-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: James Bottomley <James.Bottomley@suse.de>
2011-02-08FIX: md: process hangs at wait_barrier after 0->10 takeoverKrzysztof Wojcik1-2/+4
Following symptoms were observed: 1. After raid0->raid10 takeover operation we have array with 2 missing disks. When we add disk for rebuild, recovery process starts as expected but it does not finish- it stops at about 90%, md126_resync process hangs in "D" state. 2. Similar behavior is when we have mounted raid0 array and we execute takeover to raid10. After this when we try to unmount array- it causes process umount hangs in "D" In scenarios above processes hang at the same function- wait_barrier in raid10.c. Process waits in macro "wait_event_lock_irq" until the "!conf->barrier" condition will be true. In scenarios above it never happens. Reason was that at the end of level_store, after calling pers->run, we call mddev_resume. This calls pers->quiesce(mddev, 0) with RAID10, that calls lower_barrier. However raise_barrier hadn't been called on that 'conf' yet, so conf->barrier becomes negative, which is bad. This patch introduces setting conf->barrier=1 after takeover operation. It prevents to become barrier negative after call lower_barrier(). Signed-off-by: Krzysztof Wojcik <krzysztof.wojcik@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
2011-02-08md_make_request: don't touch the bio after calling make_requestChris Mason1-2/+7
md_make_request was calling bio_sectors() for part_stat_add after it was calling the make_request function. This is bad because the make_request function can free the bio and because the bi_size field can change around. The fix here was suggested by Jens Axboe. It saves the sector count before the make_request call. I hit this with CONFIG_DEBUG_PAGEALLOC turned on while trying to break his pretty fusionio card. Cc: <stable@kernel.org> Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: NeilBrown <neilb@suse.de>
2011-02-02md: Don't allow slot_store while resync/recovery is happening.NeilBrown1-0/+3
Activating a spare in an array while resync/recovery is already happening can lead the that spare being marked in-sync when it isn't really. So don't allow the 'slot' to be set (this activating the device) while resync/recovery is happening. Signed-off-by: NeilBrown <neilb@suse.de>
2011-01-31md: don't clear curr_resync_completed at end of resync.NeilBrown1-3/+0
There is no need to set this to zero at this point. It will be set to zero by remove_and_add_spares or at the start of md_do_sync at the latest. And setting it to zero before MD_RECOVERY_RUNNING is cleared can make a 'zero' appear briefly in the 'sync_completed' sysfs attribute just as resync is finishing. So simply remove this setting to zero. Signed-off-by: NeilBrown <neilb@suse.de>
2011-01-31md: Don't use remove_and_add_spares to remove failed devices from a ↵NeilBrown1-2/+15
read-only array remove_and_add_spares is called in two places where the needs really are very different. remove_and_add_spares should not be called on an array which is about to be reshaped as some extra devices might have been manually added and that would remove them. However if the array is 'read-auto', that will currently happen, which is bad. So in the 'ro != 0' case don't call remove_and_add_spares but simply remove the failed devices as the comment suggests is needed. Signed-off-by: NeilBrown <neilb@suse.de>
2011-01-31Add raid1->raid0 takeover supportKrzysztof Wojcik1-0/+40
This patch introduces raid 1 to raid0 takeover operation in kernel space. Signed-off-by: Krzysztof Wojcik <krzysztof.wojcik@intel.com> Signed-off-by: Neil Brown <neilb@nbeee.brown>
2011-01-31md: Remove the AllReserved flag for component devices.NeilBrown2-10/+5
This flag is not needed and is used badly. Devices that are included in a native-metadata array are reserved exclusively for that array - and currently have AllReserved set. They all are bd_claimed for the rdev and so cannot be shared. Devices that are included in external-metadata arrays can be shared among multiple arrays - providing there is no overlap. These are bd_claimed for md in general - not for a particular rdev. When changing the amount of a device that is used in an array we need to check for overlap. This currently includes a check on AllReserved So even without overlap, sharing with an AllReserved device is not allowed. However the bd_claim usage already precludes sharing with these devices, so the test on AllReserved is not needed. And in fact it is wrong. As this is the only use of AllReserved, simply remove all usage and definition of AllReserved. Signed-off-by: NeilBrown <neilb@suse.de>
2011-01-31md: don't abort checking spares as soon as one cannot be added.NeilBrown1-2/+1
As spares can be added manually before a reshape starts, we need to find them all to mark some of them as in_sync. Previously we would abort looking for spares when we found an unallocated spare what could not be added to the array (implying there was no room for new spares). However already-added spares could be later in the list, so we need to keep searching. Signed-off-by: NeilBrown <neilb@suse.de>
2011-01-31md: fix the test for finding spares in raid5_start_reshape.NeilBrown1-2/+2
As spares can be added to the array before the reshape is started, we need to find and count them when checking there are enough. The array could have been degraded, so we need to check all devices, no just those out side of the range of devices in the array before the reshape. So instead of checking the index, check the In_sync flag as that reliably tells if the device is a spare or this purpose. Signed-off-by: NeilBrown <neilb@suse.de>
2011-01-31md: simplify some 'if' conditionals in raid5_start_reshape.NeilBrown1-27/+28
There are two consecutive 'if' statements. if (mddev->delta_disks >= 0) .... if (mddev->delta_disks > 0) The code in the second is equally valid if delta_disks == 0, and these two statements are the only place that 'added_devices' is used. So make them a single if statement, make added_devices a local variable, and re-indent it all. No functional change. Signed-off-by: NeilBrown <neilb@suse.de>
2011-01-31md: revert change to raid_disks on failure.NeilBrown1-0/+2
If we try to update_raid_disks and it fails, we should put 'delta_disks' back to zero. This is important because some code, such as slot_store, assumes that delta_disks has been validated. Signed-off-by: NeilBrown <neilb@suse.de>