summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2011-05-21btrfs: implement delayed inode items operationMiao Xie
Changelog V5 -> V6: - Fix oom when the memory load is high, by storing the delayed nodes into the root's radix tree, and letting btrfs inodes go. Changelog V4 -> V5: - Fix the race on adding the delayed node to the inode, which is spotted by Chris Mason. - Merge Chris Mason's incremental patch into this patch. - Fix deadlock between readdir() and memory fault, which is reported by Itaru Kitayama. Changelog V3 -> V4: - Fix nested lock, which is reported by Itaru Kitayama, by updating space cache inode in time. Changelog V2 -> V3: - Fix the race between the delayed worker and the task which does delayed items balance, which is reported by Tsutomu Itoh. - Modify the patch address David Sterba's comment. - Fix the bug of the cpu recursion spinlock, reported by Chris Mason Changelog V1 -> V2: - break up the global rb-tree, use a list to manage the delayed nodes, which is created for every directory and file, and used to manage the delayed directory name index items and the delayed inode item. - introduce a worker to deal with the delayed nodes. Compare with Ext3/4, the performance of file creation and deletion on btrfs is very poor. the reason is that btrfs must do a lot of b+ tree insertions, such as inode item, directory name item, directory name index and so on. If we can do some delayed b+ tree insertion or deletion, we can improve the performance, so we made this patch which implemented delayed directory name index insertion/deletion and delayed inode update. Implementation: - introduce a delayed root object into the filesystem, that use two lists to manage the delayed nodes which are created for every file/directory. One is used to manage all the delayed nodes that have delayed items. And the other is used to manage the delayed nodes which is waiting to be dealt with by the work thread. - Every delayed node has two rb-tree, one is used to manage the directory name index which is going to be inserted into b+ tree, and the other is used to manage the directory name index which is going to be deleted from b+ tree. - introduce a worker to deal with the delayed operation. This worker is used to deal with the works of the delayed directory name index items insertion and deletion and the delayed inode update. When the delayed items is beyond the lower limit, we create works for some delayed nodes and insert them into the work queue of the worker, and then go back. When the delayed items is beyond the upper bound, we create works for all the delayed nodes that haven't been dealt with, and insert them into the work queue of the worker, and then wait for that the untreated items is below some threshold value. - When we want to insert a directory name index into b+ tree, we just add the information into the delayed inserting rb-tree. And then we check the number of the delayed items and do delayed items balance. (The balance policy is above.) - When we want to delete a directory name index from the b+ tree, we search it in the inserting rb-tree at first. If we look it up, just drop it. If not, add the key of it into the delayed deleting rb-tree. Similar to the delayed inserting rb-tree, we also check the number of the delayed items and do delayed items balance. (The same to inserting manipulation) - When we want to update the metadata of some inode, we cached the data of the inode into the delayed node. the worker will flush it into the b+ tree after dealing with the delayed insertion and deletion. - We will move the delayed node to the tail of the list after we access the delayed node, By this way, we can cache more delayed items and merge more inode updates. - If we want to commit transaction, we will deal with all the delayed node. - the delayed node will be freed when we free the btrfs inode. - Before we log the inode items, we commit all the directory name index items and the delayed inode update. I did a quick test by the benchmark tool[1] and found we can improve the performance of file creation by ~15%, and file deletion by ~20%. Before applying this patch: Create files: Total files: 50000 Total time: 1.096108 Average time: 0.000022 Delete files: Total files: 50000 Total time: 1.510403 Average time: 0.000030 After applying this patch: Create files: Total files: 50000 Total time: 0.932899 Average time: 0.000019 Delete files: Total files: 50000 Total time: 1.215732 Average time: 0.000024 [1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3 Many thanks for Kitayama-san's help! Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: David Sterba <dave@jikos.cz> Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-18Merge branch 'fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2 * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2: configfs: Fix race between configfs_readdir() and configfs_d_iput() configfs: Don't try to d_delete() negative dentries. ocfs2/dlm: Target node death during resource migration leads to thread spin ocfs2: Skip mount recovery for hard-ro mounts ocfs2/cluster: Heartbeat mismatch message improved ocfs2/cluster: Increase the live threshold for global heartbeat ocfs2/dlm: Use negotiated o2dlm protocol version ocfs2: skip existing hole when removing the last extent_rec in punching-hole codes. ocfs2: Initialize data_ac (might be used uninitialized)
2011-05-18Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds
* 'for-linus' of git://git.kernel.dk/linux-2.6-block: block: don't delay blk_run_queue_async scsi: remove performance regression due to async queue run blk-throttle: Use task_subsys_state() to determine a task's blkio_cgroup block: rescan partitions on invalidated devices on -ENOMEDIA too cdrom: always check_disk_change() on open block: unexport DISK_EVENT_MEDIA_CHANGE for legacy/fringe drivers
2011-05-18configfs: Fix race between configfs_readdir() and configfs_d_iput()Joel Becker
configfs_readdir() will use the existing inode numbers of inodes in the dcache, but it makes them up for attribute files that aren't currently instantiated. There is a race where a closing attribute file can be tearing down at the same time as configfs_readdir() is trying to get its inode number. We want to get the inode number of open attribute files, because they should match while instantiated. We can't lock down the transition where dentry->d_inode is set to NULL, so we just check for NULL there. We can, however, ensure that an inode we find isn't iput() in configfs_d_iput() until after we've accessed it. Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-05-18configfs: Don't try to d_delete() negative dentries.Joel Becker
When configfs is faking mkdir() on its subsystem or default group objects, it starts by adding a negative dentry. It then tries to instantiate the group. If that should fail, it must clean up after itself. I was using d_delete() here, but configfs_attach_group() promises to return an empty dentry on error. d_delete() explodes with the entry dentry. Let's try d_drop() instead. The unhashing is what we want for our dentry. Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-05-17cifs: fix cifsConvertToUCS() for the mapchars caseJeff Layton
As Metze pointed out, commit 84cdf74e broke mapchars option: Commit "cifs: fix unaligned accesses in cifsConvertToUCS" (84cdf74e8096a10dd6acbb870dd404b92f07a756) does multiple steps in just one commit (moving the function and changing it without testing). put_unaligned_le16(temp, &target[j]); is never called for any codepoint the goes via the 'default' switch statement. As a result we put just zero (or maybe uninitialized) bytes into the target buffer. His proposed patch looks correct, but doesn't apply to the current head of the tree. This patch should also fix it. Cc: <stable@kernel.org> # .38.x: 581ade4: cifs: clean up various nits in unicode routines (try #2) Reported-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-17cifs: add fallback in is_path_accessible for old serversJeff Layton
The is_path_accessible check uses a QPathInfo call, which isn't supported by ancient win9x era servers. Fall back to an older SMBQueryInfo call if it fails with the magic error codes. Cc: stable@kernel.org Reported-and-Tested-by: Sandro Bonazzola <sandro.bonazzola@gmail.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstableLinus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: Btrfs: fix FS_IOC_SETFLAGS ioctl Btrfs: fix FS_IOC_GETFLAGS ioctl fs: remove FS_COW_FL Btrfs: fix easily get into ENOSPC in mixed case Prevent oopsing in posix_acl_valid()
2011-05-14Btrfs: fix FS_IOC_SETFLAGS ioctlLi Zefan
Steps to reproduce the bug: - Call FS_IOC_SETLFAGS ioctl with flags=FS_COMPR_FL - Call FS_IOC_SETFLAGS ioctl with flags=0 - Call FS_IOC_GETFLAGS ioctl, and you'll see FS_COMPR_FL is still set! Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-14Btrfs: fix FS_IOC_GETFLAGS ioctlLi Zefan
As we've added per file compression/cow support. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-14fs: remove FS_COW_FLLi Zefan
FS_COW_FL and FS_NOCOW_FL were newly introduced to control per file COW in btrfs, but FS_NOCOW_FL is sufficient. The fact is we don't have corresponding BTRFS_INODE_COW flag. COW is default, and FS_NOCOW_FL can be used to switch off COW for a single file. If we mount btrfs with nodatacow, a newly created file will be set with the FS_NOCOW_FL flag. So to turn on COW for it, we can just clear the FS_NOCOW_FL flag. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-14Btrfs: fix easily get into ENOSPC in mixed caseliubo
When a btrfs disk is created by mixed data & metadata option, it will have no pure data or pure metadata space info. In btrfs's for-linus branch, commit 78b1ea13838039cd88afdd62519b40b344d6c920 (Btrfs: fix OOPS of empty filesystem after balance) initializes space infos at the very beginning. The problem is this initialization does not take the mixed case into account, which will cause btrfs will easily get into ENOSPC in mixed case. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-14Prevent oopsing in posix_acl_valid()Daniel J Blueman
If posix_acl_from_xattr() returns an error code, a negative address is dereferenced causing an oops; fix by checking for error code first. Signed-off-by: Daniel J Blueman <daniel.blueman@gmail.com> Reviewed-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-13Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6Linus Torvalds
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6: NFSv4.1: Ensure that layoutget uses the correct gfp modes NFSv4.1: remove pnfs_layout_hdr from pnfs_destroy_all_layouts tmp_list NFSv41: Resend on NFS4ERR_RETRY_UNCACHED_REP
2011-05-13vfs: micro-optimize acl_permission_check()Linus Torvalds
It's a hot function, and we're better off not mixing types in the mask calculations. The compiler just ends up mixing 16-bit and 32-bit operations, for no good reason. So do everything in 'unsigned int' rather than mixing 'unsigned int' masking with a 'umode_t' (16-bit) mode variable. This, together with the parent commit (47a150edc2ae: "Cache user_ns in struct cred") makes acl_permission_check() much nicer. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-13ocfs2/dlm: Target node death during resource migration leads to thread spinSunil Mushran
During resource migration, if the target node were to die, the thread doing the migration spins until the target node is not removed from the domain map. This patch slows the spin by making the thread wait for the recovery to kick in. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-05-13ocfs2: Skip mount recovery for hard-ro mountsSunil Mushran
Patch skips mount recovery for hard-ro mounts which otherwise leads to an oops. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Acked-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-05-13ocfs2/cluster: Heartbeat mismatch message improvedSunil Mushran
If o2hb finds unexpected values in the heartbeat slot, it prints a message "ERROR: Device "dm-6": another node is heartbeating in our slot!" This message could be misleading. This patch adds two more messages to help users better diagnose the problem. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Acked-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-05-13ocfs2/cluster: Increase the live threshold for global heartbeatSunil Mushran
We have seen isolated cases (very few, I might add) of o2hb not detecting all live nodes on startup. One plausible reasoning for it is that other node had a hb io delay at the same time. The live threshold set at 2 (as low as it can be) could be increased to ameliorate the situation. But increasing the threshold directly affects mount time. Currently it takes around 5 secs to mount a volume in o2cb cluster with local heartbeat. Increasing the threshold will make mounts even slower. As the issue itself is rare, we have left things as they are for the local heartbeat mode. However we can improve the situation for global heartbeat mode as in that mode, we start the heartbeat much before the mount. This patch doubles the live threshold for the start of the first region in global heartbeat mode. Addresses internal Oracle bug#10635585. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Acked-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-05-13ocfs2/dlm: Use negotiated o2dlm protocol versionSunil Mushran
Patch fixes a bug in the o2dlm protocol negotiation in that it is using the builtin version rather than the negotiated version during the domain join. This causes join errors when a node having kernel >= 2.6.37 joins a cluster with nodes having kernels < 2.6.37. This only affects the o2cb cluster stack. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Reported-by: Jacek Stepniewski <Jacek.Stepniewski@agora.pl> Acked-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-05-13ocfs2: skip existing hole when removing the last extent_rec in punching-hole ↵Tristan Ye
codes. In the case of removing a partial extent record which covers a hole, current punching-hole logic will try to remove more than the length of whole extent record, which leads to the failure of following assert(fs/ocfs2/alloc.c): 5507 BUG_ON(cpos < le32_to_cpu(rec->e_cpos) || trunc_range > rec_range); This patch tries to skip existing hole at the last attempt of removing a partial extent record, what's more, it also adds some necessary comments for better understanding of punching-hole codes. Signed-off-by: Tristan Ye <tristan.ye@oracle.com> Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-05-13ocfs2: Initialize data_ac (might be used uninitialized)Marcus Meissner
CLANG found that there is a path that has data_ac uninitialized, this place 2917 /* This gets us the dx_root */ 2918 ret = ocfs2_reserve_new_metadata_blocks(osb, 1, &meta_ac); 2919 if (ret) { 3 Taking true branch 2920 mlog_errno(ret); 2921 goto out; 4 Control jumps to line 3168 2922 } Goes to the out: label without data_ac being initialized. Ciao, Marcus Signed-Off-By: Marcus Meissner <meissner@suse.de> Signed-off-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-05-12Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: fix oops in revalidate when called with NULL nameidata
2011-05-11NFSv4.1: Ensure that layoutget uses the correct gfp modesTrond Myklebust
Currently, writebacks may end up recursing back into the filesystem due to GFP_KERNEL direct reclaims in the pnfs subsystem. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-05-11Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: ceph: do not use i_wrbuffer_ref as refcount for Fb cap ceph: fix list_add in ceph_put_snap_realm ceph: print debug message before put mds session
2011-05-11NFSv4.1: remove pnfs_layout_hdr from pnfs_destroy_all_layouts tmp_listAndy Adamson
Prevents an infinite loop as list was never emptied. Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-05-11NFSv41: Resend on NFS4ERR_RETRY_UNCACHED_REPAndy Adamson
Free the slot and resend the RPC with new session <slot#,seq#>. For nfs4_async_handle_error, return -EAGAIN and set the task->tk_status to 0 to restart the async rpc in the rpc_restart_call_prepare state which resets the slot. For nfs4_handle_exception, retrying a call that uses nfs4_call_sync will reset the slot via nfs41_call_sync_prepare. For open/close/lock/locku/delegreturn/layoutcommit/unlink/rename/write cachethis is true, so these operations will not trigger an NFS4ERR_RETRY_UNCACHED_REP. Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-05-11ceph: do not use i_wrbuffer_ref as refcount for Fb capHenry C Chang
We increments i_wrbuffer_ref when taking the Fb cap. This breaks the dirty page accounting and causes looping in __ceph_do_pending_vmtruncate, and ceph client hangs. This bug can be reproduced occasionally by running blogbench. Add a new field i_wb_ref to inode and dedicate it to Fb reference counting. Signed-off-by: Henry C Chang <henry.cy.chang@gmail.com> Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-11ceph: fix list_add in ceph_put_snap_realmHenry C Chang
Signed-off-by: Henry C Chang <henry.cy.chang@gmail.com> Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-11ceph: print debug message before put mds sessionHenry C Chang
The mds session, s, could be freed during ceph_put_mds_session. Move dout before ceph_put_mds_session. Signed-off-by: Henry C Chang <henry.cy.chang@gmail.com> Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-10Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfsLinus Torvalds
* 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: fix race condition in AIL push trigger xfs: make AIL target updates and compares 32bit safe. xfs: always push the AIL to the target xfs: exit AIL push work correctly when AIL is empty xfs: ensure reclaim cursor is reset correctly at end of AG
2011-05-10fuse: fix oops in revalidate when called with NULL nameidataMiklos Szeredi
Some cases (e.g. ecryptfs) can call ->dentry_revalidate with NULL nameidata. https://bugzilla.kernel.org/show_bug.cgi?id=34732 Tyler Hicks pointed out that this bug was introduced by commit e7c0a16786 "fuse: make fuse_dentry_revalidate() RCU aware" Reported-by: Witold Baryluk <baryluk@smp.if.uj.edu.pl> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2011-05-10nilfs2: fix infinite loop in nilfs_palloc_freev functionRyusuke Konishi
After having applied commit 9954e7af14868b8b ("nilfs2: add free entries count only if clear bit operation succeeded"), a free routine of nilfs came to fall into an infinite loop, outputting the same message endlessly: nilfs_palloc_freev: entry number 29497 already freed nilfs_palloc_freev: entry number 29497 already freed nilfs_palloc_freev: entry number 29497 already freed nilfs_palloc_freev: entry number 29497 already freed nilfs_palloc_freev: entry number 29497 already freed ... That patch broke the routine so that a loop counter is never updated in an abnormal state. This fixes the regression. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2011-05-09xfs: fix race condition in AIL push triggerDave Chinner
The recent conversion of the xfsaild functionality to a work queue introduced a hard-to-hit log space grant hang. One is caused by a race condition in determining whether there is a psh in progress or not. The XFS_AIL_PUSHING_BIT is used to determine whether a push is currently in progress. When the AIL push work completes, it checked whether the target changed and cleared the PUSHING bit to allow a new push to be requeued. The race condition is as follows: Thread 1 push work smp_wmb() smp_rmb() check ailp->xa_target unchanged update ailp->xa_target test/set PUSHING bit does not queue clear PUSHING bit does not requeue Now that the push target is updated, new attempts to push the AIL will not trigger as the push target will be the same, and hence despite trying to push the AIL we won't ever wake it again. The fix is to ensure that the AIL push work clears the PUSHING bit before it checks if the target is unchanged. As a result, both push triggers operate on the same test/set bit criteria, so even if we race in the push work and miss the target update, the thread requesting the push will still set the PUSHING bit and queue the push work to occur. For safety sake, the same queue check is done if the push work detects the target change, though only one of the two will will queue new work due to the use of test_and_set_bit() checks. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> (cherry picked from commit e4d3c4a43b595d5124ae824d300626e6489ae857)
2011-05-09xfs: make AIL target updates and compares 32bit safe.Dave Chinner
The recent conversion of the xfsaild functionality to a work queue introduced a hard-to-hit log space grant hang. One of the problems noticed was that updates of the push target are not 32 bit safe as the target is a 64 bit value. We cannot copy a 64 bit LSN without the possibility of corrupting the result when racing with another updating thread. We have function to do this update safely without needing to care about 32/64 bit issues - xfs_trans_ail_copy_lsn() - so use that when updating the AIL push target. Also move the reading of the target in the push work inside the AIL lock, and use XFS_LSN_CMP() for the unlocked comparison during work termination to close read holes as well. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> (cherry picked from commit fd5670f22fce247754243cf2ed41941e5762d990)
2011-05-09xfs: always push the AIL to the targetDave Chinner
The recent conversion of the xfsaild functionality to a work queue introduced a hard-to-hit log space grant hang. One of the problems discovered is a target mismatch between the item pushing loop and the target itself. The push trigger checks for the target increasing (i.e. new target > current) while the push loop only pushes items that have a LSN < current. As a result, we can get the situation where the push target is X, the items at the tail of the AIL have LSN X and they don't get pushed. The push work then completes thinking it is done, and cannot be restarted until the push target increases to >= X + 1. If the push target then never increases (because the tail is not moving), then we never run the push work again and we stall. Fix it by making sure log items with a LSN that matches the target exactly are pushed during the loop. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> (cherry picked from commit cb64026b6e8af50db598ec7c3f59d504259b00bb)
2011-05-09xfs: exit AIL push work correctly when AIL is emptyDave Chinner
The recent conversion of the xfsaild functionality to a work queue introduced a hard-to-hit log space grant hang. The main cause is a regression where a work exit path fails to clear the PUSHING state and recheck the target correctly. Make both exit paths do the same PUSHING bit clearing and target checking when the "no more work to be done" condition is hit. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> (cherry picked from commit ea35a20021f8497390d05b93271b4d675516c654)
2011-05-09xfs: ensure reclaim cursor is reset correctly at end of AGDave Chinner
On a 32 bit highmem PowerPC machine, the XFS inode cache was growing without bound and exhausting low memory causing the OOM killer to be triggered. After some effort, the problem was reproduced on a 32 bit x86 highmem machine. The problem is that the per-ag inode reclaim index cursor was not getting reset to the start of the AG if the radix tree tag lookup found no more reclaimable inodes. Hence every further reclaim attempt started at the same index beyond where any reclaimable inodes lay, and no further background reclaim ever occurred from the AG. Without background inode reclaim the VM driven cache shrinker simply cannot keep up with cache growth, and OOM is the result. While the change that exposed the problem was the conversion of the inode reclaim to use work queues for background reclaim, it was not the cause of the bug. The bug was introduced when the cursor code was added, just waiting for some weird configuration to strike.... Signed-off-by: Dave Chinner <dchinner@redhat.com> Tested-By: Christian Kujau <lists@nerdbynature.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> (cherry picked from commit b223221956675ce8a7b436d198ced974bb388571)
2011-05-09Don't lock guardpage if the stack is growing upMikulas Patocka
Linux kernel excludes guard page when performing mlock on a VMA with down-growing stack. However, some architectures have up-growing stack and locking the guard page should be excluded in this case too. This patch fixes lvm2 on PA-RISC (and possibly other architectures with up-growing stack). lvm2 calculates number of used pages when locking and when unlocking and reports an internal error if the numbers mismatch. [ Patch changed fairly extensively to also fix /proc/<pid>/maps for the grows-up case, and to move things around a bit to clean it all up and share the infrstructure with the /proc bits. Tested on ia64 that has both grow-up and grow-down segments - Linus ] Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> Tested-by: Tony Luck <tony.luck@gmail.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09Merge branch 'hpfs'Linus Torvalds
* hpfs: HPFS: Remove unused variable HPFS: Move declaration up, so that there are no out-of-scope pointers HPFS: Fix some unaligned accesses HPFS: Fix endianity. Make hpfs work on big-endian machines HPFS: Implement fsync for hpfs HPFS: Fix a bug that filesystem was not marked dirty when remounting it HPFS: Restrict uid and gid to 16-bit values HPFS: When marking or clearing the dirty bit, sync the filesystem HPFS: Use types with defined width HPFS: Remove mark_inode_dirty HPFS: Remove CR/LF conversion option HPFS: Remove remaining locks HPFS: Introduce a global mutex and lock it on every callback from VFS. HPFS: Make HPFS compile on preempt and SMP
2011-05-09HPFS: Remove unused variableMikulas Patocka
Remove unused variable Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09HPFS: Move declaration up, so that there are no out-of-scope pointersMikulas Patocka
Move declaration up, so that there are no out-of-scope pointers Reported-by: Jesper Juhl <jj@chaosbits.net> Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09HPFS: Fix some unaligned accessesMikulas Patocka
Fix some unaligned accesses Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09HPFS: Fix endianity. Make hpfs work on big-endian machinesMikulas Patocka
Fix endianity. Make hpfs work on big-endian machines. Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09HPFS: Implement fsync for hpfsMikulas Patocka
Implement fsync for hpfs. Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09HPFS: Fix a bug that filesystem was not marked dirty when remounting itMikulas Patocka
Fix a bug that filesystem was not marked dirty when remounting it Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09HPFS: Restrict uid and gid to 16-bit valuesMikulas Patocka
Restrict uid and gid to 16-bit values. HPFS stores only 2 bytes in the EAs. Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09HPFS: When marking or clearing the dirty bit, sync the filesystemMikulas Patocka
When marking or clearing the dirty bit, sync the filesystem Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09HPFS: Use types with defined widthMikulas Patocka
Use types with defined width Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09HPFS: Remove mark_inode_dirtyMikulas Patocka
Remove mark_inode_dirty HPFS doesn't use kernel's dirty inode indicator anyway because writing an inode requires directory's mutex. Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>