summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2011-01-07fs: use RCU in shrink_dentry_list to reduce lock nestingNick Piggin
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07fs: reduce dcache_inode_lock width in lru scanningNick Piggin
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07fs: dcache reduce prune_one_dentry lockingNick Piggin
prune_one_dentry can avoid quite a bit of locking in the common case where ancestors have an elevated refcount. Alternatively, we could have gone the other way and made fewer trylocks in the case where d_count goes to zero, but is probably less common. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07fs: dcache reduce d_parent lockingNick Piggin
Use RCU to simplify locking in dget_parent. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07fs: dcache rationalise dget variantsNick Piggin
dget_locked was a shortcut to avoid the lazy lru manipulation when we already held dcache_lock (lru manipulation was relatively cheap at that point). However, how that the lru lock is an innermost one, we never hold it at any caller, so the lock cost can now be avoided. We already have well working lazy dcache LRU, so it should be fine to defer LRU manipulations to scan time. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07fs: dcache reduce dcache_inode_lockNick Piggin
dcache_inode_lock can be avoided in d_delete() and d_materialise_unique() in cases where it is not required. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07fs: dcache reduce locking in d_allocNick Piggin
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07fs: dcache reduce dput lockingNick Piggin
It is possible to run dput without taking data structure locks up-front. In many cases where we don't kill the dentry anyway, these locks are not required. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07fs: dcache avoid starvation in dcache multi-step operationsNick Piggin
Long lived dcache "multi-step" operations which retry on rename seq can be starved with a lot of rename activity. If they fail after the 1st pass, take the rename_lock for writing to avoid further starvation. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07fs: dcache remove dcache_lockNick Piggin
dcache_lock no longer protects anything. remove it. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07fs: Use rename lock and RCU for multi-step operationsNick Piggin
The remaining usages for dcache_lock is to allow atomic, multi-step read-side operations over the directory tree by excluding modifications to the tree. Also, to walk in the leaf->root direction in the tree where we don't have a natural d_lock ordering. This could be accomplished by taking every d_lock, but this would mean a huge number of locks and actually gets very tricky. Solve this instead by using the rename seqlock for multi-step read-side operations, retry in case of a rename so we don't walk up the wrong parent. Concurrent dentry insertions are not serialised against. Concurrent deletes are tricky when walking up the directory: our parent might have been deleted when dropping locks so also need to check and retry for that. We can also use the rename lock in cases where livelock is a worry (and it is introduced in subsequent patch). Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07fs: increase d_name lock coverageNick Piggin
Cover d_name with d_lock in more cases, where there may be concurrent modification to it. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07fs: scale inode alias listNick Piggin
Add a new lock, dcache_inode_lock, to protect the inode's i_dentry list from concurrent modification. d_alias is also protected by d_lock. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07fs: dcache scale subdirsNick Piggin
Protect d_subdirs and d_child with d_lock, except in filesystems that aren't using dcache_lock for these anyway (eg. using i_mutex). Note: if we change the locking rule in future so that ->d_child protection is provided only with ->d_parent->d_lock, it may allow us to reduce some locking. But it would be an exception to an otherwise regular locking scheme, so we'd have to see some good results. Probably not worthwhile. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07fs: dcache scale d_unhashedNick Piggin
Protect d_unhashed(dentry) condition with d_lock. This means keeping DCACHE_UNHASHED bit in synch with hash manipulations. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07fs: dcache scale dentry refcountNick Piggin
Make d_count non-atomic and protect it with d_lock. This allows us to ensure a 0 refcount dentry remains 0 without dcache_lock. It is also fairly natural when we start protecting many other dentry members with d_lock. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07fs: dcache scale lruNick Piggin
Add a new lock, dcache_lru_lock, to protect the dcache LRU list from concurrent modification. d_lru is also protected by d_lock, which allows LRU lists to be accessed without the lru lock, using RCU in future patches. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07fs: dcache scale hashNick Piggin
Add a new lock, dcache_hash_lock, to protect the dcache hash table from concurrent modification. d_hash is also protected by d_lock. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07hostfs: simplify lockingNick Piggin
Remove dcache_lock locking from hostfs filesystem, and move it into dcache helpers. All that is required is a coherent path name. Protection from concurrent modification of the namespace after path name generation is not provided in current code, because dcache_lock is dropped before the path is used. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07fs: change d_hash for rcu-walkNick Piggin
Change d_hash so it may be called from lock-free RCU lookups. See similar patch for d_compare for details. For in-tree filesystems, this is just a mechanical change. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07fs: change d_compare for rcu-walkNick Piggin
Change d_compare so it may be called from lock-free RCU lookups. This does put significant restrictions on what may be done from the callback, however there don't seem to have been any problems with in-tree fses. If some strange use case pops up that _really_ cannot cope with the rcu-walk rules, we can just add new rcu-unaware callbacks, which would cause name lookup to drop out of rcu-walk mode. For in-tree filesystems, this is just a mechanical change. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07fs: name case update methodNick Piggin
smpfs and ncpfs want to update a live dentry name in-place. Rather than have them open code the locking, provide a documented dcache API. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07jfs: dont overwrite dentry name in d_revalidateNick Piggin
Use vfat's method for dealing with negative dentries to preserve case, rather than overwrite dentry name in d_revalidate, which is a bit ugly and also gets in the way of doing lock-free path walking. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07cifs: dont overwrite dentry name in d_revalidateNick Piggin
Use vfat's method for dealing with negative dentries to preserve case, rather than overwrite dentry name in d_revalidate, which is a bit ugly and also gets in the way of doing lock-free path walking. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07fs: change d_delete semanticsNick Piggin
Change d_delete from a dentry deletion notification to a dentry caching advise, more like ->drop_inode. Require it to be constant and idempotent, and not take d_lock. This is how all existing filesystems use the callback anyway. This makes fine grained dentry locking of dput and dentry lru scanning much simpler. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07config fs: avoid switching ->d_op on live dentryNick Piggin
Switching d_op on a live dentry is racy in general, so avoid it. In this case it is a negative dentry, which is safer, but there are still concurrent ops which may be called on d_op in that case (eg. d_revalidate). So in general a filesystem may not do this. Fix configfs so as not to do this. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07fs: use fast counters for vfs cachesNick Piggin
percpu_counter library generates quite nasty code, so unless you need to dynamically allocate counters or take fast approximate value, a simple per cpu set of counters is much better. The percpu_counter can never be made to work as well, because it has an indirection from pointer to percpu memory, and it can't use direct this_cpu_inc interfaces because it doesn't use static PER_CPU data, so code will always be worse. In the fastpath, it is the difference between this: incl %gs:nr_dentry # nr_dentry and this: movl percpu_counter_batch(%rip), %edx # percpu_counter_batch, movl $1, %esi #, movq $nr_dentry, %rdi #, call __percpu_counter_add # (plus I clobber registers) __percpu_counter_add: pushq %rbp # movq %rsp, %rbp #, subq $32, %rsp #, movq %rbx, -24(%rbp) #, movq %r12, -16(%rbp) #, movq %r13, -8(%rbp) #, movq %rdi, %rbx # fbc, fbc #APP # 216 "/home/npiggin/usr/src/linux-2.6/arch/x86/include/asm/thread_info.h" 1 movq %gs:kernel_stack,%rax #, pfo_ret__ # 0 "" 2 #NO_APP incl -8124(%rax) # <variable>.preempt_count movq 32(%rdi), %r12 # <variable>.counters, tcp_ptr__ #APP # 78 "lib/percpu_counter.c" 1 add %gs:this_cpu_off, %r12 # this_cpu_off, tcp_ptr__ # 0 "" 2 #NO_APP movslq (%r12),%r13 #* tcp_ptr__, tmp73 movslq %edx,%rax # batch, batch addq %rsi, %r13 # amount, count cmpq %rax, %r13 # batch, count jge .L27 #, negl %edx # tmp76 movslq %edx,%rdx # tmp76, tmp77 cmpq %rdx, %r13 # tmp77, count jg .L28 #, .L27: movq %rbx, %rdi # fbc, call _raw_spin_lock # addq %r13, 8(%rbx) # count, <variable>.count movq %rbx, %rdi # fbc, movl $0, (%r12) #,* tcp_ptr__ call _raw_spin_unlock # .L29: #APP # 216 "/home/npiggin/usr/src/linux-2.6/arch/x86/include/asm/thread_info.h" 1 movq %gs:kernel_stack,%rax #, pfo_ret__ # 0 "" 2 #NO_APP decl -8124(%rax) # <variable>.preempt_count movq -8136(%rax), %rax #, D.14625 testb $8, %al #, D.14625 jne .L32 #, .L31: movq -24(%rbp), %rbx #, movq -16(%rbp), %r12 #, movq -8(%rbp), %r13 #, leave ret .p2align 4,,10 .p2align 3 .L28: movl %r13d, (%r12) # count,* jmp .L29 # .L32: call preempt_schedule # .p2align 4,,6 jmp .L31 # .size __percpu_counter_add, .-__percpu_counter_add .p2align 4,,15 Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07vfs: revert per-cpu nr_unused counters for dentry and inodesNick Piggin
The nr_unused counters count the number of objects on an LRU, and as such they are synchronized with LRU object insertion and removal and scanning, and protected under the LRU lock. Making it per-cpu does not actually get any concurrency improvements because of this lock, and summing the counter is much slower, and incrementing/decrementing it costs more code size and is slower too. These counters should stay per-LRU, which currently means global. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07fs: d_validate fixesNick Piggin
d_validate has been broken for a long time. kmem_ptr_validate does not guarantee that a pointer can be dereferenced if it can go away at any time. Even rcu_read_lock doesn't help, because the pointer might be queued in RCU callbacks but not executed yet. So the parent cannot be checked, nor the name hashed. The dentry pointer can not be touched until it can be verified under lock. Hashing simply cannot be used. Instead, verify the parent/child relationship by traversing parent's d_child list. It's slow, but only ncpfs and the destaged smbfs care about it, at this point. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-05Revert "fs: use RCU read side protection in d_validate"Nick Piggin
This reverts commit 3825bdb7ed920845961f32f364454bee5f469abb. You cannot dget() a dentry without having a reference, or holding a lock that guarantees it remains valid. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2010-12-23Merge branch 'upstream-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2 * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2: ocfs2: Fix system inodes cache overflow. ocfs2: Hold ip_lock when set/clear flags for indexed dir. ocfs2: Adjust masklog flag values Ocfs2: Teach 'coherency=full' O_DIRECT writes to correctly up_read i_alloc_sem. ocfs2/dlm: Migrate lockres with no locks if it has a reference
2010-12-23Merge branch 'linus-hot-fix' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'linus-hot-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: fix on-line resizing regression
2010-12-23ext4: fix on-line resizing regressionTheodore Ts'o
https://bugzilla.kernel.org/show_bug.cgi?id=25352 This regression was caused by commit a31437b85: "ext4: use sb_issue_zeroout in setup_new_group_blocks", by accidentally dropping the code which reserved the block group descriptor and inode table blocks. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-22logfs: fix "Kernel BUG at readwrite.c:1193"Prasad Joshi
This happens when __logfs_create() tries to write a new inode to the disk which is full. __logfs_create() associates the transaction pointer with inode. During the logfs_write_inode() function call chain this transaction pointer is moved from inode to page->private using function move_inode_to_page (do_write_inode() -> inode_to_page() -> move_inode_to_page) When the write inode fails, the transaction is aborted and iput is called on the failed inode. During delete_inode the same transaction pointer associated with the page is getting used. Thus causing kernel BUG. The patch checks for error in write_inode() and restores the page->private to NULL. Addresses https://bugzilla.kernel.org/show_bug.cgi?id=20162 Signed-off-by: Prasad Joshi <prasadjoshi124@gmail.com> Cc: Joern Engel <joern@logfs.org> Cc: Florian Mickler <florian@mickler.org> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Maciej Rutecki <maciej.rutecki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-12-22logfs: fix deadlock in logfs_get_wblocks, hold and wait on super->s_write_mutexPrasad Joshi
do_logfs_journal_wl_pass() should use GFP_NOFS for memory allocation GC code calls btree_insert32 with GFP_KERNEL while holding a mutex super->s_write_mutex. The same mutex is used in address_space_operations->writepage(), and a call to writepage() could be triggered as a result of memory allocation in btree_insert32, causing a deadlock. Addresses https://bugzilla.kernel.org/show_bug.cgi?id=20342 Signed-off-by: Prasad Joshi <prasadjoshi124@gmail.com> Cc: Joern Engel <joern@logfs.org> Cc: Florian Mickler <florian@mickler.org> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Maciej Rutecki <maciej.rutecki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-12-22ocfs2: Fix system inodes cache overflow.Tao Ma
When we store system inodes cache in ocfs2_super, we use a array for global system inodes. But unfortunately, the range is calculated wrongly which makes it overflow and pollute ocfs2_super->local_system_inodes. This patch fix it by setting the range properly. The corresponding bug is ossbug1303. http://oss.oracle.com/bugzilla/show_bug.cgi?id=1303 Cc: stable@kernel.org Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-12-20Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: ceph: handle partial result from get_user_pages ceph: mark user pages dirty on direct-io reads ceph: fix null pointer dereference in ceph_init_dentry for nfs reexport ceph: fix direct-io on non-page-aligned buffers ceph: fix msgr_init error path
2010-12-20Fix btrfs b0rkageAl Viro
Buggered-in: 76dda93c6ae2 ("Btrfs: add snapshot/subvolume destroy ioctl") Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-12-17ceph: mark user pages dirty on direct-io readsHenry C Chang
For read operation, we have to set the argument _write_ of get_user_pages to 1 since we will write data to pages. Also, we need to SetPageDirty before releasing these pages. Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com> Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-17ceph: fix null pointer dereference in ceph_init_dentry for nfs reexportSage Weil
The fh_to_dentry etc. methods use ceph_init_dentry(), which assumes that d_parent is defined. It isn't for those callers, so check! Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-16Merge branch 'for-linus' of git://git.infradead.org/users/eparis/notifyLinus Torvalds
* 'for-linus' of git://git.infradead.org/users/eparis/notify: fanotify: fill in the metadata_len field on struct fanotify_event_metadata fanotify: split version into version and metadata_len fanotify: Dont try to open a file descriptor for the overflow event fanotify: Introduce FAN_NOFD fanotify: do not leak user reference on allocation failure inotify: stop kernel memory leak on file creation failure fanotify: on group destroy allow all waiters to bypass permission check fanotify: Dont allow a mask of 0 if setting or removing a mark fanotify: correct broken ref counting in case adding a mark failed fanotify: if set by user unset FMODE_NONOTIFY before fsnotify_perm() is called fanotify: remove packed from access response message fanotify: deny permissions when no event was sent
2010-12-16ocfs2: Hold ip_lock when set/clear flags for indexed dir.Tao Ma
When we set/clear the dyn_features for an inode we hold the ip_lock. So do it when we set/clear OCFS2_INDEXED_DIR_FL also. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-12-16ocfs2: Adjust masklog flag valuesSunil Mushran
Two masklogs had the same flag value. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-12-16nilfs2: fix regression of garbage collection ioctlRyusuke Konishi
On 2.6.37-rc1, garbage collection ioctl of nilfs was broken due to the commit 263d90cefc7d82a0 ("nilfs2: remove own inode hash used for GC"), and leading to filesystem corruption. The patch doesn't queue gc-inodes for log writer if they are reused through the vfs inode cache. Here, gc-inode is the inode which buffers blocks to be relocated on GC. That patch queues gc-inodes in nilfs_init_gcinode() function, but this function is not called when they don't have I_NEW flag. Thus, some of live blocks are wrongly overrode without being moved to new logs. This resolves the problem by moving the gc-inode queueing to an outer function to ensure it's done right. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-12-15ceph: fix direct-io on non-page-aligned buffersHenry C Chang
The user buffer may be 512-byte aligned, not page-aligned. We were assuming the buffer was page-aligned and only accounting for non-page-aligned io offsets. Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com> Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-15Merge branch 'for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: fix typo which broke '..' detection in ext4_find_entry() ext4: Turn off multiple page-io submission by default
2010-12-15install_special_mapping skips security_file_mmap check.Tavis Ormandy
The install_special_mapping routine (used, for example, to setup the vdso) skips the security check before insert_vm_struct, allowing a local attacker to bypass the mmap_min_addr security restriction by limiting the available pages for special mappings. bprm_mm_init() also skips the check, and although I don't think this can be used to bypass any restrictions, I don't see any reason not to have the security check. $ uname -m x86_64 $ cat /proc/sys/vm/mmap_min_addr 65536 $ cat install_special_mapping.s section .bss resb BSS_SIZE section .text global _start _start: mov eax, __NR_pause int 0x80 $ nasm -D__NR_pause=29 -DBSS_SIZE=0xfffed000 -f elf -o install_special_mapping.o install_special_mapping.s $ ld -m elf_i386 -Ttext=0x10000 -Tbss=0x11000 -o install_special_mapping install_special_mapping.o $ ./install_special_mapping & [1] 14303 $ cat /proc/14303/maps 0000f000-00010000 r-xp 00000000 00:00 0 [vdso] 00010000-00011000 r-xp 00001000 00:19 2453665 /home/taviso/install_special_mapping 00011000-ffffe000 rwxp 00000000 00:00 0 [stack] It's worth noting that Red Hat are shipping with mmap_min_addr set to 4096. Signed-off-by: Tavis Ormandy <taviso@google.com> Acked-by: Kees Cook <kees@ubuntu.com> Acked-by: Robert Swiecki <swiecki@google.com> [ Changed to not drop the error code - akpm ] Reviewed-by: James Morris <jmorris@namei.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-12-15fanotify: fill in the metadata_len field on struct fanotify_event_metadataEric Paris
The fanotify_event_metadata now has a field which is supposed to indicate the length of the metadata portion of the event. Fill in that field as well. Based-in-part-on-patch-by: Alexey Zaytsev <alexey.zaytsev@gmail.com> Signed-off-by: Eric Paris <eparis@redhat.com>
2010-12-14ext4: fix typo which broke '..' detection in ext4_find_entry()Aaro Koskinen
There should be a check for the NUL character instead of '0'. Fortunately the only thing that cares about this is NFS serving, which is why we didn't notice this in the merge window testing. Reported-by: Phil Carmody <ext-phil.2.carmody@nokia.com> Signed-off-by: Aaro Koskinen <aaro.koskinen@nokia.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-14ext4: Turn off multiple page-io submission by defaultTheodore Ts'o
Jon Nelson has found a test case which causes postgresql to fail with the error: psql:t.sql:4: ERROR: invalid page header in block 38269 of relation base/16384/16581 Under memory pressure, it looks like part of a file can end up getting replaced by zero's. Until we can figure out the cause, we'll roll back the change and use block_write_full_page() instead of ext4_bio_write_page(). The new, more efficient writing function can be used via the mount option mblk_io_submit, so we can test and fix the new page I/O code. To reproduce the problem, install postgres 8.4 or 9.0, and pin enough memory such that the system just at the end of triggering writeback before running the following sql script: begin; create temporary table foo as select x as a, ARRAY[x] as b FROM generate_series(1, 10000000 ) AS x; create index foo_a_idx on foo (a); create index foo_b_idx on foo USING GIN (b); rollback; If the temporary table is created on a hard drive partition which is encrypted using dm_crypt, then under memory pressure, approximately 30-40% of the time, pgsql will issue the above failure. This patch should fix this problem, and the problem will come back if the file system is mounted with the mblk_io_submit mount option. Reported-by: Jon Nelson <jnelson@jamponi.net> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>