summaryrefslogtreecommitdiff
path: root/fs/exec.c
AgeCommit message (Collapse)Author
2021-02-03Merge 4.4.255 into android-4.4-pGreg Kroah-Hartman
Changes in 4.4.255 ACPI: sysfs: Prefer "compatible" modalias wext: fix NULL-ptr-dereference with cfg80211's lack of commit() net: usb: qmi_wwan: added support for Thales Cinterion PLSx3 modem family KVM: x86/pmu: Fix HW_REF_CPU_CYCLES event pseudo-encoding in intel_arch_events[] mt7601u: fix kernel crash unplugging the device mt7601u: fix rx buffer refcounting y2038: futex: Move compat implementation into futex.c futex: Move futex exit handling into futex code futex: Replace PF_EXITPIDONE with a state exit/exec: Seperate mm_release() futex: Split futex_mm_release() for exit/exec futex: Set task::futex_state to DEAD right after handling futex exit futex: Mark the begin of futex exit explicitly futex: Sanitize exit state handling futex: Provide state handling for exec() as well futex: Add mutex around futex exit futex: Provide distinct return value when owner is exiting futex: Prevent exit livelock ARM: imx: build suspend-imx6.S with arm instruction set netfilter: nft_dynset: add timeout extension to template xfrm: Fix oops in xfrm_replay_advance_bmp RDMA/cxgb4: Fix the reported max_recv_sge value mac80211: pause TX while changing interface type can: dev: prevent potential information leak in can_fill_info() iommu/vt-d: Gracefully handle DMAR units with no supported address widths iommu/vt-d: Don't dereference iommu_device if IOMMU_API is not built NFC: fix resource leak when target index is invalid NFC: fix possible resource leak Linux 4.4.255 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: I33bac27670b3cd649e2c8c1ce42efff148f8f202
2021-02-03exit/exec: Seperate mm_release()Thomas Gleixner
commit 4610ba7ad877fafc0a25a30c6c82015304120426 upstream. mm_release() contains the futex exit handling. mm_release() is called from do_exit()->exit_mm() and from exec()->exec_mm(). In the exit_mm() case PF_EXITING and the futex state is updated. In the exec_mm() case these states are not touched. As the futex exit code needs further protections against exit races, this needs to be split into two functions. Preparatory only, no functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20191106224556.240518241@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-20Merge 4.4.224 into android-4.4-pGreg Kroah-Hartman
Changes in 4.4.224 USB: serial: qcserial: Add DW5816e support Revert "net: phy: Avoid polling PHY with PHY_IGNORE_INTERRUPTS" dp83640: reverse arguments to list_add_tail net/mlx4_core: Fix use of ENOSPC around mlx4_counter_alloc() sch_sfq: validate silly quantum values sch_choke: avoid potential panic in choke_reset() Revert "ACPI / video: Add force_native quirk for HP Pavilion dv6" enic: do not overwrite error code ipv6: fix cleanup ordering for ip6_mr failure binfmt_elf: move brk out of mmap when doing direct loader exec x86/apm: Don't access __preempt_count with zeroed fs Revert "IB/ipoib: Update broadcast object if PKey value was changed in index 0" USB: uas: add quirk for LaCie 2Big Quadra USB: serial: garmin_gps: add sanity checking for data length batman-adv: fix batadv_nc_random_weight_tq scripts/decodecode: fix trapping instruction formatting phy: micrel: Ensure interrupts are reenabled on resume binfmt_elf: Do not move brk for INTERP-less ET_EXEC ext4: add cond_resched() to ext4_protect_reserved_inode net: ipv6: add net argument to ip6_dst_lookup_flow net: ipv6_stub: use ip6_dst_lookup_flow instead of ip6_dst_lookup blktrace: Fix potential deadlock between delete & sysfs ops blktrace: fix unlocked access to init/start-stop/teardown blktrace: fix trace mutex deadlock blktrace: Protect q->blk_trace with RCU blktrace: fix dereference after null check ptp: do not explicitly set drvdata in ptp_clock_register() ptp: use is_visible method to hide unused attributes ptp: create "pins" together with the rest of attributes chardev: add helper function to register char devs with a struct device ptp: Fix pass zero to ERR_PTR() in ptp_clock_register ptp: fix the race between the release of ptp_clock and cdev ptp: free ptp device pin descriptors properly net: handle no dst on skb in icmp6_send net/sonic: Fix a resource leak in an error handling path in 'jazz_sonic_probe()' net: moxa: Fix a potential double 'free_irq()' drop_monitor: work around gcc-10 stringop-overflow warning scsi: sg: add sg_remove_request in sg_write spi: spi-dw: Add lock protect dw_spi rx/tx to prevent concurrent calls cifs: Check for timeout on Negotiate stage cifs: Fix a race condition with cifs_echo_request dmaengine: pch_dma.c: Avoid data race between probe and irq handler dmaengine: mmp_tdma: Reset channel error on release drm/qxl: lost qxl_bo_kunmap_atomic_page in qxl_image_init_helper() ipc/util.c: sysvipc_find_ipc() incorrectly updates position index net: openvswitch: fix csum updates for MPLS actions gre: do not keep the GRE header around in collect medata mode mm/memory_hotplug.c: fix overflow in test_pages_in_a_zone() scsi: qla2xxx: Avoid double completion of abort command i40e: avoid NVM acquire deadlock during NVM update net/mlx5: Fix driver load error flow when firmware is stuck netfilter: conntrack: avoid gcc-10 zero-length-bounds warning IB/mlx4: Test return value of calls to ib_get_cached_pkey pnp: Use list_for_each_entry() instead of open coding gcc-10 warnings: fix low-hanging fruit kbuild: compute false-positive -Wmaybe-uninitialized cases in Kconfig Stop the ad-hoc games with -Wno-maybe-initialized gcc-10: disable 'zero-length-bounds' warning for now gcc-10: disable 'array-bounds' warning for now gcc-10: disable 'stringop-overflow' warning for now gcc-10: disable 'restrict' warning for now block: defer timeouts to a workqueue blk-mq: Allow timeouts to run while queue is freezing blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter blk-mq: Allow blocking queue tag iter callbacks x86/paravirt: Remove the unused irq_enable_sysexit pv op gcc-10: avoid shadowing standard library 'free()' in crypto net: fix a potential recursive NETDEV_FEAT_CHANGE net: ipv4: really enforce backoff for redirects netlabel: cope with NULL catmap ALSA: hda/realtek - Limit int mic boost for Thinkpad T530 ALSA: rawmidi: Fix racy buffer resize under concurrent accesses ALSA: rawmidi: Initialize allocated buffers USB: gadget: fix illegal array access in binding with UDC ARM: dts: imx27-phytec-phycard-s-rdk: Fix the I2C1 pinctrl entries x86: Fix early boot crash on gcc-10, third try exec: Move would_dump into flush_old_exec usb: gadget: net2272: Fix a memory leak in an error handling path in 'net2272_plat_probe()' usb: gadget: audio: Fix a missing error return value in audio_bind() usb: gadget: legacy: fix error return code in gncm_bind() usb: gadget: legacy: fix error return code in cdc_bind() Revert "ALSA: hda/realtek: Fix pop noise on ALC225" ARM: dts: r8a7740: Add missing extal2 to CPG node KVM: x86: Fix off-by-one error in kvm_vcpu_ioctl_x86_setup_mce Makefile: disallow data races on gcc-10 as well scsi: iscsi: Fix a potential deadlock in the timeout handler Linux 4.4.224 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: I384313d39dead8b0babb144803269033f4aacc53
2020-05-20exec: Move would_dump into flush_old_execEric W. Biederman
commit f87d1c9559164294040e58f5e3b74a162bf7c6e8 upstream. I goofed when I added mm->user_ns support to would_dump. I missed the fact that in the case of binfmt_loader, binfmt_em86, binfmt_misc, and binfmt_script bprm->file is reassigned. Which made the move of would_dump from setup_new_exec to __do_execve_file before exec_binprm incorrect as it can result in would_dump running on the script instead of the interpreter of the script. The net result is that the code stopped making unreadable interpreters undumpable. Which allows them to be ptraced and written to disk without special permissions. Oops. The move was necessary because the call in set_new_exec was after bprm->mm was no longer valid. To correct this mistake move the misplaced would_dump from __do_execve_file into flos_old_exec, before exec_mmap is called. I tested and confirmed that without this fix I can attach with gdb to a script with an unreadable interpreter, and with this fix I can not. Cc: stable@vger.kernel.org Fixes: f84df2a6f268 ("exec: Ensure mm->user_ns contains the execed files") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-24Merge 4.4.220 into android-4.4-pGreg Kroah-Hartman
Changes in 4.4.220 bus: sunxi-rsb: Return correct data when mixing 16-bit and 8-bit reads net: vxge: fix wrong __VA_ARGS__ usage qlcnic: Fix bad kzalloc null test i2c: st: fix missing struct parameter description irqchip/versatile-fpga: Handle chained IRQs properly selftests/x86/ptrace_syscall_32: Fix no-vDSO segfault libata: Remove extra scsi_host_put() in ata_scsi_add_hosts() gfs2: Don't demote a glock until its revokes are written x86/boot: Use unsigned comparison for addresses locking/lockdep: Avoid recursion in lockdep_count_{for,back}ward_deps() btrfs: remove a BUG_ON() from merge_reloc_roots() btrfs: track reloc roots based on their commit root bytenr misc: rtsx: set correct pcr_ops for rts522A ASoC: fix regwmask ASoC: dapm: connect virtual mux with default value ASoC: dpcm: allow start or stop during pause for backend ASoC: topology: use name_prefix for new kcontrol usb: gadget: f_fs: Fix use after free issue as part of queue failure usb: gadget: composite: Inform controller driver of self-powered ALSA: usb-audio: Add mixer workaround for TRX40 and co ALSA: hda: Add driver blacklist ALSA: hda: Fix potential access overflow in beep helper ALSA: ice1724: Fix invalid access for enumerated ctl items ALSA: pcm: oss: Fix regression by buffer overflow fix acpi/x86: ignore unspecified bit positions in the ACPI global lock field thermal: devfreq_cooling: inline all stubs for CONFIG_DEVFREQ_THERMAL=n KEYS: reaching the keys quotas correctly irqchip/versatile-fpga: Apply clear-mask earlier MIPS: OCTEON: irq: Fix potential NULL pointer dereference ath9k: Handle txpower changes even when TPC is disabled signal: Extend exec_id to 64bits x86/entry/32: Add missing ASM_CLAC to general_protection entry KVM: x86: Allocate new rmap and large page tracking when moving memslot crypto: mxs-dcp - fix scatterlist linearization for hash futex: futex_wake_op, do not fail on invalid op xen-netfront: Rework the fix for Rx stall during OOM and network stress ALSA: hda: Initialize power_state field properly Btrfs: incremental send, fix invalid memory access IB/ipoib: Fix lockdep issue found on ipoib_ib_dev_heavy_flush scsi: zfcp: fix missing erp_lock in port recovery trigger for point-to-point arm64: armv8_deprecated: Fix undef_hook mask for thumb setend ext4: fix a data race at inode->i_blocks ocfs2: no need try to truncate file beyond i_size s390/diag: fix display of diagnose call statistics Input: i8042 - add Acer Aspire 5738z to nomux list kmod: make request_module() return an error when autoloading is disabled hfsplus: fix crash and filesystem corruption when deleting files libata: Return correct status in sata_pmp_eh_recover_pm() when ATA_DFLAG_DETACH is set powerpc/64/tm: Don't let userspace set regs->trap via sigreturn Btrfs: fix crash during unmount due to race with delayed inode workers drm/dp_mst: Fix clearing payload state on topology disable ipmi: fix hung processes in __get_guid() powerpc/fsl_booke: Avoid creating duplicate tlb1 entry misc: echo: Remove unnecessary parentheses and simplify check for zero mfd: dln2: Fix sanity checking for endpoints net: ipv4: devinet: Fix crash when add/del multicast IP with autojoin net: ipv6: do not consider routes via gateways for anycast address check scsi: ufs: Fix ufshcd_hold() caused scheduling while atomic jbd2: improve comments about freeing data buffers whose page mapping is NULL ext4: fix incorrect group count in ext4_fill_super error message ext4: fix incorrect inodes per group in error message ASoC: Intel: mrfld: fix incorrect check on p->sink ASoC: Intel: mrfld: return error codes when an error occurs ALSA: usb-audio: Don't override ignore_ctl_error value from the map mac80211_hwsim: Use kstrndup() in place of kasprintf() ext4: do not zeroout extents beyond i_disksize dm flakey: check for null arg_name in parse_features() kvm: x86: Host feature SSBD doesn't imply guest feature SPEC_CTRL_SSBD x86/mitigations: Clear CPU buffers on the SYSCALL fast path tracing: Fix the race between registering 'snapshot' event trigger and triggering 'snapshot' operation scsi: sg: add sg_remove_request in sg_common_write ALSA: hda: Don't release card at firmware loading error video: fbdev: sis: Remove unnecessary parentheses and commented code drm: NULL pointer dereference [null-pointer-deref] (CWE 476) problem wil6210: increase firmware ready timeout wil6210: fix temperature debugfs scsi: ufs: ufs-qcom: remove broken hci version quirk wil6210: rate limit wil_rx_refill error rtc: pm8xxx: Fix issue in RTC write path soc: qcom: smem: Use le32_to_cpu for comparison of: fix missing kobject init for !SYSFS && OF_DYNAMIC config of: unittest: kmemleak in of_unittest_platform_populate() clk: at91: usb: continue if clk_hw_round_rate() return zero clk: tegra: Fix Tegra PMC clock out parents NFS: direct.c: Fix memory leak of dreq when nfs_get_lock_context fails ext4: do not commit super on read-only bdev percpu_counter: fix a data race at vm_committed_as compiler.h: fix error in BUILD_BUG_ON() reporting NFS: Fix memory leaks in nfs_pageio_stop_mirroring() ext2: fix empty body warnings when -Wextra is used iommu/amd: Fix the configuration of GCR3 table root pointer fbdev: potential information leak in do_fb_ioctl() tty: evh_bytechan: Fix out of bounds accesses locktorture: Print ratio of acquisitions, not failures mtd: lpddr: Fix a double free in probe() mtd: phram: fix a double free issue in error path x86/CPU: Add native CPUID variants returning a single datum x86/microcode/intel: replace sync_core() with native_cpuid_reg(eax) x86/vdso: Fix lsl operand order Linux 4.4.220 Change-Id: Ic931642c95ad95eb2755c3c20f6802e04283e68b Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2020-04-24signal: Extend exec_id to 64bitsEric W. Biederman
commit d1e7fd6462ca9fc76650fbe6ca800e35b24267da upstream. Replace the 32bit exec_id with a 64bit exec_id to make it impossible to wrap the exec_id counter. With care an attacker can cause exec_id wrap and send arbitrary signals to a newly exec'd parent. This bypasses the signal sending checks if the parent changes their credentials during exec. The severity of this problem can been seen that in my limited testing of a 32bit exec_id it can take as little as 19s to exec 65536 times. Which means that it can take as little as 14 days to wrap a 32bit exec_id. Adam Zabrocki has succeeded wrapping the self_exe_id in 7 days. Even my slower timing is in the uptime of a typical server. Which means self_exec_id is simply a speed bump today, and if exec gets noticably faster self_exec_id won't even be a speed bump. Extending self_exec_id to 64bits introduces a problem on 32bit architectures where reading self_exec_id is no longer atomic and can take two read instructions. Which means that is is possible to hit a window where the read value of exec_id does not match the written value. So with very lucky timing after this change this still remains expoiltable. I have updated the update of exec_id on exec to use WRITE_ONCE and the read of exec_id in do_notify_parent to use READ_ONCE to make it clear that there is no locking between these two locations. Link: https://lore.kernel.org/kernel-hardening/20200324215049.GA3710@pi3.com.pl Fixes: 2.3.23pre2 Cc: stable@vger.kernel.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-04Merge 4.4.187 into android-4.4-pGreg Kroah-Hartman
Changes in 4.4.187 MIPS: ath79: fix ar933x uart parity mode MIPS: fix build on non-linux hosts dmaengine: imx-sdma: fix use-after-free on probe error path ath10k: Do not send probe response template for mesh ath9k: Check for errors when reading SREV register ath6kl: add some bounds checking ath: DFS JP domain W56 fixed pulse type 3 RADAR detection batman-adv: fix for leaked TVLV handler. media: dvb: usb: fix use after free in dvb_usb_device_exit crypto: talitos - fix skcipher failure due to wrong output IV media: marvell-ccic: fix DMA s/g desc number calculation media: vpss: fix a potential NULL pointer dereference net: stmmac: dwmac1000: Clear unused address entries signal/pid_namespace: Fix reboot_pid_ns to use send_sig not force_sig af_key: fix leaks in key_pol_get_resp and dump_sp. xfrm: Fix xfrm sel prefix length validation media: staging: media: davinci_vpfe: - Fix for memory leak if decoder initialization fails. net: phy: Check against net_device being NULL tua6100: Avoid build warnings. locking/lockdep: Fix merging of hlocks with non-zero references media: wl128x: Fix some error handling in fm_v4l2_init_video_device() cpupower : frequency-set -r option misses the last cpu in related cpu list net: fec: Do not use netdev messages too early net: axienet: Fix race condition causing TX hang s390/qdio: handle PENDING state for QEBSM devices perf test 6: Fix missing kvm module load for s390 gpio: omap: fix lack of irqstatus_raw0 for OMAP4 gpio: omap: ensure irq is enabled before wakeup regmap: fix bulk writes on paged registers bpf: silence warning messages in core rcu: Force inlining of rcu_read_lock() xfrm: fix sa selector validation perf evsel: Make perf_evsel__name() accept a NULL argument vhost_net: disable zerocopy by default EDAC/sysfs: Fix memory leak when creating a csrow object media: i2c: fix warning same module names ntp: Limit TAI-UTC offset timer_list: Guard procfs specific code acpi/arm64: ignore 5.1 FADTs that are reported as 5.0 media: coda: fix mpeg2 sequence number handling media: coda: increment sequence offset for the last returned frame mt7601u: do not schedule rx_tasklet when the device has been disconnected x86/build: Add 'set -e' to mkcapflags.sh to delete broken capflags.c mt7601u: fix possible memory leak when the device is disconnected ath10k: fix PCIE device wake up failed rslib: Fix decoding of shortened codes rslib: Fix handling of of caller provided syndrome ixgbe: Check DDM existence in transceiver before access EDAC: Fix global-out-of-bounds write when setting edac_mc_poll_msec bcache: check c->gc_thread by IS_ERR_OR_NULL in cache_set_flush() Bluetooth: hci_bcsp: Fix memory leak in rx_skb Bluetooth: 6lowpan: search for destination address in all peers Bluetooth: Check state in l2cap_disconnect_rsp Bluetooth: validate BLE connection interval updates crypto: ghash - fix unaligned memory access in ghash_setkey() crypto: arm64/sha1-ce - correct digest for empty data in finup crypto: arm64/sha2-ce - correct digest for empty data in finup Input: gtco - bounds check collection indent level regulator: s2mps11: Fix buck7 and buck8 wrong voltages tracing/snapshot: Resize spare buffer if size changed NFSv4: Handle the special Linux file open access mode lib/scatterlist: Fix mapping iterator when sg->offset is greater than PAGE_SIZE ALSA: seq: Break too long mutex context in the write loop media: v4l2: Test type instead of cfg->type in v4l2_ctrl_new_custom() media: coda: Remove unbalanced and unneeded mutex unlock KVM: x86/vPMU: refine kvm_pmu err msg when event creation failed drm/nouveau/i2c: Enable i2c pads & busses during preinit padata: use smp_mb in padata_reorder to avoid orphaned padata jobs 9p/virtio: Add cleanup path in p9_virtio_init PCI: Do not poll for PME if the device is in D3cold take floppy compat ioctls to sodding floppy.c floppy: fix div-by-zero in setup_format_params floppy: fix out-of-bounds read in next_valid_format floppy: fix invalid pointer dereference in drive_name floppy: fix out-of-bounds read in copy_buffer coda: pass the host file in vma->vm_file on mmap gpu: ipu-v3: ipu-ic: Fix saturation bit offset in TPMEM parisc: Fix kernel panic due invalid values in IAOQ0 or IAOQ1 powerpc/32s: fix suspend/resume when IBATs 4-7 are used powerpc/watchpoint: Restore NV GPRs while returning from exception eCryptfs: fix a couple type promotion bugs intel_th: msu: Fix single mode with disabled IOMMU Bluetooth: Add SMP workaround Microsoft Surface Precision Mouse bug usb: Handle USB3 remote wakeup for LPM enabled devices correctly dm bufio: fix deadlock with loop device bnx2x: Prevent load reordering in tx completion processing caif-hsi: fix possible deadlock in cfhsi_exit_module() ipv4: don't set IPv6 only flags to IPv4 addresses net: bcmgenet: use promisc for unsupported filters net: neigh: fix multiple neigh timer scheduling nfc: fix potential illegal memory access sky2: Disable MSI on ASUS P6T netrom: fix a memory leak in nr_rx_frame() netrom: hold sock when setting skb->destructor tcp: Reset bytes_acked and bytes_received when disconnecting bonding: validate ip header before check IPPROTO_IGMP net: bridge: mcast: fix stale nsrcs pointer in igmp3/mld2 report handling net: bridge: mcast: fix stale ipv6 hdr pointer when handling v6 query net: bridge: stp: don't cache eth dest pointer before skb pull elevator: fix truncation of icq_cache_name NFSv4: Fix open create exclusive when the server reboots nfsd: increase DRC cache limit nfsd: give out fewer session slots as limit approaches nfsd: fix performance-limiting session calculation nfsd: Fix overflow causing non-working mounts on 1 TB machines drm/panel: simple: Fix panel_simple_dsi_probe usb: core: hub: Disable hub-initiated U1/U2 tty: max310x: Fix invalid baudrate divisors calculator pinctrl: rockchip: fix leaked of_node references tty: serial: cpm_uart - fix init when SMC is relocated memstick: Fix error cleanup path of memstick_init tty/serial: digicolor: Fix digicolor-usart already registered warning tty: serial: msm_serial: avoid system lockup condition drm/virtio: Add memory barriers for capset cache. phy: renesas: rcar-gen2: Fix memory leak at error paths usb: gadget: Zero ffs_io_data powerpc/pci/of: Fix OF flags parsing for 64bit BARs PCI: sysfs: Ignore lockdep for remove attribute iio: iio-utils: Fix possible incorrect mask calculation recordmcount: Fix spurious mcount entries on powerpc mfd: core: Set fwnode for created devices mfd: arizona: Fix undefined behavior um: Silence lockdep complaint about mmap_sem powerpc/4xx/uic: clear pending interrupt after irq type/pol change serial: sh-sci: Fix TX DMA buffer flushing and workqueue races kallsyms: exclude kasan local symbols on s390 perf test mmap-thread-lookup: Initialize variable to suppress memory sanitizer warning f2fs: avoid out-of-range memory access mailbox: handle failed named mailbox channel request powerpc/eeh: Handle hugepages in ioremap space sh: prevent warnings when using iounmap mm/kmemleak.c: fix check for softirq context 9p: pass the correct prototype to read_cache_page mm/mmu_notifier: use hlist_add_head_rcu() locking/lockdep: Fix lock used or unused stats error locking/lockdep: Hide unused 'class' variable usb: wusbcore: fix unbalanced get/put cluster_id usb: pci-quirks: Correct AMD PLL quirk detection x86/sysfb_efi: Add quirks for some devices with swapped width and height x86/speculation/mds: Apply more accurate check on hypervisor platform hpet: Fix division by zero in hpet_time_div() ALSA: line6: Fix wrong altsetting for LINE6_PODHD500_1 ALSA: hda - Add a conexant codec entry to let mute led work powerpc/tm: Fix oops on sigreturn on systems without TM access: avoid the RCU grace period for the temporary subjective credentials vmstat: Remove BUG_ON from vmstat_update mm, vmstat: make quiet_vmstat lighter ipv6: check sk sk_type and protocol early in ip_mroute_set/getsockopt tcp: reset sk_send_head in tcp_write_queue_purge ISDN: hfcsusb: checking idx of ep configuration media: cpia2_usb: first wake up, then free in disconnect media: radio-raremono: change devm_k*alloc to k*alloc Bluetooth: hci_uart: check for missing tty operations sched/fair: Don't free p->numa_faults with concurrent readers drivers/pps/pps.c: clear offset flags in PPS_SETPARAMS ioctl ceph: hold i_ceph_lock when removing caps for freeing inode Linux 4.4.187 Change-Id: I6086b23376cdf9f6a905f727fb07175a7ebdd356 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2019-08-04sched/fair: Don't free p->numa_faults with concurrent readersJann Horn
commit 16d51a590a8ce3befb1308e0e7ab77f3b661af33 upstream. When going through execve(), zero out the NUMA fault statistics instead of freeing them. During execve, the task is reachable through procfs and the scheduler. A concurrent /proc/*/sched reader can read data from a freed ->numa_faults allocation (confirmed by KASAN) and write it back to userspace. I believe that it would also be possible for a use-after-free read to occur through a race between a NUMA fault and execve(): task_numa_fault() can lead to task_numa_compare(), which invokes task_weight() on the currently running task of a different CPU. Another way to fix this would be to make ->numa_faults RCU-managed or add extra locking, but it seems easier to wipe the NUMA fault statistics on execve. Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Fixes: 82727018b0d3 ("sched/numa: Call task_numa_free() from do_execve()") Link: https://lkml.kernel.org/r/20190716152047.14424-1-jannh@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-19Merge 4.4.168 into android-4.4-pGreg Kroah-Hartman
Changes in 4.4.168 ipv6: Check available headroom in ip6_xmit() even without options net: 8139cp: fix a BUG triggered by changing mtu with network traffic net: phy: don't allow __set_phy_supported to add unsupported modes net: Prevent invalid access to skb->prev in __qdisc_drop_all rtnetlink: ndo_dflt_fdb_dump() only work for ARPHRD_ETHER devices tcp: fix NULL ref in tail loss probe tun: forbid iface creation with rtnl ops neighbour: Avoid writing before skb->head in neigh_hh_output() ARM: OMAP2+: prm44xx: Fix section annotation on omap44xx_prm_enable_io_wakeup ARM: OMAP1: ams-delta: Fix possible use of uninitialized field sysv: return 'err' instead of 0 in __sysv_write_inode s390/cpum_cf: Reject request for sampling in event initialization hwmon: (ina2xx) Fix current value calculation ASoC: dapm: Recalculate audio map forcely when card instantiated hwmon: (w83795) temp4_type has writable permission Btrfs: send, fix infinite loop due to directory rename dependencies ASoC: omap-mcpdm: Add pm_qos handling to avoid under/overruns with CPU_IDLE ASoC: omap-dmic: Add pm_qos handling to avoid overruns with CPU_IDLE exportfs: do not read dentry after free bpf: fix check of allowed specifiers in bpf_trace_printk USB: omap_udc: use devm_request_irq() USB: omap_udc: fix crashes on probe error and module removal USB: omap_udc: fix omap_udc_start() on 15xx machines USB: omap_udc: fix USB gadget functionality on Palm Tungsten E KVM: x86: fix empty-body warnings net: thunderx: fix NULL pointer dereference in nic_remove ixgbe: recognize 1000BaseLX SFP modules as 1Gbps net: hisilicon: remove unexpected free_netdev drm/ast: fixed reading monitor EDID not stable issue xen: xlate_mmu: add missing header to fix 'W=1' warning fscache: fix race between enablement and dropping of object fscache, cachefiles: remove redundant variable 'cache' ocfs2: fix deadlock caused by ocfs2_defrag_extent() hfs: do not free node before using hfsplus: do not free node before using debugobjects: avoid recursive calls with kmemleak ocfs2: fix potential use after free pstore: Convert console write to use ->write_buf ALSA: pcm: remove SNDRV_PCM_IOCTL1_INFO internal command KVM: nVMX: fix msr bitmaps to prevent L2 from accessing L0 x2APIC KVM: nVMX: mark vmcs12 pages dirty on L2 exit KVM: nVMX: Eliminate vmcs02 pool KVM: VMX: introduce alloc_loaded_vmcs KVM: VMX: make MSR bitmaps per-VCPU KVM/x86: Add IBPB support KVM/VMX: Emulate MSR_IA32_ARCH_CAPABILITIES KVM/VMX: Allow direct access to MSR_IA32_SPEC_CTRL KVM/SVM: Allow direct access to MSR_IA32_SPEC_CTRL KVM/x86: Remove indirect MSR op calls from SPEC_CTRL x86: reorganize SMAP handling in user space accesses x86: fix SMAP in 32-bit environments x86: Introduce __uaccess_begin_nospec() and uaccess_try_nospec x86/usercopy: Replace open coded stac/clac with __uaccess_{begin, end} x86/uaccess: Use __uaccess_begin_nospec() and uaccess_try_nospec x86/bugs, KVM: Support the combination of guest and host IBRS x86/KVM/VMX: Expose SPEC_CTRL Bit(2) to the guest KVM: SVM: Move spec control call after restore of GS x86/bugs, KVM: Extend speculation control for VIRT_SPEC_CTRL x86/speculation: Use synthetic bits for IBRS/IBPB/STIBP KVM: SVM: Implement VIRT_SPEC_CTRL support for SSBD bpf: support 8-byte metafield access bpf/verifier: Add spi variable to check_stack_write() bpf/verifier: Pass instruction index to check_mem_access() and check_xadd() bpf: Prevent memory disambiguation attack wil6210: missing length check in wmi_set_ie posix-timers: Sanitize overrun handling mm/hugetlb.c: don't call region_abort if region_chg fails hugetlbfs: fix offset overflow in hugetlbfs mmap hugetlbfs: check for pgoff value overflow hugetlbfs: fix bug in pgoff overflow checking swiotlb: clean up reporting sr: pass down correctly sized SCSI sense buffer mm: remove write/force parameters from __get_user_pages_locked() mm: remove write/force parameters from __get_user_pages_unlocked() mm/nommu.c: Switch __get_user_pages_unlocked() to use __get_user_pages() mm: replace get_user_pages_unlocked() write/force parameters with gup_flags mm: replace get_user_pages_locked() write/force parameters with gup_flags mm: replace get_vaddr_frames() write/force parameters with gup_flags mm: replace get_user_pages() write/force parameters with gup_flags mm: replace __access_remote_vm() write parameter with gup_flags mm: replace access_remote_vm() write parameter with gup_flags proc: don't use FOLL_FORCE for reading cmdline and environment proc: do not access cmdline nor environ from file-backed areas media: dvb-frontends: fix i2c access helpers for KASAN matroxfb: fix size of memcpy staging: speakup: Replace strncpy with memcpy rocker: fix rocker_tlv_put_* functions for KASAN selftests: Move networking/timestamping from Documentation Linux 4.4.168 Change-Id: Icd04a723739ae5e38258a2f6b0aee875f306a0bc Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2018-12-17mm: replace get_user_pages() write/force parameters with gup_flagsLorenzo Stoakes
commit 768ae309a96103ed02eb1e111e838c87854d8b51 upstream. This removes the 'write' and 'force' from get_user_pages() and replaces them with 'gup_flags' to make the use of FOLL_FORCE explicit in callers as use of this flag can result in surprising behaviour (and hence bugs) within the mm subsystem. Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Christian König <christian.koenig@amd.com> Acked-by: Jesper Nilsson <jesper.nilsson@axis.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [bwh: Backported to 4.4: - Drop changes in rapidio, vchiq, goldfish - Keep the "write" variable in amdgpu_ttm_tt_pin_userptr() as it's still needed - Also update calls from various other places that now use get_user_pages_remote() upstream, which were updated there by commit 9beae1ea8930 "mm: replace get_user_pages_remote() write/force ..." - Also update calls from hfi1 and ipath - Adjust context] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-13Merge 4.4.167 into android-4.4-pGreg Kroah-Hartman
Changes in 4.4.167 media: em28xx: Fix use-after-free when disconnecting Revert "wlcore: Add missing PM call for wlcore_cmd_wait_for_event_or_timeout()" rapidio/rionet: do not free skb before reading its length s390/qeth: fix length check in SNMP processing usbnet: ipheth: fix potential recvmsg bug and recvmsg bug 2 kvm: mmu: Fix race in emulated page table writes xtensa: enable coprocessors that are being flushed xtensa: fix coprocessor context offset definitions Btrfs: ensure path name is null terminated at btrfs_control_ioctl ALSA: wss: Fix invalid snd_free_pages() at error path ALSA: ac97: Fix incorrect bit shift at AC97-SPSA control write ALSA: control: Fix race between adding and removing a user element ALSA: sparc: Fix invalid snd_free_pages() at error path ext2: fix potential use after free dmaengine: at_hdmac: fix memory leak in at_dma_xlate() dmaengine: at_hdmac: fix module unloading btrfs: release metadata before running delayed refs USB: usb-storage: Add new IDs to ums-realtek usb: core: quirks: add RESET_RESUME quirk for Cherry G230 Stream series misc: mic/scif: fix copy-paste error in scif_create_remote_lookup Kbuild: suppress packed-not-aligned warning for default setting only exec: avoid gcc-8 warning for get_task_comm disable stringop truncation warnings for now kobject: Replace strncpy with memcpy unifdef: use memcpy instead of strncpy kernfs: Replace strncpy with memcpy ip_tunnel: Fix name string concatenate in __ip_tunnel_create() drm: gma500: fix logic error scsi: bfa: convert to strlcpy/strlcat staging: rts5208: fix gcc-8 logic error warning kdb: use memmove instead of overlapping memcpy iser: set sector for ambiguous mr status errors uprobes: Fix handle_swbp() vs. unregister() + register() race once more MIPS: ralink: Fix mt7620 nd_sd pinmux mips: fix mips_get_syscall_arg o32 check drm/ast: Fix incorrect free on ioregs scsi: scsi_devinfo: cleanly zero-pad devinfo strings ALSA: trident: Suppress gcc string warning scsi: csiostor: Avoid content leaks and casts kgdboc: Fix restrict error kgdboc: Fix warning with module build leds: call led_pwm_set() in leds-pwm to enforce default LED_OFF leds: turn off the LED and wait for completion on unregistering LED class device leds: leds-gpio: Fix return value check in create_gpio_led() Input: xpad - quirk all PDP Xbox One gamepads Input: matrix_keypad - check for errors from of_get_named_gpio() Input: elan_i2c - add ELAN0620 to the ACPI table Input: elan_i2c - add ACPI ID for Lenovo IdeaPad 330-15ARR Input: elan_i2c - add support for ELAN0621 touchpad btrfs: Always try all copies when reading extent buffers Btrfs: fix use-after-free when dumping free space ARC: change defconfig defaults to ARCv2 arc: [devboards] Add support of NFSv3 ACL mm: cleancache: fix corruption on missed inode invalidation mm: mlock: avoid increase mm->locked_vm on mlock() when already mlock2(,MLOCK_ONFAULT) usb: gadget: dummy: fix nonsensical comparisons iommu/vt-d: Fix NULL pointer dereference in prq_event_thread() iommu/ipmmu-vmsa: Fix crash on early domain free can: rcar_can: Fix erroneous registration batman-adv: Expand merged fragment buffer for full packet bnx2x: Assign unique DMAE channel number for FW DMAE transactions. qed: Fix PTT leak in qed_drain() qed: Fix reading wrong value in loop condition net/mlx4_core: Zero out lkey field in SW2HW_MPT fw command net/mlx4_core: Fix uninitialized variable compilation warning net/mlx4: Fix UBSAN warning of signed integer overflow net: faraday: ftmac100: remove netif_running(netdev) check before disabling interrupts iommu/vt-d: Use memunmap to free memremap net: amd: add missing of_node_put() usb: quirk: add no-LPM quirk on SanDisk Ultra Flair device usb: appledisplay: Add 27" Apple Cinema Display USB: check usb_get_extra_descriptor for proper size ALSA: usb-audio: Fix UAF decrement if card has no live interfaces in card.c ALSA: hda: Add support for AMD Stoney Ridge ALSA: pcm: Fix starvation on down_write_nonblock() ALSA: pcm: Call snd_pcm_unlink() conditionally at closing ALSA: pcm: Fix interval evaluation with openmin/max virtio/s390: avoid race on vcdev->config virtio/s390: fix race in ccw_io_helper() SUNRPC: Fix leak of krb5p encode pages xhci: Prevent U1/U2 link pm states if exit latency is too long Staging: lustre: remove two build warnings cifs: Fix separator when building path from dentry tty: serial: 8250_mtk: always resume the device in probe. kgdboc: fix KASAN global-out-of-bounds bug in param_set_kgdboc_var() mac80211_hwsim: Timer should be initialized before device registered mac80211: Clear beacon_int in ieee80211_do_stop mac80211: ignore tx status for PS stations in ieee80211_tx_status_ext mac80211: fix reordering of buffered broadcast packets mac80211: ignore NullFunc frames in the duplicate detection Linux 4.4.167 Change-Id: I67673edf3244cb17523bfb13f256d5b3ddd1bcba Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2018-12-13exec: avoid gcc-8 warning for get_task_commArnd Bergmann
commit 3756f6401c302617c5e091081ca4d26ab604bec5 upstream. gcc-8 warns about using strncpy() with the source size as the limit: fs/exec.c:1223:32: error: argument to 'sizeof' in 'strncpy' call is the same expression as the source; did you mean to use the size of the destination? [-Werror=sizeof-pointer-memaccess] This is indeed slightly suspicious, as it protects us from source arguments without NUL-termination, but does not guarantee that the destination is terminated. This keeps the strncpy() to ensure we have properly padded target buffer, but ensures that we use the correct length, by passing the actual length of the destination buffer as well as adding a build-time check to ensure it is exactly TASK_COMM_LEN. There are only 23 callsites which I all reviewed to ensure this is currently the case. We could get away with doing only the check or passing the right length, but it doesn't hurt to do both. Link: http://lkml.kernel.org/r/20171205151724.1764896-1-arnd@arndb.de Signed-off-by: Arnd Bergmann <arnd@arndb.de> Suggested-by: Kees Cook <keescook@chromium.org> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Serge Hallyn <serge@hallyn.com> Cc: James Morris <james.l.morris@oracle.com> Cc: Aleksa Sarai <asarai@suse.de> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-21Merge 4.4.78 into android-4.4Greg Kroah-Hartman
Changes in 4.4.78 net_sched: fix error recovery at qdisc creation net: sched: Fix one possible panic when no destroy callback net/phy: micrel: configure intterupts after autoneg workaround ipv6: avoid unregistering inet6_dev for loopback net: dp83640: Avoid NULL pointer dereference. tcp: reset sk_rx_dst in tcp_disconnect() net: prevent sign extension in dev_get_stats() bpf: prevent leaking pointer via xadd on unpriviledged net: handle NAPI_GRO_FREE_STOLEN_HEAD case also in napi_frags_finish() ipv6: dad: don't remove dynamic addresses if link is down net: ipv6: Compare lwstate in detecting duplicate nexthops vrf: fix bug_on triggered by rx when destroying a vrf rds: tcp: use sock_create_lite() to create the accept socket brcmfmac: fix possible buffer overflow in brcmf_cfg80211_mgmt_tx() cfg80211: Define nla_policy for NL80211_ATTR_LOCAL_MESH_POWER_MODE cfg80211: Validate frequencies nested in NL80211_ATTR_SCAN_FREQUENCIES cfg80211: Check if PMKID attribute is of expected size irqchip/gic-v3: Fix out-of-bound access in gic_set_affinity parisc: Report SIGSEGV instead of SIGBUS when running out of stack parisc: use compat_sys_keyctl() parisc: DMA API: return error instead of BUG_ON for dma ops on non dma devs parisc/mm: Ensure IRQs are off in switch_mm() tools/lib/lockdep: Reduce MAX_LOCK_DEPTH to avoid overflowing lock_chain/: Depth kernel/extable.c: mark core_kernel_text notrace mm/list_lru.c: fix list_lru_count_node() to be race free fs/dcache.c: fix spin lockup issue on nlru->lock checkpatch: silence perl 5.26.0 unescaped left brace warnings binfmt_elf: use ELF_ET_DYN_BASE only for PIE arm: move ELF_ET_DYN_BASE to 4MB arm64: move ELF_ET_DYN_BASE to 4GB / 4MB powerpc: move ELF_ET_DYN_BASE to 4GB / 4MB s390: reduce ELF_ET_DYN_BASE exec: Limit arg stack to at most 75% of _STK_LIM vt: fix unchecked __put_user() in tioclinux ioctls mnt: In umount propagation reparent in a separate pass mnt: In propgate_umount handle visiting mounts in any order mnt: Make propagate_umount less slow for overlapping mount propagation trees selftests/capabilities: Fix the test_execve test tpm: Get rid of chip->pdev tpm: Provide strong locking for device removal Add "shutdown" to "struct class". tpm: Issue a TPM2_Shutdown for TPM2 devices. mm: fix overflow check in expand_upwards() crypto: talitos - Extend max key length for SHA384/512-HMAC and AEAD crypto: atmel - only treat EBUSY as transient if backlog crypto: sha1-ssse3 - Disable avx2 crypto: caam - fix signals handling sched/topology: Fix overlapping sched_group_mask sched/topology: Optimize build_group_mask() PM / wakeirq: Convert to SRCU PM / QoS: return -EINVAL for bogus strings tracing: Use SOFTIRQ_OFFSET for softirq dectection for more accurate results KVM: x86: disable MPX if host did not enable MPX XSAVE features kvm: vmx: Do not disable intercepts for BNDCFGS kvm: x86: Guest BNDCFGS requires guest MPX support kvm: vmx: Check value written to IA32_BNDCFGS kvm: vmx: allow host to access guest MSR_IA32_BNDCFGS Linux 4.4.78 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2017-07-21exec: Limit arg stack to at most 75% of _STK_LIMKees Cook
commit da029c11e6b12f321f36dac8771e833b65cec962 upstream. To avoid pathological stack usage or the need to special-case setuid execs, just limit all arg stack usage to at most 75% of _STK_LIM (6MB). Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-29Merge 4.4.75 into android-4.4Greg Kroah-Hartman
Changes in 4.4.75 fs/exec.c: account for argv/envp pointers autofs: sanity check status reported with AUTOFS_DEV_IOCTL_FAIL lib/cmdline.c: fix get_options() overflow while parsing ranges KVM: PPC: Book3S HV: Preserve userspace HTM state properly CIFS: Improve readdir verbosity HID: Add quirk for Dell PIXART OEM mouse signal: Only reschedule timers on signals timers have sent powerpc/kprobes: Pause function_graph tracing during jprobes handling Input: i8042 - add Fujitsu Lifebook AH544 to notimeout list time: Fix clock->read(clock) race around clocksource changes target: Fix kref->refcount underflow in transport_cmd_finish_abort iscsi-target: Reject immediate data underflow larger than SCSI transfer length drm/radeon: add a PX quirk for another K53TK variant drm/radeon: add a quirk for Toshiba Satellite L20-183 drm/amdgpu/atom: fix ps allocation size for EnableDispPowerGating drm/amdgpu: adjust default display clock USB: usbip: fix nonconforming hub descriptor rxrpc: Fix several cases where a padded len isn't checked in ticket decode of: Add check to of_scan_flat_dt() before accessing initial_boot_params mtd: spi-nor: fix spansion quad enable powerpc/slb: Force a full SLB flush when we insert for a bad EA usb: gadget: f_fs: avoid out of bounds access on comp_desc net: phy: Initialize mdio clock at probe function net: phy: fix marvell phy status reading nvme/quirk: Add a delay before checking for adapter readiness nvme: apply DELAY_BEFORE_CHK_RDY quirk at probe time too Linux 4.4.75 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2017-06-29fs/exec.c: account for argv/envp pointersKees Cook
commit 98da7d08850fb8bdeb395d6368ed15753304aa0c upstream. When limiting the argv/envp strings during exec to 1/4 of the stack limit, the storage of the pointers to the strings was not included. This means that an exec with huge numbers of tiny strings could eat 1/4 of the stack limit in strings and then additional space would be later used by the pointers to the strings. For example, on 32-bit with a 8MB stack rlimit, an exec with 1677721 single-byte strings would consume less than 2MB of stack, the max (8MB / 4) amount allowed, but the pointers to the strings would consume the remaining additional stack space (1677721 * 4 == 6710884). The result (1677721 + 6710884 == 8388605) would exhaust stack space entirely. Controlling this stack exhaustion could result in pathological behavior in setuid binaries (CVE-2017-1000365). [akpm@linux-foundation.org: additional commenting from Kees] Fixes: b6a2fea39318 ("mm: variable length argument support") Link: http://lkml.kernel.org/r/20170622001720.GA32173@beast Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Qualys Security Advisory <qsa@qualys.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-02-15Merge remote-tracking branch 'common/android-4.4' into android-4.4.yDmitry Shmidt
Change-Id: Icf907f5067fb6da5935ab0d3271df54b8d5df405
2017-01-26ANDROID: vfs: Add permission2 for filesystems with per mount permissionsDaniel Rosenberg
This allows filesystems to use their mount private data to influence the permssions they return in permission2. It has been separated into a new call to avoid disrupting current permission users. Change-Id: I9d416e3b8b6eca84ef3e336bd2af89ddd51df6ca Signed-off-by: Daniel Rosenberg <drosen@google.com>
2017-01-06exec: Ensure mm->user_ns contains the execed filesEric W. Biederman
commit f84df2a6f268de584a201e8911384a2d244876e3 upstream. When the user namespace support was merged the need to prevent ptrace from revealing the contents of an unreadable executable was overlooked. Correct this oversight by ensuring that the executed file or files are in mm->user_ns, by adjusting mm->user_ns. Use the new function privileged_wrt_inode_uidgid to see if the executable is a member of the user namespace, and as such if having CAP_SYS_PTRACE in the user namespace should allow tracing the executable. If not update mm->user_ns to the parent user namespace until an appropriate parent is found. Reported-by: Jann Horn <jann@thejh.net> Fixes: 9e4a36ece652 ("userns: Fail exec for suid and sgid binaries with ids outside our user namespace.") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-06fs: exec: apply CLOEXEC before changing dumpable task flagsAleksa Sarai
commit 613cc2b6f272c1a8ad33aefa21cad77af23139f7 upstream. If you have a process that has set itself to be non-dumpable, and it then undergoes exec(2), any CLOEXEC file descriptors it has open are "exposed" during a race window between the dumpable flags of the process being reset for exec(2) and CLOEXEC being applied to the file descriptors. This can be exploited by a process by attempting to access /proc/<pid>/fd/... during this window, without requiring CAP_SYS_PTRACE. The race in question is after set_dumpable has been (for get_link, though the trace is basically the same for readlink): [vfs] -> proc_pid_link_inode_operations.get_link -> proc_pid_get_link -> proc_fd_access_allowed -> ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS); Which will return 0, during the race window and CLOEXEC file descriptors will still be open during this window because do_close_on_exec has not been called yet. As a result, the ordering of these calls should be reversed to avoid this race window. This is of particular concern to container runtimes, where joining a PID namespace with file descriptors referring to the host filesystem can result in security issues (since PRCTL_SET_DUMPABLE doesn't protect against access of CLOEXEC file descriptors -- file descriptors which may reference filesystem objects the container shouldn't have access to). Cc: dev@opencontainers.org Cc: <stable@vger.kernel.org> # v3.2+ Reported-by: Michael Crosby <crosbymichael@gmail.com> Signed-off-by: Aleksa Sarai <asarai@suse.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-06ptrace: Capture the ptracer's creds not PT_PTRACE_CAPEric W. Biederman
commit 64b875f7ac8a5d60a4e191479299e931ee949b67 upstream. When the flag PT_PTRACE_CAP was added the PTRACE_TRACEME path was overlooked. This can result in incorrect behavior when an application like strace traces an exec of a setuid executable. Further PT_PTRACE_CAP does not have enough information for making good security decisions as it does not report which user namespace the capability is in. This has already allowed one mistake through insufficient granulariy. I found this issue when I was testing another corner case of exec and discovered that I could not get strace to set PT_PTRACE_CAP even when running strace as root with a full set of caps. This change fixes the above issue with strace allowing stracing as root a setuid executable without disabling setuid. More fundamentaly this change allows what is allowable at all times, by using the correct information in it's decision. Fixes: 4214e42f96d4 ("v2.4.9.11 -> v2.4.9.12") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-07-10vfs: Commit to never having exectuables on proc and sysfs.Eric W. Biederman
Today proc and sysfs do not contain any executable files. Several applications today mount proc or sysfs without noexec and nosuid and then depend on there being no exectuables files on proc or sysfs. Having any executable files show on proc or sysfs would cause a user space visible regression, and most likely security problems. Therefore commit to never allowing executables on proc and sysfs by adding a new flag to mark them as filesystems without executables and enforce that flag. Test the flag where MNT_NOEXEC is tested today, so that the only user visible effect will be that exectuables will be treated as if the execute bit is cleared. The filesystems proc and sysfs do not currently incoporate any executable files so this does not result in any user visible effects. This makes it unnecessary to vet changes to proc and sysfs tightly for adding exectuable files or changes to chattr that would modify existing files, as no matter what the individual file say they will not be treated as exectuable files by the vfs. Not having to vet changes to closely is important as without this we are only one proc_create call (or another goof up in the implementation of notify_change) from having problematic executables on proc. Those mistakes are all too easy to make and would create a situation where there are security issues or the assumptions of some program having to be broken (and cause userspace regressions). Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2015-05-12parisc,metag: Fix crashes due to stack randomization on stack-grows-upwards ↵Helge Deller
architectures On architectures where the stack grows upwards (CONFIG_STACK_GROWSUP=y, currently parisc and metag only) stack randomization sometimes leads to crashes when the stack ulimit is set to lower values than STACK_RND_MASK (which is 8 MB by default if not defined in arch-specific headers). The problem is, that when the stack vm_area_struct is set up in fs/exec.c, the additional space needed for the stack randomization (as defined by the value of STACK_RND_MASK) was not taken into account yet and as such, when the stack randomization code added a random offset to the stack start, the stack effectively got smaller than what the user defined via rlimit_max(RLIMIT_STACK) which then sometimes leads to out-of-stack situations and crashes. This patch fixes it by adding the maximum possible amount of memory (based on STACK_RND_MASK) which theoretically could be added by the stack randomization code to the initial stack size. That way, the user-defined stack size is always guaranteed to be at minimum what is defined via rlimit_max(RLIMIT_STACK). This bug is currently not visible on the metag architecture, because on metag STACK_RND_MASK is defined to 0 which effectively disables stack randomization. The changes to fs/exec.c are inside an "#ifdef CONFIG_STACK_GROWSUP" section, so it does not affect other platformws beside those where the stack grows upwards (parisc and metag). Signed-off-by: Helge Deller <deller@gmx.de> Cc: linux-parisc@vger.kernel.org Cc: James Hogan <james.hogan@imgtec.com> Cc: linux-metag@vger.kernel.org Cc: stable@vger.kernel.org # v3.16+
2015-04-19fs: take i_mutex during prepare_binprm for set[ug]id executablesJann Horn
This prevents a race between chown() and execve(), where chowning a setuid-user binary to root would momentarily make the binary setuid root. This patch was mostly written by Linus Torvalds. Signed-off-by: Jann Horn <jann@thejh.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17fs/exec.c:de_thread: move notify_count write under lockKirill Tkhai
We set sig->notify_count = -1 between RELEASE and ACQUIRE operations: spin_unlock_irq(lock); ... if (!thread_group_leader(tsk)) { ... for (;;) { sig->notify_count = -1; write_lock_irq(&tasklist_lock); There are no restriction on it so other processors may see this STORE mixed with other STOREs in both areas limited by the spinlocks. Probably, it may be reordered with the above sig->group_exit_task = tsk; sig->notify_count = zap_other_threads(tsk); in some way. Set it under tasklist_lock locked to be sure nothing will be reordered. Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17prctl: avoid using mmap_sem for exe_file serializationDavidlohr Bueso
Oleg cleverly suggested using xchg() to set the new mm->exe_file instead of calling set_mm_exe_file() which requires some form of serialization -- mmap_sem in this case. For archs that do not have atomic rmw instructions we still fallback to a spinlock alternative, so this should always be safe. As such, we only need the mmap_sem for looking up the backing vm_file, which can be done sharing the lock. Naturally, this means we need to manually deal with both the new and old file reference counting, and we need not worry about the MMF_EXE_FILE_CHANGED bits, which can probably be deleted in the future anyway. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Suggested-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-01-23fs: create proper filename objects using getname_kernel()Paul Moore
There are several areas in the kernel that create temporary filename objects using the following pattern: int func(const char *name) { struct filename *file = { .name = name }; ... return 0; } ... which for the most part works okay, but it causes havoc within the audit subsystem as the filename object does not persist beyond the lifetime of the function. This patch converts all of these temporary filename objects into proper filename objects using getname_kernel() and putname() which ensure that the filename object persists until the audit subsystem is finished with it. Also, a special thanks to Al Viro, Guenter Roeck, and Sabrina Dubroca for helping resolve a difficult kernel panic on boot related to a use-after-free problem in kern_path_create(); the thread can be seen at the link below: * https://lkml.org/lkml/2015/1/20/710 This patch includes code that was either based on, or directly written by Al in the above thread. CC: viro@zeniv.linux.org.uk CC: linux@roeck-us.net CC: sd@queasysnail.net CC: linux-fsdevel@vger.kernel.org Signed-off-by: Paul Moore <pmoore@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-13syscalls: implement execveat() system callDavid Drysdale
This patchset adds execveat(2) for x86, and is derived from Meredydd Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528). The primary aim of adding an execveat syscall is to allow an implementation of fexecve(3) that does not rely on the /proc filesystem, at least for executables (rather than scripts). The current glibc version of fexecve(3) is implemented via /proc, which causes problems in sandboxed or otherwise restricted environments. Given the desire for a /proc-free fexecve() implementation, HPA suggested (https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be an appropriate generalization. Also, having a new syscall means that it can take a flags argument without back-compatibility concerns. The current implementation just defines the AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be added in future -- for example, flags for new namespaces (as suggested at https://lkml.org/lkml/2006/7/11/474). Related history: - https://lkml.org/lkml/2006/12/27/123 is an example of someone realizing that fexecve() is likely to fail in a chroot environment. - http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered documenting the /proc requirement of fexecve(3) in its manpage, to "prevent other people from wasting their time". - https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a problem where a process that did setuid() could not fexecve() because it no longer had access to /proc/self/fd; this has since been fixed. This patch (of 4): Add a new execveat(2) system call. execveat() is to execve() as openat() is to open(): it takes a file descriptor that refers to a directory, and resolves the filename relative to that. In addition, if the filename is empty and AT_EMPTY_PATH is specified, execveat() executes the file to which the file descriptor refers. This replicates the functionality of fexecve(), which is a system call in other UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/<fd>" (and so relies on /proc being mounted). The filename fed to the executed program as argv[0] (or the name of the script fed to a script interpreter) will be of the form "/dev/fd/<fd>" (for an empty filename) or "/dev/fd/<fd>/<filename>", effectively reflecting how the executable was found. This does however mean that execution of a script in a /proc-less environment won't work; also, script execution via an O_CLOEXEC file descriptor fails (as the file will not be accessible after exec). Based on patches by Meredydd Luff. Signed-off-by: David Drysdale <drysdale@google.com> Cc: Meredydd Luff <meredydd@senatehouse.org> Cc: Shuah Khan <shuah.kh@samsung.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Rich Felker <dalias@aerifal.cx> Cc: Christoph Hellwig <hch@infradead.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-11-18fs: Do not include mpx.h in exec.cDave Hansen
We no longer need mpx.h in exec.c. This will obviously also break the build for non-x86 builds. We get the MPX includes that we need from mmu_context.h now. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Hansen <dave@sr71.net> Link: http://lkml.kernel.org/r/20141118003608.837015B3@viggo.jf.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-11-18x86, mpx: On-demand kernel allocation of bounds tablesDave Hansen
This is really the meat of the MPX patch set. If there is one patch to review in the entire series, this is the one. There is a new ABI here and this kernel code also interacts with userspace memory in a relatively unusual manner. (small FAQ below). Long Description: This patch adds two prctl() commands to provide enable or disable the management of bounds tables in kernel, including on-demand kernel allocation (See the patch "on-demand kernel allocation of bounds tables") and cleanup (See the patch "cleanup unused bound tables"). Applications do not strictly need the kernel to manage bounds tables and we expect some applications to use MPX without taking advantage of this kernel support. This means the kernel can not simply infer whether an application needs bounds table management from the MPX registers. The prctl() is an explicit signal from userspace. PR_MPX_ENABLE_MANAGEMENT is meant to be a signal from userspace to require kernel's help in managing bounds tables. PR_MPX_DISABLE_MANAGEMENT is the opposite, meaning that userspace don't want kernel's help any more. With PR_MPX_DISABLE_MANAGEMENT, the kernel won't allocate and free bounds tables even if the CPU supports MPX. PR_MPX_ENABLE_MANAGEMENT will fetch the base address of the bounds directory out of a userspace register (bndcfgu) and then cache it into a new field (->bd_addr) in the 'mm_struct'. PR_MPX_DISABLE_MANAGEMENT will set "bd_addr" to an invalid address. Using this scheme, we can use "bd_addr" to determine whether the management of bounds tables in kernel is enabled. Also, the only way to access that bndcfgu register is via an xsaves, which can be expensive. Caching "bd_addr" like this also helps reduce the cost of those xsaves when doing table cleanup at munmap() time. Unfortunately, we can not apply this optimization to #BR fault time because we need an xsave to get the value of BNDSTATUS. ==== Why does the hardware even have these Bounds Tables? ==== MPX only has 4 hardware registers for storing bounds information. If MPX-enabled code needs more than these 4 registers, it needs to spill them somewhere. It has two special instructions for this which allow the bounds to be moved between the bounds registers and some new "bounds tables". They are similar conceptually to a page fault and will be raised by the MPX hardware during both bounds violations or when the tables are not present. This patch handles those #BR exceptions for not-present tables by carving the space out of the normal processes address space (essentially calling the new mmap() interface indroduced earlier in this patch set.) and then pointing the bounds-directory over to it. The tables *need* to be accessed and controlled by userspace because the instructions for moving bounds in and out of them are extremely frequent. They potentially happen every time a register pointing to memory is dereferenced. Any direct kernel involvement (like a syscall) to access the tables would obviously destroy performance. ==== Why not do this in userspace? ==== This patch is obviously doing this allocation in the kernel. However, MPX does not strictly *require* anything in the kernel. It can theoretically be done completely from userspace. Here are a few ways this *could* be done. I don't think any of them are practical in the real-world, but here they are. Q: Can virtual space simply be reserved for the bounds tables so that we never have to allocate them? A: As noted earlier, these tables are *HUGE*. An X-GB virtual area needs 4*X GB of virtual space, plus 2GB for the bounds directory. If we were to preallocate them for the 128TB of user virtual address space, we would need to reserve 512TB+2GB, which is larger than the entire virtual address space today. This means they can not be reserved ahead of time. Also, a single process's pre-popualated bounds directory consumes 2GB of virtual *AND* physical memory. IOW, it's completely infeasible to prepopulate bounds directories. Q: Can we preallocate bounds table space at the same time memory is allocated which might contain pointers that might eventually need bounds tables? A: This would work if we could hook the site of each and every memory allocation syscall. This can be done for small, constrained applications. But, it isn't practical at a larger scale since a given app has no way of controlling how all the parts of the app might allocate memory (think libraries). The kernel is really the only place to intercept these calls. Q: Could a bounds fault be handed to userspace and the tables allocated there in a signal handler instead of in the kernel? A: (thanks to tglx) mmap() is not on the list of safe async handler functions and even if mmap() would work it still requires locking or nasty tricks to keep track of the allocation state there. Having ruled out all of the userspace-only approaches for managing bounds tables that we could think of, we create them on demand in the kernel. Based-on-patch-by: Qiaowei Ren <qiaowei.ren@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: linux-mm@kvack.org Cc: linux-mips@linux-mips.org Cc: Dave Hansen <dave@sr71.net> Link: http://lkml.kernel.org/r/20141114151829.AD4310DE@viggo.jf.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-10-09handle suicide on late failure exits in execve() in search_binary_handler()Al Viro
... rather than doing that in the guts of ->load_binary(). [updated to fix the bug spotted by Shentino - for SIGSEGV we really need something stronger than send_sig_info(); again, better do that in one place] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-08fork/exec: cleanup mm initializationVladimir Davydov
mm initialization on fork/exec is spread all over the place, which makes the code look inconsistent. We have mm_init(), which is supposed to init/nullify mm's internals, but it doesn't init all the fields it should: - on fork ->mmap,mm_rb,vmacache_seqnum,map_count,mm_cpumask,locked_vm are zeroed in dup_mmap(); - on fork ->pmd_huge_pte is zeroed in dup_mm(), immediately before calling mm_init(); - ->cpu_vm_mask_var ptr is initialized by mm_init_cpumask(), which is called before mm_init() on both fork and exec; - ->context is initialized by init_new_context(), which is called after mm_init() on both fork and exec; Let's consolidate all the initializations in mm_init() to make the code look cleaner. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-07-18seccomp: implement SECCOMP_FILTER_FLAG_TSYNCKees Cook
Applying restrictive seccomp filter programs to large or diverse codebases often requires handling threads which may be started early in the process lifetime (e.g., by code that is linked in). While it is possible to apply permissive programs prior to process start up, it is difficult to further restrict the kernel ABI to those threads after that point. This change adds a new seccomp syscall flag to SECCOMP_SET_MODE_FILTER for synchronizing thread group seccomp filters at filter installation time. When calling seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_TSYNC, filter) an attempt will be made to synchronize all threads in current's threadgroup to its new seccomp filter program. This is possible iff all threads are using a filter that is an ancestor to the filter current is attempting to synchronize to. NULL filters (where the task is running as SECCOMP_MODE_NONE) are also treated as ancestors allowing threads to be transitioned into SECCOMP_MODE_FILTER. If prctrl(PR_SET_NO_NEW_PRIVS, ...) has been set on the calling thread, no_new_privs will be set for all synchronized threads too. On success, 0 is returned. On failure, the pid of one of the failing threads will be returned and no filters will have been applied. The race conditions against another thread are: - requesting TSYNC (already handled by sighand lock) - performing a clone (already handled by sighand lock) - changing its filter (already handled by sighand lock) - calling exec (handled by cred_guard_mutex) The clone case is assisted by the fact that new threads will have their seccomp state duplicated from their parent before appearing on the tasklist. Holding cred_guard_mutex means that seccomp filters cannot be assigned while in the middle of another thread's exec (potentially bypassing no_new_privs or similar). The call to de_thread() may kill threads waiting for the mutex. Changes across threads to the filter pointer includes a barrier. Based on patches by Will Drewry. Suggested-by: Julien Tinnes <jln@chromium.org> Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Andy Lutomirski <luto@amacapital.net>
2014-07-18sched: move no_new_privs into new atomic flagsKees Cook
Since seccomp transitions between threads requires updates to the no_new_privs flag to be atomic, the flag must be part of an atomic flag set. This moves the nnp flag into a separate task field, and introduces accessors. Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Andy Lutomirski <luto@amacapital.net>
2014-06-06perf: Differentiate exec() and non-exec() comm eventsAdrian Hunter
perf tools like 'perf report' can aggregate samples by comm strings, which generally works. However, there are other potential use-cases. For example, to pair up 'calls' with 'returns' accurately (from branch events like Intel BTS) it is necessary to identify whether the process has exec'd. Although a comm event is generated when an 'exec' happens it is also generated whenever the comm string is changed on a whim (e.g. by prctl PR_SET_NAME). This patch adds a flag to the comm event to differentiate one case from the other. In order to determine whether the kernel supports the new flag, a selection bit named 'exec' is added to struct perf_event_attr. The bit does nothing but will cause perf_event_open() to fail if the bit is set on kernels that do not have it defined. Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/537D9EBE.7030806@intel.com Cc: Paul Mackerras <paulus@samba.org> Cc: Dave Jones <davej@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-06-06perf: Fix perf_event_comm() vs. exec() assumptionPeter Zijlstra
perf_event_comm() assumes that set_task_comm() is only called on exec(), and in particular that its only called on current. Neither are true, as Dave reported a WARN triggered by set_task_comm() being called on !current. Separate the exec() hook from the comm hook. Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20140521153219.GH5226@laptop.programming.kicks-ass.net [ Build fix. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-15metag: Reduce maximum stack size to 256MBJames Hogan
Specify the maximum stack size for arches where the stack grows upward (parisc and metag) in asm/processor.h rather than hard coding in fs/exec.c so that metag can specify a smaller value of 256MB rather than 1GB. This fixes a BUG on metag if the RLIMIT_STACK hard limit is increased beyond a safe value by root. E.g. when starting a process after running "ulimit -H -s unlimited" it will then attempt to use a stack size of the maximum 1GB which is far too big for metag's limited user virtual address space (stack_top is usually 0x3ffff000): BUG: failure at fs/exec.c:589/shift_arg_pages()! Signed-off-by: James Hogan <james.hogan@imgtec.com> Cc: Helge Deller <deller@gmx.de> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: linux-parisc@vger.kernel.org Cc: linux-metag@vger.kernel.org Cc: John David Anglin <dave.anglin@bell.net> Cc: stable@vger.kernel.org # only needed for >= v3.9 (arch/metag)
2014-04-12Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: "The first vfs pile, with deep apologies for being very late in this window. Assorted cleanups and fixes, plus a large preparatory part of iov_iter work. There's a lot more of that, but it'll probably go into the next merge window - it *does* shape up nicely, removes a lot of boilerplate, gets rid of locking inconsistencie between aio_write and splice_write and I hope to get Kent's direct-io rewrite merged into the same queue, but some of the stuff after this point is having (mostly trivial) conflicts with the things already merged into mainline and with some I want more testing. This one passes LTP and xfstests without regressions, in addition to usual beating. BTW, readahead02 in ltp syscalls testsuite has started giving failures since "mm/readahead.c: fix readahead failure for memoryless NUMA nodes and limit readahead pages" - might be a false positive, might be a real regression..." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits) missing bits of "splice: fix racy pipe->buffers uses" cifs: fix the race in cifs_writev() ceph_sync_{,direct_}write: fix an oops on ceph_osdc_new_request() failure kill generic_file_buffered_write() ocfs2_file_aio_write(): switch to generic_perform_write() ceph_aio_write(): switch to generic_perform_write() xfs_file_buffered_aio_write(): switch to generic_perform_write() export generic_perform_write(), start getting rid of generic_file_buffer_write() generic_file_direct_write(): get rid of ppos argument btrfs_file_aio_write(): get rid of ppos kill the 5th argument of generic_file_buffered_write() kill the 4th argument of __generic_file_aio_write() lustre: don't open-code kernel_recvmsg() ocfs2: don't open-code kernel_recvmsg() drbd: don't open-code kernel_recvmsg() constify blk_rq_map_user_iov() and friends lustre: switch to kernel_sendmsg() ocfs2: don't open-code kernel_sendmsg() take iov_iter stuff to mm/iov_iter.c process_vm_access: tidy up a bit ...
2014-04-07exec: kill bprm->tcomm[], simplify the "basename" logicOleg Nesterov
Starting from commit c4ad8f98bef7 ("execve: use 'struct filename *' for executable name passing") bprm->filename can not go away after flush_old_exec(), so we do not need to save the binary name in bprm->tcomm[] added by 96e02d158678 ("exec: fix use-after-free bug in setup_new_exec()"). And there was never need for filename_to_taskname-like code, we can simply do set_task_comm(kbasename(filename). This patch has to change set_task_comm() and trace_task_rename() to accept "const char *", but I think this change is also good. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07mm: per-thread vma cachingDavidlohr Bueso
This patch is a continuation of efforts trying to optimize find_vma(), avoiding potentially expensive rbtree walks to locate a vma upon faults. The original approach (https://lkml.org/lkml/2013/11/1/410), where the largest vma was also cached, ended up being too specific and random, thus further comparison with other approaches were needed. There are two things to consider when dealing with this, the cache hit rate and the latency of find_vma(). Improving the hit-rate does not necessarily translate in finding the vma any faster, as the overhead of any fancy caching schemes can be too high to consider. We currently cache the last used vma for the whole address space, which provides a nice optimization, reducing the total cycles in find_vma() by up to 250%, for workloads with good locality. On the other hand, this simple scheme is pretty much useless for workloads with poor locality. Analyzing ebizzy runs shows that, no matter how many threads are running, the mmap_cache hit rate is less than 2%, and in many situations below 1%. The proposed approach is to replace this scheme with a small per-thread cache, maximizing hit rates at a very low maintenance cost. Invalidations are performed by simply bumping up a 32-bit sequence number. The only expensive operation is in the rare case of a seq number overflow, where all caches that share the same address space are flushed. Upon a miss, the proposed replacement policy is based on the page number that contains the virtual address in question. Concretely, the following results are seen on an 80 core, 8 socket x86-64 box: 1) System bootup: Most programs are single threaded, so the per-thread scheme does improve ~50% hit rate by just adding a few more slots to the cache. +----------------+----------+------------------+ | caching scheme | hit-rate | cycles (billion) | +----------------+----------+------------------+ | baseline | 50.61% | 19.90 | | patched | 73.45% | 13.58 | +----------------+----------+------------------+ 2) Kernel build: This one is already pretty good with the current approach as we're dealing with good locality. +----------------+----------+------------------+ | caching scheme | hit-rate | cycles (billion) | +----------------+----------+------------------+ | baseline | 75.28% | 11.03 | | patched | 88.09% | 9.31 | +----------------+----------+------------------+ 3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload. +----------------+----------+------------------+ | caching scheme | hit-rate | cycles (billion) | +----------------+----------+------------------+ | baseline | 70.66% | 17.14 | | patched | 91.15% | 12.57 | +----------------+----------+------------------+ 4) Ebizzy: There's a fair amount of variation from run to run, but this approach always shows nearly perfect hit rates, while baseline is just about non-existent. The amounts of cycles can fluctuate between anywhere from ~60 to ~116 for the baseline scheme, but this approach reduces it considerably. For instance, with 80 threads: +----------------+----------+------------------+ | caching scheme | hit-rate | cycles (billion) | +----------------+----------+------------------+ | baseline | 1.06% | 91.54 | | patched | 99.97% | 14.18 | +----------------+----------+------------------+ [akpm@linux-foundation.org: fix nommu build, per Davidlohr] [akpm@linux-foundation.org: document vmacache_valid() logic] [akpm@linux-foundation.org: attempt to untangle header files] [akpm@linux-foundation.org: add vmacache_find() BUG_ON] [hughd@google.com: add vmacache_valid_mm() (from Oleg)] [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: adjust and enhance comments] Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Michel Lespinasse <walken@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Tested-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03fs, kernel: permit disabling the uselib syscallJosh Triplett
uselib hasn't been used since libc5; glibc does not use it. Support turning it off. When disabled, also omit the load_elf_library implementation from binfmt_elf.c, which only uselib invokes. bloat-o-meter: add/remove: 0/4 grow/shrink: 0/1 up/down: 0/-785 (-785) function old new delta padzero 39 36 -3 uselib_flags 20 - -20 sys_uselib 168 - -168 SyS_uselib 168 - -168 load_elf_library 426 - -426 The new CONFIG_USELIB defaults to `y'. Signed-off-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-01read_code(): go through vfs_read() instead of calling the method directlyAl Viro
... and don't skip on sanity checks. It's *not* a hot path, TYVM (a couple of calls per a.out execve(), for pity sake) and headers of random a.out binary are not to be trusted. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-03-06fs/compat: convert to COMPAT_SYSCALL_DEFINEHeiko Carstens
Convert all compat system call functions where all parameter types have a size of four or less than four bytes, or are pointer types to COMPAT_SYSCALL_DEFINE. The implicit casts within COMPAT_SYSCALL_DEFINE will perform proper zero and sign extension to 64 bit of all parameters if needed. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2014-02-05execve: use 'struct filename *' for executable name passingLinus Torvalds
This changes 'do_execve()' to get the executable name as a 'struct filename', and to free it when it is done. This is what the normal users want, and it simplifies and streamlines their error handling. The controlled lifetime of the executable name also fixes a use-after-free problem with the trace_sched_process_exec tracepoint: the lifetime of the passed-in string for kernel users was not at all obvious, and the user-mode helper code used UMH_WAIT_EXEC to serialize the pathname allocation lifetime with the execve() having finished, which in turn meant that the trace point that happened after mm_release() of the old process VM ended up using already free'd memory. To solve the kernel string lifetime issue, this simply introduces "getname_kernel()" that works like the normal user-space getname() function, except with the source coming from kernel memory. As Oleg points out, this also means that we could drop the tcomm[] array from 'struct linux_binprm', since the pathname lifetime now covers setup_new_exec(). That would be a separate cleanup. Reported-by: Igor Zhbanov <i.zhbanov@samsung.com> Tested-by: Steven Rostedt <rostedt@goodmis.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23fs/exec.c: call arch_pick_mmap_layout() only onceRichard Weinberger
Currently both setup_new_exec() and flush_old_exec() issue a call to arch_pick_mmap_layout(). As setup_new_exec() and flush_old_exec() are always called pairwise arch_pick_mmap_layout() is called twice. This patch removes one call from setup_new_exec() to have it only called once. Signed-off-by: Richard Weinberger <richard@nod.at> Tested-by: Pat Erley <pat-lkml@erley.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23exec: avoid propagating PF_NO_SETAFFINITY into userspace childZhang Yi
Userspace process doesn't want the PF_NO_SETAFFINITY, but its parent may be a kernel worker thread which has PF_NO_SETAFFINITY set, and this worker thread can do kernel_thread() to create the child. Clearing this flag in usersapce child to enable its migrating capability. Signed-off-by: Zhang Yi <zhang.yi20@zte.com.cn> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23exec: kill task_struct->did_execOleg Nesterov
We can kill either task->did_exec or PF_FORKNOEXEC, they are mutually exclusive. The patch kills ->did_exec because it has a single user. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23exec: move the final allow_write_access/fput into free_bprm()Oleg Nesterov
Both success/failure paths cleanup bprm->file, we can move this code into free_bprm() to simlify and cleanup this logic. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23exec:check_unsafe_exec: kill the dead -EAGAIN and clear_in_exec logicOleg Nesterov
fs_struct->in_exec == T means that this ->fs is used by a single process (thread group), and one of the treads does do_execve(). To avoid the mt-exec races this code has the following complications: 1. check_unsafe_exec() returns -EBUSY if ->in_exec was already set by another thread. 2. do_execve_common() records "clear_in_exec" to ensure that the error path can only clear ->in_exec if it was set by current. However, after 9b1bf12d5d51 "signals: move cred_guard_mutex from task_struct to signal_struct" we do not need these complications: 1. We can't race with our sub-thread, this is called under per-process ->cred_guard_mutex. And we can't race with another CLONE_FS task, we already checked that this fs is not shared. We can remove the dead -EAGAIN logic. 2. "out_unmark:" in do_execve_common() is either called under ->cred_guard_mutex, or after de_thread() which kills other threads, so we can't race with sub-thread which could set ->in_exec. And if ->fs is shared with another process ->in_exec should be false anyway. We can clear in_exec unconditionally. This also means that check_unsafe_exec() can be void. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23exec:check_unsafe_exec: use while_each_thread() rather than next_thread()Oleg Nesterov
next_thread() should be avoided, change check_unsafe_exec() to use while_each_thread(). Nobody except signal->curr_target actually needs next_thread-like code, and we need to change (fix) this interface. This particular code is fine, p == current. But in general the code like this can loop forever if p exits and next_thread(t) can't reach the unhashed thread. This also saves 32 bytes. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>