summaryrefslogtreecommitdiff
path: root/arch
AgeCommit message (Collapse)Author
2014-05-30KVM: PPC: Book3S PR: Use SLB entry 0Alexander Graf
We didn't make use of SLB entry 0 because ... of no good reason. SLB entry 0 will always be used by the Linux linear SLB entry, so the fact that slbia does not invalidate it doesn't matter as we overwrite SLB 0 on exit anyway. Just enable use of SLB entry 0 for our shadow SLB code. Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30KVM: PPC: Book3S HV: Fix machine check delivery to guestPaul Mackerras
The code that delivered a machine check to the guest after handling it in real mode failed to load up r11 before calling kvmppc_msr_interrupt, which needs the old MSR value in r11 so it can see the transactional state there. This adds the missing load. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30KVM: PPC: Book3S HV: Work around POWER8 performance monitor bugsPaul Mackerras
This adds workarounds for two hardware bugs in the POWER8 performance monitor unit (PMU), both related to interrupt generation. The effect of these bugs is that PMU interrupts can get lost, leading to tools such as perf reporting fewer counts and samples than they should. The first bug relates to the PMAO (perf. mon. alert occurred) bit in MMCR0; setting it should cause an interrupt, but doesn't. The other bug relates to the PMAE (perf. mon. alert enable) bit in MMCR0. Setting PMAE when a counter is negative and counter negative conditions are enabled to cause alerts should cause an alert, but doesn't. The workaround for the first bug is to create conditions where a counter will overflow, whenever we are about to restore a MMCR0 value that has PMAO set (and PMAO_SYNC clear). The workaround for the second bug is to freeze all counters using MMCR2 before reading MMCR0. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30KVM: PPC: Book3S HV: Make sure we don't miss dirty pagesPaul Mackerras
Current, when testing whether a page is dirty (when constructing the bitmap for the KVM_GET_DIRTY_LOG ioctl), we test the C (changed) bit in the HPT entries mapping the page, and if it is 0, we consider the page to be clean. However, the Power ISA doesn't require processors to set the C bit to 1 immediately when writing to a page, and in fact allows them to delay the writeback of the C bit until they receive a TLB invalidation for the page. Thus it is possible that the page could be dirty and we miss it. Now, if there are vcpus running, this is not serious since the collection of the dirty log is racy already - some vcpu could dirty the page just after we check it. But if there are no vcpus running we should return definitive results, in case we are in the final phase of migrating the guest. Also, if the permission bits in the HPTE don't allow writing, then we know that no CPU can set C. If the HPTE was previously writable and the page was modified, any C bit writeback would have been flushed out by the tlbie that we did when changing the HPTE to read-only. Otherwise we need to do a TLB invalidation even if the C bit is 0, and then check the C bit. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30KVM: PPC: Book3S HV: Fix dirty map for hugepagesAlexey Kardashevskiy
The dirty map that we construct for the KVM_GET_DIRTY_LOG ioctl has one bit per system page (4K/64K). Currently, we only set one bit in the map for each HPT entry with the Change bit set, even if the HPT is for a large page (e.g., 16MB). Userspace then considers only the first system page dirty, though in fact the guest may have modified anywhere in the large page. To fix this, we make kvm_test_clear_dirty() return the actual number of pages that are dirty (and rename it to kvm_test_clear_dirty_npages() to emphasize that that's what it returns). In kvmppc_hv_get_dirty_log() we then set that many bits in the dirty map. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30KVM: PPC: Book3S HV: Put huge-page HPTEs in rmap chain for base addressPaul Mackerras
Currently, when a huge page is faulted in for a guest, we select the rmap chain to insert the HPTE into based on the guest physical address that the guest tried to access. Since there is an rmap chain for each system page, there are many rmap chains for the area covered by a huge page (e.g. 256 for 16MB pages when PAGE_SIZE = 64kB), and the huge-page HPTE could end up in any one of them. For consistency, and to make the huge-page HPTEs easier to find, we now put huge-page HPTEs in the rmap chain corresponding to the base address of the huge page. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30KVM: PPC: Book3S HV: Fix check for running inside guest in global_invalidates()Paul Mackerras
The global_invalidates() function contains a check that is intended to tell whether we are currently executing in the context of a hypercall issued by the guest. The reason is that the optimization of using a local TLB invalidate instruction is only valid in that context. The check was testing local_paca->kvm_hstate.kvm_vcore, which gets set when entering the guest but no longer gets cleared when exiting the guest. To fix this, we use the kvm_vcpu field instead, which does get cleared when exiting the guest, by the kvmppc_release_hwthread() calls inside kvmppc_run_core(). The effect of having the check wrong was that when kvmppc_do_h_remove() got called from htab_write() on the destination machine during a migration, it cleared the current cpu's bit in kvm->arch.need_tlb_flush. This meant that when the guest started running in the destination VM, it may miss out on doing a complete TLB flush, and therefore may end up using stale TLB entries from a previous guest that used the same LPID value. This should make migration more reliable. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30KVM: PPC: Book3S: Move KVM_REG_PPC_WORT to an unused register numberPaul Mackerras
Commit b005255e12a3 ("KVM: PPC: Book3S HV: Context-switch new POWER8 SPRs") added a definition of KVM_REG_PPC_WORT with the same register number as the existing KVM_REG_PPC_VRSAVE (though in fact the definitions are not identical because of the different register sizes.) For clarity, this moves KVM_REG_PPC_WORT to the next unused number, and also adds it to api.txt. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30KVM: PPC: Add CAP to indicate hcall fixesAlexander Graf
We worked around some nasty KVM magic page hcall breakages: 1) NX bit not honored, so ignore NX when we detect it 2) LE guests swizzle hypercall instruction Without these fixes in place, there's no way it would make sense to expose kvm hypercalls to a guest. Chances are immensely high it would trip over and break. So add a new CAP that gives user space a hint that we have workarounds for the bugs above in place. It can use those as hint to disable PV hypercalls when the guest CPU is anything POWER7 or higher and the host does not have fixes in place. Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30KVM: PPC: MPIC: Reset IRQ source private membersAlexander Graf
When we reset the in-kernel MPIC controller, we forget to reset some hidden state such as destmask and output. This state is usually set when the guest writes to the IDR register for a specific IRQ line. To make sure we stay in sync and don't forget hidden state, treat reset of the IDR register as a simple write of the IDR register. That automatically updates all the hidden state as well. Reported-by: Paul Janzen <pcj@pauljanzen.org> Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30KVM: PPC: Graciously fail broken LE hypercallsAlexander Graf
There are LE Linux guests out there that don't handle hypercalls correctly. Instead of interpreting the instruction stream from device tree as big endian they assume it's a little endian instruction stream and fail. When we see an illegal instruction from such a byte reversed instruction stream, bail out graciously and just declare every hcall as error. Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30PPC: ePAPR: Fix hypercall on LE guestAlexander Graf
We get an array of instructions from the hypervisor via device tree that we write into a buffer that gets executed whenever we want to make an ePAPR compliant hypercall. However, the hypervisor passes us these instructions in BE order which we have to manually convert to LE when we want to run them in LE mode. With this fixup in place, I can successfully run LE kernels with KVM PV enabled on PR KVM. Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30KVM: PPC: BOOK3S: Remove open coded make_dsisr in alignment handlerAneesh Kumar K.V
Use make_dsisr instead of open coding it. This also have the added benefit of handling alignment interrupt on additional instructions. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30KVM: PPC: BOOK3S: Always use the saved DAR valueAneesh Kumar K.V
Although it's optional, IBM POWER cpus always had DAR value set on alignment interrupt. So don't try to compute these values. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30PPC: KVM: Make NX bit available with magic pageAlexander Graf
Because old kernels enable the magic page and then choke on NXed trampoline code we have to disable NX by default in KVM when we use the magic page. However, since commit b18db0b8 we have successfully fixed that and can now leave NX enabled, so tell the hypervisor about this. Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30KVM: PPC: Disable NX for old magic page using guestsAlexander Graf
Old guests try to use the magic page, but map their trampoline code inside of an NX region. Since we can't fix those old kernels, try to detect whether the guest is sane or not. If not, just disable NX functionality in KVM so that old guests at least work at all. For newer guests, add a bit that we can set to keep NX functionality available. Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30KVM: PPC: BOOK3S: HV: Add mixed page-size support for guestAneesh Kumar K.V
On recent IBM Power CPUs, while the hashed page table is looked up using the page size from the segmentation hardware (i.e. the SLB), it is possible to have the HPT entry indicate a larger page size. Thus for example it is possible to put a 16MB page in a 64kB segment, but since the hash lookup is done using a 64kB page size, it may be necessary to put multiple entries in the HPT for a single 16MB page. This capability is called mixed page-size segment (MPSS). With MPSS, there are two relevant page sizes: the base page size, which is the size used in searching the HPT, and the actual page size, which is the size indicated in the HPT entry. [ Note that the actual page size is always >= base page size ]. We use "ibm,segment-page-sizes" device tree node to advertise the MPSS support to PAPR guest. The penc encoding indicates whether we support a specific combination of base page size and actual page size in the same segment. We also use the penc value in the LP encoding of HPTE entry. This patch exposes MPSS support to KVM guest by advertising the feature via "ibm,segment-page-sizes". It also adds the necessary changes to decode the base page size and the actual page size correctly from the HPTE entry. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30KVM: PPC: BOOK3S: HV: Prefer CMA region for hash page table allocationAneesh Kumar K.V
Today when KVM tries to reserve memory for the hash page table it allocates from the normal page allocator first. If that fails it falls back to CMA's reserved region. One of the side effects of this is that we could end up exhausting the page allocator and get linux into OOM conditions while we still have plenty of space available in CMA. This patch addresses this issue by first trying hash page table allocation from CMA's reserved region before falling back to the normal page allocator. So if we run out of memory, we really are out of memory. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30KVM: PPC: Book3S PR: Expose TM registersAlexander Graf
POWER8 introduces transactional memory which brings along a number of new registers and MSR bits. Implementing all of those is a pretty big headache, so for now let's at least emulate enough to make Linux's context switching code happy. Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30KVM: PPC: Book3S PR: Expose EBB registersAlexander Graf
POWER8 introduces a new facility called the "Event Based Branch" facility. It contains of a few registers that indicate where a guest should branch to when a defined event occurs and it's in PR mode. We don't want to really enable EBB as it will create a big mess with !PR guest mode while hardware is in PR and we don't really emulate the PMU anyway. So instead, let's just leave it at emulation of all its registers. Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30KVM: PPC: Book3S PR: Expose TAR facility to guestAlexander Graf
POWER8 implements a new register called TAR. This register has to be enabled in FSCR and then from KVM's point of view is mere storage. This patch enables the guest to use TAR. Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30KVM: PPC: Book3S PR: Handle Facility interrupt and FSCRAlexander Graf
POWER8 introduced a new interrupt type called "Facility unavailable interrupt" which contains its status message in a new register called FSCR. Handle these exits and try to emulate instructions for unhandled facilities. Follow-on patches enable KVM to expose specific facilities into the guest. Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30KVM: PPC: Book3S PR: Emulate TIR registerAlexander Graf
In parallel to the Processor ID Register (PIR) threaded POWER8 also adds a Thread ID Register (TIR). Since PR KVM doesn't emulate more than one thread per core, we can just always expose 0 here. Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30KVM: PPC: Book3S PR: Ignore PMU SPRsAlexander Graf
When we expose a POWER8 CPU into the guest, it will start accessing PMU SPRs that we don't emulate. Just ignore accesses to them. Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30KVM: PPC: Book3S: Move little endian conflict to HV KVMAlexander Graf
With the previous patches applied, we can now successfully use PR KVM on little endian hosts which means we can now allow users to select it. However, HV KVM still needs some work, so let's keep the kconfig conflict on that one. Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30KVM: PPC: Book3S PR: Do dcbz32 patching with big endian instructionsAlexander Graf
When the host CPU we're running on doesn't support dcbz32 itself, but the guest wants to have dcbz only clear 32 bytes of data, we loop through every executable mapped page to search for dcbz instructions and patch them with a special privileged instruction that we emulate as dcbz32. The only guests that want to see dcbz act as 32byte are book3s_32 guests, so we don't have to worry about little endian instruction ordering. So let's just always search for big endian dcbz instructions, also when we're on a little endian host. Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30KVM: PPC: Make shared struct aka magic page guest endianAlexander Graf
The shared (magic) page is a data structure that contains often used supervisor privileged SPRs accessible via memory to the user to reduce the number of exits we have to take to read/write them. When we actually share this structure with the guest we have to maintain it in guest endianness, because some of the patch tricks only work with native endian load/store operations. Since we only share the structure with either host or guest in little endian on book3s_64 pr mode, we don't have to worry about booke or book3s hv. For booke, the shared struct stays big endian. For book3s_64 hv we maintain the struct in host native endian, since it never gets shared with the guest. For book3s_64 pr we introduce a variable that tells us which endianness the shared struct is in and route every access to it through helper inline functions that evaluate this variable. Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30KVM: PPC: PR: Fill pvinfo hcall instructions in big endianAlexander Graf
We expose a blob of hypercall instructions to user space that it gives to the guest via device tree again. That blob should contain a stream of instructions necessary to do a hypercall in big endian, as it just gets passed into the guest and old guests use them straight away. Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30KVM: PPC: Book3S PR: PAPR: Access RTAS in big endianAlexander Graf
When the guest does an RTAS hypercall it keeps all RTAS variables inside a big endian data structure. To make sure we don't have to bother about endianness inside the actual RTAS handlers, let's just convert the whole structure to host endian before we call our RTAS handlers and back to big endian when we return to the guest. Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30KVM: PPC: Book3S PR: PAPR: Access HTAB in big endianAlexander Graf
The HTAB on PPC is always in big endian. When we access it via hypercalls on behalf of the guest and we're running on a little endian host, we need to make sure we swap the bits accordingly. Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30KVM: PPC: Book3S PR: Default to big endian guestAlexander Graf
The default MSR when user space does not define anything should be identical on little and big endian hosts, so remove MSR_LE from it. Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30KVM: PPC: Book3S_64 PR: Access shadow slb in big endianAlexander Graf
The "shadow SLB" in the PACA is shared with the hypervisor, so it has to be big endian. We access the shadow SLB during world switch, so let's make sure we access it in big endian even when we're on a little endian host. Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30KVM: PPC: Book3S_64 PR: Access HTAB in big endianAlexander Graf
The HTAB is always big endian. We access the guest's HTAB using copy_from/to_user, but don't yet take care of the fact that we might be running on an LE host. Wrap all accesses to the guest HTAB with big endian accessors. Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30KVM: PPC: Book3S_32: PR: Access HTAB in big endianAlexander Graf
The HTAB is always big endian. We access the guest's HTAB using copy_from/to_user, but don't yet take care of the fact that we might be running on an LE host. Wrap all accesses to the guest HTAB with big endian accessors. Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30KVM: PPC: Book3S: PR: Fix C/R bit settingAlexander Graf
Commit 9308ab8e2d made C/R HTAB updates go byte-wise into the target HTAB. However, it didn't update the guest's copy of the HTAB, but instead the host local copy of it. Write to the guest's HTAB instead. Signed-off-by: Alexander Graf <agraf@suse.de> CC: Paul Mackerras <paulus@samba.org> Acked-by: Paul Mackerras <paulus@samba.org>
2014-05-30KVM: PPC: BOOK3S: PR: Fix WARN_ON with debug options onAneesh Kumar K.V
With debug option "sleep inside atomic section checking" enabled we get the below WARN_ON during a PR KVM boot. This is because upstream now have PREEMPT_COUNT enabled even if we have preempt disabled. Fix the warning by adding preempt_disable/enable around floating point and altivec enable. WARNING: at arch/powerpc/kernel/process.c:156 Modules linked in: kvm_pr kvm CPU: 1 PID: 3990 Comm: qemu-system-ppc Tainted: G W 3.15.0-rc1+ #4 task: c0000000eb85b3a0 ti: c0000000ec59c000 task.ti: c0000000ec59c000 NIP: c000000000015c84 LR: d000000003334644 CTR: c000000000015c00 REGS: c0000000ec59f140 TRAP: 0700 Tainted: G W (3.15.0-rc1+) MSR: 8000000000029032 <SF,EE,ME,IR,DR,RI> CR: 42000024 XER: 20000000 CFAR: c000000000015c24 SOFTE: 1 GPR00: d000000003334644 c0000000ec59f3c0 c000000000e2fa40 c0000000e2f80000 GPR04: 0000000000000800 0000000000002000 0000000000000001 8000000000000000 GPR08: 0000000000000001 0000000000000001 0000000000002000 c000000000015c00 GPR12: d00000000333da18 c00000000fb80900 0000000000000000 0000000000000000 GPR16: 0000000000000000 0000000000000000 0000000000000000 00003fffce4e0fa1 GPR20: 0000000000000010 0000000000000001 0000000000000002 00000000100b9a38 GPR24: 0000000000000002 0000000000000000 0000000000000000 0000000000000013 GPR28: 0000000000000000 c0000000eb85b3a0 0000000000002000 c0000000e2f80000 NIP [c000000000015c84] .enable_kernel_fp+0x84/0x90 LR [d000000003334644] .kvmppc_handle_ext+0x134/0x190 [kvm_pr] Call Trace: [c0000000ec59f3c0] [0000000000000010] 0x10 (unreliable) [c0000000ec59f430] [d000000003334644] .kvmppc_handle_ext+0x134/0x190 [kvm_pr] [c0000000ec59f4c0] [d00000000324b380] .kvmppc_set_msr+0x30/0x50 [kvm] [c0000000ec59f530] [d000000003337cac] .kvmppc_core_emulate_op_pr+0x16c/0x5e0 [kvm_pr] [c0000000ec59f5f0] [d00000000324a944] .kvmppc_emulate_instruction+0x284/0xa80 [kvm] [c0000000ec59f6c0] [d000000003336888] .kvmppc_handle_exit_pr+0x488/0xb70 [kvm_pr] [c0000000ec59f790] [d000000003338d34] kvm_start_lightweight+0xcc/0xdc [kvm_pr] [c0000000ec59f960] [d000000003336288] .kvmppc_vcpu_run_pr+0xc8/0x190 [kvm_pr] [c0000000ec59f9f0] [d00000000324c880] .kvmppc_vcpu_run+0x30/0x50 [kvm] [c0000000ec59fa60] [d000000003249e74] .kvm_arch_vcpu_ioctl_run+0x54/0x1b0 [kvm] [c0000000ec59faf0] [d000000003244948] .kvm_vcpu_ioctl+0x478/0x760 [kvm] [c0000000ec59fcb0] [c000000000224e34] .do_vfs_ioctl+0x4d4/0x790 [c0000000ec59fd90] [c000000000225148] .SyS_ioctl+0x58/0xb0 [c0000000ec59fe30] [c00000000000a1e4] syscall_exit+0x0/0x98 Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30KVM: PPC: BOOK3S: PR: Enable Little Endian PR guestAneesh Kumar K.V
This patch make sure we inherit the LE bit correctly in different case so that we can run Little Endian distro in PR mode Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30KVM: PPC: E500: Add dcbtls emulationAlexander Graf
The dcbtls instruction is able to lock data inside the L1 cache. We don't want to give the guest actual access to hardware cache locks, as that could influence other VMs on the same system. But we can tell the guest that its locking attempt failed. By implementing the instruction we at least don't give the guest a program exception which it definitely does not expect. Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30KVM: PPC: E500: Ignore L1CSR1_ICFI,ICLFRAlexander Graf
The L1 instruction cache control register contains bits that indicate that we're still handling a request. Mask those out when we set the SPR so that a read doesn't assume we're still doing something. Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-22KVM: vmx: DR7 masking on task switch emulation is wrongNadav Amit
The DR7 masking which is done on task switch emulation should be in hex format (clearing the local breakpoints enable bits 0,2,4 and 6). Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-05-22x86: fix page fault tracing when KVM guest support enabledDave Hansen
I noticed on some of my systems that page fault tracing doesn't work: cd /sys/kernel/debug/tracing echo 1 > events/exceptions/enable cat trace; # nothing shows up I eventually traced it down to CONFIG_KVM_GUEST. At least in a KVM VM, enabling that option breaks page fault tracing, and disabling fixes it. I tried on some old kernels and this does not appear to be a regression: it never worked. There are two page-fault entry functions today. One when tracing is on and another when it is off. The KVM code calls do_page_fault() directly instead of calling the traced version: > dotraplinkage void __kprobes > do_async_page_fault(struct pt_regs *regs, unsigned long > error_code) > { > enum ctx_state prev_state; > > switch (kvm_read_and_reset_pf_reason()) { > default: > do_page_fault(regs, error_code); > break; > case KVM_PV_REASON_PAGE_NOT_PRESENT: I'm also having problems with the page fault tracing on bare metal (same symptom of no trace output). I'm unsure if it's related. Steven had an alternative to this which has zero overhead when tracing is off where this includes the standard noops even when tracing is disabled. I'm unconvinced that the extra complexity of his apporach: http://lkml.kernel.org/r/20140508194508.561ed220@gandalf.local.home is worth it, expecially considering that the KVM code is already making page fault entry slower here. This solution is dirt-simple. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: x86@kernel.org Cc: Peter Zijlstra <peterz@infradead.org> Cc: Gleb Natapov <gleb@redhat.com> Cc: kvm@vger.kernel.org Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: "H. Peter Anvin" <hpa@zytor.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-05-22KVM: x86: get CPL from SS.DPLPaolo Bonzini
CS.RPL is not equal to the CPL in the few instructions between setting CR0.PE and reloading CS. And CS.DPL is also not equal to the CPL for conforming code segments. However, SS.DPL *is* always equal to the CPL except for the weird case of SYSRET on AMD processors, which sets SS.DPL=SS.RPL from the value in the STAR MSR, but force CPL=3 (Intel instead forces SS.DPL=SS.RPL=CPL=3). So this patch: - modifies SVM to update the CPL from SS.DPL rather than CS.RPL; the above case with SYSRET is not broken further, and the way to fix it would be to pass the CPL to userspace and back - modifies VMX to always return the CPL from SS.DPL (except forcing it to 0 if we are emulating real mode via vm86 mode; in vm86 mode all DPLs have to be 3, but real mode does allow privileged instructions). It also removes the CPL cache, which becomes a duplicate of the SS access rights cache. This fixes doing KVM_IOCTL_SET_SREGS exactly after setting CR0.PE=1 but before CS has been reloaded. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-05-22KVM: x86: check CS.DPL against RPL during task switchPaolo Bonzini
Table 7-1 of the SDM mentions a check that the code segment's DPL must match the selector's RPL. This was not done by KVM, fix it. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-05-22KVM: x86: drop set_rflags callbackPaolo Bonzini
Not needed anymore now that the CPL is computed directly during task switch. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-05-22KVM: x86: use new CS.RPL as CPL during task switchPaolo Bonzini
During task switch, all of CS.DPL, CS.RPL, SS.DPL must match (in addition to all the other requirements) and will be the new CPL. So far this worked by carefully setting the CS selector and flag before doing the task switch; setting CS.selector will already change the CPL. However, this will not work once we get the CPL from SS.DPL, because then you will have to set the full segment descriptor cache to change the CPL. ctxt->ops->cpl(ctxt) will then return the old CPL during the task switch, and the check that SS.DPL == CPL will fail. Temporarily assume that the CPL comes from CS.RPL during task switch to a protected-mode task. This is the same approach used in QEMU's emulation code, which (until version 2.0) manually tracks the CPL. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-05-16KVM: s390: split SIE state guest prefix fieldMichael Mueller
This patch splits the SIE state guest prefix at offset 4 into a prefix bit field. Additionally it provides the access functions: - kvm_s390_get_prefix() - kvm_s390_set_prefix() to access the prefix per vcpu. Signed-off-by: Michael Mueller <mimu@linux.vnet.ibm.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
2014-05-16s390/sclp: add sclp_get_ibc functionMichael Mueller
The patch adds functionality to retrieve the IBC configuration by means of function sclp_get_ibc(). Signed-off-by: Michael Mueller <mimu@linux.vnet.ibm.com> Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
2014-05-16KVM: s390: interpretive execution of SIGP EXTERNAL CALLDavid Hildenbrand
If the sigp interpretation facility is installed, most SIGP EXTERNAL CALL operations will be interpreted instead of intercepted. A partial execution interception will occurr at the sending cpu only if the target cpu is in the wait state ("W" bit in the cpuflags set). Instruction interception will only happen in error cases (e.g. cpu addr invalid). As a sending cpu might set the external call interrupt pending flags at the target cpu at every point in time, we can't handle this kind of interrupt using our kvm interrupt injection mechanism. The injection will be done automatically by the SIE when preparing the start of the target cpu. Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> CC: Thomas Huth <thuth@linux.vnet.ibm.com> [Adopt external call injection to check for sigp interpretion] Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2014-05-16KVM: s390: Use intercept_insn decoder in trace eventAlexander Yarygin
The current trace definition doesn't work very well with the perf tool. Perf shows a "insn_to_mnemonic not found" message. Let's handle the decoding completely in a parseable format. Signed-off-by: Alexander Yarygin <yarygin@linux.vnet.ibm.com> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2014-05-16KVM: s390: decoder of SIE intercepted instructionsAlexander Yarygin
This patch adds a new decoder of SIE intercepted instructions. The decoder implemented as a macro and potentially can be used in both kernelspace and userspace. Note that this simplified instruction decoder is only intended to be used with the subset of instructions that may cause a SIE intercept. Signed-off-by: Alexander Yarygin <yarygin@linux.vnet.ibm.com> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>