Age | Commit message (Collapse) | Author |
|
With subfacets, we'd expect megaflow updates message to carry
the original micro flow. If not, EINVAL is returned and kernel
logs an error message. Now that the user space subfacet layer is
removed, it is expected that flow updates can arrive with a
micro flow other than the original. Change the return code to
EEXIST and remove the kernel error log message.
Reported-by: Ben Pfaff <blp@nicira.com>
Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
|
|
ovs_flow_free() is not called under ovs-lock during packet
execute path (ovs_packet_cmd_execute()). Since packet execute
does not touch flow->mask, there is no need to take that
lock either. So move assert in case where flow->mask is checked.
Found by code inspection.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
|
|
commit 43d4be9cb55f3bac5253e9289996fd9d735531db (openvswitch: Allow user space
to announce ability to accept unaligned Netlink messages) introduced
OVS_DP_ATTR_USER_FEATURES netlink attribute in datapath responses,
but the attribute size was not taken into account in ovs_dp_cmd_msg_size().
Signed-off-by: Daniele Di Proietto <daniele.di.proietto@gmail.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
|
|
Both mega flow mask's reference counter and per flow table mask list
should only be accessed when holding ovs_mutex() lock. However
this is not true with ovs_flow_table_flush(). The patch fixes this bug.
Reported-by: Joe Stringer <joestringer@nicira.com>
Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
|
|
While the zerocopy method is correctly omitted if user space
does not support unaligned Netlink messages. The attribute is
still not padded correctly as skb_zerocopy() will not ensure
padding and the attribute size is no longer pre calculated
though nla_reserve() which ensured padding previously.
This patch applies appropriate padding if a linear data copy
was performed in skb_zerocopy().
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Zoltan Kiss <zoltan.kiss@citrix.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
|
|
This patch removes the net_random and net_srandom macros and replaces
them with direct calls to the prandom ones. As new commits only seem to
use prandom_u32 there is no use to keep them around.
This change makes it easier to grep for users of prandom_u32.
Signed-off-by: Aruna-Hewapathirane <aruna.hewapathirane@gmail.com>
Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
memory allocated by kmem_cache_alloc() should be freed using
kmem_cache_free(), not kfree().
Fixes: e298e5057006 ('openvswitch: Per cpu flow stats.')
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Acked-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jesse/openvswitch
Jesse Gross says:
====================
[GIT net-next] Open vSwitch
Open vSwitch changes for net-next/3.14. Highlights are:
* Performance improvements in the mechanism to get packets to userspace
using memory mapped netlink and skb zero copy where appropriate.
* Per-cpu flow stats in situations where flows are likely to be shared
across CPUs. Standard flow stats are used in other situations to save
memory and allocation time.
* A handful of code cleanups and rationalization.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Several functions and datastructures could be local
Found with 'make namespacecheck'
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Jesse Gross <jesse@nicira.com>
|
|
The copy & csum optimization is no longer present with zerocopy
enabled. Compute the checksum in skb_gso_segment() directly by
dropping the HW CSUM capability from the features passed in.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Jesse Gross <jesse@nicira.com>
|
|
Use of skb_zerocopy() can avoid the expensive call to memcpy()
when copying the packet data into the Netlink skb. Completes
checksum through skb_checksum_help() if not already done in
GSO segmentation.
Zerocopy is only performed if user space supported unaligned
Netlink messages. memory mapped netlink i/o is preferred over
zerocopy if it is set up.
Cost of upcall is significantly reduced from:
+ 7.48% vhost-8471 [k] memcpy
+ 5.57% ovs-vswitchd [k] memcpy
+ 2.81% vhost-8471 [k] csum_partial_copy_generic
to:
+ 5.72% ovs-vswitchd [k] memcpy
+ 3.32% vhost-5153 [k] memcpy
+ 0.68% vhost-5153 [k] skb_zerocopy
(megaflows disabled)
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Jesse Gross <jesse@nicira.com>
|
|
Allows removing the net and dp_ifindex argument and simplify the
code.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Jesse Gross <jesse@nicira.com>
|
|
Drop user features if an outdated user space instance that does not
understand the concept of user_features attempted to create a new
datapath.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Jesse Gross <jesse@nicira.com>
|
|
Netlink messages
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Reviewed-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
|
|
Remove duplicated include.
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Jesse Gross <jesse@nicira.com>
|
|
As we're only doing a kfree() anyway in the RCU callback, we can
simply use kfree_rcu, which does the same job, and remove the
function rcu_free_sw_flow_mask_cb() and rcu_free_acts_callback().
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
|
|
With mega flow implementation ovs flow can be shared between
multiple CPUs which makes stats updates highly contended
operation. This patch uses per-CPU stats in cases where a flow
is likely to be shared (if there is a wildcard in the 5-tuple
and therefore likely to be spread by RSS). In other situations,
it uses the current strategy, saving memory and allocation time.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
|
|
Use memory mapped Netlink i/o for all unicast openvswitch
communication if a ring has been set up.
Benchmark
* pktgen -> ovs internal port
* 5M pkts, 5M flows
* 4 threads, 8 cores
Before:
Result: OK: 67418743(c67108212+d310530) usec, 5000000 (9000byte,0frags)
74163pps 5339Mb/sec (5339736000bps) errors: 0
+ 2.98% ovs-vswitchd [k] copy_user_generic_string
+ 2.49% ovs-vswitchd [k] memcpy
+ 1.84% kpktgend_2 [k] memcpy
+ 1.81% kpktgend_1 [k] memcpy
+ 1.81% kpktgend_3 [k] memcpy
+ 1.78% kpktgend_0 [k] memcpy
After:
Result: OK: 24229690(c24127165+d102524) usec, 5000000 (9000byte,0frags)
206358pps 14857Mb/sec (14857776000bps) errors: 0
+ 2.80% ovs-vswitchd [k] memcpy
+ 1.31% kpktgend_2 [k] memcpy
+ 1.23% kpktgend_0 [k] memcpy
+ 1.09% kpktgend_1 [k] memcpy
+ 1.04% kpktgend_3 [k] memcpy
+ 0.96% ovs-vswitchd [k] copy_user_generic_string
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Reviewed-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
|
|
Flow lookup can happen either in packet processing context or userspace
context but it was annotated as requiring RCU read lock to be held. This
also allows OVS mutex to be held without causing warnings.
Reported-by: Justin Pettit <jpettit@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Reviewed-by: Thomas Graf <tgraf@redhat.com>
|
|
API changes only for code readability. No functional chnages.
This patch removes the underscored version. Added a new API
ovs_flow_tbl_lookup_stats() that returns the n_mask_hits.
Reported by: Ben Pfaff <blp@nicira.com>
Reviewed-by: Thomas Graf <tgraf@redhat.com>
Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
|
|
We won't normally have a ton of flow masks but using a size_t to store
values no bigger than sizeof(struct sw_flow_key) seems excessive.
This reduces sw_flow_key_range and sw_flow_mask by 4 bytes on 32-bit
systems. On 64-bit systems it shrinks sw_flow_key_range by 12 bytes but
sw_flow_mask only by 8 bytes due to padding.
Compile tested only.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
|
|
Signed-off-by: Ben Pfaff <blp@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
|
|
They are same, so unify them as one, pcpu_sw_netstats.
Define pcpu_sw_netstat in netdevice.h, remove pcpu_tstats
from if_tunnel and remove br_cpu_netstats from br_private.h
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In several places 'skb->rxhash = 0' is being done to clear the
rxhash value in an skb. This does not clear l4_rxhash which could
still be set so that the rxhash wouldn't be recalculated on subsequent
call to skb_get_rxhash. This patch adds an explict function to clear
all the rxhash related information in the skb properly.
skb_clear_hash_if_not_l4 clears the rxhash only if it is not marked as
l4_rxhash.
Fixed up places where 'skb->rxhash = 0' was being called.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Currently OVS uses jhash2() for calculating flow hashes in its
internal flow_hash() function. The performance of the flow_hash()
function is critical, as the input data can be hundreds of bytes
long.
OVS is largely deployed in x86_64 based datacenters. Therefore,
we argue that the performance critical fast path of OVS should
exploit underlying CPU features in order to reduce the per packet
processing costs. We replace jhash2 with the hash implementation
provided by the kernel hash lib, which exploits the crc32l
instruction to achieve high performance
Our patch greatly reduces the hash footprint from ~200 cycles of
jhash2() to around ~90 cycles in case of ovs_flow_hash_crc()
(measured with rdtsc over maximum length flow keys on an i7 Intel
CPU).
Additionally, we wrote a microbenchmark to stress the flow table
performance. The benchmark inserts random flows into the flow
hash and then performs lookups. Our hash deployed on a CRC32
capable CPU reduces the lookup for 1000 flows, 100 masks from
~10,100us to ~6,700us, for example.
Thus, simply use the newly introduced arch_fast_hash2() as a
drop-in replacement.
Signed-off-by: Francesco Fusco <ffusco@redhat.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Thomas Graf <tgraf@redhat.com>
Acked-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Pull networking fixes from David Miller:
"Mostly these are fixes for fallout due to merge window changes, as
well as cures for problems that have been with us for a much longer
period of time"
1) Johannes Berg noticed two major deficiencies in our genetlink
registration. Some genetlink protocols we passing in constant
counts for their ops array rather than something like
ARRAY_SIZE(ops) or similar. Also, some genetlink protocols were
using fixed IDs for their multicast groups.
We have to retain these fixed IDs to keep existing userland tools
working, but reserve them so that other multicast groups used by
other protocols can not possibly conflict.
In dealing with these two problems, we actually now use less state
management for genetlink operations and multicast groups.
2) When configuring interface hardware timestamping, fix several
drivers that simply do not validate that the hwtstamp_config value
is one the driver actually supports. From Ben Hutchings.
3) Invalid memory references in mwifiex driver, from Amitkumar Karwar.
4) In dev_forward_skb(), set the skb->protocol in the right order
relative to skb_scrub_packet(). From Alexei Starovoitov.
5) Bridge erroneously fails to use the proper wrapper functions to make
calls to netdev_ops->ndo_vlan_rx_{add,kill}_vid. Fix from Toshiaki
Makita.
6) When detaching a bridge port, make sure to flush all VLAN IDs to
prevent them from leaking, also from Toshiaki Makita.
7) Put in a compromise for TCP Small Queues so that deep queued devices
that delay TX reclaim non-trivially don't have such a performance
decrease. One particularly problematic area is 802.11 AMPDU in
wireless. From Eric Dumazet.
8) Fix crashes in tcp_fastopen_cache_get(), we can see NULL socket dsts
here. Fix from Eric Dumzaet, reported by Dave Jones.
9) Fix use after free in ipv6 SIT driver, from Willem de Bruijn.
10) When computing mergeable buffer sizes, virtio-net fails to take the
virtio-net header into account. From Michael Dalton.
11) Fix seqlock deadlock in ip4_datagram_connect() wrt. statistic
bumping, this one has been with us for a while. From Eric Dumazet.
12) Fix NULL deref in the new TIPC fragmentation handling, from Erik
Hugne.
13) 6lowpan bit used for traffic classification was wrong, from Jukka
Rissanen.
14) macvlan has the same issue as normal vlans did wrt. propagating LRO
disabling down to the real device, fix it the same way. From Michal
Kubecek.
15) CPSW driver needs to soft reset all slaves during suspend, from
Daniel Mack.
16) Fix small frame pacing in FQ packet scheduler, from Eric Dumazet.
17) The xen-netfront RX buffer refill timer isn't properly scheduled on
partial RX allocation success, from Ma JieYue.
18) When ipv6 ping protocol support was added, the AF_INET6 protocol
initialization cleanup path on failure was borked a little. Fix
from Vlad Yasevich.
19) If a socket disconnects during a read/recvmsg/recvfrom/etc that
blocks we can do the wrong thing with the msg_name we write back to
userspace. From Hannes Frederic Sowa. There is another fix in the
works from Hannes which will prevent future problems of this nature.
20) Fix route leak in VTI tunnel transmit, from Fan Du.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (106 commits)
genetlink: make multicast groups const, prevent abuse
genetlink: pass family to functions using groups
genetlink: add and use genl_set_err()
genetlink: remove family pointer from genl_multicast_group
genetlink: remove genl_unregister_mc_group()
hsr: don't call genl_unregister_mc_group()
quota/genetlink: use proper genetlink multicast APIs
drop_monitor/genetlink: use proper genetlink multicast APIs
genetlink: only pass array to genl_register_family_with_ops()
tcp: don't update snd_nxt, when a socket is switched from repair mode
atm: idt77252: fix dev refcnt leak
xfrm: Release dst if this dst is improper for vti tunnel
netlink: fix documentation typo in netlink_set_err()
be2net: Delete secondary unicast MAC addresses during be_close
be2net: Fix unconditional enabling of Rx interface options
net, virtio_net: replace the magic value
ping: prevent NULL pointer dereference on write to msg_name
bnx2x: Prevent "timeout waiting for state X"
bnx2x: prevent CFC attention
bnx2x: Prevent panic during DMAE timeout
...
|
|
Register generic netlink multicast groups as an array with
the family and give them contiguous group IDs. Then instead
of passing the global group ID to the various functions that
send messages, pass the ID relative to the family - for most
families that's just 0 because the only have one group.
This avoids the list_head and ID in each group, adding a new
field for the mcast group ID offset to the family.
At the same time, this allows us to prevent abusing groups
again like the quota and dropmon code did, since we can now
check that a family only uses a group it owns.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This doesn't really change anything, but prepares for the
next patch that will change the APIs to pass the group ID
within the family, rather than the global group ID.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add a static inline to generic netlink to wrap netlink_set_err()
to make it easier to use here - use it in openvswitch (the only
generic netlink user of netlink_set_err()).
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
As suggested by David Miller, make genl_register_family_with_ops()
a macro and pass only the array, evaluating ARRAY_SIZE() in the
macro, this is a little safer.
The openvswitch has some indirection, assing ops/n_ops directly in
that code. This might ultimately just assign the pointers in the
family initializations, saving the struct genl_family_and_ops and
code (once mcast groups are handled differently.)
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Now that genl_ops are no longer modified in place when
registering, they can be made const. This patch was done
mostly with spatch:
@@
identifier ops;
@@
+const
struct genl_ops ops[] = {
...
};
(except the struct thing in net/openvswitch/datapath.c)
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core locking changes from Ingo Molnar:
"The biggest changes:
- add lockdep support for seqcount/seqlocks structures, this
unearthed both bugs and required extra annotation.
- move the various kernel locking primitives to the new
kernel/locking/ directory"
* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
block: Use u64_stats_init() to initialize seqcounts
locking/lockdep: Mark __lockdep_count_forward_deps() as static
lockdep/proc: Fix lock-time avg computation
locking/doc: Update references to kernel/mutex.c
ipv6: Fix possible ipv6 seqlock deadlock
cpuset: Fix potential deadlock w/ set_mems_allowed
seqcount: Add lockdep functionality to seqcount/seqlock structures
net: Explicitly initialize u64_stats_sync structures for lockdep
locking: Move the percpu-rwsem code to kernel/locking/
locking: Move the lglocks code to kernel/locking/
locking: Move the rwsem code to kernel/locking/
locking: Move the rtmutex code to kernel/locking/
locking: Move the semaphore core to kernel/locking/
locking: Move the spinlock code to kernel/locking/
locking: Move the lockdep code to kernel/locking/
locking: Move the mutex code to kernel/locking/
hung_task debugging: Add tracepoint to report the hang
x86/locking/kconfig: Update paravirt spinlock Kconfig description
lockstat: Report avg wait and hold times
lockdep, x86/alternatives: Drop ancient lockdep fixup message
...
|
|
In order to enable lockdep on seqcount/seqlock structures, we
must explicitly initialize any locks.
The u64_stats_sync structure, uses a seqcount, and thus we need
to introduce a u64_stats_init() function and use it to initialize
the structure.
This unfortunately adds a lot of fairly trivial initialization code
to a number of drivers. But the benefit of ensuring correctness makes
this worth while.
Because these changes are required for lockdep to be enabled, and the
changes are quite trivial, I've not yet split this patch out into 30-some
separate patches, as I figured it would be better to get the various
maintainers thoughts on how to best merge this change along with
the seqcount lockdep enablement.
Feedback would be appreciated!
Signed-off-by: John Stultz <john.stultz@linaro.org>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: James Morris <jmorris@namei.org>
Cc: Jesse Gross <jesse@nicira.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Mirko Lindner <mlindner@marvell.com>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Roger Luethi <rl@hellgate.ch>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Simon Horman <horms@verge.net.au>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Cc: Wensong Zhang <wensong@linux-vs.org>
Cc: netdev@vger.kernel.org
Link: http://lkml.kernel.org/r/1381186321-4906-2-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jesse/openvswitch
Jesse Gross says:
====================
Open vSwitch
A set of updates for net-next/3.13. Major changes are:
* Restructure flow handling code to be more logically organized and
easier to read.
* Rehashing of the flow table is moved from a workqueue to flow
installation time. Before, heavy load could block the workqueue for
excessive periods of time.
* Additional debugging information is provided to help diagnose megaflows.
* It's now possible to match on TCP flags.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Conflicts:
drivers/net/ethernet/emulex/benet/be.h
drivers/net/netconsole.c
net/bridge/br_private.h
Three mostly trivial conflicts.
The net/bridge/br_private.h conflict was a function signature (argument
addition) change overlapping with the extern removals from Joe Perches.
In drivers/net/netconsole.c we had one change adjusting a printk message
whilst another changed "printk(KERN_INFO" into "pr_info(".
Lastly, the emulex change was a new inline function addition overlapping
with Joe Perches's extern removals.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Flow->hash can be used to detect hash collisions and avoid flow key
compare in flow lookup.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
|
|
tcp_flags=flags/mask
Bitwise match on TCP flags. The flags and mask are 16-bit num‐
bers written in decimal or in hexadecimal prefixed by 0x. Each
1-bit in mask requires that the corresponding bit in port must
match. Each 0-bit in mask causes the corresponding bit to be
ignored.
TCP protocol currently defines 9 flag bits, and additional 3
bits are reserved (must be transmitted as zero), see RFCs 793,
3168, and 3540. The flag bits are, numbering from the least
significant bit:
0: FIN No more data from sender.
1: SYN Synchronize sequence numbers.
2: RST Reset the connection.
3: PSH Push function.
4: ACK Acknowledgement field significant.
5: URG Urgent pointer field significant.
6: ECE ECN Echo.
7: CWR Congestion Windows Reduced.
8: NS Nonce Sum.
9-11: Reserved.
12-15: Not matchable, must be zero.
Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
|
|
Widen TCP flags handling from 7 bits (uint8_t) to 12 bits (uint16_t).
The kernel interface remains at 8 bits, which makes no functional
difference now, as none of the higher bits is currently of interest
to the userspace.
Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
|
|
OVS already can handle all types of segmentation offloads that
are supported by the kernel.
Following patch specifically enables UDP and IPV6 segmentation
offloads.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
|
|
Collect mega flow mask stats. ovs-dpctl show command can be used to
display them for debugging and performance tuning.
Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
|
|
The combination of two commits:
commit 8e4e1713e4
("openvswitch: Simplify datapath locking.")
commit 2537b4dd0a
("openvswitch:: link upper device for port devices")
introduced a bug where upper_dev wasn't unlinked upon
netdev_unregister notification
The following steps:
modprobe openvswitch
ovs-dpctl add-dp test
ip tuntap add dev tap1 mode tap
ovs-dpctl add-if test tap1
ip tuntap del dev tap1 mode tap
are causing multiple warnings:
[ 62.747557] gre: GRE over IPv4 demultiplexor driver
[ 62.749579] openvswitch: Open vSwitch switching datapath
[ 62.755087] device test entered promiscuous mode
[ 62.765911] device tap1 entered promiscuous mode
[ 62.766033] IPv6: ADDRCONF(NETDEV_UP): tap1: link is not ready
[ 62.769017] ------------[ cut here ]------------
[ 62.769022] WARNING: CPU: 1 PID: 3267 at net/core/dev.c:5501 rollback_registered_many+0x20f/0x240()
[ 62.769023] Modules linked in: openvswitch gre vxlan ip_tunnel libcrc32c ip6table_filter ip6_tables ebtable_nat ebtables nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_CHECKSUM iptable_mangle ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc vhost_net macvtap macvlan vhost kvm_intel kvm dm_crypt iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi hid_generic mxm_wmi eeepc_wmi asus_wmi sparse_keymap dm_multipath psmouse serio_raw usbhid hid parport_pc ppdev firewire_ohci lpc_ich firewire_core e1000e crc_itu_t binfmt_misc igb dca ptp pps_core mac_hid wmi lp parport i2o_config i2o_block video
[ 62.769051] CPU: 1 PID: 3267 Comm: ip Not tainted 3.12.0-rc3+ #60
[ 62.769052] Hardware name: System manufacturer System Product Name/P8Z77 WS, BIOS 3007 07/26/2012
[ 62.769053] 0000000000000009 ffff8807f25cbd28 ffffffff8175e575 0000000000000006
[ 62.769055] 0000000000000000 ffff8807f25cbd68 ffffffff8105314c ffff8807f25cbd58
[ 62.769057] ffff8807f2634000 ffff8807f25cbdc8 ffff8807f25cbd88 ffff8807f25cbdc8
[ 62.769059] Call Trace:
[ 62.769062] [<ffffffff8175e575>] dump_stack+0x55/0x76
[ 62.769065] [<ffffffff8105314c>] warn_slowpath_common+0x8c/0xc0
[ 62.769067] [<ffffffff8105319a>] warn_slowpath_null+0x1a/0x20
[ 62.769069] [<ffffffff8162a04f>] rollback_registered_many+0x20f/0x240
[ 62.769071] [<ffffffff8162a101>] rollback_registered+0x31/0x40
[ 62.769073] [<ffffffff8162a488>] unregister_netdevice_queue+0x58/0x90
[ 62.769075] [<ffffffff8154f900>] __tun_detach+0x140/0x340
[ 62.769077] [<ffffffff8154fb36>] tun_chr_close+0x36/0x60
[ 62.769080] [<ffffffff811bddaf>] __fput+0xff/0x260
[ 62.769082] [<ffffffff811bdf5e>] ____fput+0xe/0x10
[ 62.769084] [<ffffffff8107b515>] task_work_run+0xb5/0xe0
[ 62.769087] [<ffffffff810029b9>] do_notify_resume+0x59/0x80
[ 62.769089] [<ffffffff813a41fe>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 62.769091] [<ffffffff81770f5a>] int_signal+0x12/0x17
[ 62.769093] ---[ end trace 838756c62e156ffb ]---
[ 62.769481] ------------[ cut here ]------------
[ 62.769485] WARNING: CPU: 1 PID: 92 at fs/sysfs/inode.c:325 sysfs_hash_and_remove+0xa9/0xb0()
[ 62.769486] sysfs: can not remove 'master', no directory
[ 62.769486] Modules linked in: openvswitch gre vxlan ip_tunnel libcrc32c ip6table_filter ip6_tables ebtable_nat ebtables nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_CHECKSUM iptable_mangle ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc vhost_net macvtap macvlan vhost kvm_intel kvm dm_crypt iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi hid_generic mxm_wmi eeepc_wmi asus_wmi sparse_keymap dm_multipath psmouse serio_raw usbhid hid parport_pc ppdev firewire_ohci lpc_ich firewire_core e1000e crc_itu_t binfmt_misc igb dca ptp pps_core mac_hid wmi lp parport i2o_config i2o_block video
[ 62.769514] CPU: 1 PID: 92 Comm: kworker/1:2 Tainted: G W 3.12.0-rc3+ #60
[ 62.769515] Hardware name: System manufacturer System Product Name/P8Z77 WS, BIOS 3007 07/26/2012
[ 62.769518] Workqueue: events ovs_dp_notify_wq [openvswitch]
[ 62.769519] 0000000000000009 ffff880807ad3ac8 ffffffff8175e575 0000000000000006
[ 62.769521] ffff880807ad3b18 ffff880807ad3b08 ffffffff8105314c ffff880807ad3b28
[ 62.769523] 0000000000000000 ffffffff81a87a1f ffff8807f2634000 ffff880037038500
[ 62.769525] Call Trace:
[ 62.769528] [<ffffffff8175e575>] dump_stack+0x55/0x76
[ 62.769529] [<ffffffff8105314c>] warn_slowpath_common+0x8c/0xc0
[ 62.769531] [<ffffffff81053236>] warn_slowpath_fmt+0x46/0x50
[ 62.769533] [<ffffffff8123e7e9>] sysfs_hash_and_remove+0xa9/0xb0
[ 62.769535] [<ffffffff81240e96>] sysfs_remove_link+0x26/0x30
[ 62.769538] [<ffffffff81631ef7>] __netdev_adjacent_dev_remove+0xf7/0x150
[ 62.769540] [<ffffffff81632037>] __netdev_adjacent_dev_unlink_lists+0x27/0x50
[ 62.769542] [<ffffffff8163213a>] __netdev_adjacent_dev_unlink_neighbour+0x3a/0x50
[ 62.769544] [<ffffffff8163218d>] netdev_upper_dev_unlink+0x3d/0x140
[ 62.769548] [<ffffffffa033c2db>] netdev_destroy+0x4b/0x80 [openvswitch]
[ 62.769550] [<ffffffffa033b696>] ovs_vport_del+0x46/0x60 [openvswitch]
[ 62.769552] [<ffffffffa0335314>] ovs_dp_detach_port+0x44/0x60 [openvswitch]
[ 62.769555] [<ffffffffa0336574>] ovs_dp_notify_wq+0xb4/0x150 [openvswitch]
[ 62.769557] [<ffffffff81075c28>] process_one_work+0x1d8/0x6a0
[ 62.769559] [<ffffffff81075bc8>] ? process_one_work+0x178/0x6a0
[ 62.769562] [<ffffffff8107659b>] worker_thread+0x11b/0x370
[ 62.769564] [<ffffffff81076480>] ? rescuer_thread+0x350/0x350
[ 62.769566] [<ffffffff8107f44a>] kthread+0xea/0xf0
[ 62.769568] [<ffffffff8107f360>] ? flush_kthread_worker+0x150/0x150
[ 62.769570] [<ffffffff81770bac>] ret_from_fork+0x7c/0xb0
[ 62.769572] [<ffffffff8107f360>] ? flush_kthread_worker+0x150/0x150
[ 62.769573] ---[ end trace 838756c62e156ffc ]---
[ 62.769574] ------------[ cut here ]------------
[ 62.769576] WARNING: CPU: 1 PID: 92 at fs/sysfs/inode.c:325 sysfs_hash_and_remove+0xa9/0xb0()
[ 62.769577] sysfs: can not remove 'upper_test', no directory
[ 62.769577] Modules linked in: openvswitch gre vxlan ip_tunnel libcrc32c ip6table_filter ip6_tables ebtable_nat ebtables nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_CHECKSUM iptable_mangle ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc vhost_net macvtap macvlan vhost kvm_intel kvm dm_crypt iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi hid_generic mxm_wmi eeepc_wmi asus_wmi sparse_keymap dm_multipath psmouse serio_raw usbhid hid parport_pc ppdev firewire_ohci lpc_ich firewire_core e1000e crc_itu_t binfmt_misc igb dca ptp pps_core mac_hid wmi lp parport i2o_config i2o_block video
[ 62.769603] CPU: 1 PID: 92 Comm: kworker/1:2 Tainted: G W 3.12.0-rc3+ #60
[ 62.769604] Hardware name: System manufacturer System Product Name/P8Z77 WS, BIOS 3007 07/26/2012
[ 62.769606] Workqueue: events ovs_dp_notify_wq [openvswitch]
[ 62.769607] 0000000000000009 ffff880807ad3ac8 ffffffff8175e575 0000000000000006
[ 62.769609] ffff880807ad3b18 ffff880807ad3b08 ffffffff8105314c ffff880807ad3b58
[ 62.769611] 0000000000000000 ffff880807ad3bd9 ffff8807f2634000 ffff880037038500
[ 62.769613] Call Trace:
[ 62.769615] [<ffffffff8175e575>] dump_stack+0x55/0x76
[ 62.769617] [<ffffffff8105314c>] warn_slowpath_common+0x8c/0xc0
[ 62.769619] [<ffffffff81053236>] warn_slowpath_fmt+0x46/0x50
[ 62.769621] [<ffffffff8123e7e9>] sysfs_hash_and_remove+0xa9/0xb0
[ 62.769622] [<ffffffff81240e96>] sysfs_remove_link+0x26/0x30
[ 62.769624] [<ffffffff81631f22>] __netdev_adjacent_dev_remove+0x122/0x150
[ 62.769627] [<ffffffff81632037>] __netdev_adjacent_dev_unlink_lists+0x27/0x50
[ 62.769629] [<ffffffff8163213a>] __netdev_adjacent_dev_unlink_neighbour+0x3a/0x50
[ 62.769631] [<ffffffff8163218d>] netdev_upper_dev_unlink+0x3d/0x140
[ 62.769633] [<ffffffffa033c2db>] netdev_destroy+0x4b/0x80 [openvswitch]
[ 62.769636] [<ffffffffa033b696>] ovs_vport_del+0x46/0x60 [openvswitch]
[ 62.769638] [<ffffffffa0335314>] ovs_dp_detach_port+0x44/0x60 [openvswitch]
[ 62.769640] [<ffffffffa0336574>] ovs_dp_notify_wq+0xb4/0x150 [openvswitch]
[ 62.769642] [<ffffffff81075c28>] process_one_work+0x1d8/0x6a0
[ 62.769644] [<ffffffff81075bc8>] ? process_one_work+0x178/0x6a0
[ 62.769646] [<ffffffff8107659b>] worker_thread+0x11b/0x370
[ 62.769648] [<ffffffff81076480>] ? rescuer_thread+0x350/0x350
[ 62.769650] [<ffffffff8107f44a>] kthread+0xea/0xf0
[ 62.769652] [<ffffffff8107f360>] ? flush_kthread_worker+0x150/0x150
[ 62.769654] [<ffffffff81770bac>] ret_from_fork+0x7c/0xb0
[ 62.769656] [<ffffffff8107f360>] ? flush_kthread_worker+0x150/0x150
[ 62.769657] ---[ end trace 838756c62e156ffd ]---
[ 62.769724] device tap1 left promiscuous mode
This patch also affects moving devices between net namespaces.
OVS used to ignore netns move notifications which caused problems.
Like:
ovs-dpctl add-if test tap1
ip link set tap1 netns 3512
and then removing tap1 inside the namespace will cause hang on missing dev_put.
With this patch OVS will detach dev upon receiving netns move event.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
|
|
Hides mega-flow implementation in flow_table.c rather than
datapath.c.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
|
|
ovs-flow rehash does not touch mega flow list. Following patch
moves it dp struct datapath. Avoid one extra indirection for
accessing mega-flow list head on every packet receive.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
|
|
Over the time datapath.c and flow.c has became pretty large files.
Following patch restructures functionality of component into three
different components:
flow.c: contains flow extract.
flow_netlink.c: netlink flow api.
flow_table.c: flow table api.
This patch restructures code without changing logic.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
|
|
- Move sysctl_local_ports from a global variable into struct netns_ipv4.
- Modify inet_get_local_port_range to take a struct net, and update all
of the callers.
- Move the initialization of sysctl_local_ports into
sysctl_net_ipv4.c:ipv4_sysctl_init_net from inet_connection_sock.c
v2:
- Ensure indentation used tabs
- Fixed ip.h so it applies cleanly to todays net-next
v3:
- Compile fixes of strange callers of inet_get_local_port_range.
This patch now successfully passes an allmodconfig build.
Removed manual inlining of inet_get_local_port_range in ipv4_local_port_range
Originally-by: Samya <samya@twitter.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Remove duplicated include.
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Jesse Gross <jesse@nicira.com>
|
|
Remove duplicated include.
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Jesse Gross <jesse@nicira.com>
|
|
Rehashing in ovs-workqueue can cause ovs-mutex lock contentions
in case of heavy flow setups where both needs ovs-mutex. So by
moving rehashing to flow-setup we can eliminate contention.
This also simplify ovs locking and reduces dependence on
workqueue.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
|
|
In function __parse_flow_nlattrs(), we check for condition
(type > OVS_KEY_ATTR_MAX) and if true, print an error, but we do
not return from this function as in other checks. It seems this
has been forgotten, as otherwise, we could access beyond the
memory of ovs_key_lens, which is of ovs_key_lens[OVS_KEY_ATTR_MAX + 1].
Hence, a maliciously prepared nla_type from user space could access
beyond this upper limit.
Introduced by 03f0d916a ("openvswitch: Mega flow implementation").
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Andy Zhou <azhou@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
sw_flow_key alignment was declared as " __aligned(__alignof__(long))".
However, this breaks on the m68k architecture where long is 32 bit in
size but 16 bit aligned by default. This aligns to the size of a long to
ensure that we can always do comparsions in full long-sized chunks. It
also adds an additional build check to catch any reduction in alignment.
CC: Andy Zhou <azhou@nicira.com>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|