[dpdk-dev] vhost: remove lockless enqueue to the virtio ring

Message ID 1451918787-85887-1-git-send-email-huawei.xie@intel.com (mailing list archive)
State Superseded, archived
Delegated to: Yuanhan Liu
Headers

Commit Message

Huawei Xie Jan. 4, 2016, 2:46 p.m. UTC
  This patch removes the internal lockless enqueue implmentation.
DPDK doesn't support receiving/transmitting packets from/to the same
queue. Vhost PMD wraps vhost device as normal DPDK port. DPDK
applications normally have their own lock implmentation when enqueue
packets to the same queue of a port.

The atomic cmpset is a costly operation. This patch should help
performance a bit.

Signed-off-by: Huawei Xie <huawei.xie@intel.com>
---
 lib/librte_vhost/vhost_rxtx.c | 86 +++++++++++++------------------------------
 1 file changed, 25 insertions(+), 61 deletions(-)
  

Comments

Huawei Xie Jan. 5, 2016, 7:16 a.m. UTC | #1
On 1/5/2016 2:42 PM, Xie, Huawei wrote:
> This patch removes the internal lockless enqueue implmentation.
> DPDK doesn't support receiving/transmitting packets from/to the same
> queue. Vhost PMD wraps vhost device as normal DPDK port. DPDK
> applications normally have their own lock implmentation when enqueue
> packets to the same queue of a port.
>
> The atomic cmpset is a costly operation. This patch should help
> performance a bit.
>
> Signed-off-by: Huawei Xie <huawei.xie@intel.com>
This patch modifies the API's behavior, which is also a trivial ABI
change. In my opinion, application shouldn't rely on previous behavior.
Anyway, i am checking how to declare the ABI change.
  
Jianfeng Tan Jan. 19, 2016, 4:25 p.m. UTC | #2
Hi Huawei,

On 1/4/2016 10:46 PM, Huawei Xie wrote:
> This patch removes the internal lockless enqueue implmentation.
> DPDK doesn't support receiving/transmitting packets from/to the same
> queue. Vhost PMD wraps vhost device as normal DPDK port. DPDK
> applications normally have their own lock implmentation when enqueue
> packets to the same queue of a port.
>
> The atomic cmpset is a costly operation. This patch should help
> performance a bit.
>
> Signed-off-by: Huawei Xie <huawei.xie@intel.com>
> ---
>   lib/librte_vhost/vhost_rxtx.c | 86 +++++++++++++------------------------------
>   1 file changed, 25 insertions(+), 61 deletions(-)
>
> diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c
> index bbf3fac..26a1b9c 100644
> --- a/lib/librte_vhost/vhost_rxtx.c
> +++ b/lib/librte_vhost/vhost_rxtx.c

I think vhost example will not work well with this patch when 
vm2vm=software.

Test case:
Two virtio ports handled by two pmd threads. Thread 0 polls pkts from 
physical NIC and sends to virtio0, while thread0 receives pkts from 
virtio1 and routes it to virtio0.

> -
>   		*(volatile uint16_t *)&vq->used->idx += entry_success;

Another unrelated question: We ever try to move this assignment out of 
loop to save cost as it's a data contention?

Thanks,
Jianfeng
  
Huawei Xie Jan. 19, 2016, 4:43 p.m. UTC | #3
On 1/20/2016 12:25 AM, Tan, Jianfeng wrote:
> Hi Huawei,
>
> On 1/4/2016 10:46 PM, Huawei Xie wrote:
>> This patch removes the internal lockless enqueue implmentation.
>> DPDK doesn't support receiving/transmitting packets from/to the same
>> queue. Vhost PMD wraps vhost device as normal DPDK port. DPDK
>> applications normally have their own lock implmentation when enqueue
>> packets to the same queue of a port.
>>
>> The atomic cmpset is a costly operation. This patch should help
>> performance a bit.
>>
>> Signed-off-by: Huawei Xie <huawei.xie@intel.com>
>> ---
>>   lib/librte_vhost/vhost_rxtx.c | 86
>> +++++++++++++------------------------------
>>   1 file changed, 25 insertions(+), 61 deletions(-)
>>
>> diff --git a/lib/librte_vhost/vhost_rxtx.c
>> b/lib/librte_vhost/vhost_rxtx.c
>> index bbf3fac..26a1b9c 100644
>> --- a/lib/librte_vhost/vhost_rxtx.c
>> +++ b/lib/librte_vhost/vhost_rxtx.c
>
> I think vhost example will not work well with this patch when
> vm2vm=software.
>
> Test case:
> Two virtio ports handled by two pmd threads. Thread 0 polls pkts from
> physical NIC and sends to virtio0, while thread0 receives pkts from
> virtio1 and routes it to virtio0.

vhost port will be wrapped as port, by vhost PMD. DPDK APP treats all
physical and virtual ports as ports equally. When two DPDK threads try
to enqueue to the same port, the APP needs to consider the contention.
All the physical PMDs doesn't support concurrent enqueuing/dequeuing.
Vhost PMD should expose the same behavior unless absolutely necessary
and we expose the difference of different PMD.

>
>> -
>>           *(volatile uint16_t *)&vq->used->idx += entry_success;
>
> Another unrelated question: We ever try to move this assignment out of
> loop to save cost as it's a data contention?

This operation itself is not that costly, but it has side effect on the
cache transfer.
It is outside of the loop for non-mergable case. For mergeable case, it
is inside the loop.
Actually it has pro and cons whether we do this in burst or in a smaller
step. I prefer to move it outside of the loop. Let us address this later.

>
> Thanks,
> Jianfeng
>
>
  
Mike A. Polehn Jan. 19, 2016, 6:33 p.m. UTC | #4
SMP operations can be very expensive, sometimes can impact operations by 100s to 1000s of clock cycles depending on what is the circumstances of the synchronization. It is how you arrange the SMP operations within the tasks at hand across the SMP cores that gives methods for top performance.  Using traditional general purpose SMP methods will result in traditional general purpose performance. Migrating to general libraries (understood by most general purpose programmers) from expert abilities (understood by much smaller group of expert programmers focused on performance) will greatly reduce the value of DPDK since the end result will be lower performance and/or have less predictable operation where rate performance, predictability, and low latency are the primary goals.

The best method to date, is to have multiple outputs to a single port is to use a DPDK queue with multiple producer, single consumer to do an SMP operation for multiple sources to feed a single non SMP task to output to the port (that is why the ports are not SMP protected). Also when considerable contention from multiple sources occur often (data feeding at same time), having DPDK queue with input and output variables  in separate cache lines can have a notable throughput improvement.

Mike 

-----Original Message-----
From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Xie, Huawei
Sent: Tuesday, January 19, 2016 8:44 AM
To: Tan, Jianfeng; dev@dpdk.org
Cc: ann.zhuangyanying@huawei.com
Subject: Re: [dpdk-dev] [PATCH] vhost: remove lockless enqueue to the virtio ring

On 1/20/2016 12:25 AM, Tan, Jianfeng wrote:
> Hi Huawei,
>
> On 1/4/2016 10:46 PM, Huawei Xie wrote:
>> This patch removes the internal lockless enqueue implmentation.
>> DPDK doesn't support receiving/transmitting packets from/to the same 
>> queue. Vhost PMD wraps vhost device as normal DPDK port. DPDK 
>> applications normally have their own lock implmentation when enqueue 
>> packets to the same queue of a port.
>>
>> The atomic cmpset is a costly operation. This patch should help 
>> performance a bit.
>>
>> Signed-off-by: Huawei Xie <huawei.xie@intel.com>
>> ---
>>   lib/librte_vhost/vhost_rxtx.c | 86
>> +++++++++++++------------------------------
>>   1 file changed, 25 insertions(+), 61 deletions(-)
>>
>> diff --git a/lib/librte_vhost/vhost_rxtx.c 
>> b/lib/librte_vhost/vhost_rxtx.c index bbf3fac..26a1b9c 100644
>> --- a/lib/librte_vhost/vhost_rxtx.c
>> +++ b/lib/librte_vhost/vhost_rxtx.c
>
> I think vhost example will not work well with this patch when
> vm2vm=software.
>
> Test case:
> Two virtio ports handled by two pmd threads. Thread 0 polls pkts from
> physical NIC and sends to virtio0, while thread0 receives pkts from
> virtio1 and routes it to virtio0.

vhost port will be wrapped as port, by vhost PMD. DPDK APP treats all
physical and virtual ports as ports equally. When two DPDK threads try
to enqueue to the same port, the APP needs to consider the contention.
All the physical PMDs doesn't support concurrent enqueuing/dequeuing.
Vhost PMD should expose the same behavior unless absolutely necessary
and we expose the difference of different PMD.

>
>> -
>>           *(volatile uint16_t *)&vq->used->idx += entry_success;
>
> Another unrelated question: We ever try to move this assignment out of
> loop to save cost as it's a data contention?

This operation itself is not that costly, but it has side effect on the
cache transfer.
It is outside of the loop for non-mergable case. For mergeable case, it
is inside the loop.
Actually it has pro and cons whether we do this in burst or in a smaller
step. I prefer to move it outside of the loop. Let us address this later.

>
> Thanks,
> Jianfeng
>
>
  
Huawei Xie Jan. 20, 2016, 3:39 a.m. UTC | #5
On 1/20/2016 2:33 AM, Polehn, Mike A wrote:
> SMP operations can be very expensive, sometimes can impact operations by 100s to 1000s of clock cycles depending on what is the circumstances of the synchronization. It is how you arrange the SMP operations within the tasks at hand across the SMP cores that gives methods for top performance.  Using traditional general purpose SMP methods will result in traditional general purpose performance. Migrating to general libraries (understood by most general purpose programmers) from expert abilities (understood by much smaller group of expert programmers focused on performance) will greatly reduce the value of DPDK since the end result will be lower performance and/or have less predictable operation where rate performance, predictability, and low latency are the primary goals.
>
> The best method to date, is to have multiple outputs to a single port is to use a DPDK queue with multiple producer, single consumer to do an SMP operation for multiple sources to feed a single non SMP task to output to the port (that is why the ports are not SMP protected). Also when considerable contention from multiple sources occur often (data feeding at same time), having DPDK queue with input and output variables  in separate cache lines can have a notable throughput improvement.
>
> Mike 

Mike:
Thanks for detailed explanation. Do you have comment to this patch?

>
> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Xie, Huawei
> Sent: Tuesday, January 19, 2016 8:44 AM
> To: Tan, Jianfeng; dev@dpdk.org
> Cc: ann.zhuangyanying@huawei.com
> Subject: Re: [dpdk-dev] [PATCH] vhost: remove lockless enqueue to the virtio ring
>
> On 1/20/2016 12:25 AM, Tan, Jianfeng wrote:
>> Hi Huawei,
>>
>> On 1/4/2016 10:46 PM, Huawei Xie wrote:
>>> This patch removes the internal lockless enqueue implmentation.
>>> DPDK doesn't support receiving/transmitting packets from/to the same 
>>> queue. Vhost PMD wraps vhost device as normal DPDK port. DPDK 
>>> applications normally have their own lock implmentation when enqueue 
>>> packets to the same queue of a port.
>>>
>>> The atomic cmpset is a costly operation. This patch should help 
>>> performance a bit.
>>>
>>> Signed-off-by: Huawei Xie <huawei.xie@intel.com>
>>> ---
>>>   lib/librte_vhost/vhost_rxtx.c | 86
>>> +++++++++++++------------------------------
>>>   1 file changed, 25 insertions(+), 61 deletions(-)
>>>
>>> diff --git a/lib/librte_vhost/vhost_rxtx.c 
>>> b/lib/librte_vhost/vhost_rxtx.c index bbf3fac..26a1b9c 100644
>>> --- a/lib/librte_vhost/vhost_rxtx.c
>>> +++ b/lib/librte_vhost/vhost_rxtx.c
>> I think vhost example will not work well with this patch when
>> vm2vm=software.
>>
>> Test case:
>> Two virtio ports handled by two pmd threads. Thread 0 polls pkts from
>> physical NIC and sends to virtio0, while thread0 receives pkts from
>> virtio1 and routes it to virtio0.
> vhost port will be wrapped as port, by vhost PMD. DPDK APP treats all
> physical and virtual ports as ports equally. When two DPDK threads try
> to enqueue to the same port, the APP needs to consider the contention.
> All the physical PMDs doesn't support concurrent enqueuing/dequeuing.
> Vhost PMD should expose the same behavior unless absolutely necessary
> and we expose the difference of different PMD.
>
>>> -
>>>           *(volatile uint16_t *)&vq->used->idx += entry_success;
>> Another unrelated question: We ever try to move this assignment out of
>> loop to save cost as it's a data contention?
> This operation itself is not that costly, but it has side effect on the
> cache transfer.
> It is outside of the loop for non-mergable case. For mergeable case, it
> is inside the loop.
> Actually it has pro and cons whether we do this in burst or in a smaller
> step. I prefer to move it outside of the loop. Let us address this later.
>
>> Thanks,
>> Jianfeng
>>
>>
>
  
Thomas Monjalon March 14, 2016, 11:13 p.m. UTC | #6
2016-01-05 07:16, Xie, Huawei:
> On 1/5/2016 2:42 PM, Xie, Huawei wrote:
> > This patch removes the internal lockless enqueue implmentation.
> > DPDK doesn't support receiving/transmitting packets from/to the same
> > queue. Vhost PMD wraps vhost device as normal DPDK port. DPDK
> > applications normally have their own lock implmentation when enqueue
> > packets to the same queue of a port.
> >
> > The atomic cmpset is a costly operation. This patch should help
> > performance a bit.
> >
> > Signed-off-by: Huawei Xie <huawei.xie@intel.com>
> This patch modifies the API's behavior, which is also a trivial ABI
> change. In my opinion, application shouldn't rely on previous behavior.
> Anyway, i am checking how to declare the ABI change.

I guess this patch is now obsolete?
  
Huawei Xie March 16, 2016, 8:20 a.m. UTC | #7
On 3/15/2016 7:14 AM, Thomas Monjalon wrote:
> 2016-01-05 07:16, Xie, Huawei:
>> On 1/5/2016 2:42 PM, Xie, Huawei wrote:
>>> This patch removes the internal lockless enqueue implmentation.
>>> DPDK doesn't support receiving/transmitting packets from/to the same
>>> queue. Vhost PMD wraps vhost device as normal DPDK port. DPDK
>>> applications normally have their own lock implmentation when enqueue
>>> packets to the same queue of a port.
>>>
>>> The atomic cmpset is a costly operation. This patch should help
>>> performance a bit.
>>>
>>> Signed-off-by: Huawei Xie <huawei.xie@intel.com>
>> This patch modifies the API's behavior, which is also a trivial ABI
>> change. In my opinion, application shouldn't rely on previous behavior.
>> Anyway, i am checking how to declare the ABI change.
> I guess this patch is now obsolete?

How about we delay this to next release after more considerations,
whether we should keep this behavior, and what is the best way for
concurrency in vhost.
  
Yuanhan Liu March 16, 2016, 8:30 a.m. UTC | #8
On Wed, Mar 16, 2016 at 08:20:37AM +0000, Xie, Huawei wrote:
> On 3/15/2016 7:14 AM, Thomas Monjalon wrote:
> > 2016-01-05 07:16, Xie, Huawei:
> >> On 1/5/2016 2:42 PM, Xie, Huawei wrote:
> >>> This patch removes the internal lockless enqueue implmentation.
> >>> DPDK doesn't support receiving/transmitting packets from/to the same
> >>> queue. Vhost PMD wraps vhost device as normal DPDK port. DPDK
> >>> applications normally have their own lock implmentation when enqueue
> >>> packets to the same queue of a port.
> >>>
> >>> The atomic cmpset is a costly operation. This patch should help
> >>> performance a bit.
> >>>
> >>> Signed-off-by: Huawei Xie <huawei.xie@intel.com>
> >> This patch modifies the API's behavior, which is also a trivial ABI
> >> change. In my opinion, application shouldn't rely on previous behavior.
> >> Anyway, i am checking how to declare the ABI change.
> > I guess this patch is now obsolete?
> 
> How about we delay this to next release after more considerations,

I'd suggest so.

> whether we should keep this behavior, and what is the best way for
> concurrency in vhost.

I'm wondering should we do an announcement first, to notify user the
behaviour change?

	--yliu
  

Patch

diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c
index bbf3fac..26a1b9c 100644
--- a/lib/librte_vhost/vhost_rxtx.c
+++ b/lib/librte_vhost/vhost_rxtx.c
@@ -69,10 +69,8 @@  virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
 	uint64_t buff_hdr_addr = 0;
 	uint32_t head[MAX_PKT_BURST];
 	uint32_t head_idx, packet_success = 0;
-	uint16_t avail_idx, res_cur_idx;
-	uint16_t res_base_idx, res_end_idx;
+	uint16_t avail_idx, res_cur_idx, res_end_idx;
 	uint16_t free_entries;
-	uint8_t success = 0;
 
 	LOG_DEBUG(VHOST_DATA, "(%"PRIu64") virtio_dev_rx()\n", dev->device_fh);
 	if (unlikely(!is_valid_virt_queue_idx(queue_id, 0, dev->virt_qp_nb))) {
@@ -88,29 +86,18 @@  virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
 
 	count = (count > MAX_PKT_BURST) ? MAX_PKT_BURST : count;
 
-	/*
-	 * As many data cores may want access to available buffers,
-	 * they need to be reserved.
-	 */
-	do {
-		res_base_idx = vq->last_used_idx_res;
-		avail_idx = *((volatile uint16_t *)&vq->avail->idx);
-
-		free_entries = (avail_idx - res_base_idx);
-		/*check that we have enough buffers*/
-		if (unlikely(count > free_entries))
-			count = free_entries;
-
-		if (count == 0)
-			return 0;
-
-		res_end_idx = res_base_idx + count;
-		/* vq->last_used_idx_res is atomically updated. */
-		/* TODO: Allow to disable cmpset if no concurrency in application. */
-		success = rte_atomic16_cmpset(&vq->last_used_idx_res,
-				res_base_idx, res_end_idx);
-	} while (unlikely(success == 0));
-	res_cur_idx = res_base_idx;
+	avail_idx = *((volatile uint16_t *)&vq->avail->idx);
+	free_entries = (avail_idx - vq->last_used_idx_res);
+	/*check that we have enough buffers*/
+	if (unlikely(count > free_entries))
+		count = free_entries;
+	if (count == 0)
+		return 0;
+
+	res_cur_idx = vq->last_used_idx_res;
+	res_end_idx = res_cur_idx + count;
+	vq->last_used_idx_res = res_end_idx;
+
 	LOG_DEBUG(VHOST_DATA, "(%"PRIu64") Current Index %d| End Index %d\n",
 			dev->device_fh, res_cur_idx, res_end_idx);
 
@@ -230,10 +217,6 @@  virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
 
 	rte_compiler_barrier();
 
-	/* Wait until it's our turn to add our buffer to the used ring. */
-	while (unlikely(vq->last_used_idx != res_base_idx))
-		rte_pause();
-
 	*(volatile uint16_t *)&vq->used->idx += count;
 	vq->last_used_idx = res_end_idx;
 
@@ -474,7 +457,6 @@  virtio_dev_merge_rx(struct virtio_net *dev, uint16_t queue_id,
 	uint32_t pkt_idx = 0, entry_success = 0;
 	uint16_t avail_idx;
 	uint16_t res_base_idx, res_cur_idx;
-	uint8_t success = 0;
 
 	LOG_DEBUG(VHOST_DATA, "(%"PRIu64") virtio_dev_merge_rx()\n",
 		dev->device_fh);
@@ -496,46 +478,28 @@  virtio_dev_merge_rx(struct virtio_net *dev, uint16_t queue_id,
 
 	for (pkt_idx = 0; pkt_idx < count; pkt_idx++) {
 		uint32_t pkt_len = pkts[pkt_idx]->pkt_len + vq->vhost_hlen;
+		uint32_t secure_len = 0;
+		uint32_t vec_idx = 0;
 
-		do {
-			/*
-			 * As many data cores may want access to available
-			 * buffers, they need to be reserved.
-			 */
-			uint32_t secure_len = 0;
-			uint32_t vec_idx = 0;
-
-			res_base_idx = vq->last_used_idx_res;
-			res_cur_idx = res_base_idx;
+		res_base_idx = res_cur_idx = vq->last_used_idx_res;
 
-			do {
-				avail_idx = *((volatile uint16_t *)&vq->avail->idx);
-				if (unlikely(res_cur_idx == avail_idx))
-					goto merge_rx_exit;
+		do {
+			avail_idx = *((volatile uint16_t *)&vq->avail->idx);
+			if (unlikely(res_cur_idx == avail_idx))
+				goto merge_rx_exit;
 
-				update_secure_len(vq, res_cur_idx,
-						  &secure_len, &vec_idx);
-				res_cur_idx++;
-			} while (pkt_len > secure_len);
+			update_secure_len(vq, res_cur_idx,
+					&secure_len, &vec_idx);
+			res_cur_idx++;
+		} while (pkt_len > secure_len);
 
-			/* vq->last_used_idx_res is atomically updated. */
-			success = rte_atomic16_cmpset(&vq->last_used_idx_res,
-							res_base_idx,
-							res_cur_idx);
-		} while (success == 0);
+		vq->last_used_idx_res = res_cur_idx;
 
 		entry_success = copy_from_mbuf_to_vring(dev, queue_id,
 			res_base_idx, res_cur_idx, pkts[pkt_idx]);
 
 		rte_compiler_barrier();
 
-		/*
-		 * Wait until it's our turn to add our buffer
-		 * to the used ring.
-		 */
-		while (unlikely(vq->last_used_idx != res_base_idx))
-			rte_pause();
-
 		*(volatile uint16_t *)&vq->used->idx += entry_success;
 		vq->last_used_idx = res_cur_idx;
 	}