[dpdk-dev] optimize vhost enqueue

Message ID 1471319402-112998-1-git-send-email-zhihong.wang@intel.com (mailing list archive)
State Superseded, archived
Headers

Commit Message

Zhihong Wang Aug. 16, 2016, 3:50 a.m. UTC
  This patch optimizes the vhost enqueue function: rte_vhost_enqueue_burst.

Currently there're 2 callbacks for vhost enqueue:
 *  virtio_dev_merge_rx for mrg_rxbuf turned on cases.
 *  virtio_dev_rx for mrg_rxbuf turned off cases.

The virtio_dev_merge_rx doesn't provide optimal performance, also it is
reported having compatibility issue working with Windows VMs.

Besides, having 2 separated functions increases maintenance efforts.

This patch uses a single function logic to replace the current 2 for
better maintainability, and provides better performance by optimizing
caching behavior especially for mrg_rxbuf turned on cases.

It also fixes the issue working with Windows VMs.

Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
---
 lib/librte_vhost/vhost-net.h  |   6 +-
 lib/librte_vhost/vhost_rxtx.c | 582 ++++++++++++++----------------------------
 lib/librte_vhost/virtio-net.c |  15 +-
 3 files changed, 208 insertions(+), 395 deletions(-)
  

Comments

Maxime Coquelin Aug. 16, 2016, 1:59 p.m. UTC | #1
Hi Zhihong,

On 08/16/2016 05:50 AM, Zhihong Wang wrote:
> This patch optimizes the vhost enqueue function: rte_vhost_enqueue_burst.
>
> Currently there're 2 callbacks for vhost enqueue:
>  *  virtio_dev_merge_rx for mrg_rxbuf turned on cases.
>  *  virtio_dev_rx for mrg_rxbuf turned off cases.
>
> The virtio_dev_merge_rx doesn't provide optimal performance, also it is
> reported having compatibility issue working with Windows VMs.
Could you tell us more please about this compatibility issue?

>
> Besides, having 2 separated functions increases maintenance efforts.
>
> This patch uses a single function logic to replace the current 2 for
> better maintainability, and provides better performance by optimizing
> caching behavior especially for mrg_rxbuf turned on cases.
Do you have some benchmark comparison before and after your change?

Also, for maintainability, I would suggest the that the enqueue
function be split. Because vhost_enqueue_burst becomes very long (220
LoC), and max level of indentation is too high (6).

It makes the code hard to understand, and prone to miss bugs during
review and maintenance.

>
> It also fixes the issue working with Windows VMs.
Ideally, the fix should be sent separately, before the rework.
Indeed, we might want to have the fix in the stable branch, without
picking the optimization.

>
> Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
> ---
>  lib/librte_vhost/vhost-net.h  |   6 +-
>  lib/librte_vhost/vhost_rxtx.c | 582 ++++++++++++++----------------------------
>  lib/librte_vhost/virtio-net.c |  15 +-
>  3 files changed, 208 insertions(+), 395 deletions(-)
582 lines changed is a huge patch.
If possible, it would be better splitting it in incremental changes,
making the review process easier.

Also, for v2, please prefix the commit title with "vhost:".

Thanks for your contribution, I'm looking forward for the v2.
- Maxime
  
Zhihong Wang Aug. 17, 2016, 1:45 a.m. UTC | #2
> -----Original Message-----
> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> Sent: Tuesday, August 16, 2016 10:00 PM
> To: Wang, Zhihong <zhihong.wang@intel.com>; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] optimize vhost enqueue
> 
> Hi Zhihong,
> 
> On 08/16/2016 05:50 AM, Zhihong Wang wrote:
> > This patch optimizes the vhost enqueue function: rte_vhost_enqueue_burst.
> >
> > Currently there're 2 callbacks for vhost enqueue:
> >  *  virtio_dev_merge_rx for mrg_rxbuf turned on cases.
> >  *  virtio_dev_rx for mrg_rxbuf turned off cases.
> >
> > The virtio_dev_merge_rx doesn't provide optimal performance, also it is
> > reported having compatibility issue working with Windows VMs.
> Could you tell us more please about this compatibility issue?


For example, when you have testpmd in the host and Window VM as the guest,
with mrg_rxbuf turned on, the guest will hang once there's packets enqueued
by virtio_dev_merge_rx.

Let me know if you see the same issue.


> 
> >
> > Besides, having 2 separated functions increases maintenance efforts.
> >
> > This patch uses a single function logic to replace the current 2 for
> > better maintainability, and provides better performance by optimizing
> > caching behavior especially for mrg_rxbuf turned on cases.
> Do you have some benchmark comparison before and after your change?
> 
> Also, for maintainability, I would suggest the that the enqueue
> function be split. Because vhost_enqueue_burst becomes very long (220
> LoC), and max level of indentation is too high (6).
> 
> It makes the code hard to understand, and prone to miss bugs during
> review and maintenance.


This is something I've thought about while writing the code, the reason I
keep it as one function body is that:

 1. This function is very performance sensitive, and we need full control of
    code ordering (You can compare with the current performance with the
    mrg_rxbuf feature turned on to see the difference).

 2. I somehow find that a single function logic makes it easier to understand,
    surely I can add comments to make it easiler to read for .

Please let me know if you still insist, we can discuss more on it.


> 
> >
> > It also fixes the issue working with Windows VMs.
> Ideally, the fix should be sent separately, before the rework.
> Indeed, we might want to have the fix in the stable branch, without
> picking the optimization.
> 
> >
> > Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
> > ---
> >  lib/librte_vhost/vhost-net.h  |   6 +-
> >  lib/librte_vhost/vhost_rxtx.c | 582 ++++++++++++++----------------------------
> >  lib/librte_vhost/virtio-net.c |  15 +-
> >  3 files changed, 208 insertions(+), 395 deletions(-)
> 582 lines changed is a huge patch.
> If possible, it would be better splitting it in incremental changes,
> making the review process easier.


It looks like a huge patch, but it simply deletes the current implementation
and add the new code. I think perhaps split it into 2, 1st one to replace
just the rte_vhost_enqueue_burst, 2nd one to delete all the obsolete functions.
It should make the patch clear, how do you think?  :)


> 
> Also, for v2, please prefix the commit title with "vhost:".

Thanks for the hint! Will do.

> 
> Thanks for your contribution, I'm looking forward for the v2.
> - Maxime
  
Yuanhan Liu Aug. 17, 2016, 2:38 a.m. UTC | #3
On Wed, Aug 17, 2016 at 01:45:26AM +0000, Wang, Zhihong wrote:
> 
> 
> > -----Original Message-----
> > From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> > Sent: Tuesday, August 16, 2016 10:00 PM
> > To: Wang, Zhihong <zhihong.wang@intel.com>; dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH] optimize vhost enqueue
> > 
> > Hi Zhihong,
> > 
> > On 08/16/2016 05:50 AM, Zhihong Wang wrote:
> > > This patch optimizes the vhost enqueue function: rte_vhost_enqueue_burst.
> > >
> > > Currently there're 2 callbacks for vhost enqueue:
> > >  *  virtio_dev_merge_rx for mrg_rxbuf turned on cases.
> > >  *  virtio_dev_rx for mrg_rxbuf turned off cases.
> > >
> > > The virtio_dev_merge_rx doesn't provide optimal performance, also it is
> > > reported having compatibility issue working with Windows VMs.
> > Could you tell us more please about this compatibility issue?
> 
> 
> For example, when you have testpmd in the host and Window VM as the guest,
> with mrg_rxbuf turned on, the guest will hang once there's packets enqueued
> by virtio_dev_merge_rx.

You should put it into commit log.

> Let me know if you see the same issue.
> 
> 
> > 
> > >
> > > Besides, having 2 separated functions increases maintenance efforts.
> > >
> > > This patch uses a single function logic to replace the current 2 for
> > > better maintainability, and provides better performance by optimizing
> > > caching behavior especially for mrg_rxbuf turned on cases.

Here, here sounds two parts to me:

- one to unite mergeable and non-mergeable Rx

- another one to optimize the mergeable path

That means you should do it in two patches, with that we can have clear
understanding what changes the performance boost. It also helps review.

> > Do you have some benchmark comparison before and after your change?
> > 
> > Also, for maintainability, I would suggest the that the enqueue
> > function be split. Because vhost_enqueue_burst becomes very long (220
> > LoC), and max level of indentation is too high (6).
> > 
> > It makes the code hard to understand, and prone to miss bugs during
> > review and maintenance.

Agreed.

> 
> This is something I've thought about while writing the code, the reason I
> keep it as one function body is that:
> 
>  1. This function is very performance sensitive, and we need full control of
>     code ordering (You can compare with the current performance with the
>     mrg_rxbuf feature turned on to see the difference).

Will inline functions help?

>  2. I somehow find that a single function logic makes it easier to understand,
>     surely I can add comments to make it easiler to read for .
> 
> Please let me know if you still insist, we can discuss more on it.

I am personally not a fan of huge function; I would try hard to avoid
too many levels of indentation as well.

> 
> > 
> > >
> > > It also fixes the issue working with Windows VMs.
> > Ideally, the fix should be sent separately, before the rework.
> > Indeed, we might want to have the fix in the stable branch, without
> > picking the optimization.

Agreed.

> > 
> > >
> > > Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
> > > ---
> > >  lib/librte_vhost/vhost-net.h  |   6 +-
> > >  lib/librte_vhost/vhost_rxtx.c | 582 ++++++++++++++----------------------------
> > >  lib/librte_vhost/virtio-net.c |  15 +-
> > >  3 files changed, 208 insertions(+), 395 deletions(-)
> > 582 lines changed is a huge patch.
> > If possible, it would be better splitting it in incremental changes,
> > making the review process easier.
> 
> 
> It looks like a huge patch, but it simply deletes the current implementation
> and add the new code. I think perhaps split it into 2, 1st one to replace
> just the rte_vhost_enqueue_burst, 2nd one to delete all the obsolete functions.
> It should make the patch clear, how do you think?  :)

Nope, it's not working in that way. It should be:

- one patch to fix the hang issue for windows guest

  Please cc it to stable@dpdk.org as well so that we could pick it for
  v16.07 stable release.

- one patch to unite the two different Rx code path

- another patch to optimize mergeable code path

	--yliu
  
Zhihong Wang Aug. 17, 2016, 6:41 a.m. UTC | #4
> -----Original Message-----
> From: Yuanhan Liu [mailto:yuanhan.liu@linux.intel.com]
> Sent: Wednesday, August 17, 2016 10:38 AM
> To: Wang, Zhihong <zhihong.wang@intel.com>
> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] optimize vhost enqueue
> 
> On Wed, Aug 17, 2016 at 01:45:26AM +0000, Wang, Zhihong wrote:
> >
> >
> > > -----Original Message-----
> > > From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> > > Sent: Tuesday, August 16, 2016 10:00 PM
> > > To: Wang, Zhihong <zhihong.wang@intel.com>; dev@dpdk.org
> > > Subject: Re: [dpdk-dev] [PATCH] optimize vhost enqueue
> > >
> > > Hi Zhihong,
> > >
> > > On 08/16/2016 05:50 AM, Zhihong Wang wrote:
> > > > This patch optimizes the vhost enqueue function:
> rte_vhost_enqueue_burst.
> > > >
> > > > Currently there're 2 callbacks for vhost enqueue:
> > > >  *  virtio_dev_merge_rx for mrg_rxbuf turned on cases.
> > > >  *  virtio_dev_rx for mrg_rxbuf turned off cases.
> > > >
> > > > The virtio_dev_merge_rx doesn't provide optimal performance, also it is
> > > > reported having compatibility issue working with Windows VMs.
> > > Could you tell us more please about this compatibility issue?
> >
> >
> > For example, when you have testpmd in the host and Window VM as the
> guest,
> > with mrg_rxbuf turned on, the guest will hang once there's packets enqueued
> > by virtio_dev_merge_rx.
> 
> You should put it into commit log.


Okay.


> 
> > Let me know if you see the same issue.
> >
> >
> > >
> > > >
> > > > Besides, having 2 separated functions increases maintenance efforts.
> > > >
> > > > This patch uses a single function logic to replace the current 2 for
> > > > better maintainability, and provides better performance by optimizing
> > > > caching behavior especially for mrg_rxbuf turned on cases.
> 
> Here, here sounds two parts to me:
> 
> - one to unite mergeable and non-mergeable Rx
> 
> - another one to optimize the mergeable path
> 
> That means you should do it in two patches, with that we can have clear
> understanding what changes the performance boost. It also helps review.


Please see explanation below.


> 
> > > Do you have some benchmark comparison before and after your change?
> > >
> > > Also, for maintainability, I would suggest the that the enqueue
> > > function be split. Because vhost_enqueue_burst becomes very long (220
> > > LoC), and max level of indentation is too high (6).
> > >
> > > It makes the code hard to understand, and prone to miss bugs during
> > > review and maintenance.
> 
> Agreed.
> 
> >
> > This is something I've thought about while writing the code, the reason I
> > keep it as one function body is that:
> >
> >  1. This function is very performance sensitive, and we need full control of
> >     code ordering (You can compare with the current performance with the
> >     mrg_rxbuf feature turned on to see the difference).
> 
> Will inline functions help?


Optimization in this patch actually reorganizes the code from its logic,
so it's not suitable for making separated functions.

I'll explain this in v2.


> 
> >  2. I somehow find that a single function logic makes it easier to understand,
> >     surely I can add comments to make it easiler to read for .
> >
> > Please let me know if you still insist, we can discuss more on it.
> 
> I am personally not a fan of huge function; I would try hard to avoid
> too many levels of indentation as well.
> 
> >
> > >
> > > >
> > > > It also fixes the issue working with Windows VMs.
> > > Ideally, the fix should be sent separately, before the rework.
> > > Indeed, we might want to have the fix in the stable branch, without
> > > picking the optimization.
> 
> Agreed.


The fact is that I don't have much time to debug with the current code
since it's messy and I don't have Windows virtio code and the debugging
environment.

This patch doesn't try to fix this issue, it rewrites the logic totally,
and somehow fixes this issue.

Do you think integrating this whole patch into the stable branch will work?
Personally I think it makes more sense.


> 
> > >
> > > >
> > > > Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
> > > > ---
> > > >  lib/librte_vhost/vhost-net.h  |   6 +-
> > > >  lib/librte_vhost/vhost_rxtx.c | 582
> ++++++++++++++----------------------------
> > > >  lib/librte_vhost/virtio-net.c |  15 +-
> > > >  3 files changed, 208 insertions(+), 395 deletions(-)
> > > 582 lines changed is a huge patch.
> > > If possible, it would be better splitting it in incremental changes,
> > > making the review process easier.
> >
> >
> > It looks like a huge patch, but it simply deletes the current implementation
> > and add the new code. I think perhaps split it into 2, 1st one to replace
> > just the rte_vhost_enqueue_burst, 2nd one to delete all the obsolete
> functions.
> > It should make the patch clear, how do you think?  :)
> 
> Nope, it's not working in that way. It should be:
> 
> - one patch to fix the hang issue for windows guest
> 
>   Please cc it to stable@dpdk.org as well so that we could pick it for
>   v16.07 stable release.
> 
> - one patch to unite the two different Rx code path
> 
> - another patch to optimize mergeable code path


I can separate optimization from the basic code in v2, however as I explained
this patch is built from scratch and doesn't take anything from the existing
code, so there's no way to transform from the existing code incrementally into
the new code.


> 
> 	--yliu
  
Maxime Coquelin Aug. 17, 2016, 9:17 a.m. UTC | #5
On 08/17/2016 08:41 AM, Wang, Zhihong wrote:
>
>
>> -----Original Message-----
>> From: Yuanhan Liu [mailto:yuanhan.liu@linux.intel.com]
>> Sent: Wednesday, August 17, 2016 10:38 AM
>> To: Wang, Zhihong <zhihong.wang@intel.com>
>> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>; dev@dpdk.org
>> Subject: Re: [dpdk-dev] [PATCH] optimize vhost enqueue
>>
>> On Wed, Aug 17, 2016 at 01:45:26AM +0000, Wang, Zhihong wrote:
>>>
>>>
>>>> -----Original Message-----
>>>> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
>>>> Sent: Tuesday, August 16, 2016 10:00 PM
>>>> To: Wang, Zhihong <zhihong.wang@intel.com>; dev@dpdk.org
>>>> Subject: Re: [dpdk-dev] [PATCH] optimize vhost enqueue
>>>>
>>>> Hi Zhihong,
>>>>
>>>> On 08/16/2016 05:50 AM, Zhihong Wang wrote:
>>>>> This patch optimizes the vhost enqueue function:
>> rte_vhost_enqueue_burst.
>>>>>
>>>>> Currently there're 2 callbacks for vhost enqueue:
>>>>>  *  virtio_dev_merge_rx for mrg_rxbuf turned on cases.
>>>>>  *  virtio_dev_rx for mrg_rxbuf turned off cases.
>>>>>
>>>>> The virtio_dev_merge_rx doesn't provide optimal performance, also it is
>>>>> reported having compatibility issue working with Windows VMs.
>>>> Could you tell us more please about this compatibility issue?
>>>
>>>
>>> For example, when you have testpmd in the host and Window VM as the
>> guest,
>>> with mrg_rxbuf turned on, the guest will hang once there's packets enqueued
>>> by virtio_dev_merge_rx.
>>
>> You should put it into commit log.
>
>
> Okay.
>
>
>>
>>> Let me know if you see the same issue.
>>>
>>>
>>>>
>>>>>
>>>>> Besides, having 2 separated functions increases maintenance efforts.
>>>>>
>>>>> This patch uses a single function logic to replace the current 2 for
>>>>> better maintainability, and provides better performance by optimizing
>>>>> caching behavior especially for mrg_rxbuf turned on cases.
>>
>> Here, here sounds two parts to me:
>>
>> - one to unite mergeable and non-mergeable Rx
>>
>> - another one to optimize the mergeable path
>>
>> That means you should do it in two patches, with that we can have clear
>> understanding what changes the performance boost. It also helps review.
>
>
> Please see explanation below.
>
>
>>
>>>> Do you have some benchmark comparison before and after your change?
>>>>
>>>> Also, for maintainability, I would suggest the that the enqueue
>>>> function be split. Because vhost_enqueue_burst becomes very long (220
>>>> LoC), and max level of indentation is too high (6).
>>>>
>>>> It makes the code hard to understand, and prone to miss bugs during
>>>> review and maintenance.
>>
>> Agreed.
>>
>>>
>>> This is something I've thought about while writing the code, the reason I
>>> keep it as one function body is that:
>>>
>>>  1. This function is very performance sensitive, and we need full control of
>>>     code ordering (You can compare with the current performance with the
>>>     mrg_rxbuf feature turned on to see the difference).
>>
>> Will inline functions help?
>
>
> Optimization in this patch actually reorganizes the code from its logic,
> so it's not suitable for making separated functions.
>
> I'll explain this in v2.

I agree with Yuanhan.
Inline functions should not break the optimizations.
IMHO, this is mandatory for the patch to be accepted.

>
>
>>
>>>  2. I somehow find that a single function logic makes it easier to understand,
>>>     surely I can add comments to make it easiler to read for .
>>>
>>> Please let me know if you still insist, we can discuss more on it.
>>
>> I am personally not a fan of huge function; I would try hard to avoid
>> too many levels of indentation as well.
>>
>>>
>>>>
>>>>>
>>>>> It also fixes the issue working with Windows VMs.
>>>> Ideally, the fix should be sent separately, before the rework.
>>>> Indeed, we might want to have the fix in the stable branch, without
>>>> picking the optimization.
>>
>> Agreed.
>
>
> The fact is that I don't have much time to debug with the current code
> since it's messy and I don't have Windows virtio code and the debugging
> environment.

It seems you are not the only one facing the issue:
https://github.com/YanVugenfirer/kvm-guest-drivers-windows/issues/70

So a dedicated fix is really important.

> This patch doesn't try to fix this issue, it rewrites the logic totally,
> and somehow fixes this issue.
>
> Do you think integrating this whole patch into the stable branch will work?
> Personally I think it makes more sense.

No.
We don't even know why/how it fixes the Windows issue, which would be
the first thing to understand before integrating a fix in stable branch.

And the stable branch is not meant for integrating such big reworks,
it is only meant to fix bugs.

The risk of regressions have to be avoided as much as possible.

>
>
>>
>>>>
>>>>>
>>>>> Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
>>>>> ---
>>>>>  lib/librte_vhost/vhost-net.h  |   6 +-
>>>>>  lib/librte_vhost/vhost_rxtx.c | 582
>> ++++++++++++++----------------------------
>>>>>  lib/librte_vhost/virtio-net.c |  15 +-
>>>>>  3 files changed, 208 insertions(+), 395 deletions(-)
>>>> 582 lines changed is a huge patch.
>>>> If possible, it would be better splitting it in incremental changes,
>>>> making the review process easier.
>>>
>>>
>>> It looks like a huge patch, but it simply deletes the current implementation
>>> and add the new code. I think perhaps split it into 2, 1st one to replace
>>> just the rte_vhost_enqueue_burst, 2nd one to delete all the obsolete
>> functions.
>>> It should make the patch clear, how do you think?  :)
>>
>> Nope, it's not working in that way. It should be:
>>
>> - one patch to fix the hang issue for windows guest
>>
>>   Please cc it to stable@dpdk.org as well so that we could pick it for
>>   v16.07 stable release.
>>
>> - one patch to unite the two different Rx code path
>>
>> - another patch to optimize mergeable code path
>
>
> I can separate optimization from the basic code in v2, however as I explained
> this patch is built from scratch and doesn't take anything from the existing
> code, so there's no way to transform from the existing code incrementally into
> the new code.
>
>
>>
>> 	--yliu
  
Yuanhan Liu Aug. 17, 2016, 9:51 a.m. UTC | #6
On Wed, Aug 17, 2016 at 11:17:46AM +0200, Maxime Coquelin wrote:
> >>>This is something I've thought about while writing the code, the reason I
> >>>keep it as one function body is that:
> >>>
> >>> 1. This function is very performance sensitive, and we need full control of
> >>>    code ordering (You can compare with the current performance with the
> >>>    mrg_rxbuf feature turned on to see the difference).
> >>
> >>Will inline functions help?
> >
> >
> >Optimization in this patch actually reorganizes the code from its logic,
> >so it's not suitable for making separated functions.
> >
> >I'll explain this in v2.
> 
> I agree with Yuanhan.
> Inline functions should not break the optimizations.
> IMHO, this is mandatory for the patch to be accepted.

Yes.

> It seems you are not the only one facing the issue:
> https://github.com/YanVugenfirer/kvm-guest-drivers-windows/issues/70
> 
> So a dedicated fix is really important.

Yes.

> 
> >This patch doesn't try to fix this issue, it rewrites the logic totally,
> >and somehow fixes this issue.
> >
> >Do you think integrating this whole patch into the stable branch will work?
> >Personally I think it makes more sense.
> 
> No.
> We don't even know why/how it fixes the Windows issue, which would be
> the first thing to understand before integrating a fix in stable branch.

Yes.

> 
> And the stable branch is not meant for integrating such big reworks,
> it is only meant to fix bugs.

Yes.

> The risk of regressions have to be avoided as much as possible.

Yes.

	--yliu
  
Zhihong Wang Aug. 17, 2016, 10:07 a.m. UTC | #7
> -----Original Message-----
> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> Sent: Wednesday, August 17, 2016 5:18 PM
> To: Wang, Zhihong <zhihong.wang@intel.com>; Yuanhan Liu
> <yuanhan.liu@linux.intel.com>
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] optimize vhost enqueue
> 
> 
> 
> On 08/17/2016 08:41 AM, Wang, Zhihong wrote:
> >
> >
> >> -----Original Message-----
> >> From: Yuanhan Liu [mailto:yuanhan.liu@linux.intel.com]
> >> Sent: Wednesday, August 17, 2016 10:38 AM
> >> To: Wang, Zhihong <zhihong.wang@intel.com>
> >> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>; dev@dpdk.org
> >> Subject: Re: [dpdk-dev] [PATCH] optimize vhost enqueue
> >>
> >> On Wed, Aug 17, 2016 at 01:45:26AM +0000, Wang, Zhihong wrote:
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> >>>> Sent: Tuesday, August 16, 2016 10:00 PM
> >>>> To: Wang, Zhihong <zhihong.wang@intel.com>; dev@dpdk.org
> >>>> Subject: Re: [dpdk-dev] [PATCH] optimize vhost enqueue
> >>>>
> >>>> Hi Zhihong,
> >>>>
> >>>> On 08/16/2016 05:50 AM, Zhihong Wang wrote:
> >>>>> This patch optimizes the vhost enqueue function:
> >> rte_vhost_enqueue_burst.
> >>>>>
> >>>>> Currently there're 2 callbacks for vhost enqueue:
> >>>>>  *  virtio_dev_merge_rx for mrg_rxbuf turned on cases.
> >>>>>  *  virtio_dev_rx for mrg_rxbuf turned off cases.
> >>>>>
> >>>>> The virtio_dev_merge_rx doesn't provide optimal performance, also it is
> >>>>> reported having compatibility issue working with Windows VMs.
> >>>> Could you tell us more please about this compatibility issue?
> >>>
> >>>
> >>> For example, when you have testpmd in the host and Window VM as the
> >> guest,
> >>> with mrg_rxbuf turned on, the guest will hang once there's packets
> enqueued
> >>> by virtio_dev_merge_rx.
> >>
> >> You should put it into commit log.
> >
> >
> > Okay.
> >
> >
> >>
> >>> Let me know if you see the same issue.
> >>>
> >>>
> >>>>
> >>>>>
> >>>>> Besides, having 2 separated functions increases maintenance efforts.
> >>>>>
> >>>>> This patch uses a single function logic to replace the current 2 for
> >>>>> better maintainability, and provides better performance by optimizing
> >>>>> caching behavior especially for mrg_rxbuf turned on cases.
> >>
> >> Here, here sounds two parts to me:
> >>
> >> - one to unite mergeable and non-mergeable Rx
> >>
> >> - another one to optimize the mergeable path
> >>
> >> That means you should do it in two patches, with that we can have clear
> >> understanding what changes the performance boost. It also helps review.
> >
> >
> > Please see explanation below.
> >
> >
> >>
> >>>> Do you have some benchmark comparison before and after your change?
> >>>>
> >>>> Also, for maintainability, I would suggest the that the enqueue
> >>>> function be split. Because vhost_enqueue_burst becomes very long (220
> >>>> LoC), and max level of indentation is too high (6).
> >>>>
> >>>> It makes the code hard to understand, and prone to miss bugs during
> >>>> review and maintenance.
> >>
> >> Agreed.
> >>
> >>>
> >>> This is something I've thought about while writing the code, the reason I
> >>> keep it as one function body is that:
> >>>
> >>>  1. This function is very performance sensitive, and we need full control of
> >>>     code ordering (You can compare with the current performance with the
> >>>     mrg_rxbuf feature turned on to see the difference).
> >>
> >> Will inline functions help?
> >
> >
> > Optimization in this patch actually reorganizes the code from its logic,
> > so it's not suitable for making separated functions.
> >
> > I'll explain this in v2.
> 
> I agree with Yuanhan.
> Inline functions should not break the optimizations.
> IMHO, this is mandatory for the patch to be accepted.


Excellent!


> 
> >
> >
> >>
> >>>  2. I somehow find that a single function logic makes it easier to understand,
> >>>     surely I can add comments to make it easiler to read for .
> >>>
> >>> Please let me know if you still insist, we can discuss more on it.
> >>
> >> I am personally not a fan of huge function; I would try hard to avoid
> >> too many levels of indentation as well.
> >>
> >>>
> >>>>
> >>>>>
> >>>>> It also fixes the issue working with Windows VMs.
> >>>> Ideally, the fix should be sent separately, before the rework.
> >>>> Indeed, we might want to have the fix in the stable branch, without
> >>>> picking the optimization.
> >>
> >> Agreed.
> >
> >
> > The fact is that I don't have much time to debug with the current code
> > since it's messy and I don't have Windows virtio code and the debugging
> > environment.
> 
> It seems you are not the only one facing the issue:
> https://github.com/YanVugenfirer/kvm-guest-drivers-windows/issues/70
> 
> So a dedicated fix is really important.


Yeah that's me raising this issue there.

But I think it's another standalone task to identify the root cause and
find the fix for the existing code.


> 
> > This patch doesn't try to fix this issue, it rewrites the logic totally,
> > and somehow fixes this issue.
> >
> > Do you think integrating this whole patch into the stable branch will work?
> > Personally I think it makes more sense.
> 
> No.
> We don't even know why/how it fixes the Windows issue, which would be
> the first thing to understand before integrating a fix in stable branch.
> 
> And the stable branch is not meant for integrating such big reworks,
> it is only meant to fix bugs.
> 
> The risk of regressions have to be avoided as much as possible.
> 
> >
> >
> >>
> >>>>
> >>>>>
> >>>>> Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
> >>>>> ---
> >>>>>  lib/librte_vhost/vhost-net.h  |   6 +-
> >>>>>  lib/librte_vhost/vhost_rxtx.c | 582
> >> ++++++++++++++----------------------------
> >>>>>  lib/librte_vhost/virtio-net.c |  15 +-
> >>>>>  3 files changed, 208 insertions(+), 395 deletions(-)
> >>>> 582 lines changed is a huge patch.
> >>>> If possible, it would be better splitting it in incremental changes,
> >>>> making the review process easier.
> >>>
> >>>
> >>> It looks like a huge patch, but it simply deletes the current implementation
> >>> and add the new code. I think perhaps split it into 2, 1st one to replace
> >>> just the rte_vhost_enqueue_burst, 2nd one to delete all the obsolete
> >> functions.
> >>> It should make the patch clear, how do you think?  :)
> >>
> >> Nope, it's not working in that way. It should be:
> >>
> >> - one patch to fix the hang issue for windows guest
> >>
> >>   Please cc it to stable@dpdk.org as well so that we could pick it for
> >>   v16.07 stable release.
> >>
> >> - one patch to unite the two different Rx code path
> >>
> >> - another patch to optimize mergeable code path
> >
> >
> > I can separate optimization from the basic code in v2, however as I explained
> > this patch is built from scratch and doesn't take anything from the existing
> > code, so there's no way to transform from the existing code incrementally into
> > the new code.
> >
> >
> >>
> >> 	--yliu
  
Zhihong Wang Aug. 18, 2016, 6:33 a.m. UTC | #8
This patch set optimizes the vhost enqueue function.

It implements the vhost logic from scratch into a single function designed
for high performance and good maintainability, and improves CPU efficiency
significantly by optimizing cache access, which means:

 *  For fast frontends (eg. DPDK virtio pmd), higher performance (maximum
    throughput) can be achieved.

 *  For slow frontends (eg. kernel virtio-net), better scalability can be
    achieved, each vhost core can support more connections since it takes
    less cycles to handle each single frontend.

The main optimization techniques are:

 1. Reorder code to reduce CPU pipeline stall cycles.

 2. Batch update the used ring for better efficiency.

 3. Prefetch descriptor to hide cache latency.

 4. Remove useless volatile attribute to allow compiler optimization.

In the existing code there're 2 callbacks for vhost enqueue:

 *  virtio_dev_merge_rx for mrg_rxbuf turned on cases.

 *  virtio_dev_rx for mrg_rxbuf turned off cases.

The performance of the existing code is not optimal, especially when the
mrg_rxbuf feature turned on. Also, having 2 separated functions increases
maintenance efforts.

---
Changes in v2:

 1. Split the big function into several small ones

 2. Use multiple patches to explain each optimization

 3. Add comments

Zhihong Wang (6):
  vhost: rewrite enqueue
  vhost: remove obsolete
  vhost: remove useless volatile
  vhost: add desc prefetch
  vhost: batch update used ring
  vhost: optimize cache access

 lib/librte_vhost/vhost-net.h  |   6 +-
 lib/librte_vhost/vhost_rxtx.c | 582 +++++++++++++++---------------------------
 lib/librte_vhost/virtio-net.c |  15 +-
 3 files changed, 228 insertions(+), 375 deletions(-)
  
Zhihong Wang Aug. 18, 2016, 1:44 p.m. UTC | #9
Thanks Maxime and Yuanhan for your review and suggestions!
Please help review the v2 of this patch.


> -----Original Message-----
> From: Yuanhan Liu [mailto:yuanhan.liu@linux.intel.com]
> Sent: Wednesday, August 17, 2016 5:51 PM
> To: Maxime Coquelin <maxime.coquelin@redhat.com>
> Cc: Wang, Zhihong <zhihong.wang@intel.com>; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] optimize vhost enqueue
> 
> On Wed, Aug 17, 2016 at 11:17:46AM +0200, Maxime Coquelin wrote:
> > >>>This is something I've thought about while writing the code, the reason I
> > >>>keep it as one function body is that:
> > >>>
> > >>> 1. This function is very performance sensitive, and we need full control of
> > >>>    code ordering (You can compare with the current performance with
> the
> > >>>    mrg_rxbuf feature turned on to see the difference).
> > >>
> > >>Will inline functions help?
> > >
> > >
> > >Optimization in this patch actually reorganizes the code from its logic,
> > >so it's not suitable for making separated functions.
> > >
> > >I'll explain this in v2.
> >
> > I agree with Yuanhan.
> > Inline functions should not break the optimizations.
> > IMHO, this is mandatory for the patch to be accepted.
> 
> Yes.
> 
> > It seems you are not the only one facing the issue:
> > https://github.com/YanVugenfirer/kvm-guest-drivers-windows/issues/70
> >
> > So a dedicated fix is really important.
> 
> Yes.
> 
> >
> > >This patch doesn't try to fix this issue, it rewrites the logic totally,
> > >and somehow fixes this issue.
> > >
> > >Do you think integrating this whole patch into the stable branch will work?
> > >Personally I think it makes more sense.
> >
> > No.
> > We don't even know why/how it fixes the Windows issue, which would be
> > the first thing to understand before integrating a fix in stable branch.
> 
> Yes.
> 
> >
> > And the stable branch is not meant for integrating such big reworks,
> > it is only meant to fix bugs.
> 
> Yes.
> 
> > The risk of regressions have to be avoided as much as possible.
> 
> Yes.
> 
> 	--yliu
  
Zhihong Wang Aug. 19, 2016, 5:43 a.m. UTC | #10
This patch set optimizes the vhost enqueue function.

It implements the vhost logic from scratch into a single function designed
for high performance and good maintainability, and improves CPU efficiency
significantly by optimizing cache access, which means:

 *  For fast frontends (eg. DPDK virtio pmd), higher performance (maximum
    throughput) can be achieved.

 *  For slow frontends (eg. kernel virtio-net), better scalability can be
    achieved, each vhost core can support more connections since it takes
    less cycles to handle each single frontend.

The main optimization techniques are:

 1. Reorder code to reduce CPU pipeline stall cycles.

 2. Batch update the used ring for better efficiency.

 3. Prefetch descriptor to hide cache latency.

 4. Remove useless volatile attribute to allow compiler optimization.

Code reordering and batch used ring update bring most of the performance
improvements.

In the existing code there're 2 callbacks for vhost enqueue:

 *  virtio_dev_merge_rx for mrg_rxbuf turned on cases.

 *  virtio_dev_rx for mrg_rxbuf turned off cases.

The performance of the existing code is not optimal, especially when the
mrg_rxbuf feature turned on. Also, having 2 separated functions increases
maintenance efforts.

---
Changes in v3:

 1. Remove unnecessary memset which causes frontend stall on SNB & IVB.

 2. Rename variables to follow naming convention.

 3. Rewrite enqueue and delete the obsolete in the same patch.

---
Changes in v2:

 1. Split the big function into several small ones.

 2. Use multiple patches to explain each optimization.

 3. Add comments.

Zhihong Wang (5):
  vhost: rewrite enqueue
  vhost: remove useless volatile
  vhost: add desc prefetch
  vhost: batch update used ring
  vhost: optimize cache access

 lib/librte_vhost/vhost-net.h  |   6 +-
 lib/librte_vhost/vhost_rxtx.c | 573 +++++++++++++++---------------------------
 lib/librte_vhost/virtio-net.c |  15 +-
 3 files changed, 220 insertions(+), 374 deletions(-)
  
Maxime Coquelin Aug. 22, 2016, 8:11 a.m. UTC | #11
Hi Zhihong,

On 08/19/2016 07:43 AM, Zhihong Wang wrote:
> This patch set optimizes the vhost enqueue function.
>
> It implements the vhost logic from scratch into a single function designed
> for high performance and good maintainability, and improves CPU efficiency
> significantly by optimizing cache access, which means:
>
>  *  For fast frontends (eg. DPDK virtio pmd), higher performance (maximum
>     throughput) can be achieved.
>
>  *  For slow frontends (eg. kernel virtio-net), better scalability can be
>     achieved, each vhost core can support more connections since it takes
>     less cycles to handle each single frontend.
>
> The main optimization techniques are:
>
>  1. Reorder code to reduce CPU pipeline stall cycles.
>
>  2. Batch update the used ring for better efficiency.
>
>  3. Prefetch descriptor to hide cache latency.
>
>  4. Remove useless volatile attribute to allow compiler optimization.

Thanks for these details, this is helpful to understand where the perf
gain comes from.
I would suggest to add these information as comments in the code
where/if it makes sense. If more a general comment, at least add it in
the commit message of the patch introducing it.
Indeed, adding it to the cover letter is fine, but the information is
lost as soon as the series is applied.

You don't mention any figures, so I set up a benchmark on my side to
evaluate your series. It indeed shows an interesting performance gain.

My setup consists of one host running a guest.
The guest generates as much 64bytes packets as possible using
pktgen-dpdk. The hosts forwards received packets back to the guest
using testpmd on vhost pmd interface. Guest's vCPUs are pinned to
physical CPUs.

I tested it with and without your v1 patch, with and without
rx-mergeable feature turned ON.
Results are the average of 8 runs of 60 seconds:

Rx-Mergeable ON : 7.72Mpps
Rx-Mergeable ON + "vhost: optimize enqueue" v1: 9.19Mpps
Rx-Mergeable OFF: 10.52Mpps
Rx-Mergeable OFF + "vhost: optimize enqueue" v1: 10.60Mpps

Regards,
Maxime
  
Maxime Coquelin Aug. 22, 2016, 10:01 a.m. UTC | #12
On 08/22/2016 10:11 AM, Maxime Coquelin wrote:
> Hi Zhihong,
>
> On 08/19/2016 07:43 AM, Zhihong Wang wrote:
> > This patch set optimizes the vhost enqueue function.
> >
> > It implements the vhost logic from scratch into a single function
> > designed
> > for high performance and good maintainability, and improves CPU
> > efficiency
> > significantly by optimizing cache access, which means:
> >
> >  *  For fast frontends (eg. DPDK virtio pmd), higher performance (maximum
> >     throughput) can be achieved.
> >
> >  *  For slow frontends (eg. kernel virtio-net), better scalability can be
> >     achieved, each vhost core can support more connections since it takes
> >     less cycles to handle each single frontend.
> >
> > The main optimization techniques are:
> >
> >  1. Reorder code to reduce CPU pipeline stall cycles.
> >
> >  2. Batch update the used ring for better efficiency.
> >
> >  3. Prefetch descriptor to hide cache latency.
> >
> >  4. Remove useless volatile attribute to allow compiler optimization.
>
> Thanks for these details, this is helpful to understand where the perf
> gain comes from.
> I would suggest to add these information as comments in the code
> where/if it makes sense. If more a general comment, at least add it in
> the commit message of the patch introducing it.
> Indeed, adding it to the cover letter is fine, but the information is
> lost as soon as the series is applied.
>
> You don't mention any figures, so I set up a benchmark on my side to
> evaluate your series. It indeed shows an interesting performance gain.
>
> My setup consists of one host running a guest.
> The guest generates as much 64bytes packets as possible using
> pktgen-dpdk. The hosts forwards received packets back to the guest
> using testpmd on vhost pmd interface. Guest's vCPUs are pinned to
> physical CPUs.
>
> I tested it with and without your v1 patch, with and without
> rx-mergeable feature turned ON.
> Results are the average of 8 runs of 60 seconds:
>
> Rx-Mergeable ON : 7.72Mpps
> Rx-Mergeable ON + "vhost: optimize enqueue" v1: 9.19Mpps
> Rx-Mergeable OFF: 10.52Mpps
> Rx-Mergeable OFF + "vhost: optimize enqueue" v1: 10.60Mpps
>
I forgot to add that before this series, I think we should first fix the windows bug.
Else we will need a dedicated fix for the stable branch.

Regards,
Maxime
  
Thomas Monjalon Aug. 22, 2016, 10:35 a.m. UTC | #13
2016-08-22 12:01, Maxime Coquelin:
> I forgot to add that before this series, I think we should first fix the windows bug.
> Else we will need a dedicated fix for the stable branch.

This is a funny situation :)
If Zhihong had reworked the code without mentioning it is fixing a scenario
with Windows guests, maybe that nobody would have notice ;)
That's probably why it is not written in v2/v3. But thanks to the v1,
we all know it:
	"It also fixes the issue working with Windows VMs."

So yes, it would be a lot better to find the root cause and try to have a
minimal fix for 16.07, then rework the code for performance in 16.11.
I think we must avoid silent fixes, and even more, avoid writing specific
fixes for stable branches without validating them in the master branch and
its large users base.

Thanks for your good works guys, DPDK vhost is improving very well.
  
Zhihong Wang Aug. 23, 2016, 2:15 a.m. UTC | #14
> Subject: Re: [PATCH v3 0/5] vhost: optimize enqueue
> 
> Hi Zhihong,
> 
[...]
> > The main optimization techniques are:
> >
> >  1. Reorder code to reduce CPU pipeline stall cycles.
> >
> >  2. Batch update the used ring for better efficiency.
> >
> >  3. Prefetch descriptor to hide cache latency.
> >
> >  4. Remove useless volatile attribute to allow compiler optimization.
> 
> Thanks for these details, this is helpful to understand where the perf
> gain comes from.
> I would suggest to add these information as comments in the code
> where/if it makes sense. If more a general comment, at least add it in
> the commit message of the patch introducing it.
> Indeed, adding it to the cover letter is fine, but the information is
> lost as soon as the series is applied.

Hi Maxime,

I did add these info in the later optimization patches to explain each
optimization techniques. The v1 was indeed hard to read.


> 
> You don't mention any figures, so I set up a benchmark on my side to
> evaluate your series. It indeed shows an interesting performance gain.
> 
> My setup consists of one host running a guest.
> The guest generates as much 64bytes packets as possible using
> pktgen-dpdk. The hosts forwards received packets back to the guest
> using testpmd on vhost pmd interface. Guest's vCPUs are pinned to
> physical CPUs.
> 

Thanks for doing the test!

I didn't publish any numbers since the gain varies in different platforms
and test setups.

In my phy to vm test on both IVB and HSW, where testpmd in the host rx from
the nic and enqueue to the guest, the enqueue efficiency (cycles per packet)
is 2.4x and 1.4x as fast as the current code for mergeable on and mergeable
off respectively, for v3 patch.


> I tested it with and without your v1 patch, with and without
> rx-mergeable feature turned ON.
> Results are the average of 8 runs of 60 seconds:
> 
> Rx-Mergeable ON : 7.72Mpps
> Rx-Mergeable ON + "vhost: optimize enqueue" v1: 9.19Mpps
> Rx-Mergeable OFF: 10.52Mpps
> Rx-Mergeable OFF + "vhost: optimize enqueue" v1: 10.60Mpps
> 
> Regards,
> Maxime
  
Zhihong Wang Aug. 23, 2016, 2:31 a.m. UTC | #15
> -----Original Message-----

> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]

> Sent: Monday, August 22, 2016 6:02 PM

> To: Wang, Zhihong <zhihong.wang@intel.com>; dev@dpdk.org

> Cc: yuanhan.liu@linux.intel.com

> Subject: Re: [PATCH v3 0/5] vhost: optimize enqueue

> 

> 

> On 08/22/2016 10:11 AM, Maxime Coquelin wrote:

> > Hi Zhihong,

> >

> > On 08/19/2016 07:43 AM, Zhihong Wang wrote:

> > > This patch set optimizes the vhost enqueue function.

> > >

> > > It implements the vhost logic from scratch into a single function

> > > designed

> > > for high performance and good maintainability, and improves CPU

> > > efficiency

> > > significantly by optimizing cache access, which means:

> > >

> > >  *  For fast frontends (eg. DPDK virtio pmd), higher performance

> (maximum

> > >     throughput) can be achieved.

> > >

> > >  *  For slow frontends (eg. kernel virtio-net), better scalability can be

> > >     achieved, each vhost core can support more connections since it takes

> > >     less cycles to handle each single frontend.

> > >

> > > The main optimization techniques are:

> > >

> > >  1. Reorder code to reduce CPU pipeline stall cycles.

> > >

> > >  2. Batch update the used ring for better efficiency.

> > >

> > >  3. Prefetch descriptor to hide cache latency.

> > >

> > >  4. Remove useless volatile attribute to allow compiler optimization.

> >

> > Thanks for these details, this is helpful to understand where the perf

> > gain comes from.

> > I would suggest to add these information as comments in the code

> > where/if it makes sense. If more a general comment, at least add it in

> > the commit message of the patch introducing it.

> > Indeed, adding it to the cover letter is fine, but the information is

> > lost as soon as the series is applied.

> >

> > You don't mention any figures, so I set up a benchmark on my side to

> > evaluate your series. It indeed shows an interesting performance gain.

> >

> > My setup consists of one host running a guest.

> > The guest generates as much 64bytes packets as possible using

> > pktgen-dpdk. The hosts forwards received packets back to the guest

> > using testpmd on vhost pmd interface. Guest's vCPUs are pinned to

> > physical CPUs.

> >

> > I tested it with and without your v1 patch, with and without

> > rx-mergeable feature turned ON.

> > Results are the average of 8 runs of 60 seconds:

> >

> > Rx-Mergeable ON : 7.72Mpps

> > Rx-Mergeable ON + "vhost: optimize enqueue" v1: 9.19Mpps

> > Rx-Mergeable OFF: 10.52Mpps

> > Rx-Mergeable OFF + "vhost: optimize enqueue" v1: 10.60Mpps

> >

> I forgot to add that before this series, I think we should first fix the windows bug.

> Else we will need a dedicated fix for the stable branch.


Okay I'll try to fix it, though I can't make any promises at present.

Have tried once but stopped since we don't have enough debug info from the
frontend side so basically I was debugging the backend based on guesses.


> 

> Regards,

> Maxime
  
Zhihong Wang Aug. 23, 2016, 10:43 a.m. UTC | #16
> -----Original Message-----

> From: Wang, Zhihong

> Sent: Tuesday, August 23, 2016 10:31 AM

> To: Maxime Coquelin <maxime.coquelin@redhat.com>; dev@dpdk.org

> Cc: yuanhan.liu@linux.intel.com

> Subject: RE: [PATCH v3 0/5] vhost: optimize enqueue

> 

> 

> 

> > -----Original Message-----

> > From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]

> > Sent: Monday, August 22, 2016 6:02 PM

> > To: Wang, Zhihong <zhihong.wang@intel.com>; dev@dpdk.org

> > Cc: yuanhan.liu@linux.intel.com

> > Subject: Re: [PATCH v3 0/5] vhost: optimize enqueue

> >

> >

> > On 08/22/2016 10:11 AM, Maxime Coquelin wrote:

> > > Hi Zhihong,

> > >

> > > On 08/19/2016 07:43 AM, Zhihong Wang wrote:

> > > > This patch set optimizes the vhost enqueue function.

> > > >

> > > > It implements the vhost logic from scratch into a single function

> > > > designed

> > > > for high performance and good maintainability, and improves CPU

> > > > efficiency

> > > > significantly by optimizing cache access, which means:

> > > >

> > > >  *  For fast frontends (eg. DPDK virtio pmd), higher performance

> > (maximum

> > > >     throughput) can be achieved.

> > > >

> > > >  *  For slow frontends (eg. kernel virtio-net), better scalability can be

> > > >     achieved, each vhost core can support more connections since it takes

> > > >     less cycles to handle each single frontend.

> > > >

> > > > The main optimization techniques are:

> > > >

> > > >  1. Reorder code to reduce CPU pipeline stall cycles.

> > > >

> > > >  2. Batch update the used ring for better efficiency.

> > > >

> > > >  3. Prefetch descriptor to hide cache latency.

> > > >

> > > >  4. Remove useless volatile attribute to allow compiler optimization.

> > >

> > > Thanks for these details, this is helpful to understand where the perf

> > > gain comes from.

> > > I would suggest to add these information as comments in the code

> > > where/if it makes sense. If more a general comment, at least add it in

> > > the commit message of the patch introducing it.

> > > Indeed, adding it to the cover letter is fine, but the information is

> > > lost as soon as the series is applied.

> > >

> > > You don't mention any figures, so I set up a benchmark on my side to

> > > evaluate your series. It indeed shows an interesting performance gain.

> > >

> > > My setup consists of one host running a guest.

> > > The guest generates as much 64bytes packets as possible using

> > > pktgen-dpdk. The hosts forwards received packets back to the guest

> > > using testpmd on vhost pmd interface. Guest's vCPUs are pinned to

> > > physical CPUs.

> > >

> > > I tested it with and without your v1 patch, with and without

> > > rx-mergeable feature turned ON.

> > > Results are the average of 8 runs of 60 seconds:

> > >

> > > Rx-Mergeable ON : 7.72Mpps

> > > Rx-Mergeable ON + "vhost: optimize enqueue" v1: 9.19Mpps

> > > Rx-Mergeable OFF: 10.52Mpps

> > > Rx-Mergeable OFF + "vhost: optimize enqueue" v1: 10.60Mpps

> > >

> > I forgot to add that before this series, I think we should first fix the windows

> bug.

> > Else we will need a dedicated fix for the stable branch.

> 

> Okay I'll try to fix it, though I can't make any promises at present.

> 

> Have tried once but stopped since we don't have enough debug info from the

> frontend side so basically I was debugging the backend based on guesses.


Hi Maxime, Yuanhan,

I've identified the root cause, do you think it makes sense to put the fix
in the same patch set? Or send it as a separated patch?


Thanks
Zhihong

> 

> 

> >

> > Regards,

> > Maxime
  
Maxime Coquelin Aug. 23, 2016, 12:16 p.m. UTC | #17
On 08/23/2016 12:43 PM, Wang, Zhihong wrote:
>
>
>> -----Original Message-----
>> From: Wang, Zhihong
>> Sent: Tuesday, August 23, 2016 10:31 AM
>> To: Maxime Coquelin <maxime.coquelin@redhat.com>; dev@dpdk.org
>> Cc: yuanhan.liu@linux.intel.com
>> Subject: RE: [PATCH v3 0/5] vhost: optimize enqueue
>>
>>
>>
>>> -----Original Message-----
>>> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
>>> Sent: Monday, August 22, 2016 6:02 PM
>>> To: Wang, Zhihong <zhihong.wang@intel.com>; dev@dpdk.org
>>> Cc: yuanhan.liu@linux.intel.com
>>> Subject: Re: [PATCH v3 0/5] vhost: optimize enqueue
..
>>>>
>>> I forgot to add that before this series, I think we should first fix the windows
>> bug.
>>> Else we will need a dedicated fix for the stable branch.
>>
>> Okay I'll try to fix it, though I can't make any promises at present.
>>
>> Have tried once but stopped since we don't have enough debug info from the
>> frontend side so basically I was debugging the backend based on guesses.
>
> Hi Maxime, Yuanhan,
>
> I've identified the root cause, do you think it makes sense to put the fix
> in the same patch set? Or send it as a separated patch?

Good work!

Agree with Yuanhan, send it before the optimization series.

Thanks,
Maxime
  
Yuanhan Liu Aug. 23, 2016, 12:22 p.m. UTC | #18
On Tue, Aug 23, 2016 at 10:43:36AM +0000, Wang, Zhihong wrote:
> > > I forgot to add that before this series, I think we should first fix the windows
> > bug.
> > > Else we will need a dedicated fix for the stable branch.
> > 
> > Okay I'll try to fix it, though I can't make any promises at present.
> > 
> > Have tried once but stopped since we don't have enough debug info from the
> > frontend side so basically I was debugging the backend based on guesses.
> 
> Hi Maxime, Yuanhan,
> 
> I've identified the root cause, do you think it makes sense to put the fix
> in the same patch set? Or send it as a separated patch?

Great!

Yes, it's okay to put it in the patch set (normally, as the first patch,
before the rewrite).

Please also add following line before your Signed-off-by in the commit
log:

    Cc: <stable@dpdk.org>

	--yliu
  
Zhihong Wang Aug. 24, 2016, 3:37 a.m. UTC | #19
> -----Original Message-----
> From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]
> Sent: Monday, August 22, 2016 6:35 PM
> To: Maxime Coquelin <maxime.coquelin@redhat.com>; Wang, Zhihong
> <zhihong.wang@intel.com>; yuanhan.liu@linux.intel.com
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
> 
> 2016-08-22 12:01, Maxime Coquelin:
> > I forgot to add that before this series, I think we should first fix the
> windows bug.
> > Else we will need a dedicated fix for the stable branch.
> 
> This is a funny situation :)
> If Zhihong had reworked the code without mentioning it is fixing a scenario
> with Windows guests, maybe that nobody would have notice ;) That's
> probably why it is not written in v2/v3. But thanks to the v1, we all know it:
> 	"It also fixes the issue working with Windows VMs."

I thought it'd be more appropriate to send a dedicated fix for stable branch.
So I removed this info.

> 
> So yes, it would be a lot better to find the root cause and try to have a
> minimal fix for 16.07, then rework the code for performance in 16.11.
> I think we must avoid silent fixes, and even more, avoid writing specific fixes
> for stable branches without validating them in the master branch and its large
> users base.

Okay, that's also what Maxime and Yuanhan suggest.

BTW the root cause has been identified and fix will be in v4.

> 
> Thanks for your good works guys, DPDK vhost is improving very well.
  
Zhihong Wang Aug. 30, 2016, 3:35 a.m. UTC | #20
This patch set optimizes the vhost enqueue function.

It implements the vhost logic from scratch into a single function designed
for high performance and good maintainability, and improves CPU efficiency
significantly by optimizing cache access, which means:

 *  Higher maximum throughput can be achieved for fast frontends like DPDK
    virtio pmd.

 *  Better scalability can be achieved that each vhost core can support
    more connections because it takes less cycles to handle each single
    frontend.

This patch set contains:

 1. A Windows VM compatibility fix for vhost enqueue in 16.07 release.

 2. A baseline patch to rewrite the vhost logic.

 3. A series of optimization patches added upon the baseline.

The main optimization techniques are:

 1. Reorder code to reduce CPU pipeline stall cycles.

 2. Batch update the used ring for better efficiency.

 3. Prefetch descriptor to hide cache latency.

 4. Remove useless volatile attribute to allow compiler optimization.

Code reordering and batch used ring update bring most of the performance
improvements.

In the existing code there're 2 callbacks for vhost enqueue:

 *  virtio_dev_merge_rx for mrg_rxbuf turned on cases.

 *  virtio_dev_rx for mrg_rxbuf turned off cases.

The performance of the existing code is not optimal, especially when the
mrg_rxbuf feature turned on. Besides, having 2 callback paths increases
maintenance efforts.

Also, there's a compatibility issue in the existing code which causes
Windows VM to hang when the mrg_rxbuf feature turned on.

---
Changes in v4:

 1. Fix a Windows VM compatibility issue.

 2. Free shadow used ring in the right place.

 3. Add failure check for shadow used ring malloc.

 4. Refactor the code for clearer logic.

 5. Add PRINT_PACKET for debugging.

---
Changes in v3:

 1. Remove unnecessary memset which causes frontend stall on SNB & IVB.

 2. Rename variables to follow naming convention.

 3. Rewrite enqueue and delete the obsolete in the same patch.

---
Changes in v2:

 1. Split the big function into several small ones.

 2. Use multiple patches to explain each optimization.

 3. Add comments.

Zhihong Wang (6):
  vhost: fix windows vm hang
  vhost: rewrite enqueue
  vhost: remove useless volatile
  vhost: add desc prefetch
  vhost: batch update used ring
  vhost: optimize cache access

 lib/librte_vhost/vhost-net.h  |   6 +-
 lib/librte_vhost/vhost_rxtx.c | 572 +++++++++++++++---------------------------
 lib/librte_vhost/virtio-net.c |  42 +++-
 3 files changed, 244 insertions(+), 376 deletions(-)
  
Zhihong Wang Sept. 9, 2016, 3:39 a.m. UTC | #21
This patch set optimizes the vhost enqueue function.

It implements the vhost logic from scratch into a single function designed
for high performance and good maintainability, and improves CPU efficiency
significantly by optimizing cache access, which means:

 *  Higher maximum throughput can be achieved for fast frontends like DPDK
    virtio pmd.

 *  Better scalability can be achieved that each vhost core can support
    more connections because it takes less cycles to handle each single
    frontend.

This patch set contains:

 1. A Windows VM compatibility fix for vhost enqueue in 16.07 release.

 2. A baseline patch to rewrite the vhost logic.

 3. A series of optimization patches added upon the baseline.

The main optimization techniques are:

 1. Reorder code to reduce CPU pipeline stall cycles.

 2. Batch update the used ring for better efficiency.

 3. Prefetch descriptor to hide cache latency.

 4. Remove useless volatile attribute to allow compiler optimization.

Code reordering and batch used ring update bring most of the performance
improvements.

In the existing code there're 2 callbacks for vhost enqueue:

 *  virtio_dev_merge_rx for mrg_rxbuf turned on cases.

 *  virtio_dev_rx for mrg_rxbuf turned off cases.

The performance of the existing code is not optimal, especially when the
mrg_rxbuf feature turned on. Besides, having 2 callback paths increases
maintenance efforts.

Also, there's a compatibility issue in the existing code which causes
Windows VM to hang when the mrg_rxbuf feature turned on.

---
Changes in v5:

 1. Rebase to the latest branch.

 2. Rename variables to keep consistent in naming style.

 3. Small changes like return value adjustment and vertical alignment.

 4. Add details in commit log.

---
Changes in v4:

 1. Fix a Windows VM compatibility issue.

 2. Free shadow used ring in the right place.

 3. Add failure check for shadow used ring malloc.

 4. Refactor the code for clearer logic.

 5. Add PRINT_PACKET for debugging.

---
Changes in v3:

 1. Remove unnecessary memset which causes frontend stall on SNB & IVB.

 2. Rename variables to follow naming convention.

 3. Rewrite enqueue and delete the obsolete in the same patch.

---
Changes in v2:

 1. Split the big function into several small ones.

 2. Use multiple patches to explain each optimization.

 3. Add comments.

Zhihong Wang (6):
  vhost: fix windows vm hang
  vhost: rewrite enqueue
  vhost: remove useless volatile
  vhost: add desc prefetch
  vhost: batch update used ring
  vhost: optimize cache access

 lib/librte_vhost/vhost.c      |  20 +-
 lib/librte_vhost/vhost.h      |   6 +-
 lib/librte_vhost/vhost_user.c |  31 ++-
 lib/librte_vhost/virtio_net.c | 561 +++++++++++++++---------------------------
 4 files changed, 242 insertions(+), 376 deletions(-)
  
Maxime Coquelin Sept. 12, 2016, 1:52 p.m. UTC | #22
Hi,

On 09/09/2016 05:39 AM, Zhihong Wang wrote:
> This patch set optimizes the vhost enqueue function.
>
> It implements the vhost logic from scratch into a single function designed
> for high performance and good maintainability, and improves CPU efficiency
> significantly by optimizing cache access, which means:
>
>  *  Higher maximum throughput can be achieved for fast frontends like DPDK
>     virtio pmd.
>
>  *  Better scalability can be achieved that each vhost core can support
>     more connections because it takes less cycles to handle each single
>     frontend.
>
> This patch set contains:
>
>  1. A Windows VM compatibility fix for vhost enqueue in 16.07 release.
>
>  2. A baseline patch to rewrite the vhost logic.
>
>  3. A series of optimization patches added upon the baseline.
>
> The main optimization techniques are:
>
>  1. Reorder code to reduce CPU pipeline stall cycles.
>
>  2. Batch update the used ring for better efficiency.
>
>  3. Prefetch descriptor to hide cache latency.
>
>  4. Remove useless volatile attribute to allow compiler optimization.
>
> Code reordering and batch used ring update bring most of the performance
> improvements.
>
> In the existing code there're 2 callbacks for vhost enqueue:
>
>  *  virtio_dev_merge_rx for mrg_rxbuf turned on cases.
>
>  *  virtio_dev_rx for mrg_rxbuf turned off cases.
>
> The performance of the existing code is not optimal, especially when the
> mrg_rxbuf feature turned on. Besides, having 2 callback paths increases
> maintenance efforts.
>
> Also, there's a compatibility issue in the existing code which causes
> Windows VM to hang when the mrg_rxbuf feature turned on.
>
> ---
> Changes in v5:
>
>  1. Rebase to the latest branch.
>
>  2. Rename variables to keep consistent in naming style.
>
>  3. Small changes like return value adjustment and vertical alignment.
>
>  4. Add details in commit log.
Just tried to apply your series without success.
Apparently, it is not based directly on master branch,
as it lacks some SHA-1 information.

Could you rebase it against master please?

Thanks,
Maxime
  
Maxime Coquelin Sept. 12, 2016, 1:56 p.m. UTC | #23
On 09/12/2016 03:52 PM, Maxime Coquelin wrote:
> Hi,
>
> On 09/09/2016 05:39 AM, Zhihong Wang wrote:
>> This patch set optimizes the vhost enqueue function.
>>
>> It implements the vhost logic from scratch into a single function
>> designed
>> for high performance and good maintainability, and improves CPU
>> efficiency
>> significantly by optimizing cache access, which means:
>>
>>  *  Higher maximum throughput can be achieved for fast frontends like
>> DPDK
>>     virtio pmd.
>>
>>  *  Better scalability can be achieved that each vhost core can support
>>     more connections because it takes less cycles to handle each single
>>     frontend.
>>
>> This patch set contains:
>>
>>  1. A Windows VM compatibility fix for vhost enqueue in 16.07 release.
>>
>>  2. A baseline patch to rewrite the vhost logic.
>>
>>  3. A series of optimization patches added upon the baseline.
>>
>> The main optimization techniques are:
>>
>>  1. Reorder code to reduce CPU pipeline stall cycles.
>>
>>  2. Batch update the used ring for better efficiency.
>>
>>  3. Prefetch descriptor to hide cache latency.
>>
>>  4. Remove useless volatile attribute to allow compiler optimization.
>>
>> Code reordering and batch used ring update bring most of the performance
>> improvements.
>>
>> In the existing code there're 2 callbacks for vhost enqueue:
>>
>>  *  virtio_dev_merge_rx for mrg_rxbuf turned on cases.
>>
>>  *  virtio_dev_rx for mrg_rxbuf turned off cases.
>>
>> The performance of the existing code is not optimal, especially when the
>> mrg_rxbuf feature turned on. Besides, having 2 callback paths increases
>> maintenance efforts.
>>
>> Also, there's a compatibility issue in the existing code which causes
>> Windows VM to hang when the mrg_rxbuf feature turned on.
>>
>> ---
>> Changes in v5:
>>
>>  1. Rebase to the latest branch.
>>
>>  2. Rename variables to keep consistent in naming style.
>>
>>  3. Small changes like return value adjustment and vertical alignment.
>>
>>  4. Add details in commit log.
> Just tried to apply your series without success.
> Apparently, it is not based directly on master branch,
> as it lacks some SHA-1 information.
>
> Could you rebase it against master please?

Ok, it is in fact based on top of:
git://dpdk.org/next/dpdk-next-virtio master

For v6, if any, could you add this info to the cover letter please?

Thanks,
Maxime
  
Yuanhan Liu Sept. 12, 2016, 2:01 p.m. UTC | #24
On Mon, Sep 12, 2016 at 03:52:12PM +0200, Maxime Coquelin wrote:
> Just tried to apply your series without success.
> Apparently, it is not based directly on master branch,
> as it lacks some SHA-1 information.
> 
> Could you rebase it against master please?

It's rebased against the dpdk-next-virtio tree [0], where all the
virtio/vhost patches are applied first.

[0]: http://dpdk.org/browse/next/dpdk-next-virtio/

	--yliu
  
Zhihong Wang Sept. 20, 2016, 2 a.m. UTC | #25
This patch set optimizes the vhost enqueue function.

It implements the vhost logic from scratch into a single function designed
for high performance and good maintainability, and improves CPU efficiency
significantly by optimizing cache access, which means:

 *  Higher maximum throughput can be achieved for fast frontends like DPDK
    virtio pmd.

 *  Better scalability can be achieved that each vhost core can support
    more connections because it takes less cycles to handle each single
    frontend.

This patch set contains:

 1. A Windows VM compatibility fix for vhost enqueue in 16.07 release.

 2. A baseline patch to rewrite the vhost logic.

 3. A series of optimization patches added upon the baseline.

The main optimization techniques are:

 1. Reorder code to reduce CPU pipeline stall cycles.

 2. Batch update the used ring for better efficiency.

 3. Prefetch descriptor to hide cache latency.

 4. Remove useless volatile attribute to allow compiler optimization.

Code reordering and batch used ring update bring most of the performance
improvements.

In the existing code there're 2 callbacks for vhost enqueue:

 *  virtio_dev_merge_rx for mrg_rxbuf turned on cases.

 *  virtio_dev_rx for mrg_rxbuf turned off cases.

The performance of the existing code is not optimal, especially when the
mrg_rxbuf feature turned on. Besides, having 2 callback paths increases
maintenance efforts.

Also, there's a compatibility issue in the existing code which causes
Windows VM to hang when the mrg_rxbuf feature turned on.

---
Changes in v6:

 1. Merge duplicated code.

 2. Introduce a function for used ring write.

 3. Add necessary comments.

---
Changes in v5:

 1. Rebase to dpdk-next-virtio master.

 2. Rename variables to keep consistent in naming style.

 3. Small changes like return value adjustment and vertical alignment.

 4. Add details in commit log.

---
Changes in v4:

 1. Fix a Windows VM compatibility issue.

 2. Free shadow used ring in the right place.

 3. Add failure check for shadow used ring malloc.

 4. Refactor the code for clearer logic.

 5. Add PRINT_PACKET for debugging.

---
Changes in v3:

 1. Remove unnecessary memset which causes frontend stall on SNB & IVB.

 2. Rename variables to follow naming convention.

 3. Rewrite enqueue and delete the obsolete in the same patch.

---
Changes in v2:

 1. Split the big function into several small ones.

 2. Use multiple patches to explain each optimization.

 3. Add comments.

Zhihong Wang (6):
  vhost: fix windows vm hang
  vhost: rewrite enqueue
  vhost: remove useless volatile
  vhost: add desc prefetch
  vhost: batch update used ring
  vhost: optimize cache access

 lib/librte_vhost/vhost.c      |  20 +-
 lib/librte_vhost/vhost.h      |   6 +-
 lib/librte_vhost/vhost_user.c |  31 ++-
 lib/librte_vhost/virtio_net.c | 541 ++++++++++++++----------------------------
 4 files changed, 225 insertions(+), 373 deletions(-)
  
Yuanhan Liu Sept. 21, 2016, 2:26 a.m. UTC | #26
Hi Maxime,

Do you have more comments about this set? If no, I think I could merge
it shortly.

Thanks.

	--yliu

On Mon, Sep 19, 2016 at 10:00:11PM -0400, Zhihong Wang wrote:
> This patch set optimizes the vhost enqueue function.
> 
> It implements the vhost logic from scratch into a single function designed
> for high performance and good maintainability, and improves CPU efficiency
> significantly by optimizing cache access, which means:
> 
>  *  Higher maximum throughput can be achieved for fast frontends like DPDK
>     virtio pmd.
> 
>  *  Better scalability can be achieved that each vhost core can support
>     more connections because it takes less cycles to handle each single
>     frontend.
> 
> This patch set contains:
> 
>  1. A Windows VM compatibility fix for vhost enqueue in 16.07 release.
> 
>  2. A baseline patch to rewrite the vhost logic.
> 
>  3. A series of optimization patches added upon the baseline.
> 
> The main optimization techniques are:
> 
>  1. Reorder code to reduce CPU pipeline stall cycles.
> 
>  2. Batch update the used ring for better efficiency.
> 
>  3. Prefetch descriptor to hide cache latency.
> 
>  4. Remove useless volatile attribute to allow compiler optimization.
> 
> Code reordering and batch used ring update bring most of the performance
> improvements.
> 
> In the existing code there're 2 callbacks for vhost enqueue:
> 
>  *  virtio_dev_merge_rx for mrg_rxbuf turned on cases.
> 
>  *  virtio_dev_rx for mrg_rxbuf turned off cases.
> 
> The performance of the existing code is not optimal, especially when the
> mrg_rxbuf feature turned on. Besides, having 2 callback paths increases
> maintenance efforts.
> 
> Also, there's a compatibility issue in the existing code which causes
> Windows VM to hang when the mrg_rxbuf feature turned on.
> 
> ---
> Changes in v6:
> 
>  1. Merge duplicated code.
> 
>  2. Introduce a function for used ring write.
> 
>  3. Add necessary comments.
> 
> ---
> Changes in v5:
> 
>  1. Rebase to dpdk-next-virtio master.
> 
>  2. Rename variables to keep consistent in naming style.
> 
>  3. Small changes like return value adjustment and vertical alignment.
> 
>  4. Add details in commit log.
> 
> ---
> Changes in v4:
> 
>  1. Fix a Windows VM compatibility issue.
> 
>  2. Free shadow used ring in the right place.
> 
>  3. Add failure check for shadow used ring malloc.
> 
>  4. Refactor the code for clearer logic.
> 
>  5. Add PRINT_PACKET for debugging.
> 
> ---
> Changes in v3:
> 
>  1. Remove unnecessary memset which causes frontend stall on SNB & IVB.
> 
>  2. Rename variables to follow naming convention.
> 
>  3. Rewrite enqueue and delete the obsolete in the same patch.
> 
> ---
> Changes in v2:
> 
>  1. Split the big function into several small ones.
> 
>  2. Use multiple patches to explain each optimization.
> 
>  3. Add comments.
> 
> Zhihong Wang (6):
>   vhost: fix windows vm hang
>   vhost: rewrite enqueue
>   vhost: remove useless volatile
>   vhost: add desc prefetch
>   vhost: batch update used ring
>   vhost: optimize cache access
> 
>  lib/librte_vhost/vhost.c      |  20 +-
>  lib/librte_vhost/vhost.h      |   6 +-
>  lib/librte_vhost/vhost_user.c |  31 ++-
>  lib/librte_vhost/virtio_net.c | 541 ++++++++++++++----------------------------
>  4 files changed, 225 insertions(+), 373 deletions(-)
> 
> -- 
> 2.7.4
  
Maxime Coquelin Sept. 21, 2016, 4:39 a.m. UTC | #27
Hi Yuanhan,

On 09/21/2016 04:26 AM, Yuanhan Liu wrote:
> Hi Maxime,
>
> Do you have more comments about this set? If no, I think I could merge
> it shortly.

No more comments, this is good to me.

Feel free to add:
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

Thanks,
Maxime

> Thanks.
>
> 	--yliu
>
> On Mon, Sep 19, 2016 at 10:00:11PM -0400, Zhihong Wang wrote:
>> This patch set optimizes the vhost enqueue function.
>>
>> It implements the vhost logic from scratch into a single function designed
>> for high performance and good maintainability, and improves CPU efficiency
>> significantly by optimizing cache access, which means:
>>
>>  *  Higher maximum throughput can be achieved for fast frontends like DPDK
>>     virtio pmd.
>>
>>  *  Better scalability can be achieved that each vhost core can support
>>     more connections because it takes less cycles to handle each single
>>     frontend.
>>
>> This patch set contains:
>>
>>  1. A Windows VM compatibility fix for vhost enqueue in 16.07 release.
>>
>>  2. A baseline patch to rewrite the vhost logic.
>>
>>  3. A series of optimization patches added upon the baseline.
>>
>> The main optimization techniques are:
>>
>>  1. Reorder code to reduce CPU pipeline stall cycles.
>>
>>  2. Batch update the used ring for better efficiency.
>>
>>  3. Prefetch descriptor to hide cache latency.
>>
>>  4. Remove useless volatile attribute to allow compiler optimization.
>>
>> Code reordering and batch used ring update bring most of the performance
>> improvements.
>>
>> In the existing code there're 2 callbacks for vhost enqueue:
>>
>>  *  virtio_dev_merge_rx for mrg_rxbuf turned on cases.
>>
>>  *  virtio_dev_rx for mrg_rxbuf turned off cases.
>>
>> The performance of the existing code is not optimal, especially when the
>> mrg_rxbuf feature turned on. Besides, having 2 callback paths increases
>> maintenance efforts.
>>
>> Also, there's a compatibility issue in the existing code which causes
>> Windows VM to hang when the mrg_rxbuf feature turned on.
>>
>> ---
>> Changes in v6:
>>
>>  1. Merge duplicated code.
>>
>>  2. Introduce a function for used ring write.
>>
>>  3. Add necessary comments.
>>
>> ---
>> Changes in v5:
>>
>>  1. Rebase to dpdk-next-virtio master.
>>
>>  2. Rename variables to keep consistent in naming style.
>>
>>  3. Small changes like return value adjustment and vertical alignment.
>>
>>  4. Add details in commit log.
>>
>> ---
>> Changes in v4:
>>
>>  1. Fix a Windows VM compatibility issue.
>>
>>  2. Free shadow used ring in the right place.
>>
>>  3. Add failure check for shadow used ring malloc.
>>
>>  4. Refactor the code for clearer logic.
>>
>>  5. Add PRINT_PACKET for debugging.
>>
>> ---
>> Changes in v3:
>>
>>  1. Remove unnecessary memset which causes frontend stall on SNB & IVB.
>>
>>  2. Rename variables to follow naming convention.
>>
>>  3. Rewrite enqueue and delete the obsolete in the same patch.
>>
>> ---
>> Changes in v2:
>>
>>  1. Split the big function into several small ones.
>>
>>  2. Use multiple patches to explain each optimization.
>>
>>  3. Add comments.
>>
>> Zhihong Wang (6):
>>   vhost: fix windows vm hang
>>   vhost: rewrite enqueue
>>   vhost: remove useless volatile
>>   vhost: add desc prefetch
>>   vhost: batch update used ring
>>   vhost: optimize cache access
>>
>>  lib/librte_vhost/vhost.c      |  20 +-
>>  lib/librte_vhost/vhost.h      |   6 +-
>>  lib/librte_vhost/vhost_user.c |  31 ++-
>>  lib/librte_vhost/virtio_net.c | 541 ++++++++++++++----------------------------
>>  4 files changed, 225 insertions(+), 373 deletions(-)
>>
>> --
>> 2.7.4
  
Jianbo Liu Sept. 21, 2016, 8:50 a.m. UTC | #28
Hi Maxime,

On 22 August 2016 at 16:11, Maxime Coquelin <maxime.coquelin@redhat.com> wrote:
> Hi Zhihong,
>
> On 08/19/2016 07:43 AM, Zhihong Wang wrote:
>>
>> This patch set optimizes the vhost enqueue function.
>>
...

>
> My setup consists of one host running a guest.
> The guest generates as much 64bytes packets as possible using

Have you tested with other different packet size?
My testing shows that performance is dropping when packet size is more than 256.

> pktgen-dpdk. The hosts forwards received packets back to the guest
> using testpmd on vhost pmd interface. Guest's vCPUs are pinned to
> physical CPUs.
>
> I tested it with and without your v1 patch, with and without
> rx-mergeable feature turned ON.
> Results are the average of 8 runs of 60 seconds:
>
> Rx-Mergeable ON : 7.72Mpps
> Rx-Mergeable ON + "vhost: optimize enqueue" v1: 9.19Mpps
> Rx-Mergeable OFF: 10.52Mpps
> Rx-Mergeable OFF + "vhost: optimize enqueue" v1: 10.60Mpps
>
> Regards,
> Maxime
  
Zhihong Wang Sept. 21, 2016, 9:27 a.m. UTC | #29
> -----Original Message-----

> From: Jianbo Liu [mailto:jianbo.liu@linaro.org]

> Sent: Wednesday, September 21, 2016 4:50 PM

> To: Maxime Coquelin <maxime.coquelin@redhat.com>

> Cc: Wang, Zhihong <zhihong.wang@intel.com>; dev@dpdk.org;

> yuanhan.liu@linux.intel.com

> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

> 

> Hi Maxime,

> 

> On 22 August 2016 at 16:11, Maxime Coquelin

> <maxime.coquelin@redhat.com> wrote:

> > Hi Zhihong,

> >

> > On 08/19/2016 07:43 AM, Zhihong Wang wrote:

> >>

> >> This patch set optimizes the vhost enqueue function.

> >>

> ...

> 

> >

> > My setup consists of one host running a guest.

> > The guest generates as much 64bytes packets as possible using

> 

> Have you tested with other different packet size?

> My testing shows that performance is dropping when packet size is more

> than 256.



Hi Jianbo,

Thanks for reporting this.

 1. Are you running the vector frontend with mrg_rxbuf=off?

 2. Could you please specify what CPU you're running? Is it Haswell
    or Ivy Bridge?

 3. How many percentage of drop are you seeing?

This is expected by me because I've already found the root cause and
the way to optimize it, but since it missed the v0 deadline and
requires changes in eal/memcpy, I postpone it to the next release.

After the upcoming optimization the performance for packets larger
than 256 will be improved, and the new code will be much faster than
the current code.


Thanks
Zhihong


> 

> > pktgen-dpdk. The hosts forwards received packets back to the guest

> > using testpmd on vhost pmd interface. Guest's vCPUs are pinned to

> > physical CPUs.

> >

> > I tested it with and without your v1 patch, with and without

> > rx-mergeable feature turned ON.

> > Results are the average of 8 runs of 60 seconds:

> >

> > Rx-Mergeable ON : 7.72Mpps

> > Rx-Mergeable ON + "vhost: optimize enqueue" v1: 9.19Mpps

> > Rx-Mergeable OFF: 10.52Mpps

> > Rx-Mergeable OFF + "vhost: optimize enqueue" v1: 10.60Mpps

> >

> > Regards,

> > Maxime
  
Jianbo Liu Sept. 21, 2016, 12:54 p.m. UTC | #30
On 21 September 2016 at 17:27, Wang, Zhihong <zhihong.wang@intel.com> wrote:
>
>
>> -----Original Message-----
>> From: Jianbo Liu [mailto:jianbo.liu@linaro.org]
>> Sent: Wednesday, September 21, 2016 4:50 PM
>> To: Maxime Coquelin <maxime.coquelin@redhat.com>
>> Cc: Wang, Zhihong <zhihong.wang@intel.com>; dev@dpdk.org;
>> yuanhan.liu@linux.intel.com
>> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
>>
>> Hi Maxime,
>>
>> On 22 August 2016 at 16:11, Maxime Coquelin
>> <maxime.coquelin@redhat.com> wrote:
>> > Hi Zhihong,
>> >
>> > On 08/19/2016 07:43 AM, Zhihong Wang wrote:
>> >>
>> >> This patch set optimizes the vhost enqueue function.
>> >>
>> ...
>>
>> >
>> > My setup consists of one host running a guest.
>> > The guest generates as much 64bytes packets as possible using
>>
>> Have you tested with other different packet size?
>> My testing shows that performance is dropping when packet size is more
>> than 256.
>
>
> Hi Jianbo,
>
> Thanks for reporting this.
>
>  1. Are you running the vector frontend with mrg_rxbuf=off?
>
>  2. Could you please specify what CPU you're running? Is it Haswell
>     or Ivy Bridge?
>
>  3. How many percentage of drop are you seeing?
>
> This is expected by me because I've already found the root cause and
> the way to optimize it, but since it missed the v0 deadline and
> requires changes in eal/memcpy, I postpone it to the next release.
>
> After the upcoming optimization the performance for packets larger
> than 256 will be improved, and the new code will be much faster than
> the current code.
>

Sorry, I tested on an ARM server, but I wonder if there is the same
issue for x86 platform.

>> > pktgen-dpdk. The hosts forwards received packets back to the guest
>> > using testpmd on vhost pmd interface. Guest's vCPUs are pinned to
>> > physical CPUs.
>> >
>> > I tested it with and without your v1 patch, with and without
>> > rx-mergeable feature turned ON.
>> > Results are the average of 8 runs of 60 seconds:
>> >
>> > Rx-Mergeable ON : 7.72Mpps
>> > Rx-Mergeable ON + "vhost: optimize enqueue" v1: 9.19Mpps
>> > Rx-Mergeable OFF: 10.52Mpps
>> > Rx-Mergeable OFF + "vhost: optimize enqueue" v1: 10.60Mpps
>> >
  
Zhihong Wang Sept. 22, 2016, 2:11 a.m. UTC | #31
> -----Original Message-----

> From: Jianbo Liu [mailto:jianbo.liu@linaro.org]

> Sent: Wednesday, September 21, 2016 8:54 PM

> To: Wang, Zhihong <zhihong.wang@intel.com>

> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>; dev@dpdk.org;

> yuanhan.liu@linux.intel.com

> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

> 

> On 21 September 2016 at 17:27, Wang, Zhihong <zhihong.wang@intel.com>

> wrote:

> >

> >

> >> -----Original Message-----

> >> From: Jianbo Liu [mailto:jianbo.liu@linaro.org]

> >> Sent: Wednesday, September 21, 2016 4:50 PM

> >> To: Maxime Coquelin <maxime.coquelin@redhat.com>

> >> Cc: Wang, Zhihong <zhihong.wang@intel.com>; dev@dpdk.org;

> >> yuanhan.liu@linux.intel.com

> >> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

> >>

> >> Hi Maxime,

> >>

> >> On 22 August 2016 at 16:11, Maxime Coquelin

> >> <maxime.coquelin@redhat.com> wrote:

> >> > Hi Zhihong,

> >> >

> >> > On 08/19/2016 07:43 AM, Zhihong Wang wrote:

> >> >>

> >> >> This patch set optimizes the vhost enqueue function.

> >> >>

> >> ...

> >>

> >> >

> >> > My setup consists of one host running a guest.

> >> > The guest generates as much 64bytes packets as possible using

> >>

> >> Have you tested with other different packet size?

> >> My testing shows that performance is dropping when packet size is more

> >> than 256.

> >

> >

> > Hi Jianbo,

> >

> > Thanks for reporting this.

> >

> >  1. Are you running the vector frontend with mrg_rxbuf=off?

> >

> >  2. Could you please specify what CPU you're running? Is it Haswell

> >     or Ivy Bridge?

> >

> >  3. How many percentage of drop are you seeing?

> >

> > This is expected by me because I've already found the root cause and

> > the way to optimize it, but since it missed the v0 deadline and

> > requires changes in eal/memcpy, I postpone it to the next release.

> >

> > After the upcoming optimization the performance for packets larger

> > than 256 will be improved, and the new code will be much faster than

> > the current code.

> >

> 

> Sorry, I tested on an ARM server, but I wonder if there is the same

> issue for x86 platform.



For mrg_rxbuf=off path it might be slight drop for packets larger than
256B (~3% for 512B and ~1% for 1024B), no drop for other cases.

This is not a bug or issue, only we need to enhance memcpy to complete
the whole optimization, which should be done in a separated patch,
unfortunately it misses this release window.


> 

> >> > pktgen-dpdk. The hosts forwards received packets back to the guest

> >> > using testpmd on vhost pmd interface. Guest's vCPUs are pinned to

> >> > physical CPUs.

> >> >

> >> > I tested it with and without your v1 patch, with and without

> >> > rx-mergeable feature turned ON.

> >> > Results are the average of 8 runs of 60 seconds:

> >> >

> >> > Rx-Mergeable ON : 7.72Mpps

> >> > Rx-Mergeable ON + "vhost: optimize enqueue" v1: 9.19Mpps

> >> > Rx-Mergeable OFF: 10.52Mpps

> >> > Rx-Mergeable OFF + "vhost: optimize enqueue" v1: 10.60Mpps

> >> >
  
Yuanhan Liu Sept. 22, 2016, 2:29 a.m. UTC | #32
On Wed, Sep 21, 2016 at 08:54:11PM +0800, Jianbo Liu wrote:
> >> > My setup consists of one host running a guest.
> >> > The guest generates as much 64bytes packets as possible using
> >>
> >> Have you tested with other different packet size?
> >> My testing shows that performance is dropping when packet size is more
> >> than 256.
> >
> >
> > Hi Jianbo,
> >
> > Thanks for reporting this.
> >
> >  1. Are you running the vector frontend with mrg_rxbuf=off?
> >
> >  2. Could you please specify what CPU you're running? Is it Haswell
> >     or Ivy Bridge?
> >
> >  3. How many percentage of drop are you seeing?
> >
> > This is expected by me because I've already found the root cause and
> > the way to optimize it, but since it missed the v0 deadline and
> > requires changes in eal/memcpy, I postpone it to the next release.
> >
> > After the upcoming optimization the performance for packets larger
> > than 256 will be improved, and the new code will be much faster than
> > the current code.
> >
> 
> Sorry, I tested on an ARM server, but I wonder if there is the same
> issue for x86 platform.

Would you please provide more details? Say, answer the two left
questions from Zhihong?

Thanks.

	--yliu
  
Jianbo Liu Sept. 22, 2016, 5:47 a.m. UTC | #33
On 22 September 2016 at 10:29, Yuanhan Liu <yuanhan.liu@linux.intel.com> wrote:
> On Wed, Sep 21, 2016 at 08:54:11PM +0800, Jianbo Liu wrote:
>> >> > My setup consists of one host running a guest.
>> >> > The guest generates as much 64bytes packets as possible using
>> >>
>> >> Have you tested with other different packet size?
>> >> My testing shows that performance is dropping when packet size is more
>> >> than 256.
>> >
>> >
>> > Hi Jianbo,
>> >
>> > Thanks for reporting this.
>> >
>> >  1. Are you running the vector frontend with mrg_rxbuf=off?
>> >
Yes, my testing is mrg_rxbuf=off, but not vector frontend PMD.

>> >  2. Could you please specify what CPU you're running? Is it Haswell
>> >     or Ivy Bridge?
>> >
It's an ARM server.

>> >  3. How many percentage of drop are you seeing?
The testing result:
size (bytes)     improvement (%)
64                   3.92
128                 11.51
256                  24.16
512                  -13.79
1024                -22.51
1500                -12.22
A correction is that performance is dropping if byte size is larger than 512.

>> >
>> > This is expected by me because I've already found the root cause and
>> > the way to optimize it, but since it missed the v0 deadline and
>> > requires changes in eal/memcpy, I postpone it to the next release.
>> >
>> > After the upcoming optimization the performance for packets larger
>> > than 256 will be improved, and the new code will be much faster than
>> > the current code.
>> >
>>
>> Sorry, I tested on an ARM server, but I wonder if there is the same
>> issue for x86 platform.
>
> Would you please provide more details? Say, answer the two left
> questions from Zhihong?
>
> Thanks.
>
>         --yliu
  
Zhihong Wang Sept. 22, 2016, 6:58 a.m. UTC | #34
> -----Original Message-----

> From: Jianbo Liu [mailto:jianbo.liu@linaro.org]

> Sent: Thursday, September 22, 2016 1:48 PM

> To: Yuanhan Liu <yuanhan.liu@linux.intel.com>

> Cc: Wang, Zhihong <zhihong.wang@intel.com>; Maxime Coquelin

> <maxime.coquelin@redhat.com>; dev@dpdk.org

> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

> 

> On 22 September 2016 at 10:29, Yuanhan Liu <yuanhan.liu@linux.intel.com>

> wrote:

> > On Wed, Sep 21, 2016 at 08:54:11PM +0800, Jianbo Liu wrote:

> >> >> > My setup consists of one host running a guest.

> >> >> > The guest generates as much 64bytes packets as possible using

> >> >>

> >> >> Have you tested with other different packet size?

> >> >> My testing shows that performance is dropping when packet size is

> more

> >> >> than 256.

> >> >

> >> >

> >> > Hi Jianbo,

> >> >

> >> > Thanks for reporting this.

> >> >

> >> >  1. Are you running the vector frontend with mrg_rxbuf=off?

> >> >

> Yes, my testing is mrg_rxbuf=off, but not vector frontend PMD.

> 

> >> >  2. Could you please specify what CPU you're running? Is it Haswell

> >> >     or Ivy Bridge?

> >> >

> It's an ARM server.

> 

> >> >  3. How many percentage of drop are you seeing?

> The testing result:

> size (bytes)     improvement (%)

> 64                   3.92

> 128                 11.51

> 256                  24.16

> 512                  -13.79

> 1024                -22.51

> 1500                -12.22

> A correction is that performance is dropping if byte size is larger than 512.



Jianbo,

Could you please verify does this patch really cause enqueue perf to drop?

You can test the enqueue path only by set guest to do rxonly, and compare
the mpps by show port stats all in the guest.


Thanks
Zhihong

> 

> >> >

> >> > This is expected by me because I've already found the root cause and

> >> > the way to optimize it, but since it missed the v0 deadline and

> >> > requires changes in eal/memcpy, I postpone it to the next release.

> >> >

> >> > After the upcoming optimization the performance for packets larger

> >> > than 256 will be improved, and the new code will be much faster than

> >> > the current code.

> >> >

> >>

> >> Sorry, I tested on an ARM server, but I wonder if there is the same

> >> issue for x86 platform.

> >

> > Would you please provide more details? Say, answer the two left

> > questions from Zhihong?

> >

> > Thanks.

> >

> >         --yliu
  
Jianbo Liu Sept. 22, 2016, 9:01 a.m. UTC | #35
On 22 September 2016 at 14:58, Wang, Zhihong <zhihong.wang@intel.com> wrote:
>
>
>> -----Original Message-----
>> From: Jianbo Liu [mailto:jianbo.liu@linaro.org]
>> Sent: Thursday, September 22, 2016 1:48 PM
>> To: Yuanhan Liu <yuanhan.liu@linux.intel.com>
>> Cc: Wang, Zhihong <zhihong.wang@intel.com>; Maxime Coquelin
>> <maxime.coquelin@redhat.com>; dev@dpdk.org
>> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
>>
>> On 22 September 2016 at 10:29, Yuanhan Liu <yuanhan.liu@linux.intel.com>
>> wrote:
>> > On Wed, Sep 21, 2016 at 08:54:11PM +0800, Jianbo Liu wrote:
>> >> >> > My setup consists of one host running a guest.
>> >> >> > The guest generates as much 64bytes packets as possible using
>> >> >>
>> >> >> Have you tested with other different packet size?
>> >> >> My testing shows that performance is dropping when packet size is
>> more
>> >> >> than 256.
>> >> >
>> >> >
>> >> > Hi Jianbo,
>> >> >
>> >> > Thanks for reporting this.
>> >> >
>> >> >  1. Are you running the vector frontend with mrg_rxbuf=off?
>> >> >
>> Yes, my testing is mrg_rxbuf=off, but not vector frontend PMD.
>>
>> >> >  2. Could you please specify what CPU you're running? Is it Haswell
>> >> >     or Ivy Bridge?
>> >> >
>> It's an ARM server.
>>
>> >> >  3. How many percentage of drop are you seeing?
>> The testing result:
>> size (bytes)     improvement (%)
>> 64                   3.92
>> 128                 11.51
>> 256                  24.16
>> 512                  -13.79
>> 1024                -22.51
>> 1500                -12.22
>> A correction is that performance is dropping if byte size is larger than 512.
>
>
> Jianbo,
>
> Could you please verify does this patch really cause enqueue perf to drop?
>
> You can test the enqueue path only by set guest to do rxonly, and compare
> the mpps by show port stats all in the guest.
>
>
Tested with testpmd, host: txonly, guest: rxonly
size (bytes)     improvement (%)
64                    4.12
128                   6
256                   2.65
512                   -1.12
1024                 -7.02
  
Zhihong Wang Sept. 22, 2016, 10:04 a.m. UTC | #36
> -----Original Message-----

> From: Jianbo Liu [mailto:jianbo.liu@linaro.org]

> Sent: Thursday, September 22, 2016 5:02 PM

> To: Wang, Zhihong <zhihong.wang@intel.com>

> Cc: Yuanhan Liu <yuanhan.liu@linux.intel.com>; Maxime Coquelin

> <maxime.coquelin@redhat.com>; dev@dpdk.org

> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

> 

> On 22 September 2016 at 14:58, Wang, Zhihong <zhihong.wang@intel.com>

> wrote:

> >

> >

> >> -----Original Message-----

> >> From: Jianbo Liu [mailto:jianbo.liu@linaro.org]

> >> Sent: Thursday, September 22, 2016 1:48 PM

> >> To: Yuanhan Liu <yuanhan.liu@linux.intel.com>

> >> Cc: Wang, Zhihong <zhihong.wang@intel.com>; Maxime Coquelin

> >> <maxime.coquelin@redhat.com>; dev@dpdk.org

> >> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

> >>

> >> On 22 September 2016 at 10:29, Yuanhan Liu

> <yuanhan.liu@linux.intel.com>

> >> wrote:

> >> > On Wed, Sep 21, 2016 at 08:54:11PM +0800, Jianbo Liu wrote:

> >> >> >> > My setup consists of one host running a guest.

> >> >> >> > The guest generates as much 64bytes packets as possible using

> >> >> >>

> >> >> >> Have you tested with other different packet size?

> >> >> >> My testing shows that performance is dropping when packet size is

> >> more

> >> >> >> than 256.

> >> >> >

> >> >> >

> >> >> > Hi Jianbo,

> >> >> >

> >> >> > Thanks for reporting this.

> >> >> >

> >> >> >  1. Are you running the vector frontend with mrg_rxbuf=off?

> >> >> >

> >> Yes, my testing is mrg_rxbuf=off, but not vector frontend PMD.

> >>

> >> >> >  2. Could you please specify what CPU you're running? Is it Haswell

> >> >> >     or Ivy Bridge?

> >> >> >

> >> It's an ARM server.

> >>

> >> >> >  3. How many percentage of drop are you seeing?

> >> The testing result:

> >> size (bytes)     improvement (%)

> >> 64                   3.92

> >> 128                 11.51

> >> 256                  24.16

> >> 512                  -13.79

> >> 1024                -22.51

> >> 1500                -12.22

> >> A correction is that performance is dropping if byte size is larger than 512.

> >

> >

> > Jianbo,

> >

> > Could you please verify does this patch really cause enqueue perf to drop?

> >

> > You can test the enqueue path only by set guest to do rxonly, and compare

> > the mpps by show port stats all in the guest.

> >

> >

> Tested with testpmd, host: txonly, guest: rxonly

> size (bytes)     improvement (%)

> 64                    4.12

> 128                   6

> 256                   2.65

> 512                   -1.12

> 1024                 -7.02




I think your number is little bit hard to understand for me, this patch's
optimization contains 2 parts:

 1. ring operation: works for both mrg_rxbuf on and off

 2. remote write ordering: works for mrg_rxbuf=on only

So, for mrg_rxbuf=off, if this patch is good for 64B packets, then it
shouldn't do anything bad for larger packets.

This is the gain on x86 platform: host iofwd between nic and vhost,
guest rxonly.

nic2vm	enhancement
64	21.83%
128	16.97%
256	6.34%
512	0.01%
1024	0.00%

I suspect there's some complication in ARM's micro-arch.

Could you try v6 and apply all patches except the the last one:
[PATCH v6 6/6] vhost: optimize cache access

And see if there's still perf drop?


Thanks
Zhihong
  
Jianbo Liu Sept. 22, 2016, 2:41 p.m. UTC | #37
On 22 September 2016 at 18:04, Wang, Zhihong <zhihong.wang@intel.com> wrote:
>
>
>> -----Original Message-----
>> From: Jianbo Liu [mailto:jianbo.liu@linaro.org]
>> Sent: Thursday, September 22, 2016 5:02 PM
>> To: Wang, Zhihong <zhihong.wang@intel.com>
>> Cc: Yuanhan Liu <yuanhan.liu@linux.intel.com>; Maxime Coquelin
>> <maxime.coquelin@redhat.com>; dev@dpdk.org
>> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
>>
>> On 22 September 2016 at 14:58, Wang, Zhihong <zhihong.wang@intel.com>
>> wrote:
>> >
>> >
>> >> -----Original Message-----
>> >> From: Jianbo Liu [mailto:jianbo.liu@linaro.org]
>> >> Sent: Thursday, September 22, 2016 1:48 PM
>> >> To: Yuanhan Liu <yuanhan.liu@linux.intel.com>
>> >> Cc: Wang, Zhihong <zhihong.wang@intel.com>; Maxime Coquelin
>> >> <maxime.coquelin@redhat.com>; dev@dpdk.org
>> >> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
>> >>
>> >> On 22 September 2016 at 10:29, Yuanhan Liu
>> <yuanhan.liu@linux.intel.com>
>> >> wrote:
>> >> > On Wed, Sep 21, 2016 at 08:54:11PM +0800, Jianbo Liu wrote:
>> >> >> >> > My setup consists of one host running a guest.
>> >> >> >> > The guest generates as much 64bytes packets as possible using
>> >> >> >>
>> >> >> >> Have you tested with other different packet size?
>> >> >> >> My testing shows that performance is dropping when packet size is
>> >> more
>> >> >> >> than 256.
>> >> >> >
>> >> >> >
>> >> >> > Hi Jianbo,
>> >> >> >
>> >> >> > Thanks for reporting this.
>> >> >> >
>> >> >> >  1. Are you running the vector frontend with mrg_rxbuf=off?
>> >> >> >
>> >> Yes, my testing is mrg_rxbuf=off, but not vector frontend PMD.
>> >>
>> >> >> >  2. Could you please specify what CPU you're running? Is it Haswell
>> >> >> >     or Ivy Bridge?
>> >> >> >
>> >> It's an ARM server.
>> >>
>> >> >> >  3. How many percentage of drop are you seeing?
>> >> The testing result:
>> >> size (bytes)     improvement (%)
>> >> 64                   3.92
>> >> 128                 11.51
>> >> 256                  24.16
>> >> 512                  -13.79
>> >> 1024                -22.51
>> >> 1500                -12.22
>> >> A correction is that performance is dropping if byte size is larger than 512.
>> >
>> >
>> > Jianbo,
>> >
>> > Could you please verify does this patch really cause enqueue perf to drop?
>> >
>> > You can test the enqueue path only by set guest to do rxonly, and compare
>> > the mpps by show port stats all in the guest.
>> >
>> >
>> Tested with testpmd, host: txonly, guest: rxonly
>> size (bytes)     improvement (%)
>> 64                    4.12
>> 128                   6
>> 256                   2.65
>> 512                   -1.12
>> 1024                 -7.02
>
>
>
> I think your number is little bit hard to understand for me, this patch's
> optimization contains 2 parts:
>
>  1. ring operation: works for both mrg_rxbuf on and off
>
>  2. remote write ordering: works for mrg_rxbuf=on only
>
> So, for mrg_rxbuf=off, if this patch is good for 64B packets, then it
> shouldn't do anything bad for larger packets.
>
> This is the gain on x86 platform: host iofwd between nic and vhost,
> guest rxonly.
>
> nic2vm  enhancement
> 64      21.83%
> 128     16.97%
> 256     6.34%
> 512     0.01%
> 1024    0.00%
>
I bootup a VM with 2 virtual port, and stress the traffic between them.
First, I stressed with pktgen-dpdk in VM, and did iofwd in host.
Then, as you told, I did rxonly in VM, and txonly in host.

> I suspect there's some complication in ARM's micro-arch.
>
> Could you try v6 and apply all patches except the the last one:
> [PATCH v6 6/6] vhost: optimize cache access
>
> And see if there's still perf drop?
>
The last patch can improve the performance. The drop is actually
caused by the second patch.

Jianbo
  
Zhihong Wang Sept. 23, 2016, 2:56 a.m. UTC | #38
> -----Original Message-----

> From: Jianbo Liu [mailto:jianbo.liu@linaro.org]

> Sent: Thursday, September 22, 2016 10:42 PM

> To: Wang, Zhihong <zhihong.wang@intel.com>

> Cc: Yuanhan Liu <yuanhan.liu@linux.intel.com>; Maxime Coquelin

> <maxime.coquelin@redhat.com>; dev@dpdk.org

> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

> 

> On 22 September 2016 at 18:04, Wang, Zhihong <zhihong.wang@intel.com>

> wrote:

> >

> >

> >> -----Original Message-----

> >> From: Jianbo Liu [mailto:jianbo.liu@linaro.org]

> >> Sent: Thursday, September 22, 2016 5:02 PM

> >> To: Wang, Zhihong <zhihong.wang@intel.com>

> >> Cc: Yuanhan Liu <yuanhan.liu@linux.intel.com>; Maxime Coquelin

> >> <maxime.coquelin@redhat.com>; dev@dpdk.org

> >> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

> >>

> >> On 22 September 2016 at 14:58, Wang, Zhihong <zhihong.wang@intel.com>

> >> wrote:

> >> >

> >> >

> >> >> -----Original Message-----

> >> >> From: Jianbo Liu [mailto:jianbo.liu@linaro.org]

> >> >> Sent: Thursday, September 22, 2016 1:48 PM

> >> >> To: Yuanhan Liu <yuanhan.liu@linux.intel.com>

> >> >> Cc: Wang, Zhihong <zhihong.wang@intel.com>; Maxime Coquelin

> >> >> <maxime.coquelin@redhat.com>; dev@dpdk.org

> >> >> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

> >> >>

> >> >> On 22 September 2016 at 10:29, Yuanhan Liu

> >> <yuanhan.liu@linux.intel.com>

> >> >> wrote:

> >> >> > On Wed, Sep 21, 2016 at 08:54:11PM +0800, Jianbo Liu wrote:

> >> >> >> >> > My setup consists of one host running a guest.

> >> >> >> >> > The guest generates as much 64bytes packets as possible using

> >> >> >> >>

> >> >> >> >> Have you tested with other different packet size?

> >> >> >> >> My testing shows that performance is dropping when packet size is

> >> >> more

> >> >> >> >> than 256.

> >> >> >> >

> >> >> >> >

> >> >> >> > Hi Jianbo,

> >> >> >> >

> >> >> >> > Thanks for reporting this.

> >> >> >> >

> >> >> >> >  1. Are you running the vector frontend with mrg_rxbuf=off?

> >> >> >> >

> >> >> Yes, my testing is mrg_rxbuf=off, but not vector frontend PMD.

> >> >>

> >> >> >> >  2. Could you please specify what CPU you're running? Is it Haswell

> >> >> >> >     or Ivy Bridge?

> >> >> >> >

> >> >> It's an ARM server.

> >> >>

> >> >> >> >  3. How many percentage of drop are you seeing?

> >> >> The testing result:

> >> >> size (bytes)     improvement (%)

> >> >> 64                   3.92

> >> >> 128                 11.51

> >> >> 256                  24.16

> >> >> 512                  -13.79

> >> >> 1024                -22.51

> >> >> 1500                -12.22

> >> >> A correction is that performance is dropping if byte size is larger than 512.

> >> >

> >> >

> >> > Jianbo,

> >> >

> >> > Could you please verify does this patch really cause enqueue perf to drop?

> >> >

> >> > You can test the enqueue path only by set guest to do rxonly, and compare

> >> > the mpps by show port stats all in the guest.

> >> >

> >> >

> >> Tested with testpmd, host: txonly, guest: rxonly

> >> size (bytes)     improvement (%)

> >> 64                    4.12

> >> 128                   6

> >> 256                   2.65

> >> 512                   -1.12

> >> 1024                 -7.02

> >

> >

> >

> > I think your number is little bit hard to understand for me, this patch's

> > optimization contains 2 parts:

> >

> >  1. ring operation: works for both mrg_rxbuf on and off

> >

> >  2. remote write ordering: works for mrg_rxbuf=on only

> >

> > So, for mrg_rxbuf=off, if this patch is good for 64B packets, then it

> > shouldn't do anything bad for larger packets.

> >

> > This is the gain on x86 platform: host iofwd between nic and vhost,

> > guest rxonly.

> >

> > nic2vm  enhancement

> > 64      21.83%

> > 128     16.97%

> > 256     6.34%

> > 512     0.01%

> > 1024    0.00%

> >

> I bootup a VM with 2 virtual port, and stress the traffic between them.

> First, I stressed with pktgen-dpdk in VM, and did iofwd in host.

> Then, as you told, I did rxonly in VM, and txonly in host.

> 

> > I suspect there's some complication in ARM's micro-arch.

> >

> > Could you try v6 and apply all patches except the the last one:

> > [PATCH v6 6/6] vhost: optimize cache access

> >

> > And see if there's still perf drop?

> >

> The last patch can improve the performance. The drop is actually

> caused by the second patch.



This is expected because the 2nd patch is just a baseline and all optimization
patches are organized in the rest of this patch set.

I think you can do bottleneck analysis on ARM to see what's slowing down the
perf, there might be some micro-arch complications there, mostly likely in
memcpy.

Do you use glibc's memcpy? I suggest to hand-crafted it on your own.

Could you publish the mrg_rxbuf=on data also? Since it's more widely used
in terms of spec integrity.


Thanks
Zhihong


> 

> Jianbo
  
Jianbo Liu Sept. 23, 2016, 10:41 a.m. UTC | #39
On 23 September 2016 at 10:56, Wang, Zhihong <zhihong.wang@intel.com> wrote:
.....
> This is expected because the 2nd patch is just a baseline and all optimization
> patches are organized in the rest of this patch set.
>
> I think you can do bottleneck analysis on ARM to see what's slowing down the
> perf, there might be some micro-arch complications there, mostly likely in
> memcpy.
>
> Do you use glibc's memcpy? I suggest to hand-crafted it on your own.
>
> Could you publish the mrg_rxbuf=on data also? Since it's more widely used
> in terms of spec integrity.
>
I don't think it will be helpful for you, considering the differences
between x86 and arm.
So please move on with this patchset...

Thanks!
Jianbo
  
Thomas Monjalon Sept. 23, 2016, 1:41 p.m. UTC | #40
2016-09-23 18:41, Jianbo Liu:
> On 23 September 2016 at 10:56, Wang, Zhihong <zhihong.wang@intel.com> wrote:
> .....
> > This is expected because the 2nd patch is just a baseline and all optimization
> > patches are organized in the rest of this patch set.
> >
> > I think you can do bottleneck analysis on ARM to see what's slowing down the
> > perf, there might be some micro-arch complications there, mostly likely in
> > memcpy.
> >
> > Do you use glibc's memcpy? I suggest to hand-crafted it on your own.
> >
> > Could you publish the mrg_rxbuf=on data also? Since it's more widely used
> > in terms of spec integrity.
> >
> I don't think it will be helpful for you, considering the differences
> between x86 and arm.
> So please move on with this patchset...

Jianbo,
I don't understand.
You said that the 2nd patch is a regression:
-       volatile uint16_t       last_used_idx;
+       uint16_t                last_used_idx;

And the overrall series lead to performance regression
for packets > 512 B, right?
But we don't know wether you have tested the v6 or not.

Zhihong talked about some improvements possible in rte_memcpy.
ARM64 is using libc memcpy in rte_memcpy.

Now you seem to give up.
Does it mean you accept having a regression in 16.11 release?
Are you working on rte_memcpy?
  
Zhihong Wang Sept. 25, 2016, 5:41 a.m. UTC | #41
> -----Original Message-----
> From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]
> Sent: Friday, September 23, 2016 9:41 PM
> To: Jianbo Liu <jianbo.liu@linaro.org>
> Cc: dev@dpdk.org; Wang, Zhihong <zhihong.wang@intel.com>; Yuanhan Liu
> <yuanhan.liu@linux.intel.com>; Maxime Coquelin
> <maxime.coquelin@redhat.com>
> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
> 
> 2016-09-23 18:41, Jianbo Liu:
> > On 23 September 2016 at 10:56, Wang, Zhihong <zhihong.wang@intel.com>
> wrote:
> > .....
> > > This is expected because the 2nd patch is just a baseline and all optimization
> > > patches are organized in the rest of this patch set.
> > >
> > > I think you can do bottleneck analysis on ARM to see what's slowing down the
> > > perf, there might be some micro-arch complications there, mostly likely in
> > > memcpy.
> > >
> > > Do you use glibc's memcpy? I suggest to hand-crafted it on your own.
> > >
> > > Could you publish the mrg_rxbuf=on data also? Since it's more widely used
> > > in terms of spec integrity.
> > >
> > I don't think it will be helpful for you, considering the differences
> > between x86 and arm.


Hi Jianbo,

This patch does help in ARM for small packets like 64B sized ones,
this actually proves the similarity between x86 and ARM in terms
of caching optimization in this patch.

My estimation is based on:

 1. The last patch are for mrg_rxbuf=on, and since you said it helps
    perf, we can ignore it for now when we discuss mrg_rxbuf=off

 2. Vhost enqueue perf =
    Ring overhead + Virtio header overhead + Data memcpy overhead

 3. This patch helps small packets traffic, which means it helps
    ring + virtio header operations

 4. So, when you say perf drop when packet size larger than 512B,
    this is most likely caused by memcpy in ARM not working well
    with this patch

I'm not saying glibc's memcpy is not good enough, it's just that
this is a rather special use case. And since we see specialized
memcpy + this patch give better performance than other combinations
significantly on x86, we suggest to hand-craft a specialized memcpy
for it.

Of course on ARM this is still just my speculation, and we need to
either prove it or find the actual root cause.

It can be **REALLY HELPFUL** if you could help to test this patch on
ARM for mrg_rxbuf=on cases to see if this patch is in fact helpful
to ARM at all, since mrg_rxbuf=on the more widely used cases.


Thanks
Zhihong


> > So please move on with this patchset...
> 
> Jianbo,
> I don't understand.
> You said that the 2nd patch is a regression:
> -       volatile uint16_t       last_used_idx;
> +       uint16_t                last_used_idx;
> 
> And the overrall series lead to performance regression
> for packets > 512 B, right?
> But we don't know wether you have tested the v6 or not.
> 
> Zhihong talked about some improvements possible in rte_memcpy.
> ARM64 is using libc memcpy in rte_memcpy.
> 
> Now you seem to give up.
> Does it mean you accept having a regression in 16.11 release?
> Are you working on rte_memcpy?
  
Jianbo Liu Sept. 26, 2016, 4:24 a.m. UTC | #42
Hi Thomas,

On 23 September 2016 at 21:41, Thomas Monjalon
<thomas.monjalon@6wind.com> wrote:
> 2016-09-23 18:41, Jianbo Liu:
>> On 23 September 2016 at 10:56, Wang, Zhihong <zhihong.wang@intel.com> wrote:
>> .....
>> > This is expected because the 2nd patch is just a baseline and all optimization
>> > patches are organized in the rest of this patch set.
>> >
>> > I think you can do bottleneck analysis on ARM to see what's slowing down the
>> > perf, there might be some micro-arch complications there, mostly likely in
>> > memcpy.
>> >
>> > Do you use glibc's memcpy? I suggest to hand-crafted it on your own.
>> >
>> > Could you publish the mrg_rxbuf=on data also? Since it's more widely used
>> > in terms of spec integrity.
>> >
>> I don't think it will be helpful for you, considering the differences
>> between x86 and arm.
>> So please move on with this patchset...
>
> Jianbo,
> I don't understand.
> You said that the 2nd patch is a regression:
> -       volatile uint16_t       last_used_idx;
> +       uint16_t                last_used_idx;
>
No, I meant "vhost: rewrite enqueue".

> And the overrall series lead to performance regression
> for packets > 512 B, right?
> But we don't know wether you have tested the v6 or not.
Yes, I tested v6, and found performance regression for size >=512B.

>
> Zhihong talked about some improvements possible in rte_memcpy.
> ARM64 is using libc memcpy in rte_memcpy.
>
> Now you seem to give up.
> Does it mean you accept having a regression in 16.11 release?
> Are you working on rte_memcpy?
This patchset actually improves performance according to Zhihong's
result on x86 platfrom. And I also get improvement as least with
small-size packet on ARM.
I don't want to give up, but I need more time to find out the reason
for the regression. I think rte_memcpy definitely is one of the ways
to improve performance, but it could be the reason?
  
Jianbo Liu Sept. 26, 2016, 5:12 a.m. UTC | #43
On 25 September 2016 at 13:41, Wang, Zhihong <zhihong.wang@intel.com> wrote:
>
>
>> -----Original Message-----
>> From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]
>> Sent: Friday, September 23, 2016 9:41 PM
>> To: Jianbo Liu <jianbo.liu@linaro.org>
>> Cc: dev@dpdk.org; Wang, Zhihong <zhihong.wang@intel.com>; Yuanhan Liu
>> <yuanhan.liu@linux.intel.com>; Maxime Coquelin
>> <maxime.coquelin@redhat.com>
....
> This patch does help in ARM for small packets like 64B sized ones,
> this actually proves the similarity between x86 and ARM in terms
> of caching optimization in this patch.
>
> My estimation is based on:
>
>  1. The last patch are for mrg_rxbuf=on, and since you said it helps
>     perf, we can ignore it for now when we discuss mrg_rxbuf=off
>
>  2. Vhost enqueue perf =
>     Ring overhead + Virtio header overhead + Data memcpy overhead
>
>  3. This patch helps small packets traffic, which means it helps
>     ring + virtio header operations
>
>  4. So, when you say perf drop when packet size larger than 512B,
>     this is most likely caused by memcpy in ARM not working well
>     with this patch
>
> I'm not saying glibc's memcpy is not good enough, it's just that
> this is a rather special use case. And since we see specialized
> memcpy + this patch give better performance than other combinations
> significantly on x86, we suggest to hand-craft a specialized memcpy
> for it.
>
> Of course on ARM this is still just my speculation, and we need to
> either prove it or find the actual root cause.
>
> It can be **REALLY HELPFUL** if you could help to test this patch on
> ARM for mrg_rxbuf=on cases to see if this patch is in fact helpful
> to ARM at all, since mrg_rxbuf=on the more widely used cases.
>
Actually it's worse than mrg_rxbuf=off.
  
Zhihong Wang Sept. 26, 2016, 5:25 a.m. UTC | #44
> -----Original Message-----

> From: Jianbo Liu [mailto:jianbo.liu@linaro.org]

> Sent: Monday, September 26, 2016 1:13 PM

> To: Wang, Zhihong <zhihong.wang@intel.com>

> Cc: Thomas Monjalon <thomas.monjalon@6wind.com>; dev@dpdk.org; Yuanhan

> Liu <yuanhan.liu@linux.intel.com>; Maxime Coquelin

> <maxime.coquelin@redhat.com>

> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

> 

> On 25 September 2016 at 13:41, Wang, Zhihong <zhihong.wang@intel.com>

> wrote:

> >

> >

> >> -----Original Message-----

> >> From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]

> >> Sent: Friday, September 23, 2016 9:41 PM

> >> To: Jianbo Liu <jianbo.liu@linaro.org>

> >> Cc: dev@dpdk.org; Wang, Zhihong <zhihong.wang@intel.com>; Yuanhan Liu

> >> <yuanhan.liu@linux.intel.com>; Maxime Coquelin

> >> <maxime.coquelin@redhat.com>

> ....

> > This patch does help in ARM for small packets like 64B sized ones,

> > this actually proves the similarity between x86 and ARM in terms

> > of caching optimization in this patch.

> >

> > My estimation is based on:

> >

> >  1. The last patch are for mrg_rxbuf=on, and since you said it helps

> >     perf, we can ignore it for now when we discuss mrg_rxbuf=off

> >

> >  2. Vhost enqueue perf =

> >     Ring overhead + Virtio header overhead + Data memcpy overhead

> >

> >  3. This patch helps small packets traffic, which means it helps

> >     ring + virtio header operations

> >

> >  4. So, when you say perf drop when packet size larger than 512B,

> >     this is most likely caused by memcpy in ARM not working well

> >     with this patch

> >

> > I'm not saying glibc's memcpy is not good enough, it's just that

> > this is a rather special use case. And since we see specialized

> > memcpy + this patch give better performance than other combinations

> > significantly on x86, we suggest to hand-craft a specialized memcpy

> > for it.

> >

> > Of course on ARM this is still just my speculation, and we need to

> > either prove it or find the actual root cause.

> >

> > It can be **REALLY HELPFUL** if you could help to test this patch on

> > ARM for mrg_rxbuf=on cases to see if this patch is in fact helpful

> > to ARM at all, since mrg_rxbuf=on the more widely used cases.

> >

> Actually it's worse than mrg_rxbuf=off.


I mean compare the perf of original vs. original + patch with
mrg_rxbuf turned on. Is there any perf improvement?
  
Luke Gorrie Sept. 26, 2016, 5:37 a.m. UTC | #45
On 22 September 2016 at 11:01, Jianbo Liu <jianbo.liu@linaro.org> wrote:

> Tested with testpmd, host: txonly, guest: rxonly
> size (bytes)     improvement (%)
> 64                    4.12
> 128                   6
> 256                   2.65
> 512                   -1.12
> 1024                 -7.02
>

Have you considered testing with more diverse workloads e.g. mixed packet
sizes that are not always multiples of the cache line & register sizes?
  
Jianbo Liu Sept. 26, 2016, 5:38 a.m. UTC | #46
On 26 September 2016 at 13:25, Wang, Zhihong <zhihong.wang@intel.com> wrote:
>
>
>> -----Original Message-----
>> From: Jianbo Liu [mailto:jianbo.liu@linaro.org]
>> Sent: Monday, September 26, 2016 1:13 PM
>> To: Wang, Zhihong <zhihong.wang@intel.com>
>> Cc: Thomas Monjalon <thomas.monjalon@6wind.com>; dev@dpdk.org; Yuanhan
>> Liu <yuanhan.liu@linux.intel.com>; Maxime Coquelin
>> <maxime.coquelin@redhat.com>
>> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
>>
>> On 25 September 2016 at 13:41, Wang, Zhihong <zhihong.wang@intel.com>
>> wrote:
>> >
>> >
>> >> -----Original Message-----
>> >> From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]
>> >> Sent: Friday, September 23, 2016 9:41 PM
>> >> To: Jianbo Liu <jianbo.liu@linaro.org>
>> >> Cc: dev@dpdk.org; Wang, Zhihong <zhihong.wang@intel.com>; Yuanhan Liu
>> >> <yuanhan.liu@linux.intel.com>; Maxime Coquelin
>> >> <maxime.coquelin@redhat.com>
>> ....
>> > This patch does help in ARM for small packets like 64B sized ones,
>> > this actually proves the similarity between x86 and ARM in terms
>> > of caching optimization in this patch.
>> >
>> > My estimation is based on:
>> >
>> >  1. The last patch are for mrg_rxbuf=on, and since you said it helps
>> >     perf, we can ignore it for now when we discuss mrg_rxbuf=off
>> >
>> >  2. Vhost enqueue perf =
>> >     Ring overhead + Virtio header overhead + Data memcpy overhead
>> >
>> >  3. This patch helps small packets traffic, which means it helps
>> >     ring + virtio header operations
>> >
>> >  4. So, when you say perf drop when packet size larger than 512B,
>> >     this is most likely caused by memcpy in ARM not working well
>> >     with this patch
>> >
>> > I'm not saying glibc's memcpy is not good enough, it's just that
>> > this is a rather special use case. And since we see specialized
>> > memcpy + this patch give better performance than other combinations
>> > significantly on x86, we suggest to hand-craft a specialized memcpy
>> > for it.
>> >
>> > Of course on ARM this is still just my speculation, and we need to
>> > either prove it or find the actual root cause.
>> >
>> > It can be **REALLY HELPFUL** if you could help to test this patch on
>> > ARM for mrg_rxbuf=on cases to see if this patch is in fact helpful
>> > to ARM at all, since mrg_rxbuf=on the more widely used cases.
>> >
>> Actually it's worse than mrg_rxbuf=off.
>
> I mean compare the perf of original vs. original + patch with
> mrg_rxbuf turned on. Is there any perf improvement?
>
Yes, orig + patch + on is better than orig + on, but orig + patch + on
is worse than orig + patch + off.
  
Jianbo Liu Sept. 26, 2016, 5:40 a.m. UTC | #47
On 26 September 2016 at 13:37, Luke Gorrie <luke@snabb.co> wrote:
> On 22 September 2016 at 11:01, Jianbo Liu <jianbo.liu@linaro.org> wrote:
>>
>> Tested with testpmd, host: txonly, guest: rxonly
>> size (bytes)     improvement (%)
>> 64                    4.12
>> 128                   6
>> 256                   2.65
>> 512                   -1.12
>> 1024                 -7.02
>
>
> Have you considered testing with more diverse workloads e.g. mixed packet
> sizes that are not always multiples of the cache line & register sizes?
>
No. Does testpmd can stress performance with mixed size?
  
Zhihong Wang Sept. 26, 2016, 6 a.m. UTC | #48
> -----Original Message-----

> From: Jianbo Liu [mailto:jianbo.liu@linaro.org]

> Sent: Monday, September 26, 2016 1:39 PM

> To: Wang, Zhihong <zhihong.wang@intel.com>

> Cc: Thomas Monjalon <thomas.monjalon@6wind.com>; dev@dpdk.org; Yuanhan

> Liu <yuanhan.liu@linux.intel.com>; Maxime Coquelin

> <maxime.coquelin@redhat.com>

> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

> 

> On 26 September 2016 at 13:25, Wang, Zhihong <zhihong.wang@intel.com>

> wrote:

> >

> >

> >> -----Original Message-----

> >> From: Jianbo Liu [mailto:jianbo.liu@linaro.org]

> >> Sent: Monday, September 26, 2016 1:13 PM

> >> To: Wang, Zhihong <zhihong.wang@intel.com>

> >> Cc: Thomas Monjalon <thomas.monjalon@6wind.com>; dev@dpdk.org;

> Yuanhan

> >> Liu <yuanhan.liu@linux.intel.com>; Maxime Coquelin

> >> <maxime.coquelin@redhat.com>

> >> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

> >>

> >> On 25 September 2016 at 13:41, Wang, Zhihong <zhihong.wang@intel.com>

> >> wrote:

> >> >

> >> >

> >> >> -----Original Message-----

> >> >> From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]

> >> >> Sent: Friday, September 23, 2016 9:41 PM

> >> >> To: Jianbo Liu <jianbo.liu@linaro.org>

> >> >> Cc: dev@dpdk.org; Wang, Zhihong <zhihong.wang@intel.com>; Yuanhan

> Liu

> >> >> <yuanhan.liu@linux.intel.com>; Maxime Coquelin

> >> >> <maxime.coquelin@redhat.com>

> >> ....

> >> > This patch does help in ARM for small packets like 64B sized ones,

> >> > this actually proves the similarity between x86 and ARM in terms

> >> > of caching optimization in this patch.

> >> >

> >> > My estimation is based on:

> >> >

> >> >  1. The last patch are for mrg_rxbuf=on, and since you said it helps

> >> >     perf, we can ignore it for now when we discuss mrg_rxbuf=off

> >> >

> >> >  2. Vhost enqueue perf =

> >> >     Ring overhead + Virtio header overhead + Data memcpy overhead

> >> >

> >> >  3. This patch helps small packets traffic, which means it helps

> >> >     ring + virtio header operations

> >> >

> >> >  4. So, when you say perf drop when packet size larger than 512B,

> >> >     this is most likely caused by memcpy in ARM not working well

> >> >     with this patch

> >> >

> >> > I'm not saying glibc's memcpy is not good enough, it's just that

> >> > this is a rather special use case. And since we see specialized

> >> > memcpy + this patch give better performance than other combinations

> >> > significantly on x86, we suggest to hand-craft a specialized memcpy

> >> > for it.

> >> >

> >> > Of course on ARM this is still just my speculation, and we need to

> >> > either prove it or find the actual root cause.

> >> >

> >> > It can be **REALLY HELPFUL** if you could help to test this patch on

> >> > ARM for mrg_rxbuf=on cases to see if this patch is in fact helpful

> >> > to ARM at all, since mrg_rxbuf=on the more widely used cases.

> >> >

> >> Actually it's worse than mrg_rxbuf=off.

> >

> > I mean compare the perf of original vs. original + patch with

> > mrg_rxbuf turned on. Is there any perf improvement?

> >

> Yes, orig + patch + on is better than orig + on, but orig + patch + on

> is worse than orig + patch + off.



Hi Jianbo,

That's the way it is for virtio, if you compare the current enqueue,
the mrg on perf is even slower.

We should compare:

 1. mrg on: orig vs. orig + patch

 2. mrg off: orig vs. orig + patch

There's more memory touch and in the frontend that brings down the
performance when mrg is on.

Finally, even though mrg on is slower, it's still the mainstream use case
as far as I know.


Thanks
Zhihong
  
Yuanhan Liu Oct. 12, 2016, 2:53 a.m. UTC | #49
On Thu, Sep 22, 2016 at 01:47:45PM +0800, Jianbo Liu wrote:
> On 22 September 2016 at 10:29, Yuanhan Liu <yuanhan.liu@linux.intel.com> wrote:
> > On Wed, Sep 21, 2016 at 08:54:11PM +0800, Jianbo Liu wrote:
> >> >> > My setup consists of one host running a guest.
> >> >> > The guest generates as much 64bytes packets as possible using
> >> >>
> >> >> Have you tested with other different packet size?
> >> >> My testing shows that performance is dropping when packet size is more
> >> >> than 256.
> >> >
> >> >
> >> > Hi Jianbo,
> >> >
> >> > Thanks for reporting this.
> >> >
> >> >  1. Are you running the vector frontend with mrg_rxbuf=off?
> >> >
> Yes, my testing is mrg_rxbuf=off, but not vector frontend PMD.
> 
> >> >  2. Could you please specify what CPU you're running? Is it Haswell
> >> >     or Ivy Bridge?
> >> >
> It's an ARM server.
> 
> >> >  3. How many percentage of drop are you seeing?
> The testing result:
> size (bytes)     improvement (%)
> 64                   3.92
> 128                 11.51
> 256                  24.16
> 512                  -13.79
> 1024                -22.51
> 1500                -12.22
> A correction is that performance is dropping if byte size is larger than 512.

I have thought of this twice. Unfortunately, I think I may need NACK this
series.

Merging two code path into one is really good: as you stated, it improves
the maintainability. But only if we see no performance regression on both 
path after the refactor. Unfortunately, that's not the case here: it hurts
the performance for one code path (non-mrg Rx).

That makes me think we may should not do the code path merge at all. I think
that also aligns with what you have said before (internally): we could do the
merge if it gives comparable performance before and after that.

Besides that, I don't quite like the way you did in patch 2 (rewrite enqueue):
you made a lot of changes in one patch. That means if something wrong happened,
it is hard to narrow down which change introduces that regression. Badly,
that's exactly what we met here. Weeks have been passed, I see no progress.

That's the reason we like the idea of "one patch only does one thing, an
atomic thing".

So I will apply the first patch (it's a bug fixing patch) and ask you to
refactor the rest, without the code path merge.

I think we could still have a good maintainability code base if we introduce 
more common helper functions that can be used on both Rx path, or even on
Tx path (such as update_used_ring, or shadow_used_ring).

It's a bit late for too many changes for v16.11. I think you could just
grab patch 6 (vhost: optimize cache access) to the old mrg-Rx code path,
if that also helps the performance? Let us handle the left in next release,
such as shadow used ring.

Thanks.

	--yliu
  
Zhihong Wang Oct. 12, 2016, 12:22 p.m. UTC | #50
> -----Original Message-----
> From: Yuanhan Liu [mailto:yuanhan.liu@linux.intel.com]
> Sent: Wednesday, October 12, 2016 10:53 AM
> To: Wang, Zhihong <zhihong.wang@intel.com>; Jianbo Liu <jianbo.liu@linaro.org>
> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>; dev@dpdk.org; Thomas
> Monjalon <thomas.monjalon@6wind.com>
> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
> 
> On Thu, Sep 22, 2016 at 01:47:45PM +0800, Jianbo Liu wrote:
> > On 22 September 2016 at 10:29, Yuanhan Liu <yuanhan.liu@linux.intel.com>
> wrote:
> > > On Wed, Sep 21, 2016 at 08:54:11PM +0800, Jianbo Liu wrote:
> > >> >> > My setup consists of one host running a guest.
> > >> >> > The guest generates as much 64bytes packets as possible using
> > >> >>
> > >> >> Have you tested with other different packet size?
> > >> >> My testing shows that performance is dropping when packet size is more
> > >> >> than 256.
> > >> >
> > >> >
> > >> > Hi Jianbo,
> > >> >
> > >> > Thanks for reporting this.
> > >> >
> > >> >  1. Are you running the vector frontend with mrg_rxbuf=off?
> > >> >
> > Yes, my testing is mrg_rxbuf=off, but not vector frontend PMD.
> >
> > >> >  2. Could you please specify what CPU you're running? Is it Haswell
> > >> >     or Ivy Bridge?
> > >> >
> > It's an ARM server.
> >
> > >> >  3. How many percentage of drop are you seeing?
> > The testing result:
> > size (bytes)     improvement (%)
> > 64                   3.92
> > 128                 11.51
> > 256                  24.16
> > 512                  -13.79
> > 1024                -22.51
> > 1500                -12.22
> > A correction is that performance is dropping if byte size is larger than 512.
> 
> I have thought of this twice. Unfortunately, I think I may need NACK this
> series.
> 
> Merging two code path into one is really good: as you stated, it improves
> the maintainability. But only if we see no performance regression on both
> path after the refactor. Unfortunately, that's not the case here: it hurts
> the performance for one code path (non-mrg Rx).
> 
> That makes me think we may should not do the code path merge at all. I think
> that also aligns with what you have said before (internally): we could do the
> merge if it gives comparable performance before and after that.
> 
> Besides that, I don't quite like the way you did in patch 2 (rewrite enqueue):
> you made a lot of changes in one patch. That means if something wrong
> happened,
> it is hard to narrow down which change introduces that regression. Badly,
> that's exactly what we met here. Weeks have been passed, I see no progress.
> 
> That's the reason we like the idea of "one patch only does one thing, an
> atomic thing".


Yuanhan, folks,

Thanks for the analysis. I disagree here though.

I analyze, develop, benchmark on x86 platforms, where this patch
works great.

I've been trying to analyze on ARM too but it takes time and I've
had a schedule. Also since the ARM perf issue comes when it's
v6 already, I might not be able to make it in time. However
that's what I have to do for this patch to be merged in this
or the next release.

In the meantime, may I suggest we consider the possibility to
have dedicated codes for **perf critical paths** for different
kinds of architecture?

It can be hard for a person to have both the knowledge and the
development environment for multiple archs at the same time.

Moreover, different optimization techniques might be required for
different archs, so it's hard and unnecessary to make a function
works for all archs, sometimes it's just not the right thing to do.


Thanks
Zhihong


> 
> So I will apply the first patch (it's a bug fixing patch) and ask you to
> refactor the rest, without the code path merge.
> 
> I think we could still have a good maintainability code base if we introduce
> more common helper functions that can be used on both Rx path, or even on
> Tx path (such as update_used_ring, or shadow_used_ring).
> 
> It's a bit late for too many changes for v16.11. I think you could just
> grab patch 6 (vhost: optimize cache access) to the old mrg-Rx code path,
> if that also helps the performance? Let us handle the left in next release,
> such as shadow used ring.
> 
> Thanks.
> 
> 	--yliu
  
Thomas Monjalon Oct. 12, 2016, 3:31 p.m. UTC | #51
Sorry guys, you lost me in the discussion.

Is there some regression only on ARM?
Does it need some work specifically on memcpy for ARM,
or vhost for ARM?
Who can work on ARM optimization?

More comments below.

2016-10-12 12:22, Wang, Zhihong:
> From: Yuanhan Liu [mailto:yuanhan.liu@linux.intel.com]
> > > It's an ARM server.
> > >
> > > >> >  3. How many percentage of drop are you seeing?
> > > The testing result:
> > > size (bytes)     improvement (%)
> > > 64                   3.92
> > > 128                 11.51
> > > 256                  24.16
> > > 512                  -13.79
> > > 1024                -22.51
> > > 1500                -12.22
> > > A correction is that performance is dropping if byte size is larger than 512.
> > 
> > I have thought of this twice. Unfortunately, I think I may need NACK this
> > series.
> > 
> > Merging two code path into one is really good: as you stated, it improves
> > the maintainability. But only if we see no performance regression on both
> > path after the refactor. Unfortunately, that's not the case here: it hurts
> > the performance for one code path (non-mrg Rx).

+1

> > That makes me think we may should not do the code path merge at all. I think
> > that also aligns with what you have said before (internally): we could do the
> > merge if it gives comparable performance before and after that.
> > 
> > Besides that, I don't quite like the way you did in patch 2 (rewrite enqueue):
> > you made a lot of changes in one patch. That means if something wrong
> > happened,
> > it is hard to narrow down which change introduces that regression. Badly,
> > that's exactly what we met here. Weeks have been passed, I see no progress.

+1, it is important to have simple patches making changes step by step.

> > That's the reason we like the idea of "one patch only does one thing, an
> > atomic thing".
> 
> 
> Yuanhan, folks,
> 
> Thanks for the analysis. I disagree here though.
> 
> I analyze, develop, benchmark on x86 platforms, where this patch
> works great.
> 
> I've been trying to analyze on ARM too but it takes time and I've
> had a schedule. Also since the ARM perf issue comes when it's
> v6 already, I might not be able to make it in time. However
> that's what I have to do for this patch to be merged in this
> or the next release.
> 
> In the meantime, may I suggest we consider the possibility to
> have dedicated codes for **perf critical paths** for different
> kinds of architecture?

Yes that's what we do in several parts of DPDK.

> It can be hard for a person to have both the knowledge and the
> development environment for multiple archs at the same time.

Yes we do not expect you work on ARM.
So if nobody work on the ARM issue, you could make 2 code paths
in order to allow your optimization for x86 only.
But that's not the preferred way.
And you must split your rework to better identify which part is
a regression on ARM.

> Moreover, different optimization techniques might be required for
> different archs, so it's hard and unnecessary to make a function
> works for all archs, sometimes it's just not the right thing to do.

Yes sometimes. Please help us to be convinced for this case.

> > So I will apply the first patch (it's a bug fixing patch) and ask you to
> > refactor the rest, without the code path merge.
> > 
> > I think we could still have a good maintainability code base if we introduce
> > more common helper functions that can be used on both Rx path, or even on
> > Tx path (such as update_used_ring, or shadow_used_ring).

Yes it is a good step.
And the code path merge could be reconsidered later.

> > It's a bit late for too many changes for v16.11. I think you could just
> > grab patch 6 (vhost: optimize cache access) to the old mrg-Rx code path,
> > if that also helps the performance? Let us handle the left in next release,
> > such as shadow used ring.

Thank you
  
Zhihong Wang Oct. 13, 2016, 1:21 a.m. UTC | #52
> -----Original Message-----
> From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]
> Sent: Wednesday, October 12, 2016 11:31 PM
> To: Wang, Zhihong <zhihong.wang@intel.com>
> Cc: Yuanhan Liu <yuanhan.liu@linux.intel.com>; Jianbo Liu
> <jianbo.liu@linaro.org>; Maxime Coquelin <maxime.coquelin@redhat.com>;
> dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
> 
> Sorry guys, you lost me in the discussion.
> 
> Is there some regression only on ARM?

ARM is what we see, no info on ppc yet.

> Does it need some work specifically on memcpy for ARM,
> or vhost for ARM?
> Who can work on ARM optimization?

These are still open questions, Jiaobo who reported this doesn't
have time for more testing now according to the reply.

I'm trying to do some test in the hope to identify the root cause.

> 
> More comments below.
> 
> 2016-10-12 12:22, Wang, Zhihong:
> > From: Yuanhan Liu [mailto:yuanhan.liu@linux.intel.com]
> > > > It's an ARM server.
> > > >
> > > > >> >  3. How many percentage of drop are you seeing?
> > > > The testing result:
> > > > size (bytes)     improvement (%)
> > > > 64                   3.92
> > > > 128                 11.51
> > > > 256                  24.16
> > > > 512                  -13.79
> > > > 1024                -22.51
> > > > 1500                -12.22
> > > > A correction is that performance is dropping if byte size is larger than
> 512.
> > >
> > > I have thought of this twice. Unfortunately, I think I may need NACK this
> > > series.
> > >
> > > Merging two code path into one is really good: as you stated, it improves
> > > the maintainability. But only if we see no performance regression on both
> > > path after the refactor. Unfortunately, that's not the case here: it hurts
> > > the performance for one code path (non-mrg Rx).
> 
> +1
> 
> > > That makes me think we may should not do the code path merge at all. I
> think
> > > that also aligns with what you have said before (internally): we could do
> the
> > > merge if it gives comparable performance before and after that.
> > >
> > > Besides that, I don't quite like the way you did in patch 2 (rewrite
> enqueue):
> > > you made a lot of changes in one patch. That means if something wrong
> > > happened,
> > > it is hard to narrow down which change introduces that regression. Badly,
> > > that's exactly what we met here. Weeks have been passed, I see no
> progress.
> 
> +1, it is important to have simple patches making changes step by step.
> 
> > > That's the reason we like the idea of "one patch only does one thing, an
> > > atomic thing".
> >
> >
> > Yuanhan, folks,
> >
> > Thanks for the analysis. I disagree here though.
> >
> > I analyze, develop, benchmark on x86 platforms, where this patch
> > works great.
> >
> > I've been trying to analyze on ARM too but it takes time and I've
> > had a schedule. Also since the ARM perf issue comes when it's
> > v6 already, I might not be able to make it in time. However
> > that's what I have to do for this patch to be merged in this
> > or the next release.
> >
> > In the meantime, may I suggest we consider the possibility to
> > have dedicated codes for **perf critical paths** for different
> > kinds of architecture?
> 
> Yes that's what we do in several parts of DPDK.
> 
> > It can be hard for a person to have both the knowledge and the
> > development environment for multiple archs at the same time.
> 
> Yes we do not expect you work on ARM.
> So if nobody work on the ARM issue, you could make 2 code paths
> in order to allow your optimization for x86 only.
> But that's not the preferred way.
> And you must split your rework to better identify which part is
> a regression on ARM.
> 
> > Moreover, different optimization techniques might be required for
> > different archs, so it's hard and unnecessary to make a function
> > works for all archs, sometimes it's just not the right thing to do.
> 
> Yes sometimes. Please help us to be convinced for this case.
> 
> > > So I will apply the first patch (it's a bug fixing patch) and ask you to
> > > refactor the rest, without the code path merge.
> > >
> > > I think we could still have a good maintainability code base if we introduce
> > > more common helper functions that can be used on both Rx path, or
> even on
> > > Tx path (such as update_used_ring, or shadow_used_ring).
> 
> Yes it is a good step.
> And the code path merge could be reconsidered later.
> 
> > > It's a bit late for too many changes for v16.11. I think you could just
> > > grab patch 6 (vhost: optimize cache access) to the old mrg-Rx code path,
> > > if that also helps the performance? Let us handle the left in next release,
> > > such as shadow used ring.
> 
> Thank you
  
Jianbo Liu Oct. 13, 2016, 3:51 a.m. UTC | #53
Hi Thomas,

On 12 October 2016 at 23:31, Thomas Monjalon <thomas.monjalon@6wind.com> wrote:
> Sorry guys, you lost me in the discussion.
>
> Is there some regression only on ARM?
> Does it need some work specifically on memcpy for ARM,

I don't know if there is common way to improve memcpy on different ARM
hardware.  Even there is, it could take times.
I have tried do that using neon (like sse) instructions, but without success.

> or vhost for ARM?
> Who can work on ARM optimization?
>
  
Yuanhan Liu Oct. 14, 2016, 9:34 a.m. UTC | #54
This is a new set of patches to optimize the mergeable Rx code path.
No refactoring (rewrite) was made this time. It just applies some
findings from Zhihong (kudos to him!) that could improve the mergeable
Rx path on the old code.

The two major factors that could improve the performance greatly are:

- copy virtio header together with packet data. This could remove
  the buubbles between the two copy to optimize the cache access.

  This is implemented in patch 2 "vhost: optimize cache access"

- shadow used ring update and update them at once

  The basic idea is to update used ring in a local buffer and flush
  them to the virtio used ring at once in the end. Again, this is
  for optimizing the cache access.

  This is implemented in patch 5 "vhost: shadow used ring update"

The two optimizations could yield 40+% performance in micro testing
and 20+% in PVP case testing with 64B packet size.

Besides that, there are some tiny optimizations, such as prefetch
avail ring (patch 6) and retrieve avail head once (patch 7).

Note: the shadow used ring tech could also be applied to the non-mrg
Rx path (and even the dequeu) path. I didn't do that for two reasons:

- we already update used ring in batch in both path: it's not shadowed
  first though.

- it's a bit too late too make many changes at this stage: RC1 is out. 

Please help testing.

Thanks.

	--yliu

Cc: Jianbo Liu <jianbo.liu@linaro.org>
---
Yuanhan Liu (4):
  vhost: simplify mergeable Rx vring reservation
  vhost: use last avail idx for avail ring reservation
  vhost: prefetch avail ring
  vhost: retrieve avail head once

Zhihong Wang (3):
  vhost: remove useless volatile
  vhost: optimize cache access
  vhost: shadow used ring update

 lib/librte_vhost/vhost.c      |  13 ++-
 lib/librte_vhost/vhost.h      |   5 +-
 lib/librte_vhost/vhost_user.c |  23 +++--
 lib/librte_vhost/virtio_net.c | 193 +++++++++++++++++++++++++-----------------
 4 files changed, 149 insertions(+), 85 deletions(-)
  
Jianbo Liu Oct. 18, 2016, 2:25 a.m. UTC | #55
On 14 October 2016 at 17:34, Yuanhan Liu <yuanhan.liu@linux.intel.com> wrote:
> This is a new set of patches to optimize the mergeable Rx code path.
> No refactoring (rewrite) was made this time. It just applies some
> findings from Zhihong (kudos to him!) that could improve the mergeable
> Rx path on the old code.
......

> ---
> Yuanhan Liu (4):
>   vhost: simplify mergeable Rx vring reservation
>   vhost: use last avail idx for avail ring reservation
>   vhost: prefetch avail ring
>   vhost: retrieve avail head once
>
> Zhihong Wang (3):
>   vhost: remove useless volatile
>   vhost: optimize cache access
>   vhost: shadow used ring update
>
>  lib/librte_vhost/vhost.c      |  13 ++-
>  lib/librte_vhost/vhost.h      |   5 +-
>  lib/librte_vhost/vhost_user.c |  23 +++--
>  lib/librte_vhost/virtio_net.c | 193 +++++++++++++++++++++++++-----------------
>  4 files changed, 149 insertions(+), 85 deletions(-)
>

Reviewed-by: Jianbo Liu <jianbo.liu@linaro.org>
  
Maxime Coquelin Oct. 18, 2016, 2:53 p.m. UTC | #56
Hi Yuanhan,

On 10/14/2016 11:34 AM, Yuanhan Liu wrote:
> This is a new set of patches to optimize the mergeable Rx code path.
> No refactoring (rewrite) was made this time. It just applies some
> findings from Zhihong (kudos to him!) that could improve the mergeable
> Rx path on the old code.
>
> The two major factors that could improve the performance greatly are:
>
> - copy virtio header together with packet data. This could remove
>   the buubbles between the two copy to optimize the cache access.
>
>   This is implemented in patch 2 "vhost: optimize cache access"
>
> - shadow used ring update and update them at once
>
>   The basic idea is to update used ring in a local buffer and flush
>   them to the virtio used ring at once in the end. Again, this is
>   for optimizing the cache access.
>
>   This is implemented in patch 5 "vhost: shadow used ring update"
>
> The two optimizations could yield 40+% performance in micro testing
> and 20+% in PVP case testing with 64B packet size.
>
> Besides that, there are some tiny optimizations, such as prefetch
> avail ring (patch 6) and retrieve avail head once (patch 7).
>
> Note: the shadow used ring tech could also be applied to the non-mrg
> Rx path (and even the dequeu) path. I didn't do that for two reasons:
>
> - we already update used ring in batch in both path: it's not shadowed
>   first though.
>
> - it's a bit too late too make many changes at this stage: RC1 is out.
>
> Please help testing.

I tested the following use-cases without noticing any functional problems:
  - Windows Guests (mergeable ON & OFF, indirect disabled): ping other VM
  - Linux guests with Kernel driver (mergeable ON & OFF, indirect OFF): 
iperf between 2 VMs
  - Linux guest with Virtio PMD (mergeable ON & OFF): testpmd txonly on 
host, rxonly on guest.

Feel free to add my:
Tested-by: Maxime Coquelin <maxime.coquelin@redhat.com>

Thanks,
Maxime
  
Yuanhan Liu Oct. 21, 2016, 7:51 a.m. UTC | #57
Applied to dpdk-next-virtio.

And thanks for testing and reviewing.

	--yliu

On Fri, Oct 14, 2016 at 05:34:31PM +0800, Yuanhan Liu wrote:
> This is a new set of patches to optimize the mergeable Rx code path.
> No refactoring (rewrite) was made this time. It just applies some
> findings from Zhihong (kudos to him!) that could improve the mergeable
> Rx path on the old code.
> 
> The two major factors that could improve the performance greatly are:
> 
> - copy virtio header together with packet data. This could remove
>   the buubbles between the two copy to optimize the cache access.
> 
>   This is implemented in patch 2 "vhost: optimize cache access"
> 
> - shadow used ring update and update them at once
> 
>   The basic idea is to update used ring in a local buffer and flush
>   them to the virtio used ring at once in the end. Again, this is
>   for optimizing the cache access.
> 
>   This is implemented in patch 5 "vhost: shadow used ring update"
> 
> The two optimizations could yield 40+% performance in micro testing
> and 20+% in PVP case testing with 64B packet size.
> 
> Besides that, there are some tiny optimizations, such as prefetch
> avail ring (patch 6) and retrieve avail head once (patch 7).
> 
> Note: the shadow used ring tech could also be applied to the non-mrg
> Rx path (and even the dequeu) path. I didn't do that for two reasons:
> 
> - we already update used ring in batch in both path: it's not shadowed
>   first though.
> 
> - it's a bit too late too make many changes at this stage: RC1 is out. 
> 
> Please help testing.
> 
> Thanks.
> 
> 	--yliu
> 
> Cc: Jianbo Liu <jianbo.liu@linaro.org>
> ---
> Yuanhan Liu (4):
>   vhost: simplify mergeable Rx vring reservation
>   vhost: use last avail idx for avail ring reservation
>   vhost: prefetch avail ring
>   vhost: retrieve avail head once
> 
> Zhihong Wang (3):
>   vhost: remove useless volatile
>   vhost: optimize cache access
>   vhost: shadow used ring update
> 
>  lib/librte_vhost/vhost.c      |  13 ++-
>  lib/librte_vhost/vhost.h      |   5 +-
>  lib/librte_vhost/vhost_user.c |  23 +++--
>  lib/librte_vhost/virtio_net.c | 193 +++++++++++++++++++++++++-----------------
>  4 files changed, 149 insertions(+), 85 deletions(-)
> 
> -- 
> 1.9.0
  

Patch

diff --git a/lib/librte_vhost/vhost-net.h b/lib/librte_vhost/vhost-net.h
index 38593a2..a15182c 100644
--- a/lib/librte_vhost/vhost-net.h
+++ b/lib/librte_vhost/vhost-net.h
@@ -71,7 +71,7 @@  struct vhost_virtqueue {
 	uint32_t		size;
 
 	/* Last index used on the available ring */
-	volatile uint16_t	last_used_idx;
+	uint16_t		last_used_idx;
 #define VIRTIO_INVALID_EVENTFD		(-1)
 #define VIRTIO_UNINITIALIZED_EVENTFD	(-2)
 
@@ -85,6 +85,10 @@  struct vhost_virtqueue {
 
 	/* Physical address of used ring, for logging */
 	uint64_t		log_guest_addr;
+
+	/* Shadow used ring for performance */
+	struct vring_used_elem	*shadow_used_ring;
+	uint32_t		shadow_used_idx;
 } __rte_cache_aligned;
 
 /* Old kernels have no such macro defined */
diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c
index 08a73fd..1263168 100644
--- a/lib/librte_vhost/vhost_rxtx.c
+++ b/lib/librte_vhost/vhost_rxtx.c
@@ -91,7 +91,7 @@  is_valid_virt_queue_idx(uint32_t idx, int is_tx, uint32_t qp_nb)
 	return (is_tx ^ (idx & 1)) == 0 && idx < qp_nb * VIRTIO_QNUM;
 }
 
-static void
+static inline void __attribute__((always_inline))
 virtio_enqueue_offload(struct rte_mbuf *m_buf, struct virtio_net_hdr *net_hdr)
 {
 	if (m_buf->ol_flags & PKT_TX_L4_MASK) {
@@ -125,427 +125,227 @@  virtio_enqueue_offload(struct rte_mbuf *m_buf, struct virtio_net_hdr *net_hdr)
 	}
 }
 
-static inline void
-copy_virtio_net_hdr(struct virtio_net *dev, uint64_t desc_addr,
-		    struct virtio_net_hdr_mrg_rxbuf hdr)
-{
-	if (dev->vhost_hlen == sizeof(struct virtio_net_hdr_mrg_rxbuf))
-		*(struct virtio_net_hdr_mrg_rxbuf *)(uintptr_t)desc_addr = hdr;
-	else
-		*(struct virtio_net_hdr *)(uintptr_t)desc_addr = hdr.hdr;
-}
-
-static inline int __attribute__((always_inline))
-copy_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
-		  struct rte_mbuf *m, uint16_t desc_idx)
+uint16_t
+rte_vhost_enqueue_burst(int vid, uint16_t queue_id,
+	struct rte_mbuf **pkts, uint16_t count)
 {
-	uint32_t desc_avail, desc_offset;
-	uint32_t mbuf_avail, mbuf_offset;
-	uint32_t cpy_len;
+	struct virtio_net_hdr_mrg_rxbuf *virtio_hdr;
+	struct vhost_virtqueue *vq;
 	struct vring_desc *desc;
-	uint64_t desc_addr;
-	struct virtio_net_hdr_mrg_rxbuf virtio_hdr = {{0, 0, 0, 0, 0, 0}, 0};
-
-	desc = &vq->desc[desc_idx];
-	desc_addr = gpa_to_vva(dev, desc->addr);
-	/*
-	 * Checking of 'desc_addr' placed outside of 'unlikely' macro to avoid
-	 * performance issue with some versions of gcc (4.8.4 and 5.3.0) which
-	 * otherwise stores offset on the stack instead of in a register.
-	 */
-	if (unlikely(desc->len < dev->vhost_hlen) || !desc_addr)
-		return -1;
-
-	rte_prefetch0((void *)(uintptr_t)desc_addr);
-
-	virtio_enqueue_offload(m, &virtio_hdr.hdr);
-	copy_virtio_net_hdr(dev, desc_addr, virtio_hdr);
-	vhost_log_write(dev, desc->addr, dev->vhost_hlen);
-	PRINT_PACKET(dev, (uintptr_t)desc_addr, dev->vhost_hlen, 0);
-
-	desc_offset = dev->vhost_hlen;
-	desc_avail  = desc->len - dev->vhost_hlen;
-
-	mbuf_avail  = rte_pktmbuf_data_len(m);
-	mbuf_offset = 0;
-	while (mbuf_avail != 0 || m->next != NULL) {
-		/* done with current mbuf, fetch next */
-		if (mbuf_avail == 0) {
-			m = m->next;
-
-			mbuf_offset = 0;
-			mbuf_avail  = rte_pktmbuf_data_len(m);
-		}
-
-		/* done with current desc buf, fetch next */
-		if (desc_avail == 0) {
-			if ((desc->flags & VRING_DESC_F_NEXT) == 0) {
-				/* Room in vring buffer is not enough */
-				return -1;
-			}
-			if (unlikely(desc->next >= vq->size))
-				return -1;
-
-			desc = &vq->desc[desc->next];
-			desc_addr = gpa_to_vva(dev, desc->addr);
-			if (unlikely(!desc_addr))
-				return -1;
-
-			desc_offset = 0;
-			desc_avail  = desc->len;
-		}
-
-		cpy_len = RTE_MIN(desc_avail, mbuf_avail);
-		rte_memcpy((void *)((uintptr_t)(desc_addr + desc_offset)),
-			rte_pktmbuf_mtod_offset(m, void *, mbuf_offset),
-			cpy_len);
-		vhost_log_write(dev, desc->addr + desc_offset, cpy_len);
-		PRINT_PACKET(dev, (uintptr_t)(desc_addr + desc_offset),
-			     cpy_len, 0);
-
-		mbuf_avail  -= cpy_len;
-		mbuf_offset += cpy_len;
-		desc_avail  -= cpy_len;
-		desc_offset += cpy_len;
-	}
-
-	return 0;
-}
+	struct virtio_net *dev;
+	struct rte_mbuf *mbuf;
+	uint64_t desc_host_write_addr = 0;
+	uint32_t desc_chain_head = 0;
+	uint32_t desc_chain_len = 0;
+	uint32_t desc_current = 0;
+	uint32_t desc_write_offset = 0;
+	uint32_t used_idx_static = 0;
+	uint32_t pkt_idx = 0;
+	uint32_t pkt_left = 0;
+	uint32_t pkt_sent = 0;
+	uint32_t mbuf_len = 0;
+	uint32_t mbuf_len_left = 0;
+	uint32_t copy_len = 0;
+	uint32_t copy_virtio_hdr = 0;
+	uint32_t is_mrg_rxbuf = 0;
+	uint32_t is_virtio_1 = 0;
+
+	if (unlikely(count == 0))
+		return 0;
 
-/**
- * This function adds buffers to the virtio devices RX virtqueue. Buffers can
- * be received from the physical port or from another virtio device. A packet
- * count is returned to indicate the number of packets that are succesfully
- * added to the RX queue. This function works when the mbuf is scattered, but
- * it doesn't support the mergeable feature.
- */
-static inline uint32_t __attribute__((always_inline))
-virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
-	      struct rte_mbuf **pkts, uint32_t count)
-{
-	struct vhost_virtqueue *vq;
-	uint16_t avail_idx, free_entries, start_idx;
-	uint16_t desc_indexes[MAX_PKT_BURST];
-	uint16_t used_idx;
-	uint32_t i;
+	count = RTE_MIN((uint32_t)MAX_PKT_BURST, count);
 
-	LOG_DEBUG(VHOST_DATA, "(%d) %s\n", dev->vid, __func__);
-	if (unlikely(!is_valid_virt_queue_idx(queue_id, 0, dev->virt_qp_nb))) {
-		RTE_LOG(ERR, VHOST_DATA, "(%d) %s: invalid virtqueue idx %d.\n",
-			dev->vid, __func__, queue_id);
+	dev = get_device(vid);
+	if (unlikely(!dev))
 		return 0;
-	}
 
-	vq = dev->virtqueue[queue_id];
-	if (unlikely(vq->enabled == 0))
+	if (unlikely(!is_valid_virt_queue_idx(queue_id, 0, dev->virt_qp_nb)))
 		return 0;
 
-	avail_idx = *((volatile uint16_t *)&vq->avail->idx);
-	start_idx = vq->last_used_idx;
-	free_entries = avail_idx - start_idx;
-	count = RTE_MIN(count, free_entries);
-	count = RTE_MIN(count, (uint32_t)MAX_PKT_BURST);
-	if (count == 0)
+	vq = dev->virtqueue[queue_id];
+	if (unlikely(!vq->enabled))
 		return 0;
 
-	LOG_DEBUG(VHOST_DATA, "(%d) start_idx %d | end_idx %d\n",
-		dev->vid, start_idx, start_idx + count);
+	if (dev->features & (1ULL << VIRTIO_NET_F_MRG_RXBUF))
+		is_mrg_rxbuf = 1;
+
+	if (dev->features & (1ULL << VIRTIO_F_VERSION_1))
+		is_virtio_1 = 1;
+
+	pkt_idx = 0;
+	pkt_left = count;
+	used_idx_static = vq->last_used_idx & (vq->size - 1);
+	vq->shadow_used_idx = 0;
+
+	while (pkt_left > 0) {
+		if (unlikely(vq->avail->idx == vq->last_used_idx))
+			goto done;
+
+		if (pkt_left > 1 && vq->avail->idx != vq->last_used_idx + 1)
+			rte_prefetch0(&vq->desc[
+					vq->avail->ring[
+					(vq->last_used_idx + 1) &
+					(vq->size - 1)]]);
+
+		mbuf = pkts[pkt_idx];
+		mbuf_len = rte_pktmbuf_data_len(mbuf);
+		mbuf_len_left = mbuf_len;
+		pkt_idx++;
+		pkt_left--;
+
+		desc_chain_head = vq->avail->ring[(vq->last_used_idx) &
+			(vq->size - 1)];
+		desc_current = desc_chain_head;
+		desc = &vq->desc[desc_current];
+		desc_host_write_addr = gpa_to_vva(dev, desc->addr);
+		if (unlikely(!desc_host_write_addr))
+			goto done;
+
+		virtio_hdr = (struct virtio_net_hdr_mrg_rxbuf *)
+			(uintptr_t)desc_host_write_addr;
+		copy_virtio_hdr = 1;
+
+		vhost_log_write(dev, desc->addr, dev->vhost_hlen);
+		desc_write_offset = dev->vhost_hlen;
+		desc_chain_len = desc_write_offset;
+		desc_host_write_addr += desc_write_offset;
+
+		while (1) {
+			if (!mbuf_len_left) {
+				if (mbuf->next) {
+					mbuf = mbuf->next;
+					mbuf_len = rte_pktmbuf_data_len(mbuf);
+					mbuf_len_left = mbuf_len;
+				} else
+					break;
+			}
 
-	/* Retrieve all of the desc indexes first to avoid caching issues. */
-	rte_prefetch0(&vq->avail->ring[start_idx & (vq->size - 1)]);
-	for (i = 0; i < count; i++) {
-		used_idx = (start_idx + i) & (vq->size - 1);
-		desc_indexes[i] = vq->avail->ring[used_idx];
-		vq->used->ring[used_idx].id = desc_indexes[i];
-		vq->used->ring[used_idx].len = pkts[i]->pkt_len +
-					       dev->vhost_hlen;
-		vhost_log_used_vring(dev, vq,
-			offsetof(struct vring_used, ring[used_idx]),
-			sizeof(vq->used->ring[used_idx]));
-	}
+			if (desc->len <= desc_write_offset) {
+				if (desc->flags & VRING_DESC_F_NEXT) {
+					desc_write_offset = 0;
+					desc_current = desc->next;
+					desc = &vq->desc[desc_current];
+					desc_host_write_addr =
+						gpa_to_vva(dev, desc->addr);
+					if (unlikely(!desc_host_write_addr))
+						goto rollback;
+				} else if (is_mrg_rxbuf) {
+					vq->shadow_used_ring[
+						vq->shadow_used_idx].id =
+						desc_chain_head;
+					vq->shadow_used_ring[
+						vq->shadow_used_idx].len =
+						desc_chain_len;
+					vq->shadow_used_idx++;
+					vq->last_used_idx++;
+					virtio_hdr->num_buffers++;
+					if (unlikely(vq->avail->idx ==
+							vq->last_used_idx))
+						goto rollback;
+
+					desc_chain_head = vq->avail->ring[
+						(vq->last_used_idx) &
+						(vq->size - 1)];
+					desc_current = desc_chain_head;
+					desc = &vq->desc[desc_current];
+					desc_host_write_addr =
+						gpa_to_vva(dev, desc->addr);
+					if (unlikely(!desc_host_write_addr))
+						goto rollback;
+
+					desc_chain_len = 0;
+					desc_write_offset = 0;
+				} else
+					goto rollback;
+			}
 
-	rte_prefetch0(&vq->desc[desc_indexes[0]]);
-	for (i = 0; i < count; i++) {
-		uint16_t desc_idx = desc_indexes[i];
-		int err;
+			copy_len = RTE_MIN(desc->len - desc_write_offset,
+					mbuf_len_left);
+			if (copy_virtio_hdr) {
+				copy_virtio_hdr = 0;
+				memset((void *)(uintptr_t)&(virtio_hdr->hdr),
+						0, dev->vhost_hlen);
+				virtio_enqueue_offload(mbuf,
+						&(virtio_hdr->hdr));
+				if (is_mrg_rxbuf || is_virtio_1)
+					virtio_hdr->num_buffers = 1;
+			}
 
-		err = copy_mbuf_to_desc(dev, vq, pkts[i], desc_idx);
-		if (unlikely(err)) {
-			used_idx = (start_idx + i) & (vq->size - 1);
-			vq->used->ring[used_idx].len = dev->vhost_hlen;
-			vhost_log_used_vring(dev, vq,
-				offsetof(struct vring_used, ring[used_idx]),
-				sizeof(vq->used->ring[used_idx]));
+			rte_memcpy((void *)(uintptr_t)desc_host_write_addr,
+					rte_pktmbuf_mtod_offset(mbuf, void *,
+						mbuf_len - mbuf_len_left),
+					copy_len);
+			vhost_log_write(dev, desc->addr + desc_write_offset,
+					copy_len);
+			mbuf_len_left -= copy_len;
+			desc_write_offset += copy_len;
+			desc_host_write_addr += copy_len;
+			desc_chain_len += copy_len;
 		}
 
-		if (i + 1 < count)
-			rte_prefetch0(&vq->desc[desc_indexes[i+1]]);
+		vq->shadow_used_ring[vq->shadow_used_idx].id = desc_chain_head;
+		vq->shadow_used_ring[vq->shadow_used_idx].len = desc_chain_len;
+		vq->shadow_used_idx++;
+		vq->last_used_idx++;
+		pkt_sent++;
 	}
 
-	rte_smp_wmb();
-
-	*(volatile uint16_t *)&vq->used->idx += count;
-	vq->last_used_idx += count;
-	vhost_log_used_vring(dev, vq,
-		offsetof(struct vring_used, idx),
-		sizeof(vq->used->idx));
-
-	/* flush used->idx update before we read avail->flags. */
-	rte_mb();
-
-	/* Kick the guest if necessary. */
-	if (!(vq->avail->flags & VRING_AVAIL_F_NO_INTERRUPT)
-			&& (vq->callfd >= 0))
-		eventfd_write(vq->callfd, (eventfd_t)1);
-	return count;
-}
-
-static inline int
-fill_vec_buf(struct vhost_virtqueue *vq, uint32_t avail_idx,
-	     uint32_t *allocated, uint32_t *vec_idx,
-	     struct buf_vector *buf_vec)
-{
-	uint16_t idx = vq->avail->ring[avail_idx & (vq->size - 1)];
-	uint32_t vec_id = *vec_idx;
-	uint32_t len    = *allocated;
-
-	while (1) {
-		if (unlikely(vec_id >= BUF_VECTOR_MAX || idx >= vq->size))
-			return -1;
-
-		len += vq->desc[idx].len;
-		buf_vec[vec_id].buf_addr = vq->desc[idx].addr;
-		buf_vec[vec_id].buf_len  = vq->desc[idx].len;
-		buf_vec[vec_id].desc_idx = idx;
-		vec_id++;
-
-		if ((vq->desc[idx].flags & VRING_DESC_F_NEXT) == 0)
-			break;
-
-		idx = vq->desc[idx].next;
-	}
-
-	*allocated = len;
-	*vec_idx   = vec_id;
-
-	return 0;
-}
-
-/*
- * Returns -1 on fail, 0 on success
- */
-static inline int
-reserve_avail_buf_mergeable(struct vhost_virtqueue *vq, uint32_t size,
-			    uint16_t *end, struct buf_vector *buf_vec)
-{
-	uint16_t cur_idx;
-	uint16_t avail_idx;
-	uint32_t allocated = 0;
-	uint32_t vec_idx = 0;
-	uint16_t tries = 0;
-
-	cur_idx  = vq->last_used_idx;
-
-	while (1) {
-		avail_idx = *((volatile uint16_t *)&vq->avail->idx);
-		if (unlikely(cur_idx == avail_idx))
-			return -1;
-
-		if (unlikely(fill_vec_buf(vq, cur_idx, &allocated,
-					  &vec_idx, buf_vec) < 0))
-			return -1;
-
-		cur_idx++;
-		tries++;
-
-		if (allocated >= size)
-			break;
-
-		/*
-		 * if we tried all available ring items, and still
-		 * can't get enough buf, it means something abnormal
-		 * happened.
-		 */
-		if (unlikely(tries >= vq->size))
-			return -1;
-	}
-
-	*end = cur_idx;
-	return 0;
-}
-
-static inline uint32_t __attribute__((always_inline))
-copy_mbuf_to_desc_mergeable(struct virtio_net *dev, struct vhost_virtqueue *vq,
-			    uint16_t end_idx, struct rte_mbuf *m,
-			    struct buf_vector *buf_vec)
-{
-	struct virtio_net_hdr_mrg_rxbuf virtio_hdr = {{0, 0, 0, 0, 0, 0}, 0};
-	uint32_t vec_idx = 0;
-	uint16_t start_idx = vq->last_used_idx;
-	uint16_t cur_idx = start_idx;
-	uint64_t desc_addr;
-	uint32_t mbuf_offset, mbuf_avail;
-	uint32_t desc_offset, desc_avail;
-	uint32_t cpy_len;
-	uint16_t desc_idx, used_idx;
-
-	if (unlikely(m == NULL))
-		return 0;
-
-	LOG_DEBUG(VHOST_DATA, "(%d) current index %d | end index %d\n",
-		dev->vid, cur_idx, end_idx);
-
-	desc_addr = gpa_to_vva(dev, buf_vec[vec_idx].buf_addr);
-	if (buf_vec[vec_idx].buf_len < dev->vhost_hlen || !desc_addr)
-		return 0;
-
-	rte_prefetch0((void *)(uintptr_t)desc_addr);
-
-	virtio_hdr.num_buffers = end_idx - start_idx;
-	LOG_DEBUG(VHOST_DATA, "(%d) RX: num merge buffers %d\n",
-		dev->vid, virtio_hdr.num_buffers);
-
-	virtio_enqueue_offload(m, &virtio_hdr.hdr);
-	copy_virtio_net_hdr(dev, desc_addr, virtio_hdr);
-	vhost_log_write(dev, buf_vec[vec_idx].buf_addr, dev->vhost_hlen);
-	PRINT_PACKET(dev, (uintptr_t)desc_addr, dev->vhost_hlen, 0);
-
-	desc_avail  = buf_vec[vec_idx].buf_len - dev->vhost_hlen;
-	desc_offset = dev->vhost_hlen;
-
-	mbuf_avail  = rte_pktmbuf_data_len(m);
-	mbuf_offset = 0;
-	while (mbuf_avail != 0 || m->next != NULL) {
-		/* done with current desc buf, get the next one */
-		if (desc_avail == 0) {
-			desc_idx = buf_vec[vec_idx].desc_idx;
-
-			if (!(vq->desc[desc_idx].flags & VRING_DESC_F_NEXT)) {
-				/* Update used ring with desc information */
-				used_idx = cur_idx++ & (vq->size - 1);
-				vq->used->ring[used_idx].id  = desc_idx;
-				vq->used->ring[used_idx].len = desc_offset;
-				vhost_log_used_vring(dev, vq,
+done:
+	if (likely(vq->shadow_used_idx > 0)) {
+		if (used_idx_static + vq->shadow_used_idx < vq->size) {
+			rte_memcpy(&vq->used->ring[used_idx_static],
+					&vq->shadow_used_ring[0],
+					vq->shadow_used_idx *
+					sizeof(struct vring_used_elem));
+			vhost_log_used_vring(dev, vq,
 					offsetof(struct vring_used,
-						 ring[used_idx]),
-					sizeof(vq->used->ring[used_idx]));
-			}
-
-			vec_idx++;
-			desc_addr = gpa_to_vva(dev, buf_vec[vec_idx].buf_addr);
-			if (unlikely(!desc_addr))
-				return 0;
-
-			/* Prefetch buffer address. */
-			rte_prefetch0((void *)(uintptr_t)desc_addr);
-			desc_offset = 0;
-			desc_avail  = buf_vec[vec_idx].buf_len;
-		}
-
-		/* done with current mbuf, get the next one */
-		if (mbuf_avail == 0) {
-			m = m->next;
+						ring[used_idx_static]),
+					vq->shadow_used_idx *
+					sizeof(struct vring_used_elem));
+		} else {
+			uint32_t part_1 = vq->size - used_idx_static;
+			uint32_t part_2 = vq->shadow_used_idx - part_1;
 
-			mbuf_offset = 0;
-			mbuf_avail  = rte_pktmbuf_data_len(m);
+			rte_memcpy(&vq->used->ring[used_idx_static],
+					&vq->shadow_used_ring[0],
+					part_1 *
+					sizeof(struct vring_used_elem));
+			vhost_log_used_vring(dev, vq,
+					offsetof(struct vring_used,
+						ring[used_idx_static]),
+					part_1 *
+					sizeof(struct vring_used_elem));
+			rte_memcpy(&vq->used->ring[0],
+					&vq->shadow_used_ring[part_1],
+					part_2 *
+					sizeof(struct vring_used_elem));
+			vhost_log_used_vring(dev, vq,
+					offsetof(struct vring_used,
+						ring[0]),
+					part_2 *
+					sizeof(struct vring_used_elem));
 		}
-
-		cpy_len = RTE_MIN(desc_avail, mbuf_avail);
-		rte_memcpy((void *)((uintptr_t)(desc_addr + desc_offset)),
-			rte_pktmbuf_mtod_offset(m, void *, mbuf_offset),
-			cpy_len);
-		vhost_log_write(dev, buf_vec[vec_idx].buf_addr + desc_offset,
-			cpy_len);
-		PRINT_PACKET(dev, (uintptr_t)(desc_addr + desc_offset),
-			cpy_len, 0);
-
-		mbuf_avail  -= cpy_len;
-		mbuf_offset += cpy_len;
-		desc_avail  -= cpy_len;
-		desc_offset += cpy_len;
 	}
 
-	used_idx = cur_idx & (vq->size - 1);
-	vq->used->ring[used_idx].id = buf_vec[vec_idx].desc_idx;
-	vq->used->ring[used_idx].len = desc_offset;
+	rte_smp_wmb();
+	vq->used->idx = vq->last_used_idx;
 	vhost_log_used_vring(dev, vq,
-		offsetof(struct vring_used, ring[used_idx]),
-		sizeof(vq->used->ring[used_idx]));
-
-	return end_idx - start_idx;
-}
-
-static inline uint32_t __attribute__((always_inline))
-virtio_dev_merge_rx(struct virtio_net *dev, uint16_t queue_id,
-	struct rte_mbuf **pkts, uint32_t count)
-{
-	struct vhost_virtqueue *vq;
-	uint32_t pkt_idx = 0, nr_used = 0;
-	uint16_t end;
-	struct buf_vector buf_vec[BUF_VECTOR_MAX];
-
-	LOG_DEBUG(VHOST_DATA, "(%d) %s\n", dev->vid, __func__);
-	if (unlikely(!is_valid_virt_queue_idx(queue_id, 0, dev->virt_qp_nb))) {
-		RTE_LOG(ERR, VHOST_DATA, "(%d) %s: invalid virtqueue idx %d.\n",
-			dev->vid, __func__, queue_id);
-		return 0;
-	}
-
-	vq = dev->virtqueue[queue_id];
-	if (unlikely(vq->enabled == 0))
-		return 0;
-
-	count = RTE_MIN((uint32_t)MAX_PKT_BURST, count);
-	if (count == 0)
-		return 0;
-
-	for (pkt_idx = 0; pkt_idx < count; pkt_idx++) {
-		uint32_t pkt_len = pkts[pkt_idx]->pkt_len + dev->vhost_hlen;
-
-		if (unlikely(reserve_avail_buf_mergeable(vq, pkt_len,
-							 &end, buf_vec) < 0)) {
-			LOG_DEBUG(VHOST_DATA,
-				"(%d) failed to get enough desc from vring\n",
-				dev->vid);
-			break;
-		}
-
-		nr_used = copy_mbuf_to_desc_mergeable(dev, vq, end,
-						      pkts[pkt_idx], buf_vec);
-		rte_smp_wmb();
-
-		*(volatile uint16_t *)&vq->used->idx += nr_used;
-		vhost_log_used_vring(dev, vq, offsetof(struct vring_used, idx),
+			offsetof(struct vring_used, idx),
 			sizeof(vq->used->idx));
-		vq->last_used_idx += nr_used;
-	}
-
-	if (likely(pkt_idx)) {
-		/* flush used->idx update before we read avail->flags. */
-		rte_mb();
-
-		/* Kick the guest if necessary. */
+	rte_mb();
+	if (likely(pkt_sent)) {
 		if (!(vq->avail->flags & VRING_AVAIL_F_NO_INTERRUPT)
 				&& (vq->callfd >= 0))
 			eventfd_write(vq->callfd, (eventfd_t)1);
 	}
 
-	return pkt_idx;
-}
-
-uint16_t
-rte_vhost_enqueue_burst(int vid, uint16_t queue_id,
-	struct rte_mbuf **pkts, uint16_t count)
-{
-	struct virtio_net *dev = get_device(vid);
+	return pkt_sent;
 
-	if (!dev)
-		return 0;
+rollback:
+	if (is_mrg_rxbuf || is_virtio_1)
+		vq->last_used_idx -= virtio_hdr->num_buffers - 1;
 
-	if (dev->features & (1 << VIRTIO_NET_F_MRG_RXBUF))
-		return virtio_dev_merge_rx(dev, queue_id, pkts, count);
-	else
-		return virtio_dev_rx(dev, queue_id, pkts, count);
+	goto done;
 }
 
 static void
diff --git a/lib/librte_vhost/virtio-net.c b/lib/librte_vhost/virtio-net.c
index 1785695..87d09fa 100644
--- a/lib/librte_vhost/virtio-net.c
+++ b/lib/librte_vhost/virtio-net.c
@@ -152,10 +152,14 @@  cleanup_device(struct virtio_net *dev, int destroy)
 static void
 free_device(struct virtio_net *dev)
 {
+	struct vhost_virtqueue *vq;
 	uint32_t i;
 
-	for (i = 0; i < dev->virt_qp_nb; i++)
-		rte_free(dev->virtqueue[i * VIRTIO_QNUM]);
+	for (i = 0; i < dev->virt_qp_nb; i++) {
+		vq = dev->virtqueue[i * VIRTIO_QNUM];
+		rte_free(vq->shadow_used_ring);
+		rte_free(vq);
+	}
 
 	rte_free(dev);
 }
@@ -418,13 +422,18 @@  int
 vhost_set_vring_num(int vid, struct vhost_vring_state *state)
 {
 	struct virtio_net *dev;
+	struct vhost_virtqueue *vq;
 
 	dev = get_device(vid);
 	if (dev == NULL)
 		return -1;
 
 	/* State->index refers to the queue index. The txq is 1, rxq is 0. */
-	dev->virtqueue[state->index]->size = state->num;
+	vq = dev->virtqueue[state->index];
+	vq->size = state->num;
+	vq->shadow_used_ring = rte_malloc("",
+			vq->size * sizeof(struct vring_used_elem),
+			RTE_CACHE_LINE_SIZE);
 
 	return 0;
 }