Message ID | 1471319402-112998-1-git-send-email-zhihong.wang@intel.com (mailing list archive) |
---|---|
State | Superseded, archived |
Headers |
Return-Path: <dev-bounces@dpdk.org> X-Original-To: patchwork@dpdk.org Delivered-To: patchwork@dpdk.org Received: from [92.243.14.124] (localhost [IPv6:::1]) by dpdk.org (Postfix) with ESMTP id 7F5CD6CA4; Tue, 16 Aug 2016 12:58:02 +0200 (CEST) Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by dpdk.org (Postfix) with ESMTP id 417E36932 for <dev@dpdk.org>; Tue, 16 Aug 2016 12:58:00 +0200 (CEST) Received: from orsmga002.jf.intel.com ([10.7.209.21]) by fmsmga104.fm.intel.com with ESMTP; 16 Aug 2016 03:57:59 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos; i="5.28,529,1464678000"; d="scan'208"; a="1036366050" Received: from unknown (HELO dpdk5.sh.intel.com) ([10.239.129.118]) by orsmga002.jf.intel.com with ESMTP; 16 Aug 2016 03:57:59 -0700 From: Zhihong Wang <zhihong.wang@intel.com> To: dev@dpdk.org Cc: Zhihong Wang <zhihong.wang@intel.com> Date: Mon, 15 Aug 2016 23:50:02 -0400 Message-Id: <1471319402-112998-1-git-send-email-zhihong.wang@intel.com> X-Mailer: git-send-email 2.7.4 Subject: [dpdk-dev] [PATCH] optimize vhost enqueue X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK <dev.dpdk.org> List-Unsubscribe: <http://dpdk.org/ml/options/dev>, <mailto:dev-request@dpdk.org?subject=unsubscribe> List-Archive: <http://dpdk.org/ml/archives/dev/> List-Post: <mailto:dev@dpdk.org> List-Help: <mailto:dev-request@dpdk.org?subject=help> List-Subscribe: <http://dpdk.org/ml/listinfo/dev>, <mailto:dev-request@dpdk.org?subject=subscribe> Errors-To: dev-bounces@dpdk.org Sender: "dev" <dev-bounces@dpdk.org> |
Commit Message
Zhihong Wang
Aug. 16, 2016, 3:50 a.m. UTC
This patch optimizes the vhost enqueue function: rte_vhost_enqueue_burst.
Currently there're 2 callbacks for vhost enqueue:
* virtio_dev_merge_rx for mrg_rxbuf turned on cases.
* virtio_dev_rx for mrg_rxbuf turned off cases.
The virtio_dev_merge_rx doesn't provide optimal performance, also it is
reported having compatibility issue working with Windows VMs.
Besides, having 2 separated functions increases maintenance efforts.
This patch uses a single function logic to replace the current 2 for
better maintainability, and provides better performance by optimizing
caching behavior especially for mrg_rxbuf turned on cases.
It also fixes the issue working with Windows VMs.
Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
---
lib/librte_vhost/vhost-net.h | 6 +-
lib/librte_vhost/vhost_rxtx.c | 582 ++++++++++++++----------------------------
lib/librte_vhost/virtio-net.c | 15 +-
3 files changed, 208 insertions(+), 395 deletions(-)
Comments
Hi Zhihong, On 08/16/2016 05:50 AM, Zhihong Wang wrote: > This patch optimizes the vhost enqueue function: rte_vhost_enqueue_burst. > > Currently there're 2 callbacks for vhost enqueue: > * virtio_dev_merge_rx for mrg_rxbuf turned on cases. > * virtio_dev_rx for mrg_rxbuf turned off cases. > > The virtio_dev_merge_rx doesn't provide optimal performance, also it is > reported having compatibility issue working with Windows VMs. Could you tell us more please about this compatibility issue? > > Besides, having 2 separated functions increases maintenance efforts. > > This patch uses a single function logic to replace the current 2 for > better maintainability, and provides better performance by optimizing > caching behavior especially for mrg_rxbuf turned on cases. Do you have some benchmark comparison before and after your change? Also, for maintainability, I would suggest the that the enqueue function be split. Because vhost_enqueue_burst becomes very long (220 LoC), and max level of indentation is too high (6). It makes the code hard to understand, and prone to miss bugs during review and maintenance. > > It also fixes the issue working with Windows VMs. Ideally, the fix should be sent separately, before the rework. Indeed, we might want to have the fix in the stable branch, without picking the optimization. > > Signed-off-by: Zhihong Wang <zhihong.wang@intel.com> > --- > lib/librte_vhost/vhost-net.h | 6 +- > lib/librte_vhost/vhost_rxtx.c | 582 ++++++++++++++---------------------------- > lib/librte_vhost/virtio-net.c | 15 +- > 3 files changed, 208 insertions(+), 395 deletions(-) 582 lines changed is a huge patch. If possible, it would be better splitting it in incremental changes, making the review process easier. Also, for v2, please prefix the commit title with "vhost:". Thanks for your contribution, I'm looking forward for the v2. - Maxime
> -----Original Message----- > From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com] > Sent: Tuesday, August 16, 2016 10:00 PM > To: Wang, Zhihong <zhihong.wang@intel.com>; dev@dpdk.org > Subject: Re: [dpdk-dev] [PATCH] optimize vhost enqueue > > Hi Zhihong, > > On 08/16/2016 05:50 AM, Zhihong Wang wrote: > > This patch optimizes the vhost enqueue function: rte_vhost_enqueue_burst. > > > > Currently there're 2 callbacks for vhost enqueue: > > * virtio_dev_merge_rx for mrg_rxbuf turned on cases. > > * virtio_dev_rx for mrg_rxbuf turned off cases. > > > > The virtio_dev_merge_rx doesn't provide optimal performance, also it is > > reported having compatibility issue working with Windows VMs. > Could you tell us more please about this compatibility issue? For example, when you have testpmd in the host and Window VM as the guest, with mrg_rxbuf turned on, the guest will hang once there's packets enqueued by virtio_dev_merge_rx. Let me know if you see the same issue. > > > > > Besides, having 2 separated functions increases maintenance efforts. > > > > This patch uses a single function logic to replace the current 2 for > > better maintainability, and provides better performance by optimizing > > caching behavior especially for mrg_rxbuf turned on cases. > Do you have some benchmark comparison before and after your change? > > Also, for maintainability, I would suggest the that the enqueue > function be split. Because vhost_enqueue_burst becomes very long (220 > LoC), and max level of indentation is too high (6). > > It makes the code hard to understand, and prone to miss bugs during > review and maintenance. This is something I've thought about while writing the code, the reason I keep it as one function body is that: 1. This function is very performance sensitive, and we need full control of code ordering (You can compare with the current performance with the mrg_rxbuf feature turned on to see the difference). 2. I somehow find that a single function logic makes it easier to understand, surely I can add comments to make it easiler to read for . Please let me know if you still insist, we can discuss more on it. > > > > > It also fixes the issue working with Windows VMs. > Ideally, the fix should be sent separately, before the rework. > Indeed, we might want to have the fix in the stable branch, without > picking the optimization. > > > > > Signed-off-by: Zhihong Wang <zhihong.wang@intel.com> > > --- > > lib/librte_vhost/vhost-net.h | 6 +- > > lib/librte_vhost/vhost_rxtx.c | 582 ++++++++++++++---------------------------- > > lib/librte_vhost/virtio-net.c | 15 +- > > 3 files changed, 208 insertions(+), 395 deletions(-) > 582 lines changed is a huge patch. > If possible, it would be better splitting it in incremental changes, > making the review process easier. It looks like a huge patch, but it simply deletes the current implementation and add the new code. I think perhaps split it into 2, 1st one to replace just the rte_vhost_enqueue_burst, 2nd one to delete all the obsolete functions. It should make the patch clear, how do you think? :) > > Also, for v2, please prefix the commit title with "vhost:". Thanks for the hint! Will do. > > Thanks for your contribution, I'm looking forward for the v2. > - Maxime
On Wed, Aug 17, 2016 at 01:45:26AM +0000, Wang, Zhihong wrote: > > > > -----Original Message----- > > From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com] > > Sent: Tuesday, August 16, 2016 10:00 PM > > To: Wang, Zhihong <zhihong.wang@intel.com>; dev@dpdk.org > > Subject: Re: [dpdk-dev] [PATCH] optimize vhost enqueue > > > > Hi Zhihong, > > > > On 08/16/2016 05:50 AM, Zhihong Wang wrote: > > > This patch optimizes the vhost enqueue function: rte_vhost_enqueue_burst. > > > > > > Currently there're 2 callbacks for vhost enqueue: > > > * virtio_dev_merge_rx for mrg_rxbuf turned on cases. > > > * virtio_dev_rx for mrg_rxbuf turned off cases. > > > > > > The virtio_dev_merge_rx doesn't provide optimal performance, also it is > > > reported having compatibility issue working with Windows VMs. > > Could you tell us more please about this compatibility issue? > > > For example, when you have testpmd in the host and Window VM as the guest, > with mrg_rxbuf turned on, the guest will hang once there's packets enqueued > by virtio_dev_merge_rx. You should put it into commit log. > Let me know if you see the same issue. > > > > > > > > > > Besides, having 2 separated functions increases maintenance efforts. > > > > > > This patch uses a single function logic to replace the current 2 for > > > better maintainability, and provides better performance by optimizing > > > caching behavior especially for mrg_rxbuf turned on cases. Here, here sounds two parts to me: - one to unite mergeable and non-mergeable Rx - another one to optimize the mergeable path That means you should do it in two patches, with that we can have clear understanding what changes the performance boost. It also helps review. > > Do you have some benchmark comparison before and after your change? > > > > Also, for maintainability, I would suggest the that the enqueue > > function be split. Because vhost_enqueue_burst becomes very long (220 > > LoC), and max level of indentation is too high (6). > > > > It makes the code hard to understand, and prone to miss bugs during > > review and maintenance. Agreed. > > This is something I've thought about while writing the code, the reason I > keep it as one function body is that: > > 1. This function is very performance sensitive, and we need full control of > code ordering (You can compare with the current performance with the > mrg_rxbuf feature turned on to see the difference). Will inline functions help? > 2. I somehow find that a single function logic makes it easier to understand, > surely I can add comments to make it easiler to read for . > > Please let me know if you still insist, we can discuss more on it. I am personally not a fan of huge function; I would try hard to avoid too many levels of indentation as well. > > > > > > > > > It also fixes the issue working with Windows VMs. > > Ideally, the fix should be sent separately, before the rework. > > Indeed, we might want to have the fix in the stable branch, without > > picking the optimization. Agreed. > > > > > > > > Signed-off-by: Zhihong Wang <zhihong.wang@intel.com> > > > --- > > > lib/librte_vhost/vhost-net.h | 6 +- > > > lib/librte_vhost/vhost_rxtx.c | 582 ++++++++++++++---------------------------- > > > lib/librte_vhost/virtio-net.c | 15 +- > > > 3 files changed, 208 insertions(+), 395 deletions(-) > > 582 lines changed is a huge patch. > > If possible, it would be better splitting it in incremental changes, > > making the review process easier. > > > It looks like a huge patch, but it simply deletes the current implementation > and add the new code. I think perhaps split it into 2, 1st one to replace > just the rte_vhost_enqueue_burst, 2nd one to delete all the obsolete functions. > It should make the patch clear, how do you think? :) Nope, it's not working in that way. It should be: - one patch to fix the hang issue for windows guest Please cc it to stable@dpdk.org as well so that we could pick it for v16.07 stable release. - one patch to unite the two different Rx code path - another patch to optimize mergeable code path --yliu
> -----Original Message----- > From: Yuanhan Liu [mailto:yuanhan.liu@linux.intel.com] > Sent: Wednesday, August 17, 2016 10:38 AM > To: Wang, Zhihong <zhihong.wang@intel.com> > Cc: Maxime Coquelin <maxime.coquelin@redhat.com>; dev@dpdk.org > Subject: Re: [dpdk-dev] [PATCH] optimize vhost enqueue > > On Wed, Aug 17, 2016 at 01:45:26AM +0000, Wang, Zhihong wrote: > > > > > > > -----Original Message----- > > > From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com] > > > Sent: Tuesday, August 16, 2016 10:00 PM > > > To: Wang, Zhihong <zhihong.wang@intel.com>; dev@dpdk.org > > > Subject: Re: [dpdk-dev] [PATCH] optimize vhost enqueue > > > > > > Hi Zhihong, > > > > > > On 08/16/2016 05:50 AM, Zhihong Wang wrote: > > > > This patch optimizes the vhost enqueue function: > rte_vhost_enqueue_burst. > > > > > > > > Currently there're 2 callbacks for vhost enqueue: > > > > * virtio_dev_merge_rx for mrg_rxbuf turned on cases. > > > > * virtio_dev_rx for mrg_rxbuf turned off cases. > > > > > > > > The virtio_dev_merge_rx doesn't provide optimal performance, also it is > > > > reported having compatibility issue working with Windows VMs. > > > Could you tell us more please about this compatibility issue? > > > > > > For example, when you have testpmd in the host and Window VM as the > guest, > > with mrg_rxbuf turned on, the guest will hang once there's packets enqueued > > by virtio_dev_merge_rx. > > You should put it into commit log. Okay. > > > Let me know if you see the same issue. > > > > > > > > > > > > > > > Besides, having 2 separated functions increases maintenance efforts. > > > > > > > > This patch uses a single function logic to replace the current 2 for > > > > better maintainability, and provides better performance by optimizing > > > > caching behavior especially for mrg_rxbuf turned on cases. > > Here, here sounds two parts to me: > > - one to unite mergeable and non-mergeable Rx > > - another one to optimize the mergeable path > > That means you should do it in two patches, with that we can have clear > understanding what changes the performance boost. It also helps review. Please see explanation below. > > > > Do you have some benchmark comparison before and after your change? > > > > > > Also, for maintainability, I would suggest the that the enqueue > > > function be split. Because vhost_enqueue_burst becomes very long (220 > > > LoC), and max level of indentation is too high (6). > > > > > > It makes the code hard to understand, and prone to miss bugs during > > > review and maintenance. > > Agreed. > > > > > This is something I've thought about while writing the code, the reason I > > keep it as one function body is that: > > > > 1. This function is very performance sensitive, and we need full control of > > code ordering (You can compare with the current performance with the > > mrg_rxbuf feature turned on to see the difference). > > Will inline functions help? Optimization in this patch actually reorganizes the code from its logic, so it's not suitable for making separated functions. I'll explain this in v2. > > > 2. I somehow find that a single function logic makes it easier to understand, > > surely I can add comments to make it easiler to read for . > > > > Please let me know if you still insist, we can discuss more on it. > > I am personally not a fan of huge function; I would try hard to avoid > too many levels of indentation as well. > > > > > > > > > > > > > > It also fixes the issue working with Windows VMs. > > > Ideally, the fix should be sent separately, before the rework. > > > Indeed, we might want to have the fix in the stable branch, without > > > picking the optimization. > > Agreed. The fact is that I don't have much time to debug with the current code since it's messy and I don't have Windows virtio code and the debugging environment. This patch doesn't try to fix this issue, it rewrites the logic totally, and somehow fixes this issue. Do you think integrating this whole patch into the stable branch will work? Personally I think it makes more sense. > > > > > > > > > > > > Signed-off-by: Zhihong Wang <zhihong.wang@intel.com> > > > > --- > > > > lib/librte_vhost/vhost-net.h | 6 +- > > > > lib/librte_vhost/vhost_rxtx.c | 582 > ++++++++++++++---------------------------- > > > > lib/librte_vhost/virtio-net.c | 15 +- > > > > 3 files changed, 208 insertions(+), 395 deletions(-) > > > 582 lines changed is a huge patch. > > > If possible, it would be better splitting it in incremental changes, > > > making the review process easier. > > > > > > It looks like a huge patch, but it simply deletes the current implementation > > and add the new code. I think perhaps split it into 2, 1st one to replace > > just the rte_vhost_enqueue_burst, 2nd one to delete all the obsolete > functions. > > It should make the patch clear, how do you think? :) > > Nope, it's not working in that way. It should be: > > - one patch to fix the hang issue for windows guest > > Please cc it to stable@dpdk.org as well so that we could pick it for > v16.07 stable release. > > - one patch to unite the two different Rx code path > > - another patch to optimize mergeable code path I can separate optimization from the basic code in v2, however as I explained this patch is built from scratch and doesn't take anything from the existing code, so there's no way to transform from the existing code incrementally into the new code. > > --yliu
On 08/17/2016 08:41 AM, Wang, Zhihong wrote: > > >> -----Original Message----- >> From: Yuanhan Liu [mailto:yuanhan.liu@linux.intel.com] >> Sent: Wednesday, August 17, 2016 10:38 AM >> To: Wang, Zhihong <zhihong.wang@intel.com> >> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>; dev@dpdk.org >> Subject: Re: [dpdk-dev] [PATCH] optimize vhost enqueue >> >> On Wed, Aug 17, 2016 at 01:45:26AM +0000, Wang, Zhihong wrote: >>> >>> >>>> -----Original Message----- >>>> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com] >>>> Sent: Tuesday, August 16, 2016 10:00 PM >>>> To: Wang, Zhihong <zhihong.wang@intel.com>; dev@dpdk.org >>>> Subject: Re: [dpdk-dev] [PATCH] optimize vhost enqueue >>>> >>>> Hi Zhihong, >>>> >>>> On 08/16/2016 05:50 AM, Zhihong Wang wrote: >>>>> This patch optimizes the vhost enqueue function: >> rte_vhost_enqueue_burst. >>>>> >>>>> Currently there're 2 callbacks for vhost enqueue: >>>>> * virtio_dev_merge_rx for mrg_rxbuf turned on cases. >>>>> * virtio_dev_rx for mrg_rxbuf turned off cases. >>>>> >>>>> The virtio_dev_merge_rx doesn't provide optimal performance, also it is >>>>> reported having compatibility issue working with Windows VMs. >>>> Could you tell us more please about this compatibility issue? >>> >>> >>> For example, when you have testpmd in the host and Window VM as the >> guest, >>> with mrg_rxbuf turned on, the guest will hang once there's packets enqueued >>> by virtio_dev_merge_rx. >> >> You should put it into commit log. > > > Okay. > > >> >>> Let me know if you see the same issue. >>> >>> >>>> >>>>> >>>>> Besides, having 2 separated functions increases maintenance efforts. >>>>> >>>>> This patch uses a single function logic to replace the current 2 for >>>>> better maintainability, and provides better performance by optimizing >>>>> caching behavior especially for mrg_rxbuf turned on cases. >> >> Here, here sounds two parts to me: >> >> - one to unite mergeable and non-mergeable Rx >> >> - another one to optimize the mergeable path >> >> That means you should do it in two patches, with that we can have clear >> understanding what changes the performance boost. It also helps review. > > > Please see explanation below. > > >> >>>> Do you have some benchmark comparison before and after your change? >>>> >>>> Also, for maintainability, I would suggest the that the enqueue >>>> function be split. Because vhost_enqueue_burst becomes very long (220 >>>> LoC), and max level of indentation is too high (6). >>>> >>>> It makes the code hard to understand, and prone to miss bugs during >>>> review and maintenance. >> >> Agreed. >> >>> >>> This is something I've thought about while writing the code, the reason I >>> keep it as one function body is that: >>> >>> 1. This function is very performance sensitive, and we need full control of >>> code ordering (You can compare with the current performance with the >>> mrg_rxbuf feature turned on to see the difference). >> >> Will inline functions help? > > > Optimization in this patch actually reorganizes the code from its logic, > so it's not suitable for making separated functions. > > I'll explain this in v2. I agree with Yuanhan. Inline functions should not break the optimizations. IMHO, this is mandatory for the patch to be accepted. > > >> >>> 2. I somehow find that a single function logic makes it easier to understand, >>> surely I can add comments to make it easiler to read for . >>> >>> Please let me know if you still insist, we can discuss more on it. >> >> I am personally not a fan of huge function; I would try hard to avoid >> too many levels of indentation as well. >> >>> >>>> >>>>> >>>>> It also fixes the issue working with Windows VMs. >>>> Ideally, the fix should be sent separately, before the rework. >>>> Indeed, we might want to have the fix in the stable branch, without >>>> picking the optimization. >> >> Agreed. > > > The fact is that I don't have much time to debug with the current code > since it's messy and I don't have Windows virtio code and the debugging > environment. It seems you are not the only one facing the issue: https://github.com/YanVugenfirer/kvm-guest-drivers-windows/issues/70 So a dedicated fix is really important. > This patch doesn't try to fix this issue, it rewrites the logic totally, > and somehow fixes this issue. > > Do you think integrating this whole patch into the stable branch will work? > Personally I think it makes more sense. No. We don't even know why/how it fixes the Windows issue, which would be the first thing to understand before integrating a fix in stable branch. And the stable branch is not meant for integrating such big reworks, it is only meant to fix bugs. The risk of regressions have to be avoided as much as possible. > > >> >>>> >>>>> >>>>> Signed-off-by: Zhihong Wang <zhihong.wang@intel.com> >>>>> --- >>>>> lib/librte_vhost/vhost-net.h | 6 +- >>>>> lib/librte_vhost/vhost_rxtx.c | 582 >> ++++++++++++++---------------------------- >>>>> lib/librte_vhost/virtio-net.c | 15 +- >>>>> 3 files changed, 208 insertions(+), 395 deletions(-) >>>> 582 lines changed is a huge patch. >>>> If possible, it would be better splitting it in incremental changes, >>>> making the review process easier. >>> >>> >>> It looks like a huge patch, but it simply deletes the current implementation >>> and add the new code. I think perhaps split it into 2, 1st one to replace >>> just the rte_vhost_enqueue_burst, 2nd one to delete all the obsolete >> functions. >>> It should make the patch clear, how do you think? :) >> >> Nope, it's not working in that way. It should be: >> >> - one patch to fix the hang issue for windows guest >> >> Please cc it to stable@dpdk.org as well so that we could pick it for >> v16.07 stable release. >> >> - one patch to unite the two different Rx code path >> >> - another patch to optimize mergeable code path > > > I can separate optimization from the basic code in v2, however as I explained > this patch is built from scratch and doesn't take anything from the existing > code, so there's no way to transform from the existing code incrementally into > the new code. > > >> >> --yliu
On Wed, Aug 17, 2016 at 11:17:46AM +0200, Maxime Coquelin wrote: > >>>This is something I've thought about while writing the code, the reason I > >>>keep it as one function body is that: > >>> > >>> 1. This function is very performance sensitive, and we need full control of > >>> code ordering (You can compare with the current performance with the > >>> mrg_rxbuf feature turned on to see the difference). > >> > >>Will inline functions help? > > > > > >Optimization in this patch actually reorganizes the code from its logic, > >so it's not suitable for making separated functions. > > > >I'll explain this in v2. > > I agree with Yuanhan. > Inline functions should not break the optimizations. > IMHO, this is mandatory for the patch to be accepted. Yes. > It seems you are not the only one facing the issue: > https://github.com/YanVugenfirer/kvm-guest-drivers-windows/issues/70 > > So a dedicated fix is really important. Yes. > > >This patch doesn't try to fix this issue, it rewrites the logic totally, > >and somehow fixes this issue. > > > >Do you think integrating this whole patch into the stable branch will work? > >Personally I think it makes more sense. > > No. > We don't even know why/how it fixes the Windows issue, which would be > the first thing to understand before integrating a fix in stable branch. Yes. > > And the stable branch is not meant for integrating such big reworks, > it is only meant to fix bugs. Yes. > The risk of regressions have to be avoided as much as possible. Yes. --yliu
> -----Original Message----- > From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com] > Sent: Wednesday, August 17, 2016 5:18 PM > To: Wang, Zhihong <zhihong.wang@intel.com>; Yuanhan Liu > <yuanhan.liu@linux.intel.com> > Cc: dev@dpdk.org > Subject: Re: [dpdk-dev] [PATCH] optimize vhost enqueue > > > > On 08/17/2016 08:41 AM, Wang, Zhihong wrote: > > > > > >> -----Original Message----- > >> From: Yuanhan Liu [mailto:yuanhan.liu@linux.intel.com] > >> Sent: Wednesday, August 17, 2016 10:38 AM > >> To: Wang, Zhihong <zhihong.wang@intel.com> > >> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>; dev@dpdk.org > >> Subject: Re: [dpdk-dev] [PATCH] optimize vhost enqueue > >> > >> On Wed, Aug 17, 2016 at 01:45:26AM +0000, Wang, Zhihong wrote: > >>> > >>> > >>>> -----Original Message----- > >>>> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com] > >>>> Sent: Tuesday, August 16, 2016 10:00 PM > >>>> To: Wang, Zhihong <zhihong.wang@intel.com>; dev@dpdk.org > >>>> Subject: Re: [dpdk-dev] [PATCH] optimize vhost enqueue > >>>> > >>>> Hi Zhihong, > >>>> > >>>> On 08/16/2016 05:50 AM, Zhihong Wang wrote: > >>>>> This patch optimizes the vhost enqueue function: > >> rte_vhost_enqueue_burst. > >>>>> > >>>>> Currently there're 2 callbacks for vhost enqueue: > >>>>> * virtio_dev_merge_rx for mrg_rxbuf turned on cases. > >>>>> * virtio_dev_rx for mrg_rxbuf turned off cases. > >>>>> > >>>>> The virtio_dev_merge_rx doesn't provide optimal performance, also it is > >>>>> reported having compatibility issue working with Windows VMs. > >>>> Could you tell us more please about this compatibility issue? > >>> > >>> > >>> For example, when you have testpmd in the host and Window VM as the > >> guest, > >>> with mrg_rxbuf turned on, the guest will hang once there's packets > enqueued > >>> by virtio_dev_merge_rx. > >> > >> You should put it into commit log. > > > > > > Okay. > > > > > >> > >>> Let me know if you see the same issue. > >>> > >>> > >>>> > >>>>> > >>>>> Besides, having 2 separated functions increases maintenance efforts. > >>>>> > >>>>> This patch uses a single function logic to replace the current 2 for > >>>>> better maintainability, and provides better performance by optimizing > >>>>> caching behavior especially for mrg_rxbuf turned on cases. > >> > >> Here, here sounds two parts to me: > >> > >> - one to unite mergeable and non-mergeable Rx > >> > >> - another one to optimize the mergeable path > >> > >> That means you should do it in two patches, with that we can have clear > >> understanding what changes the performance boost. It also helps review. > > > > > > Please see explanation below. > > > > > >> > >>>> Do you have some benchmark comparison before and after your change? > >>>> > >>>> Also, for maintainability, I would suggest the that the enqueue > >>>> function be split. Because vhost_enqueue_burst becomes very long (220 > >>>> LoC), and max level of indentation is too high (6). > >>>> > >>>> It makes the code hard to understand, and prone to miss bugs during > >>>> review and maintenance. > >> > >> Agreed. > >> > >>> > >>> This is something I've thought about while writing the code, the reason I > >>> keep it as one function body is that: > >>> > >>> 1. This function is very performance sensitive, and we need full control of > >>> code ordering (You can compare with the current performance with the > >>> mrg_rxbuf feature turned on to see the difference). > >> > >> Will inline functions help? > > > > > > Optimization in this patch actually reorganizes the code from its logic, > > so it's not suitable for making separated functions. > > > > I'll explain this in v2. > > I agree with Yuanhan. > Inline functions should not break the optimizations. > IMHO, this is mandatory for the patch to be accepted. Excellent! > > > > > > >> > >>> 2. I somehow find that a single function logic makes it easier to understand, > >>> surely I can add comments to make it easiler to read for . > >>> > >>> Please let me know if you still insist, we can discuss more on it. > >> > >> I am personally not a fan of huge function; I would try hard to avoid > >> too many levels of indentation as well. > >> > >>> > >>>> > >>>>> > >>>>> It also fixes the issue working with Windows VMs. > >>>> Ideally, the fix should be sent separately, before the rework. > >>>> Indeed, we might want to have the fix in the stable branch, without > >>>> picking the optimization. > >> > >> Agreed. > > > > > > The fact is that I don't have much time to debug with the current code > > since it's messy and I don't have Windows virtio code and the debugging > > environment. > > It seems you are not the only one facing the issue: > https://github.com/YanVugenfirer/kvm-guest-drivers-windows/issues/70 > > So a dedicated fix is really important. Yeah that's me raising this issue there. But I think it's another standalone task to identify the root cause and find the fix for the existing code. > > > This patch doesn't try to fix this issue, it rewrites the logic totally, > > and somehow fixes this issue. > > > > Do you think integrating this whole patch into the stable branch will work? > > Personally I think it makes more sense. > > No. > We don't even know why/how it fixes the Windows issue, which would be > the first thing to understand before integrating a fix in stable branch. > > And the stable branch is not meant for integrating such big reworks, > it is only meant to fix bugs. > > The risk of regressions have to be avoided as much as possible. > > > > > > >> > >>>> > >>>>> > >>>>> Signed-off-by: Zhihong Wang <zhihong.wang@intel.com> > >>>>> --- > >>>>> lib/librte_vhost/vhost-net.h | 6 +- > >>>>> lib/librte_vhost/vhost_rxtx.c | 582 > >> ++++++++++++++---------------------------- > >>>>> lib/librte_vhost/virtio-net.c | 15 +- > >>>>> 3 files changed, 208 insertions(+), 395 deletions(-) > >>>> 582 lines changed is a huge patch. > >>>> If possible, it would be better splitting it in incremental changes, > >>>> making the review process easier. > >>> > >>> > >>> It looks like a huge patch, but it simply deletes the current implementation > >>> and add the new code. I think perhaps split it into 2, 1st one to replace > >>> just the rte_vhost_enqueue_burst, 2nd one to delete all the obsolete > >> functions. > >>> It should make the patch clear, how do you think? :) > >> > >> Nope, it's not working in that way. It should be: > >> > >> - one patch to fix the hang issue for windows guest > >> > >> Please cc it to stable@dpdk.org as well so that we could pick it for > >> v16.07 stable release. > >> > >> - one patch to unite the two different Rx code path > >> > >> - another patch to optimize mergeable code path > > > > > > I can separate optimization from the basic code in v2, however as I explained > > this patch is built from scratch and doesn't take anything from the existing > > code, so there's no way to transform from the existing code incrementally into > > the new code. > > > > > >> > >> --yliu
This patch set optimizes the vhost enqueue function. It implements the vhost logic from scratch into a single function designed for high performance and good maintainability, and improves CPU efficiency significantly by optimizing cache access, which means: * For fast frontends (eg. DPDK virtio pmd), higher performance (maximum throughput) can be achieved. * For slow frontends (eg. kernel virtio-net), better scalability can be achieved, each vhost core can support more connections since it takes less cycles to handle each single frontend. The main optimization techniques are: 1. Reorder code to reduce CPU pipeline stall cycles. 2. Batch update the used ring for better efficiency. 3. Prefetch descriptor to hide cache latency. 4. Remove useless volatile attribute to allow compiler optimization. In the existing code there're 2 callbacks for vhost enqueue: * virtio_dev_merge_rx for mrg_rxbuf turned on cases. * virtio_dev_rx for mrg_rxbuf turned off cases. The performance of the existing code is not optimal, especially when the mrg_rxbuf feature turned on. Also, having 2 separated functions increases maintenance efforts. --- Changes in v2: 1. Split the big function into several small ones 2. Use multiple patches to explain each optimization 3. Add comments Zhihong Wang (6): vhost: rewrite enqueue vhost: remove obsolete vhost: remove useless volatile vhost: add desc prefetch vhost: batch update used ring vhost: optimize cache access lib/librte_vhost/vhost-net.h | 6 +- lib/librte_vhost/vhost_rxtx.c | 582 +++++++++++++++--------------------------- lib/librte_vhost/virtio-net.c | 15 +- 3 files changed, 228 insertions(+), 375 deletions(-)
Thanks Maxime and Yuanhan for your review and suggestions! Please help review the v2 of this patch. > -----Original Message----- > From: Yuanhan Liu [mailto:yuanhan.liu@linux.intel.com] > Sent: Wednesday, August 17, 2016 5:51 PM > To: Maxime Coquelin <maxime.coquelin@redhat.com> > Cc: Wang, Zhihong <zhihong.wang@intel.com>; dev@dpdk.org > Subject: Re: [dpdk-dev] [PATCH] optimize vhost enqueue > > On Wed, Aug 17, 2016 at 11:17:46AM +0200, Maxime Coquelin wrote: > > >>>This is something I've thought about while writing the code, the reason I > > >>>keep it as one function body is that: > > >>> > > >>> 1. This function is very performance sensitive, and we need full control of > > >>> code ordering (You can compare with the current performance with > the > > >>> mrg_rxbuf feature turned on to see the difference). > > >> > > >>Will inline functions help? > > > > > > > > >Optimization in this patch actually reorganizes the code from its logic, > > >so it's not suitable for making separated functions. > > > > > >I'll explain this in v2. > > > > I agree with Yuanhan. > > Inline functions should not break the optimizations. > > IMHO, this is mandatory for the patch to be accepted. > > Yes. > > > It seems you are not the only one facing the issue: > > https://github.com/YanVugenfirer/kvm-guest-drivers-windows/issues/70 > > > > So a dedicated fix is really important. > > Yes. > > > > > >This patch doesn't try to fix this issue, it rewrites the logic totally, > > >and somehow fixes this issue. > > > > > >Do you think integrating this whole patch into the stable branch will work? > > >Personally I think it makes more sense. > > > > No. > > We don't even know why/how it fixes the Windows issue, which would be > > the first thing to understand before integrating a fix in stable branch. > > Yes. > > > > > And the stable branch is not meant for integrating such big reworks, > > it is only meant to fix bugs. > > Yes. > > > The risk of regressions have to be avoided as much as possible. > > Yes. > > --yliu
This patch set optimizes the vhost enqueue function. It implements the vhost logic from scratch into a single function designed for high performance and good maintainability, and improves CPU efficiency significantly by optimizing cache access, which means: * For fast frontends (eg. DPDK virtio pmd), higher performance (maximum throughput) can be achieved. * For slow frontends (eg. kernel virtio-net), better scalability can be achieved, each vhost core can support more connections since it takes less cycles to handle each single frontend. The main optimization techniques are: 1. Reorder code to reduce CPU pipeline stall cycles. 2. Batch update the used ring for better efficiency. 3. Prefetch descriptor to hide cache latency. 4. Remove useless volatile attribute to allow compiler optimization. Code reordering and batch used ring update bring most of the performance improvements. In the existing code there're 2 callbacks for vhost enqueue: * virtio_dev_merge_rx for mrg_rxbuf turned on cases. * virtio_dev_rx for mrg_rxbuf turned off cases. The performance of the existing code is not optimal, especially when the mrg_rxbuf feature turned on. Also, having 2 separated functions increases maintenance efforts. --- Changes in v3: 1. Remove unnecessary memset which causes frontend stall on SNB & IVB. 2. Rename variables to follow naming convention. 3. Rewrite enqueue and delete the obsolete in the same patch. --- Changes in v2: 1. Split the big function into several small ones. 2. Use multiple patches to explain each optimization. 3. Add comments. Zhihong Wang (5): vhost: rewrite enqueue vhost: remove useless volatile vhost: add desc prefetch vhost: batch update used ring vhost: optimize cache access lib/librte_vhost/vhost-net.h | 6 +- lib/librte_vhost/vhost_rxtx.c | 573 +++++++++++++++--------------------------- lib/librte_vhost/virtio-net.c | 15 +- 3 files changed, 220 insertions(+), 374 deletions(-)
Hi Zhihong, On 08/19/2016 07:43 AM, Zhihong Wang wrote: > This patch set optimizes the vhost enqueue function. > > It implements the vhost logic from scratch into a single function designed > for high performance and good maintainability, and improves CPU efficiency > significantly by optimizing cache access, which means: > > * For fast frontends (eg. DPDK virtio pmd), higher performance (maximum > throughput) can be achieved. > > * For slow frontends (eg. kernel virtio-net), better scalability can be > achieved, each vhost core can support more connections since it takes > less cycles to handle each single frontend. > > The main optimization techniques are: > > 1. Reorder code to reduce CPU pipeline stall cycles. > > 2. Batch update the used ring for better efficiency. > > 3. Prefetch descriptor to hide cache latency. > > 4. Remove useless volatile attribute to allow compiler optimization. Thanks for these details, this is helpful to understand where the perf gain comes from. I would suggest to add these information as comments in the code where/if it makes sense. If more a general comment, at least add it in the commit message of the patch introducing it. Indeed, adding it to the cover letter is fine, but the information is lost as soon as the series is applied. You don't mention any figures, so I set up a benchmark on my side to evaluate your series. It indeed shows an interesting performance gain. My setup consists of one host running a guest. The guest generates as much 64bytes packets as possible using pktgen-dpdk. The hosts forwards received packets back to the guest using testpmd on vhost pmd interface. Guest's vCPUs are pinned to physical CPUs. I tested it with and without your v1 patch, with and without rx-mergeable feature turned ON. Results are the average of 8 runs of 60 seconds: Rx-Mergeable ON : 7.72Mpps Rx-Mergeable ON + "vhost: optimize enqueue" v1: 9.19Mpps Rx-Mergeable OFF: 10.52Mpps Rx-Mergeable OFF + "vhost: optimize enqueue" v1: 10.60Mpps Regards, Maxime
On 08/22/2016 10:11 AM, Maxime Coquelin wrote: > Hi Zhihong, > > On 08/19/2016 07:43 AM, Zhihong Wang wrote: > > This patch set optimizes the vhost enqueue function. > > > > It implements the vhost logic from scratch into a single function > > designed > > for high performance and good maintainability, and improves CPU > > efficiency > > significantly by optimizing cache access, which means: > > > > * For fast frontends (eg. DPDK virtio pmd), higher performance (maximum > > throughput) can be achieved. > > > > * For slow frontends (eg. kernel virtio-net), better scalability can be > > achieved, each vhost core can support more connections since it takes > > less cycles to handle each single frontend. > > > > The main optimization techniques are: > > > > 1. Reorder code to reduce CPU pipeline stall cycles. > > > > 2. Batch update the used ring for better efficiency. > > > > 3. Prefetch descriptor to hide cache latency. > > > > 4. Remove useless volatile attribute to allow compiler optimization. > > Thanks for these details, this is helpful to understand where the perf > gain comes from. > I would suggest to add these information as comments in the code > where/if it makes sense. If more a general comment, at least add it in > the commit message of the patch introducing it. > Indeed, adding it to the cover letter is fine, but the information is > lost as soon as the series is applied. > > You don't mention any figures, so I set up a benchmark on my side to > evaluate your series. It indeed shows an interesting performance gain. > > My setup consists of one host running a guest. > The guest generates as much 64bytes packets as possible using > pktgen-dpdk. The hosts forwards received packets back to the guest > using testpmd on vhost pmd interface. Guest's vCPUs are pinned to > physical CPUs. > > I tested it with and without your v1 patch, with and without > rx-mergeable feature turned ON. > Results are the average of 8 runs of 60 seconds: > > Rx-Mergeable ON : 7.72Mpps > Rx-Mergeable ON + "vhost: optimize enqueue" v1: 9.19Mpps > Rx-Mergeable OFF: 10.52Mpps > Rx-Mergeable OFF + "vhost: optimize enqueue" v1: 10.60Mpps > I forgot to add that before this series, I think we should first fix the windows bug. Else we will need a dedicated fix for the stable branch. Regards, Maxime
2016-08-22 12:01, Maxime Coquelin: > I forgot to add that before this series, I think we should first fix the windows bug. > Else we will need a dedicated fix for the stable branch. This is a funny situation :) If Zhihong had reworked the code without mentioning it is fixing a scenario with Windows guests, maybe that nobody would have notice ;) That's probably why it is not written in v2/v3. But thanks to the v1, we all know it: "It also fixes the issue working with Windows VMs." So yes, it would be a lot better to find the root cause and try to have a minimal fix for 16.07, then rework the code for performance in 16.11. I think we must avoid silent fixes, and even more, avoid writing specific fixes for stable branches without validating them in the master branch and its large users base. Thanks for your good works guys, DPDK vhost is improving very well.
> Subject: Re: [PATCH v3 0/5] vhost: optimize enqueue > > Hi Zhihong, > [...] > > The main optimization techniques are: > > > > 1. Reorder code to reduce CPU pipeline stall cycles. > > > > 2. Batch update the used ring for better efficiency. > > > > 3. Prefetch descriptor to hide cache latency. > > > > 4. Remove useless volatile attribute to allow compiler optimization. > > Thanks for these details, this is helpful to understand where the perf > gain comes from. > I would suggest to add these information as comments in the code > where/if it makes sense. If more a general comment, at least add it in > the commit message of the patch introducing it. > Indeed, adding it to the cover letter is fine, but the information is > lost as soon as the series is applied. Hi Maxime, I did add these info in the later optimization patches to explain each optimization techniques. The v1 was indeed hard to read. > > You don't mention any figures, so I set up a benchmark on my side to > evaluate your series. It indeed shows an interesting performance gain. > > My setup consists of one host running a guest. > The guest generates as much 64bytes packets as possible using > pktgen-dpdk. The hosts forwards received packets back to the guest > using testpmd on vhost pmd interface. Guest's vCPUs are pinned to > physical CPUs. > Thanks for doing the test! I didn't publish any numbers since the gain varies in different platforms and test setups. In my phy to vm test on both IVB and HSW, where testpmd in the host rx from the nic and enqueue to the guest, the enqueue efficiency (cycles per packet) is 2.4x and 1.4x as fast as the current code for mergeable on and mergeable off respectively, for v3 patch. > I tested it with and without your v1 patch, with and without > rx-mergeable feature turned ON. > Results are the average of 8 runs of 60 seconds: > > Rx-Mergeable ON : 7.72Mpps > Rx-Mergeable ON + "vhost: optimize enqueue" v1: 9.19Mpps > Rx-Mergeable OFF: 10.52Mpps > Rx-Mergeable OFF + "vhost: optimize enqueue" v1: 10.60Mpps > > Regards, > Maxime
> -----Original Message----- > From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com] > Sent: Monday, August 22, 2016 6:02 PM > To: Wang, Zhihong <zhihong.wang@intel.com>; dev@dpdk.org > Cc: yuanhan.liu@linux.intel.com > Subject: Re: [PATCH v3 0/5] vhost: optimize enqueue > > > On 08/22/2016 10:11 AM, Maxime Coquelin wrote: > > Hi Zhihong, > > > > On 08/19/2016 07:43 AM, Zhihong Wang wrote: > > > This patch set optimizes the vhost enqueue function. > > > > > > It implements the vhost logic from scratch into a single function > > > designed > > > for high performance and good maintainability, and improves CPU > > > efficiency > > > significantly by optimizing cache access, which means: > > > > > > * For fast frontends (eg. DPDK virtio pmd), higher performance > (maximum > > > throughput) can be achieved. > > > > > > * For slow frontends (eg. kernel virtio-net), better scalability can be > > > achieved, each vhost core can support more connections since it takes > > > less cycles to handle each single frontend. > > > > > > The main optimization techniques are: > > > > > > 1. Reorder code to reduce CPU pipeline stall cycles. > > > > > > 2. Batch update the used ring for better efficiency. > > > > > > 3. Prefetch descriptor to hide cache latency. > > > > > > 4. Remove useless volatile attribute to allow compiler optimization. > > > > Thanks for these details, this is helpful to understand where the perf > > gain comes from. > > I would suggest to add these information as comments in the code > > where/if it makes sense. If more a general comment, at least add it in > > the commit message of the patch introducing it. > > Indeed, adding it to the cover letter is fine, but the information is > > lost as soon as the series is applied. > > > > You don't mention any figures, so I set up a benchmark on my side to > > evaluate your series. It indeed shows an interesting performance gain. > > > > My setup consists of one host running a guest. > > The guest generates as much 64bytes packets as possible using > > pktgen-dpdk. The hosts forwards received packets back to the guest > > using testpmd on vhost pmd interface. Guest's vCPUs are pinned to > > physical CPUs. > > > > I tested it with and without your v1 patch, with and without > > rx-mergeable feature turned ON. > > Results are the average of 8 runs of 60 seconds: > > > > Rx-Mergeable ON : 7.72Mpps > > Rx-Mergeable ON + "vhost: optimize enqueue" v1: 9.19Mpps > > Rx-Mergeable OFF: 10.52Mpps > > Rx-Mergeable OFF + "vhost: optimize enqueue" v1: 10.60Mpps > > > I forgot to add that before this series, I think we should first fix the windows bug. > Else we will need a dedicated fix for the stable branch. Okay I'll try to fix it, though I can't make any promises at present. Have tried once but stopped since we don't have enough debug info from the frontend side so basically I was debugging the backend based on guesses. > > Regards, > Maxime
> -----Original Message----- > From: Wang, Zhihong > Sent: Tuesday, August 23, 2016 10:31 AM > To: Maxime Coquelin <maxime.coquelin@redhat.com>; dev@dpdk.org > Cc: yuanhan.liu@linux.intel.com > Subject: RE: [PATCH v3 0/5] vhost: optimize enqueue > > > > > -----Original Message----- > > From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com] > > Sent: Monday, August 22, 2016 6:02 PM > > To: Wang, Zhihong <zhihong.wang@intel.com>; dev@dpdk.org > > Cc: yuanhan.liu@linux.intel.com > > Subject: Re: [PATCH v3 0/5] vhost: optimize enqueue > > > > > > On 08/22/2016 10:11 AM, Maxime Coquelin wrote: > > > Hi Zhihong, > > > > > > On 08/19/2016 07:43 AM, Zhihong Wang wrote: > > > > This patch set optimizes the vhost enqueue function. > > > > > > > > It implements the vhost logic from scratch into a single function > > > > designed > > > > for high performance and good maintainability, and improves CPU > > > > efficiency > > > > significantly by optimizing cache access, which means: > > > > > > > > * For fast frontends (eg. DPDK virtio pmd), higher performance > > (maximum > > > > throughput) can be achieved. > > > > > > > > * For slow frontends (eg. kernel virtio-net), better scalability can be > > > > achieved, each vhost core can support more connections since it takes > > > > less cycles to handle each single frontend. > > > > > > > > The main optimization techniques are: > > > > > > > > 1. Reorder code to reduce CPU pipeline stall cycles. > > > > > > > > 2. Batch update the used ring for better efficiency. > > > > > > > > 3. Prefetch descriptor to hide cache latency. > > > > > > > > 4. Remove useless volatile attribute to allow compiler optimization. > > > > > > Thanks for these details, this is helpful to understand where the perf > > > gain comes from. > > > I would suggest to add these information as comments in the code > > > where/if it makes sense. If more a general comment, at least add it in > > > the commit message of the patch introducing it. > > > Indeed, adding it to the cover letter is fine, but the information is > > > lost as soon as the series is applied. > > > > > > You don't mention any figures, so I set up a benchmark on my side to > > > evaluate your series. It indeed shows an interesting performance gain. > > > > > > My setup consists of one host running a guest. > > > The guest generates as much 64bytes packets as possible using > > > pktgen-dpdk. The hosts forwards received packets back to the guest > > > using testpmd on vhost pmd interface. Guest's vCPUs are pinned to > > > physical CPUs. > > > > > > I tested it with and without your v1 patch, with and without > > > rx-mergeable feature turned ON. > > > Results are the average of 8 runs of 60 seconds: > > > > > > Rx-Mergeable ON : 7.72Mpps > > > Rx-Mergeable ON + "vhost: optimize enqueue" v1: 9.19Mpps > > > Rx-Mergeable OFF: 10.52Mpps > > > Rx-Mergeable OFF + "vhost: optimize enqueue" v1: 10.60Mpps > > > > > I forgot to add that before this series, I think we should first fix the windows > bug. > > Else we will need a dedicated fix for the stable branch. > > Okay I'll try to fix it, though I can't make any promises at present. > > Have tried once but stopped since we don't have enough debug info from the > frontend side so basically I was debugging the backend based on guesses. Hi Maxime, Yuanhan, I've identified the root cause, do you think it makes sense to put the fix in the same patch set? Or send it as a separated patch? Thanks Zhihong > > > > > > Regards, > > Maxime
On 08/23/2016 12:43 PM, Wang, Zhihong wrote: > > >> -----Original Message----- >> From: Wang, Zhihong >> Sent: Tuesday, August 23, 2016 10:31 AM >> To: Maxime Coquelin <maxime.coquelin@redhat.com>; dev@dpdk.org >> Cc: yuanhan.liu@linux.intel.com >> Subject: RE: [PATCH v3 0/5] vhost: optimize enqueue >> >> >> >>> -----Original Message----- >>> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com] >>> Sent: Monday, August 22, 2016 6:02 PM >>> To: Wang, Zhihong <zhihong.wang@intel.com>; dev@dpdk.org >>> Cc: yuanhan.liu@linux.intel.com >>> Subject: Re: [PATCH v3 0/5] vhost: optimize enqueue .. >>>> >>> I forgot to add that before this series, I think we should first fix the windows >> bug. >>> Else we will need a dedicated fix for the stable branch. >> >> Okay I'll try to fix it, though I can't make any promises at present. >> >> Have tried once but stopped since we don't have enough debug info from the >> frontend side so basically I was debugging the backend based on guesses. > > Hi Maxime, Yuanhan, > > I've identified the root cause, do you think it makes sense to put the fix > in the same patch set? Or send it as a separated patch? Good work! Agree with Yuanhan, send it before the optimization series. Thanks, Maxime
On Tue, Aug 23, 2016 at 10:43:36AM +0000, Wang, Zhihong wrote: > > > I forgot to add that before this series, I think we should first fix the windows > > bug. > > > Else we will need a dedicated fix for the stable branch. > > > > Okay I'll try to fix it, though I can't make any promises at present. > > > > Have tried once but stopped since we don't have enough debug info from the > > frontend side so basically I was debugging the backend based on guesses. > > Hi Maxime, Yuanhan, > > I've identified the root cause, do you think it makes sense to put the fix > in the same patch set? Or send it as a separated patch? Great! Yes, it's okay to put it in the patch set (normally, as the first patch, before the rewrite). Please also add following line before your Signed-off-by in the commit log: Cc: <stable@dpdk.org> --yliu
> -----Original Message----- > From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com] > Sent: Monday, August 22, 2016 6:35 PM > To: Maxime Coquelin <maxime.coquelin@redhat.com>; Wang, Zhihong > <zhihong.wang@intel.com>; yuanhan.liu@linux.intel.com > Cc: dev@dpdk.org > Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue > > 2016-08-22 12:01, Maxime Coquelin: > > I forgot to add that before this series, I think we should first fix the > windows bug. > > Else we will need a dedicated fix for the stable branch. > > This is a funny situation :) > If Zhihong had reworked the code without mentioning it is fixing a scenario > with Windows guests, maybe that nobody would have notice ;) That's > probably why it is not written in v2/v3. But thanks to the v1, we all know it: > "It also fixes the issue working with Windows VMs." I thought it'd be more appropriate to send a dedicated fix for stable branch. So I removed this info. > > So yes, it would be a lot better to find the root cause and try to have a > minimal fix for 16.07, then rework the code for performance in 16.11. > I think we must avoid silent fixes, and even more, avoid writing specific fixes > for stable branches without validating them in the master branch and its large > users base. Okay, that's also what Maxime and Yuanhan suggest. BTW the root cause has been identified and fix will be in v4. > > Thanks for your good works guys, DPDK vhost is improving very well.
This patch set optimizes the vhost enqueue function. It implements the vhost logic from scratch into a single function designed for high performance and good maintainability, and improves CPU efficiency significantly by optimizing cache access, which means: * Higher maximum throughput can be achieved for fast frontends like DPDK virtio pmd. * Better scalability can be achieved that each vhost core can support more connections because it takes less cycles to handle each single frontend. This patch set contains: 1. A Windows VM compatibility fix for vhost enqueue in 16.07 release. 2. A baseline patch to rewrite the vhost logic. 3. A series of optimization patches added upon the baseline. The main optimization techniques are: 1. Reorder code to reduce CPU pipeline stall cycles. 2. Batch update the used ring for better efficiency. 3. Prefetch descriptor to hide cache latency. 4. Remove useless volatile attribute to allow compiler optimization. Code reordering and batch used ring update bring most of the performance improvements. In the existing code there're 2 callbacks for vhost enqueue: * virtio_dev_merge_rx for mrg_rxbuf turned on cases. * virtio_dev_rx for mrg_rxbuf turned off cases. The performance of the existing code is not optimal, especially when the mrg_rxbuf feature turned on. Besides, having 2 callback paths increases maintenance efforts. Also, there's a compatibility issue in the existing code which causes Windows VM to hang when the mrg_rxbuf feature turned on. --- Changes in v4: 1. Fix a Windows VM compatibility issue. 2. Free shadow used ring in the right place. 3. Add failure check for shadow used ring malloc. 4. Refactor the code for clearer logic. 5. Add PRINT_PACKET for debugging. --- Changes in v3: 1. Remove unnecessary memset which causes frontend stall on SNB & IVB. 2. Rename variables to follow naming convention. 3. Rewrite enqueue and delete the obsolete in the same patch. --- Changes in v2: 1. Split the big function into several small ones. 2. Use multiple patches to explain each optimization. 3. Add comments. Zhihong Wang (6): vhost: fix windows vm hang vhost: rewrite enqueue vhost: remove useless volatile vhost: add desc prefetch vhost: batch update used ring vhost: optimize cache access lib/librte_vhost/vhost-net.h | 6 +- lib/librte_vhost/vhost_rxtx.c | 572 +++++++++++++++--------------------------- lib/librte_vhost/virtio-net.c | 42 +++- 3 files changed, 244 insertions(+), 376 deletions(-)
This patch set optimizes the vhost enqueue function. It implements the vhost logic from scratch into a single function designed for high performance and good maintainability, and improves CPU efficiency significantly by optimizing cache access, which means: * Higher maximum throughput can be achieved for fast frontends like DPDK virtio pmd. * Better scalability can be achieved that each vhost core can support more connections because it takes less cycles to handle each single frontend. This patch set contains: 1. A Windows VM compatibility fix for vhost enqueue in 16.07 release. 2. A baseline patch to rewrite the vhost logic. 3. A series of optimization patches added upon the baseline. The main optimization techniques are: 1. Reorder code to reduce CPU pipeline stall cycles. 2. Batch update the used ring for better efficiency. 3. Prefetch descriptor to hide cache latency. 4. Remove useless volatile attribute to allow compiler optimization. Code reordering and batch used ring update bring most of the performance improvements. In the existing code there're 2 callbacks for vhost enqueue: * virtio_dev_merge_rx for mrg_rxbuf turned on cases. * virtio_dev_rx for mrg_rxbuf turned off cases. The performance of the existing code is not optimal, especially when the mrg_rxbuf feature turned on. Besides, having 2 callback paths increases maintenance efforts. Also, there's a compatibility issue in the existing code which causes Windows VM to hang when the mrg_rxbuf feature turned on. --- Changes in v5: 1. Rebase to the latest branch. 2. Rename variables to keep consistent in naming style. 3. Small changes like return value adjustment and vertical alignment. 4. Add details in commit log. --- Changes in v4: 1. Fix a Windows VM compatibility issue. 2. Free shadow used ring in the right place. 3. Add failure check for shadow used ring malloc. 4. Refactor the code for clearer logic. 5. Add PRINT_PACKET for debugging. --- Changes in v3: 1. Remove unnecessary memset which causes frontend stall on SNB & IVB. 2. Rename variables to follow naming convention. 3. Rewrite enqueue and delete the obsolete in the same patch. --- Changes in v2: 1. Split the big function into several small ones. 2. Use multiple patches to explain each optimization. 3. Add comments. Zhihong Wang (6): vhost: fix windows vm hang vhost: rewrite enqueue vhost: remove useless volatile vhost: add desc prefetch vhost: batch update used ring vhost: optimize cache access lib/librte_vhost/vhost.c | 20 +- lib/librte_vhost/vhost.h | 6 +- lib/librte_vhost/vhost_user.c | 31 ++- lib/librte_vhost/virtio_net.c | 561 +++++++++++++++--------------------------- 4 files changed, 242 insertions(+), 376 deletions(-)
Hi, On 09/09/2016 05:39 AM, Zhihong Wang wrote: > This patch set optimizes the vhost enqueue function. > > It implements the vhost logic from scratch into a single function designed > for high performance and good maintainability, and improves CPU efficiency > significantly by optimizing cache access, which means: > > * Higher maximum throughput can be achieved for fast frontends like DPDK > virtio pmd. > > * Better scalability can be achieved that each vhost core can support > more connections because it takes less cycles to handle each single > frontend. > > This patch set contains: > > 1. A Windows VM compatibility fix for vhost enqueue in 16.07 release. > > 2. A baseline patch to rewrite the vhost logic. > > 3. A series of optimization patches added upon the baseline. > > The main optimization techniques are: > > 1. Reorder code to reduce CPU pipeline stall cycles. > > 2. Batch update the used ring for better efficiency. > > 3. Prefetch descriptor to hide cache latency. > > 4. Remove useless volatile attribute to allow compiler optimization. > > Code reordering and batch used ring update bring most of the performance > improvements. > > In the existing code there're 2 callbacks for vhost enqueue: > > * virtio_dev_merge_rx for mrg_rxbuf turned on cases. > > * virtio_dev_rx for mrg_rxbuf turned off cases. > > The performance of the existing code is not optimal, especially when the > mrg_rxbuf feature turned on. Besides, having 2 callback paths increases > maintenance efforts. > > Also, there's a compatibility issue in the existing code which causes > Windows VM to hang when the mrg_rxbuf feature turned on. > > --- > Changes in v5: > > 1. Rebase to the latest branch. > > 2. Rename variables to keep consistent in naming style. > > 3. Small changes like return value adjustment and vertical alignment. > > 4. Add details in commit log. Just tried to apply your series without success. Apparently, it is not based directly on master branch, as it lacks some SHA-1 information. Could you rebase it against master please? Thanks, Maxime
On 09/12/2016 03:52 PM, Maxime Coquelin wrote: > Hi, > > On 09/09/2016 05:39 AM, Zhihong Wang wrote: >> This patch set optimizes the vhost enqueue function. >> >> It implements the vhost logic from scratch into a single function >> designed >> for high performance and good maintainability, and improves CPU >> efficiency >> significantly by optimizing cache access, which means: >> >> * Higher maximum throughput can be achieved for fast frontends like >> DPDK >> virtio pmd. >> >> * Better scalability can be achieved that each vhost core can support >> more connections because it takes less cycles to handle each single >> frontend. >> >> This patch set contains: >> >> 1. A Windows VM compatibility fix for vhost enqueue in 16.07 release. >> >> 2. A baseline patch to rewrite the vhost logic. >> >> 3. A series of optimization patches added upon the baseline. >> >> The main optimization techniques are: >> >> 1. Reorder code to reduce CPU pipeline stall cycles. >> >> 2. Batch update the used ring for better efficiency. >> >> 3. Prefetch descriptor to hide cache latency. >> >> 4. Remove useless volatile attribute to allow compiler optimization. >> >> Code reordering and batch used ring update bring most of the performance >> improvements. >> >> In the existing code there're 2 callbacks for vhost enqueue: >> >> * virtio_dev_merge_rx for mrg_rxbuf turned on cases. >> >> * virtio_dev_rx for mrg_rxbuf turned off cases. >> >> The performance of the existing code is not optimal, especially when the >> mrg_rxbuf feature turned on. Besides, having 2 callback paths increases >> maintenance efforts. >> >> Also, there's a compatibility issue in the existing code which causes >> Windows VM to hang when the mrg_rxbuf feature turned on. >> >> --- >> Changes in v5: >> >> 1. Rebase to the latest branch. >> >> 2. Rename variables to keep consistent in naming style. >> >> 3. Small changes like return value adjustment and vertical alignment. >> >> 4. Add details in commit log. > Just tried to apply your series without success. > Apparently, it is not based directly on master branch, > as it lacks some SHA-1 information. > > Could you rebase it against master please? Ok, it is in fact based on top of: git://dpdk.org/next/dpdk-next-virtio master For v6, if any, could you add this info to the cover letter please? Thanks, Maxime
On Mon, Sep 12, 2016 at 03:52:12PM +0200, Maxime Coquelin wrote: > Just tried to apply your series without success. > Apparently, it is not based directly on master branch, > as it lacks some SHA-1 information. > > Could you rebase it against master please? It's rebased against the dpdk-next-virtio tree [0], where all the virtio/vhost patches are applied first. [0]: http://dpdk.org/browse/next/dpdk-next-virtio/ --yliu
This patch set optimizes the vhost enqueue function. It implements the vhost logic from scratch into a single function designed for high performance and good maintainability, and improves CPU efficiency significantly by optimizing cache access, which means: * Higher maximum throughput can be achieved for fast frontends like DPDK virtio pmd. * Better scalability can be achieved that each vhost core can support more connections because it takes less cycles to handle each single frontend. This patch set contains: 1. A Windows VM compatibility fix for vhost enqueue in 16.07 release. 2. A baseline patch to rewrite the vhost logic. 3. A series of optimization patches added upon the baseline. The main optimization techniques are: 1. Reorder code to reduce CPU pipeline stall cycles. 2. Batch update the used ring for better efficiency. 3. Prefetch descriptor to hide cache latency. 4. Remove useless volatile attribute to allow compiler optimization. Code reordering and batch used ring update bring most of the performance improvements. In the existing code there're 2 callbacks for vhost enqueue: * virtio_dev_merge_rx for mrg_rxbuf turned on cases. * virtio_dev_rx for mrg_rxbuf turned off cases. The performance of the existing code is not optimal, especially when the mrg_rxbuf feature turned on. Besides, having 2 callback paths increases maintenance efforts. Also, there's a compatibility issue in the existing code which causes Windows VM to hang when the mrg_rxbuf feature turned on. --- Changes in v6: 1. Merge duplicated code. 2. Introduce a function for used ring write. 3. Add necessary comments. --- Changes in v5: 1. Rebase to dpdk-next-virtio master. 2. Rename variables to keep consistent in naming style. 3. Small changes like return value adjustment and vertical alignment. 4. Add details in commit log. --- Changes in v4: 1. Fix a Windows VM compatibility issue. 2. Free shadow used ring in the right place. 3. Add failure check for shadow used ring malloc. 4. Refactor the code for clearer logic. 5. Add PRINT_PACKET for debugging. --- Changes in v3: 1. Remove unnecessary memset which causes frontend stall on SNB & IVB. 2. Rename variables to follow naming convention. 3. Rewrite enqueue and delete the obsolete in the same patch. --- Changes in v2: 1. Split the big function into several small ones. 2. Use multiple patches to explain each optimization. 3. Add comments. Zhihong Wang (6): vhost: fix windows vm hang vhost: rewrite enqueue vhost: remove useless volatile vhost: add desc prefetch vhost: batch update used ring vhost: optimize cache access lib/librte_vhost/vhost.c | 20 +- lib/librte_vhost/vhost.h | 6 +- lib/librte_vhost/vhost_user.c | 31 ++- lib/librte_vhost/virtio_net.c | 541 ++++++++++++++---------------------------- 4 files changed, 225 insertions(+), 373 deletions(-)
Hi Maxime, Do you have more comments about this set? If no, I think I could merge it shortly. Thanks. --yliu On Mon, Sep 19, 2016 at 10:00:11PM -0400, Zhihong Wang wrote: > This patch set optimizes the vhost enqueue function. > > It implements the vhost logic from scratch into a single function designed > for high performance and good maintainability, and improves CPU efficiency > significantly by optimizing cache access, which means: > > * Higher maximum throughput can be achieved for fast frontends like DPDK > virtio pmd. > > * Better scalability can be achieved that each vhost core can support > more connections because it takes less cycles to handle each single > frontend. > > This patch set contains: > > 1. A Windows VM compatibility fix for vhost enqueue in 16.07 release. > > 2. A baseline patch to rewrite the vhost logic. > > 3. A series of optimization patches added upon the baseline. > > The main optimization techniques are: > > 1. Reorder code to reduce CPU pipeline stall cycles. > > 2. Batch update the used ring for better efficiency. > > 3. Prefetch descriptor to hide cache latency. > > 4. Remove useless volatile attribute to allow compiler optimization. > > Code reordering and batch used ring update bring most of the performance > improvements. > > In the existing code there're 2 callbacks for vhost enqueue: > > * virtio_dev_merge_rx for mrg_rxbuf turned on cases. > > * virtio_dev_rx for mrg_rxbuf turned off cases. > > The performance of the existing code is not optimal, especially when the > mrg_rxbuf feature turned on. Besides, having 2 callback paths increases > maintenance efforts. > > Also, there's a compatibility issue in the existing code which causes > Windows VM to hang when the mrg_rxbuf feature turned on. > > --- > Changes in v6: > > 1. Merge duplicated code. > > 2. Introduce a function for used ring write. > > 3. Add necessary comments. > > --- > Changes in v5: > > 1. Rebase to dpdk-next-virtio master. > > 2. Rename variables to keep consistent in naming style. > > 3. Small changes like return value adjustment and vertical alignment. > > 4. Add details in commit log. > > --- > Changes in v4: > > 1. Fix a Windows VM compatibility issue. > > 2. Free shadow used ring in the right place. > > 3. Add failure check for shadow used ring malloc. > > 4. Refactor the code for clearer logic. > > 5. Add PRINT_PACKET for debugging. > > --- > Changes in v3: > > 1. Remove unnecessary memset which causes frontend stall on SNB & IVB. > > 2. Rename variables to follow naming convention. > > 3. Rewrite enqueue and delete the obsolete in the same patch. > > --- > Changes in v2: > > 1. Split the big function into several small ones. > > 2. Use multiple patches to explain each optimization. > > 3. Add comments. > > Zhihong Wang (6): > vhost: fix windows vm hang > vhost: rewrite enqueue > vhost: remove useless volatile > vhost: add desc prefetch > vhost: batch update used ring > vhost: optimize cache access > > lib/librte_vhost/vhost.c | 20 +- > lib/librte_vhost/vhost.h | 6 +- > lib/librte_vhost/vhost_user.c | 31 ++- > lib/librte_vhost/virtio_net.c | 541 ++++++++++++++---------------------------- > 4 files changed, 225 insertions(+), 373 deletions(-) > > -- > 2.7.4
Hi Yuanhan, On 09/21/2016 04:26 AM, Yuanhan Liu wrote: > Hi Maxime, > > Do you have more comments about this set? If no, I think I could merge > it shortly. No more comments, this is good to me. Feel free to add: Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Thanks, Maxime > Thanks. > > --yliu > > On Mon, Sep 19, 2016 at 10:00:11PM -0400, Zhihong Wang wrote: >> This patch set optimizes the vhost enqueue function. >> >> It implements the vhost logic from scratch into a single function designed >> for high performance and good maintainability, and improves CPU efficiency >> significantly by optimizing cache access, which means: >> >> * Higher maximum throughput can be achieved for fast frontends like DPDK >> virtio pmd. >> >> * Better scalability can be achieved that each vhost core can support >> more connections because it takes less cycles to handle each single >> frontend. >> >> This patch set contains: >> >> 1. A Windows VM compatibility fix for vhost enqueue in 16.07 release. >> >> 2. A baseline patch to rewrite the vhost logic. >> >> 3. A series of optimization patches added upon the baseline. >> >> The main optimization techniques are: >> >> 1. Reorder code to reduce CPU pipeline stall cycles. >> >> 2. Batch update the used ring for better efficiency. >> >> 3. Prefetch descriptor to hide cache latency. >> >> 4. Remove useless volatile attribute to allow compiler optimization. >> >> Code reordering and batch used ring update bring most of the performance >> improvements. >> >> In the existing code there're 2 callbacks for vhost enqueue: >> >> * virtio_dev_merge_rx for mrg_rxbuf turned on cases. >> >> * virtio_dev_rx for mrg_rxbuf turned off cases. >> >> The performance of the existing code is not optimal, especially when the >> mrg_rxbuf feature turned on. Besides, having 2 callback paths increases >> maintenance efforts. >> >> Also, there's a compatibility issue in the existing code which causes >> Windows VM to hang when the mrg_rxbuf feature turned on. >> >> --- >> Changes in v6: >> >> 1. Merge duplicated code. >> >> 2. Introduce a function for used ring write. >> >> 3. Add necessary comments. >> >> --- >> Changes in v5: >> >> 1. Rebase to dpdk-next-virtio master. >> >> 2. Rename variables to keep consistent in naming style. >> >> 3. Small changes like return value adjustment and vertical alignment. >> >> 4. Add details in commit log. >> >> --- >> Changes in v4: >> >> 1. Fix a Windows VM compatibility issue. >> >> 2. Free shadow used ring in the right place. >> >> 3. Add failure check for shadow used ring malloc. >> >> 4. Refactor the code for clearer logic. >> >> 5. Add PRINT_PACKET for debugging. >> >> --- >> Changes in v3: >> >> 1. Remove unnecessary memset which causes frontend stall on SNB & IVB. >> >> 2. Rename variables to follow naming convention. >> >> 3. Rewrite enqueue and delete the obsolete in the same patch. >> >> --- >> Changes in v2: >> >> 1. Split the big function into several small ones. >> >> 2. Use multiple patches to explain each optimization. >> >> 3. Add comments. >> >> Zhihong Wang (6): >> vhost: fix windows vm hang >> vhost: rewrite enqueue >> vhost: remove useless volatile >> vhost: add desc prefetch >> vhost: batch update used ring >> vhost: optimize cache access >> >> lib/librte_vhost/vhost.c | 20 +- >> lib/librte_vhost/vhost.h | 6 +- >> lib/librte_vhost/vhost_user.c | 31 ++- >> lib/librte_vhost/virtio_net.c | 541 ++++++++++++++---------------------------- >> 4 files changed, 225 insertions(+), 373 deletions(-) >> >> -- >> 2.7.4
Hi Maxime, On 22 August 2016 at 16:11, Maxime Coquelin <maxime.coquelin@redhat.com> wrote: > Hi Zhihong, > > On 08/19/2016 07:43 AM, Zhihong Wang wrote: >> >> This patch set optimizes the vhost enqueue function. >> ... > > My setup consists of one host running a guest. > The guest generates as much 64bytes packets as possible using Have you tested with other different packet size? My testing shows that performance is dropping when packet size is more than 256. > pktgen-dpdk. The hosts forwards received packets back to the guest > using testpmd on vhost pmd interface. Guest's vCPUs are pinned to > physical CPUs. > > I tested it with and without your v1 patch, with and without > rx-mergeable feature turned ON. > Results are the average of 8 runs of 60 seconds: > > Rx-Mergeable ON : 7.72Mpps > Rx-Mergeable ON + "vhost: optimize enqueue" v1: 9.19Mpps > Rx-Mergeable OFF: 10.52Mpps > Rx-Mergeable OFF + "vhost: optimize enqueue" v1: 10.60Mpps > > Regards, > Maxime
> -----Original Message----- > From: Jianbo Liu [mailto:jianbo.liu@linaro.org] > Sent: Wednesday, September 21, 2016 4:50 PM > To: Maxime Coquelin <maxime.coquelin@redhat.com> > Cc: Wang, Zhihong <zhihong.wang@intel.com>; dev@dpdk.org; > yuanhan.liu@linux.intel.com > Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue > > Hi Maxime, > > On 22 August 2016 at 16:11, Maxime Coquelin > <maxime.coquelin@redhat.com> wrote: > > Hi Zhihong, > > > > On 08/19/2016 07:43 AM, Zhihong Wang wrote: > >> > >> This patch set optimizes the vhost enqueue function. > >> > ... > > > > > My setup consists of one host running a guest. > > The guest generates as much 64bytes packets as possible using > > Have you tested with other different packet size? > My testing shows that performance is dropping when packet size is more > than 256. Hi Jianbo, Thanks for reporting this. 1. Are you running the vector frontend with mrg_rxbuf=off? 2. Could you please specify what CPU you're running? Is it Haswell or Ivy Bridge? 3. How many percentage of drop are you seeing? This is expected by me because I've already found the root cause and the way to optimize it, but since it missed the v0 deadline and requires changes in eal/memcpy, I postpone it to the next release. After the upcoming optimization the performance for packets larger than 256 will be improved, and the new code will be much faster than the current code. Thanks Zhihong > > > pktgen-dpdk. The hosts forwards received packets back to the guest > > using testpmd on vhost pmd interface. Guest's vCPUs are pinned to > > physical CPUs. > > > > I tested it with and without your v1 patch, with and without > > rx-mergeable feature turned ON. > > Results are the average of 8 runs of 60 seconds: > > > > Rx-Mergeable ON : 7.72Mpps > > Rx-Mergeable ON + "vhost: optimize enqueue" v1: 9.19Mpps > > Rx-Mergeable OFF: 10.52Mpps > > Rx-Mergeable OFF + "vhost: optimize enqueue" v1: 10.60Mpps > > > > Regards, > > Maxime
On 21 September 2016 at 17:27, Wang, Zhihong <zhihong.wang@intel.com> wrote: > > >> -----Original Message----- >> From: Jianbo Liu [mailto:jianbo.liu@linaro.org] >> Sent: Wednesday, September 21, 2016 4:50 PM >> To: Maxime Coquelin <maxime.coquelin@redhat.com> >> Cc: Wang, Zhihong <zhihong.wang@intel.com>; dev@dpdk.org; >> yuanhan.liu@linux.intel.com >> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue >> >> Hi Maxime, >> >> On 22 August 2016 at 16:11, Maxime Coquelin >> <maxime.coquelin@redhat.com> wrote: >> > Hi Zhihong, >> > >> > On 08/19/2016 07:43 AM, Zhihong Wang wrote: >> >> >> >> This patch set optimizes the vhost enqueue function. >> >> >> ... >> >> > >> > My setup consists of one host running a guest. >> > The guest generates as much 64bytes packets as possible using >> >> Have you tested with other different packet size? >> My testing shows that performance is dropping when packet size is more >> than 256. > > > Hi Jianbo, > > Thanks for reporting this. > > 1. Are you running the vector frontend with mrg_rxbuf=off? > > 2. Could you please specify what CPU you're running? Is it Haswell > or Ivy Bridge? > > 3. How many percentage of drop are you seeing? > > This is expected by me because I've already found the root cause and > the way to optimize it, but since it missed the v0 deadline and > requires changes in eal/memcpy, I postpone it to the next release. > > After the upcoming optimization the performance for packets larger > than 256 will be improved, and the new code will be much faster than > the current code. > Sorry, I tested on an ARM server, but I wonder if there is the same issue for x86 platform. >> > pktgen-dpdk. The hosts forwards received packets back to the guest >> > using testpmd on vhost pmd interface. Guest's vCPUs are pinned to >> > physical CPUs. >> > >> > I tested it with and without your v1 patch, with and without >> > rx-mergeable feature turned ON. >> > Results are the average of 8 runs of 60 seconds: >> > >> > Rx-Mergeable ON : 7.72Mpps >> > Rx-Mergeable ON + "vhost: optimize enqueue" v1: 9.19Mpps >> > Rx-Mergeable OFF: 10.52Mpps >> > Rx-Mergeable OFF + "vhost: optimize enqueue" v1: 10.60Mpps >> >
> -----Original Message----- > From: Jianbo Liu [mailto:jianbo.liu@linaro.org] > Sent: Wednesday, September 21, 2016 8:54 PM > To: Wang, Zhihong <zhihong.wang@intel.com> > Cc: Maxime Coquelin <maxime.coquelin@redhat.com>; dev@dpdk.org; > yuanhan.liu@linux.intel.com > Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue > > On 21 September 2016 at 17:27, Wang, Zhihong <zhihong.wang@intel.com> > wrote: > > > > > >> -----Original Message----- > >> From: Jianbo Liu [mailto:jianbo.liu@linaro.org] > >> Sent: Wednesday, September 21, 2016 4:50 PM > >> To: Maxime Coquelin <maxime.coquelin@redhat.com> > >> Cc: Wang, Zhihong <zhihong.wang@intel.com>; dev@dpdk.org; > >> yuanhan.liu@linux.intel.com > >> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue > >> > >> Hi Maxime, > >> > >> On 22 August 2016 at 16:11, Maxime Coquelin > >> <maxime.coquelin@redhat.com> wrote: > >> > Hi Zhihong, > >> > > >> > On 08/19/2016 07:43 AM, Zhihong Wang wrote: > >> >> > >> >> This patch set optimizes the vhost enqueue function. > >> >> > >> ... > >> > >> > > >> > My setup consists of one host running a guest. > >> > The guest generates as much 64bytes packets as possible using > >> > >> Have you tested with other different packet size? > >> My testing shows that performance is dropping when packet size is more > >> than 256. > > > > > > Hi Jianbo, > > > > Thanks for reporting this. > > > > 1. Are you running the vector frontend with mrg_rxbuf=off? > > > > 2. Could you please specify what CPU you're running? Is it Haswell > > or Ivy Bridge? > > > > 3. How many percentage of drop are you seeing? > > > > This is expected by me because I've already found the root cause and > > the way to optimize it, but since it missed the v0 deadline and > > requires changes in eal/memcpy, I postpone it to the next release. > > > > After the upcoming optimization the performance for packets larger > > than 256 will be improved, and the new code will be much faster than > > the current code. > > > > Sorry, I tested on an ARM server, but I wonder if there is the same > issue for x86 platform. For mrg_rxbuf=off path it might be slight drop for packets larger than 256B (~3% for 512B and ~1% for 1024B), no drop for other cases. This is not a bug or issue, only we need to enhance memcpy to complete the whole optimization, which should be done in a separated patch, unfortunately it misses this release window. > > >> > pktgen-dpdk. The hosts forwards received packets back to the guest > >> > using testpmd on vhost pmd interface. Guest's vCPUs are pinned to > >> > physical CPUs. > >> > > >> > I tested it with and without your v1 patch, with and without > >> > rx-mergeable feature turned ON. > >> > Results are the average of 8 runs of 60 seconds: > >> > > >> > Rx-Mergeable ON : 7.72Mpps > >> > Rx-Mergeable ON + "vhost: optimize enqueue" v1: 9.19Mpps > >> > Rx-Mergeable OFF: 10.52Mpps > >> > Rx-Mergeable OFF + "vhost: optimize enqueue" v1: 10.60Mpps > >> >
On Wed, Sep 21, 2016 at 08:54:11PM +0800, Jianbo Liu wrote: > >> > My setup consists of one host running a guest. > >> > The guest generates as much 64bytes packets as possible using > >> > >> Have you tested with other different packet size? > >> My testing shows that performance is dropping when packet size is more > >> than 256. > > > > > > Hi Jianbo, > > > > Thanks for reporting this. > > > > 1. Are you running the vector frontend with mrg_rxbuf=off? > > > > 2. Could you please specify what CPU you're running? Is it Haswell > > or Ivy Bridge? > > > > 3. How many percentage of drop are you seeing? > > > > This is expected by me because I've already found the root cause and > > the way to optimize it, but since it missed the v0 deadline and > > requires changes in eal/memcpy, I postpone it to the next release. > > > > After the upcoming optimization the performance for packets larger > > than 256 will be improved, and the new code will be much faster than > > the current code. > > > > Sorry, I tested on an ARM server, but I wonder if there is the same > issue for x86 platform. Would you please provide more details? Say, answer the two left questions from Zhihong? Thanks. --yliu
On 22 September 2016 at 10:29, Yuanhan Liu <yuanhan.liu@linux.intel.com> wrote: > On Wed, Sep 21, 2016 at 08:54:11PM +0800, Jianbo Liu wrote: >> >> > My setup consists of one host running a guest. >> >> > The guest generates as much 64bytes packets as possible using >> >> >> >> Have you tested with other different packet size? >> >> My testing shows that performance is dropping when packet size is more >> >> than 256. >> > >> > >> > Hi Jianbo, >> > >> > Thanks for reporting this. >> > >> > 1. Are you running the vector frontend with mrg_rxbuf=off? >> > Yes, my testing is mrg_rxbuf=off, but not vector frontend PMD. >> > 2. Could you please specify what CPU you're running? Is it Haswell >> > or Ivy Bridge? >> > It's an ARM server. >> > 3. How many percentage of drop are you seeing? The testing result: size (bytes) improvement (%) 64 3.92 128 11.51 256 24.16 512 -13.79 1024 -22.51 1500 -12.22 A correction is that performance is dropping if byte size is larger than 512. >> > >> > This is expected by me because I've already found the root cause and >> > the way to optimize it, but since it missed the v0 deadline and >> > requires changes in eal/memcpy, I postpone it to the next release. >> > >> > After the upcoming optimization the performance for packets larger >> > than 256 will be improved, and the new code will be much faster than >> > the current code. >> > >> >> Sorry, I tested on an ARM server, but I wonder if there is the same >> issue for x86 platform. > > Would you please provide more details? Say, answer the two left > questions from Zhihong? > > Thanks. > > --yliu
> -----Original Message----- > From: Jianbo Liu [mailto:jianbo.liu@linaro.org] > Sent: Thursday, September 22, 2016 1:48 PM > To: Yuanhan Liu <yuanhan.liu@linux.intel.com> > Cc: Wang, Zhihong <zhihong.wang@intel.com>; Maxime Coquelin > <maxime.coquelin@redhat.com>; dev@dpdk.org > Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue > > On 22 September 2016 at 10:29, Yuanhan Liu <yuanhan.liu@linux.intel.com> > wrote: > > On Wed, Sep 21, 2016 at 08:54:11PM +0800, Jianbo Liu wrote: > >> >> > My setup consists of one host running a guest. > >> >> > The guest generates as much 64bytes packets as possible using > >> >> > >> >> Have you tested with other different packet size? > >> >> My testing shows that performance is dropping when packet size is > more > >> >> than 256. > >> > > >> > > >> > Hi Jianbo, > >> > > >> > Thanks for reporting this. > >> > > >> > 1. Are you running the vector frontend with mrg_rxbuf=off? > >> > > Yes, my testing is mrg_rxbuf=off, but not vector frontend PMD. > > >> > 2. Could you please specify what CPU you're running? Is it Haswell > >> > or Ivy Bridge? > >> > > It's an ARM server. > > >> > 3. How many percentage of drop are you seeing? > The testing result: > size (bytes) improvement (%) > 64 3.92 > 128 11.51 > 256 24.16 > 512 -13.79 > 1024 -22.51 > 1500 -12.22 > A correction is that performance is dropping if byte size is larger than 512. Jianbo, Could you please verify does this patch really cause enqueue perf to drop? You can test the enqueue path only by set guest to do rxonly, and compare the mpps by show port stats all in the guest. Thanks Zhihong > > >> > > >> > This is expected by me because I've already found the root cause and > >> > the way to optimize it, but since it missed the v0 deadline and > >> > requires changes in eal/memcpy, I postpone it to the next release. > >> > > >> > After the upcoming optimization the performance for packets larger > >> > than 256 will be improved, and the new code will be much faster than > >> > the current code. > >> > > >> > >> Sorry, I tested on an ARM server, but I wonder if there is the same > >> issue for x86 platform. > > > > Would you please provide more details? Say, answer the two left > > questions from Zhihong? > > > > Thanks. > > > > --yliu
On 22 September 2016 at 14:58, Wang, Zhihong <zhihong.wang@intel.com> wrote: > > >> -----Original Message----- >> From: Jianbo Liu [mailto:jianbo.liu@linaro.org] >> Sent: Thursday, September 22, 2016 1:48 PM >> To: Yuanhan Liu <yuanhan.liu@linux.intel.com> >> Cc: Wang, Zhihong <zhihong.wang@intel.com>; Maxime Coquelin >> <maxime.coquelin@redhat.com>; dev@dpdk.org >> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue >> >> On 22 September 2016 at 10:29, Yuanhan Liu <yuanhan.liu@linux.intel.com> >> wrote: >> > On Wed, Sep 21, 2016 at 08:54:11PM +0800, Jianbo Liu wrote: >> >> >> > My setup consists of one host running a guest. >> >> >> > The guest generates as much 64bytes packets as possible using >> >> >> >> >> >> Have you tested with other different packet size? >> >> >> My testing shows that performance is dropping when packet size is >> more >> >> >> than 256. >> >> > >> >> > >> >> > Hi Jianbo, >> >> > >> >> > Thanks for reporting this. >> >> > >> >> > 1. Are you running the vector frontend with mrg_rxbuf=off? >> >> > >> Yes, my testing is mrg_rxbuf=off, but not vector frontend PMD. >> >> >> > 2. Could you please specify what CPU you're running? Is it Haswell >> >> > or Ivy Bridge? >> >> > >> It's an ARM server. >> >> >> > 3. How many percentage of drop are you seeing? >> The testing result: >> size (bytes) improvement (%) >> 64 3.92 >> 128 11.51 >> 256 24.16 >> 512 -13.79 >> 1024 -22.51 >> 1500 -12.22 >> A correction is that performance is dropping if byte size is larger than 512. > > > Jianbo, > > Could you please verify does this patch really cause enqueue perf to drop? > > You can test the enqueue path only by set guest to do rxonly, and compare > the mpps by show port stats all in the guest. > > Tested with testpmd, host: txonly, guest: rxonly size (bytes) improvement (%) 64 4.12 128 6 256 2.65 512 -1.12 1024 -7.02
> -----Original Message----- > From: Jianbo Liu [mailto:jianbo.liu@linaro.org] > Sent: Thursday, September 22, 2016 5:02 PM > To: Wang, Zhihong <zhihong.wang@intel.com> > Cc: Yuanhan Liu <yuanhan.liu@linux.intel.com>; Maxime Coquelin > <maxime.coquelin@redhat.com>; dev@dpdk.org > Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue > > On 22 September 2016 at 14:58, Wang, Zhihong <zhihong.wang@intel.com> > wrote: > > > > > >> -----Original Message----- > >> From: Jianbo Liu [mailto:jianbo.liu@linaro.org] > >> Sent: Thursday, September 22, 2016 1:48 PM > >> To: Yuanhan Liu <yuanhan.liu@linux.intel.com> > >> Cc: Wang, Zhihong <zhihong.wang@intel.com>; Maxime Coquelin > >> <maxime.coquelin@redhat.com>; dev@dpdk.org > >> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue > >> > >> On 22 September 2016 at 10:29, Yuanhan Liu > <yuanhan.liu@linux.intel.com> > >> wrote: > >> > On Wed, Sep 21, 2016 at 08:54:11PM +0800, Jianbo Liu wrote: > >> >> >> > My setup consists of one host running a guest. > >> >> >> > The guest generates as much 64bytes packets as possible using > >> >> >> > >> >> >> Have you tested with other different packet size? > >> >> >> My testing shows that performance is dropping when packet size is > >> more > >> >> >> than 256. > >> >> > > >> >> > > >> >> > Hi Jianbo, > >> >> > > >> >> > Thanks for reporting this. > >> >> > > >> >> > 1. Are you running the vector frontend with mrg_rxbuf=off? > >> >> > > >> Yes, my testing is mrg_rxbuf=off, but not vector frontend PMD. > >> > >> >> > 2. Could you please specify what CPU you're running? Is it Haswell > >> >> > or Ivy Bridge? > >> >> > > >> It's an ARM server. > >> > >> >> > 3. How many percentage of drop are you seeing? > >> The testing result: > >> size (bytes) improvement (%) > >> 64 3.92 > >> 128 11.51 > >> 256 24.16 > >> 512 -13.79 > >> 1024 -22.51 > >> 1500 -12.22 > >> A correction is that performance is dropping if byte size is larger than 512. > > > > > > Jianbo, > > > > Could you please verify does this patch really cause enqueue perf to drop? > > > > You can test the enqueue path only by set guest to do rxonly, and compare > > the mpps by show port stats all in the guest. > > > > > Tested with testpmd, host: txonly, guest: rxonly > size (bytes) improvement (%) > 64 4.12 > 128 6 > 256 2.65 > 512 -1.12 > 1024 -7.02 I think your number is little bit hard to understand for me, this patch's optimization contains 2 parts: 1. ring operation: works for both mrg_rxbuf on and off 2. remote write ordering: works for mrg_rxbuf=on only So, for mrg_rxbuf=off, if this patch is good for 64B packets, then it shouldn't do anything bad for larger packets. This is the gain on x86 platform: host iofwd between nic and vhost, guest rxonly. nic2vm enhancement 64 21.83% 128 16.97% 256 6.34% 512 0.01% 1024 0.00% I suspect there's some complication in ARM's micro-arch. Could you try v6 and apply all patches except the the last one: [PATCH v6 6/6] vhost: optimize cache access And see if there's still perf drop? Thanks Zhihong
On 22 September 2016 at 18:04, Wang, Zhihong <zhihong.wang@intel.com> wrote: > > >> -----Original Message----- >> From: Jianbo Liu [mailto:jianbo.liu@linaro.org] >> Sent: Thursday, September 22, 2016 5:02 PM >> To: Wang, Zhihong <zhihong.wang@intel.com> >> Cc: Yuanhan Liu <yuanhan.liu@linux.intel.com>; Maxime Coquelin >> <maxime.coquelin@redhat.com>; dev@dpdk.org >> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue >> >> On 22 September 2016 at 14:58, Wang, Zhihong <zhihong.wang@intel.com> >> wrote: >> > >> > >> >> -----Original Message----- >> >> From: Jianbo Liu [mailto:jianbo.liu@linaro.org] >> >> Sent: Thursday, September 22, 2016 1:48 PM >> >> To: Yuanhan Liu <yuanhan.liu@linux.intel.com> >> >> Cc: Wang, Zhihong <zhihong.wang@intel.com>; Maxime Coquelin >> >> <maxime.coquelin@redhat.com>; dev@dpdk.org >> >> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue >> >> >> >> On 22 September 2016 at 10:29, Yuanhan Liu >> <yuanhan.liu@linux.intel.com> >> >> wrote: >> >> > On Wed, Sep 21, 2016 at 08:54:11PM +0800, Jianbo Liu wrote: >> >> >> >> > My setup consists of one host running a guest. >> >> >> >> > The guest generates as much 64bytes packets as possible using >> >> >> >> >> >> >> >> Have you tested with other different packet size? >> >> >> >> My testing shows that performance is dropping when packet size is >> >> more >> >> >> >> than 256. >> >> >> > >> >> >> > >> >> >> > Hi Jianbo, >> >> >> > >> >> >> > Thanks for reporting this. >> >> >> > >> >> >> > 1. Are you running the vector frontend with mrg_rxbuf=off? >> >> >> > >> >> Yes, my testing is mrg_rxbuf=off, but not vector frontend PMD. >> >> >> >> >> > 2. Could you please specify what CPU you're running? Is it Haswell >> >> >> > or Ivy Bridge? >> >> >> > >> >> It's an ARM server. >> >> >> >> >> > 3. How many percentage of drop are you seeing? >> >> The testing result: >> >> size (bytes) improvement (%) >> >> 64 3.92 >> >> 128 11.51 >> >> 256 24.16 >> >> 512 -13.79 >> >> 1024 -22.51 >> >> 1500 -12.22 >> >> A correction is that performance is dropping if byte size is larger than 512. >> > >> > >> > Jianbo, >> > >> > Could you please verify does this patch really cause enqueue perf to drop? >> > >> > You can test the enqueue path only by set guest to do rxonly, and compare >> > the mpps by show port stats all in the guest. >> > >> > >> Tested with testpmd, host: txonly, guest: rxonly >> size (bytes) improvement (%) >> 64 4.12 >> 128 6 >> 256 2.65 >> 512 -1.12 >> 1024 -7.02 > > > > I think your number is little bit hard to understand for me, this patch's > optimization contains 2 parts: > > 1. ring operation: works for both mrg_rxbuf on and off > > 2. remote write ordering: works for mrg_rxbuf=on only > > So, for mrg_rxbuf=off, if this patch is good for 64B packets, then it > shouldn't do anything bad for larger packets. > > This is the gain on x86 platform: host iofwd between nic and vhost, > guest rxonly. > > nic2vm enhancement > 64 21.83% > 128 16.97% > 256 6.34% > 512 0.01% > 1024 0.00% > I bootup a VM with 2 virtual port, and stress the traffic between them. First, I stressed with pktgen-dpdk in VM, and did iofwd in host. Then, as you told, I did rxonly in VM, and txonly in host. > I suspect there's some complication in ARM's micro-arch. > > Could you try v6 and apply all patches except the the last one: > [PATCH v6 6/6] vhost: optimize cache access > > And see if there's still perf drop? > The last patch can improve the performance. The drop is actually caused by the second patch. Jianbo
> -----Original Message----- > From: Jianbo Liu [mailto:jianbo.liu@linaro.org] > Sent: Thursday, September 22, 2016 10:42 PM > To: Wang, Zhihong <zhihong.wang@intel.com> > Cc: Yuanhan Liu <yuanhan.liu@linux.intel.com>; Maxime Coquelin > <maxime.coquelin@redhat.com>; dev@dpdk.org > Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue > > On 22 September 2016 at 18:04, Wang, Zhihong <zhihong.wang@intel.com> > wrote: > > > > > >> -----Original Message----- > >> From: Jianbo Liu [mailto:jianbo.liu@linaro.org] > >> Sent: Thursday, September 22, 2016 5:02 PM > >> To: Wang, Zhihong <zhihong.wang@intel.com> > >> Cc: Yuanhan Liu <yuanhan.liu@linux.intel.com>; Maxime Coquelin > >> <maxime.coquelin@redhat.com>; dev@dpdk.org > >> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue > >> > >> On 22 September 2016 at 14:58, Wang, Zhihong <zhihong.wang@intel.com> > >> wrote: > >> > > >> > > >> >> -----Original Message----- > >> >> From: Jianbo Liu [mailto:jianbo.liu@linaro.org] > >> >> Sent: Thursday, September 22, 2016 1:48 PM > >> >> To: Yuanhan Liu <yuanhan.liu@linux.intel.com> > >> >> Cc: Wang, Zhihong <zhihong.wang@intel.com>; Maxime Coquelin > >> >> <maxime.coquelin@redhat.com>; dev@dpdk.org > >> >> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue > >> >> > >> >> On 22 September 2016 at 10:29, Yuanhan Liu > >> <yuanhan.liu@linux.intel.com> > >> >> wrote: > >> >> > On Wed, Sep 21, 2016 at 08:54:11PM +0800, Jianbo Liu wrote: > >> >> >> >> > My setup consists of one host running a guest. > >> >> >> >> > The guest generates as much 64bytes packets as possible using > >> >> >> >> > >> >> >> >> Have you tested with other different packet size? > >> >> >> >> My testing shows that performance is dropping when packet size is > >> >> more > >> >> >> >> than 256. > >> >> >> > > >> >> >> > > >> >> >> > Hi Jianbo, > >> >> >> > > >> >> >> > Thanks for reporting this. > >> >> >> > > >> >> >> > 1. Are you running the vector frontend with mrg_rxbuf=off? > >> >> >> > > >> >> Yes, my testing is mrg_rxbuf=off, but not vector frontend PMD. > >> >> > >> >> >> > 2. Could you please specify what CPU you're running? Is it Haswell > >> >> >> > or Ivy Bridge? > >> >> >> > > >> >> It's an ARM server. > >> >> > >> >> >> > 3. How many percentage of drop are you seeing? > >> >> The testing result: > >> >> size (bytes) improvement (%) > >> >> 64 3.92 > >> >> 128 11.51 > >> >> 256 24.16 > >> >> 512 -13.79 > >> >> 1024 -22.51 > >> >> 1500 -12.22 > >> >> A correction is that performance is dropping if byte size is larger than 512. > >> > > >> > > >> > Jianbo, > >> > > >> > Could you please verify does this patch really cause enqueue perf to drop? > >> > > >> > You can test the enqueue path only by set guest to do rxonly, and compare > >> > the mpps by show port stats all in the guest. > >> > > >> > > >> Tested with testpmd, host: txonly, guest: rxonly > >> size (bytes) improvement (%) > >> 64 4.12 > >> 128 6 > >> 256 2.65 > >> 512 -1.12 > >> 1024 -7.02 > > > > > > > > I think your number is little bit hard to understand for me, this patch's > > optimization contains 2 parts: > > > > 1. ring operation: works for both mrg_rxbuf on and off > > > > 2. remote write ordering: works for mrg_rxbuf=on only > > > > So, for mrg_rxbuf=off, if this patch is good for 64B packets, then it > > shouldn't do anything bad for larger packets. > > > > This is the gain on x86 platform: host iofwd between nic and vhost, > > guest rxonly. > > > > nic2vm enhancement > > 64 21.83% > > 128 16.97% > > 256 6.34% > > 512 0.01% > > 1024 0.00% > > > I bootup a VM with 2 virtual port, and stress the traffic between them. > First, I stressed with pktgen-dpdk in VM, and did iofwd in host. > Then, as you told, I did rxonly in VM, and txonly in host. > > > I suspect there's some complication in ARM's micro-arch. > > > > Could you try v6 and apply all patches except the the last one: > > [PATCH v6 6/6] vhost: optimize cache access > > > > And see if there's still perf drop? > > > The last patch can improve the performance. The drop is actually > caused by the second patch. This is expected because the 2nd patch is just a baseline and all optimization patches are organized in the rest of this patch set. I think you can do bottleneck analysis on ARM to see what's slowing down the perf, there might be some micro-arch complications there, mostly likely in memcpy. Do you use glibc's memcpy? I suggest to hand-crafted it on your own. Could you publish the mrg_rxbuf=on data also? Since it's more widely used in terms of spec integrity. Thanks Zhihong > > Jianbo
On 23 September 2016 at 10:56, Wang, Zhihong <zhihong.wang@intel.com> wrote: ..... > This is expected because the 2nd patch is just a baseline and all optimization > patches are organized in the rest of this patch set. > > I think you can do bottleneck analysis on ARM to see what's slowing down the > perf, there might be some micro-arch complications there, mostly likely in > memcpy. > > Do you use glibc's memcpy? I suggest to hand-crafted it on your own. > > Could you publish the mrg_rxbuf=on data also? Since it's more widely used > in terms of spec integrity. > I don't think it will be helpful for you, considering the differences between x86 and arm. So please move on with this patchset... Thanks! Jianbo
2016-09-23 18:41, Jianbo Liu: > On 23 September 2016 at 10:56, Wang, Zhihong <zhihong.wang@intel.com> wrote: > ..... > > This is expected because the 2nd patch is just a baseline and all optimization > > patches are organized in the rest of this patch set. > > > > I think you can do bottleneck analysis on ARM to see what's slowing down the > > perf, there might be some micro-arch complications there, mostly likely in > > memcpy. > > > > Do you use glibc's memcpy? I suggest to hand-crafted it on your own. > > > > Could you publish the mrg_rxbuf=on data also? Since it's more widely used > > in terms of spec integrity. > > > I don't think it will be helpful for you, considering the differences > between x86 and arm. > So please move on with this patchset... Jianbo, I don't understand. You said that the 2nd patch is a regression: - volatile uint16_t last_used_idx; + uint16_t last_used_idx; And the overrall series lead to performance regression for packets > 512 B, right? But we don't know wether you have tested the v6 or not. Zhihong talked about some improvements possible in rte_memcpy. ARM64 is using libc memcpy in rte_memcpy. Now you seem to give up. Does it mean you accept having a regression in 16.11 release? Are you working on rte_memcpy?
> -----Original Message----- > From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com] > Sent: Friday, September 23, 2016 9:41 PM > To: Jianbo Liu <jianbo.liu@linaro.org> > Cc: dev@dpdk.org; Wang, Zhihong <zhihong.wang@intel.com>; Yuanhan Liu > <yuanhan.liu@linux.intel.com>; Maxime Coquelin > <maxime.coquelin@redhat.com> > Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue > > 2016-09-23 18:41, Jianbo Liu: > > On 23 September 2016 at 10:56, Wang, Zhihong <zhihong.wang@intel.com> > wrote: > > ..... > > > This is expected because the 2nd patch is just a baseline and all optimization > > > patches are organized in the rest of this patch set. > > > > > > I think you can do bottleneck analysis on ARM to see what's slowing down the > > > perf, there might be some micro-arch complications there, mostly likely in > > > memcpy. > > > > > > Do you use glibc's memcpy? I suggest to hand-crafted it on your own. > > > > > > Could you publish the mrg_rxbuf=on data also? Since it's more widely used > > > in terms of spec integrity. > > > > > I don't think it will be helpful for you, considering the differences > > between x86 and arm. Hi Jianbo, This patch does help in ARM for small packets like 64B sized ones, this actually proves the similarity between x86 and ARM in terms of caching optimization in this patch. My estimation is based on: 1. The last patch are for mrg_rxbuf=on, and since you said it helps perf, we can ignore it for now when we discuss mrg_rxbuf=off 2. Vhost enqueue perf = Ring overhead + Virtio header overhead + Data memcpy overhead 3. This patch helps small packets traffic, which means it helps ring + virtio header operations 4. So, when you say perf drop when packet size larger than 512B, this is most likely caused by memcpy in ARM not working well with this patch I'm not saying glibc's memcpy is not good enough, it's just that this is a rather special use case. And since we see specialized memcpy + this patch give better performance than other combinations significantly on x86, we suggest to hand-craft a specialized memcpy for it. Of course on ARM this is still just my speculation, and we need to either prove it or find the actual root cause. It can be **REALLY HELPFUL** if you could help to test this patch on ARM for mrg_rxbuf=on cases to see if this patch is in fact helpful to ARM at all, since mrg_rxbuf=on the more widely used cases. Thanks Zhihong > > So please move on with this patchset... > > Jianbo, > I don't understand. > You said that the 2nd patch is a regression: > - volatile uint16_t last_used_idx; > + uint16_t last_used_idx; > > And the overrall series lead to performance regression > for packets > 512 B, right? > But we don't know wether you have tested the v6 or not. > > Zhihong talked about some improvements possible in rte_memcpy. > ARM64 is using libc memcpy in rte_memcpy. > > Now you seem to give up. > Does it mean you accept having a regression in 16.11 release? > Are you working on rte_memcpy?
Hi Thomas, On 23 September 2016 at 21:41, Thomas Monjalon <thomas.monjalon@6wind.com> wrote: > 2016-09-23 18:41, Jianbo Liu: >> On 23 September 2016 at 10:56, Wang, Zhihong <zhihong.wang@intel.com> wrote: >> ..... >> > This is expected because the 2nd patch is just a baseline and all optimization >> > patches are organized in the rest of this patch set. >> > >> > I think you can do bottleneck analysis on ARM to see what's slowing down the >> > perf, there might be some micro-arch complications there, mostly likely in >> > memcpy. >> > >> > Do you use glibc's memcpy? I suggest to hand-crafted it on your own. >> > >> > Could you publish the mrg_rxbuf=on data also? Since it's more widely used >> > in terms of spec integrity. >> > >> I don't think it will be helpful for you, considering the differences >> between x86 and arm. >> So please move on with this patchset... > > Jianbo, > I don't understand. > You said that the 2nd patch is a regression: > - volatile uint16_t last_used_idx; > + uint16_t last_used_idx; > No, I meant "vhost: rewrite enqueue". > And the overrall series lead to performance regression > for packets > 512 B, right? > But we don't know wether you have tested the v6 or not. Yes, I tested v6, and found performance regression for size >=512B. > > Zhihong talked about some improvements possible in rte_memcpy. > ARM64 is using libc memcpy in rte_memcpy. > > Now you seem to give up. > Does it mean you accept having a regression in 16.11 release? > Are you working on rte_memcpy? This patchset actually improves performance according to Zhihong's result on x86 platfrom. And I also get improvement as least with small-size packet on ARM. I don't want to give up, but I need more time to find out the reason for the regression. I think rte_memcpy definitely is one of the ways to improve performance, but it could be the reason?
On 25 September 2016 at 13:41, Wang, Zhihong <zhihong.wang@intel.com> wrote: > > >> -----Original Message----- >> From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com] >> Sent: Friday, September 23, 2016 9:41 PM >> To: Jianbo Liu <jianbo.liu@linaro.org> >> Cc: dev@dpdk.org; Wang, Zhihong <zhihong.wang@intel.com>; Yuanhan Liu >> <yuanhan.liu@linux.intel.com>; Maxime Coquelin >> <maxime.coquelin@redhat.com> .... > This patch does help in ARM for small packets like 64B sized ones, > this actually proves the similarity between x86 and ARM in terms > of caching optimization in this patch. > > My estimation is based on: > > 1. The last patch are for mrg_rxbuf=on, and since you said it helps > perf, we can ignore it for now when we discuss mrg_rxbuf=off > > 2. Vhost enqueue perf = > Ring overhead + Virtio header overhead + Data memcpy overhead > > 3. This patch helps small packets traffic, which means it helps > ring + virtio header operations > > 4. So, when you say perf drop when packet size larger than 512B, > this is most likely caused by memcpy in ARM not working well > with this patch > > I'm not saying glibc's memcpy is not good enough, it's just that > this is a rather special use case. And since we see specialized > memcpy + this patch give better performance than other combinations > significantly on x86, we suggest to hand-craft a specialized memcpy > for it. > > Of course on ARM this is still just my speculation, and we need to > either prove it or find the actual root cause. > > It can be **REALLY HELPFUL** if you could help to test this patch on > ARM for mrg_rxbuf=on cases to see if this patch is in fact helpful > to ARM at all, since mrg_rxbuf=on the more widely used cases. > Actually it's worse than mrg_rxbuf=off.
> -----Original Message----- > From: Jianbo Liu [mailto:jianbo.liu@linaro.org] > Sent: Monday, September 26, 2016 1:13 PM > To: Wang, Zhihong <zhihong.wang@intel.com> > Cc: Thomas Monjalon <thomas.monjalon@6wind.com>; dev@dpdk.org; Yuanhan > Liu <yuanhan.liu@linux.intel.com>; Maxime Coquelin > <maxime.coquelin@redhat.com> > Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue > > On 25 September 2016 at 13:41, Wang, Zhihong <zhihong.wang@intel.com> > wrote: > > > > > >> -----Original Message----- > >> From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com] > >> Sent: Friday, September 23, 2016 9:41 PM > >> To: Jianbo Liu <jianbo.liu@linaro.org> > >> Cc: dev@dpdk.org; Wang, Zhihong <zhihong.wang@intel.com>; Yuanhan Liu > >> <yuanhan.liu@linux.intel.com>; Maxime Coquelin > >> <maxime.coquelin@redhat.com> > .... > > This patch does help in ARM for small packets like 64B sized ones, > > this actually proves the similarity between x86 and ARM in terms > > of caching optimization in this patch. > > > > My estimation is based on: > > > > 1. The last patch are for mrg_rxbuf=on, and since you said it helps > > perf, we can ignore it for now when we discuss mrg_rxbuf=off > > > > 2. Vhost enqueue perf = > > Ring overhead + Virtio header overhead + Data memcpy overhead > > > > 3. This patch helps small packets traffic, which means it helps > > ring + virtio header operations > > > > 4. So, when you say perf drop when packet size larger than 512B, > > this is most likely caused by memcpy in ARM not working well > > with this patch > > > > I'm not saying glibc's memcpy is not good enough, it's just that > > this is a rather special use case. And since we see specialized > > memcpy + this patch give better performance than other combinations > > significantly on x86, we suggest to hand-craft a specialized memcpy > > for it. > > > > Of course on ARM this is still just my speculation, and we need to > > either prove it or find the actual root cause. > > > > It can be **REALLY HELPFUL** if you could help to test this patch on > > ARM for mrg_rxbuf=on cases to see if this patch is in fact helpful > > to ARM at all, since mrg_rxbuf=on the more widely used cases. > > > Actually it's worse than mrg_rxbuf=off. I mean compare the perf of original vs. original + patch with mrg_rxbuf turned on. Is there any perf improvement?
On 22 September 2016 at 11:01, Jianbo Liu <jianbo.liu@linaro.org> wrote: > Tested with testpmd, host: txonly, guest: rxonly > size (bytes) improvement (%) > 64 4.12 > 128 6 > 256 2.65 > 512 -1.12 > 1024 -7.02 > Have you considered testing with more diverse workloads e.g. mixed packet sizes that are not always multiples of the cache line & register sizes?
On 26 September 2016 at 13:25, Wang, Zhihong <zhihong.wang@intel.com> wrote: > > >> -----Original Message----- >> From: Jianbo Liu [mailto:jianbo.liu@linaro.org] >> Sent: Monday, September 26, 2016 1:13 PM >> To: Wang, Zhihong <zhihong.wang@intel.com> >> Cc: Thomas Monjalon <thomas.monjalon@6wind.com>; dev@dpdk.org; Yuanhan >> Liu <yuanhan.liu@linux.intel.com>; Maxime Coquelin >> <maxime.coquelin@redhat.com> >> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue >> >> On 25 September 2016 at 13:41, Wang, Zhihong <zhihong.wang@intel.com> >> wrote: >> > >> > >> >> -----Original Message----- >> >> From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com] >> >> Sent: Friday, September 23, 2016 9:41 PM >> >> To: Jianbo Liu <jianbo.liu@linaro.org> >> >> Cc: dev@dpdk.org; Wang, Zhihong <zhihong.wang@intel.com>; Yuanhan Liu >> >> <yuanhan.liu@linux.intel.com>; Maxime Coquelin >> >> <maxime.coquelin@redhat.com> >> .... >> > This patch does help in ARM for small packets like 64B sized ones, >> > this actually proves the similarity between x86 and ARM in terms >> > of caching optimization in this patch. >> > >> > My estimation is based on: >> > >> > 1. The last patch are for mrg_rxbuf=on, and since you said it helps >> > perf, we can ignore it for now when we discuss mrg_rxbuf=off >> > >> > 2. Vhost enqueue perf = >> > Ring overhead + Virtio header overhead + Data memcpy overhead >> > >> > 3. This patch helps small packets traffic, which means it helps >> > ring + virtio header operations >> > >> > 4. So, when you say perf drop when packet size larger than 512B, >> > this is most likely caused by memcpy in ARM not working well >> > with this patch >> > >> > I'm not saying glibc's memcpy is not good enough, it's just that >> > this is a rather special use case. And since we see specialized >> > memcpy + this patch give better performance than other combinations >> > significantly on x86, we suggest to hand-craft a specialized memcpy >> > for it. >> > >> > Of course on ARM this is still just my speculation, and we need to >> > either prove it or find the actual root cause. >> > >> > It can be **REALLY HELPFUL** if you could help to test this patch on >> > ARM for mrg_rxbuf=on cases to see if this patch is in fact helpful >> > to ARM at all, since mrg_rxbuf=on the more widely used cases. >> > >> Actually it's worse than mrg_rxbuf=off. > > I mean compare the perf of original vs. original + patch with > mrg_rxbuf turned on. Is there any perf improvement? > Yes, orig + patch + on is better than orig + on, but orig + patch + on is worse than orig + patch + off.
On 26 September 2016 at 13:37, Luke Gorrie <luke@snabb.co> wrote: > On 22 September 2016 at 11:01, Jianbo Liu <jianbo.liu@linaro.org> wrote: >> >> Tested with testpmd, host: txonly, guest: rxonly >> size (bytes) improvement (%) >> 64 4.12 >> 128 6 >> 256 2.65 >> 512 -1.12 >> 1024 -7.02 > > > Have you considered testing with more diverse workloads e.g. mixed packet > sizes that are not always multiples of the cache line & register sizes? > No. Does testpmd can stress performance with mixed size?
> -----Original Message----- > From: Jianbo Liu [mailto:jianbo.liu@linaro.org] > Sent: Monday, September 26, 2016 1:39 PM > To: Wang, Zhihong <zhihong.wang@intel.com> > Cc: Thomas Monjalon <thomas.monjalon@6wind.com>; dev@dpdk.org; Yuanhan > Liu <yuanhan.liu@linux.intel.com>; Maxime Coquelin > <maxime.coquelin@redhat.com> > Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue > > On 26 September 2016 at 13:25, Wang, Zhihong <zhihong.wang@intel.com> > wrote: > > > > > >> -----Original Message----- > >> From: Jianbo Liu [mailto:jianbo.liu@linaro.org] > >> Sent: Monday, September 26, 2016 1:13 PM > >> To: Wang, Zhihong <zhihong.wang@intel.com> > >> Cc: Thomas Monjalon <thomas.monjalon@6wind.com>; dev@dpdk.org; > Yuanhan > >> Liu <yuanhan.liu@linux.intel.com>; Maxime Coquelin > >> <maxime.coquelin@redhat.com> > >> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue > >> > >> On 25 September 2016 at 13:41, Wang, Zhihong <zhihong.wang@intel.com> > >> wrote: > >> > > >> > > >> >> -----Original Message----- > >> >> From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com] > >> >> Sent: Friday, September 23, 2016 9:41 PM > >> >> To: Jianbo Liu <jianbo.liu@linaro.org> > >> >> Cc: dev@dpdk.org; Wang, Zhihong <zhihong.wang@intel.com>; Yuanhan > Liu > >> >> <yuanhan.liu@linux.intel.com>; Maxime Coquelin > >> >> <maxime.coquelin@redhat.com> > >> .... > >> > This patch does help in ARM for small packets like 64B sized ones, > >> > this actually proves the similarity between x86 and ARM in terms > >> > of caching optimization in this patch. > >> > > >> > My estimation is based on: > >> > > >> > 1. The last patch are for mrg_rxbuf=on, and since you said it helps > >> > perf, we can ignore it for now when we discuss mrg_rxbuf=off > >> > > >> > 2. Vhost enqueue perf = > >> > Ring overhead + Virtio header overhead + Data memcpy overhead > >> > > >> > 3. This patch helps small packets traffic, which means it helps > >> > ring + virtio header operations > >> > > >> > 4. So, when you say perf drop when packet size larger than 512B, > >> > this is most likely caused by memcpy in ARM not working well > >> > with this patch > >> > > >> > I'm not saying glibc's memcpy is not good enough, it's just that > >> > this is a rather special use case. And since we see specialized > >> > memcpy + this patch give better performance than other combinations > >> > significantly on x86, we suggest to hand-craft a specialized memcpy > >> > for it. > >> > > >> > Of course on ARM this is still just my speculation, and we need to > >> > either prove it or find the actual root cause. > >> > > >> > It can be **REALLY HELPFUL** if you could help to test this patch on > >> > ARM for mrg_rxbuf=on cases to see if this patch is in fact helpful > >> > to ARM at all, since mrg_rxbuf=on the more widely used cases. > >> > > >> Actually it's worse than mrg_rxbuf=off. > > > > I mean compare the perf of original vs. original + patch with > > mrg_rxbuf turned on. Is there any perf improvement? > > > Yes, orig + patch + on is better than orig + on, but orig + patch + on > is worse than orig + patch + off. Hi Jianbo, That's the way it is for virtio, if you compare the current enqueue, the mrg on perf is even slower. We should compare: 1. mrg on: orig vs. orig + patch 2. mrg off: orig vs. orig + patch There's more memory touch and in the frontend that brings down the performance when mrg is on. Finally, even though mrg on is slower, it's still the mainstream use case as far as I know. Thanks Zhihong
On Thu, Sep 22, 2016 at 01:47:45PM +0800, Jianbo Liu wrote: > On 22 September 2016 at 10:29, Yuanhan Liu <yuanhan.liu@linux.intel.com> wrote: > > On Wed, Sep 21, 2016 at 08:54:11PM +0800, Jianbo Liu wrote: > >> >> > My setup consists of one host running a guest. > >> >> > The guest generates as much 64bytes packets as possible using > >> >> > >> >> Have you tested with other different packet size? > >> >> My testing shows that performance is dropping when packet size is more > >> >> than 256. > >> > > >> > > >> > Hi Jianbo, > >> > > >> > Thanks for reporting this. > >> > > >> > 1. Are you running the vector frontend with mrg_rxbuf=off? > >> > > Yes, my testing is mrg_rxbuf=off, but not vector frontend PMD. > > >> > 2. Could you please specify what CPU you're running? Is it Haswell > >> > or Ivy Bridge? > >> > > It's an ARM server. > > >> > 3. How many percentage of drop are you seeing? > The testing result: > size (bytes) improvement (%) > 64 3.92 > 128 11.51 > 256 24.16 > 512 -13.79 > 1024 -22.51 > 1500 -12.22 > A correction is that performance is dropping if byte size is larger than 512. I have thought of this twice. Unfortunately, I think I may need NACK this series. Merging two code path into one is really good: as you stated, it improves the maintainability. But only if we see no performance regression on both path after the refactor. Unfortunately, that's not the case here: it hurts the performance for one code path (non-mrg Rx). That makes me think we may should not do the code path merge at all. I think that also aligns with what you have said before (internally): we could do the merge if it gives comparable performance before and after that. Besides that, I don't quite like the way you did in patch 2 (rewrite enqueue): you made a lot of changes in one patch. That means if something wrong happened, it is hard to narrow down which change introduces that regression. Badly, that's exactly what we met here. Weeks have been passed, I see no progress. That's the reason we like the idea of "one patch only does one thing, an atomic thing". So I will apply the first patch (it's a bug fixing patch) and ask you to refactor the rest, without the code path merge. I think we could still have a good maintainability code base if we introduce more common helper functions that can be used on both Rx path, or even on Tx path (such as update_used_ring, or shadow_used_ring). It's a bit late for too many changes for v16.11. I think you could just grab patch 6 (vhost: optimize cache access) to the old mrg-Rx code path, if that also helps the performance? Let us handle the left in next release, such as shadow used ring. Thanks. --yliu
> -----Original Message----- > From: Yuanhan Liu [mailto:yuanhan.liu@linux.intel.com] > Sent: Wednesday, October 12, 2016 10:53 AM > To: Wang, Zhihong <zhihong.wang@intel.com>; Jianbo Liu <jianbo.liu@linaro.org> > Cc: Maxime Coquelin <maxime.coquelin@redhat.com>; dev@dpdk.org; Thomas > Monjalon <thomas.monjalon@6wind.com> > Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue > > On Thu, Sep 22, 2016 at 01:47:45PM +0800, Jianbo Liu wrote: > > On 22 September 2016 at 10:29, Yuanhan Liu <yuanhan.liu@linux.intel.com> > wrote: > > > On Wed, Sep 21, 2016 at 08:54:11PM +0800, Jianbo Liu wrote: > > >> >> > My setup consists of one host running a guest. > > >> >> > The guest generates as much 64bytes packets as possible using > > >> >> > > >> >> Have you tested with other different packet size? > > >> >> My testing shows that performance is dropping when packet size is more > > >> >> than 256. > > >> > > > >> > > > >> > Hi Jianbo, > > >> > > > >> > Thanks for reporting this. > > >> > > > >> > 1. Are you running the vector frontend with mrg_rxbuf=off? > > >> > > > Yes, my testing is mrg_rxbuf=off, but not vector frontend PMD. > > > > >> > 2. Could you please specify what CPU you're running? Is it Haswell > > >> > or Ivy Bridge? > > >> > > > It's an ARM server. > > > > >> > 3. How many percentage of drop are you seeing? > > The testing result: > > size (bytes) improvement (%) > > 64 3.92 > > 128 11.51 > > 256 24.16 > > 512 -13.79 > > 1024 -22.51 > > 1500 -12.22 > > A correction is that performance is dropping if byte size is larger than 512. > > I have thought of this twice. Unfortunately, I think I may need NACK this > series. > > Merging two code path into one is really good: as you stated, it improves > the maintainability. But only if we see no performance regression on both > path after the refactor. Unfortunately, that's not the case here: it hurts > the performance for one code path (non-mrg Rx). > > That makes me think we may should not do the code path merge at all. I think > that also aligns with what you have said before (internally): we could do the > merge if it gives comparable performance before and after that. > > Besides that, I don't quite like the way you did in patch 2 (rewrite enqueue): > you made a lot of changes in one patch. That means if something wrong > happened, > it is hard to narrow down which change introduces that regression. Badly, > that's exactly what we met here. Weeks have been passed, I see no progress. > > That's the reason we like the idea of "one patch only does one thing, an > atomic thing". Yuanhan, folks, Thanks for the analysis. I disagree here though. I analyze, develop, benchmark on x86 platforms, where this patch works great. I've been trying to analyze on ARM too but it takes time and I've had a schedule. Also since the ARM perf issue comes when it's v6 already, I might not be able to make it in time. However that's what I have to do for this patch to be merged in this or the next release. In the meantime, may I suggest we consider the possibility to have dedicated codes for **perf critical paths** for different kinds of architecture? It can be hard for a person to have both the knowledge and the development environment for multiple archs at the same time. Moreover, different optimization techniques might be required for different archs, so it's hard and unnecessary to make a function works for all archs, sometimes it's just not the right thing to do. Thanks Zhihong > > So I will apply the first patch (it's a bug fixing patch) and ask you to > refactor the rest, without the code path merge. > > I think we could still have a good maintainability code base if we introduce > more common helper functions that can be used on both Rx path, or even on > Tx path (such as update_used_ring, or shadow_used_ring). > > It's a bit late for too many changes for v16.11. I think you could just > grab patch 6 (vhost: optimize cache access) to the old mrg-Rx code path, > if that also helps the performance? Let us handle the left in next release, > such as shadow used ring. > > Thanks. > > --yliu
Sorry guys, you lost me in the discussion. Is there some regression only on ARM? Does it need some work specifically on memcpy for ARM, or vhost for ARM? Who can work on ARM optimization? More comments below. 2016-10-12 12:22, Wang, Zhihong: > From: Yuanhan Liu [mailto:yuanhan.liu@linux.intel.com] > > > It's an ARM server. > > > > > > >> > 3. How many percentage of drop are you seeing? > > > The testing result: > > > size (bytes) improvement (%) > > > 64 3.92 > > > 128 11.51 > > > 256 24.16 > > > 512 -13.79 > > > 1024 -22.51 > > > 1500 -12.22 > > > A correction is that performance is dropping if byte size is larger than 512. > > > > I have thought of this twice. Unfortunately, I think I may need NACK this > > series. > > > > Merging two code path into one is really good: as you stated, it improves > > the maintainability. But only if we see no performance regression on both > > path after the refactor. Unfortunately, that's not the case here: it hurts > > the performance for one code path (non-mrg Rx). +1 > > That makes me think we may should not do the code path merge at all. I think > > that also aligns with what you have said before (internally): we could do the > > merge if it gives comparable performance before and after that. > > > > Besides that, I don't quite like the way you did in patch 2 (rewrite enqueue): > > you made a lot of changes in one patch. That means if something wrong > > happened, > > it is hard to narrow down which change introduces that regression. Badly, > > that's exactly what we met here. Weeks have been passed, I see no progress. +1, it is important to have simple patches making changes step by step. > > That's the reason we like the idea of "one patch only does one thing, an > > atomic thing". > > > Yuanhan, folks, > > Thanks for the analysis. I disagree here though. > > I analyze, develop, benchmark on x86 platforms, where this patch > works great. > > I've been trying to analyze on ARM too but it takes time and I've > had a schedule. Also since the ARM perf issue comes when it's > v6 already, I might not be able to make it in time. However > that's what I have to do for this patch to be merged in this > or the next release. > > In the meantime, may I suggest we consider the possibility to > have dedicated codes for **perf critical paths** for different > kinds of architecture? Yes that's what we do in several parts of DPDK. > It can be hard for a person to have both the knowledge and the > development environment for multiple archs at the same time. Yes we do not expect you work on ARM. So if nobody work on the ARM issue, you could make 2 code paths in order to allow your optimization for x86 only. But that's not the preferred way. And you must split your rework to better identify which part is a regression on ARM. > Moreover, different optimization techniques might be required for > different archs, so it's hard and unnecessary to make a function > works for all archs, sometimes it's just not the right thing to do. Yes sometimes. Please help us to be convinced for this case. > > So I will apply the first patch (it's a bug fixing patch) and ask you to > > refactor the rest, without the code path merge. > > > > I think we could still have a good maintainability code base if we introduce > > more common helper functions that can be used on both Rx path, or even on > > Tx path (such as update_used_ring, or shadow_used_ring). Yes it is a good step. And the code path merge could be reconsidered later. > > It's a bit late for too many changes for v16.11. I think you could just > > grab patch 6 (vhost: optimize cache access) to the old mrg-Rx code path, > > if that also helps the performance? Let us handle the left in next release, > > such as shadow used ring. Thank you
> -----Original Message----- > From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com] > Sent: Wednesday, October 12, 2016 11:31 PM > To: Wang, Zhihong <zhihong.wang@intel.com> > Cc: Yuanhan Liu <yuanhan.liu@linux.intel.com>; Jianbo Liu > <jianbo.liu@linaro.org>; Maxime Coquelin <maxime.coquelin@redhat.com>; > dev@dpdk.org > Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue > > Sorry guys, you lost me in the discussion. > > Is there some regression only on ARM? ARM is what we see, no info on ppc yet. > Does it need some work specifically on memcpy for ARM, > or vhost for ARM? > Who can work on ARM optimization? These are still open questions, Jiaobo who reported this doesn't have time for more testing now according to the reply. I'm trying to do some test in the hope to identify the root cause. > > More comments below. > > 2016-10-12 12:22, Wang, Zhihong: > > From: Yuanhan Liu [mailto:yuanhan.liu@linux.intel.com] > > > > It's an ARM server. > > > > > > > > >> > 3. How many percentage of drop are you seeing? > > > > The testing result: > > > > size (bytes) improvement (%) > > > > 64 3.92 > > > > 128 11.51 > > > > 256 24.16 > > > > 512 -13.79 > > > > 1024 -22.51 > > > > 1500 -12.22 > > > > A correction is that performance is dropping if byte size is larger than > 512. > > > > > > I have thought of this twice. Unfortunately, I think I may need NACK this > > > series. > > > > > > Merging two code path into one is really good: as you stated, it improves > > > the maintainability. But only if we see no performance regression on both > > > path after the refactor. Unfortunately, that's not the case here: it hurts > > > the performance for one code path (non-mrg Rx). > > +1 > > > > That makes me think we may should not do the code path merge at all. I > think > > > that also aligns with what you have said before (internally): we could do > the > > > merge if it gives comparable performance before and after that. > > > > > > Besides that, I don't quite like the way you did in patch 2 (rewrite > enqueue): > > > you made a lot of changes in one patch. That means if something wrong > > > happened, > > > it is hard to narrow down which change introduces that regression. Badly, > > > that's exactly what we met here. Weeks have been passed, I see no > progress. > > +1, it is important to have simple patches making changes step by step. > > > > That's the reason we like the idea of "one patch only does one thing, an > > > atomic thing". > > > > > > Yuanhan, folks, > > > > Thanks for the analysis. I disagree here though. > > > > I analyze, develop, benchmark on x86 platforms, where this patch > > works great. > > > > I've been trying to analyze on ARM too but it takes time and I've > > had a schedule. Also since the ARM perf issue comes when it's > > v6 already, I might not be able to make it in time. However > > that's what I have to do for this patch to be merged in this > > or the next release. > > > > In the meantime, may I suggest we consider the possibility to > > have dedicated codes for **perf critical paths** for different > > kinds of architecture? > > Yes that's what we do in several parts of DPDK. > > > It can be hard for a person to have both the knowledge and the > > development environment for multiple archs at the same time. > > Yes we do not expect you work on ARM. > So if nobody work on the ARM issue, you could make 2 code paths > in order to allow your optimization for x86 only. > But that's not the preferred way. > And you must split your rework to better identify which part is > a regression on ARM. > > > Moreover, different optimization techniques might be required for > > different archs, so it's hard and unnecessary to make a function > > works for all archs, sometimes it's just not the right thing to do. > > Yes sometimes. Please help us to be convinced for this case. > > > > So I will apply the first patch (it's a bug fixing patch) and ask you to > > > refactor the rest, without the code path merge. > > > > > > I think we could still have a good maintainability code base if we introduce > > > more common helper functions that can be used on both Rx path, or > even on > > > Tx path (such as update_used_ring, or shadow_used_ring). > > Yes it is a good step. > And the code path merge could be reconsidered later. > > > > It's a bit late for too many changes for v16.11. I think you could just > > > grab patch 6 (vhost: optimize cache access) to the old mrg-Rx code path, > > > if that also helps the performance? Let us handle the left in next release, > > > such as shadow used ring. > > Thank you
Hi Thomas, On 12 October 2016 at 23:31, Thomas Monjalon <thomas.monjalon@6wind.com> wrote: > Sorry guys, you lost me in the discussion. > > Is there some regression only on ARM? > Does it need some work specifically on memcpy for ARM, I don't know if there is common way to improve memcpy on different ARM hardware. Even there is, it could take times. I have tried do that using neon (like sse) instructions, but without success. > or vhost for ARM? > Who can work on ARM optimization? >
This is a new set of patches to optimize the mergeable Rx code path. No refactoring (rewrite) was made this time. It just applies some findings from Zhihong (kudos to him!) that could improve the mergeable Rx path on the old code. The two major factors that could improve the performance greatly are: - copy virtio header together with packet data. This could remove the buubbles between the two copy to optimize the cache access. This is implemented in patch 2 "vhost: optimize cache access" - shadow used ring update and update them at once The basic idea is to update used ring in a local buffer and flush them to the virtio used ring at once in the end. Again, this is for optimizing the cache access. This is implemented in patch 5 "vhost: shadow used ring update" The two optimizations could yield 40+% performance in micro testing and 20+% in PVP case testing with 64B packet size. Besides that, there are some tiny optimizations, such as prefetch avail ring (patch 6) and retrieve avail head once (patch 7). Note: the shadow used ring tech could also be applied to the non-mrg Rx path (and even the dequeu) path. I didn't do that for two reasons: - we already update used ring in batch in both path: it's not shadowed first though. - it's a bit too late too make many changes at this stage: RC1 is out. Please help testing. Thanks. --yliu Cc: Jianbo Liu <jianbo.liu@linaro.org> --- Yuanhan Liu (4): vhost: simplify mergeable Rx vring reservation vhost: use last avail idx for avail ring reservation vhost: prefetch avail ring vhost: retrieve avail head once Zhihong Wang (3): vhost: remove useless volatile vhost: optimize cache access vhost: shadow used ring update lib/librte_vhost/vhost.c | 13 ++- lib/librte_vhost/vhost.h | 5 +- lib/librte_vhost/vhost_user.c | 23 +++-- lib/librte_vhost/virtio_net.c | 193 +++++++++++++++++++++++++----------------- 4 files changed, 149 insertions(+), 85 deletions(-)
On 14 October 2016 at 17:34, Yuanhan Liu <yuanhan.liu@linux.intel.com> wrote: > This is a new set of patches to optimize the mergeable Rx code path. > No refactoring (rewrite) was made this time. It just applies some > findings from Zhihong (kudos to him!) that could improve the mergeable > Rx path on the old code. ...... > --- > Yuanhan Liu (4): > vhost: simplify mergeable Rx vring reservation > vhost: use last avail idx for avail ring reservation > vhost: prefetch avail ring > vhost: retrieve avail head once > > Zhihong Wang (3): > vhost: remove useless volatile > vhost: optimize cache access > vhost: shadow used ring update > > lib/librte_vhost/vhost.c | 13 ++- > lib/librte_vhost/vhost.h | 5 +- > lib/librte_vhost/vhost_user.c | 23 +++-- > lib/librte_vhost/virtio_net.c | 193 +++++++++++++++++++++++++----------------- > 4 files changed, 149 insertions(+), 85 deletions(-) > Reviewed-by: Jianbo Liu <jianbo.liu@linaro.org>
Hi Yuanhan, On 10/14/2016 11:34 AM, Yuanhan Liu wrote: > This is a new set of patches to optimize the mergeable Rx code path. > No refactoring (rewrite) was made this time. It just applies some > findings from Zhihong (kudos to him!) that could improve the mergeable > Rx path on the old code. > > The two major factors that could improve the performance greatly are: > > - copy virtio header together with packet data. This could remove > the buubbles between the two copy to optimize the cache access. > > This is implemented in patch 2 "vhost: optimize cache access" > > - shadow used ring update and update them at once > > The basic idea is to update used ring in a local buffer and flush > them to the virtio used ring at once in the end. Again, this is > for optimizing the cache access. > > This is implemented in patch 5 "vhost: shadow used ring update" > > The two optimizations could yield 40+% performance in micro testing > and 20+% in PVP case testing with 64B packet size. > > Besides that, there are some tiny optimizations, such as prefetch > avail ring (patch 6) and retrieve avail head once (patch 7). > > Note: the shadow used ring tech could also be applied to the non-mrg > Rx path (and even the dequeu) path. I didn't do that for two reasons: > > - we already update used ring in batch in both path: it's not shadowed > first though. > > - it's a bit too late too make many changes at this stage: RC1 is out. > > Please help testing. I tested the following use-cases without noticing any functional problems: - Windows Guests (mergeable ON & OFF, indirect disabled): ping other VM - Linux guests with Kernel driver (mergeable ON & OFF, indirect OFF): iperf between 2 VMs - Linux guest with Virtio PMD (mergeable ON & OFF): testpmd txonly on host, rxonly on guest. Feel free to add my: Tested-by: Maxime Coquelin <maxime.coquelin@redhat.com> Thanks, Maxime
Applied to dpdk-next-virtio. And thanks for testing and reviewing. --yliu On Fri, Oct 14, 2016 at 05:34:31PM +0800, Yuanhan Liu wrote: > This is a new set of patches to optimize the mergeable Rx code path. > No refactoring (rewrite) was made this time. It just applies some > findings from Zhihong (kudos to him!) that could improve the mergeable > Rx path on the old code. > > The two major factors that could improve the performance greatly are: > > - copy virtio header together with packet data. This could remove > the buubbles between the two copy to optimize the cache access. > > This is implemented in patch 2 "vhost: optimize cache access" > > - shadow used ring update and update them at once > > The basic idea is to update used ring in a local buffer and flush > them to the virtio used ring at once in the end. Again, this is > for optimizing the cache access. > > This is implemented in patch 5 "vhost: shadow used ring update" > > The two optimizations could yield 40+% performance in micro testing > and 20+% in PVP case testing with 64B packet size. > > Besides that, there are some tiny optimizations, such as prefetch > avail ring (patch 6) and retrieve avail head once (patch 7). > > Note: the shadow used ring tech could also be applied to the non-mrg > Rx path (and even the dequeu) path. I didn't do that for two reasons: > > - we already update used ring in batch in both path: it's not shadowed > first though. > > - it's a bit too late too make many changes at this stage: RC1 is out. > > Please help testing. > > Thanks. > > --yliu > > Cc: Jianbo Liu <jianbo.liu@linaro.org> > --- > Yuanhan Liu (4): > vhost: simplify mergeable Rx vring reservation > vhost: use last avail idx for avail ring reservation > vhost: prefetch avail ring > vhost: retrieve avail head once > > Zhihong Wang (3): > vhost: remove useless volatile > vhost: optimize cache access > vhost: shadow used ring update > > lib/librte_vhost/vhost.c | 13 ++- > lib/librte_vhost/vhost.h | 5 +- > lib/librte_vhost/vhost_user.c | 23 +++-- > lib/librte_vhost/virtio_net.c | 193 +++++++++++++++++++++++++----------------- > 4 files changed, 149 insertions(+), 85 deletions(-) > > -- > 1.9.0
diff --git a/lib/librte_vhost/vhost-net.h b/lib/librte_vhost/vhost-net.h index 38593a2..a15182c 100644 --- a/lib/librte_vhost/vhost-net.h +++ b/lib/librte_vhost/vhost-net.h @@ -71,7 +71,7 @@ struct vhost_virtqueue { uint32_t size; /* Last index used on the available ring */ - volatile uint16_t last_used_idx; + uint16_t last_used_idx; #define VIRTIO_INVALID_EVENTFD (-1) #define VIRTIO_UNINITIALIZED_EVENTFD (-2) @@ -85,6 +85,10 @@ struct vhost_virtqueue { /* Physical address of used ring, for logging */ uint64_t log_guest_addr; + + /* Shadow used ring for performance */ + struct vring_used_elem *shadow_used_ring; + uint32_t shadow_used_idx; } __rte_cache_aligned; /* Old kernels have no such macro defined */ diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c index 08a73fd..1263168 100644 --- a/lib/librte_vhost/vhost_rxtx.c +++ b/lib/librte_vhost/vhost_rxtx.c @@ -91,7 +91,7 @@ is_valid_virt_queue_idx(uint32_t idx, int is_tx, uint32_t qp_nb) return (is_tx ^ (idx & 1)) == 0 && idx < qp_nb * VIRTIO_QNUM; } -static void +static inline void __attribute__((always_inline)) virtio_enqueue_offload(struct rte_mbuf *m_buf, struct virtio_net_hdr *net_hdr) { if (m_buf->ol_flags & PKT_TX_L4_MASK) { @@ -125,427 +125,227 @@ virtio_enqueue_offload(struct rte_mbuf *m_buf, struct virtio_net_hdr *net_hdr) } } -static inline void -copy_virtio_net_hdr(struct virtio_net *dev, uint64_t desc_addr, - struct virtio_net_hdr_mrg_rxbuf hdr) -{ - if (dev->vhost_hlen == sizeof(struct virtio_net_hdr_mrg_rxbuf)) - *(struct virtio_net_hdr_mrg_rxbuf *)(uintptr_t)desc_addr = hdr; - else - *(struct virtio_net_hdr *)(uintptr_t)desc_addr = hdr.hdr; -} - -static inline int __attribute__((always_inline)) -copy_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq, - struct rte_mbuf *m, uint16_t desc_idx) +uint16_t +rte_vhost_enqueue_burst(int vid, uint16_t queue_id, + struct rte_mbuf **pkts, uint16_t count) { - uint32_t desc_avail, desc_offset; - uint32_t mbuf_avail, mbuf_offset; - uint32_t cpy_len; + struct virtio_net_hdr_mrg_rxbuf *virtio_hdr; + struct vhost_virtqueue *vq; struct vring_desc *desc; - uint64_t desc_addr; - struct virtio_net_hdr_mrg_rxbuf virtio_hdr = {{0, 0, 0, 0, 0, 0}, 0}; - - desc = &vq->desc[desc_idx]; - desc_addr = gpa_to_vva(dev, desc->addr); - /* - * Checking of 'desc_addr' placed outside of 'unlikely' macro to avoid - * performance issue with some versions of gcc (4.8.4 and 5.3.0) which - * otherwise stores offset on the stack instead of in a register. - */ - if (unlikely(desc->len < dev->vhost_hlen) || !desc_addr) - return -1; - - rte_prefetch0((void *)(uintptr_t)desc_addr); - - virtio_enqueue_offload(m, &virtio_hdr.hdr); - copy_virtio_net_hdr(dev, desc_addr, virtio_hdr); - vhost_log_write(dev, desc->addr, dev->vhost_hlen); - PRINT_PACKET(dev, (uintptr_t)desc_addr, dev->vhost_hlen, 0); - - desc_offset = dev->vhost_hlen; - desc_avail = desc->len - dev->vhost_hlen; - - mbuf_avail = rte_pktmbuf_data_len(m); - mbuf_offset = 0; - while (mbuf_avail != 0 || m->next != NULL) { - /* done with current mbuf, fetch next */ - if (mbuf_avail == 0) { - m = m->next; - - mbuf_offset = 0; - mbuf_avail = rte_pktmbuf_data_len(m); - } - - /* done with current desc buf, fetch next */ - if (desc_avail == 0) { - if ((desc->flags & VRING_DESC_F_NEXT) == 0) { - /* Room in vring buffer is not enough */ - return -1; - } - if (unlikely(desc->next >= vq->size)) - return -1; - - desc = &vq->desc[desc->next]; - desc_addr = gpa_to_vva(dev, desc->addr); - if (unlikely(!desc_addr)) - return -1; - - desc_offset = 0; - desc_avail = desc->len; - } - - cpy_len = RTE_MIN(desc_avail, mbuf_avail); - rte_memcpy((void *)((uintptr_t)(desc_addr + desc_offset)), - rte_pktmbuf_mtod_offset(m, void *, mbuf_offset), - cpy_len); - vhost_log_write(dev, desc->addr + desc_offset, cpy_len); - PRINT_PACKET(dev, (uintptr_t)(desc_addr + desc_offset), - cpy_len, 0); - - mbuf_avail -= cpy_len; - mbuf_offset += cpy_len; - desc_avail -= cpy_len; - desc_offset += cpy_len; - } - - return 0; -} + struct virtio_net *dev; + struct rte_mbuf *mbuf; + uint64_t desc_host_write_addr = 0; + uint32_t desc_chain_head = 0; + uint32_t desc_chain_len = 0; + uint32_t desc_current = 0; + uint32_t desc_write_offset = 0; + uint32_t used_idx_static = 0; + uint32_t pkt_idx = 0; + uint32_t pkt_left = 0; + uint32_t pkt_sent = 0; + uint32_t mbuf_len = 0; + uint32_t mbuf_len_left = 0; + uint32_t copy_len = 0; + uint32_t copy_virtio_hdr = 0; + uint32_t is_mrg_rxbuf = 0; + uint32_t is_virtio_1 = 0; + + if (unlikely(count == 0)) + return 0; -/** - * This function adds buffers to the virtio devices RX virtqueue. Buffers can - * be received from the physical port or from another virtio device. A packet - * count is returned to indicate the number of packets that are succesfully - * added to the RX queue. This function works when the mbuf is scattered, but - * it doesn't support the mergeable feature. - */ -static inline uint32_t __attribute__((always_inline)) -virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id, - struct rte_mbuf **pkts, uint32_t count) -{ - struct vhost_virtqueue *vq; - uint16_t avail_idx, free_entries, start_idx; - uint16_t desc_indexes[MAX_PKT_BURST]; - uint16_t used_idx; - uint32_t i; + count = RTE_MIN((uint32_t)MAX_PKT_BURST, count); - LOG_DEBUG(VHOST_DATA, "(%d) %s\n", dev->vid, __func__); - if (unlikely(!is_valid_virt_queue_idx(queue_id, 0, dev->virt_qp_nb))) { - RTE_LOG(ERR, VHOST_DATA, "(%d) %s: invalid virtqueue idx %d.\n", - dev->vid, __func__, queue_id); + dev = get_device(vid); + if (unlikely(!dev)) return 0; - } - vq = dev->virtqueue[queue_id]; - if (unlikely(vq->enabled == 0)) + if (unlikely(!is_valid_virt_queue_idx(queue_id, 0, dev->virt_qp_nb))) return 0; - avail_idx = *((volatile uint16_t *)&vq->avail->idx); - start_idx = vq->last_used_idx; - free_entries = avail_idx - start_idx; - count = RTE_MIN(count, free_entries); - count = RTE_MIN(count, (uint32_t)MAX_PKT_BURST); - if (count == 0) + vq = dev->virtqueue[queue_id]; + if (unlikely(!vq->enabled)) return 0; - LOG_DEBUG(VHOST_DATA, "(%d) start_idx %d | end_idx %d\n", - dev->vid, start_idx, start_idx + count); + if (dev->features & (1ULL << VIRTIO_NET_F_MRG_RXBUF)) + is_mrg_rxbuf = 1; + + if (dev->features & (1ULL << VIRTIO_F_VERSION_1)) + is_virtio_1 = 1; + + pkt_idx = 0; + pkt_left = count; + used_idx_static = vq->last_used_idx & (vq->size - 1); + vq->shadow_used_idx = 0; + + while (pkt_left > 0) { + if (unlikely(vq->avail->idx == vq->last_used_idx)) + goto done; + + if (pkt_left > 1 && vq->avail->idx != vq->last_used_idx + 1) + rte_prefetch0(&vq->desc[ + vq->avail->ring[ + (vq->last_used_idx + 1) & + (vq->size - 1)]]); + + mbuf = pkts[pkt_idx]; + mbuf_len = rte_pktmbuf_data_len(mbuf); + mbuf_len_left = mbuf_len; + pkt_idx++; + pkt_left--; + + desc_chain_head = vq->avail->ring[(vq->last_used_idx) & + (vq->size - 1)]; + desc_current = desc_chain_head; + desc = &vq->desc[desc_current]; + desc_host_write_addr = gpa_to_vva(dev, desc->addr); + if (unlikely(!desc_host_write_addr)) + goto done; + + virtio_hdr = (struct virtio_net_hdr_mrg_rxbuf *) + (uintptr_t)desc_host_write_addr; + copy_virtio_hdr = 1; + + vhost_log_write(dev, desc->addr, dev->vhost_hlen); + desc_write_offset = dev->vhost_hlen; + desc_chain_len = desc_write_offset; + desc_host_write_addr += desc_write_offset; + + while (1) { + if (!mbuf_len_left) { + if (mbuf->next) { + mbuf = mbuf->next; + mbuf_len = rte_pktmbuf_data_len(mbuf); + mbuf_len_left = mbuf_len; + } else + break; + } - /* Retrieve all of the desc indexes first to avoid caching issues. */ - rte_prefetch0(&vq->avail->ring[start_idx & (vq->size - 1)]); - for (i = 0; i < count; i++) { - used_idx = (start_idx + i) & (vq->size - 1); - desc_indexes[i] = vq->avail->ring[used_idx]; - vq->used->ring[used_idx].id = desc_indexes[i]; - vq->used->ring[used_idx].len = pkts[i]->pkt_len + - dev->vhost_hlen; - vhost_log_used_vring(dev, vq, - offsetof(struct vring_used, ring[used_idx]), - sizeof(vq->used->ring[used_idx])); - } + if (desc->len <= desc_write_offset) { + if (desc->flags & VRING_DESC_F_NEXT) { + desc_write_offset = 0; + desc_current = desc->next; + desc = &vq->desc[desc_current]; + desc_host_write_addr = + gpa_to_vva(dev, desc->addr); + if (unlikely(!desc_host_write_addr)) + goto rollback; + } else if (is_mrg_rxbuf) { + vq->shadow_used_ring[ + vq->shadow_used_idx].id = + desc_chain_head; + vq->shadow_used_ring[ + vq->shadow_used_idx].len = + desc_chain_len; + vq->shadow_used_idx++; + vq->last_used_idx++; + virtio_hdr->num_buffers++; + if (unlikely(vq->avail->idx == + vq->last_used_idx)) + goto rollback; + + desc_chain_head = vq->avail->ring[ + (vq->last_used_idx) & + (vq->size - 1)]; + desc_current = desc_chain_head; + desc = &vq->desc[desc_current]; + desc_host_write_addr = + gpa_to_vva(dev, desc->addr); + if (unlikely(!desc_host_write_addr)) + goto rollback; + + desc_chain_len = 0; + desc_write_offset = 0; + } else + goto rollback; + } - rte_prefetch0(&vq->desc[desc_indexes[0]]); - for (i = 0; i < count; i++) { - uint16_t desc_idx = desc_indexes[i]; - int err; + copy_len = RTE_MIN(desc->len - desc_write_offset, + mbuf_len_left); + if (copy_virtio_hdr) { + copy_virtio_hdr = 0; + memset((void *)(uintptr_t)&(virtio_hdr->hdr), + 0, dev->vhost_hlen); + virtio_enqueue_offload(mbuf, + &(virtio_hdr->hdr)); + if (is_mrg_rxbuf || is_virtio_1) + virtio_hdr->num_buffers = 1; + } - err = copy_mbuf_to_desc(dev, vq, pkts[i], desc_idx); - if (unlikely(err)) { - used_idx = (start_idx + i) & (vq->size - 1); - vq->used->ring[used_idx].len = dev->vhost_hlen; - vhost_log_used_vring(dev, vq, - offsetof(struct vring_used, ring[used_idx]), - sizeof(vq->used->ring[used_idx])); + rte_memcpy((void *)(uintptr_t)desc_host_write_addr, + rte_pktmbuf_mtod_offset(mbuf, void *, + mbuf_len - mbuf_len_left), + copy_len); + vhost_log_write(dev, desc->addr + desc_write_offset, + copy_len); + mbuf_len_left -= copy_len; + desc_write_offset += copy_len; + desc_host_write_addr += copy_len; + desc_chain_len += copy_len; } - if (i + 1 < count) - rte_prefetch0(&vq->desc[desc_indexes[i+1]]); + vq->shadow_used_ring[vq->shadow_used_idx].id = desc_chain_head; + vq->shadow_used_ring[vq->shadow_used_idx].len = desc_chain_len; + vq->shadow_used_idx++; + vq->last_used_idx++; + pkt_sent++; } - rte_smp_wmb(); - - *(volatile uint16_t *)&vq->used->idx += count; - vq->last_used_idx += count; - vhost_log_used_vring(dev, vq, - offsetof(struct vring_used, idx), - sizeof(vq->used->idx)); - - /* flush used->idx update before we read avail->flags. */ - rte_mb(); - - /* Kick the guest if necessary. */ - if (!(vq->avail->flags & VRING_AVAIL_F_NO_INTERRUPT) - && (vq->callfd >= 0)) - eventfd_write(vq->callfd, (eventfd_t)1); - return count; -} - -static inline int -fill_vec_buf(struct vhost_virtqueue *vq, uint32_t avail_idx, - uint32_t *allocated, uint32_t *vec_idx, - struct buf_vector *buf_vec) -{ - uint16_t idx = vq->avail->ring[avail_idx & (vq->size - 1)]; - uint32_t vec_id = *vec_idx; - uint32_t len = *allocated; - - while (1) { - if (unlikely(vec_id >= BUF_VECTOR_MAX || idx >= vq->size)) - return -1; - - len += vq->desc[idx].len; - buf_vec[vec_id].buf_addr = vq->desc[idx].addr; - buf_vec[vec_id].buf_len = vq->desc[idx].len; - buf_vec[vec_id].desc_idx = idx; - vec_id++; - - if ((vq->desc[idx].flags & VRING_DESC_F_NEXT) == 0) - break; - - idx = vq->desc[idx].next; - } - - *allocated = len; - *vec_idx = vec_id; - - return 0; -} - -/* - * Returns -1 on fail, 0 on success - */ -static inline int -reserve_avail_buf_mergeable(struct vhost_virtqueue *vq, uint32_t size, - uint16_t *end, struct buf_vector *buf_vec) -{ - uint16_t cur_idx; - uint16_t avail_idx; - uint32_t allocated = 0; - uint32_t vec_idx = 0; - uint16_t tries = 0; - - cur_idx = vq->last_used_idx; - - while (1) { - avail_idx = *((volatile uint16_t *)&vq->avail->idx); - if (unlikely(cur_idx == avail_idx)) - return -1; - - if (unlikely(fill_vec_buf(vq, cur_idx, &allocated, - &vec_idx, buf_vec) < 0)) - return -1; - - cur_idx++; - tries++; - - if (allocated >= size) - break; - - /* - * if we tried all available ring items, and still - * can't get enough buf, it means something abnormal - * happened. - */ - if (unlikely(tries >= vq->size)) - return -1; - } - - *end = cur_idx; - return 0; -} - -static inline uint32_t __attribute__((always_inline)) -copy_mbuf_to_desc_mergeable(struct virtio_net *dev, struct vhost_virtqueue *vq, - uint16_t end_idx, struct rte_mbuf *m, - struct buf_vector *buf_vec) -{ - struct virtio_net_hdr_mrg_rxbuf virtio_hdr = {{0, 0, 0, 0, 0, 0}, 0}; - uint32_t vec_idx = 0; - uint16_t start_idx = vq->last_used_idx; - uint16_t cur_idx = start_idx; - uint64_t desc_addr; - uint32_t mbuf_offset, mbuf_avail; - uint32_t desc_offset, desc_avail; - uint32_t cpy_len; - uint16_t desc_idx, used_idx; - - if (unlikely(m == NULL)) - return 0; - - LOG_DEBUG(VHOST_DATA, "(%d) current index %d | end index %d\n", - dev->vid, cur_idx, end_idx); - - desc_addr = gpa_to_vva(dev, buf_vec[vec_idx].buf_addr); - if (buf_vec[vec_idx].buf_len < dev->vhost_hlen || !desc_addr) - return 0; - - rte_prefetch0((void *)(uintptr_t)desc_addr); - - virtio_hdr.num_buffers = end_idx - start_idx; - LOG_DEBUG(VHOST_DATA, "(%d) RX: num merge buffers %d\n", - dev->vid, virtio_hdr.num_buffers); - - virtio_enqueue_offload(m, &virtio_hdr.hdr); - copy_virtio_net_hdr(dev, desc_addr, virtio_hdr); - vhost_log_write(dev, buf_vec[vec_idx].buf_addr, dev->vhost_hlen); - PRINT_PACKET(dev, (uintptr_t)desc_addr, dev->vhost_hlen, 0); - - desc_avail = buf_vec[vec_idx].buf_len - dev->vhost_hlen; - desc_offset = dev->vhost_hlen; - - mbuf_avail = rte_pktmbuf_data_len(m); - mbuf_offset = 0; - while (mbuf_avail != 0 || m->next != NULL) { - /* done with current desc buf, get the next one */ - if (desc_avail == 0) { - desc_idx = buf_vec[vec_idx].desc_idx; - - if (!(vq->desc[desc_idx].flags & VRING_DESC_F_NEXT)) { - /* Update used ring with desc information */ - used_idx = cur_idx++ & (vq->size - 1); - vq->used->ring[used_idx].id = desc_idx; - vq->used->ring[used_idx].len = desc_offset; - vhost_log_used_vring(dev, vq, +done: + if (likely(vq->shadow_used_idx > 0)) { + if (used_idx_static + vq->shadow_used_idx < vq->size) { + rte_memcpy(&vq->used->ring[used_idx_static], + &vq->shadow_used_ring[0], + vq->shadow_used_idx * + sizeof(struct vring_used_elem)); + vhost_log_used_vring(dev, vq, offsetof(struct vring_used, - ring[used_idx]), - sizeof(vq->used->ring[used_idx])); - } - - vec_idx++; - desc_addr = gpa_to_vva(dev, buf_vec[vec_idx].buf_addr); - if (unlikely(!desc_addr)) - return 0; - - /* Prefetch buffer address. */ - rte_prefetch0((void *)(uintptr_t)desc_addr); - desc_offset = 0; - desc_avail = buf_vec[vec_idx].buf_len; - } - - /* done with current mbuf, get the next one */ - if (mbuf_avail == 0) { - m = m->next; + ring[used_idx_static]), + vq->shadow_used_idx * + sizeof(struct vring_used_elem)); + } else { + uint32_t part_1 = vq->size - used_idx_static; + uint32_t part_2 = vq->shadow_used_idx - part_1; - mbuf_offset = 0; - mbuf_avail = rte_pktmbuf_data_len(m); + rte_memcpy(&vq->used->ring[used_idx_static], + &vq->shadow_used_ring[0], + part_1 * + sizeof(struct vring_used_elem)); + vhost_log_used_vring(dev, vq, + offsetof(struct vring_used, + ring[used_idx_static]), + part_1 * + sizeof(struct vring_used_elem)); + rte_memcpy(&vq->used->ring[0], + &vq->shadow_used_ring[part_1], + part_2 * + sizeof(struct vring_used_elem)); + vhost_log_used_vring(dev, vq, + offsetof(struct vring_used, + ring[0]), + part_2 * + sizeof(struct vring_used_elem)); } - - cpy_len = RTE_MIN(desc_avail, mbuf_avail); - rte_memcpy((void *)((uintptr_t)(desc_addr + desc_offset)), - rte_pktmbuf_mtod_offset(m, void *, mbuf_offset), - cpy_len); - vhost_log_write(dev, buf_vec[vec_idx].buf_addr + desc_offset, - cpy_len); - PRINT_PACKET(dev, (uintptr_t)(desc_addr + desc_offset), - cpy_len, 0); - - mbuf_avail -= cpy_len; - mbuf_offset += cpy_len; - desc_avail -= cpy_len; - desc_offset += cpy_len; } - used_idx = cur_idx & (vq->size - 1); - vq->used->ring[used_idx].id = buf_vec[vec_idx].desc_idx; - vq->used->ring[used_idx].len = desc_offset; + rte_smp_wmb(); + vq->used->idx = vq->last_used_idx; vhost_log_used_vring(dev, vq, - offsetof(struct vring_used, ring[used_idx]), - sizeof(vq->used->ring[used_idx])); - - return end_idx - start_idx; -} - -static inline uint32_t __attribute__((always_inline)) -virtio_dev_merge_rx(struct virtio_net *dev, uint16_t queue_id, - struct rte_mbuf **pkts, uint32_t count) -{ - struct vhost_virtqueue *vq; - uint32_t pkt_idx = 0, nr_used = 0; - uint16_t end; - struct buf_vector buf_vec[BUF_VECTOR_MAX]; - - LOG_DEBUG(VHOST_DATA, "(%d) %s\n", dev->vid, __func__); - if (unlikely(!is_valid_virt_queue_idx(queue_id, 0, dev->virt_qp_nb))) { - RTE_LOG(ERR, VHOST_DATA, "(%d) %s: invalid virtqueue idx %d.\n", - dev->vid, __func__, queue_id); - return 0; - } - - vq = dev->virtqueue[queue_id]; - if (unlikely(vq->enabled == 0)) - return 0; - - count = RTE_MIN((uint32_t)MAX_PKT_BURST, count); - if (count == 0) - return 0; - - for (pkt_idx = 0; pkt_idx < count; pkt_idx++) { - uint32_t pkt_len = pkts[pkt_idx]->pkt_len + dev->vhost_hlen; - - if (unlikely(reserve_avail_buf_mergeable(vq, pkt_len, - &end, buf_vec) < 0)) { - LOG_DEBUG(VHOST_DATA, - "(%d) failed to get enough desc from vring\n", - dev->vid); - break; - } - - nr_used = copy_mbuf_to_desc_mergeable(dev, vq, end, - pkts[pkt_idx], buf_vec); - rte_smp_wmb(); - - *(volatile uint16_t *)&vq->used->idx += nr_used; - vhost_log_used_vring(dev, vq, offsetof(struct vring_used, idx), + offsetof(struct vring_used, idx), sizeof(vq->used->idx)); - vq->last_used_idx += nr_used; - } - - if (likely(pkt_idx)) { - /* flush used->idx update before we read avail->flags. */ - rte_mb(); - - /* Kick the guest if necessary. */ + rte_mb(); + if (likely(pkt_sent)) { if (!(vq->avail->flags & VRING_AVAIL_F_NO_INTERRUPT) && (vq->callfd >= 0)) eventfd_write(vq->callfd, (eventfd_t)1); } - return pkt_idx; -} - -uint16_t -rte_vhost_enqueue_burst(int vid, uint16_t queue_id, - struct rte_mbuf **pkts, uint16_t count) -{ - struct virtio_net *dev = get_device(vid); + return pkt_sent; - if (!dev) - return 0; +rollback: + if (is_mrg_rxbuf || is_virtio_1) + vq->last_used_idx -= virtio_hdr->num_buffers - 1; - if (dev->features & (1 << VIRTIO_NET_F_MRG_RXBUF)) - return virtio_dev_merge_rx(dev, queue_id, pkts, count); - else - return virtio_dev_rx(dev, queue_id, pkts, count); + goto done; } static void diff --git a/lib/librte_vhost/virtio-net.c b/lib/librte_vhost/virtio-net.c index 1785695..87d09fa 100644 --- a/lib/librte_vhost/virtio-net.c +++ b/lib/librte_vhost/virtio-net.c @@ -152,10 +152,14 @@ cleanup_device(struct virtio_net *dev, int destroy) static void free_device(struct virtio_net *dev) { + struct vhost_virtqueue *vq; uint32_t i; - for (i = 0; i < dev->virt_qp_nb; i++) - rte_free(dev->virtqueue[i * VIRTIO_QNUM]); + for (i = 0; i < dev->virt_qp_nb; i++) { + vq = dev->virtqueue[i * VIRTIO_QNUM]; + rte_free(vq->shadow_used_ring); + rte_free(vq); + } rte_free(dev); } @@ -418,13 +422,18 @@ int vhost_set_vring_num(int vid, struct vhost_vring_state *state) { struct virtio_net *dev; + struct vhost_virtqueue *vq; dev = get_device(vid); if (dev == NULL) return -1; /* State->index refers to the queue index. The txq is 1, rxq is 0. */ - dev->virtqueue[state->index]->size = state->num; + vq = dev->virtqueue[state->index]; + vq->size = state->num; + vq->shadow_used_ring = rte_malloc("", + vq->size * sizeof(struct vring_used_elem), + RTE_CACHE_LINE_SIZE); return 0; }