Message ID | 20160927102123.GL25823@yliu-dev.sh.intel.com (mailing list archive) |
---|---|
State | Not Applicable, archived |
Delegated to: | Yuanhan Liu |
Headers |
Return-Path: <dev-bounces@dpdk.org> X-Original-To: patchwork@dpdk.org Delivered-To: patchwork@dpdk.org Received: from [92.243.14.124] (localhost [IPv6:::1]) by dpdk.org (Postfix) with ESMTP id 12E0738EB; Tue, 27 Sep 2016 12:20:58 +0200 (CEST) Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) by dpdk.org (Postfix) with ESMTP id 58CC39E7 for <dev@dpdk.org>; Tue, 27 Sep 2016 12:20:56 +0200 (CEST) Received: from fmsmga004.fm.intel.com ([10.253.24.48]) by fmsmga105.fm.intel.com with ESMTP; 27 Sep 2016 03:20:55 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.30,404,1470726000"; d="scan'208";a="173640373" Received: from yliu-dev.sh.intel.com (HELO yliu-dev) ([10.239.67.162]) by fmsmga004.fm.intel.com with ESMTP; 27 Sep 2016 03:20:55 -0700 Date: Tue, 27 Sep 2016 18:21:23 +0800 From: Yuanhan Liu <yuanhan.liu@linux.intel.com> To: Jianbo Liu <jianbo.liu@linaro.org> Cc: "Wang, Zhihong" <zhihong.wang@intel.com>, Maxime Coquelin <maxime.coquelin@redhat.com>, "dev@dpdk.org" <dev@dpdk.org> Message-ID: <20160927102123.GL25823@yliu-dev.sh.intel.com> References: <1471319402-112998-1-git-send-email-zhihong.wang@intel.com> <1471585430-125925-1-git-send-email-zhihong.wang@intel.com> <e6addeba-ffbc-baae-61c8-5b8e798c843e@redhat.com> <CAP4Qi3-cSgHDPC3Wne3RSL0t=Z-vhYUPsPWH6VAXsXsHYX6ShQ@mail.gmail.com> <8F6C2BD409508844A0EFC19955BE09414E7B5581@SHSMSX103.ccr.corp.intel.com> <CAP4Qi39-KD8pY-3M31asoDV+dja27XzFTsBMq9ignoawdL8=HQ@mail.gmail.com> <20160922022903.GJ23158@yliu-dev.sh.intel.com> <CAP4Qi392=aOMrSyTu-5qwpSLpwK-NVdHp-aztT-xT=BcRPWoew@mail.gmail.com> <8F6C2BD409508844A0EFC19955BE09414E7B5DAE@SHSMSX103.ccr.corp.intel.com> <CAP4Qi39YF6SoaiSaka0ioZFWb-2uzWZUbNP4CK7LqCQosaSmWQ@mail.gmail.com> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="YiEDa0DAkWCtVeE4" Content-Disposition: inline In-Reply-To: <CAP4Qi39YF6SoaiSaka0ioZFWb-2uzWZUbNP4CK7LqCQosaSmWQ@mail.gmail.com> User-Agent: Mutt/1.5.23 (2014-03-12) Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK <dev.dpdk.org> List-Unsubscribe: <http://dpdk.org/ml/options/dev>, <mailto:dev-request@dpdk.org?subject=unsubscribe> List-Archive: <http://dpdk.org/ml/archives/dev/> List-Post: <mailto:dev@dpdk.org> List-Help: <mailto:dev-request@dpdk.org?subject=help> List-Subscribe: <http://dpdk.org/ml/listinfo/dev>, <mailto:dev-request@dpdk.org?subject=subscribe> Errors-To: dev-bounces@dpdk.org Sender: "dev" <dev-bounces@dpdk.org> |
Commit Message
Yuanhan Liu
Sept. 27, 2016, 10:21 a.m. UTC
On Thu, Sep 22, 2016 at 05:01:41PM +0800, Jianbo Liu wrote: > On 22 September 2016 at 14:58, Wang, Zhihong <zhihong.wang@intel.com> wrote: > > > > > >> -----Original Message----- > >> From: Jianbo Liu [mailto:jianbo.liu@linaro.org] > >> Sent: Thursday, September 22, 2016 1:48 PM > >> To: Yuanhan Liu <yuanhan.liu@linux.intel.com> > >> Cc: Wang, Zhihong <zhihong.wang@intel.com>; Maxime Coquelin > >> <maxime.coquelin@redhat.com>; dev@dpdk.org > >> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue > >> > >> On 22 September 2016 at 10:29, Yuanhan Liu <yuanhan.liu@linux.intel.com> > >> wrote: > >> > On Wed, Sep 21, 2016 at 08:54:11PM +0800, Jianbo Liu wrote: > >> >> >> > My setup consists of one host running a guest. > >> >> >> > The guest generates as much 64bytes packets as possible using > >> >> >> > >> >> >> Have you tested with other different packet size? > >> >> >> My testing shows that performance is dropping when packet size is > >> more > >> >> >> than 256. > >> >> > > >> >> > > >> >> > Hi Jianbo, > >> >> > > >> >> > Thanks for reporting this. > >> >> > > >> >> > 1. Are you running the vector frontend with mrg_rxbuf=off? > >> >> > > >> Yes, my testing is mrg_rxbuf=off, but not vector frontend PMD. > >> > >> >> > 2. Could you please specify what CPU you're running? Is it Haswell > >> >> > or Ivy Bridge? > >> >> > > >> It's an ARM server. > >> > >> >> > 3. How many percentage of drop are you seeing? > >> The testing result: > >> size (bytes) improvement (%) > >> 64 3.92 > >> 128 11.51 > >> 256 24.16 > >> 512 -13.79 > >> 1024 -22.51 > >> 1500 -12.22 > >> A correction is that performance is dropping if byte size is larger than 512. > > > > > > Jianbo, > > > > Could you please verify does this patch really cause enqueue perf to drop? > > > > You can test the enqueue path only by set guest to do rxonly, and compare > > the mpps by show port stats all in the guest. > > > > > Tested with testpmd, host: txonly, guest: rxonly > size (bytes) improvement (%) > 64 4.12 > 128 6 > 256 2.65 > 512 -1.12 > 1024 -7.02 There is a difference between Zhihong's code and the old I spotted in the first time: Zhihong removed the avail_idx prefetch. I understand the prefetch becomes a bit tricky when mrg-rx code path is considered; thus, I didn't comment on that. That's one of the difference that, IMO, could drop a regression. I then finally got a chance to add it back. A rough test shows it improves the performance of 1400B packet size greatly in the "txonly in host and rxonly in guest" case: +33% is the number I get with my test server (Ivybridge). I guess this might/would help your case as well. Mind to have a test and tell me the results? BTW, I made it in rush; I haven't tested the mrg-rx code path yet. Thanks. --yliu commit e5852d04bf87c02d6d0d8e6d8ded4c33030b9c9e Author: Yuanhan Liu <yuanhan.liu@linux.intel.com> Date: Tue Sep 27 17:51:15 2016 +0800 xxxx Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Comments
> -----Original Message----- > From: Yuanhan Liu [mailto:yuanhan.liu@linux.intel.com] > Sent: Tuesday, September 27, 2016 6:21 PM > To: Jianbo Liu <jianbo.liu@linaro.org> > Cc: Wang, Zhihong <zhihong.wang@intel.com>; Maxime Coquelin > <maxime.coquelin@redhat.com>; dev@dpdk.org > Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue > > On Thu, Sep 22, 2016 at 05:01:41PM +0800, Jianbo Liu wrote: > > On 22 September 2016 at 14:58, Wang, Zhihong <zhihong.wang@intel.com> > wrote: > > > > > > > > >> -----Original Message----- > > >> From: Jianbo Liu [mailto:jianbo.liu@linaro.org] > > >> Sent: Thursday, September 22, 2016 1:48 PM > > >> To: Yuanhan Liu <yuanhan.liu@linux.intel.com> > > >> Cc: Wang, Zhihong <zhihong.wang@intel.com>; Maxime Coquelin > > >> <maxime.coquelin@redhat.com>; dev@dpdk.org > > >> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue > > >> > > >> On 22 September 2016 at 10:29, Yuanhan Liu <yuanhan.liu@linux.intel.com> > > >> wrote: > > >> > On Wed, Sep 21, 2016 at 08:54:11PM +0800, Jianbo Liu wrote: > > >> >> >> > My setup consists of one host running a guest. > > >> >> >> > The guest generates as much 64bytes packets as possible using > > >> >> >> > > >> >> >> Have you tested with other different packet size? > > >> >> >> My testing shows that performance is dropping when packet size is > > >> more > > >> >> >> than 256. > > >> >> > > > >> >> > > > >> >> > Hi Jianbo, > > >> >> > > > >> >> > Thanks for reporting this. > > >> >> > > > >> >> > 1. Are you running the vector frontend with mrg_rxbuf=off? > > >> >> > > > >> Yes, my testing is mrg_rxbuf=off, but not vector frontend PMD. > > >> > > >> >> > 2. Could you please specify what CPU you're running? Is it Haswell > > >> >> > or Ivy Bridge? > > >> >> > > > >> It's an ARM server. > > >> > > >> >> > 3. How many percentage of drop are you seeing? > > >> The testing result: > > >> size (bytes) improvement (%) > > >> 64 3.92 > > >> 128 11.51 > > >> 256 24.16 > > >> 512 -13.79 > > >> 1024 -22.51 > > >> 1500 -12.22 > > >> A correction is that performance is dropping if byte size is larger than 512. > > > > > > > > > Jianbo, > > > > > > Could you please verify does this patch really cause enqueue perf to drop? > > > > > > You can test the enqueue path only by set guest to do rxonly, and compare > > > the mpps by show port stats all in the guest. > > > > > > > > Tested with testpmd, host: txonly, guest: rxonly > > size (bytes) improvement (%) > > 64 4.12 > > 128 6 > > 256 2.65 > > 512 -1.12 > > 1024 -7.02 > > There is a difference between Zhihong's code and the old I spotted in > the first time: Zhihong removed the avail_idx prefetch. I understand > the prefetch becomes a bit tricky when mrg-rx code path is considered; > thus, I didn't comment on that. > > That's one of the difference that, IMO, could drop a regression. I then > finally got a chance to add it back. > > A rough test shows it improves the performance of 1400B packet size greatly > in the "txonly in host and rxonly in guest" case: +33% is the number I get > with my test server (Ivybridge). Thanks Yuanhan! I'll validate this on x86. > > I guess this might/would help your case as well. Mind to have a test > and tell me the results? > > BTW, I made it in rush; I haven't tested the mrg-rx code path yet. > > Thanks. > > --yliu
> -----Original Message----- > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Wang, Zhihong > Sent: Wednesday, September 28, 2016 12:45 AM > To: Yuanhan Liu <yuanhan.liu@linux.intel.com>; Jianbo Liu > <jianbo.liu@linaro.org> > Cc: Maxime Coquelin <maxime.coquelin@redhat.com>; dev@dpdk.org > Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue > > > > > -----Original Message----- > > From: Yuanhan Liu [mailto:yuanhan.liu@linux.intel.com] > > Sent: Tuesday, September 27, 2016 6:21 PM > > To: Jianbo Liu <jianbo.liu@linaro.org> > > Cc: Wang, Zhihong <zhihong.wang@intel.com>; Maxime Coquelin > > <maxime.coquelin@redhat.com>; dev@dpdk.org > > Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue > > > > On Thu, Sep 22, 2016 at 05:01:41PM +0800, Jianbo Liu wrote: > > > On 22 September 2016 at 14:58, Wang, Zhihong > <zhihong.wang@intel.com> > > wrote: > > > > > > > > > > > >> -----Original Message----- > > > >> From: Jianbo Liu [mailto:jianbo.liu@linaro.org] > > > >> Sent: Thursday, September 22, 2016 1:48 PM > > > >> To: Yuanhan Liu <yuanhan.liu@linux.intel.com> > > > >> Cc: Wang, Zhihong <zhihong.wang@intel.com>; Maxime Coquelin > > > >> <maxime.coquelin@redhat.com>; dev@dpdk.org > > > >> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue > > > >> > > > >> On 22 September 2016 at 10:29, Yuanhan Liu > <yuanhan.liu@linux.intel.com> > > > >> wrote: > > > >> > On Wed, Sep 21, 2016 at 08:54:11PM +0800, Jianbo Liu wrote: > > > >> >> >> > My setup consists of one host running a guest. > > > >> >> >> > The guest generates as much 64bytes packets as possible > using > > > >> >> >> > > > >> >> >> Have you tested with other different packet size? > > > >> >> >> My testing shows that performance is dropping when packet > size is > > > >> more > > > >> >> >> than 256. > > > >> >> > > > > >> >> > > > > >> >> > Hi Jianbo, > > > >> >> > > > > >> >> > Thanks for reporting this. > > > >> >> > > > > >> >> > 1. Are you running the vector frontend with mrg_rxbuf=off? > > > >> >> > > > > >> Yes, my testing is mrg_rxbuf=off, but not vector frontend PMD. > > > >> > > > >> >> > 2. Could you please specify what CPU you're running? Is it > Haswell > > > >> >> > or Ivy Bridge? > > > >> >> > > > > >> It's an ARM server. > > > >> > > > >> >> > 3. How many percentage of drop are you seeing? > > > >> The testing result: > > > >> size (bytes) improvement (%) > > > >> 64 3.92 > > > >> 128 11.51 > > > >> 256 24.16 > > > >> 512 -13.79 > > > >> 1024 -22.51 > > > >> 1500 -12.22 > > > >> A correction is that performance is dropping if byte size is larger than > 512. > > > > > > > > > > > > Jianbo, > > > > > > > > Could you please verify does this patch really cause enqueue perf to > drop? > > > > > > > > You can test the enqueue path only by set guest to do rxonly, and > compare > > > > the mpps by show port stats all in the guest. > > > > > > > > > > > Tested with testpmd, host: txonly, guest: rxonly > > > size (bytes) improvement (%) > > > 64 4.12 > > > 128 6 > > > 256 2.65 > > > 512 -1.12 > > > 1024 -7.02 > > > > There is a difference between Zhihong's code and the old I spotted in > > the first time: Zhihong removed the avail_idx prefetch. I understand > > the prefetch becomes a bit tricky when mrg-rx code path is considered; > > thus, I didn't comment on that. > > > > That's one of the difference that, IMO, could drop a regression. I then > > finally got a chance to add it back. > > > > A rough test shows it improves the performance of 1400B packet size > greatly > > in the "txonly in host and rxonly in guest" case: +33% is the number I get > > with my test server (Ivybridge). > > Thanks Yuanhan! I'll validate this on x86. Hi Yuanhan, Seems your code doesn't perform correctly. I write a new version of avail idx prefetch but didn't see any perf benefit. To be honest I doubt the benefit of this idea. The previous mrg_off code has this method but doesn't give any benefits. Even if this is useful, the benefits should be more significant for small packets, it's unlikely this simple idx prefetch could bring over 30% perf gain for large packets like 1400B ones. But if you really do work it out like that I'll be very glad to see. Thanks Zhihong > > > > > I guess this might/would help your case as well. Mind to have a test > > and tell me the results? > > > > BTW, I made it in rush; I haven't tested the mrg-rx code path yet. > > > > Thanks. > > > > --yliu
On Sun, Oct 09, 2016 at 12:09:07PM +0000, Wang, Zhihong wrote: > > > > Tested with testpmd, host: txonly, guest: rxonly > > > > size (bytes) improvement (%) > > > > 64 4.12 > > > > 128 6 > > > > 256 2.65 > > > > 512 -1.12 > > > > 1024 -7.02 > > > > > > There is a difference between Zhihong's code and the old I spotted in > > > the first time: Zhihong removed the avail_idx prefetch. I understand > > > the prefetch becomes a bit tricky when mrg-rx code path is considered; > > > thus, I didn't comment on that. > > > > > > That's one of the difference that, IMO, could drop a regression. I then > > > finally got a chance to add it back. > > > > > > A rough test shows it improves the performance of 1400B packet size > > greatly > > > in the "txonly in host and rxonly in guest" case: +33% is the number I get > > > with my test server (Ivybridge). > > > > Thanks Yuanhan! I'll validate this on x86. > > Hi Yuanhan, > > Seems your code doesn't perform correctly. I write a new version > of avail idx prefetch but didn't see any perf benefit. > > To be honest I doubt the benefit of this idea. The previous mrg_off > code has this method but doesn't give any benefits. Good point. I thought of that before, too. But you know that I made it in rush, that I didn't think further and test more. I looked the code a bit closer this time, and spotted a bug: the prefetch actually didn't happen, due to following code piece: if (vq->next_avail_idx >= NR_AVAIL_IDX_PREFETCH) { prefetch_avail_idx(vq); ... } Since vq->next_avail_idx is set to 0 at the entrance of enqueue path, prefetch_avail_idx() will be called. The fix is easy though: just put prefetch_avail_idx before invoking enqueue_packet. In summary, Zhihong is right, I see no more gains with that fix :( However, as stated, that's kind of the only difference I found between yours and the old code, that maybe it's still worthwhile to have a test on ARM, Jianbo? --yliu > Even if this is useful, the benefits should be more significant for > small packets, it's unlikely this simple idx prefetch could bring > over 30% perf gain for large packets like 1400B ones. > > But if you really do work it out like that I'll be very glad to see. > > Thanks > Zhihong > > > > > > > > > I guess this might/would help your case as well. Mind to have a test > > > and tell me the results? > > > > > > BTW, I made it in rush; I haven't tested the mrg-rx code path yet. > > > > > > Thanks. > > > > > > --yliu
On 10 October 2016 at 10:44, Yuanhan Liu <yuanhan.liu@linux.intel.com> wrote: > On Sun, Oct 09, 2016 at 12:09:07PM +0000, Wang, Zhihong wrote: >> > > > Tested with testpmd, host: txonly, guest: rxonly >> > > > size (bytes) improvement (%) >> > > > 64 4.12 >> > > > 128 6 >> > > > 256 2.65 >> > > > 512 -1.12 >> > > > 1024 -7.02 >> > > >> > > There is a difference between Zhihong's code and the old I spotted in >> > > the first time: Zhihong removed the avail_idx prefetch. I understand >> > > the prefetch becomes a bit tricky when mrg-rx code path is considered; >> > > thus, I didn't comment on that. >> > > >> > > That's one of the difference that, IMO, could drop a regression. I then >> > > finally got a chance to add it back. >> > > >> > > A rough test shows it improves the performance of 1400B packet size >> > greatly >> > > in the "txonly in host and rxonly in guest" case: +33% is the number I get >> > > with my test server (Ivybridge). >> > >> > Thanks Yuanhan! I'll validate this on x86. >> >> Hi Yuanhan, >> >> Seems your code doesn't perform correctly. I write a new version >> of avail idx prefetch but didn't see any perf benefit. >> >> To be honest I doubt the benefit of this idea. The previous mrg_off >> code has this method but doesn't give any benefits. > > Good point. I thought of that before, too. But you know that I made it > in rush, that I didn't think further and test more. > > I looked the code a bit closer this time, and spotted a bug: the prefetch > actually didn't happen, due to following code piece: > > if (vq->next_avail_idx >= NR_AVAIL_IDX_PREFETCH) { > prefetch_avail_idx(vq); > ... > } > > Since vq->next_avail_idx is set to 0 at the entrance of enqueue path, > prefetch_avail_idx() will be called. The fix is easy though: just put > prefetch_avail_idx before invoking enqueue_packet. > > In summary, Zhihong is right, I see no more gains with that fix :( > > However, as stated, that's kind of the only difference I found between > yours and the old code, that maybe it's still worthwhile to have a > test on ARM, Jianbo? > I haven't tested it, but I think it could be no improvement for ARM either. A smalll suggestion for enqueue_packet: ..... + /* start copy from mbuf to desc */ + while (mbuf_avail || mbuf->next) { ..... Considering pkt_len is in the first cache line (same as data_len), while next pointer is in the second cache line, is it better to check the total packet len, instead of the last mbuf's next pointer to jump out of while loop and avoid possible cache miss?
> -----Original Message----- > From: Jianbo Liu [mailto:jianbo.liu@linaro.org] > Sent: Monday, October 10, 2016 1:32 PM > To: Yuanhan Liu <yuanhan.liu@linux.intel.com> > Cc: Wang, Zhihong <zhihong.wang@intel.com>; Maxime Coquelin > <maxime.coquelin@redhat.com>; dev@dpdk.org > Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue > > On 10 October 2016 at 10:44, Yuanhan Liu <yuanhan.liu@linux.intel.com> > wrote: > > On Sun, Oct 09, 2016 at 12:09:07PM +0000, Wang, Zhihong wrote: > >> > > > Tested with testpmd, host: txonly, guest: rxonly > >> > > > size (bytes) improvement (%) > >> > > > 64 4.12 > >> > > > 128 6 > >> > > > 256 2.65 > >> > > > 512 -1.12 > >> > > > 1024 -7.02 > >> > > > >> > > There is a difference between Zhihong's code and the old I spotted in > >> > > the first time: Zhihong removed the avail_idx prefetch. I understand > >> > > the prefetch becomes a bit tricky when mrg-rx code path is > considered; > >> > > thus, I didn't comment on that. > >> > > > >> > > That's one of the difference that, IMO, could drop a regression. I then > >> > > finally got a chance to add it back. > >> > > > >> > > A rough test shows it improves the performance of 1400B packet size > >> > greatly > >> > > in the "txonly in host and rxonly in guest" case: +33% is the number I > get > >> > > with my test server (Ivybridge). > >> > > >> > Thanks Yuanhan! I'll validate this on x86. > >> > >> Hi Yuanhan, > >> > >> Seems your code doesn't perform correctly. I write a new version > >> of avail idx prefetch but didn't see any perf benefit. > >> > >> To be honest I doubt the benefit of this idea. The previous mrg_off > >> code has this method but doesn't give any benefits. > > > > Good point. I thought of that before, too. But you know that I made it > > in rush, that I didn't think further and test more. > > > > I looked the code a bit closer this time, and spotted a bug: the prefetch > > actually didn't happen, due to following code piece: > > > > if (vq->next_avail_idx >= NR_AVAIL_IDX_PREFETCH) { > > prefetch_avail_idx(vq); > > ... > > } > > > > Since vq->next_avail_idx is set to 0 at the entrance of enqueue path, > > prefetch_avail_idx() will be called. The fix is easy though: just put > > prefetch_avail_idx before invoking enqueue_packet. > > > > In summary, Zhihong is right, I see no more gains with that fix :( > > > > However, as stated, that's kind of the only difference I found between > > yours and the old code, that maybe it's still worthwhile to have a > > test on ARM, Jianbo? > > > I haven't tested it, but I think it could be no improvement for ARM either. > > A smalll suggestion for enqueue_packet: > > ..... > + /* start copy from mbuf to desc */ > + while (mbuf_avail || mbuf->next) { > ..... > > Considering pkt_len is in the first cache line (same as data_len), > while next pointer is in the second cache line, > is it better to check the total packet len, instead of the last mbuf's > next pointer to jump out of while loop and avoid possible cache miss? Jianbo, Thanks for the reply! This idea sounds good, but it won't help the general perf in my opinion, since the 2nd cache line is accessed anyway prior in virtio_enqueue_offload. Also this would bring a NULL check when actually access mbuf->next. BTW, could you please publish the number of: 1. mrg_rxbuf=on, comparison between original and original + this patch 2. mrg_rxbuf=off, comparison between original and original + this patch So we can have a whole picture of how this patch impact on ARM platform. Thanks Zhihong
On 10 October 2016 at 14:22, Wang, Zhihong <zhihong.wang@intel.com> wrote: > > >> -----Original Message----- >> From: Jianbo Liu [mailto:jianbo.liu@linaro.org] >> Sent: Monday, October 10, 2016 1:32 PM >> To: Yuanhan Liu <yuanhan.liu@linux.intel.com> >> Cc: Wang, Zhihong <zhihong.wang@intel.com>; Maxime Coquelin >> <maxime.coquelin@redhat.com>; dev@dpdk.org >> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue >> >> On 10 October 2016 at 10:44, Yuanhan Liu <yuanhan.liu@linux.intel.com> >> wrote: >> > On Sun, Oct 09, 2016 at 12:09:07PM +0000, Wang, Zhihong wrote: >> >> > > > Tested with testpmd, host: txonly, guest: rxonly >> >> > > > size (bytes) improvement (%) >> >> > > > 64 4.12 >> >> > > > 128 6 >> >> > > > 256 2.65 >> >> > > > 512 -1.12 >> >> > > > 1024 -7.02 >> >> > > >> >> > > There is a difference between Zhihong's code and the old I spotted in >> >> > > the first time: Zhihong removed the avail_idx prefetch. I understand >> >> > > the prefetch becomes a bit tricky when mrg-rx code path is >> considered; >> >> > > thus, I didn't comment on that. >> >> > > >> >> > > That's one of the difference that, IMO, could drop a regression. I then >> >> > > finally got a chance to add it back. >> >> > > >> >> > > A rough test shows it improves the performance of 1400B packet size >> >> > greatly >> >> > > in the "txonly in host and rxonly in guest" case: +33% is the number I >> get >> >> > > with my test server (Ivybridge). >> >> > >> >> > Thanks Yuanhan! I'll validate this on x86. >> >> >> >> Hi Yuanhan, >> >> >> >> Seems your code doesn't perform correctly. I write a new version >> >> of avail idx prefetch but didn't see any perf benefit. >> >> >> >> To be honest I doubt the benefit of this idea. The previous mrg_off >> >> code has this method but doesn't give any benefits. >> > >> > Good point. I thought of that before, too. But you know that I made it >> > in rush, that I didn't think further and test more. >> > >> > I looked the code a bit closer this time, and spotted a bug: the prefetch >> > actually didn't happen, due to following code piece: >> > >> > if (vq->next_avail_idx >= NR_AVAIL_IDX_PREFETCH) { >> > prefetch_avail_idx(vq); >> > ... >> > } >> > >> > Since vq->next_avail_idx is set to 0 at the entrance of enqueue path, >> > prefetch_avail_idx() will be called. The fix is easy though: just put >> > prefetch_avail_idx before invoking enqueue_packet. >> > >> > In summary, Zhihong is right, I see no more gains with that fix :( >> > >> > However, as stated, that's kind of the only difference I found between >> > yours and the old code, that maybe it's still worthwhile to have a >> > test on ARM, Jianbo? >> > >> I haven't tested it, but I think it could be no improvement for ARM either. >> >> A smalll suggestion for enqueue_packet: >> >> ..... >> + /* start copy from mbuf to desc */ >> + while (mbuf_avail || mbuf->next) { >> ..... >> >> Considering pkt_len is in the first cache line (same as data_len), >> while next pointer is in the second cache line, >> is it better to check the total packet len, instead of the last mbuf's >> next pointer to jump out of while loop and avoid possible cache miss? > > Jianbo, > > Thanks for the reply! > > This idea sounds good, but it won't help the general perf in my > opinion, since the 2nd cache line is accessed anyway prior in > virtio_enqueue_offload. > Yes, you are right. I'm thinking of prefetching beforehand. And if it's a chained mbuf, virtio_enqueue_offload will not be called in next loop. > Also this would bring a NULL check when actually access mbuf->next. > > BTW, could you please publish the number of: > > 1. mrg_rxbuf=on, comparison between original and original + this patch > > 2. mrg_rxbuf=off, comparison between original and original + this patch > > So we can have a whole picture of how this patch impact on ARM platform. > I think you already have got many results in my previous emails. Sorry I can't test right now and busy with other things.
> -----Original Message----- > From: Jianbo Liu [mailto:jianbo.liu@linaro.org] > Sent: Monday, October 10, 2016 2:58 PM > To: Wang, Zhihong <zhihong.wang@intel.com> > Cc: Yuanhan Liu <yuanhan.liu@linux.intel.com>; Maxime Coquelin > <maxime.coquelin@redhat.com>; dev@dpdk.org > Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue > > On 10 October 2016 at 14:22, Wang, Zhihong <zhihong.wang@intel.com> > wrote: > > > > > >> -----Original Message----- > >> From: Jianbo Liu [mailto:jianbo.liu@linaro.org] > >> Sent: Monday, October 10, 2016 1:32 PM > >> To: Yuanhan Liu <yuanhan.liu@linux.intel.com> > >> Cc: Wang, Zhihong <zhihong.wang@intel.com>; Maxime Coquelin > >> <maxime.coquelin@redhat.com>; dev@dpdk.org > >> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue > >> > >> On 10 October 2016 at 10:44, Yuanhan Liu <yuanhan.liu@linux.intel.com> > >> wrote: > >> > On Sun, Oct 09, 2016 at 12:09:07PM +0000, Wang, Zhihong wrote: > >> >> > > > Tested with testpmd, host: txonly, guest: rxonly > >> >> > > > size (bytes) improvement (%) > >> >> > > > 64 4.12 > >> >> > > > 128 6 > >> >> > > > 256 2.65 > >> >> > > > 512 -1.12 > >> >> > > > 1024 -7.02 > >> >> > > > >> >> > > There is a difference between Zhihong's code and the old I spotted > in > >> >> > > the first time: Zhihong removed the avail_idx prefetch. I > understand > >> >> > > the prefetch becomes a bit tricky when mrg-rx code path is > >> considered; > >> >> > > thus, I didn't comment on that. > >> >> > > > >> >> > > That's one of the difference that, IMO, could drop a regression. I > then > >> >> > > finally got a chance to add it back. > >> >> > > > >> >> > > A rough test shows it improves the performance of 1400B packet > size > >> >> > greatly > >> >> > > in the "txonly in host and rxonly in guest" case: +33% is the number > I > >> get > >> >> > > with my test server (Ivybridge). > >> >> > > >> >> > Thanks Yuanhan! I'll validate this on x86. > >> >> > >> >> Hi Yuanhan, > >> >> > >> >> Seems your code doesn't perform correctly. I write a new version > >> >> of avail idx prefetch but didn't see any perf benefit. > >> >> > >> >> To be honest I doubt the benefit of this idea. The previous mrg_off > >> >> code has this method but doesn't give any benefits. > >> > > >> > Good point. I thought of that before, too. But you know that I made it > >> > in rush, that I didn't think further and test more. > >> > > >> > I looked the code a bit closer this time, and spotted a bug: the prefetch > >> > actually didn't happen, due to following code piece: > >> > > >> > if (vq->next_avail_idx >= NR_AVAIL_IDX_PREFETCH) { > >> > prefetch_avail_idx(vq); > >> > ... > >> > } > >> > > >> > Since vq->next_avail_idx is set to 0 at the entrance of enqueue path, > >> > prefetch_avail_idx() will be called. The fix is easy though: just put > >> > prefetch_avail_idx before invoking enqueue_packet. > >> > > >> > In summary, Zhihong is right, I see no more gains with that fix :( > >> > > >> > However, as stated, that's kind of the only difference I found between > >> > yours and the old code, that maybe it's still worthwhile to have a > >> > test on ARM, Jianbo? > >> > > >> I haven't tested it, but I think it could be no improvement for ARM either. > >> > >> A smalll suggestion for enqueue_packet: > >> > >> ..... > >> + /* start copy from mbuf to desc */ > >> + while (mbuf_avail || mbuf->next) { > >> ..... > >> > >> Considering pkt_len is in the first cache line (same as data_len), > >> while next pointer is in the second cache line, > >> is it better to check the total packet len, instead of the last mbuf's > >> next pointer to jump out of while loop and avoid possible cache miss? > > > > Jianbo, > > > > Thanks for the reply! > > > > This idea sounds good, but it won't help the general perf in my > > opinion, since the 2nd cache line is accessed anyway prior in > > virtio_enqueue_offload. > > > Yes, you are right. I'm thinking of prefetching beforehand. > And if it's a chained mbuf, virtio_enqueue_offload will not be called > in next loop. > > > Also this would bring a NULL check when actually access mbuf->next. > > > > BTW, could you please publish the number of: > > > > 1. mrg_rxbuf=on, comparison between original and original + this patch > > > > 2. mrg_rxbuf=off, comparison between original and original + this patch > > > > So we can have a whole picture of how this patch impact on ARM platform. > > > I think you already have got many results in my previous emails. > Sorry I can't test right now and busy with other things. We're still missing mrg on data.
diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h index 381dc27..41bfeba 100644 --- a/lib/librte_vhost/vhost.h +++ b/lib/librte_vhost/vhost.h @@ -61,6 +61,8 @@ struct buf_vector { uint32_t desc_idx; }; +#define NR_AVAIL_IDX_PREFETCH 32 + /** * Structure contains variables relevant to RX/TX virtqueues. */ @@ -70,7 +72,7 @@ struct vhost_virtqueue { struct vring_used *used; uint32_t size; - /* Last index used on the available ring */ + uint16_t last_avail_idx; uint16_t last_used_idx; #define VIRTIO_INVALID_EVENTFD (-1) #define VIRTIO_UNINITIALIZED_EVENTFD (-2) @@ -89,6 +91,9 @@ struct vhost_virtqueue { /* Shadow used ring for performance */ struct vring_used_elem *shadow_used_ring; uint32_t shadow_used_idx; + + uint16_t next_avail_idx; + uint16_t avail_idx_buf[NR_AVAIL_IDX_PREFETCH]; } __rte_cache_aligned; /* Old kernels have no such macro defined */ diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c index 11a2c1a..1cc22fc 100644 --- a/lib/librte_vhost/virtio_net.c +++ b/lib/librte_vhost/virtio_net.c @@ -170,6 +170,41 @@ flush_used_ring(struct virtio_net *dev, struct vhost_virtqueue *vq, } } +/* Fetch NR_AVAIL_IDX_PREFETCH avail entries at once */ +static void +prefetch_avail_idx(struct vhost_virtqueue *vq) +{ + int i; + + for (i = 0; i < NR_AVAIL_IDX_PREFETCH; i++) { + vq->avail_idx_buf[i] = vq->avail->ring[(vq->last_avail_idx + i) & + (vq->size - 1)]; + } +} + +static uint16_t +next_avail_idx(struct vhost_virtqueue *vq) +{ + if (vq->next_avail_idx >= NR_AVAIL_IDX_PREFETCH) { + prefetch_avail_idx(vq); + vq->next_avail_idx = 0; + vq->last_avail_idx += NR_AVAIL_IDX_PREFETCH; + } + + return vq->avail_idx_buf[vq->next_avail_idx++]; +} + +/* + * Just peek, but don't move forward the "next_avail_idx" pointer + * The caller also has to make sure the point doesn't go beyond + * the array. + */ +static uint16_t +peek_next_avail_idx(struct vhost_virtqueue *vq) +{ + return vq->avail_idx_buf[vq->next_avail_idx]; +} + static inline int __attribute__((always_inline)) enqueue_packet(struct virtio_net *dev, struct vhost_virtqueue *vq, uint16_t avail_idx, struct rte_mbuf *mbuf, @@ -193,7 +228,7 @@ enqueue_packet(struct virtio_net *dev, struct vhost_virtqueue *vq, mbuf_avail = mbuf_len; /* get the current desc */ - desc_current = vq->avail->ring[(vq->last_used_idx) & (vq->size - 1)]; + desc_current = next_avail_idx(vq); desc_chain_head = desc_current; desc = &vq->desc[desc_current]; desc_addr = gpa_to_vva(dev, desc->addr); @@ -235,9 +270,7 @@ enqueue_packet(struct virtio_net *dev, struct vhost_virtqueue *vq, if (avail_idx == vq->last_used_idx) goto error; - desc_current = - vq->avail->ring[(vq->last_used_idx) & - (vq->size - 1)]; + desc_current = next_avail_idx(vq); desc_chain_head = desc_current; desc_chain_len = 0; } else @@ -298,6 +331,7 @@ notify_guest(struct virtio_net *dev, struct vhost_virtqueue *vq) eventfd_write(vq->callfd, (eventfd_t)1); } + uint16_t rte_vhost_enqueue_burst(int vid, uint16_t queue_id, struct rte_mbuf **pkts, uint16_t count) @@ -331,14 +365,15 @@ rte_vhost_enqueue_burst(int vid, uint16_t queue_id, /* start enqueuing packets 1 by 1 */ vq->shadow_used_idx = 0; + vq->next_avail_idx = 0; used_idx = vq->last_used_idx & (vq->size - 1); avail_idx = *((volatile uint16_t *)&vq->avail->idx); while (pkt_left && avail_idx != vq->last_used_idx) { /* prefetch the next desc */ - if (pkt_left > 1 && avail_idx != vq->last_used_idx + 1) - rte_prefetch0(&vq->desc[vq->avail->ring[ - (vq->last_used_idx + 1) & - (vq->size - 1)]]); + if (pkt_left > 1 && + vq->next_avail_idx + 1 < NR_AVAIL_IDX_PREFETCH) { + rte_prefetch0(&vq->desc[peek_next_avail_idx(vq)]); + } if (enqueue_packet(dev, vq, avail_idx, pkts[pkt_idx], is_mrg_rxbuf)) @@ -347,6 +382,7 @@ rte_vhost_enqueue_burst(int vid, uint16_t queue_id, pkt_idx++; pkt_left--; } + vq->last_avail_idx += vq->next_avail_idx; /* batch update used ring for better performance */ if (likely(vq->shadow_used_idx > 0))