[dpdk-dev] l3fwd: Fix l3fwd crash due to unaligned load/store intrinsics

Message ID 1447011596-2993-1-git-send-email-harish.patil@qlogic.com (mailing list archive)
State Accepted, archived
Headers

Commit Message

Harish Patil Nov. 8, 2015, 7:39 p.m. UTC
  From: Harish Patil <harish.patil@qlogic.com>

l3fwd app expects PMDs to return packets whose L2 header is
16-byte aligned due to usage of _mm_load_si128()/_mm_store_si128()
intrinsics in the app. However, most of the protocol stacks expects
packets such that its IP/L3 header be aligned on a 16-byte boundary.

Based on the recommendations received on dpdk-dev, we are changing
the l3fwd app to use _mm_loadu_si128()/_mm_loadu_si128() so that the
address need not be 16-byte aligned and thereby preventing crash.
We have tested that there is no performance impact due to this
change.

Signed-off-by: Harish Patil <harish.patil@qlogic.com>
---
 examples/l3fwd/main.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)
  

Comments

Ananyev, Konstantin Nov. 13, 2015, 10:35 a.m. UTC | #1
> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of harish.patil@qlogic.com
> Sent: Sunday, November 08, 2015 7:40 PM
> To: dev@dpdk.org
> Subject: [dpdk-dev] [PATCH] l3fwd: Fix l3fwd crash due to unaligned load/store intrinsics
> 
> From: Harish Patil <harish.patil@qlogic.com>
> 
> l3fwd app expects PMDs to return packets whose L2 header is
> 16-byte aligned due to usage of _mm_load_si128()/_mm_store_si128()
> intrinsics in the app. However, most of the protocol stacks expects
> packets such that its IP/L3 header be aligned on a 16-byte boundary.
> 
> Based on the recommendations received on dpdk-dev, we are changing
> the l3fwd app to use _mm_loadu_si128()/_mm_loadu_si128() so that the
> address need not be 16-byte aligned and thereby preventing crash.
> We have tested that there is no performance impact due to this
> change.
> 
> Signed-off-by: Harish Patil <harish.patil@qlogic.com>
> ---

Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>

As a side notice:
In fact with gcc build I do see a slight regression: ~1%
for 4 ports over 1 core test-case.
Though I think the problem is not in the patch itself.
By some, unknown to me reason, gcc treats aligned and unaligned load/store
instrincts in a different way (at least for that particular case).
With aligned load/store in use it generates code that is pretty close to the source:
 4 loads first, then 4 BLENDs, then 4  stores  (with some interfering scalar instructions of course).
But with unaligned ones  gcc starts to mix loads and blends for the same register, so now it is:
load x0; blend x0; load x1; blend x1; .. 
As if the source code was:

te[0] = _mm_loadu_si128(p[0]);
te[0] =  _mm_blend_epi16(te[0], ve[0], MASK_ETH);
te[1] = _mm_loadu_si128(p[1]);
te[1] =  _mm_blend_epi16(te[1], ve[1], MASK_ETH);  
...

So load latency is not hidden any more.
I tried it with different versions of  - same story for all of them.
Clang doesn't have such issue and generates similar code for both
aligned and unaligned instrincts. 

The only way to fix it I can think about  -  put rte_compiler_barrier() just before the first blend instinct.
That helped, now there are no noticeable differences in generated code and results before and after the patch.
 So I suppose, I'll have to submit a patch after yours one to fix that problem.
Konstantin
  
Harish Patil Nov. 16, 2015, 6:16 p.m. UTC | #2
>


>

>

>> -----Original Message-----

>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of

>>harish.patil@qlogic.com

>> Sent: Sunday, November 08, 2015 7:40 PM

>> To: dev@dpdk.org

>> Subject: [dpdk-dev] [PATCH] l3fwd: Fix l3fwd crash due to unaligned

>>load/store intrinsics

>>

>> From: Harish Patil <harish.patil@qlogic.com>

>>

>> l3fwd app expects PMDs to return packets whose L2 header is

>> 16-byte aligned due to usage of _mm_load_si128()/_mm_store_si128()

>> intrinsics in the app. However, most of the protocol stacks expects

>> packets such that its IP/L3 header be aligned on a 16-byte boundary.

>>

>> Based on the recommendations received on dpdk-dev, we are changing

>> the l3fwd app to use _mm_loadu_si128()/_mm_loadu_si128() so that the

>> address need not be 16-byte aligned and thereby preventing crash.

>> We have tested that there is no performance impact due to this

>> change.

>>

>> Signed-off-by: Harish Patil <harish.patil@qlogic.com>

>> ---

>

>Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>

>

>As a side notice:

>In fact with gcc build I do see a slight regression: ~1%

>for 4 ports over 1 core test-case.

>Though I think the problem is not in the patch itself.

>By some, unknown to me reason, gcc treats aligned and unaligned load/store

>instrincts in a different way (at least for that particular case).

>With aligned load/store in use it generates code that is pretty close to

>the source:

> 4 loads first, then 4 BLENDs, then 4  stores  (with some interfering

>scalar instructions of course).

>But with unaligned ones  gcc starts to mix loads and blends for the same

>register, so now it is:

>load x0; blend x0; load x1; blend x1; ..

>As if the source code was:

>

>te[0] = _mm_loadu_si128(p[0]);

>te[0] =  _mm_blend_epi16(te[0], ve[0], MASK_ETH);

>te[1] = _mm_loadu_si128(p[1]);

>te[1] =  _mm_blend_epi16(te[1], ve[1], MASK_ETH);

>...

>

>So load latency is not hidden any more.

>I tried it with different versions of  - same story for all of them.

>Clang doesn't have such issue and generates similar code for both

>aligned and unaligned instrincts.

>

>The only way to fix it I can think about  -  put rte_compiler_barrier()

>just before the first blend instinct.

>That helped, now there are no noticeable differences in generated code

>and results before and after the patch.

> So I suppose, I'll have to submit a patch after yours one to fix that

>problem.

>Konstantin

>

>


Sure, thanks for verifying and providing fix.

Harish


________________________________

This message and any attached documents contain information from the sending company or its parent company(s), subsidiaries, divisions or branch offices that may be confidential. If you are not the intended recipient, you may not read, copy, distribute, or use this information. If you have received this transmission in error, please notify the sender immediately by reply e-mail and then delete this message.
  
Thomas Monjalon Dec. 7, 2015, 2:16 a.m. UTC | #3
> > l3fwd app expects PMDs to return packets whose L2 header is
> > 16-byte aligned due to usage of _mm_load_si128()/_mm_store_si128()
> > intrinsics in the app. However, most of the protocol stacks expects
> > packets such that its IP/L3 header be aligned on a 16-byte boundary.
> > 
> > Based on the recommendations received on dpdk-dev, we are changing
> > the l3fwd app to use _mm_loadu_si128()/_mm_loadu_si128() so that the
> > address need not be 16-byte aligned and thereby preventing crash.
> > We have tested that there is no performance impact due to this
> > change.
> > 
> > Signed-off-by: Harish Patil <harish.patil@qlogic.com>
> 
> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>

Applied, thanks
  

Patch

diff --git a/examples/l3fwd/main.c b/examples/l3fwd/main.c
index 1f3e5c6..4b8b754 100644
--- a/examples/l3fwd/main.c
+++ b/examples/l3fwd/main.c
@@ -1220,14 +1220,14 @@  process_packet(struct lcore_conf *qconf, struct rte_mbuf *pkt,
 	dst_ipv4 = rte_be_to_cpu_32(dst_ipv4);
 	dp = get_dst_port(qconf, pkt, dst_ipv4, portid);
 
-	te = _mm_load_si128((__m128i *)eth_hdr);
+	te = _mm_loadu_si128((__m128i *)eth_hdr);
 	ve = val_eth[dp];
 
 	dst_port[0] = dp;
 	rfc1812_process(ipv4_hdr, dst_port, pkt->packet_type);
 
 	te =  _mm_blend_epi16(te, ve, MASK_ETH);
-	_mm_store_si128((__m128i *)eth_hdr, te);
+	_mm_storeu_si128((__m128i *)eth_hdr, te);
 }
 
 /*
@@ -1313,16 +1313,16 @@  processx4_step3(struct rte_mbuf *pkt[FWDSTEP], uint16_t dst_port[FWDSTEP])
 	p[3] = rte_pktmbuf_mtod(pkt[3], __m128i *);
 
 	ve[0] = val_eth[dst_port[0]];
-	te[0] = _mm_load_si128(p[0]);
+	te[0] = _mm_loadu_si128(p[0]);
 
 	ve[1] = val_eth[dst_port[1]];
-	te[1] = _mm_load_si128(p[1]);
+	te[1] = _mm_loadu_si128(p[1]);
 
 	ve[2] = val_eth[dst_port[2]];
-	te[2] = _mm_load_si128(p[2]);
+	te[2] = _mm_loadu_si128(p[2]);
 
 	ve[3] = val_eth[dst_port[3]];
-	te[3] = _mm_load_si128(p[3]);
+	te[3] = _mm_loadu_si128(p[3]);
 
 	/* Update first 12 bytes, keep rest bytes intact. */
 	te[0] =  _mm_blend_epi16(te[0], ve[0], MASK_ETH);
@@ -1330,10 +1330,10 @@  processx4_step3(struct rte_mbuf *pkt[FWDSTEP], uint16_t dst_port[FWDSTEP])
 	te[2] =  _mm_blend_epi16(te[2], ve[2], MASK_ETH);
 	te[3] =  _mm_blend_epi16(te[3], ve[3], MASK_ETH);
 
-	_mm_store_si128(p[0], te[0]);
-	_mm_store_si128(p[1], te[1]);
-	_mm_store_si128(p[2], te[2]);
-	_mm_store_si128(p[3], te[3]);
+	_mm_storeu_si128(p[0], te[0]);
+	_mm_storeu_si128(p[1], te[1]);
+	_mm_storeu_si128(p[2], te[2]);
+	_mm_storeu_si128(p[3], te[3]);
 
 	rfc1812_process((struct ipv4_hdr *)((struct ether_hdr *)p[0] + 1),
 		&dst_port[0], pkt[0]->packet_type);