[dpdk-dev,2/2] uio: new driver to support PCI MSI-X

Message ID	1443652138-31782-3-git-send-email-stephen@networkplumber.org (mailing list archive)
State	Not Applicable, archived
Headers	From: Stephen Hemminger <stephen@networkplumber.org> To: hjk@hansjkoch.de, gregkh@linux-foundation.org Date: Wed, 30 Sep 2015 15:28:58 -0700 Message-Id: <1443652138-31782-3-git-send-email-stephen@networkplumber.org> In-Reply-To: <1443652138-31782-1-git-send-email-stephen@networkplumber.org> References: <1443652138-31782-1-git-send-email-stephen@networkplumber.org> Cc: dev@dpdk.org, linux-kernel@vger.kernel.org Subject: [dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X Precedence: list Errors-To: dev-bounces@dpdk.org Sender: "dev" <dev-bounces@dpdk.org>

Message ID

1443652138-31782-3-git-send-email-stephen@networkplumber.org (mailing list archive)

State

Not Applicable, archived

Headers

From: Stephen Hemminger <stephen@networkplumber.org>
To: hjk@hansjkoch.de,
	gregkh@linux-foundation.org
Date: Wed, 30 Sep 2015 15:28:58 -0700
Message-Id: <1443652138-31782-3-git-send-email-stephen@networkplumber.org>
In-Reply-To: <1443652138-31782-1-git-send-email-stephen@networkplumber.org>
References: <1443652138-31782-1-git-send-email-stephen@networkplumber.org>
Cc: dev@dpdk.org, linux-kernel@vger.kernel.org
Subject: [dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X
Precedence: list
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

Commit Message

Stephen Hemminger Sept. 30, 2015, 10:28 p.m. UTC

  This driver allows using PCI device with Message Signalled Interrupt
from userspace. The API is similar to the igb_uio driver used by the DPDK.
Via ioctl it provides a mechanism to map MSI-X interrupts into event
file descriptors similar to VFIO.

VFIO is a better choice if IOMMU is available, but often userspace drivers
have to work in environments where IOMMU support (real or emulated) is
not available.  All UIO drivers that support DMA are not secure against
rogue userspace applications programming DMA hardware to access
private memory; this driver is no less secure than existing code.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 drivers/uio/Kconfig          |   9 ++
 drivers/uio/Makefile         |   1 +
 drivers/uio/uio_msi.c        | 378 +++++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/Kbuild    |   1 +
 include/uapi/linux/uio_msi.h |  22 +++
 5 files changed, 411 insertions(+)
 create mode 100644 drivers/uio/uio_msi.c
 create mode 100644 include/uapi/linux/uio_msi.h

Comments

Michael S. Tsirkin Oct. 1, 2015, 8:33 a.m. UTC | #1

On Wed, Sep 30, 2015 at 03:28:58PM -0700, Stephen Hemminger wrote:
> This driver allows using PCI device with Message Signalled Interrupt
> from userspace. The API is similar to the igb_uio driver used by the DPDK.
> Via ioctl it provides a mechanism to map MSI-X interrupts into event
> file descriptors similar to VFIO.
>
> VFIO is a better choice if IOMMU is available, but often userspace drivers
> have to work in environments where IOMMU support (real or emulated) is
> not available.  All UIO drivers that support DMA are not secure against
> rogue userspace applications programming DMA hardware to access
> private memory; this driver is no less secure than existing code.
> 
> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>

I don't think copying the igb_uio interface is a good idea.
What DPDK is doing with igb_uio (and indeed uio_pci_generic)
is abusing the sysfs BAR access to provide unlimited
access to hardware.

MSI messages are memory writes so any generic device capable
of MSI is capable of corrupting kernel memory.
This means that a bug in userspace will lead to kernel memory corruption
and crashes.  This is something distributions can't support.

uio_pci_generic is already abused like that, mostly
because when I wrote it, I didn't add enough protections
against using it with DMA capable devices,
and we can't go back and break working userspace.
But at least it does not bind to VFs which all of
them are capable of DMA.

The result of merging this driver will be userspace abusing the
sysfs BAR access with VFs as well, and we do not want that.


Just forwarding events is not enough to make a valid driver.
What is missing is a way to access the device in a safe way.

On a more positive note:

What would be a reasonable interface? One that does the following
in kernel:

1. initializes device rings (can be in pinned userspace memory,
   but can not be writeable by userspace), brings up interface link
2. pins userspace memory (unless using e.g. hugetlbfs)
3. gets request, make sure it's valid and belongs to
   the correct task, put it in the ring
4. in the reverse direction, notify userspace when buffers
   are available in the ring
5. notify userspace about MSI (what this driver does)

What userspace can be allowed to do:

	format requests (e.g. transmit, receive) in userspace
	read ring contents

What userspace can't be allowed to do:

	access BAR
	write rings


This means that the driver can not be a generic one,
and there will be a system call overhead when you
write the ring, but that's the price you have to
pay for ability to run on systems without an IOMMU.




> ---
>  drivers/uio/Kconfig          |   9 ++
>  drivers/uio/Makefile         |   1 +
>  drivers/uio/uio_msi.c        | 378 +++++++++++++++++++++++++++++++++++++++++++
>  include/uapi/linux/Kbuild    |   1 +
>  include/uapi/linux/uio_msi.h |  22 +++
>  5 files changed, 411 insertions(+)
>  create mode 100644 drivers/uio/uio_msi.c
>  create mode 100644 include/uapi/linux/uio_msi.h
> 
> diff --git a/drivers/uio/Kconfig b/drivers/uio/Kconfig
> index 52c98ce..04adfa0 100644
> --- a/drivers/uio/Kconfig
> +++ b/drivers/uio/Kconfig
> @@ -93,6 +93,15 @@ config UIO_PCI_GENERIC
>  	  primarily, for virtualization scenarios.
>  	  If you compile this as a module, it will be called uio_pci_generic.
>  
> +config UIO_PCI_MSI
> +       tristate "Generic driver supporting MSI-x on PCI Express cards"
> +	depends on PCI
> +	help
> +	  Generic driver that provides Message Signalled IRQ events
> +	  similar to VFIO. If IOMMMU is available please use VFIO
> +	  instead since it provides more security.
> +	  If you compile this as a module, it will be called uio_msi.
> +
>  config UIO_NETX
>  	tristate "Hilscher NetX Card driver"
>  	depends on PCI
> diff --git a/drivers/uio/Makefile b/drivers/uio/Makefile
> index 8560dad..62fc44b 100644
> --- a/drivers/uio/Makefile
> +++ b/drivers/uio/Makefile
> @@ -9,3 +9,4 @@ obj-$(CONFIG_UIO_NETX)	+= uio_netx.o
>  obj-$(CONFIG_UIO_PRUSS)         += uio_pruss.o
>  obj-$(CONFIG_UIO_MF624)         += uio_mf624.o
>  obj-$(CONFIG_UIO_FSL_ELBC_GPCM)	+= uio_fsl_elbc_gpcm.o
> +obj-$(CONFIG_UIO_PCI_MSI)	+= uio_msi.o
> diff --git a/drivers/uio/uio_msi.c b/drivers/uio/uio_msi.c
> new file mode 100644
> index 0000000..802b5c4
> --- /dev/null
> +++ b/drivers/uio/uio_msi.c
> @@ -0,0 +1,378 @@
> +/*-
> + *
> + * Copyright (c) 2015 by Brocade Communications Systems, Inc.
> + * Author: Stephen Hemminger <stephen@networkplumber.org>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 only.
> + */
> +
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +
> +#include <linux/device.h>
> +#include <linux/interrupt.h>
> +#include <linux/eventfd.h>
> +#include <linux/module.h>
> +#include <linux/pci.h>
> +#include <linux/uio_driver.h>
> +#include <linux/msi.h>
> +#include <linux/uio_msi.h>
> +
> +#define DRIVER_VERSION		"0.1.1"
> +#define MAX_MSIX_VECTORS	64
> +
> +/* MSI-X vector information */
> +struct uio_msi_pci_dev {
> +	struct uio_info info;		/* UIO driver info */
> +	struct pci_dev *pdev;		/* PCI device */
> +	struct mutex	mutex;		/* open/release/ioctl mutex */
> +	int		ref_cnt;	/* references to device */
> +	unsigned int	max_vectors;	/* MSI-X slots available */
> +	struct msix_entry *msix;	/* MSI-X vector table */
> +	struct uio_msi_irq_ctx {
> +		struct eventfd_ctx *trigger; /* vector to eventfd */
> +		char *name;		/* name in /proc/interrupts */
> +	} *ctx;
> +};
> +
> +static irqreturn_t uio_intx_irqhandler(int irq, void *arg)
> +{
> +	struct uio_msi_pci_dev *udev = arg;
> +
> +	if (pci_check_and_mask_intx(udev->pdev)) {
> +		eventfd_signal(udev->ctx->trigger, 1);
> +		return IRQ_HANDLED;
> +	}
> +
> +	return IRQ_NONE;
> +}
> +
> +static irqreturn_t uio_msi_irqhandler(int irq, void *arg)
> +{
> +	struct eventfd_ctx *trigger = arg;
> +
> +	eventfd_signal(trigger, 1);
> +	return IRQ_HANDLED;
> +}
> +
> +/* set the mapping between vector # and existing eventfd. */
> +static int set_irq_eventfd(struct uio_msi_pci_dev *udev, u32 vec, int fd)
> +{
> +	struct eventfd_ctx *trigger;
> +	int irq, err;
> +
> +	if (vec >= udev->max_vectors) {
> +		dev_notice(&udev->pdev->dev, "vec %u >= num_vec %u\n",
> +			   vec, udev->max_vectors);
> +		return -ERANGE;
> +	}
> +
> +	irq = udev->msix[vec].vector;
> +	trigger = udev->ctx[vec].trigger;
> +	if (trigger) {
> +		/* Clearup existing irq mapping */
> +		free_irq(irq, trigger);
> +		eventfd_ctx_put(trigger);
> +		udev->ctx[vec].trigger = NULL;
> +	}
> +
> +	/* Passing -1 is used to disable interrupt */
> +	if (fd < 0)
> +		return 0;
> +
> +	trigger = eventfd_ctx_fdget(fd);
> +	if (IS_ERR(trigger)) {
> +		err = PTR_ERR(trigger);
> +		dev_notice(&udev->pdev->dev,
> +			   "eventfd ctx get failed: %d\n", err);
> +		return err;
> +	}
> +
> +	if (udev->msix)
> +		err = request_irq(irq, uio_msi_irqhandler, 0,
> +				  udev->ctx[vec].name, trigger);
> +	else
> +		err = request_irq(irq, uio_intx_irqhandler, IRQF_SHARED,
> +				  udev->ctx[vec].name, udev);
> +
> +	if (err) {
> +		dev_notice(&udev->pdev->dev,
> +			   "request irq failed: %d\n", err);
> +		eventfd_ctx_put(trigger);
> +		return err;
> +	}
> +
> +	udev->ctx[vec].trigger = trigger;
> +	return 0;
> +}
> +
> +static int
> +uio_msi_ioctl(struct uio_info *info, unsigned int cmd, unsigned long arg)
> +{
> +	struct uio_msi_pci_dev *udev
> +		= container_of(info, struct uio_msi_pci_dev, info);
> +	struct uio_msi_irq_set hdr;
> +	int err;
> +
> +	switch (cmd) {
> +	case UIO_MSI_IRQ_SET:
> +		if (copy_from_user(&hdr, (void __user *)arg, sizeof(hdr)))
> +			return -EFAULT;
> +
> +		mutex_lock(&udev->mutex);
> +		err = set_irq_eventfd(udev, hdr.vec, hdr.fd);
> +		mutex_unlock(&udev->mutex);
> +		break;
> +	default:
> +		err = -EOPNOTSUPP;
> +	}
> +	return err;
> +}
> +
> +/* Opening the UIO device for first time enables MSI-X */
> +static int
> +uio_msi_open(struct uio_info *info, struct inode *inode)
> +{
> +	struct uio_msi_pci_dev *udev
> +		= container_of(info, struct uio_msi_pci_dev, info);
> +	int err = 0;
> +
> +	mutex_lock(&udev->mutex);
> +	if (udev->ref_cnt++ == 0) {
> +		if (udev->msix)
> +			err = pci_enable_msix(udev->pdev, udev->msix,
> +					      udev->max_vectors);
> +	}
> +	mutex_unlock(&udev->mutex);
> +
> +	return err;
> +}
> +
> +/* Last close of the UIO device releases/disables all IRQ's */
> +static int
> +uio_msi_release(struct uio_info *info, struct inode *inode)
> +{
> +	struct uio_msi_pci_dev *udev
> +		= container_of(info, struct uio_msi_pci_dev, info);
> +	int i;
> +
> +	mutex_lock(&udev->mutex);
> +	if (--udev->ref_cnt == 0) {
> +		for (i = 0; i < udev->max_vectors; i++) {
> +			int irq = udev->msix[i].vector;
> +			struct eventfd_ctx *trigger = udev->ctx[i].trigger;
> +
> +			if (!trigger)
> +				continue;
> +
> +			free_irq(irq, trigger);
> +			eventfd_ctx_put(trigger);
> +			udev->ctx[i].trigger = NULL;
> +		}
> +
> +		if (udev->msix)
> +			pci_disable_msix(udev->pdev);
> +	}
> +	mutex_unlock(&udev->mutex);
> +
> +	return 0;
> +}
> +
> +/* Unmap previously ioremap'd resources */
> +static void
> +release_iomaps(struct uio_mem *mem)
> +{
> +	int i;
> +
> +	for (i = 0; i < MAX_UIO_MAPS; i++, mem++) {
> +		if (mem->internal_addr)
> +			iounmap(mem->internal_addr);
> +	}
> +}
> +
> +static int
> +setup_maps(struct pci_dev *pdev, struct uio_info *info)
> +{
> +	int i, m = 0, p = 0, err;
> +	static const char * const bar_names[] = {
> +		"BAR0",	"BAR1",	"BAR2",	"BAR3",	"BAR4",	"BAR5",
> +	};
> +
> +	for (i = 0; i < ARRAY_SIZE(bar_names); i++) {
> +		unsigned long start = pci_resource_start(pdev, i);
> +		unsigned long flags = pci_resource_flags(pdev, i);
> +		unsigned long len = pci_resource_len(pdev, i);
> +
> +		if (start == 0 || len == 0)
> +			continue;
> +
> +		if (flags & IORESOURCE_MEM) {
> +			void __iomem *addr;
> +
> +			if (m >= MAX_UIO_MAPS)
> +				continue;
> +
> +			addr = ioremap(start, len);
> +			if (addr == NULL) {
> +				err = -EINVAL;
> +				goto fail;
> +			}
> +
> +			info->mem[m].name = bar_names[i];
> +			info->mem[m].addr = start;
> +			info->mem[m].internal_addr = addr;
> +			info->mem[m].size = len;
> +			info->mem[m].memtype = UIO_MEM_PHYS;
> +			++m;
> +		} else if (flags & IORESOURCE_IO) {
> +			if (p >= MAX_UIO_PORT_REGIONS)
> +				continue;
> +
> +			info->port[p].name = bar_names[i];
> +			info->port[p].start = start;
> +			info->port[p].size = len;
> +			info->port[p].porttype = UIO_PORT_X86;
> +			++p;
> +		}
> +	}
> +
> +	return 0;
> + fail:
> +	for (i = 0; i < m; i++)
> +		iounmap(info->mem[i].internal_addr);
> +	return err;
> +}
> +
> +static int uio_msi_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> +{
> +	struct uio_msi_pci_dev *udev;
> +	int i, err, vectors;
> +
> +	udev = kzalloc(sizeof(struct uio_msi_pci_dev), GFP_KERNEL);
> +	if (!udev)
> +		return -ENOMEM;
> +
> +	err = pci_enable_device(pdev);
> +	if (err != 0) {
> +		dev_err(&pdev->dev, "cannot enable PCI device\n");
> +		goto fail_free;
> +	}
> +
> +	err = pci_request_regions(pdev, "uio_msi");
> +	if (err != 0) {
> +		dev_err(&pdev->dev, "Cannot request regions\n");
> +		goto fail_disable;
> +	}
> +
> +	pci_set_master(pdev);
> +
> +	/* remap resources */
> +	err = setup_maps(pdev, &udev->info);
> +	if (err)
> +		goto fail_release_iomem;
> +
> +	/* fill uio infos */
> +	udev->info.name = "uio_msi";
> +	udev->info.version = DRIVER_VERSION;
> +	udev->info.priv = udev;
> +	udev->pdev = pdev;
> +	udev->info.ioctl = uio_msi_ioctl;
> +	udev->info.open = uio_msi_open;
> +	udev->info.release = uio_msi_release;
> +	udev->info.irq = UIO_IRQ_CUSTOM;
> +	mutex_init(&udev->mutex);
> +
> +	vectors = pci_msix_vec_count(pdev);
> +	if (vectors > 0) {
> +		udev->max_vectors = min_t(u16, vectors, MAX_MSIX_VECTORS);
> +		dev_info(&pdev->dev, "using up to %u MSI-X vectors\n",
> +			udev->max_vectors);
> +
> +		err = -ENOMEM;
> +		udev->msix = kcalloc(udev->max_vectors,
> +				     sizeof(struct msix_entry), GFP_KERNEL);
> +		if (!udev->msix)
> +			goto fail_release_iomem;
> +	} else if (!pci_intx_mask_supported(pdev)) {
> +		dev_err(&pdev->dev,
> +			"device does not support MSI-X or INTX\n");
> +		err = -EINVAL;
> +		goto fail_release_iomem;
> +	} else {
> +		dev_notice(&pdev->dev, "using INTX\n");
> +		udev->info.irq_flags = IRQF_SHARED;
> +		udev->max_vectors = 1;
> +	}
> +
> +	udev->ctx = kcalloc(udev->max_vectors,
> +			    sizeof(struct uio_msi_irq_ctx), GFP_KERNEL);
> +	if (!udev->ctx)
> +		goto fail_free_msix;
> +
> +	for (i = 0; i < udev->max_vectors; i++) {
> +		udev->msix[i].entry = i;
> +
> +		udev->ctx[i].name = kasprintf(GFP_KERNEL,
> +					      KBUILD_MODNAME "[%d](%s)",
> +					      i, pci_name(pdev));
> +		if (!udev->ctx[i].name)
> +			goto fail_free_ctx;
> +	}
> +
> +	/* register uio driver */
> +	err = uio_register_device(&pdev->dev, &udev->info);
> +	if (err != 0)
> +		goto fail_free_ctx;
> +
> +	pci_set_drvdata(pdev, udev);
> +	return 0;
> +
> +fail_free_ctx:
> +	for (i = 0; i < udev->max_vectors; i++)
> +		kfree(udev->ctx[i].name);
> +	kfree(udev->ctx);
> +fail_free_msix:
> +	kfree(udev->msix);
> +fail_release_iomem:
> +	release_iomaps(udev->info.mem);
> +	pci_release_regions(pdev);
> +fail_disable:
> +	pci_disable_device(pdev);
> +fail_free:
> +	kfree(udev);
> +
> +	pr_notice("%s ret %d\n", __func__, err);
> +	return err;
> +}
> +
> +static void uio_msi_remove(struct pci_dev *pdev)
> +{
> +	struct uio_info *info = pci_get_drvdata(pdev);
> +	struct uio_msi_pci_dev *udev
> +		= container_of(info, struct uio_msi_pci_dev, info);
> +	int i;
> +
> +	uio_unregister_device(info);
> +	release_iomaps(info->mem);
> +
> +	pci_release_regions(pdev);
> +	for (i = 0; i < udev->max_vectors; i++)
> +		kfree(udev->ctx[i].name);
> +	kfree(udev->ctx);
> +	kfree(udev->msix);
> +	pci_disable_device(pdev);
> +
> +	pci_set_drvdata(pdev, NULL);
> +	kfree(udev);
> +}
> +
> +static struct pci_driver uio_msi_pci_driver = {
> +	.name = "uio_msi",
> +	.probe = uio_msi_probe,
> +	.remove = uio_msi_remove,
> +};
> +
> +module_pci_driver(uio_msi_pci_driver);
> +MODULE_VERSION(DRIVER_VERSION);
> +MODULE_LICENSE("GPL v2");
> +MODULE_AUTHOR("Stephen Hemminger <stephen@networkplumber.org>");
> +MODULE_DESCRIPTION("UIO driver for MSI PCI devices");
> diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
> index f7b2db4..d9497691 100644
> --- a/include/uapi/linux/Kbuild
> +++ b/include/uapi/linux/Kbuild
> @@ -411,6 +411,7 @@ header-y += udp.h
>  header-y += uhid.h
>  header-y += uinput.h
>  header-y += uio.h
> +header-y += uio_msi.h
>  header-y += ultrasound.h
>  header-y += un.h
>  header-y += unistd.h
> diff --git a/include/uapi/linux/uio_msi.h b/include/uapi/linux/uio_msi.h
> new file mode 100644
> index 0000000..297de00
> --- /dev/null
> +++ b/include/uapi/linux/uio_msi.h
> @@ -0,0 +1,22 @@
> +/*
> + * UIO_MSI API definition
> + *
> + * Copyright (c) 2015 by Brocade Communications Systems, Inc.
> + * All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +#ifndef _UIO_PCI_MSI_H
> +#define _UIO_PCI_MSI_H
> +
> +struct uio_msi_irq_set {
> +	u32 vec;
> +	int fd;
> +};
> +
> +#define UIO_MSI_BASE	0x86
> +#define UIO_MSI_IRQ_SET	_IOW('I', UIO_MSI_BASE+1, struct uio_msi_irq_set)
> +
> +#endif
> -- 
> 2.1.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

Michael S. Tsirkin Oct. 1, 2015, 10:37 a.m. UTC | #2

On Thu, Oct 01, 2015 at 11:33:06AM +0300, Michael S. Tsirkin wrote:
> On Wed, Sep 30, 2015 at 03:28:58PM -0700, Stephen Hemminger wrote:
> > This driver allows using PCI device with Message Signalled Interrupt
> > from userspace. The API is similar to the igb_uio driver used by the DPDK.
> > Via ioctl it provides a mechanism to map MSI-X interrupts into event
> > file descriptors similar to VFIO.
> >
> > VFIO is a better choice if IOMMU is available, but often userspace drivers
> > have to work in environments where IOMMU support (real or emulated) is
> > not available.  All UIO drivers that support DMA are not secure against
> > rogue userspace applications programming DMA hardware to access
> > private memory; this driver is no less secure than existing code.
> > 
> > Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> 
> I don't think copying the igb_uio interface is a good idea.
> What DPDK is doing with igb_uio (and indeed uio_pci_generic)
> is abusing the sysfs BAR access to provide unlimited
> access to hardware.
> 
> MSI messages are memory writes so any generic device capable
> of MSI is capable of corrupting kernel memory.
> This means that a bug in userspace will lead to kernel memory corruption
> and crashes.  This is something distributions can't support.
> 
> uio_pci_generic is already abused like that, mostly
> because when I wrote it, I didn't add enough protections
> against using it with DMA capable devices,
> and we can't go back and break working userspace.
> But at least it does not bind to VFs which all of
> them are capable of DMA.
> 
> The result of merging this driver will be userspace abusing the
> sysfs BAR access with VFs as well, and we do not want that.
> 
> 
> Just forwarding events is not enough to make a valid driver.
> What is missing is a way to access the device in a safe way.
> 
> On a more positive note:
> 
> What would be a reasonable interface? One that does the following
> in kernel:
> 
> 1. initializes device rings (can be in pinned userspace memory,
>    but can not be writeable by userspace), brings up interface link
> 2. pins userspace memory (unless using e.g. hugetlbfs)
> 3. gets request, make sure it's valid and belongs to
>    the correct task, put it in the ring
> 4. in the reverse direction, notify userspace when buffers
>    are available in the ring
> 5. notify userspace about MSI (what this driver does)
> 
> What userspace can be allowed to do:
> 
> 	format requests (e.g. transmit, receive) in userspace
> 	read ring contents
> 
> What userspace can't be allowed to do:
> 
> 	access BAR
> 	write rings

Thinking about it some more, many devices


Have separate rings for DMA: TX (device reads memory)
and RX (device writes memory).
With such devices, a mode where userspace can write TX ring
but not RX ring might make sense.

This will mean userspace might read kernel memory
through the device, but can not corrupt it.

That's already a big win!

And RX buffers do not have to be added one at a time.
If we assume 0.2usec per system call, batching some 100 buffers per
system call gives you 2 nano seconds overhead.  That seems quite
reasonable.





> 
> This means that the driver can not be a generic one,
> and there will be a system call overhead when you
> write the ring, but that's the price you have to
> pay for ability to run on systems without an IOMMU.
> 
> 
> 
> 
> > ---
> >  drivers/uio/Kconfig          |   9 ++
> >  drivers/uio/Makefile         |   1 +
> >  drivers/uio/uio_msi.c        | 378 +++++++++++++++++++++++++++++++++++++++++++
> >  include/uapi/linux/Kbuild    |   1 +
> >  include/uapi/linux/uio_msi.h |  22 +++
> >  5 files changed, 411 insertions(+)
> >  create mode 100644 drivers/uio/uio_msi.c
> >  create mode 100644 include/uapi/linux/uio_msi.h
> > 
> > diff --git a/drivers/uio/Kconfig b/drivers/uio/Kconfig
> > index 52c98ce..04adfa0 100644
> > --- a/drivers/uio/Kconfig
> > +++ b/drivers/uio/Kconfig
> > @@ -93,6 +93,15 @@ config UIO_PCI_GENERIC
> >  	  primarily, for virtualization scenarios.
> >  	  If you compile this as a module, it will be called uio_pci_generic.
> >  
> > +config UIO_PCI_MSI
> > +       tristate "Generic driver supporting MSI-x on PCI Express cards"
> > +	depends on PCI
> > +	help
> > +	  Generic driver that provides Message Signalled IRQ events
> > +	  similar to VFIO. If IOMMMU is available please use VFIO
> > +	  instead since it provides more security.
> > +	  If you compile this as a module, it will be called uio_msi.
> > +
> >  config UIO_NETX
> >  	tristate "Hilscher NetX Card driver"
> >  	depends on PCI
> > diff --git a/drivers/uio/Makefile b/drivers/uio/Makefile
> > index 8560dad..62fc44b 100644
> > --- a/drivers/uio/Makefile
> > +++ b/drivers/uio/Makefile
> > @@ -9,3 +9,4 @@ obj-$(CONFIG_UIO_NETX)	+= uio_netx.o
> >  obj-$(CONFIG_UIO_PRUSS)         += uio_pruss.o
> >  obj-$(CONFIG_UIO_MF624)         += uio_mf624.o
> >  obj-$(CONFIG_UIO_FSL_ELBC_GPCM)	+= uio_fsl_elbc_gpcm.o
> > +obj-$(CONFIG_UIO_PCI_MSI)	+= uio_msi.o
> > diff --git a/drivers/uio/uio_msi.c b/drivers/uio/uio_msi.c
> > new file mode 100644
> > index 0000000..802b5c4
> > --- /dev/null
> > +++ b/drivers/uio/uio_msi.c
> > @@ -0,0 +1,378 @@
> > +/*-
> > + *
> > + * Copyright (c) 2015 by Brocade Communications Systems, Inc.
> > + * Author: Stephen Hemminger <stephen@networkplumber.org>
> > + *
> > + * This work is licensed under the terms of the GNU GPL, version 2 only.
> > + */
> > +
> > +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> > +
> > +#include <linux/device.h>
> > +#include <linux/interrupt.h>
> > +#include <linux/eventfd.h>
> > +#include <linux/module.h>
> > +#include <linux/pci.h>
> > +#include <linux/uio_driver.h>
> > +#include <linux/msi.h>
> > +#include <linux/uio_msi.h>
> > +
> > +#define DRIVER_VERSION		"0.1.1"
> > +#define MAX_MSIX_VECTORS	64
> > +
> > +/* MSI-X vector information */
> > +struct uio_msi_pci_dev {
> > +	struct uio_info info;		/* UIO driver info */
> > +	struct pci_dev *pdev;		/* PCI device */
> > +	struct mutex	mutex;		/* open/release/ioctl mutex */
> > +	int		ref_cnt;	/* references to device */
> > +	unsigned int	max_vectors;	/* MSI-X slots available */
> > +	struct msix_entry *msix;	/* MSI-X vector table */
> > +	struct uio_msi_irq_ctx {
> > +		struct eventfd_ctx *trigger; /* vector to eventfd */
> > +		char *name;		/* name in /proc/interrupts */
> > +	} *ctx;
> > +};
> > +
> > +static irqreturn_t uio_intx_irqhandler(int irq, void *arg)
> > +{
> > +	struct uio_msi_pci_dev *udev = arg;
> > +
> > +	if (pci_check_and_mask_intx(udev->pdev)) {
> > +		eventfd_signal(udev->ctx->trigger, 1);
> > +		return IRQ_HANDLED;
> > +	}
> > +
> > +	return IRQ_NONE;
> > +}
> > +
> > +static irqreturn_t uio_msi_irqhandler(int irq, void *arg)
> > +{
> > +	struct eventfd_ctx *trigger = arg;
> > +
> > +	eventfd_signal(trigger, 1);
> > +	return IRQ_HANDLED;
> > +}
> > +
> > +/* set the mapping between vector # and existing eventfd. */
> > +static int set_irq_eventfd(struct uio_msi_pci_dev *udev, u32 vec, int fd)
> > +{
> > +	struct eventfd_ctx *trigger;
> > +	int irq, err;
> > +
> > +	if (vec >= udev->max_vectors) {
> > +		dev_notice(&udev->pdev->dev, "vec %u >= num_vec %u\n",
> > +			   vec, udev->max_vectors);
> > +		return -ERANGE;
> > +	}
> > +
> > +	irq = udev->msix[vec].vector;
> > +	trigger = udev->ctx[vec].trigger;
> > +	if (trigger) {
> > +		/* Clearup existing irq mapping */
> > +		free_irq(irq, trigger);
> > +		eventfd_ctx_put(trigger);
> > +		udev->ctx[vec].trigger = NULL;
> > +	}
> > +
> > +	/* Passing -1 is used to disable interrupt */
> > +	if (fd < 0)
> > +		return 0;
> > +
> > +	trigger = eventfd_ctx_fdget(fd);
> > +	if (IS_ERR(trigger)) {
> > +		err = PTR_ERR(trigger);
> > +		dev_notice(&udev->pdev->dev,
> > +			   "eventfd ctx get failed: %d\n", err);
> > +		return err;
> > +	}
> > +
> > +	if (udev->msix)
> > +		err = request_irq(irq, uio_msi_irqhandler, 0,
> > +				  udev->ctx[vec].name, trigger);
> > +	else
> > +		err = request_irq(irq, uio_intx_irqhandler, IRQF_SHARED,
> > +				  udev->ctx[vec].name, udev);
> > +
> > +	if (err) {
> > +		dev_notice(&udev->pdev->dev,
> > +			   "request irq failed: %d\n", err);
> > +		eventfd_ctx_put(trigger);
> > +		return err;
> > +	}
> > +
> > +	udev->ctx[vec].trigger = trigger;
> > +	return 0;
> > +}
> > +
> > +static int
> > +uio_msi_ioctl(struct uio_info *info, unsigned int cmd, unsigned long arg)
> > +{
> > +	struct uio_msi_pci_dev *udev
> > +		= container_of(info, struct uio_msi_pci_dev, info);
> > +	struct uio_msi_irq_set hdr;
> > +	int err;
> > +
> > +	switch (cmd) {
> > +	case UIO_MSI_IRQ_SET:
> > +		if (copy_from_user(&hdr, (void __user *)arg, sizeof(hdr)))
> > +			return -EFAULT;
> > +
> > +		mutex_lock(&udev->mutex);
> > +		err = set_irq_eventfd(udev, hdr.vec, hdr.fd);
> > +		mutex_unlock(&udev->mutex);
> > +		break;
> > +	default:
> > +		err = -EOPNOTSUPP;
> > +	}
> > +	return err;
> > +}
> > +
> > +/* Opening the UIO device for first time enables MSI-X */
> > +static int
> > +uio_msi_open(struct uio_info *info, struct inode *inode)
> > +{
> > +	struct uio_msi_pci_dev *udev
> > +		= container_of(info, struct uio_msi_pci_dev, info);
> > +	int err = 0;
> > +
> > +	mutex_lock(&udev->mutex);
> > +	if (udev->ref_cnt++ == 0) {
> > +		if (udev->msix)
> > +			err = pci_enable_msix(udev->pdev, udev->msix,
> > +					      udev->max_vectors);
> > +	}
> > +	mutex_unlock(&udev->mutex);
> > +
> > +	return err;
> > +}
> > +
> > +/* Last close of the UIO device releases/disables all IRQ's */
> > +static int
> > +uio_msi_release(struct uio_info *info, struct inode *inode)
> > +{
> > +	struct uio_msi_pci_dev *udev
> > +		= container_of(info, struct uio_msi_pci_dev, info);
> > +	int i;
> > +
> > +	mutex_lock(&udev->mutex);
> > +	if (--udev->ref_cnt == 0) {
> > +		for (i = 0; i < udev->max_vectors; i++) {
> > +			int irq = udev->msix[i].vector;
> > +			struct eventfd_ctx *trigger = udev->ctx[i].trigger;
> > +
> > +			if (!trigger)
> > +				continue;
> > +
> > +			free_irq(irq, trigger);
> > +			eventfd_ctx_put(trigger);
> > +			udev->ctx[i].trigger = NULL;
> > +		}
> > +
> > +		if (udev->msix)
> > +			pci_disable_msix(udev->pdev);
> > +	}
> > +	mutex_unlock(&udev->mutex);
> > +
> > +	return 0;
> > +}
> > +
> > +/* Unmap previously ioremap'd resources */
> > +static void
> > +release_iomaps(struct uio_mem *mem)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < MAX_UIO_MAPS; i++, mem++) {
> > +		if (mem->internal_addr)
> > +			iounmap(mem->internal_addr);
> > +	}
> > +}
> > +
> > +static int
> > +setup_maps(struct pci_dev *pdev, struct uio_info *info)
> > +{
> > +	int i, m = 0, p = 0, err;
> > +	static const char * const bar_names[] = {
> > +		"BAR0",	"BAR1",	"BAR2",	"BAR3",	"BAR4",	"BAR5",
> > +	};
> > +
> > +	for (i = 0; i < ARRAY_SIZE(bar_names); i++) {
> > +		unsigned long start = pci_resource_start(pdev, i);
> > +		unsigned long flags = pci_resource_flags(pdev, i);
> > +		unsigned long len = pci_resource_len(pdev, i);
> > +
> > +		if (start == 0 || len == 0)
> > +			continue;
> > +
> > +		if (flags & IORESOURCE_MEM) {
> > +			void __iomem *addr;
> > +
> > +			if (m >= MAX_UIO_MAPS)
> > +				continue;
> > +
> > +			addr = ioremap(start, len);
> > +			if (addr == NULL) {
> > +				err = -EINVAL;
> > +				goto fail;
> > +			}
> > +
> > +			info->mem[m].name = bar_names[i];
> > +			info->mem[m].addr = start;
> > +			info->mem[m].internal_addr = addr;
> > +			info->mem[m].size = len;
> > +			info->mem[m].memtype = UIO_MEM_PHYS;
> > +			++m;
> > +		} else if (flags & IORESOURCE_IO) {
> > +			if (p >= MAX_UIO_PORT_REGIONS)
> > +				continue;
> > +
> > +			info->port[p].name = bar_names[i];
> > +			info->port[p].start = start;
> > +			info->port[p].size = len;
> > +			info->port[p].porttype = UIO_PORT_X86;
> > +			++p;
> > +		}
> > +	}
> > +
> > +	return 0;
> > + fail:
> > +	for (i = 0; i < m; i++)
> > +		iounmap(info->mem[i].internal_addr);
> > +	return err;
> > +}
> > +
> > +static int uio_msi_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> > +{
> > +	struct uio_msi_pci_dev *udev;
> > +	int i, err, vectors;
> > +
> > +	udev = kzalloc(sizeof(struct uio_msi_pci_dev), GFP_KERNEL);
> > +	if (!udev)
> > +		return -ENOMEM;
> > +
> > +	err = pci_enable_device(pdev);
> > +	if (err != 0) {
> > +		dev_err(&pdev->dev, "cannot enable PCI device\n");
> > +		goto fail_free;
> > +	}
> > +
> > +	err = pci_request_regions(pdev, "uio_msi");
> > +	if (err != 0) {
> > +		dev_err(&pdev->dev, "Cannot request regions\n");
> > +		goto fail_disable;
> > +	}
> > +
> > +	pci_set_master(pdev);
> > +
> > +	/* remap resources */
> > +	err = setup_maps(pdev, &udev->info);
> > +	if (err)
> > +		goto fail_release_iomem;
> > +
> > +	/* fill uio infos */
> > +	udev->info.name = "uio_msi";
> > +	udev->info.version = DRIVER_VERSION;
> > +	udev->info.priv = udev;
> > +	udev->pdev = pdev;
> > +	udev->info.ioctl = uio_msi_ioctl;
> > +	udev->info.open = uio_msi_open;
> > +	udev->info.release = uio_msi_release;
> > +	udev->info.irq = UIO_IRQ_CUSTOM;
> > +	mutex_init(&udev->mutex);
> > +
> > +	vectors = pci_msix_vec_count(pdev);
> > +	if (vectors > 0) {
> > +		udev->max_vectors = min_t(u16, vectors, MAX_MSIX_VECTORS);
> > +		dev_info(&pdev->dev, "using up to %u MSI-X vectors\n",
> > +			udev->max_vectors);
> > +
> > +		err = -ENOMEM;
> > +		udev->msix = kcalloc(udev->max_vectors,
> > +				     sizeof(struct msix_entry), GFP_KERNEL);
> > +		if (!udev->msix)
> > +			goto fail_release_iomem;
> > +	} else if (!pci_intx_mask_supported(pdev)) {
> > +		dev_err(&pdev->dev,
> > +			"device does not support MSI-X or INTX\n");
> > +		err = -EINVAL;
> > +		goto fail_release_iomem;
> > +	} else {
> > +		dev_notice(&pdev->dev, "using INTX\n");
> > +		udev->info.irq_flags = IRQF_SHARED;
> > +		udev->max_vectors = 1;
> > +	}
> > +
> > +	udev->ctx = kcalloc(udev->max_vectors,
> > +			    sizeof(struct uio_msi_irq_ctx), GFP_KERNEL);
> > +	if (!udev->ctx)
> > +		goto fail_free_msix;
> > +
> > +	for (i = 0; i < udev->max_vectors; i++) {
> > +		udev->msix[i].entry = i;
> > +
> > +		udev->ctx[i].name = kasprintf(GFP_KERNEL,
> > +					      KBUILD_MODNAME "[%d](%s)",
> > +					      i, pci_name(pdev));
> > +		if (!udev->ctx[i].name)
> > +			goto fail_free_ctx;
> > +	}
> > +
> > +	/* register uio driver */
> > +	err = uio_register_device(&pdev->dev, &udev->info);
> > +	if (err != 0)
> > +		goto fail_free_ctx;
> > +
> > +	pci_set_drvdata(pdev, udev);
> > +	return 0;
> > +
> > +fail_free_ctx:
> > +	for (i = 0; i < udev->max_vectors; i++)
> > +		kfree(udev->ctx[i].name);
> > +	kfree(udev->ctx);
> > +fail_free_msix:
> > +	kfree(udev->msix);
> > +fail_release_iomem:
> > +	release_iomaps(udev->info.mem);
> > +	pci_release_regions(pdev);
> > +fail_disable:
> > +	pci_disable_device(pdev);
> > +fail_free:
> > +	kfree(udev);
> > +
> > +	pr_notice("%s ret %d\n", __func__, err);
> > +	return err;
> > +}
> > +
> > +static void uio_msi_remove(struct pci_dev *pdev)
> > +{
> > +	struct uio_info *info = pci_get_drvdata(pdev);
> > +	struct uio_msi_pci_dev *udev
> > +		= container_of(info, struct uio_msi_pci_dev, info);
> > +	int i;
> > +
> > +	uio_unregister_device(info);
> > +	release_iomaps(info->mem);
> > +
> > +	pci_release_regions(pdev);
> > +	for (i = 0; i < udev->max_vectors; i++)
> > +		kfree(udev->ctx[i].name);
> > +	kfree(udev->ctx);
> > +	kfree(udev->msix);
> > +	pci_disable_device(pdev);
> > +
> > +	pci_set_drvdata(pdev, NULL);
> > +	kfree(udev);
> > +}
> > +
> > +static struct pci_driver uio_msi_pci_driver = {
> > +	.name = "uio_msi",
> > +	.probe = uio_msi_probe,
> > +	.remove = uio_msi_remove,
> > +};
> > +
> > +module_pci_driver(uio_msi_pci_driver);
> > +MODULE_VERSION(DRIVER_VERSION);
> > +MODULE_LICENSE("GPL v2");
> > +MODULE_AUTHOR("Stephen Hemminger <stephen@networkplumber.org>");
> > +MODULE_DESCRIPTION("UIO driver for MSI PCI devices");
> > diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
> > index f7b2db4..d9497691 100644
> > --- a/include/uapi/linux/Kbuild
> > +++ b/include/uapi/linux/Kbuild
> > @@ -411,6 +411,7 @@ header-y += udp.h
> >  header-y += uhid.h
> >  header-y += uinput.h
> >  header-y += uio.h
> > +header-y += uio_msi.h
> >  header-y += ultrasound.h
> >  header-y += un.h
> >  header-y += unistd.h
> > diff --git a/include/uapi/linux/uio_msi.h b/include/uapi/linux/uio_msi.h
> > new file mode 100644
> > index 0000000..297de00
> > --- /dev/null
> > +++ b/include/uapi/linux/uio_msi.h
> > @@ -0,0 +1,22 @@
> > +/*
> > + * UIO_MSI API definition
> > + *
> > + * Copyright (c) 2015 by Brocade Communications Systems, Inc.
> > + * All rights reserved.
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License version 2 as
> > + * published by the Free Software Foundation.
> > + */
> > +#ifndef _UIO_PCI_MSI_H
> > +#define _UIO_PCI_MSI_H
> > +
> > +struct uio_msi_irq_set {
> > +	u32 vec;
> > +	int fd;
> > +};
> > +
> > +#define UIO_MSI_BASE	0x86
> > +#define UIO_MSI_IRQ_SET	_IOW('I', UIO_MSI_BASE+1, struct uio_msi_irq_set)
> > +
> > +#endif
> > -- 
> > 2.1.4
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/

Stephen Hemminger Oct. 1, 2015, 2:50 p.m. UTC | #3

On Thu, 1 Oct 2015 11:33:06 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Wed, Sep 30, 2015 at 03:28:58PM -0700, Stephen Hemminger wrote:
> > This driver allows using PCI device with Message Signalled Interrupt
> > from userspace. The API is similar to the igb_uio driver used by the DPDK.
> > Via ioctl it provides a mechanism to map MSI-X interrupts into event
> > file descriptors similar to VFIO.
> >
> > VFIO is a better choice if IOMMU is available, but often userspace drivers
> > have to work in environments where IOMMU support (real or emulated) is
> > not available.  All UIO drivers that support DMA are not secure against
> > rogue userspace applications programming DMA hardware to access
> > private memory; this driver is no less secure than existing code.
> > 
> > Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> 
> I don't think copying the igb_uio interface is a good idea.
> What DPDK is doing with igb_uio (and indeed uio_pci_generic)
> is abusing the sysfs BAR access to provide unlimited
> access to hardware.
> 
> MSI messages are memory writes so any generic device capable
> of MSI is capable of corrupting kernel memory.
> This means that a bug in userspace will lead to kernel memory corruption
> and crashes.  This is something distributions can't support.
> 
> uio_pci_generic is already abused like that, mostly
> because when I wrote it, I didn't add enough protections
> against using it with DMA capable devices,
> and we can't go back and break working userspace.
> But at least it does not bind to VFs which all of
> them are capable of DMA.
> 
> The result of merging this driver will be userspace abusing the
> sysfs BAR access with VFs as well, and we do not want that.
> 
> 
> Just forwarding events is not enough to make a valid driver.
> What is missing is a way to access the device in a safe way.
> 
> On a more positive note:
> 
> What would be a reasonable interface? One that does the following
> in kernel:
> 
> 1. initializes device rings (can be in pinned userspace memory,
>    but can not be writeable by userspace), brings up interface link
> 2. pins userspace memory (unless using e.g. hugetlbfs)
> 3. gets request, make sure it's valid and belongs to
>    the correct task, put it in the ring
> 4. in the reverse direction, notify userspace when buffers
>    are available in the ring
> 5. notify userspace about MSI (what this driver does)
> 
> What userspace can be allowed to do:
> 
> 	format requests (e.g. transmit, receive) in userspace
> 	read ring contents
> 
> What userspace can't be allowed to do:
> 
> 	access BAR
> 	write rings
> 
> 
> This means that the driver can not be a generic one,
> and there will be a system call overhead when you
> write the ring, but that's the price you have to
> pay for ability to run on systems without an IOMMU.

I think I understand what you are proposing, but it really doesn't
fit into the high speed userspace networking model.

1. Device rings are device specific, can't be in a generic driver.
2. DPDK uses huge mememory.
3. Performance requires all ring requests be done in pure userspace,
   (ie no syscalls)
4. Ditto, can't have kernel to userspace notification per packet

Michael S. Tsirkin Oct. 1, 2015, 3:22 p.m. UTC | #4

On Thu, Oct 01, 2015 at 07:50:37AM -0700, Stephen Hemminger wrote:
> On Thu, 1 Oct 2015 11:33:06 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Wed, Sep 30, 2015 at 03:28:58PM -0700, Stephen Hemminger wrote:
> > > This driver allows using PCI device with Message Signalled Interrupt
> > > from userspace. The API is similar to the igb_uio driver used by the DPDK.
> > > Via ioctl it provides a mechanism to map MSI-X interrupts into event
> > > file descriptors similar to VFIO.
> > >
> > > VFIO is a better choice if IOMMU is available, but often userspace drivers
> > > have to work in environments where IOMMU support (real or emulated) is
> > > not available.  All UIO drivers that support DMA are not secure against
> > > rogue userspace applications programming DMA hardware to access
> > > private memory; this driver is no less secure than existing code.
> > > 
> > > Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> > 
> > I don't think copying the igb_uio interface is a good idea.
> > What DPDK is doing with igb_uio (and indeed uio_pci_generic)
> > is abusing the sysfs BAR access to provide unlimited
> > access to hardware.
> > 
> > MSI messages are memory writes so any generic device capable
> > of MSI is capable of corrupting kernel memory.
> > This means that a bug in userspace will lead to kernel memory corruption
> > and crashes.  This is something distributions can't support.
> > 
> > uio_pci_generic is already abused like that, mostly
> > because when I wrote it, I didn't add enough protections
> > against using it with DMA capable devices,
> > and we can't go back and break working userspace.
> > But at least it does not bind to VFs which all of
> > them are capable of DMA.
> > 
> > The result of merging this driver will be userspace abusing the
> > sysfs BAR access with VFs as well, and we do not want that.
> > 
> > 
> > Just forwarding events is not enough to make a valid driver.
> > What is missing is a way to access the device in a safe way.
> > 
> > On a more positive note:
> > 
> > What would be a reasonable interface? One that does the following
> > in kernel:
> > 
> > 1. initializes device rings (can be in pinned userspace memory,
> >    but can not be writeable by userspace), brings up interface link
> > 2. pins userspace memory (unless using e.g. hugetlbfs)
> > 3. gets request, make sure it's valid and belongs to
> >    the correct task, put it in the ring
> > 4. in the reverse direction, notify userspace when buffers
> >    are available in the ring
> > 5. notify userspace about MSI (what this driver does)
> > 
> > What userspace can be allowed to do:
> > 
> > 	format requests (e.g. transmit, receive) in userspace
> > 	read ring contents
> > 
> > What userspace can't be allowed to do:
> > 
> > 	access BAR
> > 	write rings
> > 
> > 
> > This means that the driver can not be a generic one,
> > and there will be a system call overhead when you
> > write the ring, but that's the price you have to
> > pay for ability to run on systems without an IOMMU.
> 
> I think I understand what you are proposing, but it really doesn't
> fit into the high speed userspace networking model.

I'm aware of the fact currently the model does everything including
bringing up the link in user-space.
But there's really no justification for this.
Only data path things should be in userspace.

A userspace bug should not be able to do things like over-writing the
on-device EEPROM.


> 1. Device rings are device specific, can't be in a generic driver.

So that's more work, and it is not going to happen if people
can get by with insecure hacks.

> 2. DPDK uses huge mememory.

Hugetlbfs? Don't see why this is an issue. Might make things simpler.

> 3. Performance requires all ring requests be done in pure userspace,
>    (ie no syscalls)

Make only the TX ring writeable then. At least you won't be
able to corrupt the kernel memory.

> 4. Ditto, can't have kernel to userspace notification per packet

RX ring can be read-only, so userspace can read it directly.

Michael S. Tsirkin Oct. 1, 2015, 4:06 p.m. UTC | #5

On Thu, Oct 01, 2015 at 01:37:12PM +0300, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 11:33:06AM +0300, Michael S. Tsirkin wrote:
> > On Wed, Sep 30, 2015 at 03:28:58PM -0700, Stephen Hemminger wrote:
> > > This driver allows using PCI device with Message Signalled Interrupt
> > > from userspace. The API is similar to the igb_uio driver used by the DPDK.
> > > Via ioctl it provides a mechanism to map MSI-X interrupts into event
> > > file descriptors similar to VFIO.
> > >
> > > VFIO is a better choice if IOMMU is available, but often userspace drivers
> > > have to work in environments where IOMMU support (real or emulated) is
> > > not available.  All UIO drivers that support DMA are not secure against
> > > rogue userspace applications programming DMA hardware to access
> > > private memory; this driver is no less secure than existing code.
> > > 
> > > Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> > 
> > I don't think copying the igb_uio interface is a good idea.
> > What DPDK is doing with igb_uio (and indeed uio_pci_generic)
> > is abusing the sysfs BAR access to provide unlimited
> > access to hardware.
> > 
> > MSI messages are memory writes so any generic device capable
> > of MSI is capable of corrupting kernel memory.
> > This means that a bug in userspace will lead to kernel memory corruption
> > and crashes.  This is something distributions can't support.
> > 
> > uio_pci_generic is already abused like that, mostly
> > because when I wrote it, I didn't add enough protections
> > against using it with DMA capable devices,
> > and we can't go back and break working userspace.
> > But at least it does not bind to VFs which all of
> > them are capable of DMA.
> > 
> > The result of merging this driver will be userspace abusing the
> > sysfs BAR access with VFs as well, and we do not want that.
> > 
> > 
> > Just forwarding events is not enough to make a valid driver.
> > What is missing is a way to access the device in a safe way.
> > 
> > On a more positive note:
> > 
> > What would be a reasonable interface? One that does the following
> > in kernel:
> > 
> > 1. initializes device rings (can be in pinned userspace memory,
> >    but can not be writeable by userspace), brings up interface link
> > 2. pins userspace memory (unless using e.g. hugetlbfs)
> > 3. gets request, make sure it's valid and belongs to
> >    the correct task, put it in the ring
> > 4. in the reverse direction, notify userspace when buffers
> >    are available in the ring
> > 5. notify userspace about MSI (what this driver does)
> > 
> > What userspace can be allowed to do:
> > 
> > 	format requests (e.g. transmit, receive) in userspace
> > 	read ring contents
> > 
> > What userspace can't be allowed to do:
> > 
> > 	access BAR
> > 	write rings
> 
> Thinking about it some more, many devices
> 
> 
> Have separate rings for DMA: TX (device reads memory)
> and RX (device writes memory).
> With such devices, a mode where userspace can write TX ring
> but not RX ring might make sense.
> 
> This will mean userspace might read kernel memory
> through the device, but can not corrupt it.
> 
> That's already a big win!
> 
> And RX buffers do not have to be added one at a time.
> If we assume 0.2usec per system call, batching some 100 buffers per
> system call gives you 2 nano seconds overhead.  That seems quite
> reasonable.
> 

To add to that, there's no reason to do this on the same core.
Re-arming descriptors can happen in parallel with packet
processing, so this overhead won't affect PPS or latency at all:
only the CPU utilization.


> 
> 
> 
> > 
> > This means that the driver can not be a generic one,
> > and there will be a system call overhead when you
> > write the ring, but that's the price you have to
> > pay for ability to run on systems without an IOMMU.
> > 
> > 
> > 
> > 
> > > ---
> > >  drivers/uio/Kconfig          |   9 ++
> > >  drivers/uio/Makefile         |   1 +
> > >  drivers/uio/uio_msi.c        | 378 +++++++++++++++++++++++++++++++++++++++++++
> > >  include/uapi/linux/Kbuild    |   1 +
> > >  include/uapi/linux/uio_msi.h |  22 +++
> > >  5 files changed, 411 insertions(+)
> > >  create mode 100644 drivers/uio/uio_msi.c
> > >  create mode 100644 include/uapi/linux/uio_msi.h
> > > 
> > > diff --git a/drivers/uio/Kconfig b/drivers/uio/Kconfig
> > > index 52c98ce..04adfa0 100644
> > > --- a/drivers/uio/Kconfig
> > > +++ b/drivers/uio/Kconfig
> > > @@ -93,6 +93,15 @@ config UIO_PCI_GENERIC
> > >  	  primarily, for virtualization scenarios.
> > >  	  If you compile this as a module, it will be called uio_pci_generic.
> > >  
> > > +config UIO_PCI_MSI
> > > +       tristate "Generic driver supporting MSI-x on PCI Express cards"
> > > +	depends on PCI
> > > +	help
> > > +	  Generic driver that provides Message Signalled IRQ events
> > > +	  similar to VFIO. If IOMMMU is available please use VFIO
> > > +	  instead since it provides more security.
> > > +	  If you compile this as a module, it will be called uio_msi.
> > > +
> > >  config UIO_NETX
> > >  	tristate "Hilscher NetX Card driver"
> > >  	depends on PCI
> > > diff --git a/drivers/uio/Makefile b/drivers/uio/Makefile
> > > index 8560dad..62fc44b 100644
> > > --- a/drivers/uio/Makefile
> > > +++ b/drivers/uio/Makefile
> > > @@ -9,3 +9,4 @@ obj-$(CONFIG_UIO_NETX)	+= uio_netx.o
> > >  obj-$(CONFIG_UIO_PRUSS)         += uio_pruss.o
> > >  obj-$(CONFIG_UIO_MF624)         += uio_mf624.o
> > >  obj-$(CONFIG_UIO_FSL_ELBC_GPCM)	+= uio_fsl_elbc_gpcm.o
> > > +obj-$(CONFIG_UIO_PCI_MSI)	+= uio_msi.o
> > > diff --git a/drivers/uio/uio_msi.c b/drivers/uio/uio_msi.c
> > > new file mode 100644
> > > index 0000000..802b5c4
> > > --- /dev/null
> > > +++ b/drivers/uio/uio_msi.c
> > > @@ -0,0 +1,378 @@
> > > +/*-
> > > + *
> > > + * Copyright (c) 2015 by Brocade Communications Systems, Inc.
> > > + * Author: Stephen Hemminger <stephen@networkplumber.org>
> > > + *
> > > + * This work is licensed under the terms of the GNU GPL, version 2 only.
> > > + */
> > > +
> > > +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> > > +
> > > +#include <linux/device.h>
> > > +#include <linux/interrupt.h>
> > > +#include <linux/eventfd.h>
> > > +#include <linux/module.h>
> > > +#include <linux/pci.h>
> > > +#include <linux/uio_driver.h>
> > > +#include <linux/msi.h>
> > > +#include <linux/uio_msi.h>
> > > +
> > > +#define DRIVER_VERSION		"0.1.1"
> > > +#define MAX_MSIX_VECTORS	64
> > > +
> > > +/* MSI-X vector information */
> > > +struct uio_msi_pci_dev {
> > > +	struct uio_info info;		/* UIO driver info */
> > > +	struct pci_dev *pdev;		/* PCI device */
> > > +	struct mutex	mutex;		/* open/release/ioctl mutex */
> > > +	int		ref_cnt;	/* references to device */
> > > +	unsigned int	max_vectors;	/* MSI-X slots available */
> > > +	struct msix_entry *msix;	/* MSI-X vector table */
> > > +	struct uio_msi_irq_ctx {
> > > +		struct eventfd_ctx *trigger; /* vector to eventfd */
> > > +		char *name;		/* name in /proc/interrupts */
> > > +	} *ctx;
> > > +};
> > > +
> > > +static irqreturn_t uio_intx_irqhandler(int irq, void *arg)
> > > +{
> > > +	struct uio_msi_pci_dev *udev = arg;
> > > +
> > > +	if (pci_check_and_mask_intx(udev->pdev)) {
> > > +		eventfd_signal(udev->ctx->trigger, 1);
> > > +		return IRQ_HANDLED;
> > > +	}
> > > +
> > > +	return IRQ_NONE;
> > > +}
> > > +
> > > +static irqreturn_t uio_msi_irqhandler(int irq, void *arg)
> > > +{
> > > +	struct eventfd_ctx *trigger = arg;
> > > +
> > > +	eventfd_signal(trigger, 1);
> > > +	return IRQ_HANDLED;
> > > +}
> > > +
> > > +/* set the mapping between vector # and existing eventfd. */
> > > +static int set_irq_eventfd(struct uio_msi_pci_dev *udev, u32 vec, int fd)
> > > +{
> > > +	struct eventfd_ctx *trigger;
> > > +	int irq, err;
> > > +
> > > +	if (vec >= udev->max_vectors) {
> > > +		dev_notice(&udev->pdev->dev, "vec %u >= num_vec %u\n",
> > > +			   vec, udev->max_vectors);
> > > +		return -ERANGE;
> > > +	}
> > > +
> > > +	irq = udev->msix[vec].vector;
> > > +	trigger = udev->ctx[vec].trigger;
> > > +	if (trigger) {
> > > +		/* Clearup existing irq mapping */
> > > +		free_irq(irq, trigger);
> > > +		eventfd_ctx_put(trigger);
> > > +		udev->ctx[vec].trigger = NULL;
> > > +	}
> > > +
> > > +	/* Passing -1 is used to disable interrupt */
> > > +	if (fd < 0)
> > > +		return 0;
> > > +
> > > +	trigger = eventfd_ctx_fdget(fd);
> > > +	if (IS_ERR(trigger)) {
> > > +		err = PTR_ERR(trigger);
> > > +		dev_notice(&udev->pdev->dev,
> > > +			   "eventfd ctx get failed: %d\n", err);
> > > +		return err;
> > > +	}
> > > +
> > > +	if (udev->msix)
> > > +		err = request_irq(irq, uio_msi_irqhandler, 0,
> > > +				  udev->ctx[vec].name, trigger);
> > > +	else
> > > +		err = request_irq(irq, uio_intx_irqhandler, IRQF_SHARED,
> > > +				  udev->ctx[vec].name, udev);
> > > +
> > > +	if (err) {
> > > +		dev_notice(&udev->pdev->dev,
> > > +			   "request irq failed: %d\n", err);
> > > +		eventfd_ctx_put(trigger);
> > > +		return err;
> > > +	}
> > > +
> > > +	udev->ctx[vec].trigger = trigger;
> > > +	return 0;
> > > +}
> > > +
> > > +static int
> > > +uio_msi_ioctl(struct uio_info *info, unsigned int cmd, unsigned long arg)
> > > +{
> > > +	struct uio_msi_pci_dev *udev
> > > +		= container_of(info, struct uio_msi_pci_dev, info);
> > > +	struct uio_msi_irq_set hdr;
> > > +	int err;
> > > +
> > > +	switch (cmd) {
> > > +	case UIO_MSI_IRQ_SET:
> > > +		if (copy_from_user(&hdr, (void __user *)arg, sizeof(hdr)))
> > > +			return -EFAULT;
> > > +
> > > +		mutex_lock(&udev->mutex);
> > > +		err = set_irq_eventfd(udev, hdr.vec, hdr.fd);
> > > +		mutex_unlock(&udev->mutex);
> > > +		break;
> > > +	default:
> > > +		err = -EOPNOTSUPP;
> > > +	}
> > > +	return err;
> > > +}
> > > +
> > > +/* Opening the UIO device for first time enables MSI-X */
> > > +static int
> > > +uio_msi_open(struct uio_info *info, struct inode *inode)
> > > +{
> > > +	struct uio_msi_pci_dev *udev
> > > +		= container_of(info, struct uio_msi_pci_dev, info);
> > > +	int err = 0;
> > > +
> > > +	mutex_lock(&udev->mutex);
> > > +	if (udev->ref_cnt++ == 0) {
> > > +		if (udev->msix)
> > > +			err = pci_enable_msix(udev->pdev, udev->msix,
> > > +					      udev->max_vectors);
> > > +	}
> > > +	mutex_unlock(&udev->mutex);
> > > +
> > > +	return err;
> > > +}
> > > +
> > > +/* Last close of the UIO device releases/disables all IRQ's */
> > > +static int
> > > +uio_msi_release(struct uio_info *info, struct inode *inode)
> > > +{
> > > +	struct uio_msi_pci_dev *udev
> > > +		= container_of(info, struct uio_msi_pci_dev, info);
> > > +	int i;
> > > +
> > > +	mutex_lock(&udev->mutex);
> > > +	if (--udev->ref_cnt == 0) {
> > > +		for (i = 0; i < udev->max_vectors; i++) {
> > > +			int irq = udev->msix[i].vector;
> > > +			struct eventfd_ctx *trigger = udev->ctx[i].trigger;
> > > +
> > > +			if (!trigger)
> > > +				continue;
> > > +
> > > +			free_irq(irq, trigger);
> > > +			eventfd_ctx_put(trigger);
> > > +			udev->ctx[i].trigger = NULL;
> > > +		}
> > > +
> > > +		if (udev->msix)
> > > +			pci_disable_msix(udev->pdev);
> > > +	}
> > > +	mutex_unlock(&udev->mutex);
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +/* Unmap previously ioremap'd resources */
> > > +static void
> > > +release_iomaps(struct uio_mem *mem)
> > > +{
> > > +	int i;
> > > +
> > > +	for (i = 0; i < MAX_UIO_MAPS; i++, mem++) {
> > > +		if (mem->internal_addr)
> > > +			iounmap(mem->internal_addr);
> > > +	}
> > > +}
> > > +
> > > +static int
> > > +setup_maps(struct pci_dev *pdev, struct uio_info *info)
> > > +{
> > > +	int i, m = 0, p = 0, err;
> > > +	static const char * const bar_names[] = {
> > > +		"BAR0",	"BAR1",	"BAR2",	"BAR3",	"BAR4",	"BAR5",
> > > +	};
> > > +
> > > +	for (i = 0; i < ARRAY_SIZE(bar_names); i++) {
> > > +		unsigned long start = pci_resource_start(pdev, i);
> > > +		unsigned long flags = pci_resource_flags(pdev, i);
> > > +		unsigned long len = pci_resource_len(pdev, i);
> > > +
> > > +		if (start == 0 || len == 0)
> > > +			continue;
> > > +
> > > +		if (flags & IORESOURCE_MEM) {
> > > +			void __iomem *addr;
> > > +
> > > +			if (m >= MAX_UIO_MAPS)
> > > +				continue;
> > > +
> > > +			addr = ioremap(start, len);
> > > +			if (addr == NULL) {
> > > +				err = -EINVAL;
> > > +				goto fail;
> > > +			}
> > > +
> > > +			info->mem[m].name = bar_names[i];
> > > +			info->mem[m].addr = start;
> > > +			info->mem[m].internal_addr = addr;
> > > +			info->mem[m].size = len;
> > > +			info->mem[m].memtype = UIO_MEM_PHYS;
> > > +			++m;
> > > +		} else if (flags & IORESOURCE_IO) {
> > > +			if (p >= MAX_UIO_PORT_REGIONS)
> > > +				continue;
> > > +
> > > +			info->port[p].name = bar_names[i];
> > > +			info->port[p].start = start;
> > > +			info->port[p].size = len;
> > > +			info->port[p].porttype = UIO_PORT_X86;
> > > +			++p;
> > > +		}
> > > +	}
> > > +
> > > +	return 0;
> > > + fail:
> > > +	for (i = 0; i < m; i++)
> > > +		iounmap(info->mem[i].internal_addr);
> > > +	return err;
> > > +}
> > > +
> > > +static int uio_msi_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> > > +{
> > > +	struct uio_msi_pci_dev *udev;
> > > +	int i, err, vectors;
> > > +
> > > +	udev = kzalloc(sizeof(struct uio_msi_pci_dev), GFP_KERNEL);
> > > +	if (!udev)
> > > +		return -ENOMEM;
> > > +
> > > +	err = pci_enable_device(pdev);
> > > +	if (err != 0) {
> > > +		dev_err(&pdev->dev, "cannot enable PCI device\n");
> > > +		goto fail_free;
> > > +	}
> > > +
> > > +	err = pci_request_regions(pdev, "uio_msi");
> > > +	if (err != 0) {
> > > +		dev_err(&pdev->dev, "Cannot request regions\n");
> > > +		goto fail_disable;
> > > +	}
> > > +
> > > +	pci_set_master(pdev);
> > > +
> > > +	/* remap resources */
> > > +	err = setup_maps(pdev, &udev->info);
> > > +	if (err)
> > > +		goto fail_release_iomem;
> > > +
> > > +	/* fill uio infos */
> > > +	udev->info.name = "uio_msi";
> > > +	udev->info.version = DRIVER_VERSION;
> > > +	udev->info.priv = udev;
> > > +	udev->pdev = pdev;
> > > +	udev->info.ioctl = uio_msi_ioctl;
> > > +	udev->info.open = uio_msi_open;
> > > +	udev->info.release = uio_msi_release;
> > > +	udev->info.irq = UIO_IRQ_CUSTOM;
> > > +	mutex_init(&udev->mutex);
> > > +
> > > +	vectors = pci_msix_vec_count(pdev);
> > > +	if (vectors > 0) {
> > > +		udev->max_vectors = min_t(u16, vectors, MAX_MSIX_VECTORS);
> > > +		dev_info(&pdev->dev, "using up to %u MSI-X vectors\n",
> > > +			udev->max_vectors);
> > > +
> > > +		err = -ENOMEM;
> > > +		udev->msix = kcalloc(udev->max_vectors,
> > > +				     sizeof(struct msix_entry), GFP_KERNEL);
> > > +		if (!udev->msix)
> > > +			goto fail_release_iomem;
> > > +	} else if (!pci_intx_mask_supported(pdev)) {
> > > +		dev_err(&pdev->dev,
> > > +			"device does not support MSI-X or INTX\n");
> > > +		err = -EINVAL;
> > > +		goto fail_release_iomem;
> > > +	} else {
> > > +		dev_notice(&pdev->dev, "using INTX\n");
> > > +		udev->info.irq_flags = IRQF_SHARED;
> > > +		udev->max_vectors = 1;
> > > +	}
> > > +
> > > +	udev->ctx = kcalloc(udev->max_vectors,
> > > +			    sizeof(struct uio_msi_irq_ctx), GFP_KERNEL);
> > > +	if (!udev->ctx)
> > > +		goto fail_free_msix;
> > > +
> > > +	for (i = 0; i < udev->max_vectors; i++) {
> > > +		udev->msix[i].entry = i;
> > > +
> > > +		udev->ctx[i].name = kasprintf(GFP_KERNEL,
> > > +					      KBUILD_MODNAME "[%d](%s)",
> > > +					      i, pci_name(pdev));
> > > +		if (!udev->ctx[i].name)
> > > +			goto fail_free_ctx;
> > > +	}
> > > +
> > > +	/* register uio driver */
> > > +	err = uio_register_device(&pdev->dev, &udev->info);
> > > +	if (err != 0)
> > > +		goto fail_free_ctx;
> > > +
> > > +	pci_set_drvdata(pdev, udev);
> > > +	return 0;
> > > +
> > > +fail_free_ctx:
> > > +	for (i = 0; i < udev->max_vectors; i++)
> > > +		kfree(udev->ctx[i].name);
> > > +	kfree(udev->ctx);
> > > +fail_free_msix:
> > > +	kfree(udev->msix);
> > > +fail_release_iomem:
> > > +	release_iomaps(udev->info.mem);
> > > +	pci_release_regions(pdev);
> > > +fail_disable:
> > > +	pci_disable_device(pdev);
> > > +fail_free:
> > > +	kfree(udev);
> > > +
> > > +	pr_notice("%s ret %d\n", __func__, err);
> > > +	return err;
> > > +}
> > > +
> > > +static void uio_msi_remove(struct pci_dev *pdev)
> > > +{
> > > +	struct uio_info *info = pci_get_drvdata(pdev);
> > > +	struct uio_msi_pci_dev *udev
> > > +		= container_of(info, struct uio_msi_pci_dev, info);
> > > +	int i;
> > > +
> > > +	uio_unregister_device(info);
> > > +	release_iomaps(info->mem);
> > > +
> > > +	pci_release_regions(pdev);
> > > +	for (i = 0; i < udev->max_vectors; i++)
> > > +		kfree(udev->ctx[i].name);
> > > +	kfree(udev->ctx);
> > > +	kfree(udev->msix);
> > > +	pci_disable_device(pdev);
> > > +
> > > +	pci_set_drvdata(pdev, NULL);
> > > +	kfree(udev);
> > > +}
> > > +
> > > +static struct pci_driver uio_msi_pci_driver = {
> > > +	.name = "uio_msi",
> > > +	.probe = uio_msi_probe,
> > > +	.remove = uio_msi_remove,
> > > +};
> > > +
> > > +module_pci_driver(uio_msi_pci_driver);
> > > +MODULE_VERSION(DRIVER_VERSION);
> > > +MODULE_LICENSE("GPL v2");
> > > +MODULE_AUTHOR("Stephen Hemminger <stephen@networkplumber.org>");
> > > +MODULE_DESCRIPTION("UIO driver for MSI PCI devices");
> > > diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
> > > index f7b2db4..d9497691 100644
> > > --- a/include/uapi/linux/Kbuild
> > > +++ b/include/uapi/linux/Kbuild
> > > @@ -411,6 +411,7 @@ header-y += udp.h
> > >  header-y += uhid.h
> > >  header-y += uinput.h
> > >  header-y += uio.h
> > > +header-y += uio_msi.h
> > >  header-y += ultrasound.h
> > >  header-y += un.h
> > >  header-y += unistd.h
> > > diff --git a/include/uapi/linux/uio_msi.h b/include/uapi/linux/uio_msi.h
> > > new file mode 100644
> > > index 0000000..297de00
> > > --- /dev/null
> > > +++ b/include/uapi/linux/uio_msi.h
> > > @@ -0,0 +1,22 @@
> > > +/*
> > > + * UIO_MSI API definition
> > > + *
> > > + * Copyright (c) 2015 by Brocade Communications Systems, Inc.
> > > + * All rights reserved.
> > > + *
> > > + * This program is free software; you can redistribute it and/or modify
> > > + * it under the terms of the GNU General Public License version 2 as
> > > + * published by the Free Software Foundation.
> > > + */
> > > +#ifndef _UIO_PCI_MSI_H
> > > +#define _UIO_PCI_MSI_H
> > > +
> > > +struct uio_msi_irq_set {
> > > +	u32 vec;
> > > +	int fd;
> > > +};
> > > +
> > > +#define UIO_MSI_BASE	0x86
> > > +#define UIO_MSI_IRQ_SET	_IOW('I', UIO_MSI_BASE+1, struct uio_msi_irq_set)
> > > +
> > > +#endif
> > > -- 
> > > 2.1.4
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > Please read the FAQ at  http://www.tux.org/lkml/

Michael S. Tsirkin Oct. 1, 2015, 4:31 p.m. UTC | #6

On Thu, Oct 01, 2015 at 11:33:06AM +0300, Michael S. Tsirkin wrote:
> On Wed, Sep 30, 2015 at 03:28:58PM -0700, Stephen Hemminger wrote:
> > This driver allows using PCI device with Message Signalled Interrupt
> > from userspace. The API is similar to the igb_uio driver used by the DPDK.
> > Via ioctl it provides a mechanism to map MSI-X interrupts into event
> > file descriptors similar to VFIO.
> >
> > VFIO is a better choice if IOMMU is available, but often userspace drivers
> > have to work in environments where IOMMU support (real or emulated) is
> > not available.  All UIO drivers that support DMA are not secure against
> > rogue userspace applications programming DMA hardware to access
> > private memory; this driver is no less secure than existing code.
> > 
> > Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> 
> I don't think copying the igb_uio interface is a good idea.
> What DPDK is doing with igb_uio (and indeed uio_pci_generic)
> is abusing the sysfs BAR access to provide unlimited
> access to hardware.
> 
> MSI messages are memory writes so any generic device capable
> of MSI is capable of corrupting kernel memory.
> This means that a bug in userspace will lead to kernel memory corruption
> and crashes.  This is something distributions can't support.
> 
> uio_pci_generic is already abused like that, mostly
> because when I wrote it, I didn't add enough protections
> against using it with DMA capable devices,
> and we can't go back and break working userspace.
> But at least it does not bind to VFs which all of
> them are capable of DMA.
> 
> The result of merging this driver will be userspace abusing the
> sysfs BAR access with VFs as well, and we do not want that.
> 
> 
> Just forwarding events is not enough to make a valid driver.
> What is missing is a way to access the device in a safe way.
> 
> On a more positive note:
> 
> What would be a reasonable interface? One that does the following
> in kernel:
> 
> 1. initializes device rings (can be in pinned userspace memory,
>    but can not be writeable by userspace), brings up interface link
> 2. pins userspace memory (unless using e.g. hugetlbfs)
> 3. gets request, make sure it's valid and belongs to
>    the correct task, put it in the ring
> 4. in the reverse direction, notify userspace when buffers
>    are available in the ring
> 5. notify userspace about MSI (what this driver does)
> 
> What userspace can be allowed to do:
> 
> 	format requests (e.g. transmit, receive) in userspace
> 	read ring contents
> 
> What userspace can't be allowed to do:
> 
> 	access BAR
> 	write rings
> 
> 
> This means that the driver can not be a generic one,
> and there will be a system call overhead when you
> write the ring, but that's the price you have to
> pay for ability to run on systems without an IOMMU.
> 


The device specific parts can be taken from John Fastabend's patches
BTW:

https://patchwork.ozlabs.org/patch/396713/

IIUC what was missing there was exactly the memory protection
we are looking for here.


> 
> 
> > ---
> >  drivers/uio/Kconfig          |   9 ++
> >  drivers/uio/Makefile         |   1 +
> >  drivers/uio/uio_msi.c        | 378 +++++++++++++++++++++++++++++++++++++++++++
> >  include/uapi/linux/Kbuild    |   1 +
> >  include/uapi/linux/uio_msi.h |  22 +++
> >  5 files changed, 411 insertions(+)
> >  create mode 100644 drivers/uio/uio_msi.c
> >  create mode 100644 include/uapi/linux/uio_msi.h
> > 
> > diff --git a/drivers/uio/Kconfig b/drivers/uio/Kconfig
> > index 52c98ce..04adfa0 100644
> > --- a/drivers/uio/Kconfig
> > +++ b/drivers/uio/Kconfig
> > @@ -93,6 +93,15 @@ config UIO_PCI_GENERIC
> >  	  primarily, for virtualization scenarios.
> >  	  If you compile this as a module, it will be called uio_pci_generic.
> >  
> > +config UIO_PCI_MSI
> > +       tristate "Generic driver supporting MSI-x on PCI Express cards"
> > +	depends on PCI
> > +	help
> > +	  Generic driver that provides Message Signalled IRQ events
> > +	  similar to VFIO. If IOMMMU is available please use VFIO
> > +	  instead since it provides more security.
> > +	  If you compile this as a module, it will be called uio_msi.
> > +
> >  config UIO_NETX
> >  	tristate "Hilscher NetX Card driver"
> >  	depends on PCI
> > diff --git a/drivers/uio/Makefile b/drivers/uio/Makefile
> > index 8560dad..62fc44b 100644
> > --- a/drivers/uio/Makefile
> > +++ b/drivers/uio/Makefile
> > @@ -9,3 +9,4 @@ obj-$(CONFIG_UIO_NETX)	+= uio_netx.o
> >  obj-$(CONFIG_UIO_PRUSS)         += uio_pruss.o
> >  obj-$(CONFIG_UIO_MF624)         += uio_mf624.o
> >  obj-$(CONFIG_UIO_FSL_ELBC_GPCM)	+= uio_fsl_elbc_gpcm.o
> > +obj-$(CONFIG_UIO_PCI_MSI)	+= uio_msi.o
> > diff --git a/drivers/uio/uio_msi.c b/drivers/uio/uio_msi.c
> > new file mode 100644
> > index 0000000..802b5c4
> > --- /dev/null
> > +++ b/drivers/uio/uio_msi.c
> > @@ -0,0 +1,378 @@
> > +/*-
> > + *
> > + * Copyright (c) 2015 by Brocade Communications Systems, Inc.
> > + * Author: Stephen Hemminger <stephen@networkplumber.org>
> > + *
> > + * This work is licensed under the terms of the GNU GPL, version 2 only.
> > + */
> > +
> > +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> > +
> > +#include <linux/device.h>
> > +#include <linux/interrupt.h>
> > +#include <linux/eventfd.h>
> > +#include <linux/module.h>
> > +#include <linux/pci.h>
> > +#include <linux/uio_driver.h>
> > +#include <linux/msi.h>
> > +#include <linux/uio_msi.h>
> > +
> > +#define DRIVER_VERSION		"0.1.1"
> > +#define MAX_MSIX_VECTORS	64
> > +
> > +/* MSI-X vector information */
> > +struct uio_msi_pci_dev {
> > +	struct uio_info info;		/* UIO driver info */
> > +	struct pci_dev *pdev;		/* PCI device */
> > +	struct mutex	mutex;		/* open/release/ioctl mutex */
> > +	int		ref_cnt;	/* references to device */
> > +	unsigned int	max_vectors;	/* MSI-X slots available */
> > +	struct msix_entry *msix;	/* MSI-X vector table */
> > +	struct uio_msi_irq_ctx {
> > +		struct eventfd_ctx *trigger; /* vector to eventfd */
> > +		char *name;		/* name in /proc/interrupts */
> > +	} *ctx;
> > +};
> > +
> > +static irqreturn_t uio_intx_irqhandler(int irq, void *arg)
> > +{
> > +	struct uio_msi_pci_dev *udev = arg;
> > +
> > +	if (pci_check_and_mask_intx(udev->pdev)) {
> > +		eventfd_signal(udev->ctx->trigger, 1);
> > +		return IRQ_HANDLED;
> > +	}
> > +
> > +	return IRQ_NONE;
> > +}
> > +
> > +static irqreturn_t uio_msi_irqhandler(int irq, void *arg)
> > +{
> > +	struct eventfd_ctx *trigger = arg;
> > +
> > +	eventfd_signal(trigger, 1);
> > +	return IRQ_HANDLED;
> > +}
> > +
> > +/* set the mapping between vector # and existing eventfd. */
> > +static int set_irq_eventfd(struct uio_msi_pci_dev *udev, u32 vec, int fd)
> > +{
> > +	struct eventfd_ctx *trigger;
> > +	int irq, err;
> > +
> > +	if (vec >= udev->max_vectors) {
> > +		dev_notice(&udev->pdev->dev, "vec %u >= num_vec %u\n",
> > +			   vec, udev->max_vectors);
> > +		return -ERANGE;
> > +	}
> > +
> > +	irq = udev->msix[vec].vector;
> > +	trigger = udev->ctx[vec].trigger;
> > +	if (trigger) {
> > +		/* Clearup existing irq mapping */
> > +		free_irq(irq, trigger);
> > +		eventfd_ctx_put(trigger);
> > +		udev->ctx[vec].trigger = NULL;
> > +	}
> > +
> > +	/* Passing -1 is used to disable interrupt */
> > +	if (fd < 0)
> > +		return 0;
> > +
> > +	trigger = eventfd_ctx_fdget(fd);
> > +	if (IS_ERR(trigger)) {
> > +		err = PTR_ERR(trigger);
> > +		dev_notice(&udev->pdev->dev,
> > +			   "eventfd ctx get failed: %d\n", err);
> > +		return err;
> > +	}
> > +
> > +	if (udev->msix)
> > +		err = request_irq(irq, uio_msi_irqhandler, 0,
> > +				  udev->ctx[vec].name, trigger);
> > +	else
> > +		err = request_irq(irq, uio_intx_irqhandler, IRQF_SHARED,
> > +				  udev->ctx[vec].name, udev);
> > +
> > +	if (err) {
> > +		dev_notice(&udev->pdev->dev,
> > +			   "request irq failed: %d\n", err);
> > +		eventfd_ctx_put(trigger);
> > +		return err;
> > +	}
> > +
> > +	udev->ctx[vec].trigger = trigger;
> > +	return 0;
> > +}
> > +
> > +static int
> > +uio_msi_ioctl(struct uio_info *info, unsigned int cmd, unsigned long arg)
> > +{
> > +	struct uio_msi_pci_dev *udev
> > +		= container_of(info, struct uio_msi_pci_dev, info);
> > +	struct uio_msi_irq_set hdr;
> > +	int err;
> > +
> > +	switch (cmd) {
> > +	case UIO_MSI_IRQ_SET:
> > +		if (copy_from_user(&hdr, (void __user *)arg, sizeof(hdr)))
> > +			return -EFAULT;
> > +
> > +		mutex_lock(&udev->mutex);
> > +		err = set_irq_eventfd(udev, hdr.vec, hdr.fd);
> > +		mutex_unlock(&udev->mutex);
> > +		break;
> > +	default:
> > +		err = -EOPNOTSUPP;
> > +	}
> > +	return err;
> > +}
> > +
> > +/* Opening the UIO device for first time enables MSI-X */
> > +static int
> > +uio_msi_open(struct uio_info *info, struct inode *inode)
> > +{
> > +	struct uio_msi_pci_dev *udev
> > +		= container_of(info, struct uio_msi_pci_dev, info);
> > +	int err = 0;
> > +
> > +	mutex_lock(&udev->mutex);
> > +	if (udev->ref_cnt++ == 0) {
> > +		if (udev->msix)
> > +			err = pci_enable_msix(udev->pdev, udev->msix,
> > +					      udev->max_vectors);
> > +	}
> > +	mutex_unlock(&udev->mutex);
> > +
> > +	return err;
> > +}
> > +
> > +/* Last close of the UIO device releases/disables all IRQ's */
> > +static int
> > +uio_msi_release(struct uio_info *info, struct inode *inode)
> > +{
> > +	struct uio_msi_pci_dev *udev
> > +		= container_of(info, struct uio_msi_pci_dev, info);
> > +	int i;
> > +
> > +	mutex_lock(&udev->mutex);
> > +	if (--udev->ref_cnt == 0) {
> > +		for (i = 0; i < udev->max_vectors; i++) {
> > +			int irq = udev->msix[i].vector;
> > +			struct eventfd_ctx *trigger = udev->ctx[i].trigger;
> > +
> > +			if (!trigger)
> > +				continue;
> > +
> > +			free_irq(irq, trigger);
> > +			eventfd_ctx_put(trigger);
> > +			udev->ctx[i].trigger = NULL;
> > +		}
> > +
> > +		if (udev->msix)
> > +			pci_disable_msix(udev->pdev);
> > +	}
> > +	mutex_unlock(&udev->mutex);
> > +
> > +	return 0;
> > +}
> > +
> > +/* Unmap previously ioremap'd resources */
> > +static void
> > +release_iomaps(struct uio_mem *mem)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < MAX_UIO_MAPS; i++, mem++) {
> > +		if (mem->internal_addr)
> > +			iounmap(mem->internal_addr);
> > +	}
> > +}
> > +
> > +static int
> > +setup_maps(struct pci_dev *pdev, struct uio_info *info)
> > +{
> > +	int i, m = 0, p = 0, err;
> > +	static const char * const bar_names[] = {
> > +		"BAR0",	"BAR1",	"BAR2",	"BAR3",	"BAR4",	"BAR5",
> > +	};
> > +
> > +	for (i = 0; i < ARRAY_SIZE(bar_names); i++) {
> > +		unsigned long start = pci_resource_start(pdev, i);
> > +		unsigned long flags = pci_resource_flags(pdev, i);
> > +		unsigned long len = pci_resource_len(pdev, i);
> > +
> > +		if (start == 0 || len == 0)
> > +			continue;
> > +
> > +		if (flags & IORESOURCE_MEM) {
> > +			void __iomem *addr;
> > +
> > +			if (m >= MAX_UIO_MAPS)
> > +				continue;
> > +
> > +			addr = ioremap(start, len);
> > +			if (addr == NULL) {
> > +				err = -EINVAL;
> > +				goto fail;
> > +			}
> > +
> > +			info->mem[m].name = bar_names[i];
> > +			info->mem[m].addr = start;
> > +			info->mem[m].internal_addr = addr;
> > +			info->mem[m].size = len;
> > +			info->mem[m].memtype = UIO_MEM_PHYS;
> > +			++m;
> > +		} else if (flags & IORESOURCE_IO) {
> > +			if (p >= MAX_UIO_PORT_REGIONS)
> > +				continue;
> > +
> > +			info->port[p].name = bar_names[i];
> > +			info->port[p].start = start;
> > +			info->port[p].size = len;
> > +			info->port[p].porttype = UIO_PORT_X86;
> > +			++p;
> > +		}
> > +	}
> > +
> > +	return 0;
> > + fail:
> > +	for (i = 0; i < m; i++)
> > +		iounmap(info->mem[i].internal_addr);
> > +	return err;
> > +}
> > +
> > +static int uio_msi_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> > +{
> > +	struct uio_msi_pci_dev *udev;
> > +	int i, err, vectors;
> > +
> > +	udev = kzalloc(sizeof(struct uio_msi_pci_dev), GFP_KERNEL);
> > +	if (!udev)
> > +		return -ENOMEM;
> > +
> > +	err = pci_enable_device(pdev);
> > +	if (err != 0) {
> > +		dev_err(&pdev->dev, "cannot enable PCI device\n");
> > +		goto fail_free;
> > +	}
> > +
> > +	err = pci_request_regions(pdev, "uio_msi");
> > +	if (err != 0) {
> > +		dev_err(&pdev->dev, "Cannot request regions\n");
> > +		goto fail_disable;
> > +	}
> > +
> > +	pci_set_master(pdev);
> > +
> > +	/* remap resources */
> > +	err = setup_maps(pdev, &udev->info);
> > +	if (err)
> > +		goto fail_release_iomem;
> > +
> > +	/* fill uio infos */
> > +	udev->info.name = "uio_msi";
> > +	udev->info.version = DRIVER_VERSION;
> > +	udev->info.priv = udev;
> > +	udev->pdev = pdev;
> > +	udev->info.ioctl = uio_msi_ioctl;
> > +	udev->info.open = uio_msi_open;
> > +	udev->info.release = uio_msi_release;
> > +	udev->info.irq = UIO_IRQ_CUSTOM;
> > +	mutex_init(&udev->mutex);
> > +
> > +	vectors = pci_msix_vec_count(pdev);
> > +	if (vectors > 0) {
> > +		udev->max_vectors = min_t(u16, vectors, MAX_MSIX_VECTORS);
> > +		dev_info(&pdev->dev, "using up to %u MSI-X vectors\n",
> > +			udev->max_vectors);
> > +
> > +		err = -ENOMEM;
> > +		udev->msix = kcalloc(udev->max_vectors,
> > +				     sizeof(struct msix_entry), GFP_KERNEL);
> > +		if (!udev->msix)
> > +			goto fail_release_iomem;
> > +	} else if (!pci_intx_mask_supported(pdev)) {
> > +		dev_err(&pdev->dev,
> > +			"device does not support MSI-X or INTX\n");
> > +		err = -EINVAL;
> > +		goto fail_release_iomem;
> > +	} else {
> > +		dev_notice(&pdev->dev, "using INTX\n");
> > +		udev->info.irq_flags = IRQF_SHARED;
> > +		udev->max_vectors = 1;
> > +	}
> > +
> > +	udev->ctx = kcalloc(udev->max_vectors,
> > +			    sizeof(struct uio_msi_irq_ctx), GFP_KERNEL);
> > +	if (!udev->ctx)
> > +		goto fail_free_msix;
> > +
> > +	for (i = 0; i < udev->max_vectors; i++) {
> > +		udev->msix[i].entry = i;
> > +
> > +		udev->ctx[i].name = kasprintf(GFP_KERNEL,
> > +					      KBUILD_MODNAME "[%d](%s)",
> > +					      i, pci_name(pdev));
> > +		if (!udev->ctx[i].name)
> > +			goto fail_free_ctx;
> > +	}
> > +
> > +	/* register uio driver */
> > +	err = uio_register_device(&pdev->dev, &udev->info);
> > +	if (err != 0)
> > +		goto fail_free_ctx;
> > +
> > +	pci_set_drvdata(pdev, udev);
> > +	return 0;
> > +
> > +fail_free_ctx:
> > +	for (i = 0; i < udev->max_vectors; i++)
> > +		kfree(udev->ctx[i].name);
> > +	kfree(udev->ctx);
> > +fail_free_msix:
> > +	kfree(udev->msix);
> > +fail_release_iomem:
> > +	release_iomaps(udev->info.mem);
> > +	pci_release_regions(pdev);
> > +fail_disable:
> > +	pci_disable_device(pdev);
> > +fail_free:
> > +	kfree(udev);
> > +
> > +	pr_notice("%s ret %d\n", __func__, err);
> > +	return err;
> > +}
> > +
> > +static void uio_msi_remove(struct pci_dev *pdev)
> > +{
> > +	struct uio_info *info = pci_get_drvdata(pdev);
> > +	struct uio_msi_pci_dev *udev
> > +		= container_of(info, struct uio_msi_pci_dev, info);
> > +	int i;
> > +
> > +	uio_unregister_device(info);
> > +	release_iomaps(info->mem);
> > +
> > +	pci_release_regions(pdev);
> > +	for (i = 0; i < udev->max_vectors; i++)
> > +		kfree(udev->ctx[i].name);
> > +	kfree(udev->ctx);
> > +	kfree(udev->msix);
> > +	pci_disable_device(pdev);
> > +
> > +	pci_set_drvdata(pdev, NULL);
> > +	kfree(udev);
> > +}
> > +
> > +static struct pci_driver uio_msi_pci_driver = {
> > +	.name = "uio_msi",
> > +	.probe = uio_msi_probe,
> > +	.remove = uio_msi_remove,
> > +};
> > +
> > +module_pci_driver(uio_msi_pci_driver);
> > +MODULE_VERSION(DRIVER_VERSION);
> > +MODULE_LICENSE("GPL v2");
> > +MODULE_AUTHOR("Stephen Hemminger <stephen@networkplumber.org>");
> > +MODULE_DESCRIPTION("UIO driver for MSI PCI devices");
> > diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
> > index f7b2db4..d9497691 100644
> > --- a/include/uapi/linux/Kbuild
> > +++ b/include/uapi/linux/Kbuild
> > @@ -411,6 +411,7 @@ header-y += udp.h
> >  header-y += uhid.h
> >  header-y += uinput.h
> >  header-y += uio.h
> > +header-y += uio_msi.h
> >  header-y += ultrasound.h
> >  header-y += un.h
> >  header-y += unistd.h
> > diff --git a/include/uapi/linux/uio_msi.h b/include/uapi/linux/uio_msi.h
> > new file mode 100644
> > index 0000000..297de00
> > --- /dev/null
> > +++ b/include/uapi/linux/uio_msi.h
> > @@ -0,0 +1,22 @@
> > +/*
> > + * UIO_MSI API definition
> > + *
> > + * Copyright (c) 2015 by Brocade Communications Systems, Inc.
> > + * All rights reserved.
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License version 2 as
> > + * published by the Free Software Foundation.
> > + */
> > +#ifndef _UIO_PCI_MSI_H
> > +#define _UIO_PCI_MSI_H
> > +
> > +struct uio_msi_irq_set {
> > +	u32 vec;
> > +	int fd;
> > +};
> > +
> > +#define UIO_MSI_BASE	0x86
> > +#define UIO_MSI_IRQ_SET	_IOW('I', UIO_MSI_BASE+1, struct uio_msi_irq_set)
> > +
> > +#endif
> > -- 
> > 2.1.4
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/

Stephen Hemminger Oct. 1, 2015, 5:26 p.m. UTC | #7

On Thu, 1 Oct 2015 19:31:08 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Thu, Oct 01, 2015 at 11:33:06AM +0300, Michael S. Tsirkin wrote:
> > On Wed, Sep 30, 2015 at 03:28:58PM -0700, Stephen Hemminger wrote:
> > > This driver allows using PCI device with Message Signalled Interrupt
> > > from userspace. The API is similar to the igb_uio driver used by the DPDK.
> > > Via ioctl it provides a mechanism to map MSI-X interrupts into event
> > > file descriptors similar to VFIO.
> > >
> > > VFIO is a better choice if IOMMU is available, but often userspace drivers
> > > have to work in environments where IOMMU support (real or emulated) is
> > > not available.  All UIO drivers that support DMA are not secure against
> > > rogue userspace applications programming DMA hardware to access
> > > private memory; this driver is no less secure than existing code.
> > > 
> > > Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> > 
> > I don't think copying the igb_uio interface is a good idea.
> > What DPDK is doing with igb_uio (and indeed uio_pci_generic)
> > is abusing the sysfs BAR access to provide unlimited
> > access to hardware.
> > 
> > MSI messages are memory writes so any generic device capable
> > of MSI is capable of corrupting kernel memory.
> > This means that a bug in userspace will lead to kernel memory corruption
> > and crashes.  This is something distributions can't support.
> > 
> > uio_pci_generic is already abused like that, mostly
> > because when I wrote it, I didn't add enough protections
> > against using it with DMA capable devices,
> > and we can't go back and break working userspace.
> > But at least it does not bind to VFs which all of
> > them are capable of DMA.
> > 
> > The result of merging this driver will be userspace abusing the
> > sysfs BAR access with VFs as well, and we do not want that.
> > 
> > 
> > Just forwarding events is not enough to make a valid driver.
> > What is missing is a way to access the device in a safe way.
> > 
> > On a more positive note:
> > 
> > What would be a reasonable interface? One that does the following
> > in kernel:
> > 
> > 1. initializes device rings (can be in pinned userspace memory,
> >    but can not be writeable by userspace), brings up interface link
> > 2. pins userspace memory (unless using e.g. hugetlbfs)
> > 3. gets request, make sure it's valid and belongs to
> >    the correct task, put it in the ring
> > 4. in the reverse direction, notify userspace when buffers
> >    are available in the ring
> > 5. notify userspace about MSI (what this driver does)
> > 
> > What userspace can be allowed to do:
> > 
> > 	format requests (e.g. transmit, receive) in userspace
> > 	read ring contents
> > 
> > What userspace can't be allowed to do:
> > 
> > 	access BAR
> > 	write rings
> > 
> > 
> > This means that the driver can not be a generic one,
> > and there will be a system call overhead when you
> > write the ring, but that's the price you have to
> > pay for ability to run on systems without an IOMMU.
> > 
> 
> 
> The device specific parts can be taken from John Fastabend's patches
> BTW:
> 
> https://patchwork.ozlabs.org/patch/396713/
> 
> IIUC what was missing there was exactly the memory protection
> we are looking for here.

The bifuricated drivers are interesting from an architecture
point of view, but do nothing to solve the immediate use case.
The problem is not on bare metal environment, most of those already have IOMMU.
The issues are on environments like VMWare with SRIOV or vmxnet3,
neither of those are really helped by bifirucated driver or VFIO.

Michael S. Tsirkin Oct. 1, 2015, 6:25 p.m. UTC | #8

On Thu, Oct 01, 2015 at 10:26:19AM -0700, Stephen Hemminger wrote:
> On Thu, 1 Oct 2015 19:31:08 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Thu, Oct 01, 2015 at 11:33:06AM +0300, Michael S. Tsirkin wrote:
> > > On Wed, Sep 30, 2015 at 03:28:58PM -0700, Stephen Hemminger wrote:
> > > > This driver allows using PCI device with Message Signalled Interrupt
> > > > from userspace. The API is similar to the igb_uio driver used by the DPDK.
> > > > Via ioctl it provides a mechanism to map MSI-X interrupts into event
> > > > file descriptors similar to VFIO.
> > > >
> > > > VFIO is a better choice if IOMMU is available, but often userspace drivers
> > > > have to work in environments where IOMMU support (real or emulated) is
> > > > not available.  All UIO drivers that support DMA are not secure against
> > > > rogue userspace applications programming DMA hardware to access
> > > > private memory; this driver is no less secure than existing code.
> > > > 
> > > > Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> > > 
> > > I don't think copying the igb_uio interface is a good idea.
> > > What DPDK is doing with igb_uio (and indeed uio_pci_generic)
> > > is abusing the sysfs BAR access to provide unlimited
> > > access to hardware.
> > > 
> > > MSI messages are memory writes so any generic device capable
> > > of MSI is capable of corrupting kernel memory.
> > > This means that a bug in userspace will lead to kernel memory corruption
> > > and crashes.  This is something distributions can't support.
> > > 
> > > uio_pci_generic is already abused like that, mostly
> > > because when I wrote it, I didn't add enough protections
> > > against using it with DMA capable devices,
> > > and we can't go back and break working userspace.
> > > But at least it does not bind to VFs which all of
> > > them are capable of DMA.
> > > 
> > > The result of merging this driver will be userspace abusing the
> > > sysfs BAR access with VFs as well, and we do not want that.
> > > 
> > > 
> > > Just forwarding events is not enough to make a valid driver.
> > > What is missing is a way to access the device in a safe way.
> > > 
> > > On a more positive note:
> > > 
> > > What would be a reasonable interface? One that does the following
> > > in kernel:
> > > 
> > > 1. initializes device rings (can be in pinned userspace memory,
> > >    but can not be writeable by userspace), brings up interface link
> > > 2. pins userspace memory (unless using e.g. hugetlbfs)
> > > 3. gets request, make sure it's valid and belongs to
> > >    the correct task, put it in the ring
> > > 4. in the reverse direction, notify userspace when buffers
> > >    are available in the ring
> > > 5. notify userspace about MSI (what this driver does)
> > > 
> > > What userspace can be allowed to do:
> > > 
> > > 	format requests (e.g. transmit, receive) in userspace
> > > 	read ring contents
> > > 
> > > What userspace can't be allowed to do:
> > > 
> > > 	access BAR
> > > 	write rings
> > > 
> > > 
> > > This means that the driver can not be a generic one,
> > > and there will be a system call overhead when you
> > > write the ring, but that's the price you have to
> > > pay for ability to run on systems without an IOMMU.
> > > 
> > 
> > 
> > The device specific parts can be taken from John Fastabend's patches
> > BTW:
> > 
> > https://patchwork.ozlabs.org/patch/396713/
> > 
> > IIUC what was missing there was exactly the memory protection
> > we are looking for here.
> 
> The bifuricated drivers are interesting from an architecture
> point of view, but do nothing to solve the immediate use case.
> The problem is not on bare metal environment, most of those already have IOMMU.
> The issues are on environments like VMWare with SRIOV or vmxnet3,
> neither of those are really helped by bifirucated driver or VFIO.

Two points I tried to make (and apparently failed, so I'm trying again,
more verbosely):
- bifurcated drivers do DMA into both kernel and userspace
  memory from same pci address (bus/dev/fn). As IOMMU uses this
  source address to validated accesses, there is no way to
  have IOMMU prevent userspace from accessing kernel memory.
  If you are prepared to use dynamic mappings for kernel
  memory, it might be possible to limit the harm that userspace
  can do, but this will slow down kernel networking
  (changing IOMMU mappings is expensive) and userspace will
  likely be able to at least disrupt kernel networking.
  So what I am discussing might still have value there.

- bifurcated drivers have code to bring up link and map rings into
  userspace (they also map other rings into kernel, and tweak rx filter
  in hardware, that might not be necessary for this usecase).
  What I proposed above can use that code, with
  the twist that the RX ring is made RO for userspace, and a system call
  to safely copy from userspace ring there is supported.
  In other words, here's the device specific part in
  kernel that you wanted to use, which will only
  need some tweaks.

Alexander Duyck Oct. 1, 2015, 11:40 p.m. UTC | #9

On 09/30/2015 03:28 PM, Stephen Hemminger wrote:
> This driver allows using PCI device with Message Signalled Interrupt
> from userspace. The API is similar to the igb_uio driver used by the DPDK.
> Via ioctl it provides a mechanism to map MSI-X interrupts into event
> file descriptors similar to VFIO.
>
> VFIO is a better choice if IOMMU is available, but often userspace drivers
> have to work in environments where IOMMU support (real or emulated) is
> not available.  All UIO drivers that support DMA are not secure against
> rogue userspace applications programming DMA hardware to access
> private memory; this driver is no less secure than existing code.
>
> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> ---
>   drivers/uio/Kconfig          |   9 ++
>   drivers/uio/Makefile         |   1 +
>   drivers/uio/uio_msi.c        | 378 +++++++++++++++++++++++++++++++++++++++++++
>   include/uapi/linux/Kbuild    |   1 +
>   include/uapi/linux/uio_msi.h |  22 +++
>   5 files changed, 411 insertions(+)
>   create mode 100644 drivers/uio/uio_msi.c
>   create mode 100644 include/uapi/linux/uio_msi.h
>
> diff --git a/drivers/uio/Kconfig b/drivers/uio/Kconfig
> index 52c98ce..04adfa0 100644
> --- a/drivers/uio/Kconfig
> +++ b/drivers/uio/Kconfig
> @@ -93,6 +93,15 @@ config UIO_PCI_GENERIC
>   	  primarily, for virtualization scenarios.
>   	  If you compile this as a module, it will be called uio_pci_generic.
>
> +config UIO_PCI_MSI
> +       tristate "Generic driver supporting MSI-x on PCI Express cards"
> +	depends on PCI
> +	help
> +	  Generic driver that provides Message Signalled IRQ events
> +	  similar to VFIO. If IOMMMU is available please use VFIO
> +	  instead since it provides more security.
> +	  If you compile this as a module, it will be called uio_msi.
> +
>   config UIO_NETX
>   	tristate "Hilscher NetX Card driver"
>   	depends on PCI

Should you maybe instead depend on CONFIG_PCI_MSI.  Without MSI this is 
essentially just uio_pci_generic with a bit more greedy mapping setup.

> diff --git a/drivers/uio/Makefile b/drivers/uio/Makefile
> index 8560dad..62fc44b 100644
> --- a/drivers/uio/Makefile
> +++ b/drivers/uio/Makefile
> @@ -9,3 +9,4 @@ obj-$(CONFIG_UIO_NETX)	+= uio_netx.o
>   obj-$(CONFIG_UIO_PRUSS)         += uio_pruss.o
>   obj-$(CONFIG_UIO_MF624)         += uio_mf624.o
>   obj-$(CONFIG_UIO_FSL_ELBC_GPCM)	+= uio_fsl_elbc_gpcm.o
> +obj-$(CONFIG_UIO_PCI_MSI)	+= uio_msi.o
> diff --git a/drivers/uio/uio_msi.c b/drivers/uio/uio_msi.c
> new file mode 100644
> index 0000000..802b5c4
> --- /dev/null
> +++ b/drivers/uio/uio_msi.c
> @@ -0,0 +1,378 @@
> +/*-
> + *
> + * Copyright (c) 2015 by Brocade Communications Systems, Inc.
> + * Author: Stephen Hemminger <stephen@networkplumber.org>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 only.
> + */
> +
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +
> +#include <linux/device.h>
> +#include <linux/interrupt.h>
> +#include <linux/eventfd.h>
> +#include <linux/module.h>
> +#include <linux/pci.h>
> +#include <linux/uio_driver.h>
> +#include <linux/msi.h>
> +#include <linux/uio_msi.h>
> +
> +#define DRIVER_VERSION		"0.1.1"
> +#define MAX_MSIX_VECTORS	64
> +
> +/* MSI-X vector information */
> +struct uio_msi_pci_dev {
> +	struct uio_info info;		/* UIO driver info */
> +	struct pci_dev *pdev;		/* PCI device */
> +	struct mutex	mutex;		/* open/release/ioctl mutex */
> +	int		ref_cnt;	/* references to device */
> +	unsigned int	max_vectors;	/* MSI-X slots available */
> +	struct msix_entry *msix;	/* MSI-X vector table */
> +	struct uio_msi_irq_ctx {
> +		struct eventfd_ctx *trigger; /* vector to eventfd */
> +		char *name;		/* name in /proc/interrupts */
> +	} *ctx;
> +};
> +

I would move the definition of uio_msi_irq_ctx out of uio_msi_pci_dev. 
It would help to make this a bit more readable.

> +static irqreturn_t uio_intx_irqhandler(int irq, void *arg)
> +{
> +	struct uio_msi_pci_dev *udev = arg;
> +
> +	if (pci_check_and_mask_intx(udev->pdev)) {
> +		eventfd_signal(udev->ctx->trigger, 1);
> +		return IRQ_HANDLED;
> +	}
> +
> +	return IRQ_NONE;
> +}
> +

I would really prefer to see the intx handling dropped since there are 
already 2 different UIO drivers setup for handling INTx style 
interrupts.  Lets focus on the parts from the last decade and drop 
support for INTx now in favor of MSI-X and maybe MSI.  If we _REALLY_ 
need it we can always come back later and add it.

> +static irqreturn_t uio_msi_irqhandler(int irq, void *arg)
> +{
> +	struct eventfd_ctx *trigger = arg;
> +
> +	eventfd_signal(trigger, 1);
> +	return IRQ_HANDLED;
> +}
> +
> +/* set the mapping between vector # and existing eventfd. */
> +static int set_irq_eventfd(struct uio_msi_pci_dev *udev, u32 vec, int fd)
> +{
> +	struct eventfd_ctx *trigger;
> +	int irq, err;
> +
> +	if (vec >= udev->max_vectors) {
> +		dev_notice(&udev->pdev->dev, "vec %u >= num_vec %u\n",
> +			   vec, udev->max_vectors);
> +		return -ERANGE;
> +	}
> +
> +	irq = udev->msix[vec].vector;
> +	trigger = udev->ctx[vec].trigger;
> +	if (trigger) {
> +		/* Clearup existing irq mapping */

Minor spelling issue here, "Clear up" should be 2 words, "Cleanup" can 
be one.

> +		free_irq(irq, trigger);
> +		eventfd_ctx_put(trigger);
> +		udev->ctx[vec].trigger = NULL;
> +	}
> +
> +	/* Passing -1 is used to disable interrupt */
> +	if (fd < 0)
> +		return 0;
> +
> +	trigger = eventfd_ctx_fdget(fd);
> +	if (IS_ERR(trigger)) {
> +		err = PTR_ERR(trigger);
> +		dev_notice(&udev->pdev->dev,
> +			   "eventfd ctx get failed: %d\n", err);
> +		return err;
> +	}
> +
> +	if (udev->msix)
> +		err = request_irq(irq, uio_msi_irqhandler, 0,
> +				  udev->ctx[vec].name, trigger);
> +	else
> +		err = request_irq(irq, uio_intx_irqhandler, IRQF_SHARED,
> +				  udev->ctx[vec].name, udev);
> +
> +	if (err) {
> +		dev_notice(&udev->pdev->dev,
> +			   "request irq failed: %d\n", err);
> +		eventfd_ctx_put(trigger);
> +		return err;
> +	}
> +
> +	udev->ctx[vec].trigger = trigger;
> +	return 0;
> +}
> +
> +static int
> +uio_msi_ioctl(struct uio_info *info, unsigned int cmd, unsigned long arg)
> +{
> +	struct uio_msi_pci_dev *udev
> +		= container_of(info, struct uio_msi_pci_dev, info);
> +	struct uio_msi_irq_set hdr;
> +	int err;
> +
> +	switch (cmd) {
> +	case UIO_MSI_IRQ_SET:
> +		if (copy_from_user(&hdr, (void __user *)arg, sizeof(hdr)))
> +			return -EFAULT;
> +
> +		mutex_lock(&udev->mutex);
> +		err = set_irq_eventfd(udev, hdr.vec, hdr.fd);
> +		mutex_unlock(&udev->mutex);
> +		break;
> +	default:
> +		err = -EOPNOTSUPP;
> +	}
> +	return err;
> +}
> +
> +/* Opening the UIO device for first time enables MSI-X */
> +static int
> +uio_msi_open(struct uio_info *info, struct inode *inode)
> +{
> +	struct uio_msi_pci_dev *udev
> +		= container_of(info, struct uio_msi_pci_dev, info);
> +	int err = 0;
> +
> +	mutex_lock(&udev->mutex);
> +	if (udev->ref_cnt++ == 0) {
> +		if (udev->msix)
> +			err = pci_enable_msix(udev->pdev, udev->msix,
> +					      udev->max_vectors);
> +	}
> +	mutex_unlock(&udev->mutex);
> +
> +	return err;
> +}
> +

I agree with some other reviewers.  Why call pci_enable_msix in open? 
It seems like it would make much more sense to do this on probe, and 
then disable MSI-X on free.  I can only assume you are trying to do it 
to save on resources but the fact is this is a driver you have to 
explicitly force onto a device so you would probably be safe to assume 
that they plan to use it in the near future.

> +/* Last close of the UIO device releases/disables all IRQ's */
> +static int
> +uio_msi_release(struct uio_info *info, struct inode *inode)
> +{
> +	struct uio_msi_pci_dev *udev
> +		= container_of(info, struct uio_msi_pci_dev, info);
> +	int i;
> +
> +	mutex_lock(&udev->mutex);
> +	if (--udev->ref_cnt == 0) {
> +		for (i = 0; i < udev->max_vectors; i++) {
> +			int irq = udev->msix[i].vector;
> +			struct eventfd_ctx *trigger = udev->ctx[i].trigger;
> +
> +			if (!trigger)
> +				continue;
> +
> +			free_irq(irq, trigger);
> +			eventfd_ctx_put(trigger);
> +			udev->ctx[i].trigger = NULL;
> +		}
> +
> +		if (udev->msix)
> +			pci_disable_msix(udev->pdev);
> +	}
> +	mutex_unlock(&udev->mutex);
> +
> +	return 0;
> +}
> +
> +/* Unmap previously ioremap'd resources */
> +static void
> +release_iomaps(struct uio_mem *mem)
> +{
> +	int i;
> +
> +	for (i = 0; i < MAX_UIO_MAPS; i++, mem++) {
> +		if (mem->internal_addr)
> +			iounmap(mem->internal_addr);
> +	}
> +}
> +
> +static int
> +setup_maps(struct pci_dev *pdev, struct uio_info *info)
> +{
> +	int i, m = 0, p = 0, err;
> +	static const char * const bar_names[] = {
> +		"BAR0",	"BAR1",	"BAR2",	"BAR3",	"BAR4",	"BAR5",
> +	};
> +
> +	for (i = 0; i < ARRAY_SIZE(bar_names); i++) {
> +		unsigned long start = pci_resource_start(pdev, i);
> +		unsigned long flags = pci_resource_flags(pdev, i);
> +		unsigned long len = pci_resource_len(pdev, i);
> +
> +		if (start == 0 || len == 0)
> +			continue;
> +
> +		if (flags & IORESOURCE_MEM) {
> +			void __iomem *addr;
> +
> +			if (m >= MAX_UIO_MAPS)
> +				continue;
> +
> +			addr = ioremap(start, len);
> +			if (addr == NULL) {
> +				err = -EINVAL;
> +				goto fail;
> +			}
> +
> +			info->mem[m].name = bar_names[i];
> +			info->mem[m].addr = start;
> +			info->mem[m].internal_addr = addr;
> +			info->mem[m].size = len;
> +			info->mem[m].memtype = UIO_MEM_PHYS;
> +			++m;
> +		} else if (flags & IORESOURCE_IO) {
> +			if (p >= MAX_UIO_PORT_REGIONS)
> +				continue;
> +
> +			info->port[p].name = bar_names[i];
> +			info->port[p].start = start;
> +			info->port[p].size = len;
> +			info->port[p].porttype = UIO_PORT_X86;
> +			++p;
> +		}
> +	}
> +
> +	return 0;
> + fail:
> +	for (i = 0; i < m; i++)
> +		iounmap(info->mem[i].internal_addr);
> +	return err;
> +}
> +

Do you really need to map IORESOURCE bars?  Most drivers I can think of 
don't use IO BARs anymore.  Maybe we could look at just dropping the 
code and adding it back later if we have a use case that absolutely 
needs it.

Also how many devices actually need resources beyond BAR 0?  I'm just 
curious as I know BAR 2 on many of the Intel devices is the register 
space related to MSI-X so now we have both the PCIe subsystem and user 
space with access to this region.

> +static int uio_msi_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> +{
> +	struct uio_msi_pci_dev *udev;
> +	int i, err, vectors;
> +
> +	udev = kzalloc(sizeof(struct uio_msi_pci_dev), GFP_KERNEL);
> +	if (!udev)
> +		return -ENOMEM;
> +
> +	err = pci_enable_device(pdev);
> +	if (err != 0) {
> +		dev_err(&pdev->dev, "cannot enable PCI device\n");
> +		goto fail_free;
> +	}
> +
> +	err = pci_request_regions(pdev, "uio_msi");
> +	if (err != 0) {
> +		dev_err(&pdev->dev, "Cannot request regions\n");
> +		goto fail_disable;
> +	}
> +
> +	pci_set_master(pdev);
> +
> +	/* remap resources */
> +	err = setup_maps(pdev, &udev->info);
> +	if (err)
> +		goto fail_release_iomem;
> +
> +	/* fill uio infos */
> +	udev->info.name = "uio_msi";
> +	udev->info.version = DRIVER_VERSION;
> +	udev->info.priv = udev;
> +	udev->pdev = pdev;
> +	udev->info.ioctl = uio_msi_ioctl;
> +	udev->info.open = uio_msi_open;
> +	udev->info.release = uio_msi_release;
> +	udev->info.irq = UIO_IRQ_CUSTOM;
> +	mutex_init(&udev->mutex);
> +
> +	vectors = pci_msix_vec_count(pdev);
> +	if (vectors > 0) {
> +		udev->max_vectors = min_t(u16, vectors, MAX_MSIX_VECTORS);
> +		dev_info(&pdev->dev, "using up to %u MSI-X vectors\n",
> +			udev->max_vectors);
> +
> +		err = -ENOMEM;
> +		udev->msix = kcalloc(udev->max_vectors,
> +				     sizeof(struct msix_entry), GFP_KERNEL);
> +		if (!udev->msix)
> +			goto fail_release_iomem;
> +	} else if (!pci_intx_mask_supported(pdev)) {
> +		dev_err(&pdev->dev,
> +			"device does not support MSI-X or INTX\n");
> +		err = -EINVAL;
> +		goto fail_release_iomem;
> +	} else {
> +		dev_notice(&pdev->dev, "using INTX\n");
> +		udev->info.irq_flags = IRQF_SHARED;
> +		udev->max_vectors = 1;
> +	}
> +
> +	udev->ctx = kcalloc(udev->max_vectors,
> +			    sizeof(struct uio_msi_irq_ctx), GFP_KERNEL);
> +	if (!udev->ctx)
> +		goto fail_free_msix;
> +
> +	for (i = 0; i < udev->max_vectors; i++) {
> +		udev->msix[i].entry = i;
> +
> +		udev->ctx[i].name = kasprintf(GFP_KERNEL,
> +					      KBUILD_MODNAME "[%d](%s)",
> +					      i, pci_name(pdev));
> +		if (!udev->ctx[i].name)
> +			goto fail_free_ctx;
> +	}
> +
> +	/* register uio driver */
> +	err = uio_register_device(&pdev->dev, &udev->info);
> +	if (err != 0)
> +		goto fail_free_ctx;
> +
> +	pci_set_drvdata(pdev, udev);
> +	return 0;
> +
> +fail_free_ctx:
> +	for (i = 0; i < udev->max_vectors; i++)
> +		kfree(udev->ctx[i].name);
> +	kfree(udev->ctx);
> +fail_free_msix:
> +	kfree(udev->msix);
> +fail_release_iomem:
> +	release_iomaps(udev->info.mem);
> +	pci_release_regions(pdev);
> +fail_disable:
> +	pci_disable_device(pdev);
> +fail_free:
> +	kfree(udev);
> +
> +	pr_notice("%s ret %d\n", __func__, err);
> +	return err;
> +}
> +
> +static void uio_msi_remove(struct pci_dev *pdev)
> +{
> +	struct uio_info *info = pci_get_drvdata(pdev);
> +	struct uio_msi_pci_dev *udev
> +		= container_of(info, struct uio_msi_pci_dev, info);
> +	int i;
> +
> +	uio_unregister_device(info);
> +	release_iomaps(info->mem);
> +
> +	pci_release_regions(pdev);
> +	for (i = 0; i < udev->max_vectors; i++)
> +		kfree(udev->ctx[i].name);
> +	kfree(udev->ctx);
> +	kfree(udev->msix);
> +	pci_disable_device(pdev);
> +
> +	pci_set_drvdata(pdev, NULL);
> +	kfree(udev);
> +}
> +
> +static struct pci_driver uio_msi_pci_driver = {
> +	.name = "uio_msi",
> +	.probe = uio_msi_probe,
> +	.remove = uio_msi_remove,
> +};
> +
> +module_pci_driver(uio_msi_pci_driver);
> +MODULE_VERSION(DRIVER_VERSION);
> +MODULE_LICENSE("GPL v2");
> +MODULE_AUTHOR("Stephen Hemminger <stephen@networkplumber.org>");
> +MODULE_DESCRIPTION("UIO driver for MSI PCI devices");
> diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
> index f7b2db4..d9497691 100644
> --- a/include/uapi/linux/Kbuild
> +++ b/include/uapi/linux/Kbuild
> @@ -411,6 +411,7 @@ header-y += udp.h
>   header-y += uhid.h
>   header-y += uinput.h
>   header-y += uio.h
> +header-y += uio_msi.h
>   header-y += ultrasound.h
>   header-y += un.h
>   header-y += unistd.h
> diff --git a/include/uapi/linux/uio_msi.h b/include/uapi/linux/uio_msi.h
> new file mode 100644
> index 0000000..297de00
> --- /dev/null
> +++ b/include/uapi/linux/uio_msi.h
> @@ -0,0 +1,22 @@
> +/*
> + * UIO_MSI API definition
> + *
> + * Copyright (c) 2015 by Brocade Communications Systems, Inc.
> + * All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +#ifndef _UIO_PCI_MSI_H
> +#define _UIO_PCI_MSI_H
> +
> +struct uio_msi_irq_set {
> +	u32 vec;
> +	int fd;
> +};
> +
> +#define UIO_MSI_BASE	0x86
> +#define UIO_MSI_IRQ_SET	_IOW('I', UIO_MSI_BASE+1, struct uio_msi_irq_set)
> +
> +#endif
>

Stephen Hemminger Oct. 2, 2015, 12:01 a.m. UTC | #10

On Thu, 1 Oct 2015 16:40:10 -0700
Alexander Duyck <alexander.duyck@gmail.com> wrote:

> I agree with some other reviewers.  Why call pci_enable_msix in open? 
> It seems like it would make much more sense to do this on probe, and 
> then disable MSI-X on free.  I can only assume you are trying to do it 
> to save on resources but the fact is this is a driver you have to 
> explicitly force onto a device so you would probably be safe to assume 
> that they plan to use it in the near future.

Because if interface is not up, the MSI handle doesn't have to be open.
This saves resources and avoids some races.

Stephen Hemminger Oct. 2, 2015, 12:04 a.m. UTC | #11

On Thu, 1 Oct 2015 16:40:10 -0700
Alexander Duyck <alexander.duyck@gmail.com> wrote:

> Do you really need to map IORESOURCE bars?  Most drivers I can think of 
> don't use IO BARs anymore.  Maybe we could look at just dropping the 
> code and adding it back later if we have a use case that absolutely 
> needs it.
Mapping is not strictly necessary, but for virtio it acts a way to communicate
the regions.

> Also how many devices actually need resources beyond BAR 0?  I'm just 
> curious as I know BAR 2 on many of the Intel devices is the register 
> space related to MSI-X so now we have both the PCIe subsystem and user 
> space with access to this region.

VMXNet3 needs 2 bars. Most use only one.

Alexander Duyck Oct. 2, 2015, 1:21 a.m. UTC | #12

On 10/01/2015 05:01 PM, Stephen Hemminger wrote:
> On Thu, 1 Oct 2015 16:40:10 -0700
> Alexander Duyck <alexander.duyck@gmail.com> wrote:
>
>> I agree with some other reviewers.  Why call pci_enable_msix in open?
>> It seems like it would make much more sense to do this on probe, and
>> then disable MSI-X on free.  I can only assume you are trying to do it
>> to save on resources but the fact is this is a driver you have to
>> explicitly force onto a device so you would probably be safe to assume
>> that they plan to use it in the near future.
> Because if interface is not up, the MSI handle doesn't have to be open.
> This saves resources and avoids some races.

Yes, but it makes things a bit messier for the interrupts.  Most drivers 
take care of interrupts during probe so that if there are any allocation 
problems they can take care of them then instead of leaving an interface 
out that will later fail when it is brought up.  It ends up being a way 
to deal with the whole MSI-X fall-back issue.

- Alex

Alexander Duyck Oct. 2, 2015, 2:33 a.m. UTC | #13

On 10/01/2015 05:04 PM, Stephen Hemminger wrote:
> On Thu, 1 Oct 2015 16:40:10 -0700
> Alexander Duyck <alexander.duyck@gmail.com> wrote:
>
>> Do you really need to map IORESOURCE bars?  Most drivers I can think of
>> don't use IO BARs anymore.  Maybe we could look at just dropping the
>> code and adding it back later if we have a use case that absolutely
>> needs it.
> Mapping is not strictly necessary, but for virtio it acts a way to communicate
> the regions.

I think I see what you are saying.  I was hoping we could get away from 
having to map any I/O ports but it looks like virtio is still using them 
for BAR 0, or at least that is what I am seeing on my VM with 
virtio_net.  I was really hoping we could get away from that since a 16b 
address space is far too restrictive anyway.

>> Also how many devices actually need resources beyond BAR 0?  I'm just
>> curious as I know BAR 2 on many of the Intel devices is the register
>> space related to MSI-X so now we have both the PCIe subsystem and user
>> space with access to this region.
> VMXNet3 needs 2 bars. Most use only one.

So essentially we are needing to make exceptions for the virtual interfaces.

I guess there isn't much we can do then and we probably need to map any 
and all base address registers we can find for the given device.  I was 
hoping for something a bit more surgical since we are opening a security 
hole of sorts, but I guess it can't be helped if we want to support 
multiple devices and they all have such radically different configurations.

- Alex

Michael S. Tsirkin Oct. 5, 2015, 9:54 p.m. UTC | #14

On Thu, Oct 01, 2015 at 11:33:06AM +0300, Michael S. Tsirkin wrote:
> Just forwarding events is not enough to make a valid driver.
> What is missing is a way to access the device in a safe way.

Thinking about it some more, maybe some devices don't do DMA, and merely
signal events with MSI/MSI-X.

The fact you mention igb_uio in the cover letter seems to hint that this
isn't the case, and that the real intent is to abuse it for DMA-capable
devices, but still ...

If we assume such a simple device, we need to block userspace from
tweaking at least the MSI control and the MSI-X table.
And changing BARs might make someone else corrupt the MSI-X
table, so we need to block it from changing BARs, too.

Things like device reset will clear the table.  I guess this means we
need to track access to reset, too, make sure we restore the
table to a sane config.

PM  capability can be used to reset things tooI think. Better be
careful about that.

And a bunch of devices could be doing weird things that need
to be special-cased.

All of this is what VFIO is already dealing with.

Maybe extending VFIO for this usecase, or finding another way to share
code might be a better idea than duplicating the code within uio?

Vladislav Zolotarov Oct. 5, 2015, 10:09 p.m. UTC | #15

On Oct 6, 2015 12:55 AM, "Michael S. Tsirkin" <mst@redhat.com> wrote:
>
> On Thu, Oct 01, 2015 at 11:33:06AM +0300, Michael S. Tsirkin wrote:
> > Just forwarding events is not enough to make a valid driver.
> > What is missing is a way to access the device in a safe way.
>
> Thinking about it some more, maybe some devices don't do DMA, and merely
> signal events with MSI/MSI-X.
>
> The fact you mention igb_uio in the cover letter seems to hint that this
> isn't the case, and that the real intent is to abuse it for DMA-capable
> devices, but still ...
>
> If we assume such a simple device, we need to block userspace from
> tweaking at least the MSI control and the MSI-X table.
> And changing BARs might make someone else corrupt the MSI-X
> table, so we need to block it from changing BARs, too.
>
> Things like device reset will clear the table.  I guess this means we
> need to track access to reset, too, make sure we restore the
> table to a sane config.
>
> PM  capability can be used to reset things tooI think. Better be
> careful about that.
>
> And a bunch of devices could be doing weird things that need
> to be special-cased.
>
> All of this is what VFIO is already dealing with.
>
> Maybe extending VFIO for this usecase, or finding another way to share
> code might be a better idea than duplicating the code within uio?

How about instead of trying to invent the wheel just go and attack the
problem directly just like i've proposed already a few times in the last
days: instead of limiting the UIO limit the users that are allowed to use
UIO to privileged users only (e.g. root). This would solve all clearly
unresolvable issues u are raising here all together, wouldn't it?

>
> --
> MST

Michael S. Tsirkin Oct. 5, 2015, 10:49 p.m. UTC | #16

On Tue, Oct 06, 2015 at 01:09:55AM +0300, Vladislav Zolotarov wrote:
> How about instead of trying to invent the wheel just go and attack the problem
> directly just like i've proposed already a few times in the last days: instead
> of limiting the UIO limit the users that are allowed to use UIO to privileged
> users only (e.g. root). This would solve all clearly unresolvable issues u are
> raising here all together, wouldn't it?

No - root or no root, if the user can modify the addresses in the MSI-X
table and make the chip corrupt random memory, this is IMHO a non-starter.

And tainting kernel is not a solution - your patch adds a pile of
code that either goes completely unused or taints the kernel.
Not just that - it's a dedicated userspace API that either
goes completely unused or taints the kernel.

> >
> > --
> > MST
>

Stephen Hemminger Oct. 6, 2015, 7:33 a.m. UTC | #17

Other than implementation objections, so far the two main arguments
against this reduce to:
  1. If you allow UIO ioctl then it opens an API hook for all the crap out
     of tree UIO drivers to do what they want.
  2. If you allow UIO MSI-X then you are expanding the usage of userspace
     device access in an insecure manner.

Another alternative which I explored was making a version of VFIO that
works without IOMMU. It solves #1 but actually increases the likely negative
response to arguent #2. This would keep same API, and avoid having to
modify UIO. But we would still have the same (if not more resistance)
from IOMMU developers who believe all systems have to be secure against
root.

Vladislav Zolotarov Oct. 6, 2015, 8:23 a.m. UTC | #18

On 10/06/15 01:49, Michael S. Tsirkin wrote:
> On Tue, Oct 06, 2015 at 01:09:55AM +0300, Vladislav Zolotarov wrote:
>> How about instead of trying to invent the wheel just go and attack the problem
>> directly just like i've proposed already a few times in the last days: instead
>> of limiting the UIO limit the users that are allowed to use UIO to privileged
>> users only (e.g. root). This would solve all clearly unresolvable issues u are
>> raising here all together, wouldn't it?
> No - root or no root, if the user can modify the addresses in the MSI-X
> table and make the chip corrupt random memory, this is IMHO a non-starter.

Michael, how this or any other related patch is related to the problem u 
r describing? The above ability is there for years and if memory serves 
me well it was u who wrote uio_pci_generic with this "security flaw".  ;)

This patch in general only adds the ability to receive notifications per 
MSI-X interrupt and it has nothing to do with the ability to reprogram 
the MSI-X related registers from the user space which was always there.

>
> And tainting kernel is not a solution - your patch adds a pile of
> code that either goes completely unused or taints the kernel.
> Not just that - it's a dedicated userspace API that either
> goes completely unused or taints the kernel.
>
>>> --
>>> MST

Avi Kivity Oct. 6, 2015, 12:15 p.m. UTC | #19

On 10/06/2015 10:33 AM, Stephen Hemminger wrote:
> Other than implementation objections, so far the two main arguments
> against this reduce to:
>    1. If you allow UIO ioctl then it opens an API hook for all the crap out
>       of tree UIO drivers to do what they want.
>    2. If you allow UIO MSI-X then you are expanding the usage of userspace
>       device access in an insecure manner.
>
> Another alternative which I explored was making a version of VFIO that
> works without IOMMU. It solves #1 but actually increases the likely negative
> response to arguent #2. This would keep same API, and avoid having to
> modify UIO. But we would still have the same (if not more resistance)
> from IOMMU developers who believe all systems have to be secure against
> root.

vfio's charter was explicitly aiming for modern setups with iommus.

This could be revisited, but I agree it will have even more resistance, 
justified IMO.

btw, (2) doesn't really add any insecurity.  The user could already poke 
at the msix tables (as well as perform DMA); they just couldn't get a 
useful interrupt out of them.

Maybe a module parameter "allow_insecure_dma" can be added to 
uio_pci_generic.  Without the parameter, bus mastering and msix is 
disabled, with the parameter it is allowed.  This requires the sysadmin 
to take a positive step in order to make use of their hardware.

Michael S. Tsirkin Oct. 6, 2015, 1:42 p.m. UTC | #20

On Tue, Oct 06, 2015 at 08:33:56AM +0100, Stephen Hemminger wrote:
> Other than implementation objections, so far the two main arguments
> against this reduce to:
>   1. If you allow UIO ioctl then it opens an API hook for all the crap out
>      of tree UIO drivers to do what they want.
>   2. If you allow UIO MSI-X then you are expanding the usage of userspace
>      device access in an insecure manner.

That's not all. Without MSI one can detect insecure usage by detecting
userspace enabling bus mastering.  This can be detected simply using
lspci.  Or one can also imagine a configuration where this ability is
disabled, is logged, or taints kernel.  This seems like something that
might be worth having for some locked-down systems.

OTOH enabling MSI requires enabling bus mastering so suddenly we have no
idea whether device can be/is used in a safe way.

> 
> Another alternative which I explored was making a version of VFIO that
> works without IOMMU. It solves #1 but actually increases the likely negative
> response to arguent #2.

No - because VFIO has limited protection against device misuse by
userspace, by limiting access to sub-ranges of device BARs and config
space.  For a device that doesn't do DMA, that will be enough to make it
secure to use.

That's a pretty weak excuse to support userspace drivers for PCI devices
without an IOMMU, but it's the best I heard so far.

Is that worth the security trade-off? I'm still not sure.

> This would keep same API, and avoid having to
> modify UIO. But we would still have the same (if not more resistance)
> from IOMMU developers who believe all systems have to be secure against
> root.

"Secure against root" is a confusing way to put it IMHO. We are talking
about memory protection.

So that's not IOMMU developers IIUC. I believe most kernel developers will
agree it's not a good idea to let userspace corrupt kernel memory.
Otherwise, the driver can't be supported, and maintaining upstream
drivers that can't be supported serves no useful purpose.  Anyone can
load out of tree ones just as well.

VFIO already supports MSI so VFIO developers already have a lot of
experience with these issues. Getting their input would be valuable.

Michael S. Tsirkin Oct. 6, 2015, 1:58 p.m. UTC | #21

On Tue, Oct 06, 2015 at 11:23:11AM +0300, Vlad Zolotarov wrote:
> Michael, how this or any other related patch is related to the problem u r
> describing?
> The above ability is there for years and if memory serves me
> well it was u who wrote uio_pci_generic with this "security flaw".  ;)

I answered all this already.

This patch enables bus mastering, enables MSI or MSI-X, and requires
userspace to map the MSI-X table and read/write the config space.
This means that a single userspace bug is enough to corrupt kernel
memory.

uio_pci_generic does not enable bus mastering or MSI, and
it might be a good idea to have uio_pci_generic block
access to MSI/MSI-X config.

Michael S. Tsirkin Oct. 6, 2015, 2:07 p.m. UTC | #22

On Tue, Oct 06, 2015 at 03:15:57PM +0300, Avi Kivity wrote:
> btw, (2) doesn't really add any insecurity.  The user could already poke at
> the msix tables (as well as perform DMA); they just couldn't get a useful
> interrupt out of them.

Poking at msix tables won't cause memory corruption unless msix and bus
mastering is enabled.  It's true root can enable msix and bus mastering
through sysfs - but that's easy to block or detect. Even if you don't
buy a security story, it seems less likely to trigger as a result
of a userspace bug.

Vladislav Zolotarov Oct. 6, 2015, 2:49 p.m. UTC | #23

On 10/06/15 16:58, Michael S. Tsirkin wrote:
> On Tue, Oct 06, 2015 at 11:23:11AM +0300, Vlad Zolotarov wrote:
>> Michael, how this or any other related patch is related to the problem u r
>> describing?
>> The above ability is there for years and if memory serves me
>> well it was u who wrote uio_pci_generic with this "security flaw".  ;)
> I answered all this already.
>
> This patch enables bus mastering, enables MSI or MSI-X

This may be done from the user space right now without this patch...

> , and requires
> userspace to map the MSI-X table

Hmmm... I must have missed this requirement. Could u, pls., clarify? 
 From what I see, MSI/MSI-X table is configured completely in the kernel 
here...

> and read/write the config space.
> This means that a single userspace bug is enough to corrupt kernel
> memory.

Could u, pls., provide and example of this simple bug? Because it's 
absolutely not obvious...

>
> uio_pci_generic does not enable bus mastering or MSI, and
> it might be a good idea to have uio_pci_generic block
> access to MSI/MSI-X config.

Since device bars may be mapped bypassing the UIO/uio_pci_generic - this 
won't solve any issue.

Michael S. Tsirkin Oct. 6, 2015, 3 p.m. UTC | #24

On Tue, Oct 06, 2015 at 05:49:21PM +0300, Vlad Zolotarov wrote:
> >and read/write the config space.
> >This means that a single userspace bug is enough to corrupt kernel
> >memory.
> 
> Could u, pls., provide and example of this simple bug? Because it's
> absolutely not obvious...

Stick a value that happens to match a kernel address in Msg Addr field
in an unmasked MSI-X entry.

Avi Kivity Oct. 6, 2015, 3:41 p.m. UTC | #25

On 10/06/2015 05:07 PM, Michael S. Tsirkin wrote:
> On Tue, Oct 06, 2015 at 03:15:57PM +0300, Avi Kivity wrote:
>> btw, (2) doesn't really add any insecurity.  The user could already poke at
>> the msix tables (as well as perform DMA); they just couldn't get a useful
>> interrupt out of them.
> Poking at msix tables won't cause memory corruption unless msix and bus
> mastering is enabled.

It's a given that bus mastering is enabled.  It's true that msix is 
unlikely to be enabled, unless msix support is added.

>    It's true root can enable msix and bus mastering
> through sysfs - but that's easy to block or detect. Even if you don't
> buy a security story, it seems less likely to trigger as a result
> of a userspace bug.

If you're doing DMA, that's the least of your worries.

Still, zero-mapping the msix space seems reasonable, and can protect 
userspace from silly stuff.  It can't be considered to have anything to 
do with security though, as long as users can simply DMA to every bit of 
RAM in the system they want to.

Vladislav Zolotarov Oct. 6, 2015, 4:40 p.m. UTC | #26

On 10/06/15 18:00, Michael S. Tsirkin wrote:
> On Tue, Oct 06, 2015 at 05:49:21PM +0300, Vlad Zolotarov wrote:
>>> and read/write the config space.
>>> This means that a single userspace bug is enough to corrupt kernel
>>> memory.
>> Could u, pls., provide and example of this simple bug? Because it's
>> absolutely not obvious...
> Stick a value that happens to match a kernel address in Msg Addr field
> in an unmasked MSI-X entry.

This patch neither configures MSI-X entries in the user space nor 
provides additional means to do so therefore this "sticking" would be a 
matter of some extra code that is absolutely unrelated to this patch. 
So, this example seems absolutely irrelevant to this particular discussion.

thanks,
vlad

>

Thomas Monjalon Oct. 16, 2015, 5:11 p.m. UTC | #27

To sum it up,
We want to remove the need of the out-of-tree module igb_uio.
3 possible implementations were discussed so far:
- new UIO driver
- extend uio_pci_generic
- VFIO without IOMMU

It is preferred to avoid creating yet another module to support.
That's why the uio_pci_generic extension would be nice.
In my understanding, there are currently 2 issues with the patches
from Vlad and Stephen:
- IRQ must be mapped to a fd without using a new ioctl
- MSI-X handling in userspace breaks the memory protection

I'm confident the first issue can be fixed with something like sysfs.
About the "security" concern, mainly expressed by MST, I think the idea
of Avi (below) deserves to be discussed.

2015-10-06 15:15, Avi Kivity:
> On 10/06/2015 10:33 AM, Stephen Hemminger wrote:
> > Other than implementation objections, so far the two main arguments
> > against this reduce to:
> >    1. If you allow UIO ioctl then it opens an API hook for all the crap out
> >       of tree UIO drivers to do what they want.
> >    2. If you allow UIO MSI-X then you are expanding the usage of userspace
> >       device access in an insecure manner.
[...]
> btw, (2) doesn't really add any insecurity.  The user could already poke 
> at the msix tables (as well as perform DMA); they just couldn't get a 
> useful interrupt out of them.
> 
> Maybe a module parameter "allow_insecure_dma" can be added to 
> uio_pci_generic.  Without the parameter, bus mastering and msix is 
> disabled, with the parameter it is allowed.  This requires the sysadmin 
> to take a positive step in order to make use of their hardware.

Giving the control of the memory protection level to the distribution or
the administrator looks a good idea.
When allowing insecure DMA, a log will make clear how it is supported
-or not- by the system provider.

From another thread:
2015-10-01 14:09, Michael S. Tsirkin:
> If Linux keeps enabling hacks, no one will bother doing the right thing.
> Upstream inclusion is the only carrot Linux has to make people do the
> right thing.

The "right thing" should be guided by the users needs at a given time.
The "carrot" for a better solution will be to have a well protected system.

Stephen Hemminger Oct. 16, 2015, 5:20 p.m. UTC | #28

On Fri, 16 Oct 2015 19:11:35 +0200
Thomas Monjalon <thomas.monjalon@6wind.com> wrote:

> To sum it up,
> We want to remove the need of the out-of-tree module igb_uio.
> 3 possible implementations were discussed so far:
> - new UIO driver
> - extend uio_pci_generic
> - VFIO without IOMMU

There is recent progress on VFIO without IOMMU. This looks the most promising
long term solution.

diff mbox

Patch

diff --git a/drivers/uio/Kconfig b/drivers/uio/Kconfig
index 52c98ce..04adfa0 100644
--- a/drivers/uio/Kconfig
+++ b/drivers/uio/Kconfig
@@ -93,6 +93,15 @@  config UIO_PCI_GENERIC
 	  primarily, for virtualization scenarios.
 	  If you compile this as a module, it will be called uio_pci_generic.
 
+config UIO_PCI_MSI
+       tristate "Generic driver supporting MSI-x on PCI Express cards"
+	depends on PCI
+	help
+	  Generic driver that provides Message Signalled IRQ events
+	  similar to VFIO. If IOMMMU is available please use VFIO
+	  instead since it provides more security.
+	  If you compile this as a module, it will be called uio_msi.
+
 config UIO_NETX
 	tristate "Hilscher NetX Card driver"
 	depends on PCI
diff --git a/drivers/uio/Makefile b/drivers/uio/Makefile
index 8560dad..62fc44b 100644
--- a/drivers/uio/Makefile
+++ b/drivers/uio/Makefile
@@ -9,3 +9,4 @@  obj-$(CONFIG_UIO_NETX)	+= uio_netx.o
 obj-$(CONFIG_UIO_PRUSS)         += uio_pruss.o
 obj-$(CONFIG_UIO_MF624)         += uio_mf624.o
 obj-$(CONFIG_UIO_FSL_ELBC_GPCM)	+= uio_fsl_elbc_gpcm.o
+obj-$(CONFIG_UIO_PCI_MSI)	+= uio_msi.o
diff --git a/drivers/uio/uio_msi.c b/drivers/uio/uio_msi.c
new file mode 100644
index 0000000..802b5c4
--- /dev/null
+++ b/drivers/uio/uio_msi.c
@@ -0,0 +1,378 @@ 
+/*-
+ *
+ * Copyright (c) 2015 by Brocade Communications Systems, Inc.
+ * Author: Stephen Hemminger <stephen@networkplumber.org>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 only.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/device.h>
+#include <linux/interrupt.h>
+#include <linux/eventfd.h>
+#include <linux/module.h>
+#include <linux/pci.h>
+#include <linux/uio_driver.h>
+#include <linux/msi.h>
+#include <linux/uio_msi.h>
+
+#define DRIVER_VERSION		"0.1.1"
+#define MAX_MSIX_VECTORS	64
+
+/* MSI-X vector information */
+struct uio_msi_pci_dev {
+	struct uio_info info;		/* UIO driver info */
+	struct pci_dev *pdev;		/* PCI device */
+	struct mutex	mutex;		/* open/release/ioctl mutex */
+	int		ref_cnt;	/* references to device */
+	unsigned int	max_vectors;	/* MSI-X slots available */
+	struct msix_entry *msix;	/* MSI-X vector table */
+	struct uio_msi_irq_ctx {
+		struct eventfd_ctx *trigger; /* vector to eventfd */
+		char *name;		/* name in /proc/interrupts */
+	} *ctx;
+};
+
+static irqreturn_t uio_intx_irqhandler(int irq, void *arg)
+{
+	struct uio_msi_pci_dev *udev = arg;
+
+	if (pci_check_and_mask_intx(udev->pdev)) {
+		eventfd_signal(udev->ctx->trigger, 1);
+		return IRQ_HANDLED;
+	}
+
+	return IRQ_NONE;
+}
+
+static irqreturn_t uio_msi_irqhandler(int irq, void *arg)
+{
+	struct eventfd_ctx *trigger = arg;
+
+	eventfd_signal(trigger, 1);
+	return IRQ_HANDLED;
+}
+
+/* set the mapping between vector # and existing eventfd. */
+static int set_irq_eventfd(struct uio_msi_pci_dev *udev, u32 vec, int fd)
+{
+	struct eventfd_ctx *trigger;
+	int irq, err;
+
+	if (vec >= udev->max_vectors) {
+		dev_notice(&udev->pdev->dev, "vec %u >= num_vec %u\n",
+			   vec, udev->max_vectors);
+		return -ERANGE;
+	}
+
+	irq = udev->msix[vec].vector;
+	trigger = udev->ctx[vec].trigger;
+	if (trigger) {
+		/* Clearup existing irq mapping */
+		free_irq(irq, trigger);
+		eventfd_ctx_put(trigger);
+		udev->ctx[vec].trigger = NULL;
+	}
+
+	/* Passing -1 is used to disable interrupt */
+	if (fd < 0)
+		return 0;
+
+	trigger = eventfd_ctx_fdget(fd);
+	if (IS_ERR(trigger)) {
+		err = PTR_ERR(trigger);
+		dev_notice(&udev->pdev->dev,
+			   "eventfd ctx get failed: %d\n", err);
+		return err;
+	}
+
+	if (udev->msix)
+		err = request_irq(irq, uio_msi_irqhandler, 0,
+				  udev->ctx[vec].name, trigger);
+	else
+		err = request_irq(irq, uio_intx_irqhandler, IRQF_SHARED,
+				  udev->ctx[vec].name, udev);
+
+	if (err) {
+		dev_notice(&udev->pdev->dev,
+			   "request irq failed: %d\n", err);
+		eventfd_ctx_put(trigger);
+		return err;
+	}
+
+	udev->ctx[vec].trigger = trigger;
+	return 0;
+}
+
+static int
+uio_msi_ioctl(struct uio_info *info, unsigned int cmd, unsigned long arg)
+{
+	struct uio_msi_pci_dev *udev
+		= container_of(info, struct uio_msi_pci_dev, info);
+	struct uio_msi_irq_set hdr;
+	int err;
+
+	switch (cmd) {
+	case UIO_MSI_IRQ_SET:
+		if (copy_from_user(&hdr, (void __user *)arg, sizeof(hdr)))
+			return -EFAULT;
+
+		mutex_lock(&udev->mutex);
+		err = set_irq_eventfd(udev, hdr.vec, hdr.fd);
+		mutex_unlock(&udev->mutex);
+		break;
+	default:
+		err = -EOPNOTSUPP;
+	}
+	return err;
+}
+
+/* Opening the UIO device for first time enables MSI-X */
+static int
+uio_msi_open(struct uio_info *info, struct inode *inode)
+{
+	struct uio_msi_pci_dev *udev
+		= container_of(info, struct uio_msi_pci_dev, info);
+	int err = 0;
+
+	mutex_lock(&udev->mutex);
+	if (udev->ref_cnt++ == 0) {
+		if (udev->msix)
+			err = pci_enable_msix(udev->pdev, udev->msix,
+					      udev->max_vectors);
+	}
+	mutex_unlock(&udev->mutex);
+
+	return err;
+}
+
+/* Last close of the UIO device releases/disables all IRQ's */
+static int
+uio_msi_release(struct uio_info *info, struct inode *inode)
+{
+	struct uio_msi_pci_dev *udev
+		= container_of(info, struct uio_msi_pci_dev, info);
+	int i;
+
+	mutex_lock(&udev->mutex);
+	if (--udev->ref_cnt == 0) {
+		for (i = 0; i < udev->max_vectors; i++) {
+			int irq = udev->msix[i].vector;
+			struct eventfd_ctx *trigger = udev->ctx[i].trigger;
+
+			if (!trigger)
+				continue;
+
+			free_irq(irq, trigger);
+			eventfd_ctx_put(trigger);
+			udev->ctx[i].trigger = NULL;
+		}
+
+		if (udev->msix)
+			pci_disable_msix(udev->pdev);
+	}
+	mutex_unlock(&udev->mutex);
+
+	return 0;
+}
+
+/* Unmap previously ioremap'd resources */
+static void
+release_iomaps(struct uio_mem *mem)
+{
+	int i;
+
+	for (i = 0; i < MAX_UIO_MAPS; i++, mem++) {
+		if (mem->internal_addr)
+			iounmap(mem->internal_addr);
+	}
+}
+
+static int
+setup_maps(struct pci_dev *pdev, struct uio_info *info)
+{
+	int i, m = 0, p = 0, err;
+	static const char * const bar_names[] = {
+		"BAR0",	"BAR1",	"BAR2",	"BAR3",	"BAR4",	"BAR5",
+	};
+
+	for (i = 0; i < ARRAY_SIZE(bar_names); i++) {
+		unsigned long start = pci_resource_start(pdev, i);
+		unsigned long flags = pci_resource_flags(pdev, i);
+		unsigned long len = pci_resource_len(pdev, i);
+
+		if (start == 0 || len == 0)
+			continue;
+
+		if (flags & IORESOURCE_MEM) {
+			void __iomem *addr;
+
+			if (m >= MAX_UIO_MAPS)
+				continue;
+
+			addr = ioremap(start, len);
+			if (addr == NULL) {
+				err = -EINVAL;
+				goto fail;
+			}
+
+			info->mem[m].name = bar_names[i];
+			info->mem[m].addr = start;
+			info->mem[m].internal_addr = addr;
+			info->mem[m].size = len;
+			info->mem[m].memtype = UIO_MEM_PHYS;
+			++m;
+		} else if (flags & IORESOURCE_IO) {
+			if (p >= MAX_UIO_PORT_REGIONS)
+				continue;
+
+			info->port[p].name = bar_names[i];
+			info->port[p].start = start;
+			info->port[p].size = len;
+			info->port[p].porttype = UIO_PORT_X86;
+			++p;
+		}
+	}
+
+	return 0;
+ fail:
+	for (i = 0; i < m; i++)
+		iounmap(info->mem[i].internal_addr);
+	return err;
+}
+
+static int uio_msi_probe(struct pci_dev *pdev, const struct pci_device_id *id)
+{
+	struct uio_msi_pci_dev *udev;
+	int i, err, vectors;
+
+	udev = kzalloc(sizeof(struct uio_msi_pci_dev), GFP_KERNEL);
+	if (!udev)
+		return -ENOMEM;
+
+	err = pci_enable_device(pdev);
+	if (err != 0) {
+		dev_err(&pdev->dev, "cannot enable PCI device\n");
+		goto fail_free;
+	}
+
+	err = pci_request_regions(pdev, "uio_msi");
+	if (err != 0) {
+		dev_err(&pdev->dev, "Cannot request regions\n");
+		goto fail_disable;
+	}
+
+	pci_set_master(pdev);
+
+	/* remap resources */
+	err = setup_maps(pdev, &udev->info);
+	if (err)
+		goto fail_release_iomem;
+
+	/* fill uio infos */
+	udev->info.name = "uio_msi";
+	udev->info.version = DRIVER_VERSION;
+	udev->info.priv = udev;
+	udev->pdev = pdev;
+	udev->info.ioctl = uio_msi_ioctl;
+	udev->info.open = uio_msi_open;
+	udev->info.release = uio_msi_release;
+	udev->info.irq = UIO_IRQ_CUSTOM;
+	mutex_init(&udev->mutex);
+
+	vectors = pci_msix_vec_count(pdev);
+	if (vectors > 0) {
+		udev->max_vectors = min_t(u16, vectors, MAX_MSIX_VECTORS);
+		dev_info(&pdev->dev, "using up to %u MSI-X vectors\n",
+			udev->max_vectors);
+
+		err = -ENOMEM;
+		udev->msix = kcalloc(udev->max_vectors,
+				     sizeof(struct msix_entry), GFP_KERNEL);
+		if (!udev->msix)
+			goto fail_release_iomem;
+	} else if (!pci_intx_mask_supported(pdev)) {
+		dev_err(&pdev->dev,
+			"device does not support MSI-X or INTX\n");
+		err = -EINVAL;
+		goto fail_release_iomem;
+	} else {
+		dev_notice(&pdev->dev, "using INTX\n");
+		udev->info.irq_flags = IRQF_SHARED;
+		udev->max_vectors = 1;
+	}
+
+	udev->ctx = kcalloc(udev->max_vectors,
+			    sizeof(struct uio_msi_irq_ctx), GFP_KERNEL);
+	if (!udev->ctx)
+		goto fail_free_msix;
+
+	for (i = 0; i < udev->max_vectors; i++) {
+		udev->msix[i].entry = i;
+
+		udev->ctx[i].name = kasprintf(GFP_KERNEL,
+					      KBUILD_MODNAME "[%d](%s)",
+					      i, pci_name(pdev));
+		if (!udev->ctx[i].name)
+			goto fail_free_ctx;
+	}
+
+	/* register uio driver */
+	err = uio_register_device(&pdev->dev, &udev->info);
+	if (err != 0)
+		goto fail_free_ctx;
+
+	pci_set_drvdata(pdev, udev);
+	return 0;
+
+fail_free_ctx:
+	for (i = 0; i < udev->max_vectors; i++)
+		kfree(udev->ctx[i].name);
+	kfree(udev->ctx);
+fail_free_msix:
+	kfree(udev->msix);
+fail_release_iomem:
+	release_iomaps(udev->info.mem);
+	pci_release_regions(pdev);
+fail_disable:
+	pci_disable_device(pdev);
+fail_free:
+	kfree(udev);
+
+	pr_notice("%s ret %d\n", __func__, err);
+	return err;
+}
+
+static void uio_msi_remove(struct pci_dev *pdev)
+{
+	struct uio_info *info = pci_get_drvdata(pdev);
+	struct uio_msi_pci_dev *udev
+		= container_of(info, struct uio_msi_pci_dev, info);
+	int i;
+
+	uio_unregister_device(info);
+	release_iomaps(info->mem);
+
+	pci_release_regions(pdev);
+	for (i = 0; i < udev->max_vectors; i++)
+		kfree(udev->ctx[i].name);
+	kfree(udev->ctx);
+	kfree(udev->msix);
+	pci_disable_device(pdev);
+
+	pci_set_drvdata(pdev, NULL);
+	kfree(udev);
+}
+
+static struct pci_driver uio_msi_pci_driver = {
+	.name = "uio_msi",
+	.probe = uio_msi_probe,
+	.remove = uio_msi_remove,
+};
+
+module_pci_driver(uio_msi_pci_driver);
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Stephen Hemminger <stephen@networkplumber.org>");
+MODULE_DESCRIPTION("UIO driver for MSI PCI devices");
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index f7b2db4..d9497691 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -411,6 +411,7 @@  header-y += udp.h
 header-y += uhid.h
 header-y += uinput.h
 header-y += uio.h
+header-y += uio_msi.h
 header-y += ultrasound.h
 header-y += un.h
 header-y += unistd.h
diff --git a/include/uapi/linux/uio_msi.h b/include/uapi/linux/uio_msi.h
new file mode 100644
index 0000000..297de00
--- /dev/null
+++ b/include/uapi/linux/uio_msi.h
@@ -0,0 +1,22 @@ 
+/*
+ * UIO_MSI API definition
+ *
+ * Copyright (c) 2015 by Brocade Communications Systems, Inc.
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+#ifndef _UIO_PCI_MSI_H
+#define _UIO_PCI_MSI_H
+
+struct uio_msi_irq_set {
+	u32 vec;
+	int fd;
+};
+
+#define UIO_MSI_BASE	0x86
+#define UIO_MSI_IRQ_SET	_IOW('I', UIO_MSI_BASE+1, struct uio_msi_irq_set)
+
+#endif