Mapping MMIO region write-back does not work
I want all read & write requests to a PCIe device to be cached by CPU caches. However, it does not work as I expected.
These are my assumptions on write-back MMIO regions.
- Writes to the PCIe device happen only on cache write-back.
- The size of TLP payloads is cache block size (64B).
However, captured TLPs do not follow my assumptions.
- Writes to the PCIe device happen on every write to the MMIO region.
- The size of TLP payloads is 1B.
I write 8-byte of 0xff
to the MMIO region with the following user space program & device driver.
Part of User Program
struct pcie_ioctl ioctl_control;
ioctl_control.bar_select = BAR_ID;
ioctl_control.num_bytes_to_write = atoi(argv[1]);
if (ioctl(fd, IOCTL_WRITE_0xFF, &ioctl_control) < 0) {
printf("ioctl failedn");
}
Part of Device Driver
case IOCTL_WRITE_0xFF:
{
int i;
char *buff;
struct pci_cdev_struct *pci_cdev = pci_get_drvdata(fpga_pcie_dev.pci_device);
copy_from_user(&ioctl_control, (void __user *)arg, sizeof(ioctl_control));
buff = kmalloc(sizeof(char) * ioctl_control.num_bytes_to_write, GFP_KERNEL);
for (i = 0; i < ioctl_control.num_bytes_to_write; i++) {
buff[i] = 0xff;
}
memcpy(pci_cdev->bar[ioctl_control.bar_select], buff, ioctl_control.num_bytes_to_write);
kfree(buff);
break;
}
I modified MTRRs to make the corresponding MMIO region write-back. The MMIO region starts from 0x0c7300000, and the length is 0x100000 (1MB). Followings are cat /proc/mtrr
results for different policies. Please note that I made each region exclusive.
Uncacheable
reg00: base=0x080000000 ( 2048MB), size= 1024MB, count=1: uncachable
reg01: base=0x380000000000 (58720256MB), size=524288MB, count=1: uncachable
reg02: base=0x0c0000000 ( 3072MB), size= 64MB, count=1: uncachable
reg03: base=0x0c4000000 ( 3136MB), size= 32MB, count=1: uncachable
reg04: base=0x0c6000000 ( 3168MB), size= 16MB, count=1: uncachable
reg05: base=0x0c7000000 ( 3184MB), size= 1MB, count=1: uncachable
reg06: base=0x0c7100000 ( 3185MB), size= 1MB, count=1: uncachable
reg07: base=0x0c7200000 ( 3186MB), size= 1MB, count=1: uncachable
reg08: base=0x0c7300000 ( 3187MB), size= 1MB, count=1: uncachable
reg09: base=0x0c7400000 ( 3188MB), size= 1MB, count=1: uncachable
Write-combining
reg00: base=0x080000000 ( 2048MB), size= 1024MB, count=1: uncachable
reg01: base=0x380000000000 (58720256MB), size=524288MB, count=1: uncachable
reg02: base=0x0c0000000 ( 3072MB), size= 64MB, count=1: uncachable
reg03: base=0x0c4000000 ( 3136MB), size= 32MB, count=1: uncachable
reg04: base=0x0c6000000 ( 3168MB), size= 16MB, count=1: uncachable
reg05: base=0x0c7000000 ( 3184MB), size= 1MB, count=1: uncachable
reg06: base=0x0c7100000 ( 3185MB), size= 1MB, count=1: uncachable
reg07: base=0x0c7200000 ( 3186MB), size= 1MB, count=1: uncachable
reg08: base=0x0c7300000 ( 3187MB), size= 1MB, count=1: write-combining
reg09: base=0x0c7400000 ( 3188MB), size= 1MB, count=1: uncachable
Write-back
reg00: base=0x080000000 ( 2048MB), size= 1024MB, count=1: uncachable
reg01: base=0x380000000000 (58720256MB), size=524288MB, count=1: uncachable
reg02: base=0x0c0000000 ( 3072MB), size= 64MB, count=1: uncachable
reg03: base=0x0c4000000 ( 3136MB), size= 32MB, count=1: uncachable
reg04: base=0x0c6000000 ( 3168MB), size= 16MB, count=1: uncachable
reg05: base=0x0c7000000 ( 3184MB), size= 1MB, count=1: uncachable
reg06: base=0x0c7100000 ( 3185MB), size= 1MB, count=1: uncachable
reg07: base=0x0c7200000 ( 3186MB), size= 1MB, count=1: uncachable
reg08: base=0x0c7300000 ( 3187MB), size= 1MB, count=1: write-back
reg09: base=0x0c7400000 ( 3188MB), size= 1MB, count=1: uncachable
Followings are waveform captures for 8B write with different policies. I have used integrated logic analyzer (ILA) to capture these waveform. Please watch pcie_endpoint_litepcietlpdepacketizer_tlp_req_payload_dat
when pcie_endpoint_litepcietlpdepacketizer_tlp_req_valid
is set. You can count the number of packets by counting pcie_endpoint_litepcietlpdepacketizer_tlp_req_valid
in these waveform example.
Uncacheable: link -> correct, 1B x 8 packets
Write-combining: link -> correct, 8B x 1 packet
Write-back: link -> unexpected, 1B x 8 packets
System configuration is like below.
CPU: Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
OS: Linux kernel 4.15.0-38
PCIe Device: Xilinx FPGA KC705 programmed with litepcie
Related Links
- Generating a 64-byte read PCIe TLP from an x86 CPU
- How to Implement a 64B PCIe* Burst Transfer on Intel® Architecture
- Write Combining Buffer Out of Order Writes and PCIe
- Do Ryzen support write-back caching for Memory Mapped IO (through PCIe interface)?
- MTRR (Memory Type Range Register) control
- PATting Linux
- Down to the TLP: How PCI express devices talk (Part I)
linux caching x86 fpga pci-e
add a comment |
I want all read & write requests to a PCIe device to be cached by CPU caches. However, it does not work as I expected.
These are my assumptions on write-back MMIO regions.
- Writes to the PCIe device happen only on cache write-back.
- The size of TLP payloads is cache block size (64B).
However, captured TLPs do not follow my assumptions.
- Writes to the PCIe device happen on every write to the MMIO region.
- The size of TLP payloads is 1B.
I write 8-byte of 0xff
to the MMIO region with the following user space program & device driver.
Part of User Program
struct pcie_ioctl ioctl_control;
ioctl_control.bar_select = BAR_ID;
ioctl_control.num_bytes_to_write = atoi(argv[1]);
if (ioctl(fd, IOCTL_WRITE_0xFF, &ioctl_control) < 0) {
printf("ioctl failedn");
}
Part of Device Driver
case IOCTL_WRITE_0xFF:
{
int i;
char *buff;
struct pci_cdev_struct *pci_cdev = pci_get_drvdata(fpga_pcie_dev.pci_device);
copy_from_user(&ioctl_control, (void __user *)arg, sizeof(ioctl_control));
buff = kmalloc(sizeof(char) * ioctl_control.num_bytes_to_write, GFP_KERNEL);
for (i = 0; i < ioctl_control.num_bytes_to_write; i++) {
buff[i] = 0xff;
}
memcpy(pci_cdev->bar[ioctl_control.bar_select], buff, ioctl_control.num_bytes_to_write);
kfree(buff);
break;
}
I modified MTRRs to make the corresponding MMIO region write-back. The MMIO region starts from 0x0c7300000, and the length is 0x100000 (1MB). Followings are cat /proc/mtrr
results for different policies. Please note that I made each region exclusive.
Uncacheable
reg00: base=0x080000000 ( 2048MB), size= 1024MB, count=1: uncachable
reg01: base=0x380000000000 (58720256MB), size=524288MB, count=1: uncachable
reg02: base=0x0c0000000 ( 3072MB), size= 64MB, count=1: uncachable
reg03: base=0x0c4000000 ( 3136MB), size= 32MB, count=1: uncachable
reg04: base=0x0c6000000 ( 3168MB), size= 16MB, count=1: uncachable
reg05: base=0x0c7000000 ( 3184MB), size= 1MB, count=1: uncachable
reg06: base=0x0c7100000 ( 3185MB), size= 1MB, count=1: uncachable
reg07: base=0x0c7200000 ( 3186MB), size= 1MB, count=1: uncachable
reg08: base=0x0c7300000 ( 3187MB), size= 1MB, count=1: uncachable
reg09: base=0x0c7400000 ( 3188MB), size= 1MB, count=1: uncachable
Write-combining
reg00: base=0x080000000 ( 2048MB), size= 1024MB, count=1: uncachable
reg01: base=0x380000000000 (58720256MB), size=524288MB, count=1: uncachable
reg02: base=0x0c0000000 ( 3072MB), size= 64MB, count=1: uncachable
reg03: base=0x0c4000000 ( 3136MB), size= 32MB, count=1: uncachable
reg04: base=0x0c6000000 ( 3168MB), size= 16MB, count=1: uncachable
reg05: base=0x0c7000000 ( 3184MB), size= 1MB, count=1: uncachable
reg06: base=0x0c7100000 ( 3185MB), size= 1MB, count=1: uncachable
reg07: base=0x0c7200000 ( 3186MB), size= 1MB, count=1: uncachable
reg08: base=0x0c7300000 ( 3187MB), size= 1MB, count=1: write-combining
reg09: base=0x0c7400000 ( 3188MB), size= 1MB, count=1: uncachable
Write-back
reg00: base=0x080000000 ( 2048MB), size= 1024MB, count=1: uncachable
reg01: base=0x380000000000 (58720256MB), size=524288MB, count=1: uncachable
reg02: base=0x0c0000000 ( 3072MB), size= 64MB, count=1: uncachable
reg03: base=0x0c4000000 ( 3136MB), size= 32MB, count=1: uncachable
reg04: base=0x0c6000000 ( 3168MB), size= 16MB, count=1: uncachable
reg05: base=0x0c7000000 ( 3184MB), size= 1MB, count=1: uncachable
reg06: base=0x0c7100000 ( 3185MB), size= 1MB, count=1: uncachable
reg07: base=0x0c7200000 ( 3186MB), size= 1MB, count=1: uncachable
reg08: base=0x0c7300000 ( 3187MB), size= 1MB, count=1: write-back
reg09: base=0x0c7400000 ( 3188MB), size= 1MB, count=1: uncachable
Followings are waveform captures for 8B write with different policies. I have used integrated logic analyzer (ILA) to capture these waveform. Please watch pcie_endpoint_litepcietlpdepacketizer_tlp_req_payload_dat
when pcie_endpoint_litepcietlpdepacketizer_tlp_req_valid
is set. You can count the number of packets by counting pcie_endpoint_litepcietlpdepacketizer_tlp_req_valid
in these waveform example.
Uncacheable: link -> correct, 1B x 8 packets
Write-combining: link -> correct, 8B x 1 packet
Write-back: link -> unexpected, 1B x 8 packets
System configuration is like below.
CPU: Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
OS: Linux kernel 4.15.0-38
PCIe Device: Xilinx FPGA KC705 programmed with litepcie
Related Links
- Generating a 64-byte read PCIe TLP from an x86 CPU
- How to Implement a 64B PCIe* Burst Transfer on Intel® Architecture
- Write Combining Buffer Out of Order Writes and PCIe
- Do Ryzen support write-back caching for Memory Mapped IO (through PCIe interface)?
- MTRR (Memory Type Range Register) control
- PATting Linux
- Down to the TLP: How PCI express devices talk (Part I)
linux caching x86 fpga pci-e
1
Is it possible that something else (like PAT) is overriding the MTRR setting and making it actually be UC instead of WB? An BTW, the single-byte transactions might be frommempcy
being implemented asrep movsb
inside the kernel (because your CPU is new enough for ERMSB which makesrep movsb
fairly good.)
– Peter Cordes
Nov 15 '18 at 1:48
1
@PeterCordes Thanks, Peter. I could find a conflict in PAT. The region is set asuncached-minus
in PAT. I will try to solve it.
– Taekyung Heo
Nov 15 '18 at 5:22
add a comment |
I want all read & write requests to a PCIe device to be cached by CPU caches. However, it does not work as I expected.
These are my assumptions on write-back MMIO regions.
- Writes to the PCIe device happen only on cache write-back.
- The size of TLP payloads is cache block size (64B).
However, captured TLPs do not follow my assumptions.
- Writes to the PCIe device happen on every write to the MMIO region.
- The size of TLP payloads is 1B.
I write 8-byte of 0xff
to the MMIO region with the following user space program & device driver.
Part of User Program
struct pcie_ioctl ioctl_control;
ioctl_control.bar_select = BAR_ID;
ioctl_control.num_bytes_to_write = atoi(argv[1]);
if (ioctl(fd, IOCTL_WRITE_0xFF, &ioctl_control) < 0) {
printf("ioctl failedn");
}
Part of Device Driver
case IOCTL_WRITE_0xFF:
{
int i;
char *buff;
struct pci_cdev_struct *pci_cdev = pci_get_drvdata(fpga_pcie_dev.pci_device);
copy_from_user(&ioctl_control, (void __user *)arg, sizeof(ioctl_control));
buff = kmalloc(sizeof(char) * ioctl_control.num_bytes_to_write, GFP_KERNEL);
for (i = 0; i < ioctl_control.num_bytes_to_write; i++) {
buff[i] = 0xff;
}
memcpy(pci_cdev->bar[ioctl_control.bar_select], buff, ioctl_control.num_bytes_to_write);
kfree(buff);
break;
}
I modified MTRRs to make the corresponding MMIO region write-back. The MMIO region starts from 0x0c7300000, and the length is 0x100000 (1MB). Followings are cat /proc/mtrr
results for different policies. Please note that I made each region exclusive.
Uncacheable
reg00: base=0x080000000 ( 2048MB), size= 1024MB, count=1: uncachable
reg01: base=0x380000000000 (58720256MB), size=524288MB, count=1: uncachable
reg02: base=0x0c0000000 ( 3072MB), size= 64MB, count=1: uncachable
reg03: base=0x0c4000000 ( 3136MB), size= 32MB, count=1: uncachable
reg04: base=0x0c6000000 ( 3168MB), size= 16MB, count=1: uncachable
reg05: base=0x0c7000000 ( 3184MB), size= 1MB, count=1: uncachable
reg06: base=0x0c7100000 ( 3185MB), size= 1MB, count=1: uncachable
reg07: base=0x0c7200000 ( 3186MB), size= 1MB, count=1: uncachable
reg08: base=0x0c7300000 ( 3187MB), size= 1MB, count=1: uncachable
reg09: base=0x0c7400000 ( 3188MB), size= 1MB, count=1: uncachable
Write-combining
reg00: base=0x080000000 ( 2048MB), size= 1024MB, count=1: uncachable
reg01: base=0x380000000000 (58720256MB), size=524288MB, count=1: uncachable
reg02: base=0x0c0000000 ( 3072MB), size= 64MB, count=1: uncachable
reg03: base=0x0c4000000 ( 3136MB), size= 32MB, count=1: uncachable
reg04: base=0x0c6000000 ( 3168MB), size= 16MB, count=1: uncachable
reg05: base=0x0c7000000 ( 3184MB), size= 1MB, count=1: uncachable
reg06: base=0x0c7100000 ( 3185MB), size= 1MB, count=1: uncachable
reg07: base=0x0c7200000 ( 3186MB), size= 1MB, count=1: uncachable
reg08: base=0x0c7300000 ( 3187MB), size= 1MB, count=1: write-combining
reg09: base=0x0c7400000 ( 3188MB), size= 1MB, count=1: uncachable
Write-back
reg00: base=0x080000000 ( 2048MB), size= 1024MB, count=1: uncachable
reg01: base=0x380000000000 (58720256MB), size=524288MB, count=1: uncachable
reg02: base=0x0c0000000 ( 3072MB), size= 64MB, count=1: uncachable
reg03: base=0x0c4000000 ( 3136MB), size= 32MB, count=1: uncachable
reg04: base=0x0c6000000 ( 3168MB), size= 16MB, count=1: uncachable
reg05: base=0x0c7000000 ( 3184MB), size= 1MB, count=1: uncachable
reg06: base=0x0c7100000 ( 3185MB), size= 1MB, count=1: uncachable
reg07: base=0x0c7200000 ( 3186MB), size= 1MB, count=1: uncachable
reg08: base=0x0c7300000 ( 3187MB), size= 1MB, count=1: write-back
reg09: base=0x0c7400000 ( 3188MB), size= 1MB, count=1: uncachable
Followings are waveform captures for 8B write with different policies. I have used integrated logic analyzer (ILA) to capture these waveform. Please watch pcie_endpoint_litepcietlpdepacketizer_tlp_req_payload_dat
when pcie_endpoint_litepcietlpdepacketizer_tlp_req_valid
is set. You can count the number of packets by counting pcie_endpoint_litepcietlpdepacketizer_tlp_req_valid
in these waveform example.
Uncacheable: link -> correct, 1B x 8 packets
Write-combining: link -> correct, 8B x 1 packet
Write-back: link -> unexpected, 1B x 8 packets
System configuration is like below.
CPU: Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
OS: Linux kernel 4.15.0-38
PCIe Device: Xilinx FPGA KC705 programmed with litepcie
Related Links
- Generating a 64-byte read PCIe TLP from an x86 CPU
- How to Implement a 64B PCIe* Burst Transfer on Intel® Architecture
- Write Combining Buffer Out of Order Writes and PCIe
- Do Ryzen support write-back caching for Memory Mapped IO (through PCIe interface)?
- MTRR (Memory Type Range Register) control
- PATting Linux
- Down to the TLP: How PCI express devices talk (Part I)
linux caching x86 fpga pci-e
I want all read & write requests to a PCIe device to be cached by CPU caches. However, it does not work as I expected.
These are my assumptions on write-back MMIO regions.
- Writes to the PCIe device happen only on cache write-back.
- The size of TLP payloads is cache block size (64B).
However, captured TLPs do not follow my assumptions.
- Writes to the PCIe device happen on every write to the MMIO region.
- The size of TLP payloads is 1B.
I write 8-byte of 0xff
to the MMIO region with the following user space program & device driver.
Part of User Program
struct pcie_ioctl ioctl_control;
ioctl_control.bar_select = BAR_ID;
ioctl_control.num_bytes_to_write = atoi(argv[1]);
if (ioctl(fd, IOCTL_WRITE_0xFF, &ioctl_control) < 0) {
printf("ioctl failedn");
}
Part of Device Driver
case IOCTL_WRITE_0xFF:
{
int i;
char *buff;
struct pci_cdev_struct *pci_cdev = pci_get_drvdata(fpga_pcie_dev.pci_device);
copy_from_user(&ioctl_control, (void __user *)arg, sizeof(ioctl_control));
buff = kmalloc(sizeof(char) * ioctl_control.num_bytes_to_write, GFP_KERNEL);
for (i = 0; i < ioctl_control.num_bytes_to_write; i++) {
buff[i] = 0xff;
}
memcpy(pci_cdev->bar[ioctl_control.bar_select], buff, ioctl_control.num_bytes_to_write);
kfree(buff);
break;
}
I modified MTRRs to make the corresponding MMIO region write-back. The MMIO region starts from 0x0c7300000, and the length is 0x100000 (1MB). Followings are cat /proc/mtrr
results for different policies. Please note that I made each region exclusive.
Uncacheable
reg00: base=0x080000000 ( 2048MB), size= 1024MB, count=1: uncachable
reg01: base=0x380000000000 (58720256MB), size=524288MB, count=1: uncachable
reg02: base=0x0c0000000 ( 3072MB), size= 64MB, count=1: uncachable
reg03: base=0x0c4000000 ( 3136MB), size= 32MB, count=1: uncachable
reg04: base=0x0c6000000 ( 3168MB), size= 16MB, count=1: uncachable
reg05: base=0x0c7000000 ( 3184MB), size= 1MB, count=1: uncachable
reg06: base=0x0c7100000 ( 3185MB), size= 1MB, count=1: uncachable
reg07: base=0x0c7200000 ( 3186MB), size= 1MB, count=1: uncachable
reg08: base=0x0c7300000 ( 3187MB), size= 1MB, count=1: uncachable
reg09: base=0x0c7400000 ( 3188MB), size= 1MB, count=1: uncachable
Write-combining
reg00: base=0x080000000 ( 2048MB), size= 1024MB, count=1: uncachable
reg01: base=0x380000000000 (58720256MB), size=524288MB, count=1: uncachable
reg02: base=0x0c0000000 ( 3072MB), size= 64MB, count=1: uncachable
reg03: base=0x0c4000000 ( 3136MB), size= 32MB, count=1: uncachable
reg04: base=0x0c6000000 ( 3168MB), size= 16MB, count=1: uncachable
reg05: base=0x0c7000000 ( 3184MB), size= 1MB, count=1: uncachable
reg06: base=0x0c7100000 ( 3185MB), size= 1MB, count=1: uncachable
reg07: base=0x0c7200000 ( 3186MB), size= 1MB, count=1: uncachable
reg08: base=0x0c7300000 ( 3187MB), size= 1MB, count=1: write-combining
reg09: base=0x0c7400000 ( 3188MB), size= 1MB, count=1: uncachable
Write-back
reg00: base=0x080000000 ( 2048MB), size= 1024MB, count=1: uncachable
reg01: base=0x380000000000 (58720256MB), size=524288MB, count=1: uncachable
reg02: base=0x0c0000000 ( 3072MB), size= 64MB, count=1: uncachable
reg03: base=0x0c4000000 ( 3136MB), size= 32MB, count=1: uncachable
reg04: base=0x0c6000000 ( 3168MB), size= 16MB, count=1: uncachable
reg05: base=0x0c7000000 ( 3184MB), size= 1MB, count=1: uncachable
reg06: base=0x0c7100000 ( 3185MB), size= 1MB, count=1: uncachable
reg07: base=0x0c7200000 ( 3186MB), size= 1MB, count=1: uncachable
reg08: base=0x0c7300000 ( 3187MB), size= 1MB, count=1: write-back
reg09: base=0x0c7400000 ( 3188MB), size= 1MB, count=1: uncachable
Followings are waveform captures for 8B write with different policies. I have used integrated logic analyzer (ILA) to capture these waveform. Please watch pcie_endpoint_litepcietlpdepacketizer_tlp_req_payload_dat
when pcie_endpoint_litepcietlpdepacketizer_tlp_req_valid
is set. You can count the number of packets by counting pcie_endpoint_litepcietlpdepacketizer_tlp_req_valid
in these waveform example.
Uncacheable: link -> correct, 1B x 8 packets
Write-combining: link -> correct, 8B x 1 packet
Write-back: link -> unexpected, 1B x 8 packets
System configuration is like below.
CPU: Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
OS: Linux kernel 4.15.0-38
PCIe Device: Xilinx FPGA KC705 programmed with litepcie
Related Links
- Generating a 64-byte read PCIe TLP from an x86 CPU
- How to Implement a 64B PCIe* Burst Transfer on Intel® Architecture
- Write Combining Buffer Out of Order Writes and PCIe
- Do Ryzen support write-back caching for Memory Mapped IO (through PCIe interface)?
- MTRR (Memory Type Range Register) control
- PATting Linux
- Down to the TLP: How PCI express devices talk (Part I)
linux caching x86 fpga pci-e
linux caching x86 fpga pci-e
edited Nov 15 '18 at 2:03
Taekyung Heo
asked Nov 15 '18 at 1:21
Taekyung HeoTaekyung Heo
315
315
1
Is it possible that something else (like PAT) is overriding the MTRR setting and making it actually be UC instead of WB? An BTW, the single-byte transactions might be frommempcy
being implemented asrep movsb
inside the kernel (because your CPU is new enough for ERMSB which makesrep movsb
fairly good.)
– Peter Cordes
Nov 15 '18 at 1:48
1
@PeterCordes Thanks, Peter. I could find a conflict in PAT. The region is set asuncached-minus
in PAT. I will try to solve it.
– Taekyung Heo
Nov 15 '18 at 5:22
add a comment |
1
Is it possible that something else (like PAT) is overriding the MTRR setting and making it actually be UC instead of WB? An BTW, the single-byte transactions might be frommempcy
being implemented asrep movsb
inside the kernel (because your CPU is new enough for ERMSB which makesrep movsb
fairly good.)
– Peter Cordes
Nov 15 '18 at 1:48
1
@PeterCordes Thanks, Peter. I could find a conflict in PAT. The region is set asuncached-minus
in PAT. I will try to solve it.
– Taekyung Heo
Nov 15 '18 at 5:22
1
1
Is it possible that something else (like PAT) is overriding the MTRR setting and making it actually be UC instead of WB? An BTW, the single-byte transactions might be from
mempcy
being implemented as rep movsb
inside the kernel (because your CPU is new enough for ERMSB which makes rep movsb
fairly good.)– Peter Cordes
Nov 15 '18 at 1:48
Is it possible that something else (like PAT) is overriding the MTRR setting and making it actually be UC instead of WB? An BTW, the single-byte transactions might be from
mempcy
being implemented as rep movsb
inside the kernel (because your CPU is new enough for ERMSB which makes rep movsb
fairly good.)– Peter Cordes
Nov 15 '18 at 1:48
1
1
@PeterCordes Thanks, Peter. I could find a conflict in PAT. The region is set as
uncached-minus
in PAT. I will try to solve it.– Taekyung Heo
Nov 15 '18 at 5:22
@PeterCordes Thanks, Peter. I could find a conflict in PAT. The region is set as
uncached-minus
in PAT. I will try to solve it.– Taekyung Heo
Nov 15 '18 at 5:22
add a comment |
1 Answer
1
active
oldest
votes
In short, it seems that mapping MMIO region write-back does not work by design.
Please upload an answer if anyone finds that it is possible.
I came to find John McCalpin's articles and answers. First, mapping MMIO region write-back is not possible. Second, workaround is possible on some processors.
Mapping MMIO region write-back is not possible
Quote from this link
FYI: The WB type will not work with memory-mapped IO. You can
program the bits to set up the mapping as WB, but the system will
crash as soon as it gets a transaction that it does not know how to
handle. It is theoretically possible to use WP or WT to get cached
reads from MMIO, but coherence has to be handled in software.
Quote from this link
Only when I set both PAT and MTRR to WB does the kernel crash
Workaround is possible on some processors
Notes on Cached Access to Memory-Mapped IO Regions, John McCalpin
There is one set of mappings that can be made to work on at least some
x86-64 processors, and it is based on mapping the MMIO space twice.
Map the MMIO range with a set of attributes that allow write-combining
stores (but only uncached reads). Map the MMIO range a second time
with a set of attributes that allow cache-line reads (but only
uncached, non-write-combined stores).
1
I wonder if anything changes with Skylake-SP where 64-byte stores are possible to UC MMIO regions with a single AVX512 instruction. But probably cache eviction is still different from amovntps [rdi], zmm0
, ormovaps
on a UC memory region.
– Peter Cordes
Nov 15 '18 at 11:06
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53311131%2fmapping-mmio-region-write-back-does-not-work%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
In short, it seems that mapping MMIO region write-back does not work by design.
Please upload an answer if anyone finds that it is possible.
I came to find John McCalpin's articles and answers. First, mapping MMIO region write-back is not possible. Second, workaround is possible on some processors.
Mapping MMIO region write-back is not possible
Quote from this link
FYI: The WB type will not work with memory-mapped IO. You can
program the bits to set up the mapping as WB, but the system will
crash as soon as it gets a transaction that it does not know how to
handle. It is theoretically possible to use WP or WT to get cached
reads from MMIO, but coherence has to be handled in software.
Quote from this link
Only when I set both PAT and MTRR to WB does the kernel crash
Workaround is possible on some processors
Notes on Cached Access to Memory-Mapped IO Regions, John McCalpin
There is one set of mappings that can be made to work on at least some
x86-64 processors, and it is based on mapping the MMIO space twice.
Map the MMIO range with a set of attributes that allow write-combining
stores (but only uncached reads). Map the MMIO range a second time
with a set of attributes that allow cache-line reads (but only
uncached, non-write-combined stores).
1
I wonder if anything changes with Skylake-SP where 64-byte stores are possible to UC MMIO regions with a single AVX512 instruction. But probably cache eviction is still different from amovntps [rdi], zmm0
, ormovaps
on a UC memory region.
– Peter Cordes
Nov 15 '18 at 11:06
add a comment |
In short, it seems that mapping MMIO region write-back does not work by design.
Please upload an answer if anyone finds that it is possible.
I came to find John McCalpin's articles and answers. First, mapping MMIO region write-back is not possible. Second, workaround is possible on some processors.
Mapping MMIO region write-back is not possible
Quote from this link
FYI: The WB type will not work with memory-mapped IO. You can
program the bits to set up the mapping as WB, but the system will
crash as soon as it gets a transaction that it does not know how to
handle. It is theoretically possible to use WP or WT to get cached
reads from MMIO, but coherence has to be handled in software.
Quote from this link
Only when I set both PAT and MTRR to WB does the kernel crash
Workaround is possible on some processors
Notes on Cached Access to Memory-Mapped IO Regions, John McCalpin
There is one set of mappings that can be made to work on at least some
x86-64 processors, and it is based on mapping the MMIO space twice.
Map the MMIO range with a set of attributes that allow write-combining
stores (but only uncached reads). Map the MMIO range a second time
with a set of attributes that allow cache-line reads (but only
uncached, non-write-combined stores).
1
I wonder if anything changes with Skylake-SP where 64-byte stores are possible to UC MMIO regions with a single AVX512 instruction. But probably cache eviction is still different from amovntps [rdi], zmm0
, ormovaps
on a UC memory region.
– Peter Cordes
Nov 15 '18 at 11:06
add a comment |
In short, it seems that mapping MMIO region write-back does not work by design.
Please upload an answer if anyone finds that it is possible.
I came to find John McCalpin's articles and answers. First, mapping MMIO region write-back is not possible. Second, workaround is possible on some processors.
Mapping MMIO region write-back is not possible
Quote from this link
FYI: The WB type will not work with memory-mapped IO. You can
program the bits to set up the mapping as WB, but the system will
crash as soon as it gets a transaction that it does not know how to
handle. It is theoretically possible to use WP or WT to get cached
reads from MMIO, but coherence has to be handled in software.
Quote from this link
Only when I set both PAT and MTRR to WB does the kernel crash
Workaround is possible on some processors
Notes on Cached Access to Memory-Mapped IO Regions, John McCalpin
There is one set of mappings that can be made to work on at least some
x86-64 processors, and it is based on mapping the MMIO space twice.
Map the MMIO range with a set of attributes that allow write-combining
stores (but only uncached reads). Map the MMIO range a second time
with a set of attributes that allow cache-line reads (but only
uncached, non-write-combined stores).
In short, it seems that mapping MMIO region write-back does not work by design.
Please upload an answer if anyone finds that it is possible.
I came to find John McCalpin's articles and answers. First, mapping MMIO region write-back is not possible. Second, workaround is possible on some processors.
Mapping MMIO region write-back is not possible
Quote from this link
FYI: The WB type will not work with memory-mapped IO. You can
program the bits to set up the mapping as WB, but the system will
crash as soon as it gets a transaction that it does not know how to
handle. It is theoretically possible to use WP or WT to get cached
reads from MMIO, but coherence has to be handled in software.
Quote from this link
Only when I set both PAT and MTRR to WB does the kernel crash
Workaround is possible on some processors
Notes on Cached Access to Memory-Mapped IO Regions, John McCalpin
There is one set of mappings that can be made to work on at least some
x86-64 processors, and it is based on mapping the MMIO space twice.
Map the MMIO range with a set of attributes that allow write-combining
stores (but only uncached reads). Map the MMIO range a second time
with a set of attributes that allow cache-line reads (but only
uncached, non-write-combined stores).
edited Nov 15 '18 at 11:24
answered Nov 15 '18 at 6:07
Taekyung HeoTaekyung Heo
315
315
1
I wonder if anything changes with Skylake-SP where 64-byte stores are possible to UC MMIO regions with a single AVX512 instruction. But probably cache eviction is still different from amovntps [rdi], zmm0
, ormovaps
on a UC memory region.
– Peter Cordes
Nov 15 '18 at 11:06
add a comment |
1
I wonder if anything changes with Skylake-SP where 64-byte stores are possible to UC MMIO regions with a single AVX512 instruction. But probably cache eviction is still different from amovntps [rdi], zmm0
, ormovaps
on a UC memory region.
– Peter Cordes
Nov 15 '18 at 11:06
1
1
I wonder if anything changes with Skylake-SP where 64-byte stores are possible to UC MMIO regions with a single AVX512 instruction. But probably cache eviction is still different from a
movntps [rdi], zmm0
, or movaps
on a UC memory region.– Peter Cordes
Nov 15 '18 at 11:06
I wonder if anything changes with Skylake-SP where 64-byte stores are possible to UC MMIO regions with a single AVX512 instruction. But probably cache eviction is still different from a
movntps [rdi], zmm0
, or movaps
on a UC memory region.– Peter Cordes
Nov 15 '18 at 11:06
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53311131%2fmapping-mmio-region-write-back-does-not-work%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
Is it possible that something else (like PAT) is overriding the MTRR setting and making it actually be UC instead of WB? An BTW, the single-byte transactions might be from
mempcy
being implemented asrep movsb
inside the kernel (because your CPU is new enough for ERMSB which makesrep movsb
fairly good.)– Peter Cordes
Nov 15 '18 at 1:48
1
@PeterCordes Thanks, Peter. I could find a conflict in PAT. The region is set as
uncached-minus
in PAT. I will try to solve it.– Taekyung Heo
Nov 15 '18 at 5:22