Is there a limit on the number of hugepage entries that can be stored in the TLB











up vote
0
down vote

favorite












I'm trying to analyze the network performance boosts that VMs get when they use hugepages. For this I configured the hypervisor to have several 1G hugepages (36) by changing the grub command line and rebooting and when launching the VMs I made sure the hugepages were being passed on to the VMs. On launching 8 VMs (each with 2 1G huge pages) and running network throughput tests between them, it was found that the throughput was drastically lower than when running without hugepages. That got me wondering, if it had something to do with the number of hugepages I was using. Is there a limit on the number of 1G hugepages that can be referenced using the TLB and if so, is it lower than the limit for regular sized pages? How do I know this information. In this scenario I was using an Ivy Bridge system, and using cpuid command, I saw something like



cache and TLB information (2):
0x63: data TLB: 1G pages, 4-way, 4 entries
0x03: data TLB: 4K pages, 4-way, 64 entries
0x76: instruction TLB: 2M/4M pages, fully, 8 entries
0xff: cache data is in CPUID 4
0xb5: instruction TLB: 4K, 8-way, 64 entries
0xf0: 64 byte prefetching
0xc1: L2 TLB: 4K/2M pages, 8-way, 1024 entries


Does it mean I can have only 4 1G hugepage mappings in the TLB at any time?










share|improve this question


















  • 2




    Welcome to Stackoverflow. While your question is set within the scenario of virtualization and involving different CPUs, your questions are substantially answered by this question: stackoverflow.com/questions/40649655/…. Effectively, yes, the processor's TLB has dedicated space for the different types of entries, with a very limited space for huge pages.
    – Brian
    Nov 7 at 20:29










  • Yes, you've found a way to create very poor hugepage locality. Most workloads that do a lot of kernel access to memory have more accesses within the same 1G hugepage. (User-space memory on Linux usually uses 2M hugepages when it's using anonymous hugepages at all). In Haswell for example, 2M and 4k TLB entries can go into the 2nd-level TLB victim cache, but apparently 1G entries can't, if 7-cpu.com/cpu/Haswell.html is fully accurate.
    – Peter Cordes
    Nov 8 at 8:57















up vote
0
down vote

favorite












I'm trying to analyze the network performance boosts that VMs get when they use hugepages. For this I configured the hypervisor to have several 1G hugepages (36) by changing the grub command line and rebooting and when launching the VMs I made sure the hugepages were being passed on to the VMs. On launching 8 VMs (each with 2 1G huge pages) and running network throughput tests between them, it was found that the throughput was drastically lower than when running without hugepages. That got me wondering, if it had something to do with the number of hugepages I was using. Is there a limit on the number of 1G hugepages that can be referenced using the TLB and if so, is it lower than the limit for regular sized pages? How do I know this information. In this scenario I was using an Ivy Bridge system, and using cpuid command, I saw something like



cache and TLB information (2):
0x63: data TLB: 1G pages, 4-way, 4 entries
0x03: data TLB: 4K pages, 4-way, 64 entries
0x76: instruction TLB: 2M/4M pages, fully, 8 entries
0xff: cache data is in CPUID 4
0xb5: instruction TLB: 4K, 8-way, 64 entries
0xf0: 64 byte prefetching
0xc1: L2 TLB: 4K/2M pages, 8-way, 1024 entries


Does it mean I can have only 4 1G hugepage mappings in the TLB at any time?










share|improve this question


















  • 2




    Welcome to Stackoverflow. While your question is set within the scenario of virtualization and involving different CPUs, your questions are substantially answered by this question: stackoverflow.com/questions/40649655/…. Effectively, yes, the processor's TLB has dedicated space for the different types of entries, with a very limited space for huge pages.
    – Brian
    Nov 7 at 20:29










  • Yes, you've found a way to create very poor hugepage locality. Most workloads that do a lot of kernel access to memory have more accesses within the same 1G hugepage. (User-space memory on Linux usually uses 2M hugepages when it's using anonymous hugepages at all). In Haswell for example, 2M and 4k TLB entries can go into the 2nd-level TLB victim cache, but apparently 1G entries can't, if 7-cpu.com/cpu/Haswell.html is fully accurate.
    – Peter Cordes
    Nov 8 at 8:57













up vote
0
down vote

favorite









up vote
0
down vote

favorite











I'm trying to analyze the network performance boosts that VMs get when they use hugepages. For this I configured the hypervisor to have several 1G hugepages (36) by changing the grub command line and rebooting and when launching the VMs I made sure the hugepages were being passed on to the VMs. On launching 8 VMs (each with 2 1G huge pages) and running network throughput tests between them, it was found that the throughput was drastically lower than when running without hugepages. That got me wondering, if it had something to do with the number of hugepages I was using. Is there a limit on the number of 1G hugepages that can be referenced using the TLB and if so, is it lower than the limit for regular sized pages? How do I know this information. In this scenario I was using an Ivy Bridge system, and using cpuid command, I saw something like



cache and TLB information (2):
0x63: data TLB: 1G pages, 4-way, 4 entries
0x03: data TLB: 4K pages, 4-way, 64 entries
0x76: instruction TLB: 2M/4M pages, fully, 8 entries
0xff: cache data is in CPUID 4
0xb5: instruction TLB: 4K, 8-way, 64 entries
0xf0: 64 byte prefetching
0xc1: L2 TLB: 4K/2M pages, 8-way, 1024 entries


Does it mean I can have only 4 1G hugepage mappings in the TLB at any time?










share|improve this question













I'm trying to analyze the network performance boosts that VMs get when they use hugepages. For this I configured the hypervisor to have several 1G hugepages (36) by changing the grub command line and rebooting and when launching the VMs I made sure the hugepages were being passed on to the VMs. On launching 8 VMs (each with 2 1G huge pages) and running network throughput tests between them, it was found that the throughput was drastically lower than when running without hugepages. That got me wondering, if it had something to do with the number of hugepages I was using. Is there a limit on the number of 1G hugepages that can be referenced using the TLB and if so, is it lower than the limit for regular sized pages? How do I know this information. In this scenario I was using an Ivy Bridge system, and using cpuid command, I saw something like



cache and TLB information (2):
0x63: data TLB: 1G pages, 4-way, 4 entries
0x03: data TLB: 4K pages, 4-way, 64 entries
0x76: instruction TLB: 2M/4M pages, fully, 8 entries
0xff: cache data is in CPUID 4
0xb5: instruction TLB: 4K, 8-way, 64 entries
0xf0: 64 byte prefetching
0xc1: L2 TLB: 4K/2M pages, 8-way, 1024 entries


Does it mean I can have only 4 1G hugepage mappings in the TLB at any time?







cpu cpu-architecture tlb huge-pages






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 7 at 20:21









Sai Malleni

1




1








  • 2




    Welcome to Stackoverflow. While your question is set within the scenario of virtualization and involving different CPUs, your questions are substantially answered by this question: stackoverflow.com/questions/40649655/…. Effectively, yes, the processor's TLB has dedicated space for the different types of entries, with a very limited space for huge pages.
    – Brian
    Nov 7 at 20:29










  • Yes, you've found a way to create very poor hugepage locality. Most workloads that do a lot of kernel access to memory have more accesses within the same 1G hugepage. (User-space memory on Linux usually uses 2M hugepages when it's using anonymous hugepages at all). In Haswell for example, 2M and 4k TLB entries can go into the 2nd-level TLB victim cache, but apparently 1G entries can't, if 7-cpu.com/cpu/Haswell.html is fully accurate.
    – Peter Cordes
    Nov 8 at 8:57














  • 2




    Welcome to Stackoverflow. While your question is set within the scenario of virtualization and involving different CPUs, your questions are substantially answered by this question: stackoverflow.com/questions/40649655/…. Effectively, yes, the processor's TLB has dedicated space for the different types of entries, with a very limited space for huge pages.
    – Brian
    Nov 7 at 20:29










  • Yes, you've found a way to create very poor hugepage locality. Most workloads that do a lot of kernel access to memory have more accesses within the same 1G hugepage. (User-space memory on Linux usually uses 2M hugepages when it's using anonymous hugepages at all). In Haswell for example, 2M and 4k TLB entries can go into the 2nd-level TLB victim cache, but apparently 1G entries can't, if 7-cpu.com/cpu/Haswell.html is fully accurate.
    – Peter Cordes
    Nov 8 at 8:57








2




2




Welcome to Stackoverflow. While your question is set within the scenario of virtualization and involving different CPUs, your questions are substantially answered by this question: stackoverflow.com/questions/40649655/…. Effectively, yes, the processor's TLB has dedicated space for the different types of entries, with a very limited space for huge pages.
– Brian
Nov 7 at 20:29




Welcome to Stackoverflow. While your question is set within the scenario of virtualization and involving different CPUs, your questions are substantially answered by this question: stackoverflow.com/questions/40649655/…. Effectively, yes, the processor's TLB has dedicated space for the different types of entries, with a very limited space for huge pages.
– Brian
Nov 7 at 20:29












Yes, you've found a way to create very poor hugepage locality. Most workloads that do a lot of kernel access to memory have more accesses within the same 1G hugepage. (User-space memory on Linux usually uses 2M hugepages when it's using anonymous hugepages at all). In Haswell for example, 2M and 4k TLB entries can go into the 2nd-level TLB victim cache, but apparently 1G entries can't, if 7-cpu.com/cpu/Haswell.html is fully accurate.
– Peter Cordes
Nov 8 at 8:57




Yes, you've found a way to create very poor hugepage locality. Most workloads that do a lot of kernel access to memory have more accesses within the same 1G hugepage. (User-space memory on Linux usually uses 2M hugepages when it's using anonymous hugepages at all). In Haswell for example, 2M and 4k TLB entries can go into the 2nd-level TLB victim cache, but apparently 1G entries can't, if 7-cpu.com/cpu/Haswell.html is fully accurate.
– Peter Cordes
Nov 8 at 8:57












1 Answer
1






active

oldest

votes

















up vote
2
down vote













Yes, of course. Having an unbounded upper limit in the number of TLB entries would require an unbounded amount of physical space in the CPU die.



Every TLB in every architecture has an upper limit on the number of entries it can hold.



For the x86 case this number is less than what you probably expected: it is 4.

It was 4 in your Ivy Bridge and it is still 4 in my Kaby Lake, four generations later.



It's worth noting that 4 entries cover 4GiB of RAM (4x1GiB), that's seems enough to handle networking if properly used.

Finally, TLBs are core resources, each core has its set of TLBs.

If you disable SMT (e.g. Intel Hyper Threading) or assign both threads on a core to the same VM, the VMs won't be competing for the TLB entries.



However each VM can only have at most 4xC huge page entries cached, where C is the number of cores dedicated to that VM.

The ability of the VM to fully exploit these entries depends on how the Host OS, the hyper-visor and the guest OS work together and on the memory layout of the guest application of interest (pages shared across cores have duplicated TLB entries in each core).

It's hard (almost impossible?) to transparently use 1GiB pages, I'm not sure how the hyper-visor and the VM are going to use those pages - I'd say you need specific support for that but I'm not sure.



As Peter Cordes noted, 1GiB pages use a single-level TLB (and in Skylake, apparently there is also a second level TLB with 16 entries for 1GB pages).
A miss in the 1GiB TLB will result in a page walk so it's very important that all the software involved use page-aware code.






share|improve this answer























  • Worth mentioning that at least according to 7-cpu.com/cpu/Haswell.html, the 2nd level TLB victim cache doesn't hold 1G TLB entries in Haswell, so if you have misses they have to come from the page-walker. But Skylake has a 16-entry 2nd-level TLB for 1G pages to back up the 4-entry 1st level TLB. 7-cpu.com/cpu/Skylake.html.
    – Peter Cordes
    Nov 8 at 11:08












  • Thanks @PeterCorder, that's nice to know and have in the answer.
    – Margaret Bloom
    Nov 8 at 11:19











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














 

draft saved


draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53197226%2fis-there-a-limit-on-the-number-of-hugepage-entries-that-can-be-stored-in-the-tlb%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
2
down vote













Yes, of course. Having an unbounded upper limit in the number of TLB entries would require an unbounded amount of physical space in the CPU die.



Every TLB in every architecture has an upper limit on the number of entries it can hold.



For the x86 case this number is less than what you probably expected: it is 4.

It was 4 in your Ivy Bridge and it is still 4 in my Kaby Lake, four generations later.



It's worth noting that 4 entries cover 4GiB of RAM (4x1GiB), that's seems enough to handle networking if properly used.

Finally, TLBs are core resources, each core has its set of TLBs.

If you disable SMT (e.g. Intel Hyper Threading) or assign both threads on a core to the same VM, the VMs won't be competing for the TLB entries.



However each VM can only have at most 4xC huge page entries cached, where C is the number of cores dedicated to that VM.

The ability of the VM to fully exploit these entries depends on how the Host OS, the hyper-visor and the guest OS work together and on the memory layout of the guest application of interest (pages shared across cores have duplicated TLB entries in each core).

It's hard (almost impossible?) to transparently use 1GiB pages, I'm not sure how the hyper-visor and the VM are going to use those pages - I'd say you need specific support for that but I'm not sure.



As Peter Cordes noted, 1GiB pages use a single-level TLB (and in Skylake, apparently there is also a second level TLB with 16 entries for 1GB pages).
A miss in the 1GiB TLB will result in a page walk so it's very important that all the software involved use page-aware code.






share|improve this answer























  • Worth mentioning that at least according to 7-cpu.com/cpu/Haswell.html, the 2nd level TLB victim cache doesn't hold 1G TLB entries in Haswell, so if you have misses they have to come from the page-walker. But Skylake has a 16-entry 2nd-level TLB for 1G pages to back up the 4-entry 1st level TLB. 7-cpu.com/cpu/Skylake.html.
    – Peter Cordes
    Nov 8 at 11:08












  • Thanks @PeterCorder, that's nice to know and have in the answer.
    – Margaret Bloom
    Nov 8 at 11:19















up vote
2
down vote













Yes, of course. Having an unbounded upper limit in the number of TLB entries would require an unbounded amount of physical space in the CPU die.



Every TLB in every architecture has an upper limit on the number of entries it can hold.



For the x86 case this number is less than what you probably expected: it is 4.

It was 4 in your Ivy Bridge and it is still 4 in my Kaby Lake, four generations later.



It's worth noting that 4 entries cover 4GiB of RAM (4x1GiB), that's seems enough to handle networking if properly used.

Finally, TLBs are core resources, each core has its set of TLBs.

If you disable SMT (e.g. Intel Hyper Threading) or assign both threads on a core to the same VM, the VMs won't be competing for the TLB entries.



However each VM can only have at most 4xC huge page entries cached, where C is the number of cores dedicated to that VM.

The ability of the VM to fully exploit these entries depends on how the Host OS, the hyper-visor and the guest OS work together and on the memory layout of the guest application of interest (pages shared across cores have duplicated TLB entries in each core).

It's hard (almost impossible?) to transparently use 1GiB pages, I'm not sure how the hyper-visor and the VM are going to use those pages - I'd say you need specific support for that but I'm not sure.



As Peter Cordes noted, 1GiB pages use a single-level TLB (and in Skylake, apparently there is also a second level TLB with 16 entries for 1GB pages).
A miss in the 1GiB TLB will result in a page walk so it's very important that all the software involved use page-aware code.






share|improve this answer























  • Worth mentioning that at least according to 7-cpu.com/cpu/Haswell.html, the 2nd level TLB victim cache doesn't hold 1G TLB entries in Haswell, so if you have misses they have to come from the page-walker. But Skylake has a 16-entry 2nd-level TLB for 1G pages to back up the 4-entry 1st level TLB. 7-cpu.com/cpu/Skylake.html.
    – Peter Cordes
    Nov 8 at 11:08












  • Thanks @PeterCorder, that's nice to know and have in the answer.
    – Margaret Bloom
    Nov 8 at 11:19













up vote
2
down vote










up vote
2
down vote









Yes, of course. Having an unbounded upper limit in the number of TLB entries would require an unbounded amount of physical space in the CPU die.



Every TLB in every architecture has an upper limit on the number of entries it can hold.



For the x86 case this number is less than what you probably expected: it is 4.

It was 4 in your Ivy Bridge and it is still 4 in my Kaby Lake, four generations later.



It's worth noting that 4 entries cover 4GiB of RAM (4x1GiB), that's seems enough to handle networking if properly used.

Finally, TLBs are core resources, each core has its set of TLBs.

If you disable SMT (e.g. Intel Hyper Threading) or assign both threads on a core to the same VM, the VMs won't be competing for the TLB entries.



However each VM can only have at most 4xC huge page entries cached, where C is the number of cores dedicated to that VM.

The ability of the VM to fully exploit these entries depends on how the Host OS, the hyper-visor and the guest OS work together and on the memory layout of the guest application of interest (pages shared across cores have duplicated TLB entries in each core).

It's hard (almost impossible?) to transparently use 1GiB pages, I'm not sure how the hyper-visor and the VM are going to use those pages - I'd say you need specific support for that but I'm not sure.



As Peter Cordes noted, 1GiB pages use a single-level TLB (and in Skylake, apparently there is also a second level TLB with 16 entries for 1GB pages).
A miss in the 1GiB TLB will result in a page walk so it's very important that all the software involved use page-aware code.






share|improve this answer














Yes, of course. Having an unbounded upper limit in the number of TLB entries would require an unbounded amount of physical space in the CPU die.



Every TLB in every architecture has an upper limit on the number of entries it can hold.



For the x86 case this number is less than what you probably expected: it is 4.

It was 4 in your Ivy Bridge and it is still 4 in my Kaby Lake, four generations later.



It's worth noting that 4 entries cover 4GiB of RAM (4x1GiB), that's seems enough to handle networking if properly used.

Finally, TLBs are core resources, each core has its set of TLBs.

If you disable SMT (e.g. Intel Hyper Threading) or assign both threads on a core to the same VM, the VMs won't be competing for the TLB entries.



However each VM can only have at most 4xC huge page entries cached, where C is the number of cores dedicated to that VM.

The ability of the VM to fully exploit these entries depends on how the Host OS, the hyper-visor and the guest OS work together and on the memory layout of the guest application of interest (pages shared across cores have duplicated TLB entries in each core).

It's hard (almost impossible?) to transparently use 1GiB pages, I'm not sure how the hyper-visor and the VM are going to use those pages - I'd say you need specific support for that but I'm not sure.



As Peter Cordes noted, 1GiB pages use a single-level TLB (and in Skylake, apparently there is also a second level TLB with 16 entries for 1GB pages).
A miss in the 1GiB TLB will result in a page walk so it's very important that all the software involved use page-aware code.







share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 8 at 22:21









BeeOnRope

24.5k873169




24.5k873169










answered Nov 8 at 9:57









Margaret Bloom

21.4k32762




21.4k32762












  • Worth mentioning that at least according to 7-cpu.com/cpu/Haswell.html, the 2nd level TLB victim cache doesn't hold 1G TLB entries in Haswell, so if you have misses they have to come from the page-walker. But Skylake has a 16-entry 2nd-level TLB for 1G pages to back up the 4-entry 1st level TLB. 7-cpu.com/cpu/Skylake.html.
    – Peter Cordes
    Nov 8 at 11:08












  • Thanks @PeterCorder, that's nice to know and have in the answer.
    – Margaret Bloom
    Nov 8 at 11:19


















  • Worth mentioning that at least according to 7-cpu.com/cpu/Haswell.html, the 2nd level TLB victim cache doesn't hold 1G TLB entries in Haswell, so if you have misses they have to come from the page-walker. But Skylake has a 16-entry 2nd-level TLB for 1G pages to back up the 4-entry 1st level TLB. 7-cpu.com/cpu/Skylake.html.
    – Peter Cordes
    Nov 8 at 11:08












  • Thanks @PeterCorder, that's nice to know and have in the answer.
    – Margaret Bloom
    Nov 8 at 11:19
















Worth mentioning that at least according to 7-cpu.com/cpu/Haswell.html, the 2nd level TLB victim cache doesn't hold 1G TLB entries in Haswell, so if you have misses they have to come from the page-walker. But Skylake has a 16-entry 2nd-level TLB for 1G pages to back up the 4-entry 1st level TLB. 7-cpu.com/cpu/Skylake.html.
– Peter Cordes
Nov 8 at 11:08






Worth mentioning that at least according to 7-cpu.com/cpu/Haswell.html, the 2nd level TLB victim cache doesn't hold 1G TLB entries in Haswell, so if you have misses they have to come from the page-walker. But Skylake has a 16-entry 2nd-level TLB for 1G pages to back up the 4-entry 1st level TLB. 7-cpu.com/cpu/Skylake.html.
– Peter Cordes
Nov 8 at 11:08














Thanks @PeterCorder, that's nice to know and have in the answer.
– Margaret Bloom
Nov 8 at 11:19




Thanks @PeterCorder, that's nice to know and have in the answer.
– Margaret Bloom
Nov 8 at 11:19


















 

draft saved


draft discarded



















































 


draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53197226%2fis-there-a-limit-on-the-number-of-hugepage-entries-that-can-be-stored-in-the-tlb%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







這個網誌中的熱門文章

Hercules Kyvelos

Tangent Lines Diagram Along Smooth Curve

Yusuf al-Mu'taman ibn Hud