Age | Commit message (Collapse) | Author | Files | Lines |
|
* Fix PIE options
We were missing passing the -pie linker option. That means that while we
were compiling our code as position independent, the executables
(not shared libraries) were not marked as position independent and
ASLR was not applied to them. They were always loaded to fixed addresses.
This change adds the missing -pie option and also replaces all the individual
settings of -fPIE / -fPIC on the targets we build by a centralized setting
of CMAKE_POSITION_INDEPENDENT_CODE variable that causes cmake to add the
appropriate compiler options everywhere.
* Fix native parts of coreclr tests build
The native parts of the tests are not built using the root CMakeLists.txt
so I am moving enabling the position independent code to configurecompiler.cmake
Change-Id: Ieafff8984ec23e5fdb00fb0c2fb017e53afbce88
|
|
Port of dotnet/runtime#1389.
|
|
Ports https://github.com/dotnet/runtime/pull/206 to release/3.1.
The code in PAL_GetCurrentThreadAffinitySet relied on the fact that the
number of processors reported as configured in the system is always
larger than the maximum CPU index. However, it turns out that it is not
true on some devices / distros. The Jetson TX2 reports CPUs 0, 3, 4 and
5 in the affinity mask and the 1 and 2 are never reported. GLIBC reports
6 as the number of configured CPUs, however MUSL reports just 4. The
PAL_GetCurrentThreadAffinitySet was using the number of CPUs reported as
configured as the upper bound for scanning affinity set, so on Jetson
TX2, the affinity mask returned had just two bits set while there were
4 CPUs. That triggered an assert in the GCToOSInterface::Initialize.
This change fixes that by reading the maximum CPU index from the
/proc/cpuinfo. It falls back to using the number of processors
configured when the /proc/cpuinfo is not available (on macOS, FreeBSD, ...)
Fixes https://github.com/dotnet/runtime/issues/170
|
|
* Fix available memory extraction on Linux
The GlobalMemoryStatusEx in PAL is returning number of free physical pages in
the ullAvailPhys member. But there are additional pages that are allocated
as buffers and caches that get released when there is a memory pressure and
thus they are effectively available too.
This change extracts the available memory on Linux from the /proc/meminfo
MemAvailable row, which is reported by the kernel as the most precise
amount of available memory.
|
|
+ when hardlimit is specified we should only retry when we didn't fail due to commit failure - if commit failed it means we simply didn't have as much memory as what the hardlimit specified. we should throw OOM in this case.
+ added some diag info around OOM history to help with future diagnostics.
(cherry picked from commit 7dca41fd36721068e610c537654765e8e42275d7)
|
|
* Fix a potential division by 0 in post GC counter computation
* Remove useless code
|
|
* Fixes when accessing fgn_maxgen_percent
PR #25350 changed `fgn_maxgen_percent` to be a per-heap property when
`MULTIPLE_HEAPS` is set. A few uses need to be updated.
* In `full_gc_wait`, must re-read `fgn_maxgen_percent` before the
second test of `maxgen_percent == 0`.
(Otherwise the second test is statically unreachable.)
* In RegisterForFullGCNotification, must set `fgn_maxgen_percent` when
`MULTIPLE_HEAPS` is not set
* In CancelFullGCNotification, must set `fgn_maxgen_percent` for each
heap separately when `MULTIPLE_HEAPS` is set.
Fix dotnet/corefx#39374
* Avoid duplicate code when getting fgn_maxgen_percent twice in full_gc_wait
|
|
* Add property HardLimitBytes to GCMemoryInfo
This adds a new property HardLimitBytes.
Unlike TotalAvailableMemoryBytes,
this will reflect an explicitly set COMPLUS_GCHeapHardLimit.
It will also reflect the fraction of a container's size that we use,
where TotalAvailableMemoryBytes is the total container size.
Normally, though, it is equal to TotalAvailableMemoryBytes.
Fix #38821
* Remove HardLimitBytes; have TotalAvailableMemoryBytes take on its behavior
* Fix typos
* Separate total_physical_mem and heap_hard_limit
so we can compute highMemoryLoadThresholdBytes and memoryLoadBytes
* Do more work in gc.cpp instead of Gc.cs
* Consistently end names in "Bytes"
|
|
|
|
Fix brick table logic to fix perf issue in several ASP.NET tests, remove #ifdef FFIND_OBJECT.
What I observed was that some GCs spent a lot of time in find_first_object called from find_object, which is called during stack scanning to find the containing object for interior pointers. A substantial fraction of generation 0 was being scanned, indicating that the brick table logic didn't work properly in these cases.
The root cause was the fact that the brick table entries were not being set in adjust_limit_clr if the allocation was satisfied from the free list in gen0 instead of newly allocated space. This is the case if there are pinned objects in gen0 as well.
The main fix is in adjust_limit_clr - if the allocation is satisfied from the freelist, seg is nullptr, the change is to set the bricks in this case as well if we are allocating in gen0 and the allocated piece is above a reasonable size threshold.
The bricks are not set always set during allocation - instead, when we detect an interior pointer during GC, we make the allocator set the bricks during the next GC cycles by setting gen0_must_clear_bricks. I changed the way this is handled for server GC (multiple heaps). We used to multiply the decay time by the number of heaps (gc_heap::n_heaps), but only applied it to the single heap where an interior pointer was found. Instead, I think it's better to instead set gen0_must_clear_bricks for all heaps, but leave the decay time unchanged compared to workstation GC.
Maoni suggested to remove the #ifdef FFIND_OBJECT - interior pointers are not going away, so the #ifdefs are unnecessary clutter.
Addressed code review feedback:
- add parentheses as per GC coding conventions
- use max instead of if-statement
- merge body of for-loop over all into existing for-loop
|
|
large pages will have segments aligned to 16mb (the default min seg size for hardlimit)
|
|
* ensure process-wide fences when updating GC write barrier on ARM64
|
|
It doesn't seem like something we would want to export outside standalone build.
|
|
This was in CoreRT's copy of gcinterface.dac.h, but got lost in dotnet/corert#7517.
|
|
|
|
Otherwise, gen0_min_size is eventually capped by gen0_max_size, which makes it impossible to raise gen0 size above the default max sizes for gen0.
This is required for some scenarios (CppCodeGen, WASM) in CoreRT.
|
|
|
|
|
|
|
|
|
|
|
|
csc.exe ourselves. (#24342)
* Use CMake's C# support to build DacTableGen instead of manually invoking csc.exe ourselves.
* Fix x86 failures.
* Disable DAC generation when building with NMake Makefiles and issue an error since the CMake C# support is VS-only. We don't actually support building with NMake (only configure) so this is ok.
* Clean up rest of the macro=1's
PR Feedback.
* Fix Visual Studio generator matching.
* Explicitly specify anycpu32bitpreferred for DacTableGen so the ARM64 build doesn't accidentally make it 64-bit
* Fix bad merge
|
|
asserts. (#24992)
Fixes:#24879
|
|
* Just use `new T[]` when elements are not pointer-free
* reduce zeroing out when not necessary.
* use AllocateUninitializedArray in ArrayPool
|
|
* Fix initial thread affinity on Linux
On Linux, a new thread inherits the affinity mask of the thread
that created the new thread. This is a problem for background GC
threads that are created by one of the server GC threads that are
affinitized to single core.
This change adds resetting each new thread affinity to match the
current process affinity.
In addition to that, I've also fixed the extraction of the CPU count
that was using PID 0. While the doc says that 0 represents current process,
it in fact means current thread.
And as a small bonus, I've added caching of the value returned by
the PAL_GetLogicalCpuCountFromOS, since it cannot change during runtime.
|
|
* Add Series/CounterType to CounterPayload and IncrementingCounterPayload
* merging with master
* Add Generation sizes counter
* Some cleanup
* Add allocation rate counter
* Fix build
* add Allocation Rate runtime counter
* Fix a potential div by zero exception
* Add back in code commented out
* Add LOH size counter
* Fix linux build
* GetTotalAllocated -> GetTotalAllocation
* PR feedback
* More cleanup + renaming per PR feedback
* undo comments
* more pr feedback
* Use existing GC.GetTotalAllocatedBytes API instead
* Remove duplicate GetTotalAllocation
* More PR feedback
* Fix x86 build
* Match type between C++/C#
* remove unused variables'
|
|
The code was using GCToOSInterface::SetThreadAffinity, which
effectively pinned the current thread to a specific processor. On
Windows, it calls SetThreadIdealProcessor which is basically just a
scheduler hint, but the thread can stil run on other threads.
Since there is no way to set ideal affinity on Unix, the fix is to do
nothing in the GCToOSInterface::SetCurrentThreadIdealAffinity.
|
|
Fix CPUSET_T definition for FreeBSD
|
|
* Remove concept of AppDomains from the GC
- Leave constructs allowing for multiple handle tables, as scenarios for that have been proposed
- Remove FEATURE_APPDOMAIN_RESOURCE_MONITORING
|
|
|
|
* keep what's allocated so far on each heap
* Implement GC.GetTotalAllocatedBytes
It is based on https://github.com/dotnet/corefx/issues/34631 and https://github.com/dotnet/corefx/issues/30644
* Fixing races related to dead_threads_non_alloc_bytes
* separated per-heap SOH and LOH counters. Different locks imply that we need different counters.
* allow/ignore torn 64bit reads on 32bit in imprecise mode.
* PR feedback
* simplified the test a little to avoid OOM on ARM
|
|
|
|
* Generate eventpipe implementation as part of CMake configure.
* Generate Etw provider as part of CMake configure.
* First pass porting over lttng provider to cmake.
* Fix up CMake Lttng provider generation.
* Move Lttng provider into CMake tree.
* Move dummy event provider to CMake
* Move genEventing into the CMake tree.
* Remove extraneous logging and unused python locator.
* Clean up build.sh
* Clean up genEventingTests.py
* Add dependencies to enable more incremental builds (providers not fully incremental).
* Convert to custom command and targets instead of at configure time.
* Get each eventing target to incrementally build.
* Fix incremental builds
* Add missing dependencies on eventing headers.
* PR Feedback. Mark all generated files as generated
* Clean up eventprovider test CMakeLists
|
|
Fix some small issues with stress logging.
|
|
* Do not expand to allocation_quantum in SOH when GC_ALLOC_ZEROING_OPTIONAL
* short-circuit short arrays to use `new T[size]`
* Clean syncblock of large-aligned objects on ARM32
* specialize single-dimensional path AllocateSzArray
* Unit tests
* Some PR feedback. Made AllocateUninitializedArray not be trimmed away.
* PR feedback on gchelpers
- replaced use of multiple bool parameters with flags enum
- merged some methods with nearly identical implementation
- switched callers to use AllocateSzArray vs. AllocateArrayEx where appropriate.
* PR feedback. Removed X86 specific array/string allocation helpers.
|
|
|
|
When large pages are enabled, we must commit everything we reserve.
Previously we reserved 2x the segment size for LOH. This is a problem
with large pages where we must commit everything we reserve.
Thanks to https://github.com/dotnet/coreclr/pull/24081 this does not
cause performance regression with large pages; but without large pages
we were seeing regressions when the loh_seg_size was reduced. So this
change will only take effect when large pages are enabled.
|
|
* thier -> their
* exeption -> exception
* Estbalisher -> Establisher
* neeed -> need
* neeed -> need
* neeeded -> needed
* neeeded -> needed
* facilitiate -> facilitate
* extremly -> extremely
* extry -> extra
|
|
* Improve LOH heap balancing
Previously in `balance_heaps_loh`, we would default to `org_hp` being
`acontext->get_alloc_heap()`.
Since `alloc_large_object` is an instance method, that ultimately came
from the heap instance this was called on. In `GCHeap::Alloc` that came
from `acontext->get_alloc_heap()` (this is a different acontext). That
variable is set when we allocate a small object. So the heap we were
allocating large objects on was affected by the heap we were allocating
small objects on. This isn't necessary as small object heap and large
object heaps have separate areas. In scenarios with limited memory, we
can unnecessarily run out of memory by refusing to move away from that
hea. However, we do want to ensure that the large object heap accessed
is not on a different numa node than the small object heap.
I experimented with adding a `get_loh_alloc_heap()` to acontext similar
to the SOH alloc heap, but performance tests showed that it was usually
better to just start from the home heap. The chosen policy was:
* Start searching from the home heap -- this is the one corresponding to
our processor.
* Have a low (but non-zero) preference for that heap (dd_min_size(dd) /
2), as long as we stay within the same numa node.
* Have a higher cost of switching to a different numa node. However,
this is still much less than before; it was dd_min_size(dd) * 4, now
dd_min_size(dd) * 3 / 2.
This showed big performance improvements (over 30% less time) in a
scenario with lots of LOH allocation where there were fewer allocating
threads than GC heaps. The changes were more pronounced the more we
allocated large objects vs small objects. There was usually slight
improvement (1-2%) when there were 48 constantly allocating threads and
48 heaps. The one place we did see a slight regression was in an 800MB
container and 4 allocating threads on a 48 processor machine; however,
similar tests with less memory or more threads were prone to running out
of memory or running very slow on the master branch, so we've improved
stability. Previously the gc could get lucky by having the SOH choice
happen to be a good choice for LOH, but we shouldn't be relying on it as
it failed in some container scenarios.
One more change is in joined_generation_to_condemn: If there is a memory
limit and we are about to OOM, we should always do a compacting GC. This
helps avoid the OOM and feeds into the next change.
This PR also adds a *second* balance_heaps_loh function for when there
is a memory limit and we previously failed to allocate into the chosen
heap. `balance_heaps_loh` works based on allocation budgets, whereas
`balance_heaps_loh_hard_limit_retry` works on the actual space available
at the end of the segment. Thanks to the change to
joined_generation_to_condemn the heaps should be compact, so not looking
at free space here.
* Fix uninitialized variable
* In a container, use space available instead of budget
* Fix duplicate semicolon
|
|
The current implementation assumes that the NUMA nodes of CPUs
used for GC threads form a zero based continous range. However that
doesn't have to be true for cases when user selects only a subset of the
available CPUs for the GC heap threads using the
COMPlus_GCHeapAffinitizeMask or COMPlus_GCHeapAffinitizeRanges. The
selected CPUs may belong only to a subset of NUMA nodes that don't
necessarily start at node 0 or form a continuous range.
This change fixes the algorithm that initializes the
numa_node_to_heap_map lookup array so that it works correctly even in
such cases.
|
|
Fix NUMA node for heap when NUMA is not available
|
|
The recent refactoring of the GCToOSInterface::GetProcessorForHeap has
accidentally changed the NUMA node returned in case NUMA is disabled
(either via the COMPlus_GCNumaAware or due to the fact that there is
just a single NUMA node on the system) and the CPU groups are disabled.
Before that refactoring, the code was incorrectly returning 0 as the
NUMA node when CPU groups were disabled no matter whether NUMA was
enabled or disabled. The refactoring fixed that by returning the
current CPU group number for the case when NUMA was enabled, however
it still returned incorrect value, this time GroupProcNo::NoGroup as
the NUMA node number in case NUMA was disabled.
This change fixes it by returning the current group number in this case.
|
|
* Switch to workstation GC in case of constrained CPU resources
Right now, if the user sets the configuration so that the server GC is
used, the server GC will be loaded even in conditions where we know the
workstation GC would fare better. An example of such conditions is
constrained environment where there is only 1 or less CPU or with very
low memory.
This can be harmful if users deploy the same projects on different kind
of platforms: deploying to a 20+ cores server and to Azure Functions
will require largely different configurations for the runtime.
There are already multiple ways for the user to specify to use the
server GC or not:
- setting `COMPlus_gcServer` as an environment variable
- setting `gcServer` in the configuration file
- setting `System.GC.Server` passed to `coreclr_initialize`
Fix https://github.com/dotnet/coreclr/issues/23949
* Address review
* Address review
Remove GCToOSInterface::GetCurrentProcessCpuLimit in favor of
GCToOSInterface::GetCurrentProcessCpuCount because the CpuLimit is taken
into account in the CpuCount again.
* Address review
Do the work in src/vm/ceemain.cpp otherwise there will be a disparity
between what the VM and the GC are running. Before, only the GC would be
aware of the switch from server to workstation GC, but not the VM.
|
|
|
|
|
|
|
|
The CPU limiting was accidentally removed during refactoring of the CPU
groups support in GC. This change puts them back.
|
|
Calling delete on types allocated with new[] leads to undefined
behaviour.
|
|
|
|
Adjust plug_size_to_fit to consider large alignment on ARM32
|