so I'm looking deeper into virtual memory subject, particularly into the functionality of swapping in and out pages.
The use case I want to examine is the following, assume we have 3 tasks, each has in its page table a pte which points to the exact same physical page. This can happen when fork is done in context of COW (there might be more cases, but that is the simple one which comes into my mind). Now for the completeness of all that I looked in this linux version, lets assume that during the run of page_launder_zone(which located in vmscan.c) kernel function, this physical page decided to swapped out, thus first add_to_swap (swap_state.c) is successfully called, which chooses a free slot in the appropriate swap area, adds it to the page cache with the appropriate page_hash, enlarging page atomic count (the swapping slot count is 1), after this in the launder function try_to_unmap (vmscan.c) is called which unmaps all the page table ptes from that physical page (each unmapping also turns off the present bit and update the mapping to hold the information for accessing the swapping area), finally the physical page is written into the swapping area (when the only one who point to that page is the page cache and all the previous unmappings are done) and the slot is removed from the page cache (the swap slot now hold the count of 3 because 3 task point to the swap slot).
Now we have the situation when 3 tasks in our case ptes for which bit present is 0, point to the same swapping slot (using all the other 31 bits in its address).
Lets assume now that each one of the processes access the pte to read any values from that swapped page.
First we will handle a major page fault, for which the function handle_mm_fault(memory.c) will be called as much as I understand.
during its run it will call 'handle_pte_fault'(memory.c) which will check the present bit and will call 'do_swap_page'(memory.c). this function in its turn will particularly execute this portion of code:
page = lookup_swap_cache(entry);
if (!page) {
swapin_readahead(entry);
page = read_swap_cache_async(entry);
if (!page) {
/*
* Back out if somebody else faulted in this pte while
* we released the page table lock.
*/
int retval;
spin_lock(&mm->page_table_lock);
retval = pte_same(*page_table, orig_pte) ? -1 : 1;
spin_unlock(&mm->page_table_lock);
return retval;
}
/* Had to read the page from swap area: Major fault */
ret = 2;
}
which for the first time the top lookup will not find the page, because it isn't yet in the swap cache, so it go inner, will execute swapin_readahead, which will find the swapping slot in which the page resides, add it to the swap and page cache and swap it in dealing with major page fault as can be seen from the ret value determined as 2 in this case (read_swap_cache_async will bring the page after swapin_readahead swapped it in and added into the swap and page caches).
then later in this function the following code will be executed:
/* The page isn't present yet, go ahead with the fault. */
swap_free(entry);
if (vm_swap_full())
remove_exclusive_swap_page(page);
mm->rss++;
pte = mk_pte(page, vma->vm_page_prot);
if (write_access && can_share_swap_page(page))
pte = pte_mkdirty(pte_mkwrite(pte));
unlock_page(page);
flush_page_to_ram(page);
flush_icache_page(vma, page);
set_pte(page_table, pte);
page_add_rmap(page, page_table);
/* No need to invalidate - it was non-present before */
update_mmu_cache(vma, address, pte);
spin_unlock(&mm->page_table_lock);
return ret;
where the swap slot count is decreased using swap_free(entry);(because now the pte connected to the physical page brought in and not to the swap slot) and the pte is connected to the physical page (bit present is set on and the other bits will contain now the regular all level bits for accessing the page table through the virtual to physical translation and etc.)
For the other 2 tasks, when they will deal with the same page fault, it wouldn't be a major page fault, but a minor one for both last 2 tasks.
During run of page = lookup_swap_cache(entry); page will be found in the swap cache and now the inner block of the if statement wouldn't be called at all.
the swap_free(entry); will be still called to decrease the swap slot count, as during the page fault handling we found the page and another time made it related to the pte which got us to the page fault.
But I don't see anywhere in the code of this function, in outer scopes or inner scopes of used functions the call to #define get_page(p) atomic_inc(&(p)->count) as I would expect.
Why it is expected ?
because in file mm.h where it is defined the following comments appear:
/*
* Methods to modify the page usage count.
*
* What counts for a page usage:
* - cache mapping (page->mapping)
* - disk mapping (page->buffers)
* - page mapped in a task's page tables, each mapping
* is counted separately
*
* Also, many kernel routines increase the page count before a critical
* routine so they can be sure the page doesn't go away from under them.
*/
#define get_page(p) atomic_inc(&(p)->count)
#define put_page(p) __free_page(p)
#define put_page_testzero(p) atomic_dec_and_test(&(p)->count)
#define page_count(p) atomic_read(&(p)->count)
#define set_page_count(p,v) atomic_set(&(p)->count, v)
which clarify that:
page mapped in a task's page tables, each mapping is counted separately
Thus the atomic count should be 3 as after one major page fault and two minors, 3 tasks totally should be mapped to the physical page allocated during the swapping in of the page from the swap area.
If I am right (that it isn't done anywhere there), then there is a bug in this old linux release, but maybe I didn't look in the code properly, and I'll be glad someone to point me out where this comment is done in the code in the described scenario for the 2 remaining tasks which fault when accessing their pte's which point to the same swapping slot.
It is a real case scenario, so it's not a hypothetical case.
I'm adding here the archive of the kernel code, I didn't write here explicitly all the functions, but just enough to understand the issue, and it may be require to look and track the function calling in this use case looking directly at the kernel code and track the flow of the functions in this case. This is not a hard job for someone who read code often, and I hope someone will have the curiosity to do it, for the sake of a better understanding of linux page caching, swapping and virtual memory management (although it is indeed an old version of linux). kernel download link