Kernel address poising by clearing upper bits?

123 Views Asked by ensc At 06 July 2017 at 17:45

Is there some mechanism in Linux which is poisoning addresses by zeroing upper 16 bits?

I am debugging a kernel crash on an Intel x86-64 machine. The instruction which is causing the crash tries to access an address of 0x880139f3da00:

crash> bt
R10: 0000000000000001  R11: 0000000000000001  R12: 0000880139f3da00
                                              ~~~~~~~~~~~~~~~~~~~~~

crash> p arp_tbl.nht->hash_buckets[255]
$66 = (struct neighbour *) 0x880139f3da00

crash> p *arp_tbl.nht->hash_buckets[255]
Cannot access memory at address 0x880139f3da00

The hash_buckets table is valid:

crash> p arp_tbl.nht->hash_buckets[253]
$70 = (struct neighbour *) 0xffff88007325ae00
$71 = {
  next = 0x0, 
  tbl = 0xffffffff81abbf20 <arp_tbl>,

Setting upper word to 0xffff makes the address valid and returns a valid data structure:

crash> p *((struct neighbour *)0xffff880139f3da00)
$73 = {
  next = 0xffff88006de69a00, 
  tbl = 0xffffffff81abbf20 <arp_tbl>, 
  ... rest looks reasonable too ...

Structure is updated by RCU operations (e.g. very likely by these in neigh_flush_dev()). So, what could be the reason that the address becomes invalid in such a way?

I can exclude hardware defects (seen on two machines and with different addresses). Systems are running CentOS 7 with kernel 3.10.0-514.6.1.el7.centos.plus.x86_64 till 3.10.0-514.21.2.el7.centos.plus.x86_64.

Update

From another crash dump, I see an skb of an IPv6 packet with

crash> p *((struct sk_buff *)0xffff880070e25e00)
$57 = {
  transport_header = 54, 
  network_header = 14, 
  mac_header = 0, 
  ...
  head = 0xffff880138e28000 "", 
  data = 0xffff880138e2800e "`", 
  ...
}

This crashes when writing the first 0x8 bytes in

#define HH_DATA_MOD 16

static inline int neigh_hh_output(const struct hh_cache *hh, struct sk_buff *skb)
{
                if (likely(hh_len <= HH_DATA_MOD)) {
                        memcpy(skb->data - HH_DATA_MOD, hh->hh_data, HH_DATA_MOD);   <<<<<

This would explain why two bytes are overridden (16 - 14).

Original Q&A

There are 1 best solutions below

AudioBubble On 06 July 2017 at 21:14

can you inspect the memory location this address was read from? typically such a "partial zero" read is a result of memset being run on the area. after this cpu triggered a crash there was possibly enough time for whoever else was modifying the area to finish zeroing and possibly even fill it with other data.

so far there is no reason to suspect rcu plays any role here

this is most definitely not "poisoning" done by the kernel (it would be quite weird to do it in this way). however, if the crash is reproducible (you say it occurred on at least 2 different machines?) then running a debug kernel may be of help, especially with slab debug enabled.

Kernel address poising by clearing upper bits?

Update

There are 1 best solutions below

Related Questions in LINUX-KERNEL

Related Questions in CRASH

Related Questions in X86-64

Related Questions in CENTOS7

Related Questions in RCU

Trending Questions

Popular # Hahtags

Popular Questions