Adversarial Training and Testing

203 Views Asked by At

Maybe this is more a conceptual problem, but I hope you can give me your opinions. I do understand that adversarial training means introducing some corrupted instances into the training process in order to confuse the model and produce false prediction when testing. However, is this model applicable in the following scenario?: Let's assume an adversarial patch is created to fool a classifier that detects a stop sign, so a normal object detector will not be able to distinguish a real stop sign in presence of this patch. But what if the model trains both instances with and without patches? This is not so difficult to perform for the object classificator and the attack looses all chances to succeed, right?. I do not get the idea why those attacks can be successful if for the model it would take only a bit more of training to include those adversarial samples.

2

There are 2 best solutions below

0
On

As far as I am aware, adversarial training (i.e., continuously training / fine-tuning on new adversarial images with correct labels) is the only robust defense to adversarial examples that cannot be completely overcome by some form of adversarial attack, (Please correct me if I'm wrong). There have been many other attempts to defend against adversarial examples, but typically there is a way around them if the attacker has an idea of what the defense is (For instance, see Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples).

Note that to truly obtain robustness with adversarial training, you have to generate adversarial examples during training, or continue updating with new adversarial images. As I understand it, this is because once you train on some adversarial examples, your model changes slightly, and while it is made robust to your initial adversarial examples, there still exist other adversarial examples that still target your newly trained / fine-tuned model. Adversarial training gradually changes your model to minimize the availability of effective adversarial perturbations.

However, doing this is can be at odds with accuracy, (See Robustness May Be at Odds with Accuracy). A model that is truly robust to adversarial examples may have a significantly lower accuracy for non-adversarial examples. Additionally, adversarial training may be hard to scale to datasets with larger images.

0
On

I doubt there would be many scholars to answer your question here. Should just go and find your senior P.hD in ur school. My research topic is more on SLAM area, but I`ll still try to answer it.

You can train on the modified set of inputs. But The model itself will change its properly after you further trained with the modified set sample. It will lose the original attributes to do task A but for more optimized for task B where task A and task B could be related.

Then the attack should be also modified to focus on the modified attributes which means fool it with sth else.

But if you go for this way, you are defeating your original purpose.

Hopefully, this is the answer you are looking for.

Go find a research chat group in Wechat,QQ, whatsapp. its easier to get answers there