Rails 5 - How to strip tags from string in rails (NOT in/for html)

532 Views Asked by At

I need to strip tags from user input before saving into DB

I'm well aware of strip_tags method but it also html escapes string, as well as all other recommended methods:

Rails::Html::FullSanitizer.new.sanitize '&'
 => "&" 
Rails::Html::WhiteListSanitizer.new.sanitize('&', tags: [])
 => "&" 
ActionController::Base.helpers.strip_tags "&"
 => "&" 

The string I want to sanitize is NOT to be escaped, it's getting exported via API, used in files etc. it's NOT only outputted via HTML (where also in cases like link_to ActionController::Base.helpers.strip_tags("&") - link_to is double escaping string so you'll get link to & in the frontend )

As a monkey patch I've wrapped strip_tags into CGI.unescapeHTML to get more or less expected result but want to find some straight solution (I'm also afraid what else can strip_tags do and there are too many moving parts for that small functionality - more stuff that can go wrong or become broken)

Real world example: JPMorgan Chase & Co should become JPMorgan Chase & Co after removing tags

test<script>alert('hacked!');</script>&test should become test&test after stripping tags

And also string:

"test &#x3C;script&#x3E;alert(&#x27;hacked!&#x27;)&#x3C;/script&#x3E;"

Should still be

"test &#x3C;script&#x3E;alert(&#x27;hacked!&#x27;)&#x3C;/script&#x3E;"

After stripping HTMLs

With alternative solutions that I've found or that was proposed:

> Nokogiri::HTML("test &#x3C;script&#x3E;alert(&#x27;hacked!&#x27;)&#x3C;/script&#x3E;").text
 => "test <script>alert('hacked!')</script>"

> Loofah.fragment("test &#x3C;script&#x3E;alert(&#x27;hacked!&#x27;)&#x3C;/script&#x3E;").text(encode_special_chars: false)
 => "test <script>alert('hacked!')</script>"

So they're also a no go

1

There are 1 best solutions below

3
Schwern On

You have to parse the HTML and extract the text elements. Use Nokogiri to do that.

Nokogiri::HTML("<div>Strip <i>this</i> & <b>this</b> & <u>this</u>!</div>").text

Nokogiri is already used by Rails so there's no cost to using it.


You will get all the text, including the content of <script> tags.

Nokogiri::HTML(%q[test<script>alert('hacked!');</script>&test]).text

# testalert('hacked!');&test

You can strip the <script> tags.

doc = Nokogiri::HTML(%q[test<script>alert('hacked!');</script>&test])
doc.search('//script').each { |node| node.replace('') }
doc.text

# test&test

But with the tags stripped out the string is of no harm. It might not be worth the effort.

See the Nokogiri tutorials for more.