Is there something like a "CSS selector" or XPath grep?

6.5k Views Asked by At

I need to find all places in a bunch of HTML files, that lie in following structure (CSS):

div.a ul.b

or XPath:

//div[@class="a"]//div[@class="b"]

grep doesn't help me here. Is there a command-line tool that returns all files (and optionally all places therein), that match this criterium? I.e., that returns file names, if the file matches a certain HTML or XML structure.

4

There are 4 best solutions below

0
On BEST ANSWER

Try this:

  1. Install http://www.w3.org/Tools/HTML-XML-utils/.
    • Ubuntu: aptitude install html-xml-utils
    • MacOS: brew install html-xml-utils
  2. Save a web page (call it filename.html).
  3. Run: hxnormalize -l 240 -x filename.html | hxselect -s '\n' -c "label.black"

Where "label.black" is the CSS selector that uniquely identifies the name of the HTML element. Write a helper script named cssgrep:

#!/bin/bash

# Ignore errors, write the results to standard output.
hxnormalize -l 240 -x $1 2>/dev/null | hxselect -s '\n' -c "$2"

You can then run:

cssgrep filename.html "label.black"

This will generate the content for all HTML label elements of the class black.

The -l 240 argument is important to avoid parsing line-breaks in the output. For example if <label class="black">Text to \nextract</label> is the input, then -l 240 will reformat the HTML to <label class="black">Text to extract</label>, inserting newlines at column 240, which simplifies parsing. Extending out to 1024 or beyond is also possible.

See also:

0
On

I have built a command line tool with Node JS which does just this. You enter a CSS selector and it will search through all of the HTML files in the directory and tell you which files have matches for that selector.

You will need to install Element Finder, cd into the directory you want to search, and then run:

elfinder -s "div.a ul.b"

For more info please see http://keegan.st/2012/06/03/find-in-files-with-css-selectors/

0
On

There are at least 4 tools:

  • pup - Inspired by jq, pup aims to be a fast and flexible way of exploring HTML from the terminal.

  • htmlq - Likes jq, but for HTML. Uses CSS selectors to extract bits of content from HTML files.

  • hq - Lightweight command line HTML processor using CSS and XPath selectors.

  • xq - Command-line XML and HTML beautifier and content extractor.

Examples:

$ wget http://en.wikipedia.org/wiki/Robots_exclusion_standard -O robots.html

$ pup --color 'title' < robots.html
<title>
 Robots exclusion standard - Wikipedia
</title>

$ htmlq --text 'title' < robots.html
Robots exclusion standard - Wikipedia

$ hq --xpath '//title' < robots.html
<title>robots.txt - Wikipedia</title>

$ xq --xpath '//title' < robots.html
robots.txt - Wikipedia
3
On

Per Nat's answer here:

How to parse XML in Bash?

Command-line tools that can be called from shell scripts include:

4xpath - command-line wrapper around Python's 4Suite package
XMLStarlet
xpath - command-line wrapper around Perl's XPath library