Unix - Parse Html File And Get All His Resources List
Solution 1:
Use JSOUP
jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.
Solution 2:
The following should get you some of the way:
% sed -n -E 's/.*(href|src)="([^"]*).*/\2/p'input.html
The -n
means don't print lines by default; the -E
means use extended regular expressions (so we can use the vertical bar for alternation); the trailing p
on the substitution means print out any lines which have a successful substitution on them. Together, this finds any lines which have a href=
or src=
on them, replaces the entire line by what's between the "..."
or up to a #
, and prints out the result.
On your input, this produces:
css/webworks.css
wwhdata/common/context.js
Page1.htm
images/Pic.2.jpg
javascript:WWHClickedPopup('HelpSR2', 'Page4.htm
javascript:WWHClickedPopup('HelpSR2', 'Page2.htm
javascript:WWHClickedPopup('HelpSR2', 'Page3.htm
Limitations of this simple version:
- it won't work if there's more than one href or src on a line;
- it fails to extract the contents of the Javascript argument;
- it presumes that the input uses
"..."
rather than'...'
to delimit file names.
Each of these could probably be improved by suitable additions to the sed script, though the second would probably be best done by sending the output through another sed script or...
% cat /tmp/t.sed
s/.*(href|src)="([^#"]*).*/\2/
s/javascript.*'//
t x
b
:x
p
% sed -n -E -f /tmp/t.sed /tmp/so.txt
css/webworks.css
wwhdata/common/context.js
Page1.htm
images/Pic.2.jpg
Page4.htm
Page2.htm
Page3.htm
%
That last one's a little bit special! I'll leave you and the manpage to work out the details.
Post a Comment for "Unix - Parse Html File And Get All His Resources List"