f



need script: convert html-text to text

i have html-text. i have to convert this text to simple text without
html-tags.

-- 
Posted via http://www.ruby-forum.com/.


0
keal21 (3)
1/4/2006 10:30:03 AM
comp.lang.ruby 48886 articles. 0 followers. Post Follow

3 Replies
611 Views

Similar Articles

[PageSpeed] 19

keal wrote:
> i have html-text. i have to convert this text to simple text without
> html-tags.
>
> --
> Posted via http://www.ruby-forum.com/.

path o'least resistance

lynx -dump www.myurl
or use links2 ## or w3m -dump www.myurl

or high-falutin solution
http://groups.google.com/group/comp.lang.ruby/browse_frm/thread/e0fb1207f1814c77/37cd5e35a1ffb8d7?q=strip+HTML+tags&rnum=7#37cd5e35a1ffb8d7

0
gene.tani (520)
1/4/2006 10:40:14 AM
On Wed, 04 Jan 2006 10:30:03 -0000, keal <keal21@mail.ru> wrote:

> i have html-text. i have to convert this text to simple text without
> html-tags.
>

It's tricky, there's more to it than you'd think. The best way is probably  
to use Lynx, or another browser, to do it for you, e.g.:

	def plain(url)
	  `lynx -dump "#{url}"`
	end

	p = plain('http://www.google.com/')
	puts p

Outputs:

                      [1]Personalised Home | [2]Sign in

   [3]A picture of the Braille letters spelling out "Google." Happy Birthday
                               Louis Braille!

     Web    [4]Images    [5]Groups    [6]News    [7]Froogle    [8]more »

> ... [snip] ...	

Of course you'll need lynx for that to work, but you can use others too.  
Try a Google search.

Cheers,

-- 
Ross Bamford - rosco@roscopeco.remove.co.uk
0
rosco (414)
1/4/2006 10:49:18 AM
keal wrote:
> i have html-text. i have to convert this text to simple text without
> html-tags.

This is a very low cost variant - I guess the lynx approach is much more
effective and complete:

ruby -pe 'gsub! %r{</?.*?>}, ""' index.html

Kind regards

    robert

0
bob.news (3807)
1/4/2006 10:51:22 AM
Reply: