f



How to convert markup text to plain text in python?

I have some marked up text and would like to convert it to plain text,
by simply removing all the tags. Of course I can do it from first
principles but I felt that among all Python's markup tools there must
be something that would do this simply, without having to create an
XML parser etc.

I've looked around a bit but failed to find anything, any tips?

(e.g. convert "<B>Today</B> is <U>Friday</U>" to "Today is Friday")

Regards,
Geoff
0
2/1/2008 4:07:37 PM
comp.lang.python 77058 articles. 6 followers. Post Follow

8 Replies
678 Views

Similar Articles

[PageSpeed] 39

> I have some marked up text and would like to convert it to plain text,
> by simply removing all the tags. Of course I can do it from first
> principles but I felt that among all Python's markup tools there must
> be something that would do this simply, without having to create an
> XML parser etc.
> 
> I've looked around a bit but failed to find anything, any tips?
> 
> (e.g. convert "<B>Today</B> is <U>Friday</U>" to "Today is Friday")


Well, if all you want to do is remove everything from a "<" to a
">", you can use

  >>> s = "<B>Today</B> is <U>Friday</U>"
  >>> import re
  >>> r = re.compile('<[^>]*>')
  >>> print r.sub('', s)
  Today is Friday

it should even work for semi-pathological cases such as

 s = """You can find my <a
   href='http://example.com'>thesis</a
   > online"""

where the tag contents are split across lines.  There are more
pathological cases where tags aren't well-formed, e.g.

  s ="This <tag>has a > sign in it and <odd<ly>-nested> tags"

in which case you get what you deserve for making such
pathological conditions ;-)

-tkc



0
python.list (1515)
2/1/2008 4:27:28 PM
On 01-Feb-2008, geoffbache wrote:
> I have some marked up text and would like to convert it to plain text,
> by simply removing all the tags. Of course I can do it from first
> principles but I felt that among all Python's markup tools there must
> be something that would do this simply, without having to create an
> XML parser etc.
> 
> I've looked around a bit but failed to find anything, any tips?
> 
> (e.g. convert "<B>Today</B> is <U>Friday</U>" to "Today is Friday")

Quick but very dirty way:

data=urllib.urlopen('http://google.com').read()
data=''.join([x.split('>',1)[-1] for x in data.split('<')])



0
ph3845 (1)
2/1/2008 4:34:25 PM
Tim Chase wrote:
>> I have some marked up text and would like to convert it to plain text,
>> by simply removing all the tags. Of course I can do it from first
>> principles but I felt that among all Python's markup tools there must
>> be something that would do this simply, without having to create an
>> XML parser etc.
>>
>> I've looked around a bit but failed to find anything, any tips?
>>
>> (e.g. convert "<B>Today</B> is <U>Friday</U>" to "Today is Friday")
> 
> 
> Well, if all you want to do is remove everything from a "<" to a
> ">", you can use
> 
>   >>> s = "<B>Today</B> is <U>Friday</U>"
>   >>> import re
>   >>> r = re.compile('<[^>]*>')
>   >>> print r.sub('', s)
>   Today is Friday
> 
> it should even work for semi-pathological cases such as
> 
>  s = """You can find my <a
>    href='http://example.com'>thesis</a
>    > online"""
> 
> where the tag contents are split across lines.  There are more
> pathological cases where tags aren't well-formed, e.g.
> 
>   s ="This <tag>has a > sign in it and <odd<ly>-nested> tags"
> 
> in which case you get what you deserve for making such
> pathological conditions ;-)
> 
The real answer to this question is "learn how to use Beautiful Soup" -- 
see http://www.crummy.com/software/BeautifulSoup/

regards
  Steve
-- 
Steve Holden        +1 571 484 6266   +1 800 494 3119
Holden Web LLC              http://www.holdenweb.com/

0
steve73 (4801)
2/1/2008 4:43:55 PM
>> Well, if all you want to do is remove everything from a "<" to a
>> ">", you can use
>>
>>   >>> s = "<B>Today</B> is <U>Friday</U>"
>>   >>> import re
>>   >>> r = re.compile('<[^>]*>')
>>   >>> print r.sub('', s)
>>   Today is Friday
>>
[Tim's ramblings about pathological cases snipped]
>
> The real answer to this question is "learn how to use Beautiful Soup" -- 
> see http://www.crummy.com/software/BeautifulSoup/

Yes, for more pathological cases, BS does a great job of parsing
junk :)

However, as BS isn't batteries-included [Aside: BS and pyparsing
are two common solutions to problems that would make great
additions to the standard library], using a RE to make a
best-effort guess is a good first approximation of a solution
without needing to download extra packages--no matter how useful
those extra packages may be.

-tkc



0
python.list (1515)
2/1/2008 4:54:19 PM
On Feb 1, 10:54=A0am, Tim Chase <python.l...@tim.thechases.com> wrote:
> >> Well, if all you want to do is remove everything from a "<" to a
> >> ">", you can use
>
> >> =A0 >>> s =3D "<B>Today</B> is <U>Friday</U>"
> >> =A0 >>> import re
> >> =A0 >>> r =3D re.compile('<[^>]*>')
> >> =A0 >>> print r.sub('', s)
> >> =A0 Today is Friday
>
> [Tim's ramblings about pathological cases snipped]

pyparsing includes an example script for stripping tags from HTML
source.  See it on the wiki at http://pyparsing.wikispaces.com/space/showima=
ge/htmlStripper.py.

-- Paul
0
ptmcg (802)
2/1/2008 5:20:50 PM
On Feb 1, 8:07 am, geoffbache <geoff.ba...@pobox.com> wrote:
> I have some marked up text and would like to convert it to plain text,

If this is just a quick and dirty problem, you can also use one of the
lynx/elinks/links2 browsers and dump the contents to a file.  On Linux
it would be
lynx -dump http://www.etc > text.txt
Lynx is also available for MS Windows, but am not sure about the other
two.
0
zentraders (96)
2/2/2008 4:44:19 PM
geoffbache wrote:
> I have some marked up text and would like to convert it to plain text,
> by simply removing all the tags. Of course I can do it from first
> principles but I felt that among all Python's markup tools there must
> be something that would do this simply, without having to create an
> XML parser etc.
> 
> I've looked around a bit but failed to find anything, any tips?
> 
> (e.g. convert "<B>Today</B> is <U>Friday</U>" to "Today is Friday")

   >>> import lxml.etree as et
   >>> doc = et.HTML("<b>Today</b> is <u>Friday</u>")
   >>> et.tostring(doc, method='text', encoding=unicode)
   u'Today is Friday'


http://codespeak.net/lxml

Stefan
0
stefan_ml (868)
2/3/2008 5:34:31 PM
geoffbache wrote:
> I have some marked up text and would like to convert it to plain text,
> by simply removing all the tags. Of course I can do it from first
> principles but I felt that among all Python's markup tools there must
> be something that would do this simply, without having to create an
> XML parser etc.
> 
> I've looked around a bit but failed to find anything, any tips?
> 
> (e.g. convert "<B>Today</B> is <U>Friday</U>" to "Today is Friday")

This might be of interest:

http://pypi.python.org/pypi/haufe.stripml

Stefan
0
stefan_ml (868)
2/11/2008 10:02:38 AM
Reply: