f



Re: Importing into Mathematica from URL (PubMed)

I don't know the field but the problem you are experiencing has nothing to
do with mathematica or one site providing better data than another. 
This just returned the FASTA text amino acid sequence:

Import["http://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?tool=portal&sendto=o
n&log$=seqview&db=protein&dopt=fasta&val=3336842&extrafeat=0&maxplex=1",
"Text"]
Or 
Import["http://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?tool=portal&sendto=o
n&log$=seqview&db=protein&dopt=fasta&val=3336842&extrafeat=0&maxplex=1",
"FASTA"] for straight to "FASTA" format import.

This returned the GenPept full format

Import["http://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?tool=portal&sendto=o
n&log$=seqview&db=protein&dopt=gpwithparts&val=3336842&extrafeat=0&maxplex=1
", "Text"]

This site needs to run in a browser, what you see in the browser seems to be
code that is generated through javascript. Mathematica only imports the raw
sources not not the final source when viewed through a web browser with
javascript engine on.

If you view your original site through a web browser "
http://www.ncbi.nlm.nih.gov/protein/CAA76847.1" the "Send to:" link and drop
down image will give you the opportunity to save what you are viewing to
file. I just copied the representative URL from the save to file result.
This did not involve mathematica in any way.

You may also get this in xml format if you play around with the URL
viewr.fcgi? query strings

$Version
"7.0 for Microsoft Windows (32-bit) (February 18, 2009)"

Hans


-----Original Message-----
From: Thomas Dowling [mailto:thomasgdowling@gmail.com] 
Sent: Thursday, March 24, 2011 6:27 AM
Subject:  Importing into Mathematica from URL (PubMed)

Hello,

Does anyone know how to import a protein sequence from the PubMed database
into Mathematica, or
can anyone advise me as to where I am going wrong in the following approach?


1. As an example, I'd like to import the data for BSA (bovine serum albumin)
from the following site:


http://www.ncbi.nlm.nih.gov/protein/CAA76847.1


I wish to import all meaningful data from this page, but the bit I am
particularly interested in is the amino acid sequence

(in one-leter code) which is right at the end (between ORIGIN and //):

ORIGIN
        1 mkwvtfisll llfssaysrg vfrrdthkse iahrfkdlge ehfkglvlia fsqylqqcpf
       61 dehvklvnel tefaktcvad eshagceksl htlfgdelck vaslretygd madccekqep
      121 ernecflshk ddspdlpklk pdpntlcdef kadekkfwgk ylyeiarrhp yfyapellyy
      181 ankyngvfqe ccqaedkgac llpkietmre kvltssarqr lrcasiqkfg eralkawsva
      241 rlsqkfpkae fvevtklvtd ltkvhkecch gdllecaddr adlakyicdn qdtissklke
      301 ccdkplleks hciaevekda ipenlpplta dfaedkdvck nyqeakdafl gsflyeysrr
      361 hpeyavsvll rlakeyeatl eeccakddph acystvfdkl khlvdepqnl ikqncdqfek
      421 lgeygfqnal ivrytrkvpq vstptlvevs rslgkvgtrc ctkpesermp ctedylslil
      481 nrlcvlhekt pvsekvtkcc teslvnrrpc fsaltpdety vpkafdeklf tfhadictlp
      541 dtekqikkqt alvellkhkp kateeqlktv menfvafvdk ccaaddkeac favegpklvv
      601 stqtala
//

2.  Import["http://www.ncbi.nlm.nih.gov/protein/CAA76847.1","Elements"]

gives the following

{Data,FullData,Hyperlinks,Images,ImageURLs,Plaintext,Source,Title,XMLObject}

However,

Import["http://www.ncbi.nlm.nih.gov/protein/CAA76847.1","Data"];

only gives what looks like a load of rubbish, and there is NO SEQUENCE

3.  Trying in FASTA format (which is  supported by Mathematica)

Import["http://www.ncbi.nlm.nih.gov/protein/3336842?report=fasta
","Elements"]

gives

{Data,FullData,Hyperlinks,Images,ImageURLs,Plaintext,Source,Title,XMLObject}

but

Import["http://www.ncbi.nlm.nih.gov/protein/3336842?report=fasta","Data"]

also gives only rubbish.

 (Changing the Element to FullData or Plaintext has no effect)


4. NO SUCH PROBLEMS OCCUR with Uniprot (Swiss-Prot).

Import["http://www.uniprot.org/uniprot/P02769","Elements"]

Import["http://www.uniprot.org/uniprot/P02769","Data"]

(Note that the sequence is now imported)

Or, BEST, using the FASTA format and importing from this site

Flatten@Characters@StringReplace[Import["
http://www.uniprot.org/uniprot/P02769.fasta", "Plaintext"] , Whitespace->
""]//Short

giving

{M,K,W,V,T,F,I,S,<<592>>,S,T,Q,T,A,L,A}

which is where I'd like to get to.


So my question is the following:  What is so unusual about the PubMed site,
and what am I doing wrong in the approach I am taking?  It  would
be a great advantage to me to be able to import from PubMed in the manner
shown above for Uniport.


Thanks for your help

Tom Dowling


0
hmichel1 (97)
3/29/2011 11:49:13 AM
comp.soft-sys.math.mathematica 28821 articles. 0 followers. Follow

1 Replies
31133 Views

Similar Articles

[PageSpeed] 9

On Mar 29, 6:49 am, "Hans Michel" <hmic...@cox.net> wrote:
> I don't know the field but the problem you are experiencing has nothing to
> do with mathematica or one site providing better data than another.
> This just returned the FASTA text amino acid sequence:
>
> Import["http://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?tool=portal&sendto=o
> n&log$=seqview&db=protein&dopt=fasta&val=3336842&extrafeat=0&maxplex=1",
> "Text"]
> Or
> Import["http://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?tool=portal&sendto=o
> n&log$=seqview&db=protein&dopt=fasta&val=3336842&extrafeat=0&maxplex=1",
> "FASTA"] for straight to "FASTA" format import.
>
> This returned the GenPept full format
>
> Import["http://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?tool=portal&sendto=o
> n&log$=seqview&db=protein&dopt=gpwithparts&val=3336842&extrafeat=0&maxplex= 1
> ", "Text"]
>
> This site needs to run in a browser, what you see in the browser seems to be
> code that is generated through javascript. Mathematica only imports the raw
> sources not not the final source when viewed through a web browser with
> javascript engine on.
>
> If you view your original site through a web browser "http://www.ncbi.nlm.nih.gov/protein/CAA76847.1" the "Send to:" link and drop
> down image will give you the opportunity to save what you are viewing to
> file. I just copied the representative URL from the save to file result.
> This did not involve mathematica in any way.
>
> You may also get this in xml format if you play around with the URL
> viewr.fcgi? query strings
>
> $Version
> "7.0 for Microsoft Windows (32-bit) (February 18, 2009)"
>
> Hans
>
>
>
>
>
>
>
> -----Original Message-----
> From: Thomas Dowling [mailto:thomasgdowl...@gmail.com]
> Sent: Thursday, March 24, 2011 6:27 AM
> Subject:  Importing into Mathematica from URL (PubMed)
>
> Hello,
>
> Does anyone know how to import a protein sequence from the PubMed database
> into Mathematica, or
> can anyone advise me as to where I am going wrong in the following approach?
>
> 1. As an example, I'd like to import the data for BSA (bovine serum album in)
> from the following site:
>
> http://www.ncbi.nlm.nih.gov/protein/CAA76847.1
>
> I wish to import all meaningful data from this page, but the bit I am
> particularly interested in is the amino acid sequence
>
> (in one-leter code) which is right at the end (between ORIGIN and //):
>
> ORIGIN
>         1 mkwvtfisll llfssaysrg vfrrdthkse iahrfkdlge ehfkglvliafsqylqqcpf
>        61 dehvklvnel tefaktcvad eshagceksl htlfgdelck vaslretygdmadccekqep
>       121 ernecflshk ddspdlpklk pdpntlcdef kadekkfwgk ylyeiarrhp yfyapellyy
>       181 ankyngvfqe ccqaedkgac llpkietmre kvltssarqr lrcasiqkfg eralkawsva
>       241 rlsqkfpkae fvevtklvtd ltkvhkecch gdllecaddr adlakyicdn qdtissklke
>       301 ccdkplleks hciaevekda ipenlpplta dfaedkdvck nyqeakdafl gsflyeysrr
>       361 hpeyavsvll rlakeyeatl eeccakddph acystvfdkl khlvdepqnl ikqncdqfek
>       421 lgeygfqnal ivrytrkvpq vstptlvevs rslgkvgtrc ctkpesermp ctedylslil
>       481 nrlcvlhekt pvsekvtkcc teslvnrrpc fsaltpdety vpkafdeklf tfhadictlp
>       541 dtekqikkqt alvellkhkp kateeqlktv menfvafvdk ccaaddkeac favegpklvv
>       601 stqtala
> //
>
> 2.  Import["http://www.ncbi.nlm.nih.gov/protein/CAA76847.1","Elements"]
>
> gives the following
>
> {Data,FullData,Hyperlinks,Images,ImageURLs,Plaintext,Source,Title,XMLObject }
>
> However,
>
> Import["http://www.ncbi.nlm.nih.gov/protein/CAA76847.1","Data"];
>
> only gives what looks like a load of rubbish, and there is NO SEQUENCE
>
> 3.  Trying in FASTA format (which is  supported by Mathematica)
>
> Import["http://www.ncbi.nlm.nih.gov/protein/3336842?report=fasta
> ","Elements"]
>
> gives
>
> {Data,FullData,Hyperlinks,Images,ImageURLs,Plaintext,Source,Title,XMLObject }
>
> but
>
> Import["http://www.ncbi.nlm.nih.gov/protein/3336842?report=fasta","Data"]
>
> also gives only rubbish.
>
>  (Changing the Element to FullData or Plaintext has no effect)
>
> 4. NO SUCH PROBLEMS OCCUR with Uniprot (Swiss-Prot).
>
> Import["http://www.uniprot.org/uniprot/P02769","Elements"]
>
> Import["http://www.uniprot.org/uniprot/P02769","Data"]
>
> (Note that the sequence is now imported)
>
> Or, BEST, using the FASTA format and importing from this site
>
> Flatten@Characters@StringReplace[Import["http://www.uniprot.org/uniprot/P02769.fasta", "Plaintext"] , Whitespace->
> ""]//Short
>
> giving
>
> {M,K,W,V,T,F,I,S,<<592>>,S,T,Q,T,A,L,A}
>
> which is where I'd like to get to.
>
> So my question is the following:  What is so unusual about the PubMed site,
> and what am I doing wrong in the approach I am taking?  It  would
> be a great advantage to me to be able to import from PubMed in the manner
> shown above for Uniport.
>
> Thanks for your help
>
> Tom Dowling

Tom (and others):

I played around with this and discovered an API of sorts for the site.
While I'm not sure of *exactly* what you want to grab from the page,
consider the following:

Import["http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?
db=nuccore&id=CAA76847.1&&rettype=fasta&retmode=text","Text"]

I discovered this trick by digging for info on 'url encoding',
specifically the following (which I altered by removing the two
numbers they provided --- along with the comma separating them -- and
then plugging in your ref number, or whatever the hell it means!):

http://www.ncbi.nlm.nih.gov/books/NBK25500/#chapter1.Downloading_Full_Records

"Downloading Full Records" is a section from a 'book' on E-Utilities
(their description):

(...The E-utilities use a fixed URL syntax that translates a standard
set of input parameters into the values necessary for various NCBI
software components to search for and retrieve the requested data. The
E-utilities are therefore the structured interface to the Entrez
system, which currently includes 38 databases covering a variety of
biomedical data, including nucleotide and protein sequences, gene
records, three-dimensional molecular structures, and the biomedical
literature.)

Hope this helps. Most data sites have some sort of URL encoding API
these days, a good thing for Mathematica users. :D

-RG

0
rgorka (44)
3/30/2011 9:17:52 AM
Reply: