Extracting Links from an HTML document using a Script

Discussion:

(too old to reply)

JoJo

2009-08-28 15:40:03 UTC

Folks:

I have an HTML document that is about 100 pages long. I assembled this
document from the "Articles By
This Author" section of the following web page:
http://www.tigersharktrading.com/authors/23/Harry-Boxer

Scattered throughout this document are many links to the web. The links of
interest to me all start with the ">>" characters, as seen
at TigerSharkTrading, then the name of the article is given as a link.

* How can I quickly extract these links and transfer same to a new file
?
* Is there some type of script that can quickly accomplish this task ?

Thanks,
JoJo.

mr_unreliable

2009-08-28 16:33:40 UTC

Permalink

Post by JoJo
I have an HTML document that is about 100 pages long. I assembled this
document from the "Articles By
http://www.tigersharktrading.com/authors/23/Harry-Boxer
Scattered throughout this document are many links to the web. The links of
interest to me all start with the ">>" characters, as seen
at TigerSharkTrading, then the name of the article is given as a link.
* How can I quickly extract these links and transfer same to a new file
?
* Is there some type of script that can quickly accomplish this task ?

hi JoJo,

I suggest using the "all" collection (of the document
object).

Let's say that your links appear in an "anchor" (A) tag.

Then you could get your collection of anchor tags like this:

document.all.tags("A")

To get the tags you want, you could "walk-the-list" with
some sort of a loop (your choice, try "For Each").

The individual items would be addressed as:

document.all.tags("A")(i) ' where i is your index

And the number of items would be:

document.all.tags("A").Length

In your discussion, you mentioned the URL's, which are
probably appearing as the "href" attribute of the "A"
tag. My guess is that you can get the URL as:

document.all.tags("A")(i).href

cheers, jw
____________________________________________________________

You got questions? WE GOT ANSWERS!!! ..(but, no guarantee
the answers will be applicable to the questions)

Larry Serflaten

2009-08-29 15:39:05 UTC

Permalink

As indicated by mr_unreliable, you will probable want to use the DOM
objects to parse the document. I was just going to add that it appears
all the links of interest are contained in SPAN objects that have a class name
of 'title'. So, instead of grabbing 'all' anchors, you could grab all 'SPAN'
objects and check for a className of title, and then do another grab
within that object for all anchors (of which there is only one, the one you
want)

Something like: (warning - air code)

For each sp in document.all.tags("SPAN")
If sp.className = "title" Then
For each ref in sp.all.tags("A")
' Save hRef to new file ex...
AppendToFile ref.hRef
Next
End If
Next

Your own AppendToFile routine night as well make the file an HTML
document, so you can load it in a browser and click on any interesting
links....

Have fun!
LFS