Tags: apis, attributes, core, extract, file, gui, html, htmldocumentiterator, input, iterator, java, parse, tags, value

HTMLDocument.Iterator

On Java Studio » Java Core GUI APIs

20,519 words with 8 Comments; publish: Wed, 19 Sep 2007 14:48:00 GMT; (15078.13, « »)

I need to parse a HTML file and extract the VALUE attributes from the INPUT tags. I am trying to use the Iterator to do this. However, I've found that it only works on certain tags like A and H1. Anyone out there knows if there are workarounds to iterate through other tags using the Iterator.

Thanks.

All Comments

Leave a comment...

  • 8 Comments
    • Hello,I don't know if that will work, but you can create your own tag like this:HTML$Tag YOURTAG = new Tag("theTagName");But VALUE is an Attribute not a Tag.INPUT Tag is in HTML$Tag.
      #1; Thu, 05 Jul 2007 22:48:00 GMT
    • Hi,

      to find an element such as the INPUT tag you can use /**

      * find the first occurrence of an <code>Element</code> in the

      * element tree below a given <code>Element</code>

      *

      * .java-gui-api.developerfaqs.com.param name the name of the <code>Element</code> to search for

      * .java-gui-api.developerfaqs.com.param parent the <code>Element</code> to start looking

      *

      * .java-gui-api.developerfaqs.com.return the found <code>Element</code> or null if none is found

      */

      public static Element findElementDown(String name, Element parent) {

      Element foundElement = null;

      ElementIterator eli = new ElementIterator(parent);

      Element thisElement = eli.first();

      while(thisElement != null && foundElement == null) {

      if(thisElement.getName().equalsIgnoreCase(name)) {

      foundElement = thisElement;

      }

      thisElement = eli.next();

      }

      return foundElement;

      }

      If the element is found, use element.getAttributes().getAttribute(HTML.Attribute.VALUE) to get the value.

      Ulrich

      #2; Thu, 05 Jul 2007 22:48:00 GMT
    • ...and I forgot: the call to findElement would look like this in your case (document being a HTMLDocument):

      Element root = document.getDefaultRootElement();

      Element e = findElement(HTML.Tag.INPUT.toString());

      if(e != null) {

      Object value = e.getAttributes().getAttribute(HTML.Attribute.VALUE);

      if(value != null) {

      // the value was found and can be used

      }

      }

      #3; Thu, 05 Jul 2007 22:48:00 GMT
    • sorry, it must readElement e = findElement(HTML.Tag.INPUT.toString(), root);
      #4; Thu, 05 Jul 2007 22:48:00 GMT
    • Once more Ulrich. Your answers in the forum have helped me.

      The HTMLDocument class has a abstract Iterator as a inner class

      Java people have provided a implementation of it by the name LeafIterator.

      I fail to understand why they did not provide a iterator for BranchElements. Though java developers have left a TODO comment over there.

      Your solution would still help.

      Faruk.

      #5; Thu, 05 Jul 2007 22:48:00 GMT
    • Now guys we a Iterator for Branch elements also. So elements like P, H1, H2, ...H6, BODY, etc. donot get a null iterator.

      Just make below two changes in the HTMLDocument

      /**

      * Fetches an iterator for the specified HTML tag.

      * This can be used for things like iterating over the

      * set of anchors contained, or iterating over the input

      * elements.

      *

      * .java-gui-api.developerfaqs.com.param t the requested <code>HTML.Tag</code>

      * .java-gui-api.developerfaqs.com.return the <code>Iterator</code> for the given HTML tag

      * .java-gui-api.developerfaqs.com.see javax.swing.text.html.HTML.Tag

      */

      public Iterator getIterator(HTML.Tag t) {

      if (t.isBlock()) {

      return new BranchIterator(t, this);

      }

      return new LeafIterator(t, this);

      }

      /**

      * An iterator to iterate over a Branch Elements.

      * [FARUK]-- I am not sure why the SUN guys did not give the

      * implemetation for a BranchElements. I have implemented

      * this class in a effort to achive an implementation for

      * BranchElement Iterator.

      */

      static class BranchIterator extends Iterator {

      private static final String nonLeafElements[] =

      {

      "BLOCKQUOTE", "BODY", "DD", "DIR", "DIV", "DL", "DT", "H1", "H2",

      "H3", "H4", "H5", "H6", "HEAD", "LI", "MENU", "NOFRAMES", "OL",

      "P", "PRE", "TABLE", "TD", "TH", "TITLE", "TR", "UL"

      };

      BranchIterator(HTML.Tag t, Document doc) {

      tag = t;

      elemIterator = new ElementIterator(doc);

      next();

      endOffset = 0;

      next();

      }

      /**

      * Returns the attributes for this tag.

      * .java-gui-api.developerfaqs.com.return the <code>AttributeSet</code> for this tag,

      *or <code>null</code> if none can be found

      */

      public AttributeSet getAttributes() {

      Element elem = elemIterator.current();

      if (elem != null) {

      //System.err.println("Trying to fetch 'AttributeSet' for tag : " + tag); /* DEBUGSTATEMENT */

      AttributeSet a = (AttributeSet) elem.getAttributes();

      return a;

      }

      return null;

      }

      /**

      * Returns the start of the range for which the current occurrence of

      * the tag is defined and has the same attributes.

      *

      * .java-gui-api.developerfaqs.com.return the start of the range, or -1 if it can't be found

      */

      public int getStartOffset() {

      Element elem = elemIterator.current();

      if (elem != null) {

      return elem.getStartOffset();

      }

      return -1;

      }

      /**

      * Returns the end of the range for which the current occurrence of

      * the tag is defined and has the same attributes.

      *

      * .java-gui-api.developerfaqs.com.return the end of the range

      */

      public int getEndOffset() {

      return endOffset;

      }

      /**

      * Moves the iterator forward to the next occurrence

      * of the tag it represents.

      */

      public void next() {

      for (nextBranch(elemIterator); isValid(); nextBranch(elemIterator)) {

      Element elem = elemIterator.current();

      if (elem.getStartOffset() >= endOffset) {

      if (elem.getName().equalsIgnoreCase(tag.toString())) {

      //System.err.println("The tag '" + tag + "' is present in this html page"); /* DEBUGSTATEMENT */

      // we found the next one

      setEndOffset();

      break;

      }

      }

      }

      }

      /**

      * Returns the type of tag this iterator represents.

      *

      * .java-gui-api.developerfaqs.com.return the <code>HTML.Tag</code> that this iterator represents.

      * .java-gui-api.developerfaqs.com.see javax.swing.text.html.HTML.Tag

      */

      public HTML.Tag getTag() {

      return tag;

      }

      /**

      * Returns true if the current position is not <code>null</code>.

      * .java-gui-api.developerfaqs.com.return true if current position is not <code>null</code>,

      *otherwise returns false

      */

      public boolean isValid() {

      return (elemIterator.current() != null);

      }

      /**

      * Moves the given iterator to the next leaf element.

      * .java-gui-api.developerfaqs.com.param iter the iterator to be scanned

      */

      void nextBranch(ElementIterator iter) {

      for (iter.next(); iter.current() != null; iter.next()) {

      Element e = iter.current();

      if (isBranch(e)) {

      break;

      }

      }

      }

      public boolean isBranch(Element elem) {

      boolean retval = false;

      for (int i = 0; i < nonLeafElements.length; i++) {

      if (elem.getName().equalsIgnoreCase(nonLeafElements[i])) {

      retval = true;

      break;

      }

      }

      return retval;

      }

      /**

      * Marches a cloned iterator forward to locate the end

      * of the run. This sets the value of <code>endOffset</code>.

      */

      void setEndOffset() {

      AttributeSet a0 = getAttributes();

      endOffset = elemIterator.current().getEndOffset();

      ElementIterator fwd = (ElementIterator) elemIterator.clone();

      for (nextBranch(fwd); fwd.current() != null; nextBranch(fwd)) {

      Element e = fwd.current();

      AttributeSet a1 = (AttributeSet) e.getAttributes().getAttribute(tag);

      if ((a1 == null) || (!a1.equals(a0))) {

      break;

      }

      endOffset = e.getEndOffset();

      }

      }

      private int endOffset;

      private HTML.Tag tag;

      private ElementIterator elemIterator;

      }

      #6; Thu, 05 Jul 2007 22:48:00 GMT
    • hi !

      I have a problem with the HTML.Iterator.

      I would like to grap some specific html ref in the yahoo page.

      http://uk.biz.yahoo.com/p/us/cpi/cpia0.html

      the html ref are contained in some tables (specifically tables of tables ...)

      It happens that moving the iterator in the page, the content of the cells table is not visible.

      Even more, one element is reported corretly (img) but the others are not correctly identified (href & A).

      The iterator gets the table, gets the row, get the cell (td), stat that there are thre elements inside the td element, BUT .... ..........you can access none !! (one is the link I need...)

      Any idea how to get to the cell content ?

      Have anyone of you ever accessed the HTML cell content ?

      I enclose the source code to show you what happens.

      Any help would be very aprreciated....

      Giuliano.

      ==================================================================

      /*

      * TestReader.java

      *

      * Created on 21 ottobre 2004, 22.00

      */

      package webread;

      import java.io.*;

      import java.net.*;

      import java.util.*;

      import javax.swing.*;

      import javax.swing.text.*;

      import javax.swing.text.html.parser.AttributeList;

      import javax.swing.text.html.*;

      /**

      *

      * .java-gui-api.developerfaqs.com.author Giuliano

      */

      public class TestReader {

      static HTMLDocument doc;

      /** Creates a new instance of TestReader */

      public TestReader() {

      }

      public static void getPage(String uriStr) {

      try {

      // Create a reader on the HTML content

      URL url = new URI(uriStr).toURL();

      URLConnection conn = url.openConnection();

      Reader rd = new InputStreamReader(conn.getInputStream());

      // Parse the HTML

      EditorKit kit = new HTMLEditorKit();

      doc = (HTMLDocument)kit.createDefaultDocument();

      doc.putProperty("IgnoreCharsetDirective", Boolean.TRUE);

      kit.read(rd, doc, 0);

      Element[] elemList = doc.getRootElements();

      // start recursion

      recursePrint(getSubElementAtOrder(elemList[0], 1, HTML.Tag.BODY), 0);

      } catch (MalformedURLException e) {

      } catch (URISyntaxException e) {

      } catch (BadLocationException e) {

      } catch (IOException e) {

      }

      }

      /** gets a given element who is child of the root element

      * it looks only for the given HTML type

      * and takes only ther "order" occurence of this element type

      * .java-gui-api.developerfaqs.com.param root - the parent object

      * .java-gui-api.developerfaqs.com.param order - the order occurence of the desised element

      * .java-gui-api.developerfaqs.com.param type - the type of the desired element

      *

      *.java-gui-api.developerfaqs.com.return - the found element or null

      *

      */

      public static Element getSubElementAtOrder(Element root, int order, HTML.Tag type)

      {

      if (root.getElementCount()< order)return null;

      int count =0;

      for (int i=0; i<root.getElementCount(); i++){

      if (root.getElement(i).getAttributes().getAttribute(StyleConstants.NameAttribute )== type) count +=1;

      if (count == order) return root.getElement(i);

      }

      return null;

      }

      /***

      * recursively print the element level, name and subelements number

      *

      **/

      public static void recursePrint( Element elem, int level){

      String tag="";

      for(int i=0; i><level;i++) tag += "- ";

      System.out.print(tag + "LEVEL : " + level);

      System.out.print(tag + "Element name " + elem.getName() );

      System.out.println(tag + "Subelements Number : " + elem.getElementCount());

      if ( level >= 5 )

      {

      int index = elem.getStartOffset();

      int length = elem.getEndOffset() - index;

      String tableContent = "Element content : \n";

      try

      { tableContent += doc.getText(index, length);

      }

      catch (Exception e)

      {tableContent += "";

      }

      System.out.println(tag + tableContent);

      System.out.println(" Index : "+ index + " - Length : "+ length);

      }

      if ( level >= 6 )

      { printAttributeList(tag, (AttributeList)elem.getAttributes().getAttribute(HTML.Attribute.HREF));

      }

      if ((elem.getElementCount() > 0) && (level < 8)) {

      for (int i=0; i < elem.getElementCount();i++ )

      recursePrint( elem.getElement(i), level+1);

      } else {

      int start = elem.getStartOffset();

      int lenght = elem.getEndOffset() - start;

      try {

      //System.out.println("Valore : " + doc.getText(start, lenght));

      } catch (Exception e) {}

      }

      }

      /**

      *

      *

      *

      **/

      public static void printAttributeList(String tag, AttributeList aList){

      while( aList != null){

      System.out.println(tag + aList.getName() + " - " + aList.getValue());

      aList = aList.getNext();

      }

      }

      public static void main(String[] args)

      {

      getPage("http://uk.biz.yahoo.com/p/us/cpi/cpia0.html");

      }

      }

      #7; Thu, 05 Jul 2007 22:48:00 GMT
    • [nobr]Hi giulsun,

      I need ur help that i hope u can.

      Suppose I have the content of a html file in the form of String and i want to print the VALUE attribute of INPUT,SELECT and LABEL tags.

      and to achieve it i am using a method like buttonAction(String browserContents) (I am not sure that the code written in the sample methode will work or not) and i have the problem to understand the iteration of elements

      So Can you Help me Please..

      public static void buttonAction(String browserContents){

      JEditorPane pan = new JEditorPane();

      pan.setEditable(false);

      pan.setContentType("text/html");

      pan.setText(browserContents);

      HTMLDocument htmlDoc = (HTMLDocument)pan.getDocument();

      /// code to print the VALUE attribute of INPUT,SELECT and LABEL

      // .......

      ///-

      }

      Note-here is the example of html file content

      <HTML xmlns="http://www.w3.org/1999/xhtml">

      <HEAD>

      <TITLE>Untitled Document</TITLE>

      <META http-equiv=Content-Type content="text/html; charset=iso-8859-1">

      </HEAD>

      <BODY>

      <FORM id=form1 name=form1 action="" method=post>

      <LABEL>Name <INPUT value=rohit name=textfield> </LABEL>

      <LABEL>Second Field <INPUT value=Test2 name=textfield2> </LABEL>

      <LABEL>Color <SELECT name=select>

      <OPTION value=Red>Red</OPTION>

      <OPTION value=Blue selected>Blue</OPTION>

      <OPTION value=Green>Green</OPTION></SELECT>

      </LABEL>

      <LABEL><INPUT type=submit value=Save name=Submit> </LABEL>

      </FORM>

      </BODY>

      </HTML>

      Result will be something like this name=rohit ,second field=Test2, Color=Blue.

      Message was edited by:

      Rohit_Kumar[/nobr]

      #8; Thu, 05 Jul 2007 22:48:00 GMT