Parsing HTML using jsoup library

This blog post will show readers how to parse an HTML table using jsoup, an open source Java library.  To get started, either download the jsoup libraries and place them on the classpath for your project, or use the maven dependencies.

For our tutorial, let’s parse a table at http://en.wikipedia.org/wiki/List_of_blogs.  To do this, we set up a connection to the site:

Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/List_of_blogs").get();

Next, we need to extract the table.  There are 2 tables on this page, however.  The first table contains this language, “This article needs additional…” The second table is the one we’d like to iterate over.  We’ll select the second table by referencing its CSS class, like so:

Elements trs = doc.select("table.wikitable tr");

table means we want to ‘select a table’, . means ‘with CSS class named’, wikitable actually identifies the CSS class we’re looking for, and ‘<space>tr‘ means ‘and then get all the table rows that follow.’  So all together that’s, “select a table with CSS class named wikitable and then get all the table rows (trs) that follow.” I was able to determine that the table had a wikitable class on it by examining the HTML using Chrome’s Inspect Element feature.  Firebug is a nice Firefox extension that allows you to do the same thing, as is Developer Tools in IE.  Jsoup allows you to search by more than just CSS class, and they document a full list of selectors you can use other than ‘.‘ and ‘<space>‘ .

Good news!  That was the hard part.

Next up, is to remove the header row:

trs.remove(0);

and iterate over each row.  We can pull out the table data (<td>)within each row using the getElementsByTag() method, and pull out the first <td> (the one containing the blog title) by using the first() method.  From there, the text() method gives us the text that appeared in that specific td. On the off chance that wasn’t crystal clear:

for (Element tr : trs) {
    Elements tds = tr.getElementsByTag("td");
    Element td = tds.first();
    System.out.println("Blog: " + td.text());
}

Here’s the main() method soup-to-nuts. I’ll leave the imports as an exercise for the reader ;-)

public static void main(String[] args) {
    try {
        Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/List_of_blogs").get();
        Elements trs = doc.select("table.wikitable tr");

        //remove header row
        trs.remove(0);

        for (Element tr : trs) {
            Elements tds = tr.getElementsByTag("td");
            Element td = tds.first();
            System.out.println("Blog: " + td.text());
        }
    } catch (IOException e) {
        e.printStackTrace();
    }
}
This entry was posted in 0-Uncategorized, Enterprise Java, How To Guide, Misc and tagged , , . Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

27 Comments

  1. Sheela
    Posted September 22, 2012 at 6:27 am | Permalink

    what if the html code has 2 table with no difference in the tag.
    for example

    heading
    details

    name
    abcd

    hobby
    playing cricket

    name

    city

    a

    madurai

    b

    coimbatore

    —————————————-
    in this case how to retrieve content from each table as the table could not be identified separately using any class or width ???

  2. Posted September 26, 2012 at 12:18 am | Permalink

    Elements tables = doc.select(“table”);

    will return all the tables in an HTML document. From there you have to iterate through each table until you find the one that you want.

    Hope that helps! -Rich

  3. Manoj
    Posted October 17, 2012 at 2:30 pm | Permalink

    what if i have the html coding in local machine how to invoke that and carry out this?

  4. Manoj
    Posted October 17, 2012 at 2:33 pm | Permalink

    Also can you pls explain on iterating through tables

  5. Richard Krajunus
    Posted October 23, 2012 at 11:35 pm | Permalink

    The jsoup documentation explains how to work with local files: http://jsoup.org/cookbook/input/load-document-from-file

  6. Richard Krajunus
    Posted October 23, 2012 at 11:47 pm | Permalink

    “Iterating through tables” means looping through each table on the page. I imagine something like this would work:
    Elements tables = doc.select(“table”);

    for (Element table : tables) {
    //insert the code that you'd like to execute for each table here
    }

  7. nahid
    Posted November 10, 2012 at 8:45 pm | Permalink

    Please explain how can I use div:has(p) in Java?
    and how extract text from tag ?
    Document doc = Jsoup.parse(html);
    thanks

  8. nahid
    Posted November 10, 2012 at 8:47 pm | Permalink

    how extract text from tag ?

  9. nahid
    Posted November 11, 2012 at 5:53 am | Permalink

    hi

    Please explation how can i extract childNodes() in java with Jsoup?
    and div:has(p)
    thanks,

  10. Richard Krajunus
    Posted November 13, 2012 at 4:06 pm | Permalink

    Hi, Nahid,

    childNodes() can be called from any Node class, and of course any class extending Node. Since Element extends Node, we could write:
    List nodes = td.childNodes();
    in the for loop.
    childNodes() documentation

    div:has(p) is jQuery syntax which returns true if the div has a p tag. To implement something like with jSoup, try:
    Elements divsHavingP = doc.select("div p");
    if (!divsHavingP.isEmpty())
    {
    //this code will be executed if there is a div followed by p
    }

    The selector-syntax documentation says we can also use:
    Elements divsHavingP = doc.select("div > p");
    if we want the p to be a direct child of the div.

    Extracting text from a tag was in my original post…
    System.out.println("Blog: " + td.text());

  11. rajiv
    Posted January 3, 2013 at 10:06 am | Permalink

    Hi, i need to parse an html string but retain the formatting.
    for e.g para1 para2 should print as
    para1
    para2
    but it displays as
    para1 para2.
    Similarly for other html tags as well. Kindly help… urgent

  12. karan
    Posted January 4, 2013 at 4:25 pm | Permalink

    Hi ..I need to parse a table in HTML using jsoup library from the site http://www.informatik.uni-trier.de/~ley/pers/hd/k/Kumar:G=_Praveen.html
    ..
    Since there are two tables on the page ,I do not know how excatly to parse the table contents.I need to extract the contents of the 1st table, that is only author names and their publications and the 2nd table which is at the end named coauthors…
    I tried code given below ,but it gives errors…

    import java.io.*;
    import org.jsoup.*;
    import org.jsoup.nodes.*;
    import org.jsoup.select.*;
    import java.io.BufferedWriter.*;
    import java.io.FileWriter.*;
    import java.io.IOException.*;
    import java.util.*;
    public class Main {

    public static void main(String[] args) {

    try {
    Document doc = Jsoup.connect(“http://www.informatik.uni-trier.de/~ley/pers/hd/k/Kumar:G=_Praveen.html
    “).get();
    Elements trs = doc.select(“table tr”);
    Element table = doc.select(“table[class=coauthor]“).first();
    Iterator ite = table.select(“td”).iterator();
    ite.next();
    System.out.println(“Value 1: ” + ite.next().text());
    System.out.println(“Value 2: ” + ite.next().text());
    System.out.println(“Value 3: ” + ite.next().text());
    System.out.println(“Value 4: ” + ite.next().text());

    trs.remove(0);
    for (Element tr : trs) {
    Elements tds = tr.getElementsByTag(“td”);
    Element td = tds.first();
    System.out.println(“Blog: ” + td.text());
    }
    } catch (IOException e) {
    e.printStackTrace();
    }
    }
    }
    ….Any help will be appreciated..thanx in advance..

  13. Richard Krajunus
    Posted January 10, 2013 at 11:09 pm | Permalink

    You write:
    Element table = doc.select(“table[class=coauthor]“).first();

    This is incorrect because according to the selector syntax documentation, “.” is how we specify a class. You can’t specify a class with the “=” syntax that you’ve written. To get the second table use:
    Element table = doc.select("table").get(1);

    You also write:
    Iterator ite = table.select(“td”).iterator();
    ite.next();
    System.out.println(“Value 1: ” + ite.next().text());

    This is incorrect because this gets all the td‘s. You should get a row first, then get the 2nd td of each row, and call text on that.


    public static void main(String[] args) {

    Document doc = null;

    try {
    doc = Jsoup
    .connect(
    "http://www.informatik.uni-trier.de/~ley/pers/hd/k/Kumar:G=_Praveen.html")
    .get();
    } catch (IOException e) {
    e.printStackTrace();
    return;
    }

    Element secondTable = doc.select("table").get(1);
    Elements rowsFromSecondTable = secondTable.select("tr");

    for (Element row : rowsFromSecondTable)
    {
    Element secondTd = row.select("td").get(1);
    System.out.println(secondTd.text());
    }
    }

  14. Richard Krajunus
    Posted January 10, 2013 at 10:37 pm | Permalink

    Rajiv, after reading the documentation, it looks like the outerHtml() or innerHtml() methods would be of help to you.

  15. Jenita
    Posted January 27, 2013 at 10:40 pm | Permalink

    How to parse different pages from the same website but different IDs ?

  16. Richard Krajunus
    Posted January 28, 2013 at 3:48 pm | Permalink

    Hi, Jenita,

    You can parse different pages by specifying a different url for each page in the call to Jsoup.connect(...).

    The documentation on the selector syntax states you can supply an id with a # sign, as in
    doc.select("#logo") which retrieves all elements with an id equal to “logo”.

  17. Jenita
    Posted January 29, 2013 at 3:44 am | Permalink

    Thank you Richard. Been able to do it. Thanks a lot

  18. prabhu
    Posted February 17, 2013 at 7:08 am | Permalink

    I am not able retrive the label in a html file when the label is in paraenthesis like (nct)Number. I am using jsoup 1.6 jar

    My code is
    Collapse | Copy Code
    public Element findElementByLabel(Document htmlDocObj, String label, String elType) throws AbstractScriptException
    {
    if(elType.equalsIgnoreCase(“img”))
    {
    return findIMG(htmlDocObj, label, elType);
    }

    Elements els = htmlDocObj.getElementsByAttributeValueContaining(“value”, label);

    how to retrive elements…

  19. Richard Krajunus
    Posted February 20, 2013 at 5:38 pm | Permalink

    Hi, prabhu, what do you mean by label? Do you mean the HTML label tag? It seems like your working with the img tag. Could you describe what you’re trying to do?

  20. karan
    Posted March 22, 2013 at 9:58 am | Permalink

    Hi.. I trying to parse the following page.

    Document doc = Jsoup.connect(“http://www.informatik.uni-trier.de/~ley/pers/hd/h/Han:Jiawei.html“).get();

    I need to extract the contents of the 1st table, that is only author names and their publications But I need only the contents of the table from the 1986 to year 2012.
    What could be the possible syntax in jsoup?

  21. zulkarnain
    Posted March 23, 2013 at 2:59 am | Permalink

    What if the url doesnt change but the text content does, I mean what if the webpage is an Ajax one where the url doesn’t change but the content changes when clicking on a link.? How to get all text from all the links that clicked don’t go to a new webpage but do generate new text. For example this webpage:
    http://www.islamicuniversity.edu.in/Descrip?date=SELID1

  22. Posted August 1, 2013 at 11:41 am | Permalink

    you can find some more details in the below link<"http://javadomain.in/parsing-title-of-the-website-using-jsoup/&quot;

  23. Bhawna
    Posted September 3, 2013 at 4:46 am | Permalink

    Hello, I have just started using jsoup. How to use jsoup if I have a index.html page with a form input and I want to fetch this data using jsoup?

    Thanks in advance!

  24. Richard Krajunus
    Posted September 11, 2013 at 12:25 pm | Permalink

    Hi, Karan,

    The following code ought to return all of the list elements with the “year” class applied to them:
    Elements allYears = doc.select("li.year");
    but for the particular website you’ve supplied I notice only the years from 2013-2007 are showing up. I think you’ve found a bug in the jsoup library.

    The bug is being discussed at https://github.com/mariuszs/jsoup/commit/49f16476c71cd995724c4edec089c9b97237cc41

  25. Richard Krajunus
    Posted September 11, 2013 at 1:42 pm | Permalink

    Hi, zulkarnain, I’m not aware of a way to parse something like that with jsoup. The problem is that jsoup isn’t conducive to clicking on html links; it’s mainly a parser. You may have better luck with a test framework.

    You might find these questions helpful:
    http://stackoverflow.com/questions/13666453/trying-to-parse-html-hidden-by-javascript
    http://stackoverflow.com/questions/11016122/how-to-click-submit-button-using-jsoup

  26. Richard Krajunus
    Posted September 11, 2013 at 2:10 pm | Permalink

    Hi, Bhawna, Try to take a look at the selector syntax documentation, which states how to parse elements by tag (input) and attribute (value).

  27. Milan
    Posted October 18, 2014 at 6:54 pm | Permalink

    Hi guys
    I have problem and need some help. I have this code , which gets some data from Web site,parse it,and put in Document. In my application I want to input String e.q
    String which represent column town on web site example “Kac” and for that String I want to know does exists String like that in table which I parse to Document, and if exist I need to view that entire row which contain number,town,address, and time.
    Document doc = Jsoup.connect(“http://www.elektrovojvodina.rs/sl/mediji/Dana-20-i-21-10-2014-g-se-zbog-PLANIRANIH-radova-u-el-mrezi-iskljucuju”).get();
    for (Element table : doc.select(“div.content_body_left”)) {
    for (Element row : table.select(“tr”)) {
    Elements tds = row.select(“td”);
    if (tds.size() > 3) {
    System.out.println( tds.get(0).text() + tds.get(1).text());
    }

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*