Tallan's Technology Blog

Tallan's Top Technologists Share Their Thoughts on Today's Technology Challenges

Parsing HTML using jsoup library

This blog post will show readers how to parse an HTML table using jsoup, an open source Java library.  To get started, either download the jsoup libraries and place them on the classpath for your project, or use the maven dependencies.

For our tutorial, let’s parse a table at http://en.wikipedia.org/wiki/List_of_blogs.  To do this, we set up a connection to the site:

Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/List_of_blogs").get();

Next, we need to extract the table.  There are 2 tables on this page, however.  The first table contains this language, “This article needs additional…” The second table is the one we’d like to iterate over.  We’ll select the second table by referencing its CSS class, like so:

Elements trs = doc.select("table.wikitable tr");

table means we want to ‘select a table’, . means ‘with CSS class named’, wikitable actually identifies the CSS class we’re looking for, and ‘<space>tr‘ means ‘and then get all the table rows that follow.’  So all together that’s, “select a table with CSS class named wikitable and then get all the table rows (trs) that follow.” I was able to determine that the table had a wikitable class on it by examining the HTML using Chrome’s Inspect Element feature.  Firebug is a nice Firefox extension that allows you to do the same thing, as is Developer Tools in IE.  Jsoup allows you to search by more than just CSS class, and they document a full list of selectors you can use other than ‘.‘ and ‘<space>‘ .

Good news!  That was the hard part.

Next up, is to remove the header row:


and iterate over each row.  We can pull out the table data (<td>)within each row using the getElementsByTag() method, and pull out the first <td> (the one containing the blog title) by using the first() method.  From there, the text() method gives us the text that appeared in that specific td. On the off chance that wasn’t crystal clear:

for (Element tr : trs) {
    Elements tds = tr.getElementsByTag("td");
    Element td = tds.first();
    System.out.println("Blog: " + td.text());

Here’s the main() method soup-to-nuts. I’ll leave the imports as an exercise for the reader 😉

public static void main(String[] args) {
    try {
        Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/List_of_blogs").get();
        Elements trs = doc.select("table.wikitable tr");

        //remove header row

        for (Element tr : trs) {
            Elements tds = tr.getElementsByTag("td");
            Element td = tds.first();
            System.out.println("Blog: " + td.text());
    } catch (IOException e) {
Tags: HTML, Java, jsoup,

27 Comments. Leave new

what if the html code has 2 table with no difference in the tag.
for example



playing cricket







in this case how to retrieve content from each table as the table could not be identified separately using any class or width ???

Elements tables = doc.select(“table”);

will return all the tables in an HTML document. From there you have to iterate through each table until you find the one that you want.

Hope that helps! -Rich

what if i have the html coding in local machine how to invoke that and carry out this?

Also can you pls explain on iterating through tables

Richard Krajunus
October 23, 2012 11:35 pm

The jsoup documentation explains how to work with local files: http://jsoup.org/cookbook/input/load-document-from-file

Richard Krajunus
October 23, 2012 11:47 pm

“Iterating through tables” means looping through each table on the page. I imagine something like this would work:
Elements tables = doc.select(“table”);

for (Element table : tables) {
//insert the code that you'd like to execute for each table here

Please explain how can I use div:has(p) in Java?
and how extract text from tag ?
Document doc = Jsoup.parse(html);

how extract text from tag ?


Please explation how can i extract childNodes() in java with Jsoup?
and div:has(p)

Richard Krajunus
November 13, 2012 4:06 pm

Hi, Nahid,

childNodes() can be called from any Node class, and of course any class extending Node. Since Element extends Node, we could write:
List nodes = td.childNodes();
in the for loop.
childNodes() documentation

div:has(p) is jQuery syntax which returns true if the div has a p tag. To implement something like with jSoup, try:
Elements divsHavingP = doc.select("div p");
if (!divsHavingP.isEmpty())
//this code will be executed if there is a div followed by p

The selector-syntax documentation says we can also use:
Elements divsHavingP = doc.select("div > p");
if we want the p to be a direct child of the div.

Extracting text from a tag was in my original post…
System.out.println("Blog: " + td.text());

Hi, i need to parse an html string but retain the formatting.
for e.g para1 para2 should print as
but it displays as
para1 para2.
Similarly for other html tags as well. Kindly help… urgent

Hi ..I need to parse a table in HTML using jsoup library from the site http://www.informatik.uni-trier.de/~ley/pers/hd/k/Kumar:G=_Praveen.html
Since there are two tables on the page ,I do not know how excatly to parse the table contents.I need to extract the contents of the 1st table, that is only author names and their publications and the 2nd table which is at the end named coauthors…
I tried code given below ,but it gives errors…

import java.io.*;
import org.jsoup.*;
import org.jsoup.nodes.*;
import org.jsoup.select.*;
import java.io.BufferedWriter.*;
import java.io.FileWriter.*;
import java.io.IOException.*;
import java.util.*;
public class Main {

public static void main(String[] args) {

try {
Document doc = Jsoup.connect(“http://www.informatik.uni-trier.de/~ley/pers/hd/k/Kumar:G=_Praveen.html
Elements trs = doc.select(“table tr”);
Element table = doc.select(“table[class=coauthor]”).first();
Iterator ite = table.select(“td”).iterator();
System.out.println(“Value 1: ” + ite.next().text());
System.out.println(“Value 2: ” + ite.next().text());
System.out.println(“Value 3: ” + ite.next().text());
System.out.println(“Value 4: ” + ite.next().text());

for (Element tr : trs) {
Elements tds = tr.getElementsByTag(“td”);
Element td = tds.first();
System.out.println(“Blog: ” + td.text());
} catch (IOException e) {
….Any help will be appreciated..thanx in advance..

Richard Krajunus
January 10, 2013 11:09 pm

You write:
Element table = doc.select(“table[class=coauthor]“).first();

This is incorrect because according to the selector syntax documentation, “.” is how we specify a class. You can’t specify a class with the “=” syntax that you’ve written. To get the second table use:
Element table = doc.select("table").get(1);

You also write:
Iterator ite = table.select(“td”).iterator();
System.out.println(“Value 1: ” + ite.next().text());

This is incorrect because this gets all the td‘s. You should get a row first, then get the 2nd td of each row, and call text on that.

public static void main(String[] args) {

Document doc = null;

try {
doc = Jsoup
} catch (IOException e) {

Element secondTable = doc.select("table").get(1);
Elements rowsFromSecondTable = secondTable.select("tr");

for (Element row : rowsFromSecondTable)
Element secondTd = row.select("td").get(1);

Richard Krajunus
January 10, 2013 10:37 pm

Rajiv, after reading the documentation, it looks like the outerHtml() or innerHtml() methods would be of help to you.

How to parse different pages from the same website but different IDs ?

Richard Krajunus
January 28, 2013 3:48 pm

Hi, Jenita,

You can parse different pages by specifying a different url for each page in the call to Jsoup.connect(...).

The documentation on the selector syntax states you can supply an id with a # sign, as in
doc.select("#logo") which retrieves all elements with an id equal to “logo”.

Thank you Richard. Been able to do it. Thanks a lot

I am not able retrive the label in a html file when the label is in paraenthesis like (nct)Number. I am using jsoup 1.6 jar

My code is
Collapse | Copy Code
public Element findElementByLabel(Document htmlDocObj, String label, String elType) throws AbstractScriptException
return findIMG(htmlDocObj, label, elType);

Elements els = htmlDocObj.getElementsByAttributeValueContaining(“value”, label);

how to retrive elements…

Richard Krajunus
February 20, 2013 5:38 pm

Hi, prabhu, what do you mean by label? Do you mean the HTML label tag? It seems like your working with the img tag. Could you describe what you’re trying to do?

Hi.. I trying to parse the following page.

Document doc = Jsoup.connect(“http://www.informatik.uni-trier.de/~ley/pers/hd/h/Han:Jiawei.html“).get();

I need to extract the contents of the 1st table, that is only author names and their publications But I need only the contents of the table from the 1986 to year 2012.
What could be the possible syntax in jsoup?

What if the url doesnt change but the text content does, I mean what if the webpage is an Ajax one where the url doesn’t change but the content changes when clicking on a link.? How to get all text from all the links that clicked don’t go to a new webpage but do generate new text. For example this webpage:

Hello, I have just started using jsoup. How to use jsoup if I have a index.html page with a form input and I want to fetch this data using jsoup?

Thanks in advance!

Richard Krajunus
September 11, 2013 12:25 pm

Hi, Karan,

The following code ought to return all of the list elements with the “year” class applied to them:
Elements allYears = doc.select("li.year");
but for the particular website you’ve supplied I notice only the years from 2013-2007 are showing up. I think you’ve found a bug in the jsoup library.

The bug is being discussed at https://github.com/mariuszs/jsoup/commit/49f16476c71cd995724c4edec089c9b97237cc41

Richard Krajunus
September 11, 2013 1:42 pm

Hi, zulkarnain, I’m not aware of a way to parse something like that with jsoup. The problem is that jsoup isn’t conducive to clicking on html links; it’s mainly a parser. You may have better luck with a test framework.

You might find these questions helpful:

Richard Krajunus
September 11, 2013 2:10 pm

Hi, Bhawna, Try to take a look at the selector syntax documentation, which states how to parse elements by tag (input) and attribute (value).

Hi guys
I have problem and need some help. I have this code , which gets some data from Web site,parse it,and put in Document. In my application I want to input String e.q
String which represent column town on web site example “Kac” and for that String I want to know does exists String like that in table which I parse to Document, and if exist I need to view that entire row which contain number,town,address, and time.
Document doc = Jsoup.connect(“http://www.elektrovojvodina.rs/sl/mediji/Dana-20-i-21-10-2014-g-se-zbog-PLANIRANIH-radova-u-el-mrezi-iskljucuju”).get();
for (Element table : doc.select(“div.content_body_left”)) {
for (Element row : table.select(“tr”)) {
Elements tds = row.select(“td”);
if (tds.size() > 3) {
System.out.println( tds.get(0).text() + tds.get(1).text());

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>