public class

Jsoup

extends Object
java.lang.Object
   ↳ org.jsoup.Jsoup

Class Overview

The core public access point to the jsoup functionality.

Summary

Public Methods
static String clean(String bodyHtml, String baseUri, Whitelist whitelist)
Get safe HTML from untrusted input HTML, by parsing input HTML and filtering it through a white-list of permitted tags and attributes.
static String clean(String bodyHtml, Whitelist whitelist)
Get safe HTML from untrusted input HTML, by parsing input HTML and filtering it through a white-list of permitted tags and attributes.
static Connection connect(String url)
Creates a new Connection to a URL.
static boolean isValid(String bodyHtml, Whitelist whitelist)
Test if the input HTML has only tags and attributes allowed by the Whitelist.
static Document parse(InputStream in, String charsetName, String baseUri)
Read an input stream, and parse it to a Document.
static Document parse(File in, String charsetName)
Parse the contents of a file as HTML.
static Document parse(String html, String baseUri)
Parse HTML into a Document.
static Document parse(URL url, int timeoutMillis)
Fetch a URL, and parse it as HTML.
static Document parse(File in, String charsetName, String baseUri)
Parse the contents of a file as HTML.
static Document parse(String html)
Parse HTML into a Document.
static Document parseBodyFragment(String bodyHtml)
Parse a fragment of HTML, with the assumption that it forms the body of the HTML.
static Document parseBodyFragment(String bodyHtml, String baseUri)
Parse a fragment of HTML, with the assumption that it forms the body of the HTML.
[Expand]
Inherited Methods
From class java.lang.Object

Public Methods

public static String clean (String bodyHtml, String baseUri, Whitelist whitelist)

Get safe HTML from untrusted input HTML, by parsing input HTML and filtering it through a white-list of permitted tags and attributes.

Parameters
bodyHtml input untrusted HMTL
baseUri URL to resolve relative URLs against
whitelist white-list of permitted HTML elements
Returns
  • safe HTML
See Also

public static String clean (String bodyHtml, Whitelist whitelist)

Get safe HTML from untrusted input HTML, by parsing input HTML and filtering it through a white-list of permitted tags and attributes.

Parameters
bodyHtml input untrusted HTML
whitelist white-list of permitted HTML elements
Returns
  • safe HTML
See Also

public static Connection connect (String url)

Creates a new Connection to a URL. Use to fetch and parse a HTML page.

Use examples:

  • Document doc = Jsoup.connect("http://example.com").userAgent("Mozilla").data("name", "jsoup").get();
  • Document doc = Jsoup.connect("http://example.com").cookie("auth", "token").post();

Parameters
url URL to connect to. The protocol must be http or https.
Returns
  • the connection. You can add data, cookies, and headers; set the user-agent, referrer, method; and then execute.

public static boolean isValid (String bodyHtml, Whitelist whitelist)

Test if the input HTML has only tags and attributes allowed by the Whitelist. Useful for form validation. The input HTML should still be run through the cleaner to set up enforced attributes, and to tidy the output.

Parameters
bodyHtml HTML to test
whitelist whitelist to test against
Returns
  • true if no tags or attributes were removed; false otherwise

public static Document parse (InputStream in, String charsetName, String baseUri)

Read an input stream, and parse it to a Document.

Parameters
in input stream to read. Make sure to close it after parsing.
charsetName (optional) character set of file contents. Set to null to determine from http-equiv meta tag, if present, or fall back to UTF-8 (which is often safe to do).
baseUri The URL where the HTML was retrieved from, to resolve relative links against.
Returns
  • sane HTML
Throws
IOException if the file could not be found, or read, or if the charsetName is invalid.

public static Document parse (File in, String charsetName)

Parse the contents of a file as HTML. The location of the file is used as the base URI to qualify relative URLs.

Parameters
in file to load HTML from
charsetName (optional) character set of file contents. Set to null to determine from http-equiv meta tag, if present, or fall back to UTF-8 (which is often safe to do).
Returns
  • sane HTML
Throws
IOException if the file could not be found, or read, or if the charsetName is invalid.

public static Document parse (String html, String baseUri)

Parse HTML into a Document. The parser will make a sensible, balanced document tree out of any HTML.

Parameters
html HTML to parse
baseUri The URL where the HTML was retrieved from. Used to resolve relative URLs to absolute URLs, that occur before the HTML declares a <base href> tag.
Returns
  • sane HTML

public static Document parse (URL url, int timeoutMillis)

Fetch a URL, and parse it as HTML. Provided for compatibility; in most cases use connect(String) instead.

The encoding character set is determined by the content-type header or http-equiv meta tag, or falls back to UTF-8.

Parameters
url URL to fetch (with a GET). The protocol must be http or https.
timeoutMillis Connection and read timeout, in milliseconds. If exceeded, IOException is thrown.
Returns
  • The parsed HTML.
Throws
IOException If the final server response != 200 OK (redirects are followed), or if there's an error reading the response stream.
See Also

public static Document parse (File in, String charsetName, String baseUri)

Parse the contents of a file as HTML.

Parameters
in file to load HTML from
charsetName (optional) character set of file contents. Set to null to determine from http-equiv meta tag, if present, or fall back to UTF-8 (which is often safe to do).
baseUri The URL where the HTML was retrieved from, to resolve relative links against.
Returns
  • sane HTML
Throws
IOException if the file could not be found, or read, or if the charsetName is invalid.

public static Document parse (String html)

Parse HTML into a Document. As no base URI is specified, absolute URL detection relies on the HTML including a <base href> tag.

Parameters
html HTML to parse
Returns
  • sane HTML

public static Document parseBodyFragment (String bodyHtml)

Parse a fragment of HTML, with the assumption that it forms the body of the HTML.

Parameters
bodyHtml body HTML fragment
Returns
  • sane HTML document
See Also

public static Document parseBodyFragment (String bodyHtml, String baseUri)

Parse a fragment of HTML, with the assumption that it forms the body of the HTML.

Parameters
bodyHtml body HTML fragment
baseUri URL to resolve relative URLs against.
Returns
  • sane HTML document
See Also