java.lang.Object | |
↳ | org.jsoup.Jsoup |
The core public access point to the jsoup functionality.
Public Methods | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Get safe HTML from untrusted input HTML, by parsing input HTML and filtering it through a white-list of permitted
tags and attributes.
| |||||||||||
Get safe HTML from untrusted input HTML, by parsing input HTML and filtering it through a white-list of permitted
tags and attributes.
| |||||||||||
Creates a new
Connection to a URL. | |||||||||||
Test if the input HTML has only tags and attributes allowed by the Whitelist.
| |||||||||||
Read an input stream, and parse it to a Document.
| |||||||||||
Parse the contents of a file as HTML.
| |||||||||||
Parse HTML into a Document.
| |||||||||||
Fetch a URL, and parse it as HTML.
| |||||||||||
Parse the contents of a file as HTML.
| |||||||||||
Parse HTML into a Document.
| |||||||||||
Parse a fragment of HTML, with the assumption that it forms the
body of the HTML. | |||||||||||
Parse a fragment of HTML, with the assumption that it forms the
body of the HTML. |
[Expand]
Inherited Methods | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
From class
java.lang.Object
|
Get safe HTML from untrusted input HTML, by parsing input HTML and filtering it through a white-list of permitted tags and attributes.
bodyHtml | input untrusted HMTL |
---|---|
baseUri | URL to resolve relative URLs against |
whitelist | white-list of permitted HTML elements |
Get safe HTML from untrusted input HTML, by parsing input HTML and filtering it through a white-list of permitted tags and attributes.
bodyHtml | input untrusted HTML |
---|---|
whitelist | white-list of permitted HTML elements |
Creates a new Connection
to a URL. Use to fetch and parse a HTML page.
Use examples:
Document doc = Jsoup.connect("http://example.com").userAgent("Mozilla").data("name", "jsoup").get();
Document doc = Jsoup.connect("http://example.com").cookie("auth", "token").post();
url | URL to connect to. The protocol must be http or https . |
---|
Test if the input HTML has only tags and attributes allowed by the Whitelist. Useful for form validation. The input HTML should still be run through the cleaner to set up enforced attributes, and to tidy the output.
bodyHtml | HTML to test |
---|---|
whitelist | whitelist to test against |
Read an input stream, and parse it to a Document.
in | input stream to read. Make sure to close it after parsing. |
---|---|
charsetName | (optional) character set of file contents. Set to null to determine from http-equiv meta tag, if
present, or fall back to UTF-8 (which is often safe to do). |
baseUri | The URL where the HTML was retrieved from, to resolve relative links against. |
IOException | if the file could not be found, or read, or if the charsetName is invalid. |
---|
Parse the contents of a file as HTML. The location of the file is used as the base URI to qualify relative URLs.
in | file to load HTML from |
---|---|
charsetName | (optional) character set of file contents. Set to null to determine from http-equiv meta tag, if
present, or fall back to UTF-8 (which is often safe to do). |
IOException | if the file could not be found, or read, or if the charsetName is invalid. |
---|
Parse HTML into a Document. The parser will make a sensible, balanced document tree out of any HTML.
html | HTML to parse |
---|---|
baseUri | The URL where the HTML was retrieved from. Used to resolve relative URLs to absolute URLs, that occur
before the HTML declares a <base href> tag. |
Fetch a URL, and parse it as HTML. Provided for compatibility; in most cases use connect(String)
instead.
The encoding character set is determined by the content-type header or http-equiv meta tag, or falls back to UTF-8
.
url | URL to fetch (with a GET). The protocol must be http or https . |
---|---|
timeoutMillis | Connection and read timeout, in milliseconds. If exceeded, IOException is thrown. |
IOException | If the final server response != 200 OK (redirects are followed), or if there's an error reading the response stream. |
---|
Parse the contents of a file as HTML.
in | file to load HTML from |
---|---|
charsetName | (optional) character set of file contents. Set to null to determine from http-equiv meta tag, if
present, or fall back to UTF-8 (which is often safe to do). |
baseUri | The URL where the HTML was retrieved from, to resolve relative links against. |
IOException | if the file could not be found, or read, or if the charsetName is invalid. |
---|
Parse HTML into a Document. As no base URI is specified, absolute URL detection relies on the HTML including a
<base href>
tag.
html | HTML to parse |
---|
Parse a fragment of HTML, with the assumption that it forms the body
of the HTML.
bodyHtml | body HTML fragment |
---|
Parse a fragment of HTML, with the assumption that it forms the body
of the HTML.
bodyHtml | body HTML fragment |
---|---|
baseUri | URL to resolve relative URLs against. |