Summary: Methods | Inherited Methods | [Expand All]

public class

Jsoup

extends Object

java.lang.Object
↳	org.jsoup.Jsoup

Class Overview

The core public access point to the jsoup functionality.

Summary

Public Methods
static String	clean(String bodyHtml, String baseUri, Whitelist whitelist) Get safe HTML from untrusted input HTML, by parsing input HTML and filtering it through a white-list of permitted tags and attributes.
static String	clean(String bodyHtml, Whitelist whitelist) Get safe HTML from untrusted input HTML, by parsing input HTML and filtering it through a white-list of permitted tags and attributes.
static Connection	connect(String url) Creates a new `Connection` to a URL.
static boolean	isValid(String bodyHtml, Whitelist whitelist) Test if the input HTML has only tags and attributes allowed by the Whitelist.
static Document	parse(InputStream in, String charsetName, String baseUri) Read an input stream, and parse it to a Document.
static Document	parse(File in, String charsetName) Parse the contents of a file as HTML.
static Document	parse(String html, String baseUri) Parse HTML into a Document.
static Document	parse(URL url, int timeoutMillis) Fetch a URL, and parse it as HTML.
static Document	parse(File in, String charsetName, String baseUri) Parse the contents of a file as HTML.
static Document	parse(String html) Parse HTML into a Document.
static Document	parseBodyFragment(String bodyHtml) Parse a fragment of HTML, with the assumption that it forms the `body` of the HTML.
static Document	parseBodyFragment(String bodyHtml, String baseUri) Parse a fragment of HTML, with the assumption that it forms the `body` of the HTML.

[Expand]

Inherited Methods

From class java.lang.Object

Public Methods

public static String clean (String bodyHtml, String baseUri, Whitelist whitelist)

Get safe HTML from untrusted input HTML, by parsing input HTML and filtering it through a white-list of permitted tags and attributes.

Parameters

bodyHtml	input untrusted HMTL
baseUri	URL to resolve relative URLs against
whitelist	white-list of permitted HTML elements

Returns

safe HTML

public static String clean (String bodyHtml, Whitelist whitelist)

Get safe HTML from untrusted input HTML, by parsing input HTML and filtering it through a white-list of permitted tags and attributes.

Parameters

bodyHtml	input untrusted HTML
whitelist	white-list of permitted HTML elements

Returns

safe HTML

public static Connection connect (String url)

Creates a new Connection to a URL. Use to fetch and parse a HTML page.

Use examples:

Document doc = Jsoup.connect("http://example.com").userAgent("Mozilla").data("name", "jsoup").get();
Document doc = Jsoup.connect("http://example.com").cookie("auth", "token").post();


  
      Parameters
      
        
          url
          URL to connect to. The protocol must be http or https.
        
      
  
  
      Returns
      the connection. You can add data, cookies, and headers; set the user-agent, referrer, method; and then execute.






 
    
      
        public 
        static 
         
         
         
        boolean
      
      isValid
      (String bodyHtml, Whitelist whitelist)
    
      
        


        
  

      
    
      
  Test if the input HTML has only tags and attributes allowed by the Whitelist. Useful for form validation. The input HTML should
     still be run through the cleaner to set up enforced attributes, and to tidy the output.
  
      Parameters
      
        
          bodyHtml
          HTML to test
        
        
          whitelist
          whitelist to test against
        
      
  
  
      Returns
      true if no tags or attributes were removed; false otherwise
  
  
      See Also
      clean(String, org.jsoup.safety.Whitelist)
      
  

    





 
    
      
        public 
        static 
         
         
         
        Document
      
      parse
      (InputStream in, String charsetName, String baseUri)
    
      
        


        
  

      
    
      
  Read an input stream, and parse it to a Document.
  
      Parameters
      
        
          in
          input stream to read. Make sure to close it after parsing.
        
        
          charsetName
          (optional) character set of file contents. Set to null to determine from http-equiv meta tag, if
     present, or fall back to UTF-8 (which is often safe to do).
        
        
          baseUri
          The URL where the HTML was retrieved from, to resolve relative links against.
        
      
  
  
      Returns
      sane HTML
  
  
      Throws
        
        
            IOException
            if the file could not be found, or read, or if the charsetName is invalid.

        
      
  

    





 
    
      
        public 
        static 
         
         
         
        Document
      
      parse
      (File in, String charsetName)
    
      
        


        
  

      
    
      
  Parse the contents of a file as HTML. The location of the file is used as the base URI to qualify relative URLs.
  
      Parameters
      
        
          in
          file to load HTML from
        
        
          charsetName
          (optional) character set of file contents. Set to null to determine from http-equiv meta tag, if
     present, or fall back to UTF-8 (which is often safe to do).
        
      
  
  
      Returns
      sane HTML
  
  
      Throws
        
        
            IOException
            if the file could not be found, or read, or if the charsetName is invalid.
        
      
  
  
      See Also
      parse(File, String, String)
      
  

    





 
    
      
        public 
        static 
         
         
         
        Document
      
      parse
      (String html, String baseUri)
    
      
        


        
  

      
    
      
  Parse HTML into a Document. The parser will make a sensible, balanced document tree out of any HTML.
  
      Parameters
      
        
          html
          HTML to parse
        
        
          baseUri
          The URL where the HTML was retrieved from. Used to resolve relative URLs to absolute URLs, that occur
     before the HTML declares a <base href> tag.
        
      
  
  
      Returns
      sane HTML

  

    





 
    
      
        public 
        static 
         
         
         
        Document
      
      parse
      (URL url, int timeoutMillis)
    
      
        


        
  

      
    
      
  Fetch a URL, and parse it as HTML. Provided for compatibility; in most cases use connect(String) instead.
     

     The encoding character set is determined by the content-type header or http-equiv meta tag, or falls back to UTF-8.
  
      Parameters
      
        
          url
          URL to fetch (with a GET). The protocol must be http or https.
        
        
          timeoutMillis
          Connection and read timeout, in milliseconds. If exceeded, IOException is thrown.
        
      
  
  
      Returns
      The parsed HTML.
  
  
      Throws
        
        
            IOException
            If the final server response != 200 OK (redirects are followed), or if there's an error reading
     the response stream.
        
      
  
  
      See Also
      connect(String)
      
  

    





 
    
      
        public 
        static 
         
         
         
        Document
      
      parse
      (File in, String charsetName, String baseUri)
    
      
        


        
  

      
    
      
  Parse the contents of a file as HTML.
  
      Parameters
      
        
          in
          file to load HTML from
        
        
          charsetName
          (optional) character set of file contents. Set to null to determine from http-equiv meta tag, if
     present, or fall back to UTF-8 (which is often safe to do).
        
        
          baseUri
          The URL where the HTML was retrieved from, to resolve relative links against.
        
      
  
  
      Returns
      sane HTML
  
  
      Throws
        
        
            IOException
            if the file could not be found, or read, or if the charsetName is invalid.

        
      
  

    





 
    
      
        public 
        static 
         
         
         
        Document
      
      parse
      (String html)
    
      
        


        
  

      
    
      
  Parse HTML into a Document. As no base URI is specified, absolute URL detection relies on the HTML including a
     <base href> tag.
  
      Parameters
      
        
          html
          HTML to parse
        
      
  
  
      Returns
      sane HTML
  
  
      See Also
      parse(String, String)
      
  

    





 
    
      
        public 
        static 
         
         
         
        Document
      
      parseBodyFragment
      (String bodyHtml)
    
      
        


        
  

      
    
      
  Parse a fragment of HTML, with the assumption that it forms the body of the HTML.
  
      Parameters
      
        
          bodyHtml
          body HTML fragment
        
      
  
  
      Returns
      sane HTML document
  
  
      See Also
      body()
      
  

    





 
    
      
        public 
        static 
         
         
         
        Document
      
      parseBodyFragment
      (String bodyHtml, String baseUri)
    
      
        


        
  

      
    
      
  Parse a fragment of HTML, with the assumption that it forms the body of the HTML.
  
      Parameters
      
        
          bodyHtml
          body HTML fragment
        
        
          baseUri
          URL to resolve relative URLs against.
        
      
  
  
      Returns
      sane HTML document
  
  
      See Also
      body()
      
  

    














Generated by Doclava.

in	input stream to read. Make sure to close it after parsing.
charsetName	(optional) character set of file contents. Set to `null` to determine from `http-equiv` meta tag, if present, or fall back to `UTF-8` (which is often safe to do).
baseUri	The URL where the HTML was retrieved from, to resolve relative links against.

in	file to load HTML from
charsetName	(optional) character set of file contents. Set to `null` to determine from `http-equiv` meta tag, if present, or fall back to `UTF-8` (which is often safe to do).

html	HTML to parse
baseUri	The URL where the HTML was retrieved from. Used to resolve relative URLs to absolute URLs, that occur before the HTML declares a `<base href>` tag.

url	URL to fetch (with a GET). The protocol must be `http` or `https`.
timeoutMillis	Connection and read timeout, in milliseconds. If exceeded, IOException is thrown.

bodyHtml	body HTML fragment
baseUri	URL to resolve relative URLs against.

Interfaces

Classes

Enums

Jsoup

Class Overview

Summary

Public Methods

public static String clean (String bodyHtml, String baseUri, Whitelist whitelist)

Parameters

Returns

See Also

public static String clean (String bodyHtml, Whitelist whitelist)

Parameters

Returns

See Also

public static Connection connect (String url)

Parameters

Returns

public static boolean isValid (String bodyHtml, Whitelist whitelist)

Parameters

Returns

See Also

public static Document parse (InputStream in, String charsetName, String baseUri)

Parameters

Returns

Throws

public static Document parse (File in, String charsetName)

Parameters

Returns

Throws

See Also

public static Document parse (String html, String baseUri)

Parameters

Returns

public static Document parse (URL url, int timeoutMillis)

Parameters

Returns

Throws

See Also

public static Document parse (File in, String charsetName, String baseUri)

Parameters

Returns

Throws

public static Document parse (String html)

Parameters

Returns

See Also

public static Document parseBodyFragment (String bodyHtml)

Parameters

Returns

See Also

public static Document parseBodyFragment (String bodyHtml, String baseUri)

Parameters

Returns

See Also