This set of libraries include a couple HTML parsers and classes appropriate to work with HTML on the Java™ platform. Reader re = ...
// Create the document
HTMLDoc doc = new HTMLDoc();
// …
Loofah is a general library for manipulating and transforming HTML/XML documents and fragments. It‘s built on top of Nokogiri and libxml2, so it‘s fast and has a nice API. Loofah excels at HTML sanitization …
Jerry is a jQuery in Java. Jerry is a fast and concise Java Library that simplifies HTML document parsing, traversing and manipulating. Jerry is designed to change the way that you parse HTML content. …
This utility is a single class, HTMLFilter, which can be used to parse user-submitted input and sanitize it against potential cross site scripting attacks, malicious html, or simply badly formed html. …
RenderSnake is a Java library for creating components and pages that produce HTML using only Java. Its purpose is to support the creation of Web applications that are better maintainable, allows for easier …
A powerful python module to generate HTML code from a python script PyH allows you to simply generate HTML pages from within your python code in an object-oriented fashion. Each HTML tag is an object that …
MozillaParser is a Java Html parser based on mozilla's html parser. it acts as a bridge from java classes to Mozilla's classes and outputs a java Document object from a raw ( and dirty) HTML input Real …
Scrender - A Website Screenshot Renderer project ("scrender") Scrender is a java library that performs screen capturing of web sites. It is essentially captures the web site's appearance as is is being …
NodeHtmlParser A forgiving HTML/XML/RSS parser written in JS for both the browser and NodeJS (yes, despite the name it works just fine in any modern browser). The parser can handle streams (chunked data …
Jericho HTML Parser is a java library allowing analysis and manipulation of parts of an HTML document, including server-side tags, while reproducing verbatim any unrecognised or invalid HTML. It also provides …
libhtml is a minimal, open source (ISC-licensed) C library for parsing, serialising, and manipulating HTML-4.01-strict and XHTML-1.0-strict documents. You may enjoy this library if you're interested in …
HTML Purifier is a standards-compliant HTML filter library written in PHP. HTML Purifier will not only remove all malicious code (better known as XSS) with a thoroughly audited, secure yet permissive whitelist …
Hpricot is a standalone library. It requires no other libraries. Just Ruby! While priding itself on speed, Hpricot works hard to sort out bad HTML and pays a small penalty in order to get that right. So …
Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping. Three features make it powerful: Beautiful Soup won't choke if you give it bad markup. It yields …
QueryPath is a PHP library resembling jQuery. It implements much of the jQuery API, but it is oriented toward server-side programming. Use it to: Parse XML and HTML Use CSS 3 Selectors to find things Retrieve …
Stateful programmatic web browsing in Python, after Andy Lester’s Perl module mechanize.Browser and mechanize.UserAgentBase implement the interface of urllib2.OpenerDirector, so: any URL can be opened, …
Cobra: Java HTML Renderer & Parser Cobra is a pure Java HTML renderer and DOM parser that is being developed to support HTML 4, Javascript and CSS 2. Cobra can be used as a Javascript-aware and CSS-aware …
This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that …
A Python and PHP implementations of a HTML parser based on the WHATWG HTML5 specification for maximum compatibility with major desktop web browsers. Parses valid and invalid HTML documents to a tree Support …
A HTML DOM parser written in PHP5+ let you manipulate HTML in a very easy way! Require PHP 5+. Supports invalid HTML. Find tags on an HTML page with selectors just like jQuery. Extract contents from HTML …
htmlcxx is a simple non-validating css1 and html parser for C++. Although there are several other html parsers available, htmlcxx has some characteristics that make it unique: STL like navigation of DOM …
Implementation of an html and javascript context scanner with no lookahead. Its purpose is to scan an html document and provide context information at any point within the input stream. An example of a …
HtmlCleaner is open-source HTML parser written in Java. HTML found on Web is usually dirty, ill-formed and unsuitable for further processing. For any serious consumption of such documents, it is necessary …
JTidy is a Java port of HTML Tidy, a HTML syntax checker and pretty printer. Like its non-Java cousin, JTidy can be used as a tool for cleaning up malformed and faulty HTML. In addition, JTidy provides …
NekoHTML is a simple HTML scanner and tag balancer that enables application programmers to parse HTML documents and access the information using standard XML interfaces. The parser can scan HTML files …
soup: Java HTML Parser jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods …
HTML Parser is a Java library used to parse HTML in either a linear or nested fashion. Primarily used for transformation or extraction, it features filters, visitors, custom tags and easy to use JavaBeans …