<< <%SKIN-STRTRANS-SYNTOC%> >> Navigation: Introduction > Web Scraping Techniques > HTML Content |
HTML stands for HyperText Markup Language - the standard markup language for creating web pages. It consists of content that is defined in an HTML document by tags that appear in brackets, such as <html>. Typically, these tags are seen in pairs, with one on each end of the content that they represent (such as <h1> and </h1>). The first tag in a pair is the start tag, and the second tag is the end tag (also known as opening tags and closing tags). Some tags that represent empty elements don't come in pairs, such as <img>.
The purpose of a web browser is to read HTML documents and compose them into visible web pages. The browser does not display the HTML tags, but rather interprets the tags and displays content on the page that corresponds to that tag. HTML describes the structure of a web page semantically, with cues for presentation. This distinguishes it as a markup language rather than a programming language.
HTML elements are the building blocks of any website, including embedded images and objects, and also interactive forms. It provides the structure for a page by denoting structural semantics for text such as headings, paragraphs, lists, links, quotes, and other items. It can also contain scripts written in languages such as JavaScript - which controls the behavior of HTML web pages.
Content Grabber uses XPath to select specific HTML tags and then extracts content from those tags. An HTML tag can contain both text and attributes. For example, a HTML tag that displays an image will contain a scr attribute that specifies the URL of the image to display. Content Grabber can extract both tag text and tag attributes, and may perform certain actions on the content it extracts. For example, it may extract the scr attribute from an <image> HTML tag and then use the URL to download the image.
There are many websites that have HTML tutorials. Here is one example:
http://www.w3schools.com/html/html_intro.asp