Text Extraction: HTML API
AlchemyAPI provides easy-to-use facilities for extracting page text and title information from any posted (uploaded) HTML file. Post (upload) any HTML content directly for immediate processing.
A HTML page cleaning facility is provided, which normalizes / cleans HTML content (removing ads, navigation links, and other unimportant content), enabling extraction of only the important article text.
These API calls may be utilized to process posted (uploaded) webpages and other HTML content. If you are processing content hosted on a publicly accessible website, consider using our URL processing calls instead.
| Text Extraction (cleaned) | Cleans the specified HTML content, removing advertisements, navigation links, and other unimportant content. Extracts the primary article / page text from a HTML document. |
| Text Extraction (raw) | Extracts the raw text content from a HTML document, including advertisements, navigation links, and all other content. |
| Title Extraction | Extracts the title from a HTML document. |
API Call: HTMLGetText
Description: The HTMLGetText call is utilized to extract the primary page / article text from a posted web page. AlchemyAPI will extract text from the posted HTML document structure, ignoring page navigation, advertisements, and other undesireable page content.
Endpoint: http://access.alchemyapi.com/calls/html/HTMLGetText
Parameters:
| http argument | parameter description |
|---|---|
| apikey | your private api key
(required parameter) |
| html | HTML document content (must be uri-argument encoded)
(required parameter) |
| url | HTML document URL (must be uri-argument encoded)
(optional parameter, for response tracking purposes.) |
| useMetadata | whether to use description information embedded in web page meta-data Possible values: 1 - enable (default) 0 - disable (optional parameter) |
| extractLinks | whether to include hyperlinks in the extracted web page text Possible values: 1 - enable 0 - disable (default) (optional parameter) |
| outputMode | desired API output format Possible values: xml (default) json rdf (optional parameter) |
| jsonp | desired JSONP callback (optional parameter, requires "outputMode" to be set to json) |
Response Format (XML)
<results>
<status>REQUEST_STATUS</status>
<url>DOCUMENT_URL</url>
<text>EXTRACTED_TEXT</text>
</results>
Response Format (JSON):
{
"status": "REQUEST_STATUS",
"url": "DOCUMENT_URL",
"text": "EXTRACTED_TEXT"
}
Response Format (RDF):
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:aapi="http://rdf.alchemyapi.com/rdf/v1/s/aapi-schema#"
xml:base="http://rdf.alchemyapi.com/rdf/v1/r/response.rdf">
<rdf:Description rdf:ID="DOCUMENT_HASH">
<rdf:type rdf:resource="http://rdf.alchemyapi.com/rdf/v1/s/aapi-schema#DocInfo"/>
<aapi:ResultStatus>REQUEST_STATUS</aapi:ResultStatus>
<aapi:URL>DOCUMENT_URL</aapi:URL>
<aapi:DocText>EXTRACTED_TEXT</aapi:DocText>
</rdf:Description>
</rdf:RDF>
Response Fields:
| field name | field description |
|---|---|
| status | success / failure status indicating whether the request was processed. Possible values: OK ERROR |
| url | http url information was requested for. |
| text | the extracted web page text. |
| statusInfo | failure status information (sent only if "status" == "ERROR"). Possible values: invalid-api-key page-is-not-html |
API Notes:
- 1. Calls to HTMLGetText should be made using HTTP POST.
- 2. HTTP POST calls should include the Content-Type header: application/x-www-form-urlencoded
- 3. Posted HTML documents can be a maximum of 600 kilobytes. Larger documents will result in a "content-exceeds-size-limit" error response.
API Call: HTMLGetRawText
- return to top of page -
Description: The HTMLGetRawText call is utilized to extract all text from a posted web page. AlchemyAPI will extract text from the posted HTML document structure, including page navigation, advertisements, and other page content. To ignore this content, please use the HTMLGetText call.
Endpoint: http://access.alchemyapi.com/calls/html/HTMLGetRawText
Parameters:
| http argument | parameter description |
|---|---|
| apikey | your private api key
(required parameter) |
| html | HTML document content (must be uri-argument encoded)
(required parameter) |
| url | HTML document URL (must be uri-argument encoded) (optional parameter, for response tracking purposes.) |
| outputMode | desired API output format Possible values: xml (default) json rdf (optional parameter) |
Response Format (XML)
<results>
<status>REQUEST_STATUS</status>
<url>DOCUMENT_URL</url>
<text>EXTRACTED_TEXT</text>
</results>
Response Format (JSON):
{
"status": "REQUEST_STATUS",
"url": "DOCUMENT_URL",
"text": "EXTRACTED_TEXT"
}
Response Format (RDF):
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:aapi="http://rdf.alchemyapi.com/rdf/v1/s/aapi-schema#"
xml:base="http://rdf.alchemyapi.com/rdf/v1/r/response.rdf">
<rdf:Description rdf:ID="DOCUMENT_HASH">
<rdf:type rdf:resource="http://rdf.alchemyapi.com/rdf/v1/s/aapi-schema#DocInfo"/>
<aapi:ResultStatus>REQUEST_STATUS</aapi:ResultStatus>
<aapi:URL>DOCUMENT_URL</aapi:URL>
<aapi:DocText>EXTRACTED_TEXT</aapi:DocText>
</rdf:Description>
</rdf:RDF>
Response Fields:
| field name | field description |
|---|---|
| status | success / failure status indicating whether the request was processed. Possible values: OK ERROR |
| url | http url information was requested for. |
| text | the extracted web page text. |
| statusInfo | failure status information (sent only if "status" == "ERROR"). Possible values: invalid-api-key page-is-not-html |
API Notes:
- 1. Calls to HTMLGetRawText should be made using HTTP POST.
- 2. HTTP POST calls should include the Content-Type header: application/x-www-form-urlencoded
- 3. Posted HTML documents can be a maximum of 600 kilobytes. Larger documents will result in a "content-exceeds-size-limit" error response.
API Call: HTMLGetTitle
- return to top of page -
Description: The HTMLGetTitle call is utilized to extract title information from a posted web page. The posted HTML document is processed, extracting any title information.
Endpoint: http://access.alchemyapi.com/calls/html/HTMLGetTitle
Parameters:
| http argument | parameter description |
|---|---|
| html | HTML document content (must be uri-argument encoded)
(required parameter) |
| apikey | your private api key
(required parameter) |
| url | uri-argument encoded http url
(optional parameter) |
| useMetadata | whether to use title information embedded in web page meta-data Possible values: 1 - enable (default) 0 - disable (optional parameter) |
| outputMode | desired API output format Possible values: xml (default) json rdf (optional parameter) |
Response Format (XML):
<results>
<status>REQUEST_STATUS</status>
<url>DOCUMENT_URL</url>
<title>EXTRACTED_TITLE</title>
</results>
Response Format (JSON):
{
"status": "REQUEST_STATUS",
"url": "DOCUMENT_URL",
"title": "EXTRACTED_TITLE"
}
Response Format (RDF):
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:aapi="http://rdf.alchemyapi.com/rdf/v1/s/aapi-schema#"
xml:base="http://rdf.alchemyapi.com/rdf/v1/r/response.rdf">
<rdf:Description rdf:ID="DOCUMENT_HASH">
<rdf:type rdf:resource="http://rdf.alchemyapi.com/rdf/v1/s/aapi-schema#DocInfo"/>
<aapi:ResultStatus>REQUEST_STATUS</aapi:ResultStatus>
<aapi:URL>DOCUMENT_URL</aapi:URL>
<aapi:DocTitle>EXTRACTED_TITLE</aapi:DocTitle>
</rdf:Description>
</rdf:RDF>
Response Fields:
| field name | field description |
|---|---|
| status | success / failure status indicating whether the request was processed. Possible values: OK ERROR |
| url | http url information was requested for. |
| title | the detected web page title. |
| statusInfo | failure status information (sent only if "status" == "ERROR"). Possible values: invalid-api-key page-is-not-html |
API Notes:
- 1. Calls to HTMLGetTitle should be made using HTTP POST.
- 2. HTTP POST calls should include the Content-Type header: application/x-www-form-urlencoded
- 3. Posted HTML documents can be a maximum of 600 kilobytes. Larger documents will result in a "content-exceeds-size-limit" error response.
- return to top of page -
