AlchemyAPI provides easy-to-use facilities for extracting text, titles, and other information from your publicly-accessible web-based content. These URL processing calls automatically fetch the desired Internet webpage and extract the desired content (article text or title). A HTML page cleaning facility is provided, which normalizes / cleans HTML content (removing ads, navigation links, and other unimportant content), enabling extraction of only the important article text. These API calls may be utilized to process hosted webpages, blogs, and other publicly-accessible Internet content. If you are processing content that is not hosted on a public webserver, use our HTML API calls instead.
| Text Extraction (cleaned) | Cleans the specified HTML content, removing advertisements, navigation links, and other unimportant content. Extracts the primary article / page text from a HTML document. |
| Text Extraction (raw) | Extracts the raw text content from a HTML document, including advertisements, navigation links, and all other content. |
| Title Extraction | Extracts the title from a HTML document. |
Description: The URLGetText call is utilized to extract the primary page / article text from a web page. AlchemyAPI will download the requested URL, extract text from the HTML document structure, and ignore page navigation, advertisements, and other undesireable page content.
Endpoint: http://access.alchemyapi.com/calls/url/URLGetText
| http argument | parameter description |
|---|---|
| url | http url (must be uri-argument encoded)
(required parameter) |
| apikey | your private api key
(required parameter) |
| useMetadata | whether to use description information embedded in web page meta-data Possible values: 1 - enable (default) 0 - disable (optional parameter) |
| extractLinks | whether to include hyperlinks in the extracted web page text Possible values: 1 - enable 0 - disable (default) (optional parameter) |
| outputMode | desired API output format Possible values: xml (default) json rdf (optional parameter) |
| jsonp | desired JSONP callback (optional parameter, requires "outputMode" to be set to json) |
<results>
<status>REQUEST_STATUS</status>
<url>REQUESTED_URL</url>
<text>EXTRACTED_TEXT</text>
</results>
{
"status": "REQUEST_STATUS",
"url": "REQUESTED_URL",
"text": "EXTRACTED_TEXT"
}
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:aapi="http://rdf.alchemyapi.com/rdf/v1/s/aapi-schema#"
xml:base="http://rdf.alchemyapi.com/rdf/v1/r/response.rdf">
<rdf:Description rdf:ID="DOCUMENT_HASH">
<rdf:type rdf:resource="http://rdf.alchemyapi.com/rdf/v1/s/aapi-schema#DocInfo"/>
<aapi:ResultStatus>REQUEST_STATUS</aapi:ResultStatus>
<aapi:URL>DOCUMENT_URL</aapi:URL>
<aapi:DocText>EXTRACTED_TEXT</aapi:DocText>
</rdf:Description>
</rdf:RDF>
| field name | field description |
|---|---|
| status | success / failure status indicating whether the request was processed. Possible values: OK ERROR |
| url | http url information was requested for. |
| text | the extracted web page text. |
| statusInfo | failure status information (sent only if "status" == "ERROR"). Possible values: invalid-api-key cannot-retrieve page-is-not-html |
XML: http://access.alchemyapi.com/calls/...
RDF: http://access.alchemyapi.com/calls/...
| http argument | parameter description |
|---|---|
| url | http url (must be uri-argument encoded)
(required parameter) |
| apikey | your private api key
(required parameter) |
| outputMode | desired API output format Possible values: xml (default) json rdf (optional parameter) |
<results>
<status>REQUEST_STATUS</status>
<url>REQUESTED_URL</url>
<text>EXTRACTED_TEXT</text>
</results>
{
"status": "REQUEST_STATUS",
"url": "REQUESTED_URL",
"text": "EXTRACTED_TEXT"
}
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:aapi="http://rdf.alchemyapi.com/rdf/v1/s/aapi-schema#"
xml:base="http://rdf.alchemyapi.com/rdf/v1/r/response.rdf">
<rdf:Description rdf:ID="DOCUMENT_HASH">
<rdf:type rdf:resource="http://rdf.alchemyapi.com/rdf/v1/s/aapi-schema#DocInfo"/>
<aapi:ResultStatus>REQUEST_STATUS</aapi:ResultStatus>
<aapi:URL>DOCUMENT_URL</aapi:URL>
<aapi:DocText>EXTRACTED_TEXT</aapi:DocText>
</rdf:Description>
</rdf:RDF>
| field name | field description |
|---|---|
| status | success / failure status indicating whether the request was processed. Possible values: OK ERROR |
| url | http url information was requested for. |
| text | the extracted web page text. |
| statusInfo | failure status information (sent only if "status" == "ERROR"). Possible values: invalid-api-key cannot-retrieve page-is-not-html |
XML: http://access.alchemyapi.com/calls/...
RDF: http://access.alchemyapi.com/calls/...
Description: The URLGetTitle call is utilized to extract title information from a web page. The requested URL is downloaded, and the retrieved HTML document is processed, extracting any title information.
Endpoint: http://access.alchemyapi.com/calls/url/URLGetTitle
| http argument | parameter description |
|---|---|
| url | uri-argument encoded http url
(required parameter) |
| apikey | your private api key
(required parameter) |
| useMetadata | whether to use title information embedded in web page meta-data Possible values: 1 - enable (default) 0 - disable (optional parameter) |
| outputMode | desired API output format Possible values: xml (default) json rdf (optional parameter) |
<results>
<status>REQUEST_STATUS</status>
<url>REQUESTED_URL</url>
<title>EXTRACTED_TITLE</title>
</results>
{
"status": "REQUEST_STATUS",
"url": "REQUESTED_URL",
"title": "EXTRACTED_TITLE"
}
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:aapi="http://rdf.alchemyapi.com/rdf/v1/s/aapi-schema#"
xml:base="http://rdf.alchemyapi.com/rdf/v1/r/response.rdf">
<rdf:Description rdf:ID="DOCUMENT_HASH">
<rdf:type rdf:resource="http://rdf.alchemyapi.com/rdf/v1/s/aapi-schema#DocInfo"/>
<aapi:ResultStatus>REQUEST_STATUS</aapi:ResultStatus>
<aapi:URL>DOCUMENT_URL</aapi:URL>
<aapi:DocTitle>EXTRACTED_TITLE</aapi:DocTitle>
</rdf:Description>
</rdf:RDF>
| field name | field description |
|---|---|
| status | success / failure status indicating whether the request was processed. Possible values: OK ERROR |
| url | http url information was requested for. |
| title | the detected web page title. |
| statusInfo | failure status information (sent only if "status" == "ERROR"). Possible values: invalid-api-key cannot-retrieve page-is-not-html |
XML: http://access.alchemyapi.com/calls/...
RDF: http://access.alchemyapi.com/calls/...