Text Extraction: Web API

AlchemyAPI provides easy-to-use facilities for extracting text, titles, and other information from your publicly-accessible web-based content. These URL processing calls automatically fetch the desired Internet webpage and extract the desired content (article text or title).

A HTML page cleaning facility is provided, which normalizes / cleans HTML content (removing ads, navigation links, and other unimportant content), enabling extraction of only the important article text.

These API calls may be utilized to process hosted webpages, blogs, and other publicly-accessible Internet content. If you are processing content that is not hosted on a public webserver, use our HTML API calls instead.

Text Extraction (cleaned) Cleans the specified HTML content, removing advertisements, navigation links, and other unimportant content. Extracts the primary article / page text from a HTML document.
Text Extraction (raw) Extracts the raw text content from a HTML document, including advertisements, navigation links, and all other content.
Title Extraction Extracts the title from a HTML document.

API Call: URLGetText

 - return to top of page - 

Description: The URLGetText call is utilized to extract the primary page / article text from a web page. AlchemyAPI will download the requested URL, extract text from the HTML document structure, and ignore page navigation, advertisements, and other undesireable page content.

Endpoint: http://access.alchemyapi.com/calls/url/URLGetText

Parameters:

http argument parameter description
url http url (must be uri-argument encoded)

(required parameter)
apikey your private api key

(required parameter)
useMetadata whether to use description information embedded in web page meta-data

Possible values:
1 - enable (default)
0 - disable

(optional parameter)
extractLinks whether to include hyperlinks in the extracted web page text

Possible values:
1 - enable
0 - disable (default)

(optional parameter)
outputMode desired API output format

Possible values:
xml (default)
json
rdf

(optional parameter)
jsonp desired JSONP callback

(optional parameter, requires "outputMode" to be set to json)

Response Format (XML):

<results>
    <status>REQUEST_STATUS</status>
    <url>REQUESTED_URL</url>
    <text>EXTRACTED_TEXT</text>
</results>

Response Format (JSON):

{
    "status": "REQUEST_STATUS",
    "url": "REQUESTED_URL",
    "text": "EXTRACTED_TEXT"
}

Response Format (RDF):

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
                 xmlns:aapi="http://rdf.alchemyapi.com/rdf/v1/s/aapi-schema#"
                 xml:base="http://rdf.alchemyapi.com/rdf/v1/r/response.rdf">
    <rdf:Description rdf:ID="DOCUMENT_HASH">
        <rdf:type rdf:resource="http://rdf.alchemyapi.com/rdf/v1/s/aapi-schema#DocInfo"/>
        <aapi:ResultStatus>REQUEST_STATUS</aapi:ResultStatus>
        <aapi:URL>DOCUMENT_URL</aapi:URL>
        <aapi:DocText>EXTRACTED_TEXT</aapi:DocText>
    </rdf:Description>
</rdf:RDF>

Response Fields:

field name field description
status success / failure status indicating whether the request was processed.

Possible values:
OK
ERROR
url http url information was requested for.
text the extracted web page text.
statusInfo failure status information (sent only if "status" == "ERROR").

Possible values:
invalid-api-key
cannot-retrieve
page-is-not-html

Example Calls:

XML: http://access.alchemyapi.com/calls/...
RDF: http://access.alchemyapi.com/calls/...

API Notes:

  • 1. Calls to URLGetText can be made using HTTP GET or POST.
  • 2. HTTP POST calls should include the Content-Type header: application/x-www-form-urlencoded
  • 3. URL retrieval is attempted for a maximum of 10 seconds. Requests taking longer than this will result in a "cannot-retrieve" error response.
  • 4. Requested HTML documents can be a maximum of 600 kilobytes. Larger documents will result in a "content-exceeds-size-limit" error response.


API Call: URLGetRawText

 - return to top of page - 

Description: The URLGetRawText call is utilized to extract all text from a web page. AlchemyAPI will download the requested URL, extract text from the HTML document structure, including page navigation, advertisements, and other page content. To ignore this content, please use the URLGetText call.

Endpoint: http://access.alchemyapi.com/calls/url/URLGetRawText

Parameters:

http argument parameter description
url http url (must be uri-argument encoded)

(required parameter)
apikey your private api key

(required parameter)
outputMode desired API output format

Possible values:
xml (default)
json
rdf

(optional parameter)

Response Format (XML):

<results>
    <status>REQUEST_STATUS</status>
    <url>REQUESTED_URL</url>
    <text>EXTRACTED_TEXT</text>
</results>

Response Format (JSON):

{
    "status": "REQUEST_STATUS",
    "url": "REQUESTED_URL",
    "text": "EXTRACTED_TEXT"
}

Response Format (RDF):

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
                 xmlns:aapi="http://rdf.alchemyapi.com/rdf/v1/s/aapi-schema#"
                 xml:base="http://rdf.alchemyapi.com/rdf/v1/r/response.rdf">
    <rdf:Description rdf:ID="DOCUMENT_HASH">
        <rdf:type rdf:resource="http://rdf.alchemyapi.com/rdf/v1/s/aapi-schema#DocInfo"/>
        <aapi:ResultStatus>REQUEST_STATUS</aapi:ResultStatus>
        <aapi:URL>DOCUMENT_URL</aapi:URL>
        <aapi:DocText>EXTRACTED_TEXT</aapi:DocText>
    </rdf:Description>
</rdf:RDF>

Response Fields:

field name field description
status success / failure status indicating whether the request was processed.

Possible values:
OK
ERROR
url http url information was requested for.
text the extracted web page text.
statusInfo failure status information (sent only if "status" == "ERROR").

Possible values:
invalid-api-key
cannot-retrieve
page-is-not-html

Example Call:

XML: http://access.alchemyapi.com/calls/...
RDF: http://access.alchemyapi.com/calls/...

API Notes:

  • 1. Calls to URLGetRawText can be made using HTTP GET or POST.
  • 2. HTTP POST calls should include the Content-Type header: application/x-www-form-urlencoded
  • 3. URL retrieval is attempted for a maximum of 10 seconds. Requests taking longer than this will result in a "cannot-retrieve" error response.
  • 4. Requested HTML documents can be a maximum of 600 kilobytes. Larger documents will result in a "content-exceeds-size-limit" error response.


API Call: URLGetTitle

 - return to top of page - 

Description: The URLGetTitle call is utilized to extract title information from a web page. The requested URL is downloaded, and the retrieved HTML document is processed, extracting any title information.

Endpoint: http://access.alchemyapi.com/calls/url/URLGetTitle

Parameters:

http argument parameter description
url uri-argument encoded http url

(required parameter)
apikey your private api key

(required parameter)
useMetadata whether to use title information embedded in web page meta-data

Possible values:
1 - enable (default)
0 - disable

(optional parameter)
outputMode desired API output format

Possible values:
xml (default)
json
rdf

(optional parameter)

Response Format (XML):

<results>
    <status>REQUEST_STATUS</status>
    <url>REQUESTED_URL</url>
    <title>EXTRACTED_TITLE</title>
</results>

Response Format (JSON):

{
    "status": "REQUEST_STATUS",
    "url": "REQUESTED_URL",
    "title": "EXTRACTED_TITLE"
}

Response Format (RDF):

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
                 xmlns:aapi="http://rdf.alchemyapi.com/rdf/v1/s/aapi-schema#"
                 xml:base="http://rdf.alchemyapi.com/rdf/v1/r/response.rdf">
    <rdf:Description rdf:ID="DOCUMENT_HASH">
        <rdf:type rdf:resource="http://rdf.alchemyapi.com/rdf/v1/s/aapi-schema#DocInfo"/>
        <aapi:ResultStatus>REQUEST_STATUS</aapi:ResultStatus>
        <aapi:URL>DOCUMENT_URL</aapi:URL>
        <aapi:DocTitle>EXTRACTED_TITLE</aapi:DocTitle>
    </rdf:Description>
</rdf:RDF>

Response Fields:

field name field description
status success / failure status indicating whether the request was processed.

Possible values:
OK
ERROR
url http url information was requested for.
title the detected web page title.
statusInfo failure status information (sent only if "status" == "ERROR").

Possible values:
invalid-api-key
cannot-retrieve
page-is-not-html

Example Calls:

XML: http://access.alchemyapi.com/calls/...
RDF: http://access.alchemyapi.com/calls/...

API Notes:

  • 1. Calls to URLGetTitle can be made using HTTP GET or POST.
  • 2. HTTP POST calls should include the Content-Type header: application/x-www-form-urlencoded
  • 3. URL retrieval is attempted for a maximum of 10 seconds. Requests taking longer than this will result in a "cannot-retrieve" error response.
  • 4. Requested HTML documents can be a maximum of 600 kilobytes. Larger documents will result in a "content-exceeds-size-limit" error response.


 - return to top of page -