Keyword / Term Extraction: HTML API
AlchemyAPI provides easy-to-use facilities for extracting topic keywords from any web page: Post (upload) any content directly to our service for analysis.
Posted content is normalized / cleaned (removing ads, navigation links, and other unimportant content), the primary document language is detected, and topic keywords are extracted automatically.
These API calls may be utilized to process posted (uploaded) webpages and other HTML content. If you are processing content hosted on a publicly accessible website, consider using our URL processing calls instead.
API Call: HTMLGetRankedKeywords
Description: The HTMLGetRankedKeywords call is utilized to extract a relevancy-ranked list of topic keywords from a posted HTML document. AlchemyAPI will extract text from the posted HTML document structure (ignoring navigation links, advertisements, and other undesireable content), and perform keyword extraction operations.
Endpoint: http://access.alchemyapi.com/calls/html/HTMLGetRankedKeywords
Parameters:
| http argument | parameter description | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| apikey | your private api key
(required parameter) |
||||||||||
| html | HTML document content (must be uri-argument encoded)
(required parameter) |
||||||||||
| url | HTML document URL (must be uri-argument encoded) (optional parameter, for response tracking purposes.) |
||||||||||
| maxRetrieve | maximum number of keywords to extract (default: 50)
(optional parameter) |
||||||||||
| keywordExtractMode | keyword extraction mode (normal or strict) Possible values: normal - normal keyword extraction mode (default) strict - strict keyword extraction mode (returns more "well-formed" keywords). refines results at the expense of returning fewer keywords. (optional parameter) |
||||||||||
| sentiment | whether to enable keyword-level sentiment analysis. Possible values: 1 - enabled 0 - disabled (default) (optional parameter - Note that enabling this option will incur usage of one (1) additional AlchemyAPI transaction) |
||||||||||
| outputMode | desired API output format Possible values: xml (default) json rdf rel-tag rel-tag-raw (optional parameter) |
||||||||||
| jsonp | desired JSONP callback (optional parameter, requires "outputMode" to be set to json) |
||||||||||
| showSourceText | whether to include the original 'source text' the keywords were extracted from within the API response. Possible values: 1 - enabled 0 - disabled (default) (optional parameter) |
||||||||||
| sourceText | where to obtain the text that will be processed by this API call. AlchemyAPI supports multiple modes of text extraction: web page cleaning (removes ads, navigation links, etc.), raw text extraction (processes all web page text, including ads / nav links), visual constraint queries, and XPath queries. Possible values:
(optional parameter) |
||||||||||
| cquery | a visual constraints query to apply to the web page. Constraint queries enable API operations to be performed on a targeted area of a web page, such as a story title or product description. (optional parameter, used when sourceText is set to 'cquery'. must be uri-argument encoded) |
||||||||||
| xpath | an XPath query to apply to the web page. XPath queries enable API operations to be performed on a targeted area of a web page, such as a story title or product description. (optional parameter, used when sourceText is set to 'xpath'. must be uri-argument encoded) |
||||||||||
| baseUrl | rel-tag output base http url (must be uri-argument encoded)
(optional parameter, used with rel-tag or rel-tag-raw outputMode.) |
Response Format (XML):
<results>
<status>REQUEST_STATUS</status>
<url>DOCUMENT_URL</url>
<language>DOCUMENT_LANGUAGE</language>
<text>DOCUMENT_TEXT</text>
<keywords>
<keyword>
<text>DETECTED_KEYWORD</text>
<relevance>DETECTED_RELEVANCE</relevance>
<sentiment>
<type>SENTIMENT_LABEL</type>
<score>SENTIMENT_SCORE</score>
</sentiment>
</keyword>
</keywords>
</results>
Response Format (JSON):
{
"status": "REQUEST_STATUS",
"url": "DOCUMENT_URL",
"language": "DOCUMENT_LANGUAGE",
"text": "DOCUMENT_TEXT",/text>
"keywords": [
{
"text": "DETECTED_KEYWORD",
"relevance": "DETECTED_RELEVANCE",
"sentiment": {
"type": "SENTIMENT_LABEL",
"score": "SENTIMENT_SCORE"
}
}
]
}
Response Format (RDF):
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:aapi="http://rdf.alchemyapi.com/rdf/v1/s/aapi-schema#"
xml:base="http://rdf.alchemyapi.com/rdf/v1/r/response.rdf">
<rdf:Description rdf:ID="DOCUMENT_HASH">
<rdf:type rdf:resource="http://rdf.alchemyapi.com/rdf/v1/s/aapi-schema#DocInfo"/>
<aapi:ResultStatus>REQUEST_STATUS</aapi:ResultStatus>
<aapi:Language>DOCUMENT_LANGUAGE</aapi:Language>
<aapi:URL>DOCUMENT_URL</aapi:URL>
<aapi:DocText>DOCUMENT_TEXT</aapi:DocText>
</rdf:Description>
<rdf:Description rdf:ID="DOCUMENT_HASH-KEYWORD_NUM">
<rdf:type rdf:resource="http://rdf.alchemyapi.com/rdf/v1/s/aapi-schema#KeywordOccurrences"/>
<aapi:Doc>DOCUMENT_HASH</aapi:Doc>
<aapi:Relevance>DETECTED_RELEVANCE</aapi:Relevance>
<aapi:Name>DETECTED_KEYWORD</aapi:Name>
<aapi:Sentiment>
<rdf:Description rdf:about="#DOCUMENT_HASH-KEYWORD_NUM">
<rdf:type rdf:resource="http://rdf.alchemyapi.com/rdf/v1/s/aapi-schema#Sentiment"/>
<aapi:Doc>DOCUMENT_HASH</aapi:Doc>
<aapi:SentimentType>SENTIMENT_LABEL</aapi:SentimentType>
<aapi:SentimentScore>SENTIMENT_SCORE</aapi:SentimentScore>
</rdf:Description>
</aapi:Sentiment>
</rdf:Description>
</rdf:RDF>
Response Format (REL-TAG Microformat [XML-embedded] ):
<results>
<status>REQUEST_STATUS</status>
<url>REQUESTED_URL</url>
<language>DOCUMENT_LANGUAGE</language>
<text>DOCUMENT_TEXT</text>
<microformats>
<a href="REQUESTED_BASE_URL/DETECTED_KEYWORD" rel="tag">DETECTED_KEYWORD</a>
<a href="REQUESTED_BASE_URL/DETECTED_KEYWORD" rel="tag">DETECTED_KEYWORD</a>
</microformats>
</results>
Response (REL-TAG Microformat [raw] ):
<a href="REQUESTED_BASE_URL/DETECTED_KEYWORD" rel="tag">DETECTED_KEYWORD</a>
<a href="REQUESTED_BASE_URL/DETECTED_KEYWORD" rel="tag">DETECTED_KEYWORD</a>
Response Fields:
| field name | field description | ||||||
|---|---|---|---|---|---|---|---|
| status | success / failure status indicating whether the request was processed. Possible values: OK ERROR |
||||||
| language | the detected language that the source text was written in. | ||||||
| url | http url information was requested for. | ||||||
| relevance | relevance score for a detected keyword. Possible values: (0.0 - 1.0) [1.0 = most relevant] |
||||||
| text | the detected keyword text. | ||||||
| sentiment | sentiment for the detected keyword (sent only if keyword-level sentiment analysis is enabled)
|
||||||
| statusInfo | failure status information (sent only if "status" == "ERROR"). Possible values: invalid-api-key page-is-not-html |
API Notes:
- 1. Calls to HTMLGetRankedKeywords should be made using HTTP POST.
- 2. HTTP POST calls should include the Content-Type header: application/x-www-form-urlencoded
- 3. Posted HTML documents can be a maximum of 600 kilobytes. Larger documents will result in a "content-exceeds-size-limit" error response.
- 4. Language detection is performed on the retrieved document before attempting keyword extraction. A minimum of 15 characters of text must exist within the requested HTTP document to perform language detection.
- 5. Documents containing less than 15 characters of text are assumed to be English-language content.
- 6. Keyword extraction is currently supported for all languages listed on the language support page. Other non-supported language submissions will be rejected and an error response returned.
- 7. Enabling keyword-level sentiment analysis results in one additional transaction utilized against your daily API limit.
- 8. Sentiment analysis is currently available for English-language content only. Support for other languages is in development.
- return to top of page -
