Back to Top

HTML API

HTML API: Concept Tagging

AlchemyAPI provides easy-to-use facilities for performing concept tagging on any posted web page.

Posted content is normalized/cleaned (removing ads, navigation links, and other unimportant content), the primary document language is detected, and concept tagging is performed automatically. These API calls may be utilized to process posted (uploaded) webpages and other HTML content. If you are processing content hosted on a publicly accessible website, consider using our URL processing calls instead.

API Call: HTMLGetRankedConcepts

Description: The HTMLGetRankedConcepts call is utilized to extract a relevancy-ranked list of concept tags for a posted HTML document. AlchemyAPI will extract text from the posted HTML document structure (ignoring navigation links, advertisements, and other undesireable content), and perform concept tagging operations.

Endpoint: http://access.alchemyapi.com/calls/html/HTMLGetRankedConcepts

Parameters:

http argument parameter description
apikey your private api key

(required parameter)
html HTML document content (must be uri-argument encoded)

(required parameter)
url HTML document URL (must be uri-argument encoded)

(optional parameter, for response tracking purposes.)
maxRetrieve maximum number of concept tags to extract (default: 8)

(optional parameter)
outputMode desired API output format

Possible values:
xml (default)
json
rdf

(optional parameter)
jsonp desired JSONP callback

(optional parameter, requires "outputMode" to be set to json)
linkedData whether to include Linked Data content links with identified concept tags.

Possible values:
1 - enabled (default)
0 - disabled

(optional parameter.)
showSourceText whether to include the original 'source text' the concept tags were extracted from within the API response.

Possible values:
1 - enabled
0 - disabled (default)

(optional parameter)
sourceText where to obtain the text that will be processed by this API call.

AlchemyAPI supports multiple modes of text extraction: web page cleaning (removes ads, navigation links, etc.), raw text extraction (processes all web page text, including ads / nav links), visual constraint queries, and XPath queries.

Possible values:
cleaned_or_raw cleaning enabled, fallback to raw when cleaning produces no text (default)
cleaned operate on 'cleaned' web page text (web page cleaning enabled)
raw operate on raw web page text (web page cleaning disabled)
cquery operate on the results of a visual constraints query

Note: The 'cquery' http argument must also be set to a valid visual constraints query.
xpath operate on the results of an XPath query

Note: The 'xpath' http argument must also be set to a valid XPath query.

(optional parameter)
cquery a visual constraints query to apply to the web page.

Constraint queries enable API operations to be performed on a targeted area of a web page, such as a story title or product description.

(optional parameter, used when sourceText is set to 'cquery'. must be uri-argument encoded)
xpath an XPath query to apply to the web page.

XPath queries enable API operations to be performed on a targeted area of a web page, such as a story title or product description.

(optional parameter, used when sourceText is set to 'xpath'. must be uri-argument encoded)

Response Format (XML):

<results>
    <status>REQUEST_STATUS</status>
    <url>DOCUMENT_URL</url>
    <language>DOCUMENT_LANGUAGE</language>
    <text>DOCUMENT_TEXT</text>
    <concepts>
        <concept>
            <text>DETECTED_CONCEPT</text>
            <relevance>DETECTED_RELEVANCE</relevance>
            <website>WEBSITE</website>
            <geo>LATITUDE LONGITUDE</geo>
            <dbpedia>LINKED_DATA_DBPEDIA</dbpedia>
            <yago>LINKED_DATA_YAGO</yago>
            <opencyc>LINKED_DATA_OPENCYC</opencyc>
            <freebase>LINKED_DATA_FREEBASE</freebase>
            <ciaFactbook>LINKED_DATA_FACTBOOK</ciaFactbook>
            <census>LINKED_DATA_CENSUS</census>
            <geonames>LINKED_DATA_GEONAMES</geonames>
            <crunchbase>CRUNCHBASE_WEB_LINK</crunchbase>
        </concept>
    </concepts>
</results>

Response Format (JSON):

{
    "status": "REQUEST_STATUS",
    "url": "DOCUMENT_URL",
    "language": "DOCUMENT_LANGUAGE",
    "text": "DOCUMENT_TEXT",/text>
    "concepts": [
        {
        "text": "DETECTED_CONCEPT",
        "relevance": "DETECTED_RELEVANCE",
        "website": "WEBSITE",
        "geo": "LATITUDE LONGITUDE",
        "dbpedia": "LINKED_DATA_DBPEDIA",
        "yago": "LINKED_DATA_YAGO",
        "opencyc": "LINKED_DATA_OPENCYC",
        "freebase": "LINKED_DATA_FREEBASE",
        "ciaFactbook": "LINKED_DATA_FACTBOOK",
        "census": "LINKED_DATA_CENSUS",
        "geonames": "LINKED_DATA_GEONAMES",
        "musicBrainz": "LINKED_DATA_MUSICBRAINZ",
        "crunchbase": "CRUNCHBASE_WEB_LINK",
        }
    ]
}

Response Format (RDF):

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
                 xmlns:aapi="http://rdf.alchemyapi.com/rdf/v1/s/aapi-schema#"
                 xml:base="http://rdf.alchemyapi.com/rdf/v1/r/response.rdf">
    <rdf:Description rdf:ID="DOCUMENT_HASH">
        <rdf:type rdf:resource="http://rdf.alchemyapi.com/rdf/v1/s/aapi-schema#DocInfo"/>
        <aapi:ResultStatus>REQUEST_STATUS</aapi:ResultStatus>
        <aapi:Language>DOCUMENT_LANGUAGE</aapi:Language>
        <aapi:URL>DOCUMENT_URL</aapi:URL>
        <aapi:DocText>DOCUMENT_TEXT</aapi:DocText>
    </rdf:Description>
    <rdf:Description rdf:ID="DOCUMENT_HASH-CONCEPT_NUM">
        <rdf:type rdf:resource="http://rdf.alchemyapi.com/rdf/v1/s/aapi-schema#ConceptOccurrence"/>
        <aapi:Doc>DOCUMENT_HASH</aapi:Doc>
        <aapi:Relevance>DETECTED_RELEVANCE</aapi:Relevance>
        <aapi:Name>DETECTED_CONCEPT</aapi:Name>
        <aapi:URL>WEBSITE</aapi:URL>
        <aapi:Geo>LATITUDE LONGITUDE</aapi:Geo>
        <owl:sameAs rdf:resource="LINKED_DATA_DBPEDIA"/>
        <owl:sameAs rdf:resource="LINKED_DATA_YAGO"/>
        <owl:sameAs rdf:resource="LINKED_DATA_OPENCYC"/>
        <owl:sameAs rdf:resource="LINKED_DATA_FREEBASE"/>
        <owl:sameAs rdf:resource="LINKED_DATA_FACTBOOK"/>
        <owl:sameAs rdf:resource="LINKED_DATA_CENSUS"/>
        <owl:sameAs rdf:resource="LINKED_DATA_GEONAMES"/>
        <owl:sameAs rdf:resource="LINKED_DATA_MUSICBRAINZ"/>
        <owl:sameAs rdf:resource="LINKED_DATA_CRUNCHBASE"/>
    </rdf:Description>
</rdf:RDF>

Response Fields:

field name field description
status success / failure status indicating whether the request was processed.

Possible values:
OK
ERROR
language the detected language that the source text was written in.
url http url information was requested for.
relevance relevance score for a detected concept tag.

Possible values: (0.0 - 1.0)   [1.0 = most relevant]
text the detected concept tag.
linked data linked data for the detected concept tag (sent only if linkedData is enabled)
linked data field field description
website the website associated with this concept tag.
geo latitude longitude

the geographic coordinates associated with this concept tag.
dbpedia sameAs link to DBpedia for this concept tag.

Note: Provided only for entities that exist in this linked data-set.
yago sameAs link to YAGO for this concept tag.

Note: Provided only for entities that exist in this linked data-set.
opencyc sameAs link to OpenCyc for this concept tag.

Note: Provided only for entities that exist in this linked data-set.
freebase sameAs link to Freebase for this concept tag.

Note: Provided only for entities that exist in this linked data-set.
ciaFactbook sameAs link to the CIA World Factbook for this concept tag.

Note: Provided only for entities that exist in this linked data-set.
census sameAs link to the US Census for this concept tag.

Note: Provided only for entities that exist in this linked data-set.
geonames sameAs link to Geonames for this concept tag.

Note: Provided only for entities that exist in this linked data-set.
musicBrainz sameAs link to MusicBrainz for this concept tag.

Note: Provided only for entities that exist in this linked data-set.
crunchbase website link to CrunchBase for this concept tag.

Note: Provided only for entities that exist in CrunchBase.
statusInfo failure status information (sent only if "status" == "ERROR").

Possible values:
invalid-api-key
page-is-not-html

API Notes:

  1. Calls to HTMLGetRankedConcepts should be made using HTTP POST.
  2. HTTP POST calls should include the Content-Type header: application/x-www-form-urlencoded
  3. Posted HTML documents can be a maximum of 600 kilobytes. Larger documents will result in a "content-exceeds-size-limit" error response.
  4. Language detection is performed on the retrieved document before attempting concept tagging. A minimum of 15 characters of text must exist within the requested HTTP document to perform language detection.
  5. Documents containing less than 15 characters of text are assumed to be English-language content.