Entity Extraction: HTML API
AlchemyAPI provides easy-to-use facilities for extracting the semantic richness from any web page: Post (upload) any content directly to our service for analysis.
Posted content is normalized / cleaned (removing ads, navigation links, and other unimportant content), the primary document language is detected, and named entities, topic keywords, and other content are extracted automatically.
These API calls may be utilized to process posted (uploaded) webpages and other HTML content. If you are processing content hosted on a publicly accessible website, consider using our URL processing calls instead.
API Call: HTMLGetRankedNamedEntities
Description: The HTMLGetRankedNamedEntities call is utilized to extract a grouped, relevancy-ranked list of named entities (people, companies, organizations, etc.) from a posted HTML document. AlchemyGrid will extract text from the posted HTML document (ignoring navigation links, advertisements, and other undesireable content), and perform entity extraction operations.
Endpoint: http://access.alchemyapi.com/calls/html/HTMLGetRankedNamedEntities
Parameters:
| http argument |
parameter description |
| apikey |
your private api key
(required parameter)
|
| html |
HTML document content (must be uri-argument encoded)
(required parameter)
|
| url |
HTML document URL
(optional parameter, must be uri-argument encoded)
|
| outputMode |
desired API output format
Possible values:
xml (default)
json
rdf
rel-tag
rel-tag-raw
(optional parameter)
|
| disambiguate |
whether to disambiguate detected entities.
Possible values:
1 - enabled (default)
0 - disabled
(optional parameter)
|
| linkedData |
whether to include Linked Data content links with disambiguated entities.
Possible values:
1 - enabled (default)
0 - disabled
(optional parameter. disambiguation must be enabled to utilize the linkedData feature.)
|
| coreference |
whether to resolve he/she/etc coreferences into detected entities.
Possible values:
1 - enabled (default)
0 - disabled
(optional parameter)
|
| quotations |
whether to enable quotations extraction.
Possible values:
1 - enabled
0 - disabled (default)
(optional parameter)
|
| showSourceText |
whether to include the original 'source text' the entities were extracted from within the API response.
Possible values:
1 - enabled
0 - disabled (default)
(optional parameter)
|
| sourceText |
where to obtain the text that will be processed by this API call.
AlchemyAPI supports multiple modes of text extraction: web page cleaning (removes ads, navigation links, etc.), raw text extraction (processes all web page text, including ads / nav links), visual constraint queries, and XPath queries.
Possible values:
|
cleaned_or_raw
|
cleaning enabled, fallback to raw when cleaning produces no text (default)
|
|
cleaned
|
operate on 'cleaned' web page text (web page cleaning enabled)
|
|
raw
|
operate on raw web page text (web page cleaning disabled)
|
|
cquery
|
operate on the results of a visual constraints query
Note: The 'cquery' http argument must also be set to a valid visual constraints query.
|
|
xpath
|
operate on the results of an XPath query
Note: The 'xpath' http argument must also be set to a valid XPath query.
|
(optional parameter)
|
| cquery |
a visual constraints query to apply to the web page.
Constraint queries enable API operations to be performed on a targeted area of a web page, such as a story title or product description.
(optional parameter, used when sourceText is set to 'cquery'. must be uri-argument encoded)
|
| xpath |
an XPath query to apply to the web page.
XPath queries enable API operations to be performed on a targeted area of a web page, such as a story title or product description.
(optional parameter, used when sourceText is set to 'xpath'. must be uri-argument encoded)
|
| maxRetrieve |
maximum number of named entities to extract (default: no limit)
(optional parameter)
|
| baseUrl |
rel-tag output base http url
(optional parameter, used with rel-tag or rel-tag-raw outputMode. must be uri-argument encoded) |
Response Format (XML):
<results>
<status>REQUEST_STATUS</status>
<language>DOCUMENT_LANGUAGE</language>
<url>DOCUMENT_URL</url>
<text>DOCUMENT_TEXT</text>
<entities>
<entity>
<type>DETECTED_TYPE</type>
<relevance>DETECTED_RELEVANCE</relevance>
<count>DETECTED_COUNT</count>
<text>DETECTED_ENTITY</text>
<disambiguated>
<name>DISAMBIGUATED_ENTITY</name>
<subType>ENTITY_SUBTYPE</subType>
<website>WEBSITE</website>
<geo>LATITUDE LONGITUDE</geo>
<dbpedia>LINKED_DATA_DBPEDIA</dbpedia>
<yago>LINKED_DATA_YAGO</yago>
<opencyc>LINKED_DATA_OPENCYC</opencyc>
<umbel>LINKED_DATA_UMBEL</umbel>
<freebase>LINKED_DATA_FREEBASE</freebase>
<ciaFactbook>LINKED_DATA_FACTBOOK</ciaFactbook>
<census>LINKED_DATA_CENSUS</census>
<geonames>LINKED_DATA_GEONAMES</geonames>
<musicBrainz>LINKED_DATA_MUSICBRAINZ</musicBrainz>
<crunchbase>CRUNCHBASE_WEB_LINK</crunchbase>
<semanticCrunchbase>LINKED_DATA_CRUNCHBASE</semanticCrunchbase>
</disambiguated>
<quotations>
<quotation>ENTITY_QUOTATION</quotation>
</quotations>
</entity>
</entities>
</results>
Response Format (JSON):
{
"status": "REQUEST_STATUS",
"language": "DOCUMENT_LANGUAGE",
"url": "DOCUMENT_URL",
"text": "DOCUMENT_TEXT",
"entities": [
"entity": {
"type": "DETECTED_TYPE",
"relevance": "DETECTED_RELEVANCE",
"count": "DETECTED_COUNT",
"text": "DETECTED_ENTITY"
"disambiguated": {
"name": "DISAMBIGUATED_ENTITY",
"subType": "ENTITY_SUBTYPE",
"website": "WEBSITE",
"geo": "LATITUDE LONGITUDE",
"dbpedia": "LINKED_DATA_DBPEDIA",
"yago": "LINKED_DATA_YAGO",
"opencyc": "LINKED_DATA_OPENCYC",
"umbel": "LINKED_DATA_UMBEL",
"freebase": "LINKED_DATA_FREEBASE",
"ciaFactbook": "LINKED_DATA_FACTBOOK",
"census": "LINKED_DATA_CENSUS",
"geonames": "LINKED_DATA_GEONAMES",
"musicBrainz": "LINKED_DATA_MUSICBRAINZ",
"crunchbase": "CRUNCHBASE_WEB_LINK",
"semanticCrunchbase": "LINKED_DATA_CRUNCHBASE"
},
"quotations": [
{
"quotation": "ENTITY_QUOTATION"
}
]
}
]
}
Response Format (RDF):
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:aapi="http://rdf.alchemyapi.com/rdf/v1/s/aapi-schema#"
xml:base="http://rdf.alchemyapi.com/rdf/v1/r/response.rdf">
<rdf:Description rdf:ID="DOCUMENT_HASH">
<rdf:type rdf:resource="http://rdf.alchemyapi.com/rdf/v1/s/aapi-schema#DocInfo"/>
<aapi:ResultStatus>REQUEST_STATUS</aapi:ResultStatus>
<aapi:Language>DOCUMENT_LANGUAGE</aapi:Language>
<aapi:URL>DOCUMENT_URL</aapi:URL>
<aapi:DocText>DOCUMENT_TEXT</aapi:DocText>
</rdf:Description>
<rdf:Description rdf:ID="DOCUMENT_HASH-ENTITY_NUM">
<rdf:type rdf:resource="http://rdf.alchemyapi.com/rdf/v1/s/aapi-schema#EntityOccurrences"/>
<aapi:Doc>DOCUMENT_HASH</aapi:Doc>
<aapi:EntityType>DETECTED_TYPE</aapi:EntityType>
<aapi:Relevance>DETECTED_RELEVANCE</aapi:Relevance>
<aapi:NumOccurs>DETECTED_COUNT</aapi:NumOccurs>
<aapi:Name>DETECTED_ENTITY</aapi:Name>
<aapi:Disambiguation>
<rdf:Description rdf:about="#DOCUMENT_HASH-ENTITY_NUM">
<rdf:type rdf:resource="http://rdf.alchemyapi.com/rdf/v1/s/aapi-schema#Disambiguation"/>
<aapi:Doc>DOCUMENT_HASH</aapi:Doc>
<aapi:ResolvedName>DISAMBIGUATED_ENTITY</aapi:ResolvedName>
<aapi:SubType>ENTITY_SUBTYPE</aapi:SubType>
<aapi:URL>WEBSITE</aapi:URL>
<aapi:Geo>LATITUDE LONGITUDE</aapi:Geo>
<owl:sameAs rdf:resource="LINKED_DATA_DBPEDIA"/>
<owl:sameAs rdf:resource="LINKED_DATA_YAGO"/>
<owl:sameAs rdf:resource="LINKED_DATA_OPENCYC"/>
<owl:sameAs rdf:resource="LINKED_DATA_UMBEL"/>
<owl:sameAs rdf:resource="LINKED_DATA_FREEBASE"/>
<owl:sameAs rdf:resource="LINKED_DATA_FACTBOOK"/>
<owl:sameAs rdf:resource="LINKED_DATA_CENSUS"/>
<owl:sameAs rdf:resource="LINKED_DATA_GEONAMES"/>
<owl:sameAs rdf:resource="LINKED_DATA_MUSICBRAINZ"/>
<owl:sameAs rdf:resource="LINKED_DATA_CRUNCHBASE"/>
</rdf:Description>
</aapi:Disambiguation>
<aapi:Quotations>
<rdf:Description rdf:about="#DOCUMENT_HASH-ENTITY_NUM">
<rdf:type rdf:resource="http://rdf.alchemyapi.com/rdf/v1/s/aapi-schema#Quotations"/>
<aapi:Doc>DOCUMENT_HASH</aapi:Doc>
<aapi:Quotation>ENTITY_QUOTATION</aapi:Quotation>
</rdf:Description>
</aapi:Quotations>
</rdf:Description>
</rdf:RDF>
Response Format (REL-TAG Microformat [XML-embedded] ):
<results>
<status>REQUEST_STATUS</status>
<language>DOCUMENT_LANGUAGE</language>
<url>REQUESTED_URL</url>
<text>DOCUMENT_TEXT</text>
<microformats>
<a href="REQUESTED_BASE_URL/DETECTED_ENTITY" rel="tag">DETECTED_ENTITY</a>
<a href="REQUESTED_BASE_URL/DETECTED_ENTITY" rel="tag">DETECTED_ENTITY</a>
</microformats>
</results>
Response Format (REL-TAG Microformat [raw] ):
<a href="REQUESTED_BASE_URL/DETECTED_ENTITY" rel="tag">DETECTED_ENTITY</a>
<a href="REQUESTED_BASE_URL/DETECTED_ENTITY" rel="tag">DETECTED_ENTITY</a>
Response Fields:
| field name |
field description |
| status |
success / failure status indicating whether the request was processed.
Possible values:
OK
ERROR
|
| language |
detected language (english, french, ..) for the submitted content. |
| url |
http url information was requested for. |
| type |
the detected entity type.
Possible values: (click to see list)
|
| relevance |
relevance score for a detected entity.
Possible values: (0.0 - 1.0) [1.0 = most relevant] |
| count |
number of times an entity was seen within the source web page. |
| text |
the detected entity text. |
| disambiguated |
disambiguation information for the detected entity (sent only if disambiguation occurred)
| disambiguation field |
field description |
| name |
the disambiguated entity name. |
| subType |
the disambiguated entity subType
SubTypes expose additional ontological mappings for a detected entity, such as identification of a Person as a Politician or Athlete. |
| website |
the disambiguated entity website. |
| geo |
latitude longitude
the disambiguated entity geographic coordinates. |
| dbpedia |
sameAs link to DBpedia for the disambiguated entity.
Note: Provided only for entities that exist in this linked data-set. |
| yago |
sameAs link to YAGO for the disambiguated entity.
Note: Provided only for entities that exist in this linked data-set. |
| opencyc |
sameAs link to OpenCyc for the disambiguated entity.
Note: Provided only for entities that exist in this linked data-set. |
| umbel |
sameAs link to UMBEL for the disambiguated entity.
Note: Provided only for entities that exist in this linked data-set. |
| freebase |
sameAs link to Freebase for the disambiguated entity.
Note: Provided only for entities that exist in this linked data-set. |
| ciaFactbook |
sameAs link to the CIA World Factbook for the disambiguated entity.
Note: Provided only for entities that exist in this linked data-set. |
| census |
sameAs link to the US Census for the disambiguated entity.
Note: Provided only for entities that exist in this linked data-set. |
| geonames |
sameAs link to Geonames for the disambiguated entity.
Note: Provided only for entities that exist in this linked data-set. |
| musicBrainz |
sameAs link to MusicBrainz for the disambiguated entity.
Note: Provided only for entities that exist in this linked data-set. |
| crunchbase |
website link to CrunchBase for the disambiguated entity.
Note: Provided only for entities that exist in CrunchBase. |
| semanticCrunchbase |
sameAs link to Semantic CrunchBase for the disambiguated entity.
Note: Provided only for entities that exist in this linked data-set. |
|
| quotations |
extracted quotations for the detected entity (sent only if quotations extraction is enabled)
| field |
field description |
| quotation |
quotation extracted for a particular named entity. |
|
| statusInfo |
failure status information (sent only if "status" == "ERROR").
Possible values:
invalid-api-key
page-is-not-html
|
API Notes:
- 1. Calls to HTMLGetRankedNamedEntities should be made using HTTP POST.
- 2. HTTP POST calls should include the Content-Type header: application/x-www-form-urlencoded
- 3. Posted HTML documents can be a maximum of 600 kilobytes. Larger documents will result in a "content-exceeds-size-limit" error response.
- 4. Language detection is performed on the retrieved document before attempting named entity extraction. A minimum of 15 characters of text must exist within the requested HTTP document to perform language detection.
- 5. Documents containing less than 15 characters of text are assumed to be English-language content.
- 6. Disambiguation of detected entities is enabled by default. Disambiguation information will be included for each entity that is successfully resolved.
- 7. Entity extraction is currently supported for all languages listed on the language support page. Other foreign-language submissions will be rejected and an error response returned.
- 8. Disambiguation and quotations extraction are currently available for English-language content only. Support for other languages is in development.
API Call: HTMLGetNamedEntities
- return to top of page -
Description: The HTMLGetNamedEntities call is utilized to extract named entities (people, companies, organizations, etc.) from a posted HTML document. AlchemyGrid will extract text from the posted HTML document (ignoring navigation links, advertisements, and other undesireable content), and perform entity extraction operations.
Endpoint: http://access.alchemyapi.com/calls/html/HTMLGetNamedEntities
Parameters:
| http argument |
parameter description |
| apikey |
your private api key
(required parameter)
|
| html |
HTML document content (must be uri-argument encoded)
(required parameter)
|
| url |
HTML document URL
(optional parameter, must be uri-argument encoded)
|
| outputMode |
desired API output format
Possible values:
xml (default)
json
rdf
(optional parameter)
|
| disambiguate |
whether to disambiguate detected entities.
Possible values:
1 - enabled (default)
0 - disabled
(optional parameter)
|
| linkedData |
whether to include Linked Data content links with disambiguated entities.
Possible values:
1 - enabled (default)
0 - disabled
(optional parameter. disambiguation must be enabled to utilize the linkedData feature.)
|
| coreference |
whether to resolve he/she/etc coreferences into detected entities.
Possible values:
1 - enabled (default)
0 - disabled
(optional parameter)
|
| quotations |
whether to enable quotations extraction.
Possible values:
1 - enabled
0 - disabled (default)
(optional parameter)
|
| showSourceText |
whether to include the original 'source text' the entities were extracted from within the API response.
Possible values:
1 - enabled
0 - disabled (default)
(optional parameter)
|
| sourceText |
where to obtain the text that will be processed by this API call.
AlchemyAPI supports multiple modes of text extraction: web page cleaning (removes ads, navigation links, etc.), raw text extraction (processes all web page text, including ads / nav links), visual constraint queries, and XPath queries.
Possible values:
|
cleaned_or_raw
|
cleaning enabled, fallback to raw when cleaning produces no text (default)
|
|
cleaned
|
operate on 'cleaned' web page text (web page cleaning enabled)
|
|
raw
|
operate on raw web page text (web page cleaning disabled)
|
|
cquery
|
operate on the results of a visual constraints query
Note: The 'cquery' http argument must also be set to a valid visual constraints query.
|
|
xpath
|
operate on the results of an XPath query
Note: The 'xpath' http argument must also be set to a valid XPath query.
|
(optional parameter)
|
| cquery |
a visual constraints query to apply to the web page.
Constraint queries enable API operations to be performed on a targeted area of a web page, such as a story title or product description.
(optional parameter, used when sourceText is set to 'cquery'. must be uri-argument encoded)
|
| xpath |
an XPath query to apply to the web page.
XPath queries enable API operations to be performed on a targeted area of a web page, such as a story title or product description.
(optional parameter, used when sourceText is set to 'xpath'. must be uri-argument encoded)
|
Response Format (XML):
<results>
<status>REQUEST_STATUS</status>
<language>DOCUMENT_LANGUAGE</language>
<url>DOCUMENT_URL</url>
<text>DOCUMENT_TEXT</text>
<entities>
<entity>
<type>DETECTED_TYPE</type>
<start>START_POS</start>
<end>END_POS</end>
<text>DETECTED_ENTITY</text>
<disambiguated>
<name>DISAMBIGUATED_ENTITY</name>
<subType>ENTITY_SUBTYPE</subType>
<website>WEBSITE</website>
<geo>LATITUDE LONGITUDE</geo>
<dbpedia>LINKED_DATA_DBPEDIA</dbpedia>
<yago>LINKED_DATA_YAGO</yago>
<opencyc>LINKED_DATA_OPENCYC</opencyc>
<umbel>LINKED_DATA_UMBEL</umbel>
<freebase>LINKED_DATA_FREEBASE</freebase>
<ciaFactbook>LINKED_DATA_FACTBOOK</ciaFactbook>
<census>LINKED_DATA_CENSUS</census>
<geonames>LINKED_DATA_GEONAMES</geonames>
<musicBrainz>LINKED_DATA_MUSICBRAINZ</musicBrainz>
<crunchbase>CRUNCHBASE_WEB_LINK</crunchbase>
<semanticCrunchbase>LINKED_DATA_CRUNCHBASE</semanticCrunchbase>
</disambiguated>
<quotations>
<quotation>ENTITY_QUOTATION</quotation>
</quotations>
</entity>
</entities>
</results>
Response Format (JSON):
{
"status": "REQUEST_STATUS",
"language": "DOCUMENT_LANGUAGE",
"url": "DOCUMENT_URL",
"text": "DOCUMENT_TEXT",
"entities": [
"entity": {
"type": "DETECTED_TYPE",
"start": "START_POS",
"end": "END_POS",
"text": "DETECTED_ENTITY"
"disambiguated": {
"name": "DISAMBIGUATED_ENTITY",
"subType": "ENTITY_SUBTYPE",
"website": "WEBSITE",
"geo": "LATITUDE LONGITUDE",
"dbpedia": "LINKED_DATA_DBPEDIA",
"yago": "LINKED_DATA_YAGO",
"opencyc": "LINKED_DATA_OPENCYC",
"umbel": "LINKED_DATA_UMBEL",
"freebase": "LINKED_DATA_FREEBASE",
"ciaFactbook": "LINKED_DATA_FACTBOOK",
"census": "LINKED_DATA_CENSUS",
"geonames": "LINKED_DATA_GEONAMES",
"musicBrainz": "LINKED_DATA_MUSICBRAINZ",
"crunchbase": "CRUNCHBASE_WEB_LINK",
"semanticCrunchbase": "LINKED_DATA_CRUNCHBASE"
},
"quotations": [
{
"quotation": "ENTITY_QUOTATION"
}
]
}
]
}
Response Format (RDF):
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:aapi="http://rdf.alchemyapi.com/rdf/v1/s/aapi-schema#"
xml:base="http://rdf.alchemyapi.com/rdf/v1/r/response.rdf">
<rdf:Description rdf:ID="DOCUMENT_HASH">
<rdf:type rdf:resource="http://rdf.alchemyapi.com/rdf/v1/s/aapi-schema#DocInfo"/>
<aapi:ResultStatus>REQUEST_STATUS</aapi:ResultStatus>
<aapi:Language>DOCUMENT_LANGUAGE</aapi:Language>
<aapi:URL>DOCUMENT_URL</aapi:URL>
<aapi:DocText>DOCUMENT_TEXT</aapi:DocText>
</rdf:Description>
<rdf:Description rdf:ID="DOCUMENT_HASH-ENTITY_NUM">
<rdf:type rdf:resource="http://rdf.alchemyapi.com/rdf/v1/s/aapi-schema#EntityOccurrence"/>
<aapi:Doc>DOCUMENT_HASH</aapi:Doc>
<aapi:EntityType>DETECTED_TYPE</aapi:EntityType>
<aapi:TextStartPos>START_POS</aapi:TextStartPos>
<aapi:TextEndPos>END_POS</aapi:TextEndPos>
<aapi:Name>DETECTED_ENTITY</aapi:Name>
<aapi:Disambiguation>
<rdf:Description rdf:about="#DOCUMENT_HASH-ENTITY_NUM">
<rdf:type rdf:resource="http://rdf.alchemyapi.com/rdf/v1/s/aapi-schema#Disambiguation"/>
<aapi:Doc>DOCUMENT_HASH</aapi:Doc>
<aapi:ResolvedName>DISAMBIGUATED_ENTITY</aapi:ResolvedName>
<aapi:SubType>ENTITY_SUBTYPE</aapi:SubType>
<aapi:URL>WEBSITE</aapi:URL>
<aapi:Geo>LATITUDE LONGITUDE</aapi:Geo>
<owl:sameAs rdf:resource="LINKED_DATA_DBPEDIA"/>
<owl:sameAs rdf:resource="LINKED_DATA_YAGO"/>
<owl:sameAs rdf:resource="LINKED_DATA_OPENCYC"/>
<owl:sameAs rdf:resource="LINKED_DATA_UMBEL"/>
<owl:sameAs rdf:resource="LINKED_DATA_FREEBASE"/>
<owl:sameAs rdf:resource="LINKED_DATA_FACTBOOK"/>
<owl:sameAs rdf:resource="LINKED_DATA_CENSUS"/>
<owl:sameAs rdf:resource="LINKED_DATA_GEONAMES"/>
<owl:sameAs rdf:resource="LINKED_DATA_MUSICBRAINZ"/>
<owl:sameAs rdf:resource="LINKED_DATA_CRUNCHBASE"/>
</rdf:Description>
</aapi:Disambiguation>
<aapi:Quotations>
<rdf:Description rdf:about="#DOCUMENT_HASH-ENTITY_NUM">
<rdf:type rdf:resource="http://rdf.alchemyapi.com/rdf/v1/s/aapi-schema#Quotations"/>
<aapi:Doc>DOCUMENT_HASH</aapi:Doc>
<aapi:Quotation>ENTITY_QUOTATION</aapi:Quotation>
</rdf:Description>
</aapi:Quotations>
</rdf:Description>
</rdf:RDF>
Response Fields:
| field name |
field description |
| status |
success / failure status indicating whether the request was processed.
Possible values:
OK
ERROR
|
| language |
detected language (english, french, ..) for the submitted content. |
| url |
http url information was requested for. |
| type |
the detected entity type.
Possible values: (click to see list)
|
| start |
start offset (in bytes) of this entity in the text stream.
Only included in API responses for AlchemyAPI subscription users |
| end |
end offset (in bytes) of this entity in the text stream.
Only included in API responses for AlchemyAPI subscription users |
| text |
the detected entity text. |
| disambiguated |
disambiguation information for the detected entity (sent only if disambiguation occurred)
| disambiguation field |
field description |
| name |
the disambiguated entity name. |
| subType |
the disambiguated entity subType
SubTypes expose additional ontological mappings for a detected entity, such as identification of a Person as a Politician or Athlete. |
| website |
the disambiguated entity website. |
| geo |
latitude longitude
the disambiguated entity geographic coordinates. |
| dbpedia |
sameAs link to DBpedia for the disambiguated entity.
Note: Provided only for entities that exist in this linked data-set. |
| yago |
sameAs link to YAGO for the disambiguated entity.
Note: Provided only for entities that exist in this linked data-set. |
| opencyc |
sameAs link to OpenCyc for the disambiguated entity.
Note: Provided only for entities that exist in this linked data-set. |
| umbel |
sameAs link to UMBEL for the disambiguated entity.
Note: Provided only for entities that exist in this linked data-set. |
| freebase |
sameAs link to Freebase for the disambiguated entity.
Note: Provided only for entities that exist in this linked data-set. |
| ciaFactbook |
sameAs link to the CIA World Factbook for the disambiguated entity.
Note: Provided only for entities that exist in this linked data-set. |
| census |
sameAs link to the US Census for the disambiguated entity.
Note: Provided only for entities that exist in this linked data-set. |
| geonames |
sameAs link to Geonames for the disambiguated entity.
Note: Provided only for entities that exist in this linked data-set. |
| musicBrainz |
sameAs link to MusicBrainz for the disambiguated entity.
Note: Provided only for entities that exist in this linked data-set. |
| crunchbase |
website link to CrunchBase for the disambiguated entity.
Note: Provided only for entities that exist in CrunchBase. |
| semanticCrunchbase |
sameAs link to Semantic CrunchBase for the disambiguated entity.
Note: Provided only for entities that exist in this linked data-set. |
|
| quotations |
extracted quotations for the detected entity (sent only if quotations extraction is enabled)
| field |
field description |
| quotation |
quotation extracted for a particular named entity. |
|
| statusInfo |
failure status information (sent only if "status" == "ERROR").
Possible values:
invalid-api-key
page-is-not-html
|
API Notes:
- 1. Calls to HTMLGetNamedEntities should be made using HTTP POST.
- 2. HTTP POST calls should include the Content-Type header: application/x-www-form-urlencoded
- 3. Posted HTML documents can be a maximum of 600 kilobytes. Larger documents will result in a "content-exceeds-size-limit" error response.
- 4. Language detection is performed on the retrieved document before attempting named entity extraction. A minimum of 15 characters of text must exist within the requested HTTP document to perform language detection.
- 5. Documents containing less than 15 characters of text are assumed to be English-language content.
- 6. Disambiguation of detected entities is enabled by default. Disambiguation information will be included for each entity that is successfully resolved.
- 7. Entity extraction is currently supported for all languages listed on the language support page. Other foreign-language submissions will be rejected and an error response returned.
- 8. Disambiguation and quotations extraction are currently available for English-language content only. Support for other languages is in development.
API Call: HTMLGetAnnotatedNamedEntityText
- return to top of page -
Description: The HTMLGetAnnotatedNamedEntityText call is utilized to annotate posted webpage text with detected named entities (people, companies, organizations, etc.). AlchemyGrid will process the posted HTML document, extracting text from the document structure (ignoring navigation links, advertisements, and other undesireable content), perform entity extraction operations, and annotate the original webpage text according to the provided variable substitution template.
Endpoint: http://access.alchemyapi.com/calls/html/HTMLGetAnnotatedNamedEntityText
Parameters:
| http argument |
parameter description |
| apikey |
your private api key
(required parameter)
|
| html |
HTML document content (must be uri-argument encoded)
(required parameter)
|
| url |
HTML document URL
(optional parameter, must be uri-argument encoded)
|
| template |
the annotation template to apply to the text we're marking up. this template is applied to each detected entity in the source text. Variable substitution is utilized to integrate entity information into the annotated text.
Supported substitution variables:
| variable |
variable description |
| $ENTITY |
the detected named entity text |
| $ENCODED_ENTITY |
the detected named entity text (uri-argument encoded) |
| $TRIMMED_ENTITY |
the detected named entity text (punctuation-trimmed) |
| $ENCODED_TRIMMED_ENTITY |
the detected named entity text (punctuation-trimmed, uri-argument encoded) |
| $RESOLVED_ENTITY |
the resolved, disambiguated named entity |
| $ENCODED_RESOLVED_ENTITY |
the resolved, disambiguated named entity (uri-argument encoded) |
| $TYPE |
the detected named entity type |
(optional parameter)
|
| outputMode |
desired API output format
Possible values:
xml (default)
json
rdf
(optional parameter)
|
| disambiguate |
whether to disambiguate detected entities.
Possible values:
1 - enabled (default)
0 - disabled
(optional parameter)
|
| coreference |
whether to resolve he/she/etc coreferences into detected entities.
Possible values:
1 - enabled (default)
0 - disabled
(optional parameter)
|
| quotations |
whether to enable quotations extraction.
Possible values:
1 - enabled
0 - disabled (default)
(optional parameter)
|
| showSourceText |
whether to include the original 'source text' the entities were extracted from within the API response.
Possible values:
1 - enabled
0 - disabled (default)
(optional parameter)
|
| sourceText |
where to obtain the text that will be processed by this API call.
AlchemyAPI supports multiple modes of text extraction: web page cleaning (removes ads, navigation links, etc.), raw text extraction (processes all web page text, including ads / nav links), visual constraint queries, and XPath queries.
Possible values:
|
cleaned_or_raw
|
cleaning enabled, fallback to raw when cleaning produces no text (default)
|
|
cleaned
|
operate on 'cleaned' web page text (web page cleaning enabled)
|
|
raw
|
operate on raw web page text (web page cleaning disabled)
|
|
cquery
|
operate on the results of a visual constraints query
Note: The 'cquery' http argument must also be set to a valid visual constraints query.
|
|
xpath
|
operate on the results of an XPath query
Note: The 'xpath' http argument must also be set to a valid XPath query.
|
(optional parameter)
|
| cquery |
a visual constraints query to apply to the web page.
Constraint queries enable API operations to be performed on a targeted area of a web page, such as a story title or product description.
(optional parameter, used when sourceText is set to 'cquery'. must be uri-argument encoded)
|
| xpath |
an XPath query to apply to the web page.
XPath queries enable API operations to be performed on a targeted area of a web page, such as a story title or product description.
(optional parameter, used when sourceText is set to 'xpath'. must be uri-argument encoded)
|
Response Format (XML):
<results>
<status>REQUEST_STATUS</status>
<url>REQUESTED_URL</url>
<language>DETECTED_LANGUAGE</language>
<text>DOCUMENT_TEXT</text>
<annotatedText>ANNOTATED_TEXT</annotatedText>
</results>
Response Format (JSON):
{
"status": "REQUEST_STATUS",
"url": "REQUESTED_URL",
"language": "DETECTED_LANGUAGE",
"text": "DOCUMENT_TEXT",/text>
"annotatedText": "ANNOTATED_TEXT"
}
Response Format (RDF):
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:aapi="http://rdf.alchemyapi.com/rdf/v1/s/aapi-schema#"
xml:base="http://rdf.alchemyapi.com/rdf/v1/r/response.rdf">
<rdf:Description rdf:ID="DOCUMENT_HASH">
<rdf:type rdf:resource="http://rdf.alchemyapi.com/rdf/v1/s/aapi-schema#DocInfo"/>
<aapi:ResultStatus>REQUEST_STATUS</aapi:ResultStatus>
<aapi:Language>DOCUMENT_LANGUAGE</aapi:Language>
<aapi:URL>DOCUMENT_URL</aapi:URL>
<text>DOCUMENT_TEXT</text>
<aapi:AnnotatedDocText>ANNOTATED_TEXT</aapi:AnnotatedDocText>
</rdf:Description>
</rdf:RDF>
Response Fields:
| field name |
field description |
| status |
success / failure status indicating whether the request was processed.
Possible values:
OK
ERROR
|
| language |
detected language (english, french, ..) for the submitted content. |
| url |
http url information was requested for. |
| language |
the detected language that the source text was written in. |
| annotatedText |
the source text annotated with all identified named entities. Text annotation is controlled by the provided template parameter. |
| statusInfo |
failure status information (sent only if "status" == "ERROR").
Possible values:
invalid-api-key
cannot-retrieve
page-is-not-html
|
API Notes:
- 1. Calls to HTMLGetAnnotatedNamedEntityText should be made using HTTP POST.
- 2. HTTP POST calls should include the Content-Type header: application/x-www-form-urlencoded
- 3. Posted HTML documents can be a maximum of 600 kilobytes. Larger documents will result in a "content-exceeds-size-limit" error response.
- 4. Language detection is performed on the retrieved document before attempting text annotation. A minimum of 15 characters of text must exist within the requested HTTP document to perform language detection.
- 5. Documents containing less than 15 characters of text are assumed to be English-language content.
- 6. Disambiguation of detected entities is enabled by default. Disambiguation information will be made available for variable substitution for each entity that is successfully resolved.
- 7. Entity extraction is currently supported for all languages listed on the language support page. Other foreign-language submissions will be rejected and an error response returned.
- 8. Disambiguation and quotations extraction are currently available for English-language content only. Support for other languages is in development.
-
return to top of page -