Diffbot's Search API allows you to search the extracted content of one or all of your Diffbot "collections." A collection is a discrete Crawlbot (:php:class:`Swader\\Diffbot\\Api\\Crawl`) or Bulk API job, and includes all of the web pages processed within that job.
In order to search a collection, you must first create that collection using either Crawlbot or the Bulk API. A collection can be searched before a crawl or bulk job is finished.
Whereas Crawlbot returns information about a specific crawljob, the Search API returns sets of matching documents from Diffbot's database, depending on provided query parameters.
The API consists of two parts: the API class used to make the call and return the results, and the SearchInfo class as an alternative result, providing metadata about the query and the complete resultset. We'll describe both, in order.
Note that the API class extends :php:class:`Swader\\Diffbot\\Abstracts\\Api`, so be sure to read that first if you haven't already.
.. php:namespace:: Swader\Diffbot\Api
.. php:class:: Search
This API class is a bit specific in that it only extends :php:class:`Swader\\Diffbot\\Abstracts\\Api` to inherit part of a single function - almost everything else is custom implemented, due to the highly specific nature of the API.
Basic usage:
use Swader\Diffbot\Diffbot; $diffbot = new Diffbot('my_token'); $search = $diffbot->search('author:"Miles Johnson" AND type:article'); $result = $search->call(); foreach ($result as $article) { echo $article->getTitle(); } $info = $search->call(true); echo $info->getHits(); // 50
.. php:method:: __construct() :param string $q: Query string to run on the collection(s) The constructor takes a string like "foo AND bar AND title:baz". This would make the API search for documents containing both "foo" and "bar" in any of the fields, and "baz" in the title field.
.. php:method:: setCol($col = null) :param string $col: [Optional] Name of collection to search :returns: $this If collection name is not provided, Search API will search all the collections under the currently active token.
.. php:method:: setNum($num = 20) :param string|int $num: Number of results to return :returns: $this The ``$num`` param should either be a number, or the string "all" if you want the API to return all the results. Note that this may be quite a large payload if the search terms are broad, and you'd likely be better off paginating the result (see below).
.. php:method:: setStart($start = 0) :param int $start: The starting result number. Used during pagination. :returns: $this
.. php:method:: buildUrl() :returns: string This method is called automatically when :php:meth:`Swader\\Diffbot\\Abstracts\\Api::call` is called. It builds the URL which is to be called by the HTTPClient in :php:meth:`Swader\\Diffbot\\Diffbot::setHttpClient`, and returns it. This method can be used to get the URL for the purposes of testing in third party API clients like `Postman <https://www.getpostman.com/>`_.Usage:
$api-> // ... set up API $myUrl = $api->buildUrl();
.. php:method:: call($info = false) :param bool $info: Either ``true`` or ``false`` :returns: :php:class:`Swader\\Diffbot\\Entity\\SearchInfo` | :php:class:`Swader\\Diffbot\\Entity\\EntityIterator` When the API instance has been fully configured, this method executes the call. If the ``$info`` parameter passed into the method is ``false``, the return value will be an iterable collection (:php:class:`Swader\\Diffbot\\Entity\\EntityIterator`) of appropriate entities. Refer to each API's documentation for details on entities returned from each API call. If you pass in ``true``, you force info mode and get back a :php:class:`Swader\\Diffbot\\Entity\\SearchInfo` object related to the last call. Keep in mind that passing in ``true`` before calling a default ``call()`` will implicitly call the ``call()``, and then get the SearchInfo. So:: $searchApi->call(); // gets entities $searchApi->call(true); // gets SearchInfo about the executed query
When the Search API is called with info mode forced, the API will return an info object, containing various properties useful for pagination and metadata.
.. php:namespace:: Swader\Diffbot\Entity
.. php:class:: SearchInfo
.. php:method:: getType() :returns: string Will always return "searchInfo":: // ... API setup ... // $result = $api->call(true); echo $result->getType(); // "searchInfo"
.. php:method:: getCurrentTimeUTC() :returns: int Current UTC time as timestamp
.. php:method:: getResponseTimeMS() :returns: int Response time in milliseconds. Time it took to process the query on Diffbot's end.
.. php:method:: getNumResultsOmitted() :returns: int Number of results skipped for any reason
.. php:method:: getNumShardsSkipped() :returns: int Number of skipped shards (@todo find out what those are)
.. php:method:: getTotalShards() :returns: int Total number of shards (@todo find out what those are)
.. php:method:: getDocsInCollection() :returns: int Total number of documents in collection. Should resemble the total number you got on the crawl job. (@todo: find out why not identical)
.. php:method:: getHits() :returns: int Number of results that match - NOT the number of *returned* results! Use this for pagination as a total result count.
.. php:method:: getQueryInfo() :returns: array Returns an assoc. array containing the following keys and example values:: /** "fullQuery" => "type:json AND (author:\"Miles Johnson\" AND type:article)", "queryLanguageAbbr" => "xx", "queryLanguage" => "Unknown", "terms" => [ [ "termNum" => 0, "termStr" => "Miles Johnson", "termFreq" => 2621376, "termHash48" => 224575481707228, "termHash64" => 4150001371756911641, "prefixHash64" => 3732660069076179349 ], [ "termNum" => 1, "termStr" => "type:json", "termFreq" => 2621664, "termHash48" => 272064464231140, "termHash64" => 9877301297136722857, "prefixHash64" => 7586288672657224048 ], [ "termNum" => 2, "termStr" => "type:article", "termFreq" => 524448, "termHash48" => 210861560163398, "termHash64" => 12449358332005671483, "prefixHash64" => 7586288672657224048 ] ] **/ @todo: find out what hashes are, and to what the freq is relative