How much data can Watson Discovery process from IBM Cloud Object Storage?

523 Views Asked by At

I'm using Watson Assistant together with Watson Discovery in a Node.JS app. The idea is that I will use the Discovery service for more Q&A sort of things - I pass the utterance from the assistant and send it to Discovery for an answer. Therefore I have prepared data structure in a JSON format, which will act as a Q&A database. Example:

{
  "elements":[
    {
      "ProductID":12345,
      "Questions":[
        "What is included in insurance Type A?",
        "Does insurance Type A provide this kind of protection?"
      ],
      "Answer":"Insurance Type A can be used for the cases ..." 
    },
    ...
  ]
}

This data can be updated, deleted, extended etc. (all normal database operations) through an API and after each change must be updated by the Discovery side as well. I've checked the integration types - Salesforce, Box etc. and I found that there is an IBM Cloud Object Storage integration, which I want to use as a database. My question is:

After we have set a connection to an endpoint, will Discovery process the whole data from that bucket even if the data is 1Gb in future?

1

There are 1 best solutions below

2
On

You can use Discovery to connect to and crawl documents from remote sources.

The following general requirements apply to all data sources:

  • The individual document file size limit for Box, Salesforce, SharePoint Online, SharePoint 2016, IBM Cloud Object Storage, and Web Crawl is 10MB.
  • You must have the credentials, file locations, or URLs for each data source. A developer or system administrator typically provides the credentials, file locations, and URLs of the data source.
  • You must know which resources of the data source to crawl, which the source administrator can provide. If you crawl Box or Salesforce, a list of available resources is presented when you configure a source, using the Discovery tooling.
  • If you are using the Discovery tooling, you can configure a collection with a single data source. If you are using the API, you can ingest documents from multiple data sources into a single collection.
  • Crawling a data source uses resources, namely API calls, of the data source. The number of API calls depends on the number of documents that need to be crawled. You must obtain an appropriate level of service license, for example Enterprise, for the data source. For information about the appropriate service level license that you need, contact the source system administrator.
  • Discovery source crawls do not delete documents that are stored in a collection, but you can manually delete them using the API. When a source is re-crawled, new documents are added, updated documents are modified to the current version, and deleted documents remain as the version last stored.

Check the complete General source requirements

If you decide to use API or tooling, consider the following when you are ready to add documents to your collection in IBM Watson Discovery service

  • The maximum file size that can be uploaded to Discovery is 50MB.
  • Only the first 50,000 characters of each JSON field selected for enrichment are enriched.
  • When creating a collection, you select the document language (English is the default). See Language support for the list of languages. Your documents are enriched in the selected language. Do not mix languages within the same collection.

Check more info at Adding content with the API or tooling