How to find documents in Oracle Text that have not been indexed

102 Views Asked by At

Environment: Oracle 19c

We have a system that uses Oracle Text (Oracle 19c) to index PDF files that are stored as BFiles in a table. We pre-process the PDF files using Adobe Acrobat to OCR the PDF files. About a year ago, the OCRing was occasionally rotating pages either 90 or 180 degrees. This appears to be an Adobe bug as it is reproducible. We informed our customers that they would need to OCR their documents prior to uploading them as we were disabling our OCR process. We are now making changes to the process of how we OCR the documents. So we need to identify all the documents that don't have any tokens (words) in the Oracle Text index so we can re-process them. Our assumption is that those documents were probably not OCR’d by the customer and therefore are not searchable using Oracle Text.

How can I query Oracle Text to find the documents that don't have any indexed tokens?

Second tangential question: Is there is some way to test a PDF document to check if it was OCR’d?

Example Table Scripts:

CREATE TABLE MY_SCHEMA.ORADOCS
(
  DOC_ID            NUMBER                      NOT NULL GENERATED BY DEFAULT ON NULL AS IDENTITY,
  SUBJECT_OR_TITLE  VARCHAR2(255 BYTE)          NOT NULL,
  DOC_DT            DATE                        NOT NULL,
  AUTHOR            VARCHAR2(255 BYTE),
  CREATE_DT         DATE                        DEFAULT sysdate               NOT NULL,
  CREATE_USER       VARCHAR2(100 BYTE)          DEFAULT USER                  NOT NULL,
  ORIG_FILENAME     VARCHAR2(255 BYTE),
  ORADOC_FILE       BFILE
)
TABLESPACE MY_DOC

ALTER TABLE MY_SCHEMA.ORADOCS ADD (
  CONSTRAINT ORADOCS_PK1
  PRIMARY KEY
  (DOC_ID)
  USING INDEX MY_SCHEMA.ORADOCS_PK1
  ENABLE VALIDATE);

CREATE INDEX MY_SCHEMA.AUTHOR_ORADOCS ON MY_SCHEMA.ORADOCS
(AUTHOR)
INDEXTYPE IS CTXSYS.CONTEXT
PARAMETERS('storage meta_storage  memory 4G SYNC(ON COMMIT)');

CREATE INDEX MY_SCHEMA.ORADOC_FILE ON MY_SCHEMA.ORADOCS
(ORADOC_FILE)
INDEXTYPE IS CTXSYS.CONTEXT
PARAMETERS('storage meta_storage memory 6G SYNC(ON COMMIT)');

CREATE INDEX MY_SCHEMA.SUBJECT_ORADOCS ON MY_SCHEMA.ORADOCS
(SUBJECT_OR_TITLE)
INDEXTYPE IS CTXSYS.CONTEXT
PARAMETERS('storage meta_storage  memory 4G SYNC(ON COMMIT)');

Thanks, Rick

1

There are 1 best solutions below

1
JJDiez On

You cal look at CTX_USER_INDEX_ERRORS view. When ctx_ddl.sync_index encounters a problem with a content, generates an entry on this "table".

You have info about the Oracle Text index affected, a timestamp, the error message, and ERR_TEXTKEY value contains the rowid of the row in the data table.

Hope this helps