Environment: Oracle 19c
We have a system that uses Oracle Text (Oracle 19c) to index PDF files that are stored as BFiles in a table. We pre-process the PDF files using Adobe Acrobat to OCR the PDF files. About a year ago, the OCRing was occasionally rotating pages either 90 or 180 degrees. This appears to be an Adobe bug as it is reproducible. We informed our customers that they would need to OCR their documents prior to uploading them as we were disabling our OCR process. We are now making changes to the process of how we OCR the documents. So we need to identify all the documents that don't have any tokens (words) in the Oracle Text index so we can re-process them. Our assumption is that those documents were probably not OCR’d by the customer and therefore are not searchable using Oracle Text.
How can I query Oracle Text to find the documents that don't have any indexed tokens?
Second tangential question: Is there is some way to test a PDF document to check if it was OCR’d?
Example Table Scripts:
CREATE TABLE MY_SCHEMA.ORADOCS
(
DOC_ID NUMBER NOT NULL GENERATED BY DEFAULT ON NULL AS IDENTITY,
SUBJECT_OR_TITLE VARCHAR2(255 BYTE) NOT NULL,
DOC_DT DATE NOT NULL,
AUTHOR VARCHAR2(255 BYTE),
CREATE_DT DATE DEFAULT sysdate NOT NULL,
CREATE_USER VARCHAR2(100 BYTE) DEFAULT USER NOT NULL,
ORIG_FILENAME VARCHAR2(255 BYTE),
ORADOC_FILE BFILE
)
TABLESPACE MY_DOC
ALTER TABLE MY_SCHEMA.ORADOCS ADD (
CONSTRAINT ORADOCS_PK1
PRIMARY KEY
(DOC_ID)
USING INDEX MY_SCHEMA.ORADOCS_PK1
ENABLE VALIDATE);
CREATE INDEX MY_SCHEMA.AUTHOR_ORADOCS ON MY_SCHEMA.ORADOCS
(AUTHOR)
INDEXTYPE IS CTXSYS.CONTEXT
PARAMETERS('storage meta_storage memory 4G SYNC(ON COMMIT)');
CREATE INDEX MY_SCHEMA.ORADOC_FILE ON MY_SCHEMA.ORADOCS
(ORADOC_FILE)
INDEXTYPE IS CTXSYS.CONTEXT
PARAMETERS('storage meta_storage memory 6G SYNC(ON COMMIT)');
CREATE INDEX MY_SCHEMA.SUBJECT_ORADOCS ON MY_SCHEMA.ORADOCS
(SUBJECT_OR_TITLE)
INDEXTYPE IS CTXSYS.CONTEXT
PARAMETERS('storage meta_storage memory 4G SYNC(ON COMMIT)');
Thanks, Rick
You cal look at CTX_USER_INDEX_ERRORS view. When ctx_ddl.sync_index encounters a problem with a content, generates an entry on this "table".
You have info about the Oracle Text index affected, a timestamp, the error message, and ERR_TEXTKEY value contains the rowid of the row in the data table.
Hope this helps