How to decode extracted text from pdf file in android kotlin?

82 Views Asked by At

I'm new to Kotlin. I am creating an application in which, after the user selects/picks a PDF file, he will see fragments of the extracted text. Unfortunately each time I read text from the PDF file it is unreadable. i.e.:

%PDF-1.4
%����
1 0 obj
<</Title (MojPdf)
/Producer (Skia/PDF m123 Google Docs Renderer)>>
endobj
3 0 obj
<</ca 1
/BM /Normal>>
endobj
5 0 obj
<</Filter /FlateDecode
/Length 326>> stream
x��SQN�0��)r�evllj��`l���&!����D�n*sBk�6ʋ_^�m�P��'s������~������x��arػ����5\{�H���v��Ac{+�K�c����[n*������+���J�w�d��*1e߽�??�[߽9�!BV�\�Db�䀂:���n��!�\�ϋ��R(',�)����  Z�V=P�KB.4و��Q3F�:b}9�Ιe�!wCa@�Z��4��tDV�B�??%J,�M��??P,*z.��+�����qm5e�����ej��F5��d��l9��m�@�_u�Q�v#����[�}V�(;

and so on...

I have already tried in various ways, adding different charsets UTF-8 and others, using reader or bufferreader... I'm using this method to get the text from PDF:

val result = remember { mutableStateOf<Uri?>(null) }
        var stringResult = remember {
            ""
        }
        var stringDienst = remember {
            ""
        }
        val applicationContext = LocalContext.current
        val contentResolver = applicationContext.contentResolver

        @Throws(IOException::class)
        fun readTextFromUri(uri: Uri): String {
            val stringBuilder = StringBuilder()
            contentResolver.openInputStream(uri)?.use { inputStream ->
                BufferedReader(InputStreamReader(inputStream, "UTF-8")).use { reader ->
                    var line: String? = reader.readText()
                    while (line != null) {
                        stringBuilder.append(line)
                        line = reader.readLine()
                    }
                }
            }
            Log.d(TAG, "stringBuilder: $stringBuilder")

            return stringBuilder.toString()
        }
        val launcher = rememberLauncherForActivityResult(ActivityResultContracts.OpenDocument()) {
            result.value = it
            if (it != null) {
                stringResult = readTextFromUri(it)
            }
        }


        Column {
            Row {
                Button(onClick = {
                    launcher.launch(arrayOf("application/pdf"))
                }) {
                    Text(text = "Select Document")
                }
            }
            Row {
                Text(text = "stringDienst: $stringDienst")
            }
        }

After selecting the file and running the method, the text is completely unreadable. Thanks for any help.

2

There are 2 best solutions below

1
Simon Kocurek On

Pdf is not a plain-text format.

If you want to parse it, can find the specification here: https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf

Or, preferably, you could import one of many Java/Kotlin PDF libraries and use that to read it.

0
ollo On

Yes, yes, yes, it finally happened. Thank you all very much for your help. The comments about the external library from K J and Šimon Kocúrek led me to the appropriate solution. Here is the code using https://github.com/TomRoush/PdfBox-Android Once again, many thanks to all of you.

val result = remember { mutableStateOf<Uri?>(null) }
var stringResult = remember {
    ""
}
val launcher = rememberLauncherForActivityResult(ActivityResultContracts.OpenDocument()) {
    result.value = it
}
val applicationContext = LocalContext.current
val contentResolver = applicationContext.contentResolver
PDFBoxResourceLoader.init(applicationContext)
@Throws(IOException::class)
fun readTextFromUri(uri: Uri): String {
    contentResolver.openInputStream(uri)?.use { inputStream ->
        PDDocument.load(inputStream).use { pdfDocument ->
            if (!pdfDocument.isEncrypted) {
                stringResult = PDFTextStripper().getText(pdfDocument)
            }
        }
    }
    return stringResult
}