I'm using Apple's VisionKit framework OCR scan receipt images. My goal is to extract each item with its respective price from the image. VisionKit's extraction works great -- the text is all pulled off of the image, however, the text structure always seems incorrect. For example using this receipt:

I get the following text output from VNRecognizeTextRequest:
"44\nBierhaus\nBierhaus NYC\n712 Third Avenue\nNew York, NY 10017\nServer: Tiffany R\nCheck #44\nGuest Count: 3\nOrdered:\n1 Lt Delicator\n1L Hofbräu Dunkel\n.5L Hofbräu Original\nBierhaus Burger\nWell Done\nBratwurst\nSpicy Mustard\nChicken Bratwurst\nSpicy Mustard\nSubtotal\nTax\nTotal\nTable 304\n12/31/23 7:27 PM\n$25.00\n$19.00\n$10.00\n$19.00\n$17.00\n$17.00\n$107.00\n$9.48\n$116.48\nSuggested Tip:\n22%: (Tip $23.54\nTotal $140.02)\n20%: (Tip $21.40 Total $137.88)\n18%: (Tip $19.26 Total $135.74)\nTip percentages are based on the check\nprice before taxes.\nOktoberfest All Year Round!\nPowered by Hofbräu Bier"
Are there config settings that I can change with VNRecognizeTextRequest to recognize the items inline with their respective prices? The resulting data seems super unstructured. How would you recommend I parse this data to achieve my goal (digitizing a list of the items with their prices). I'm looking for format (Total, 116.48)
My code:
import Foundation
import Vision
import VisionKit
final class TextRecognizer {
let cameraScan: VNDocumentCameraScan?
let uploadedImage: UIImage?
init(cameraScan: VNDocumentCameraScan?, image: UIImage?) {
self.cameraScan = cameraScan
self.uploadedImage = image
}
private let queue = DispatchQueue(label: "scan-codes", qos: .default, attributes: [], autoreleaseFrequency: .workItem)
func recognizeText(withCompletionHandler completionHandler: @escaping ([String]) -> Void) {
queue.async {
let minimumTextHeight: Float = 1
if let cameraScan = self.cameraScan {
let images = (0..<cameraScan.pageCount).compactMap({
cameraScan.imageOfPage(at: $0).cgImage
})
let imagesAndRequests = images.map({ (image: $0, request: VNRecognizeTextRequest()) })
let textPerPage = imagesAndRequests.map { image, request -> String in
// Configure the request for receipt recognition
request.recognitionLevel = .accurate
request.usesLanguageCorrection = true
request.minimumTextHeight = minimumTextHeight
let handler = VNImageRequestHandler(cgImage: image, options: [:])
do {
try handler.perform([request])
guard let observations = request.results else { return "" }
return observations.compactMap({ $0.topCandidates(1).first?.string }).joined(separator: "\n")
} catch {
print(error)
return ""
}
}
DispatchQueue.main.async {
completionHandler(textPerPage)
}
} else if let image = self.uploadedImage {
let images = (0...0).compactMap({ _ in
image.cgImage
})
let imagesAndRequests = images.map({ (image: $0, request: VNRecognizeTextRequest()) })
let textPerPage = imagesAndRequests.map { image, request -> String in
// Configure the request for receipt recognition
request.recognitionLevel = .accurate
request.usesLanguageCorrection = false
request.minimumTextHeight = minimumTextHeight
let handler = VNImageRequestHandler(cgImage: image, options: [:])
do {
try handler.perform([request])
guard let observations = request.results else { return "" }
return observations.compactMap({ $0.topCandidates(1).first?.string }).joined(separator: "\n")
} catch {
print(error)
return ""
}
}
DispatchQueue.main.async {
completionHandler(textPerPage)
}
}
}
}
}
