Adapting the Tesseract Open-Source OCR Engine for Tamil and Sinhala Legacy Fonts and Creating a Parallel Corpus for Tamil-Sinhala-English Charangan Vasantharajan Laksika Tharmalingam Uthayasanker Thayasivam Dept. of Computer Sci. and ...
Filetype PDF | Posted on 07 Feb 2023 | 2 years ago
The words contained in this file might help you see if this file matches what you are looking for:
...Adapting the tesseract open source ocr engine for tamil and sinhala legacy fonts creating a parallel corpus english charangan vasantharajan laksika tharmalingam uthayasanker thayasivam dept of computer sci engineering university moratuwa colombo sri lanka cse mrt ac lk rtuthaya abstract most low resource languages do not have neces so far recent study revealed that first half century sary resources to create even substantial monolingual research in computational linguistics from circa up these may often be found government proceedings present has touched on less than world s but mainly portable document format pdf contains only further corpora extracting text documents consist two or more would aid is challenging due font usage printer friendly encoding which are optimized development machine translation language extraction therefore we propose simple automatic interoperability novel idea can scale though lrl gained much traction many along with since building need technologies process...