A novel approach for content equivalence analysis in compressed document images a systematic study | [Journal of Theoretical and Applied Information Technology • 2022]

Author(s):

1. KAVITA V. HORADI: Department of ISE, Global College of Engineering and Technology, Bangalore, India

2. JAGADEESH PUJARI: Department of ISE, SDM College of Engineering and Technology, Dharwad, India

3. NARASIMHA PRASAD BHAT: Infosys, Brookfield WI, USA

Abstract:

Rapid growth of digital data with complex content has led to various challenges in processing. Exponential increase in the size of 'Big Data' due to videos, audios, images and textual content has created several problems which need to be addressed by the research community. Currently, huge amount of digital data is generated by various sources. The high quality data require more space and consume excessive bandwidth during transmission. To overcome these issues, digital data are stored in compressed form using different compression algorithms stated in literature. In order to analyze these data traditional schemes use decompression techniques which are a time consuming process and increases the computation overhead. To overcome these issues, currently compressed domain image processing techniques have been adopted where complete decompression may not be required. In this work, we adopt document image processing in compressed domain which contains printed text in the document images. Our main aim is to identify the similarity and find the equivalence between two or more compressed document images. In order to achieve this, first of all, we apply JPEG encoding which generates encoded data. This data further processed through the proposed line, word and character segmentation scheme. Further, we apply SIFT (ScaleInvariant Feature Transform) to extract the feature from compressed domain segmented data. Finally, feature matching scheme is applied which uses Brute force feature matcher and k-nearest neighbor. We have tested this approach on publically available PubLayNet, IIIT-AR-13K, and Tobacco-3482 datasets which contains large scale document images. The experimental analysis shows the robustness of proposed approach to identify the similarity between compressed documents images.

Page(s): 5401-5417

DOI: DOI not available

Published: Journal: Journal of Theoretical and Applied Information Technology, Volume: 100, Issue: 17, Year: 2022

Keywords:

JPEG , Compressed Domain , KNN , SIFT , Content Equivalence , Brute force , Document processing , Compressed Document Images CDI

References:

References are not available for this document.

Citations

Citations are not available for this document.

Citations

Downloads

Views