CL!TR - Cross-Language !ndian Text Reuse

Introduction

With the advent of the World Wide Web, information in many different formats is easily accessible. Texts, images, videos and audios are all available for consult, download, and modification. Under these circumstances, text re-use has increased in the last years. In particular, plagiarism has been defined by IEEE as the reuse of someone else's prior ideas, processes, results, or words without explicitly acknowledging the original author and source. The problem has requested the attention from many research areas, even generating new terms, such as the known as copy&paste syndrome or a new kind of text re-use: cyberplagiarism.

While people have enough expertise to detect re-use of text when reading a document, the scale of potential source documents (that of the Web) makes manual analysis unfeasible. As a countermeasure, different systems that assist in the detection of text re-use have been developed. The main idea is to automatically detect such text fragments in a document that are suspicious of being re-used and, if available, provide its presumable source. In that way, on the basis of given linguistic evidence , a human can take a final decision.

Recent efforts have been conducted to the better development of models for detection of text re-use. Probably one of the most interesting cases is the PAN, International Competition on Plagiarism Detection held in conjunction with CLEF.

A special kind of phenomenon is cross-language text re-use, where the re-used text fragment and its source are written in different languages, making its automatic detection even harder than for the monolingual case. Cross-language text re-use detection has been nearly approached in the last years, and better models are necessary.

Through in the current initiative we aim to further impulse the development of better models for text re-use detection and, in particular, cross-language text re-use detection. Our interest in the second kind of text re-use is motivated by the following facts:

Speakers of less-resourced languages (also known as under resourced languages) are forced to consult documentation in a foreign language; and
People immerse in a foreign country can still consult material written in their native language.

Such environments cause the commitment of cross-language text re-use more likely and become it an interesting problem nowadays.

Task Description

The focus of the CL!TR evaluation task is on cross-language text re-use detection. To start with, in this year's task, we are targeting two languages: English - Hindi. The source text is in English and the suspicious text is in Hindi.

You are provided with a set of suspicious documents in Hindi and a set of potential source documents in English. The task is to identify the documents in the suspicious set (Hindi) that are created by re-using fragments from the source set (English).

You are expected to identify suspicious documents which have been actually generated by re-use together with their corresponding sources. Note that this is a document level task. No specific fragments inside of the documents are expected to be identified; only pairs of documents. Determining either a text has been re-used from its corresponding source is enough. Specifying the kind of re-use (Exact, Heavy, or Light) is not necessary.

CL!TR is divided in two phases: training and test. For the training phase we provide an annotated corpus including different levels of re-use. It includes information about whether a text fragment has been re-used and, if it is the case, what its source is. In the test phase no annotation or hints about the cases are provided.

Result Submission

The results of your re-use detection software are required to be formatted in XML:

<document>
<reuse_case
  reused_reference="..."    
  source_reference="..."    
/>
<reuse_case
  reused_reference="..."    
  source_reference="..."   
/>
.........................    
</document >

For each pair of suspicious and source document there will be one entry of the <reuse_case .../> in the xml file.

Evaluation Corpus

Training Collection

The training corpus is available for download here:

CLITR_training_data.tar.bz2
md5sum 53381673b76196110adf29428b552bb0 , 14.8 MB
(note that the potential source documents include Wiki-markup)

Test Collection

The test corpus is available for download here:

CLITR_test_data.tar.bz2
md5sum dc2af9095c01270264e25604d9d9f2a4 , 14.8 MB
(note that the potential source documents include Wiki-markup)

Evaluation Task

Let S be a set of suspicious documents. Let D be a set of potential source documents. The task is to find those documents s in S which have been actually re-used and their source document d in D.

Evaluation Corpus

The corpus contains a set of potential source documents D, written in English, and set of suspicious documents S, written in Hindi. In the corpus you will find plain text files encoded in UTF-8. The source documents are taken from English Wikipedia. The source documents include Wiki-mark up.

Training Collection

In order to prepare and develop your detection software we provide with a training collection. Such a collection includes annotations for every case of re-use.

Training Corpus Statistics

5032 Source files in English
198 suspicious files in Hindi

Test Collection

The test collection is composed on the same way than the training collection: a set of suspicious together with potential source documents.

Test Corpus Statistics

5032 Source files in English
190 suspicious files in Hindi

Both corpora can be downloaded from the Corpus section of this website.

Submission of Detection Results

Participants are allowed to submit up to three runs in order to experimenting with different settings.

The results of your detection are required to be formatted in XML. The result document must be valid with respect to the following XML schema:

Performance Measures

The success of a text re-use detection will be measured in terms of its Precision (P), Recall (R), and F-measure (F) on detecting the re-used documents together with their source in the test corpus.

A detection is considered correct if the re-used document is identified together with its corresponding source document. We consider:

total detected to be the set of suspicious-source pairs detected by the system.
correctly detected to be the subset of pairs detected by the system which actually compose cases of re-use.
total re-used to be the gold standard, which includes all those pairs which compose actual re-used cases.

P, R and F are defined as follows:

P =

correctly detected

total deteted

R =

correctly detected

total re-used

F-measure =

2 * R * P

R + P

A reference implementation of the measures, coded in Perl, is no longer available.

It can be run as follows:

perl getmeasures.pl <gold_standard.xml> <detection.xml>

(for an example, run it considering ref_small.xml as gold standard and multiple_detection.xml.)

Evaluation Results

Participants

Participant	Institution	Country
Aniruddha Ghosh	Jadavpur University	India
Karteek Addanki et al.	Hong Kong University of Science and Technology	Hong Kong (China)
Nitish Aggarwal et al.	DERI Galway and UPM Madrid	Ireland / Spain
Parth Gupta et al.	UPV & DA-IICT	Spain / India
Rambhoopal K.	IIIT Hyderabad	India
Yurii Palkovskii	Zhytomyr State University / SkyLine Inc.	Ukraine

Ranking

Rank	F-measure	Recall	Precision	Run	Leader
1	0.649	0.750	0.571	3	Rambhoopal K.
2	0.609	0.821	0.484	1	Nitish Aggarwal
3	0.608	0.643	0.576	2	Rambhoopal K.
4	0.603	0.589	0.617	1	Yurii Palkovskii
5	0.596	0.804	0.474	2	Parth Gupta
6	0.589	0.795	0.468	2	Nitish Aggarwal
7	0.576	0.589	0.564	1	Rambhoopal K.
8	0.541	0.473	0.631	2	Yurii Palkovskii
9	0.523	0.500	0.549	3	Yurii Palkovskii
10	0.509	0.607	0.439	3	Parth Gupta
11	0.430	0.580	0.342	1	Parth Gupta
12	0.220	0.214	0.226	2	Aniruddha Ghosh
13	0.220	0.214	0.226	3	Aniruddha Ghosh
14	0.085	0.107	0.070	1	Aniruddha Ghosh
15	0.000	0.000	0.000	1	Karteek Addanki

Organizing Committee

Alberto Barrón-Cedeño, Paolo Rosso
NLE Lab @ Universidad Politécnica de Valencia, Spain
Sobha Lalitha Devi
CLR Group @ AU-KBC Research Centre, Chennai, India
Paul Clough, Mark Stevenson
IR & NLP Groups @ University of Sheffield, UK

Program Committee

Tim Baldwin	Melbourne University
Rafael E. Banchs	Institute for Infocomm Research Singapore
Carole Chaski	Institute for Linguistic Evidence
Malcolm Coulthard	Centre for Forensic Linguistics, University of Aston
Marcelo Errecalde	Universidad Nacional de San Luis
Michael Granitzer	Know-Center Graz
Roman Kern	Graz University of Technology
Adam Kilgarriff	Lexicography MasterClass Ltd
Elisabeth Lex	Know-Center Graz
Qin Lu	The Hong Kong Polytechnic University
Manuel Montes y Gomez	INAOE-Puebla
Ted Pedersen	University of Minnesota in Duluth
Anselmo Peñas	UNED
Martin Potthast	Bauhaus-Universität Weimar
Ganesh Ramakrishnan	IIT Bombay
Grigori Sidorov	Instituto Politécnico Nacional
Thamar Solorio	University of Alabama at Birmingham
Efstathios Stamatatos	University of the Aegean
Benno Stein	Bauhaus-Universität Weimar
Dan Tufis	Romanian Academy
María Teresa Turell Juliá	ForensicLab, Universitat Pompeu Fabra
Vasudeva Varma	IIIT Hyderabad
Juan Velásquez	Universidad de Chile
Luis Villaseñor	INAOE-Puebla
Piek Vossen	Vrije Universiteit (VU) Amsterdam
Dekai Wu	Hong Kong University of Science and Technology

CL!TR - Cross-Language !ndian Text Reuse

held in conjunction with the FIRE 2011 Forum for Information Retrieval Evaluation

2 - 4 December 2011, IIT Bombay

Important Dates

Introduction

Task Description

Evaluation Corpus

Training Collection

Test Collection

Evaluation Task

Evaluation Corpus

Submission of Detection Results

Performance Measures

Evaluation Results

Participants

Ranking

Organizing Committee

Program Committee

Links