• Adapting Web Archive Catalogues for Dynamic Change

      Wu, Paul H-J; Ichsan, Tamsir P.; Nguyen, Ngoc Giang; Julien, Masanes; Andreas, Rauber (2007)
      Web archives are an important source of information. However, before a Web archive can be properly utilized, it needs to be catalogued. This is to ensure that the accessed materials yield the historical understanding intended by the researcher. At the same time, the dynamic nature of the Web will easily render these catalogues outdated, and there is a constant need to monitor when the Web catalogues become irrelevant upon change of the Web content. This means a substantial amount of human effort is required to maintain the catalogue records for the Web archives, adding additional burden to any institutions that maintain it. In this paper, we propose an automatic mechanism to monitor changes in Web content, so that human workload can be reduced. The system combines two component technologies to make this possible: (1) a contextualized annotation module and (2) an evidence change detection module. Contextualized annotation enables the cataloguing process to link content on the Web page (the evidence), to the value assigned for an element of a metadata schema. Thus, the metadata is â supportedâ by certain Web content that functions as evidence for a cataloguing decision. Regardless of changes in the webpages outside of the evidence, the metadata remains valid as long as all the evidence remains the same. In order to achieve evidence-specific change detection, we need to extend the traditional Longest Common Subsequence (LCS) based Diff engine using a Page Coordinate translation algorithm, which we argue, through a survey, is the first among many other Web content monitoring approaches.