Web Pages, Text Types, and Linguistic Features: Some Issues

This is an open access article distributed under the terms of Creative Commons Attribution 4.0 International License (CC-BY 4.0).
From a textual point of view, the web is a huge reservoir of documents. On the web virtually everything can be seen as a ‘document’ or better a ‘web page’. The sheer amount of texts available is just overwhelming. Furthermore, the web is mainly wild and uncontrolled. This becomes clear if we compare a ‘tamed’ resource of the paper world, like the British National Library, and the ‘untamed’ English Web. In: this empirical study, I investigated text typologies in a random sample of raw web pages, and not in a corpus of pre-selected and pre-processed documents. I realized that the textuality of web pages might be dissimilar from the textuality of linear documents (whether paper or electronic documents). This new textuality makes automatic feature extraction and application of NLP tools more troublesome. I also realized that the text typologies already available in the literature might not cover all web page types. The issues pointed out in this study do not have an easy solution. For the time being, my suggestion is to keep them in mind when assessing results from any automatic approach to web pages.
