主播大秀 - WW2 People's War - Project History: Technical Information

Technical overview of the archive build

The original People's War site was developed using the 主播大秀's in-house article/commentary system called DNA. Originally developed for the , this is a database-driven system that allows multiple sites to be stored in a single Microsoft SQL Server database. A registered user's profile is associated with a particular DNA site but they can use the same credentials to post articles and/or comments to any site. .

The combination of a monolithic structure encompassing multiple sites, and a proprietary database backend, meant that a considerable amount of work had to be done to convert the data into something which could be used to build a permanent archive. Eventually the following procedure was followed.

Firstly the DNA database administrators copied all the relevant data into a 'snapshot' SQL Server database. This had an identical structure and table names to the full DNA database, but only contained those records pertinent to the People's War site and not any other site held in DNA.

A custom Perl program was run to extract the relevant articles from the snapshot database and store them as XML files, one file per article. Not all records from the article table were extracted - for example some of those discarded were Research Desks and explanatory articles written by 主播大秀 editorial staff. At the same time any relevant forum entries relating to each story were also extracted (from separate tables within the database) and combined into the XML. Note that this 'interim XML' format still closely mirrored the format of the data within the relational database, rather than aiming for a specific final destination format.

Another custom Perl program extracted information about contributors from the database. As well as any information that the users may have written in their personal pages, this also scanned the database for any articles or forum contributions that they'd made. Again, the data was ultimately written out as XML, with one XML file for each contributor.

A further process was required to ensure that the images uploaded to the site were included on the site, and viewable in category galleries. The content of each extracted article was examined for image tags. If these were found, the image URL was determined, the image downloaded from the live DNA site and stored locally. Information about each downloaded image including captions, size, and associated article was also captured and stored within a single XML file.

A categories listing XML file was created to list the number of articles (pre-categorised by editorial staff) within each category, and for each article its title, the name of the contributor, word-count, and whether or not it had any images or was an 'editorial pick' (known as Recommended Reads in the archive site). This file would later be combined with the results of the Bayesian categorisation process to create the category listings on the archive site.

Finally, all of the above XML files were cross-referenced against each other to try to obtain a degree of internal consistency.

The above set of programs took around 48 hours to run to completion on a Sun UltraSparc 250 server, and the XML written out took up approximately 450MB (with another 280MB of image files).

As there was some inconsistency in the interim XML files extracted from the database they were transformed to a set of target XML files using a combination of Perl and XSLT. The structure of the target XML data was defined using XML Schema to create a well-defined output format. As the content of the XML files was created by thousands of individuals there was a lot of variation in how this content was structured and coded. Therefore a large part of the XSLT process was to transform all contributed text and GuideML to standards compliant HTML while preserving the original layout displayed by the browser. This process was used for both article and contributors XML.

An XML topic map was built to describe the structure of the archive including detailed information about all articles. This was done using XSLT to combine the categories listing XML, Bayesian categories data and a further XML file describing the basic categorisation for the archive.

A further XSLT transformation then created the actual HTML files for this website. The XML topic map was used to build the crumb trail and categories information for article pages and to determine whether an article listed on a contributor's page was a 'recommended story' and/or contained an image.

Another Perl program created the archive listing, categories and gallery pages. Again this was done using XSLT using the XML topic map as source data.

Bayesian analysis

Of the approx 47,000 articles extracted, around 17,500 had been placed into categories by 主播大秀 editorial staff. The remainder needed to be placed into categories using some kind of automated process - a notoriously difficult problem to solve using computers. Due to the relatively large sample of good-quality pre-categorised information, it was decided that Bayesian analysis would provide a good trade-off between ease-of-use, speed, and accuracy.

Bayesian analysis is a statistical technique based on equations derived by an 18th-century British amateur mathematician, Reverend Thomas Bayes (1702-1761). These equations developed a way to use mathematical probability theory to determine the chances of an event occurring, based on the number of times it has (or has not) occurred in the past.

Bayes' equations (or, more strictly, derivations of them) have been used extensively in many fields of modern life, and have found particular application within computing theories of artifical intelligence and natural language processing. In particular they can be used to build up a set of probabilities that a particular document matches one or more already-known categories and when used in this way the classification system is known as 'naive Bayes classification'.

There are a number of pre-existing implementations of the naive Bayes algorithm available. The one we chose was the Perl module Algorithm::NaiveBayes. Perl is the programming language of choice used by the 主播大秀 iF&L software engineering team, and the above module is freely available, open source and well-proven, making it an ideal basis for the categorisation solution. A Perl program was written, based around Algorithm::NaiveBayes, to read in the pre-categorised articles as a training set, and use the Bayesian knowledge gleaned from them to categorise the remaining 30,000-odd articles.

The naive Bayes classifier analyses words found in manually categorised documents, and calculates probabilities that a word will appear in a document of a given category. This training process builds a knowledge set that can be used to classify similar, unclassified documents. The classifier picks out the most likely categories for each document. When the classifier suggested more than one category, we chose to limit the selection to only the top category since we found that gave the most meaningful results overall.

The classifier makes some assumptions about how the data is generated. It assumes that all the words found in a document are independent of each other with respect to its category. In real life, we know this is false, but this model does produce surprisingly good results.

A Bayes classifier is in contrast to a rules-based classifier where a series of semantic rules are built up and manually refined in an iterative manner until the classifier is satisfied that the documents are categorised appropriately. We chose a Bayes classifier because it made maximum use of the manually classified stories.

Why are some documents in the wrong category?

It's important to remember that naive Bayesian analysis is a mathematical, statistical system, and like any such deterministic algorithm will never be able to have the flashes of insight or understanding that might allow a human being to make categorisation choices. Additionally there are assumptions inherent within the mathematical nature of the algorithm, such that the unknown data is considered to be of a similar quality to the known data. However even allowing for these limitations we believe that the Bayesian categorisation process was accurate to within 85% to 90% (inasmuch as accuracy can be determined for such a subjective question as "has this article been placed in an accurate category?").

We hope that the occasional mis-classification does not detract from the value of the content in the archive. Stories can also be accessed by search using particular key words and it is assumed that in the future, search mechanisms will become evermore efficient at returning the desired results for the archive user.

Back to About This Site

	主播大秀主播大秀page 主播大秀 History
WW2 People's War 主播大秀page Archive List Timeline About This Site

	Contact Us