For PageFreezer, we created a proprietary highly advanced web crawler, which takes into account every minor peculiarity of every known web server and web browser software. It’s a Java library, which integrates well with any project and provides interfaces to override various behaviors.
In order to monitor the crawling processes as conveniently as possible, we created an informative admin interface. We made it possible to crawl and capture images as well as text, and even flash animations, even when they were on different domains. An extra URL list was created for this purpose.
Include, exclude and advanced website settings were introduced, making it even more convenient for users who wish to crawl certain URLs depending on keywords. Flexible user agent selection for crawling was also added. The mechanism was designed to crawl web pages at moments when they are not under high load. Clients can also use the option of crawling speed to configure the number of crawl workers for each individual task to reduce the load on the website.
Redwerk also implemented a standard sitemap XML crawling feature to reduce the time it takes to crawl large websites, because only modified pages and their contents are crawled and archived.
A number of outstanding, technologically advances crawling options were also made available:
- parsing links out of XML files using XSLT templates
- generic authentication mechanism allowing crawlers to authorize on almost any website
All of these features make PageFreezer a much more technologically advanced solution compared to the competition.
In order to get to your desired point in time, a convenient calendar was created, highlighting the dates on which the snapshots were taken. In order to allow the user to see the site structure we created a simple navigation tree which reflects the URL hierarchy. All the tree nodes are clickable and open the corresponding site page.
One of the most difficult tasks in this project was to select the best storage option, which had to be very scalable. The project started with approximately 500 sites, but had to be prepared for much more. We toyed with the idea of using S3 or Google for some time, but those proved to be too slow to access and too expensive. So Redwerk had to come up with a more flexible, custom-tailored idea, and after some benchmarking we built a simple yet scalable custom storage cloud from scratch, based on a database and NFS file system.
It was essential, as always, to ensure that no information was lost in case of failure of any part of the system. We implemented a modern logic which makes crawlers stop and wait in case the database or the file system are unavailable. When these components come back, no information gathered by the crawlers is lost, and the use of checksums helps maintain the integrity of all stored data.
A digital signature is a set of algorithms and other methods for validating digital documents or messages. They are used almost in all sectors of economy to detect forgery or tampering, making it a fundamental security tool.
The PageFreezer service is no exception. Here, Redwerk opted TSA, used by PageFreezer to digitally sign all crawled content. Hash data of crawled content, verified certificates, user keys and timestamps are all used when signing through TSA. Therefore, a valid TSA signature is what guarantees to PageFreezer clients a reason to believe that original webpage was crawled at particular moment of time. PageFreezer data can even be used as evidence in court thanks to this implementation.
Once the system is enabled, all snapshots available to the user will be signed through TSA, and the signature can be verified on the browsing page at any time.
To protect data from destructive forces and the unwanted actions of unauthorized users we use a rock-solid combination of firewalls, fail2ban, back-ups and slave database servers. Generally speaking, the system was created to be as modular and scalable as possible. The components do not affect the performance of each other. Crawlers are separate processes, and different modules were designed for logged-in users and guests.