Project Alpha
The home of Project Equinox

Brief look at ALEKSI

[Edit]

Posts in This forum
Oliver said on 15:36:00 01-Apr-2017

Ok so what is ALEKSI. Well here's some information!

Many many many years ago, I found myself trawling the internet for information on various subjects, collecting information and compiling it into many different forms. This, as you could imagine, could be time-consuming, agonising and to be quite honest, tedious. Wouldn't it be great if there was some way of scouring the internet automatically for what I needed?

Step forward ELKS, as it was (Electronic Link Kollection Service), which was a MySQL, cURL and PHP-driven engine for sourcing web pages, searching their content, links and images, downloading whatever was necessary and storing any new links for the next process.

The application initially sat on a WAMP stack, requiring an Internet Explorer window to run the script with a meta refresh at the end of each page, but the software was changed swiftly after this: The program took too long to do pages, was too resource intensive and was running on a Windows Server, which made it further too slow and extensively resource intensive.

Version 2 saw some changes to the system. It was designed to run on a LAMP stack, which meant a CRON instance was generated for the application and some major database changes were made. For the most part, these were to change database schema from MyISAM to InnoDB to allow row-level locking and more instances to run at once. A four-digit checksum was added to aid in sorting and future plans for multi-machine processing. Changes also meant that it didn't take over three minutes to do a page when the database had 1.7 million entries in it. This was due to the system's new idea of challenging new entries into the database as and when they came up from a list of pre-visited links, as opposes to checking them when they were added.

When a page was added, it was scanned for "interest" words. If they appeared, interest levels would drop by one per each followed link, else if the words did appear, that page's interest was incremented by one before continuing. All pages were given a level number; the top page usually starting at 0, all links to pages from level 0 had a level 1 number, and so forth. A cap was generated such that when level 100 was met, the link was retained but not scanned.

Abilities for depth-first scanning was added along with the default breadth-first scanning in the main link trawler.

Version 3 saw the database change back to MyISAM; yes it made it slower, but at least it didn't break. New routines for offline data cleaning were created; all links to be processed were checked against the to-be-processed database before entering a temporary table or dropped if they duplicated. Once complete, they were scanned against the complete table and dropped if duplicated and run if not. This was in response to having databases with over 80,000,000 rows in them

A group level spread function was added to allow the running of multiple levels at any given time (bredth first and depth first would eventually create so many links per level that all links would be of the same level for days on end)

Massive OOP changes were made and duplicate code was removed. A new ID stamping system was implemented such that when the INT autoincrement hit its limit, it could be changed to BIGINT as a temporary measure and eventually scrapped completely over a 16-digit alpha-numeric code from the ranges system.

About

Information will appear here

Philosophy

Information will appear here

Contact

+44 (0) 7535 692215
Project Alpha The home of Project Equinox