Blog

2007-05-29 - Todd Holbrook

I'm well into the ERM development now, and that has left the trunk of the SVN tree in a state that may not be usable for new installs. It's probably best to grab the CUFTS2 branch.

2007-03-26 - Todd Holbrook

The subject and association tables have been done. Updates are here:

sql/updates/normalize_cjdb_subjects_associations-2007-03-08.sql

There have been a number of minor bug fixes as well. I'm currently working on roughing in some screens for the upcoming ERM system.

2007-02-26 - Todd Holbrook

The past few months I've focused on optimizing some parts of CUFTS that still used prototyping code which messed with the internals of Class::DBI (CDBI::Relationship::HasDetails) and normalized the heavily used CJDB title tables. Hopefully I'll have time to do the subject and association tables this week as well. The main production CUFTS system currently handles about 20,000 requests per day and was starting to see some speed issues with a few queries.

The SQL update for removing the Details tables is here:

sql/updates/move_all_details_into_real_tables-2006-11-15.sql

The SQL update for normalizing the CJDB title tables is here:

sql/updates/normalize_cjdb_titles-2007-02-20.sql

I'm trying to get some optimization done in anticipation of the next big development - adding a database of databases (public resource list), ERM data, and a number of associated changes. This is being driven by some funding money from a few COPPUL sites so their needs will have priority, but if there's development in this area you'd like to see you can join the discussion in the CUFTS ERM Forum

2006-07-26 - Todd Holbrook

I haven't been keeping the blog up to date.. sorry about that. Most of the last few months has been concentrating on maintenance, bug fixes and adding minor features as they come up. I have a rewrite of most of the resource modules which will be committed in the next day or two. This includes a new module to preload all the resource modules when the web applications start up so that any syntax errors will be thrown immediately. Since loading on the fly is really only a benefit if you're not running a persistent perl parser (like not using mod_perl), this should speed individual requests a bit at the expense of slightly longer startup times. I'll be looking at removing all the on-the-fly loading code eventually, I just have to make sure it doesn't break anything.

A big new feature is support for RSS feeds in two ways. The journal_auth table has a field for persistant storage of the RSS feed URL, and that gets passed to the CJDB when it is built. The CJDB displays a little RSS button beside the journal when in browse mode. That button is a link to a proxied RSS feed - a bunch of commonly used link fields are rewritten to include the proxy prefix before passing them to the patron. This allows for patrons to use their own RSS reader but still get proxied links directly to the articles. As well, the individual journal display will attempt to grab the RSS feed and present the results directly to the patron in the CJDB display. Currently it displays as much as possible (abstracts, etc.), however with the variance in quality of data from publishers it might be better to simply display basic information to avoid confusion.

There's a few new update SQL files you should load if you're trying to keep current with the SVN tree. They support the features above. They should be datestamped and in the sql/updates dir.

I've also updated the Catalyst scripts to 5.7, but they have not been tested in production. That will be happening over the next few days as well.

2006-03-22 - Todd Holbrook

I've spent a bunch of time over the last couple of weeks working on speed optimizations for CUFTS.

The first was fairly simple to fix - deleting resources took forever. I tracked this down to an index I had changed a few months ago. The resource and journal columns in the local_journals table were indexed together to enforce a UNIQUE constraint on them which solved an earlier bug of duplicate local_journals records when people hit submit twice (or something, I never did track down what caused the dupes for sure). I removed the index on the journal at that time thinking that local journals where never retrieved without the resource. Of course, it turns out that Class::DBI has_many delete cascading uses the journal column alone and thus was forcing table scans without the index. Doh. Adding the journal index back in fixes that one. Deleting a 300 title global resource is down from 134 seconds to 7. Here's the index and a couple other index tweaks which should save space, though probably wont save any query time.

CREATE INDEX local_journals_j_idx ON local_journals (journal);

DROP INDEX local_journals_e_issn_idx;
DROP INDEX local_journals_issn_idx;
DROP INDEX local_journals_title_idx;

CREATE INDEX local_journals_e_issn_idx ON local_journals (e_issn) WHERE e_issn IS NOT NULL;
CREATE INDEX local_journals_issn_idx ON local_journals (issn) WHERE issn IS NOT NULL;
CREATE INDEX local_journals_title_idx ON local_journals (title) WHERE title IS NOT NULL;

While looking for this error, I decided to implement fast deleting in the HasDetails? module. This should be safe since HasDetails? tables are pretty standard and should (never) have has_many relationships that need to be followed. So now the HasDetails? module looks for the presence of CDBI::Plugin::FastDelete? in the details table and uses it to delete all details in one database call rather than iterating through them like Class::DBI normally does. This can be avoided by not loading the FastDelete? plugin if there's a reason to avoid it. Unfortunately it adds another CPAN module to the CUFTS dependencies, but the delete example above goes from 7 seconds to 4. Doesn't sound like much on a 300 journal table, but there's some in the 10,000 range where this will save lots of time.

2006-03-02 - Todd Holbrook

Ok, it turns out caching hits on web services may be more useful than just CrossRef, so I've removed the CrossRef cache and implemented a generic one. Sorry if you've already updated! If you have, drop the "crossrefcache" table and add this:

/CUFTS/sql/CUFTS/searchcache.sql

2006-02-28 - Todd Holbrook

Lots of new interesting stuff on the go. There's now an AIM/AOL/ICQ bot which will do basic journal information queries like Jake did. You can chat it up as "CUFTS2" on those networks.

I've done basic work for a Google Scholar dump that sites will be able to do from CUFTS. I'm waiting to hear back from Google about whether the format is correct.

The journal_auth decisions have been made and the tools are now pretty much finished. There's a nice interface in the MaintTool for merging and editing journal_auth records. You can now also edit global journal data in the MaintTool, mainly so you can change which journal_auth record a journal is linking to, but you can also use it to edit other data. It's not advisable however, since it bypasses the scripts which notify sites of database changes.

I've added a CrossRef caching system which should cut a few seconds off of repeat requests. It does mean another periodic "cleanup" script will have to be run (you don't want things cached forever). Alternatively, I may add cache expiring to the CrossRef module so it happens on the fly.

If you update an existing CUFTS2 install, please note the new database table that should be added:

/CUFTS/sql/CUFTS/crossrefcache.sql

2006-01-17 - Todd Holbrook

I've been spending a lot more time than I anticipated grappling with how to manage the "journals_auth" records. This is the table of unique journals that is stable across CJDB loads. User tags are tied to it, and it also contains base MARC information for building CJDB records without associated print information. I'm trying to avoid running into the same problems as the jake project - updates to the database are too human intensive. After trying out a variety of loads and hand checking a bunch of merged records, I've decided not to merge any journal_auth records based on title during the automated loads. There's too much of a chance of errors and managing splitting those records later is not trivial. Merging them later through a tool is pretty simple and should only require a couple of clicks, however, and those merged records should be "sticky" in that later data loads will not require any more human intervention. Hopefully I'll have a first pass at this loading done today and simple tools for testing whether this method of maintaining these records will be workable.

2006-01-16 - Todd Holbrook

If you're synced to the SVN tree, you should note a new field in the sites table of the database that you'll need to add. This adds support for WAM (III) proxying which does not support a simple url prefix.

ALTER TABLE sites ADD COLUMN proxy_WAM VARCHAR(512);

2006-01-09 - Todd Holbrook

All the 2.0 beta tickets are now complete and I've moved the new code to a production server for local testing. There's still some functionality (tag/user administration, journal_auth toosl, etc.) I'd like to add, but it's mostly stuff that can be added onto existing installations without a problem.

Now I need to move the bulk of the information from the original CUFTS website over to Trac and its Wiki and then proxy it. The existing pages are pretty outdated, but hopefully having them up here where they're easier to maintain other people can modify them (with permission) will keep things a little more up to date.

2006-01-03 - Todd Holbrook

I spent some time over the holidays investigating why CJDB page displays were slower than I had anticipated. My early implementation was fairly quick: under 2 seconds for a page with a couple hundred journals with links to display. It turns out that the template code I added to check the CUFTS database for any link level notes associated with it was the problem. It added 1-5 database searches per journal which was enough to almost double the template building time. I was all set to change the link grabbing code into a join, however the seperation between the CJDB and CUFTS database made this impossible to do on the fly. Since most of the reasons for having two truly seperate databases have gone anyway, I spent some time merging them back together and testing the template rendering. Thankfully, the CJDB display speed is back where it should be, though it'll put me back a bit as I rewrite a bunch of code to use the new table names. I'll have to rewrite the installer, but most of it is removing code.