Teamware/SCCS history conversion to Mercurial


Originally posted back in December 2007, I'veadded some new references and some possible strategies, at the end.
Silver Falls, Oregon. No, it doesn't use Mercurial, yet.

Just a few notes on converting source file change history from Teamware/SCCSto Mercurial. These are just notes because in the JDK area and in anyTeamware JDK workspaces being converted, we don't plan onconverting the old source change history into Mercurial.The major reason why we aren'tis a legal issue, and you can imagine what the legal issues are with regards tonon-open sources that become open. I won't get into that.But there are some technical issues too, which I will try and coverin case someone decides to attempt such a conversion.Why convert the revision history?


The complete source history is an extremely valuable asset,being able to know when and who made a change years ago is often essentialto understanding a problem in a product.Initially we wanted to preserve this source changehistory and assumed it wasn't a difficultjob.Most engineers have been upset that our current plans don't include thishistory conversion, but read on if you are curious as to the problems encountered.The Basic Idea


The basic idea in doing an 'ideal' source history conversion would be to create aMercurial changeset for each Teamware putback.That means you need to identify the putback event, the specific SCCS revisions of the files,and any file renames or deletes.And each changeset is built upon the previous changeset, so the orderingof the changesets is critical here.
Sounds simple right? Well, read on, it's not so simple.The Problems


History Files:You need to understand how the Teamware history file works.The Codemgr_wsdata/history file in a workspace does not propagate, so thespecifics on a putback won't percolate around your tree of workspaces.This means that each workspace has a history of the Teamware eventsthat happened to it, but not the details of anything that happenedto the other workspaces.So to get accurate Teamware history you need the entire tree of integrationworkspaces (any workspaces that might be the target of a putback)and all that ever existed, then you'd need tofold all the events in these history files together in the proper time order.So the more complicated the Teamware workspace integration tree, the more difficultthis task becomes.The JDK workspaces (there are many different workspaces) each have 6-20 differentintegration workspace instances, and some of these workspaces go back quite a fewyears, so we are talking some major source change history here.
SCCS file revision numbers:The details in the Teamware history will just list the filesinvolved in a putback or bringover, not the specific SCCS Rev numbers for the files.So matching up the specific SCCS Revs on files to the specific putback eventthat putback these SCCS revisions is not trivial.(I think there may have been an option in Teamware to record the SCCS revisionnumbers in the workspace history file, but it is off by default, which is shame).So to create a nice neat Mercurial changeset means you need to somehow matchup the filelists and timestamps of the putbacks with the individual SCCS revision numbers of source files.Unfortunately, the SCCS filesrecord a time but no timezone, so if anyone decides to do this kind ofhistory conversion will need to have lots of fuzzy timestamp logicto match up the right SCCS revisions with the putbacks.The username is included in the Teamware history file and the SCCS revisions,so that may also help, except that often an integrator of changes isn't thesame person that did the SCCS revision.
SCCS Revision Tree:The SCCS revision tree for each file can be fairly complex graph, depending onhow many file merges happened to the file.You might be able to just use the top level SCCS revision number, but informationin the SCCS comments of the other revisions will contain important information topreserve.
Deleted files:Teamware deletes files by moving them from where they are to a top leveldeleted_files directory. So they don't really get deleted, just renamed.However, a common practice with many teams is to purge the deleted_filesdirectory once a product reaches a major milestone. So some of the filesmay actually be permanently gone, and this needs to be taken into consideration.At some point, you can't recreate the complete source tree if this has happened.
Empty comments:Empty SCCS revision comments, and empty putback commentswould also create problems if you planned on using these comments orcookies of information in these comments to connect up the files to theputback events (e.g. bug id numbers or change request numbers).So more specific SCCS revision comments and more specific putback commentsmight make this job easier.Approaches


We considered multiple approaches to doing a source revision history conversion.You could come at it from the putback events, using the history files toidentify 'real' changesets, and hope any deleted files are still around.What you'd use as Revision 1 of the files might be a little tricky.Or you could try and just look at the SCCS revisions, and figure out viatimestamps, usernames, and perhaps SCCS comments, which files were changedas a group.Or a combination of both.Or you could try to come at if from a time perspective, e.g.all the changes to get you from April 1, 2004 to May 1, 2004.
The simple approach of one changeset per SCCS revision isn't really that simplebecause Mercurial changesets have an order to them.To do it right you'd need to view the Teamware workspace as a large graph of filenodes, with small sub-graphs of SCCS revisions.Then pick a time T to start Revision 1 of the Mercurial sources,find all the file instances at time T, add these files as a changeset toMercurial, then repeat that for T+1.Or perhaps T+N where N is selected based on samplingtimestamps after T for a quiet time (to avoid picking a time that mightsplit up file changes that happened in a group). Just some wild ideas.
But it just feels wrong, no putback data, the files won't be bunched right, andthe resulting repository would contain inaccurate source state in any of theseconverted changesets.
We never fully explored all the approaches because once the legal constraintcame in, there seemed no need to pursue it.It's an interesting and complicated problem, but ultimately one we decidedwe didn't haveto solve.Conclusion


So the bottom line is that whatever can be created would likely havequestionable data if someone asked to have the sources per a particulardate or if they wanted to know the state of the entire source treewhen a given change was made... Hard to ever be perfect here, andnot being perfect could send a few engineers down some deep rabbit holes. :^(
The old history isn't being destroyed, it's just being left in theold Teamware workspaces. So we will still have access to it, just notvia Mercurial repositories.As time passes, we'll build up new and better history in our Mercurialrepositories, and maybe by the time I retire, it won't matter much. ;^)Update: Some Ideas


Jesse's conversion script turns out to be a possibility. He documentsthe problems with it, but it's certainly a step toward something.
With the OpenJDK6 repositorieswhich were originally in TeamWare, we had two ways to gain some history.With each build promotion while in TeamWare, we saved a source bundle,so we had a raw snapshot of the source for each build.By using these as potential working set files, thisallowed us to start rev0 with Build 0 source bundles, then for each build promotedafter that, repeat the steps:
  1. Delete the working set files
  2. Copy in new working set files from the source bundle for Build promotion N.
  3. Run: hg addremove ; hg commit --message "Build Promotion N" ; hg tag BuildN
This provides a large grain history, not great, but could be very valuable to narrowdown when a change came in.Adding in more specific history required patch files that you would apply in between,but you needed the patches, you needed to know what Build a change went in, and mostcritically, you needed to know the order of the patches or changes in case twofixes modified the same files.Ultimately, it worked for OpenJDK6, to a degree. The Build Promotion revisions wereaccurate, but sometimes getting the others accurate was hard to do.And unfortunately, all the changesets in OpenJDK6 look like they were created byme, which is right in a way, but I really wasn't responsible for many of the patches.So the authors, dates, and SCCS comments were not included, but the bugids were.
Anyway, just thought I would update this rather old posting.

-kto

More...