[revml] Re: RevML DTD
Peter Miller
Peter Miller <millerp@canb.auug.org.au>
Mon, 10 Jan 2005 08:50:56 +1100
--=-/6mK5F/gJpnyNEZBPNaj
Content-Type: text/plain
Content-Transfer-Encoding: quoted-printable
On Sat, 2005-01-08 at 02:18, Barrie Slaymaker wrote:
> See this link for the latest:
>=20
> http://public.perforce.com/public/revml/revml.dtd
I've grabbed this one (0.38)
What follows could be seen as me not liking RevML. The reverse is the
case: I think it is a great idea, and one who's time has come. There
are just a stack of questions I have concerning implementation, which
have arisen as I sit here and code an Aegis RevML import/export tool.
> I can send you some [example].
Yes, please!
> VCP generates only valid XML (it checks elements
> against the DTD).
I was going to check my output via nsgmls and the DTD.
I'm writing C++, not Perl. Is there a "DTD to C++" tool?
> > How come you introduced <char code=3D"0xNN"> instead of using the exist=
ing
> > &#xNN; mechanism? Maybe some words of explanation in the DTD would
> > help.
>=20
> It's an XML thing: no matter what XML method you use, you are not
> allowed to encode any character point below a space (32) with the
> exception of a few control characters like carriage return and line
> feed. Even in XML1.1 you can't encode a NUL (0x00). So we need a
> non-builtin way to carry the occasional illegal character through XML.
If I understand correctly this means that
for 0x7F to 0xFF, I use &#xNN; and
for 0x00 to 0x1F, I use <char code=3D"0xNN">
> Most of RevML has few assumptions other than a series of revisions
> linked in some way.
For a SCM with real change sets (a few pre-CVS and most post-CVS VC/SCM
systems have them) is it expected that a single RevML file describes a
single change set?
The typical usage of diff/patch it that of a change set, even when the
underlying repository is CVS which does not itself understand change
sets. The change set model of diff/patch is one developers understand
intuitively. I'm not sure putting anything else in a RevML file would
meet user expectations. Aggregating several related change sets into
one big change set is still a change set and doesn't break the model.
> Systems like Subversion and, I presume, Aegis, would need their own
> element; the DTD above defines <cvs_info>, <p4_info>, etc. in each rev
> currently as PCDATA blobs, but we can define structured information in
> to them at some point as well.
But that's inherently invalid. It means that you are (implicitly)
encouraging vendors to add tags which *aren't* in the DTD to support
their own platforms.
For example, if I produce a tool which writes RevML files with
<aegis_info> tags, this is not (yet) a valid RevML file, until you
happen one day to see one and add it to the next rev of the DTD. This
doesn't scale for an arbitrary number of VC/SCM systems that the RevML
DTD author has never, and probably will never, see or use or even hear
of.
BTW: maybe "vendor" is the wrong word... it could be interpreted to
exclude Aegis, arch, darcs, monotone, OpenCM, etc.
> > What is a system supports file attributes beyond the ones in the DTD?=20
>=20
> We'd open the DTD up to allow them. Let's define them :).
I'd rather add attributes in a way that didn't require DTD changes (see
above). That way a new vendor can appear on the scene, and still
produce valid RevML files.
> I want to capture standard stuff in the DTD to prevent accidental or
> overly creative misuse. By standardizing the commonly available pieces
> in the DTD, including element ordering where convenient, we narrow the
> range of variation and limit accidental dependance on unspecified
> ordering, for instance.
Agreed. But by having commentary in the DTD which says it's OK to add
vendor tags as required just means you get actual misuse (not valid
RevML), instead of just creative misuse.
> > Given the presence of the <REP_TYPE>, why is the rep_type redundantly
> > present in the names of all the <*_INFO> forms?
> It is not now, not sure why it every was.
Err... the
<!ELEMENT p4_info (#PCDATA|char)* >
(etc) definitions are still in the DTD, and still referenced by the
<!ELEMENT rev> defintion.
> > What is a change set moves a file *and* changes it?
> That has not been considered. [...] I've not implemented any backends t=
hat support a discreet
> "move"
Aegis, Arch, Subversion, almost any VC/SCM project started since 2000,
all support file moves a first class operation. Some also support
"forking" (my term) a file, so they share common history up to a certain
point, and then diverge... a semantic quagmire.
> Yes, although two of the four systems (CVS, VSS) have no changeset
> concept
It's not especially difficult code to extract implicit change set info
from CVS... the Aegis import facility does this. (Sliding time window
across the set of all files, plus different users mean different change
sets.) I've tried it on large projects. At worst it produces too many
change sets rather than too few (e.g. half a commit before lunch, the
other half after lunch).
> and so the DTD does not assume changesets. I'd like to see some
> explicit support for declaring changeset-wide information and then
> referring to it in individual revs, but that would mean a whole lot more
> logic to handle indirection and save little or no disk space when RevML
> is compressed.
This gets back to my earlier question: does a single RevML contain a
single change set? Or does it contain all change sets for the entire
history of a project? Or something else?=20
If it only contains a single change set, then no extra machinery beyond
additional non-<REV> attributes is required.
I have found that the users of Aegis accept the
branch-as-a-single-change-set model with few problems.
However, grabbing (and applying) all the change sets of a branch _as
separate change sets_ requires more machinery, and for which there are
few successful implementations.
If the RevML was supposed to be able to encapsulate change sets, maybe
<!ELEMENT revml (change_set*)> would be enough, with <!ELEMENT
change_set> being defined as the current <REVML> definition. Adding
recursion would probably be helpful... meaning a change set can be the
composed of a sequenced set of sub-change-sets.
> I want to limit the ad-hoc use of a generic form to truely
> generic attributes; common attributes should be embodied in the DTD to
> encourage standardization and once a common attribute escapes in to the
> wild encapsulated in a generic form, it can never be recaptured in a
> standard form without having every tool support both forms (ugh).
Yes, this can be a problem. But mostly, a case of looking at attributes
with several synonyms - extra rows in a lookup table.
The different formats for the values would be a pain, though.
But in a way this like email headers. The truly generic ones don't have
an X- prefix.
For aegis, the change set attributes include brief_description,
description (RevML's <COMMENT> ?), cause, several testing flags, a pile
of history information including developer(s) and reviewer(s), plus
arbitrary user define attributes. The file-in-a-change-set attributes
include action, usage (source, test, etc), Content-Type plus arbitrary
user defined attributes.
> > The <TYPE> form is too limited.
>=20
> It is sufficient for the systems we've used RevML with
And will always be for the vast majority of VC/SCM uses, I expect,
however there was an interesting thread on OpenCM mailing list some time
early 2004 about content types and their applications.
Maybe allowing "text/*" to be understood to mean "text" would be
sufficient in the DTD comments.
> Ignoring portions of an XML grammar is easy :). Coping with multiple
> authors who do not happen to choose the same spelling for a <name> is
> difficult, I think.
Yes. This is something I would like to avoid. I think it needs
extension mechanisms which are in the RevML content, rather than the
RevML structure.
Re: <attribute><name>blah</name><value>blah</value></attribute>
> > Plus, they can all have X-system-blah-blah extensions. The ones that
> > support arbitrary user defined attributes could have User-blah-blah
> > attributes, too.
>=20
> Nice approach, actually.
No new with me, it's how extension email headers are written.
> I like the idea of a <user_attribute> and
> <site_attribute> if an SCM makes some true semantic difference between
> them.
Why "site", why not "vendor"? And why make it a different part of the
RevML structure?
Ideally, I'd like to be able to receive a change set in RevML into an
Aegis repository, and all attributes that Aegis doesn't understand, it
simply inserts into the arbitrary attributes of the change set. When
the change set is exported again via RevML, it gets all those attributes
Aegis didn't understand, plus all of the ones Aegis did understand. All
it takes is a little code to say that "[xX]-*-*" attributes don't get a
"User-" prefix.
> > Note that some systems give each file a unique ID (at least two that I
> > know of use the standard GUID/UUID format) which is immutable; they
> > model filenames as an editable attribute of a file, thus a file rename
> > is a simple change of the filename attribute.
>=20
> The <rev id=3D"..."> should contain the GUID/UUID while the <name> should
> be it's current public identity.
Time to clarify things... what is the rev id supposed to be? The
language in the DTD comments is too loose for me.
Each change set has a UUID, meaning that when I package it up (using
aedist) and email it to a developer, when it unpacks at their end, it
gets the same UUID. Each change set is "the same" not matter which
repository it is in.
But... each file also has its own UUID, from a completely different pool
of UUIDs. (Change set UUIDs have nothing to do with file UUIDs, and
vice versa.) Now, when the REV element is given an ID attribute, is it
the ID of the file, or the ID of the change set?
It makes sense that it would be the ID of the change set, because this
allows all the file revisions of a single change set to be grouped
together... if a RevML file can contain more than one change set.
But is a RevML file only ever contains a single change set, it would
make sense that the REV element's ID would be the file's ID because this
would allow grouping file histories in the face of renames.
Specific RevML DTD 0.38 comments:
Maybe a preamble comment with a glossary? Especially when "site",
"vendor", "repository", "tool", "user" (etc) are used as adjectives.=20
Being fairly pedantic about wording is a Good Thing in a standard.
In the REVML element, does the COMMENT element refer to the tool and/or
vendor, the site, the specific repository/project at a site, the
specific repository/project replicated to several sites, or a change
set? Or something else?
In the REVML element, is references a BRANCHES element which is never
defined.
The REP_TYPE element's data can be a "vendor" name. Is this case
sensitive? I also notice that you interchangeably use p4/perforce,
vss/sourcesafe, etc, all through the DTD comments. Are aliases allowed
in the REP_TYPE tag value?
The REP_DESC description talks about the repository as if it was a site
specific attribute, but the suggested values look more like a tool
("vendor") attribute. Which is it?
The REV_ROOT element appears to describe what Aegis calls a project... a
unique (within a site) repository identifier. A site could potentially
host many, many projects. Would this map to a CVS module name, or a the
actual path to the CVS_ROOT? because CVS_ROOT potentially refers to
*many* projects (it doesn't help that CVS itself is rather fuzzy about
the distinction).
The BRANCH_ID element comment talks about exporting a branch. Does this
mean that a single RevML file is intended to describe all of the change
sets to a branch? (What if the branch has sub-branches? are they in
there too?)
Is it really necessary to have ACTION, P4_ACTION and SOURCESAFE_ACTION?=20
Surely a single ACTION with add/create, delete/remove, edit/modify and
move/rename values is sufficient?
(Well, not 8 alternatives, 4 will do. Aegis has more, but they can all
be encoded as "edit".)
The DIGEST element - what exactly does it contain? Is it the md5sum of
the content/delta text, or is it the md5sum of the file after the delta
is applied? Or something else? It is only by context that I guess an
md5sum is the value.
What happened to the <FILE_COUNT> element? I like the idea of a
progress bar. (probably misnamed, maybe it should have been a
<REV_COUNT>, although it does highlight the need to carefully
distinguish between a REV and a file in the glossary and then rigorously
use the terms as defined.)
=20
That's plenty for now.
--=20
Peter Miller <millerp@canb.auug.org.au>
--=-/6mK5F/gJpnyNEZBPNaj
Content-Type: application/pgp-signature; name=signature.asc
Content-Description: This is a digitally signed message part
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)
iD8DBQBB4adAGNik4tDttk0RAsL9AJ0SMyaHh8T+vl6mZF4lUawn6t7aiwCggw9Q
D0gGQusgXymwfLPWNV37kHk=
=8xHm
-----END PGP SIGNATURE-----
--=-/6mK5F/gJpnyNEZBPNaj--