So, hi all. First time commenting on the list. I personally think having tags (whether of type "author:" or type "by") is useful for two reasons. One: It allows multiple tags to be entered. Two, it clears up the potential problem listed by Fletcher regarding tags. by Christoph Freitag Affiliation: XYZ by Fletcher T. Penney Affiliation: ABC tags: Markdown, Standardization, MMD, Metadata desc: An interesting discussion of how metadata could be included usefully in Markdown, whilst being readable etc. Regarding the localisation problem then, I thought that this was a solved problem when it came to computing? (At least in the cases of the major world languages.) A parser could have a table of equivalent words, so in English "by", en fran?ais "de" (pardon my French*). * By which I mean, I'm not sure that's correct, because I'm only a learner.> From: Christoph Freitag <mail at christoph-freitag.de> > Fletcher, sorry, but personally -- despite loving MMD (and even having used MMD CMS for a diary) -- I have never liked the way MMD handles metadata. Partly this is because, not being a native English speaker, I dislike English meta descriptors. A localization could resolve this -- but I still think it looks ugly. However, do you actually need descriptors at all? I doubt it: > > * The title could be anything "at the start" of the document. Blosxom is a good example. Anything up to the first blank line is the title. > * After that, anything between the first blank line and the second blank line would be treated as additional metadata. > * Instead of the "Author:" descriptor, explicitely stated, it should suffice to write "by". What follows is the name of the author. (Localization would be easier as only this "keyword" would have to be known to the parser in a number of languages.) > * Dates would be self-explanatory, to a clever parser. > * Any list of words separated by commas on a single line would be treated as tags. > * Any more fanciful meta descriptors might be given explicitly just as in MMD before. This could be left to non-standard, personalized variants of Markdown. > > Thus the following would be a valid document: > > --- > Test Document for Automatic Metadata Detection > > by Christoph Freitag > 08/17/2011 > Markdown, Standardization, MMD, Metadata > > A Markdown document may contain metadata in a human readable form that the parser converts to a machine readable form of metadata automatically. A casual reader will understand the content directly and without distraction. Bowerbird will love this.> From: "Fletcher T. Penney" <fletcher at fletcherpenney.net>> You mention the English-centric nature of MMD metadata. This is certainly true, but no more so than HTML itself. One could certainly localize MMD to use any language you like (the beauty of open source), but to match your proposal in multiple languages would be quite complicated. > > For example, the following are valid MMD metadata dates, and easily used: > > date: 8/17/2011 > date: August 17th, 2011 > date: 2011-08-17 > date: 17/8/2011 > date: 14. Juni 2001 > date: 8 avril 2000 > > Writing a parser that would correctly catch all of these dates in any language would be quite difficult, and prone to error. > > You mention tags as being easily recognized, but that this is not always true: > > A sample document > > by John Smith, MD > Director of Palliative Care, Division of General Medicine, Medical University of Somewhere > > While perhaps not the best example of potential problems, this would be incorrectly interpreted as tags, when the author probably implies that this represents his academic affiliation and would like it to be properly placed after his name on the title page, or on the slide deck if generating via beamer.
It is true that certain metadata (author and date, to provide two examples) are used far more frequently than return addresses or URIs for graphical signatures. That said, it would be foolish to try to imagine every way in which metadata might be used, nor do I see much value in doing so. If Markdown is to process metadata, the syntax should support arbitrary key?value pairs. For example: author: Jesper N?hr date: 17 August 2011 tags: lol, omg, lulz Formatted differently: author: Jesper N?hr date: 17 August 2011 tag: lol tag: omg tag: lulz If ? again, if ? Markdown is to be charged with parsing metadata, my opinion is that it's role should be limited to returning a dictionary-like metadata object (in addition to the HTML string generated from the remainder of the document's contents). For the first example: {"date": "17 August 2011", "tags": "lol, omg, lulz", "author": "Jesper N?hr"} For the second example: {"date": "17 August 2011", "author": "Jesper N?hr", "tag": ["lol", "omg", "lulz"]} In my opinion, Markdown should *not* be responsible for any of the following: - splitting lists (note that "lol, omg, lulz" is a string in the first example) - converting date strings into date objects - any other manipulation of values In other words, every value should be either a string, or an ordered, list-like object containing two or more strings (in the case of a repeated key). In addition to converting strings into appropriate objects, applications making use of Markdown's metadata feature would also be responsible for handling the fact that the value for a particular key may be a string for one document and a list of strings for another. Fletcher touched on another question that should be discussed: should multiline values be accommodated and if so, how? I think it'd be great to support multiline strings. I imagine the formatting looking something like this: author: Jesper N?hr date: 17 August 2011 lol: Irony keffiyeh pitchfork, mustache letterpress tofu cred twee scenester thundercats gluten-free yr chambray sartorial stumptown. Homo cosby sweater gentrify banh mi letterpress, vinyl beard hoodie terry richardson. Art party whatever banksy, readymade skateboard you probably haven't heard of them tumblr tattooed PBR letterpress photo booth carles vegan organic. omg: VHS carles photo booth food truck synth craft beer, wes anderson tofu banksy fanny pack stumptown. This strikes me as being in the spirit of Markdown, as it's how one might structure this content if one were to produce it on a typewriter. I'm interested to hear people's thoughts on multiline values and on the unfancy approach to metadata parsing that I (currently) favour. David On 17 August 2011 15:17, M Harris <mark at 2011.n0b.org> wrote:> So, hi all. First time commenting on the list. > > I personally think having tags (whether of type "author:" or type "by") > is useful for two reasons. > One: It allows multiple tags to be entered. Two, it clears up the > potential problem listed by Fletcher regarding tags. > > by Christoph Freitag > Affiliation: XYZ > by Fletcher T. Penney > Affiliation: ABC > tags: Markdown, Standardization, MMD, Metadata > desc: An interesting discussion of how metadata could be included > usefully in Markdown, whilst being readable etc. > > > Regarding the localisation problem then, I thought that this was a > solved problem when it came to computing? (At least in the cases of the > major world languages.) A parser could have a table of equivalent words, > so in English "by", en fran?ais "de" (pardon my French*). > > * By which I mean, I'm not sure that's correct, because I'm only a > learner. > > > From: Christoph Freitag <mail at christoph-freitag.de> > > Fletcher, sorry, but personally -- despite loving MMD (and even having > used MMD CMS for a diary) -- I have never liked the way MMD handles > metadata. Partly this is because, not being a native English speaker, I > dislike English meta descriptors. A localization could resolve this -- but I > still think it looks ugly. However, do you actually need descriptors at all? > I doubt it: > > > > * The title could be anything "at the start" of the document. Blosxom > is a good example. Anything up to the first blank line is the title. > > * After that, anything between the first blank line and the second > blank line would be treated as additional metadata. > > * Instead of the "Author:" descriptor, explicitely stated, it should > suffice to write "by". What follows is the name of the author. (Localization > would be easier as only this "keyword" would have to be known to the parser > in a number of languages.) > > * Dates would be self-explanatory, to a clever parser. > > * Any list of words separated by commas on a single line would be > treated as tags. > > * Any more fanciful meta descriptors might be given explicitly just as > in MMD before. This could be left to non-standard, personalized variants of > Markdown. > > > > Thus the following would be a valid document: > > > > --- > > Test Document for Automatic Metadata Detection > > > > by Christoph Freitag > > 08/17/2011 > > Markdown, Standardization, MMD, Metadata > > > > A Markdown document may contain metadata in a human readable form that > the parser converts to a machine readable form of metadata automatically. A > casual reader will understand the content directly and without distraction. > Bowerbird will love this. > > > > From: "Fletcher T. Penney" <fletcher at fletcherpenney.net> > > > You mention the English-centric nature of MMD metadata. This is > certainly true, but no more so than HTML itself. One could certainly > localize MMD to use any language you like (the beauty of open source), but > to match your proposal in multiple languages would be quite complicated. > > > > For example, the following are valid MMD metadata dates, and easily used: > > > > date: 8/17/2011 > > date: August 17th, 2011 > > date: 2011-08-17 > > date: 17/8/2011 > > date: 14. Juni 2001 > > date: 8 avril 2000 > > > > Writing a parser that would correctly catch all of these dates in any > language would be quite difficult, and prone to error. > > > > You mention tags as being easily recognized, but that this is not always > true: > > > > A sample document > > > > by John Smith, MD > > Director of Palliative Care, Division of General Medicine, Medical > University of Somewhere > > > > While perhaps not the best example of potential problems, this would be > incorrectly interpreted as tags, when the author probably implies that this > represents his academic affiliation and would like it to be properly placed > after his name on the title page, or on the slide deck if generating via > beamer. > > > > _______________________________________________ > Markdown-Discuss mailing list > Markdown-Discuss at six.pairlist.net > http://six.pairlist.net/mailman/listinfo/markdown-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://six.pairlist.net/pipermail/markdown-discuss/attachments/20110817/2247f382/attachment.html>
Bowerbird at aol.com
2011-Aug-26 18:01 UTC
Metadata syntax (was Universal syntax for Markdown)
all pooped out, are you? oh well, the conversation this time lasted longer than it ever has before, in my memory, so maybe you're just working up your stamina for next time... so let me finish off this round... *** christoph said:> A Markdown document may contain metadata > in a human readable form that the parser converts > to a machine readable form of metadata automatically. > A casual reader will understand the content directly > and without distraction. Bowerbird will love this.indeed, christoph... because you've begun to describe the very system that i use, for the very reason i use it. i'll describe it more fully below, but first other stuff.... *** i'm not sure i fully understand the mentality that says "implementations of markdown 2.0 can toss metadata". isn't the objective to dispense with implementations that act differently from each other? ok, sure, i'm not naive; i realize that once a "standard" for "markup 2.0" is made, someone will come along and "tweak" it for their benefit, and then we are once again on the path toward fracture. but still, the goal for here and now is to unify all. right? i feel the same way about command-line switches that turn on different "modes", like "quirks" and "extensions". isn't it our zeitgeist to gather everyone under one roof? you'll just ignore (or never learn) features you don't need. so everyone gets what they want. and if it's not possible, if you want to use the system you have been using which is tweaked the way you want it, just continue to do that... it's not like those scripts will stop working or something. but manufacturing a situation where all of the differences are _blessed_ (rather than removed) is counterproductive. *** now on to "metadata"... as for the color of the metadata bikeshed, we have one shade of paint -- "simple" -- so that's what it must be... you've probably over-discussed it already, without even getting to the meat of the matter. for _most_ purposes, the "metadata" is relatively unimportant, which you'll see quite clearly if you only begin to concentrate on specifics. in a .pdf, for example, the "metadata" consists merely of title, author, subject, creator, and keywords. that's it... in an .epub or a .mobi, you can specify a ton of metadata, if you want, but there's no standardized way of getting it, so you're basically whistling at a noisy construction site... (or doing pantomime in the dark, if you prefer that image.) unless/until the "microformat" people get an upper-hand -- and lord help us if that kind of bureaucracy wins out -- "metadata" in .html continues to be a rather iffy thing, so at least for now, i think this issue needs little attention... as for the matter of "tags" or "keywords", they're _lame_, to a large degree, because they can be gleaned from the text itself in most cases. and perhaps more importantly, such descriptive judgments need to be accumulated over the input from hundreds or thousands of "objective" users, rather than plugged in by a document's author or publisher, or the specter of gaming the system makes it all worthless... i'm not telling people not to use tags, but i think it's obvious that any worthwhile recommendation system will ignore 'em. your metadata often tries to tell lies; google knows the truth. there are a lot of consultants selling metadata as a cure-all. it's more like snake-oil. *** as for my system... as i said, my focus is on _books_, so for me, the concept of the "title-page" (plus the "cover") is the one that rules here. the first "section" or "chapter" in a .zml file is the title-page, and _everything_ on that page is considered as "metadata". remember that my first pass consists of separating "chunks" -- a sequence of non-blank lines bordered by blank lines -- so the top chunk (of one or more lines) is defined as the title. the second chunk is considered to be the subtitle, and the third is considered to be the author. the "author" chunk is required to start with the word "by", so if the second chunk starts with "by" and the third chunk does not, my routines assume that the book has no subtitle, so the second chunk is considered to be the "author" chunk. subsequent chunks are required to be labeled appropriately, such as "edited by" or "illustrations by" or "plus additional contributions by" or "with preface by", and so on. you get the picture; it's clear. other things which commonly appear on the title-page are the publisher's name and often the city where it is located, publication date, contact information for the author(s), etc. none of this is particularly difficult to parse. nor does it sacrifice any power _or_ flexibility. other info about the document is obtained in the course of analyzing it, like the number of chapters and illustrations, the size of the file, the number of references, and so forth. you also have to acknowledge, at some point in time, that no matter what you do, you ain't gonna make a professional book-cataloger happy... and one of my close friends is just such an animal, working in the library system over at u.c.l.a. their cataloging workflow can summon hundreds of variables, depending on the unique characteristics of a particular book, and that's a complexity that we could never hope to replicate. at the same time, though, we can get 80% of the utility with 2% of the effort (yes i did say 2%, and not the expected 20%), so that's the sweet spot we need for maximum cost-benefit. as i said, there are a lot of consultants selling metadata as snake-oil, and the most common pitch is that metadata will give better discovery. that's hogwash. discovery will always be inferior until we develop good collaborative filtering, and that's necessary anyway, and fully independent of metadata. *** there's something else that i generally put under "metadata" -- which other people do not -- which are the specifications used to create the output-formats. these include things like straight-quotes vs. curly, indented paragraphs vs. block, and the pagesize (for .pdf), the font, fontsize, leading, and so on. this allows the end-user who receives the z.m.l. file to create outputs matching what the author intended them to look like. in accordance with the all-text-in-one-file mandate of z.m.l., these specifications should be included in the text-file itself, and can fall in the "metadata" section, the "colophon" section, or in their own "output specifications" section, as you desire... and, of course, end-users can also change the specifications, so as to create output that is formatted to their own desires... -bowerbird -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://six.pairlist.net/pipermail/markdown-discuss/attachments/20110826/116af2ff/attachment.html>