thr3ads.net - Markdown Discuss - Metadata syntax (was Universal syntax for Markdown) [Aug 2011]

If this information is useful, please help other people find it:
Share via:

M Harris

2011-Aug-17 22:17 UTC

Metadata syntax (was Universal syntax for Markdown)

So, hi all. First time commenting on the list. 

I personally think having tags (whether of type "author:" or type
"by")
is useful for two reasons. 
One: It allows multiple tags to be entered. Two, it clears up the
potential problem listed by Fletcher regarding tags.

by Christoph Freitag
Affiliation: XYZ
by Fletcher T. Penney
Affiliation: ABC
tags: Markdown, Standardization, MMD, Metadata
desc: An interesting discussion of how metadata could be included
usefully in Markdown, whilst being readable etc.


Regarding the localisation problem then, I thought that this was a
solved problem when it came to computing? (At least in the cases of the
major world languages.) A parser could have a table of equivalent words,
so in English "by", en fran?ais "de" (pardon my French*).  

* By which I mean, I'm not sure that's correct, because I'm only a
learner. 
> From: Christoph Freitag <mail at christoph-freitag.de>
> Fletcher, sorry, but personally -- despite loving MMD (and even having used
MMD CMS for a diary) -- I have never liked the way MMD handles metadata. Partly
this is because, not being a native English speaker, I dislike English meta
descriptors. A localization could resolve this -- but I still think it looks
ugly. However, do you actually need descriptors at all? I doubt it:
> 
> *   The title could be anything "at the start" of the document.
Blosxom is a good example. Anything up to the first blank line is the title.
> *   After that, anything between the first blank line and the second blank
line would be treated as additional metadata.
> *   Instead of the "Author:" descriptor, explicitely stated, it
should suffice to write "by". What follows is the name of the author.
(Localization would be easier as only this "keyword" would have to be
known to the parser in a number of languages.)
> *   Dates would be self-explanatory, to a clever parser.
> *   Any list of words separated by commas on a single line would be treated
as tags.
> *   Any more fanciful meta descriptors might be given explicitly just as in
MMD before. This could be left to non-standard, personalized variants of
Markdown.
> 
> Thus the following would be a valid document:
> 
> ---
> Test Document for Automatic Metadata Detection
> 
> by Christoph Freitag  
> 08/17/2011  
> Markdown, Standardization, MMD, Metadata
> 
> A Markdown document may contain metadata in a human readable form that the
parser converts to a machine readable form of metadata automatically. A casual
reader will understand the content directly and without distraction. Bowerbird
will love this.
> From: "Fletcher T. Penney" <fletcher at fletcherpenney.net>
> You mention the English-centric nature of MMD metadata.  This is certainly
true, but no more so than HTML itself.  One could certainly localize MMD to use
any language you like (the beauty of open source), but to match your proposal in
multiple languages would be quite complicated.
> 
> For example, the following are valid MMD metadata dates, and easily used:
> 
> 	date:	8/17/2011
> 	date:	August 17th, 2011
> 	date:	2011-08-17
> 	date:	17/8/2011
> 	date:	14. Juni 2001
> 	date:	8 avril 2000
> 
> Writing a parser that would correctly catch all of these dates in any
language would be quite difficult, and prone to error.
> 
> You mention tags as being easily recognized, but that this is not always
true:
> 
> 	A sample document
> 
> 	by John Smith, MD
> 	Director of Palliative Care, Division of General Medicine, Medical
University of Somewhere
> 
> While perhaps not the best example of potential problems, this would be
incorrectly interpreted as tags, when the author probably implies that this
represents his academic affiliation and would like it to be properly placed
after his name on the title page, or on the slide deck if generating via beamer.

David Chambers

2011-Aug-18 01:17 UTC

head link

Metadata syntax (was Universal syntax for Markdown)

It is true that certain metadata (author and date, to provide two examples)
are used far more frequently than return addresses or URIs for graphical
signatures. That said, it would be foolish to try to imagine every way in
which metadata might be used, nor do I see much value in doing so.

If Markdown is to process metadata, the syntax should support arbitrary
key?value pairs.

For example:

    author: Jesper N?hr
    date: 17 August 2011
    tags: lol, omg, lulz

Formatted differently:

    author: Jesper N?hr
    date: 17 August 2011
    tag: lol
    tag: omg
    tag: lulz

If ? again, if ? Markdown is to be charged with parsing metadata, my opinion
is that it's role should be limited to returning a dictionary-like metadata
object (in addition to the HTML string generated from the remainder of the
document's contents).

For the first example:

    {"date": "17 August 2011", "tags": "lol,
omg, lulz", "author": "Jesper
N?hr"}

For the second example:

    {"date": "17 August 2011", "author":
"Jesper N?hr", "tag": ["lol",
"omg", "lulz"]}

In my opinion, Markdown should *not* be responsible for any of the
following:

   - splitting lists (note that "lol, omg, lulz" is a string in the
first
   example)
   - converting date strings into date objects
   - any other manipulation of values

In other words, every value should be either a string, or an ordered,
list-like object containing two or more strings (in the case of a repeated
key).

In addition to converting strings into appropriate objects, applications
making use of Markdown's metadata feature would also be responsible for
handling the fact that the value for a particular key may be a string for
one document and a list of strings for another.

Fletcher touched on another question that should be discussed: should
multiline values be accommodated and if so, how?

I think it'd be great to support multiline strings. I imagine the formatting
looking something like this:

author:
  Jesper N?hr
date:
  17 August 2011
lol:
  Irony keffiyeh pitchfork, mustache letterpress tofu cred twee scenester
  thundercats gluten-free yr chambray sartorial stumptown. Homo cosby
sweater
  gentrify banh mi letterpress, vinyl beard hoodie terry richardson. Art
party
  whatever banksy, readymade skateboard you probably haven't heard of them
  tumblr tattooed PBR letterpress photo booth carles vegan organic.
omg:
  VHS carles photo booth food truck synth craft beer, wes anderson tofu
banksy
  fanny pack stumptown.

This strikes me as being in the spirit of Markdown, as it's how one might
structure this content if one were to produce it on a typewriter.

I'm interested to hear people's thoughts on multiline values and on the
unfancy approach to metadata parsing that I (currently) favour.

David

On 17 August 2011 15:17, M Harris <mark at 2011.n0b.org> wrote:
> So, hi all. First time commenting on the list.
>
> I personally think having tags (whether of type "author:" or type
"by")
> is useful for two reasons.
> One: It allows multiple tags to be entered. Two, it clears up the
> potential problem listed by Fletcher regarding tags.
>
> by Christoph Freitag
> Affiliation: XYZ
> by Fletcher T. Penney
> Affiliation: ABC
> tags: Markdown, Standardization, MMD, Metadata
> desc: An interesting discussion of how metadata could be included
> usefully in Markdown, whilst being readable etc.
>
>
> Regarding the localisation problem then, I thought that this was a
> solved problem when it came to computing? (At least in the cases of the
> major world languages.) A parser could have a table of equivalent words,
> so in English "by", en fran?ais "de" (pardon my
French*).
>
> * By which I mean, I'm not sure that's correct, because I'm
only a
> learner.
>
> > From: Christoph Freitag <mail at christoph-freitag.de>
> > Fletcher, sorry, but personally -- despite loving MMD (and even having
> used MMD CMS for a diary) -- I have never liked the way MMD handles
> metadata. Partly this is because, not being a native English speaker, I
> dislike English meta descriptors. A localization could resolve this -- but
I
> still think it looks ugly. However, do you actually need descriptors at
all?
> I doubt it:
> >
> > *   The title could be anything "at the start" of the
document. Blosxom
> is a good example. Anything up to the first blank line is the title.
> > *   After that, anything between the first blank line and the second
> blank line would be treated as additional metadata.
> > *   Instead of the "Author:" descriptor, explicitely stated,
it should
> suffice to write "by". What follows is the name of the author.
(Localization
> would be easier as only this "keyword" would have to be known to
the parser
> in a number of languages.)
> > *   Dates would be self-explanatory, to a clever parser.
> > *   Any list of words separated by commas on a single line would be
> treated as tags.
> > *   Any more fanciful meta descriptors might be given explicitly just
as
> in MMD before. This could be left to non-standard, personalized variants of
> Markdown.
> >
> > Thus the following would be a valid document:
> >
> > ---
> > Test Document for Automatic Metadata Detection
> >
> > by Christoph Freitag
> > 08/17/2011
> > Markdown, Standardization, MMD, Metadata
> >
> > A Markdown document may contain metadata in a human readable form that
> the parser converts to a machine readable form of metadata automatically. A
> casual reader will understand the content directly and without distraction.
> Bowerbird will love this.
>
>
> > From: "Fletcher T. Penney" <fletcher at
fletcherpenney.net>
>
> > You mention the English-centric nature of MMD metadata.  This is
> certainly true, but no more so than HTML itself.  One could certainly
> localize MMD to use any language you like (the beauty of open source), but
> to match your proposal in multiple languages would be quite complicated.
> >
> > For example, the following are valid MMD metadata dates, and easily
used:
> >
> >       date:   8/17/2011
> >       date:   August 17th, 2011
> >       date:   2011-08-17
> >       date:   17/8/2011
> >       date:   14. Juni 2001
> >       date:   8 avril 2000
> >
> > Writing a parser that would correctly catch all of these dates in any
> language would be quite difficult, and prone to error.
> >
> > You mention tags as being easily recognized, but that this is not
always
> true:
> >
> >       A sample document
> >
> >       by John Smith, MD
> >       Director of Palliative Care, Division of General Medicine,
Medical
> University of Somewhere
> >
> > While perhaps not the best example of potential problems, this would
be
> incorrectly interpreted as tags, when the author probably implies that this
> represents his academic affiliation and would like it to be properly placed
> after his name on the title page, or on the slide deck if generating via
> beamer.
>
>
>
> _______________________________________________
> Markdown-Discuss mailing list
> Markdown-Discuss at six.pairlist.net
> http://six.pairlist.net/mailman/listinfo/markdown-discuss
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://six.pairlist.net/pipermail/markdown-discuss/attachments/20110817/2247f382/attachment.html>

Bowerbird at aol.com

2011-Aug-26 18:01 UTC

head link

Metadata syntax (was Universal syntax for Markdown)

all pooped out, are you?

oh well, the conversation this time lasted longer
than it ever has before, in my memory, so maybe
you're just working up your stamina for next time...

so let me finish off this round...

***

christoph said:> A Markdown document may contain metadata
> in a human readable form that the parser converts
> to a machine readable form of metadata automatically.
> A casual reader will understand the content directly
> and without distraction. Bowerbird will love this.
indeed, christoph... because you've begun to describe
the very system that i use, for the very reason i use it.

i'll describe it more fully below, but first other stuff....

***

i'm not sure i fully understand the mentality that says
"implementations of markdown 2.0 can toss metadata".

isn't the objective to dispense with implementations that
act differently from each other? ok, sure, i'm not naive;
i realize that once a "standard" for "markup 2.0" is made,
someone will come along and "tweak" it for their benefit,
and then we are once again on the path toward fracture.
but still, the goal for here and now is to unify all. right?

i feel the same way about command-line switches that
turn on different "modes", like "quirks" and
"extensions".
isn't it our zeitgeist to gather everyone under one roof?
you'll just ignore (or never learn) features you don't need.

so everyone gets what they want. and if it's not possible,
if you want to use the system you have been using which
is tweaked the way you want it, just continue to do that...
it's not like those scripts will stop working or something.

but manufacturing a situation where all of the differences
are _blessed_ (rather than removed) is counterproductive.

***

now on to "metadata"...

as for the color of the metadata bikeshed, we have one
shade of paint -- "simple" -- so that's what it must be...

you've probably over-discussed it already, without even
getting to the meat of the matter. for _most_ purposes,
the "metadata" is relatively unimportant, which you'll see
quite clearly if you only begin to concentrate on specifics.

in a .pdf, for example, the "metadata" consists merely of
title, author, subject, creator, and keywords. that's it...

in an .epub or a .mobi, you can specify a ton of metadata,
if you want, but there's no standardized way of getting it,
so you're basically whistling at a noisy construction site...
(or doing pantomime in the dark, if you prefer that image.)

unless/until the "microformat" people get an upper-hand
-- and lord help us if that kind of bureaucracy wins out --
"metadata" in .html continues to be a rather iffy thing, so
at least for now, i think this issue needs little attention...

as for the matter of "tags" or "keywords", they're
_lame_,
to a large degree, because they can be gleaned from the
text itself in most cases. and perhaps more importantly,
such descriptive judgments need to be accumulated over
the input from hundreds or thousands of "objective" users,
rather than plugged in by a document's author or publisher,
or the specter of gaming the system makes it all worthless...
i'm not telling people not to use tags, but i think it's obvious
that any worthwhile recommendation system will ignore 'em.
your metadata often tries to tell lies; google knows the truth.

there are a lot of consultants selling metadata as a cure-all.
it's more like snake-oil.

***

as for my system...

as i said, my focus is on _books_, so for me, the concept of
the "title-page" (plus the "cover") is the one that rules
here.

the first "section" or "chapter" in a .zml file is the
title-page,
and _everything_ on that page is considered as "metadata".

remember that my first pass consists of separating "chunks"
-- a sequence of non-blank lines bordered by blank lines --
so the top chunk (of one or more lines) is defined as the title.

the second chunk is considered to be the subtitle, and the
third is considered to be the author. the "author" chunk is
required to start with the word "by", so if the second chunk
starts with "by" and the third chunk does not, my routines
assume that the book has no subtitle, so the second chunk
is considered to be the "author" chunk. subsequent chunks
are required to be labeled appropriately, such as "edited by"
or "illustrations by" or "plus additional contributions by"
or
"with preface by", and so on. you get the picture; it's clear.

other things which commonly appear on the title-page are
the publisher's name and often the city where it is located,
publication date, contact information for the author(s), etc.

none of this is particularly difficult to parse.

nor does it sacrifice any power _or_ flexibility.

other info about the document is obtained in the course of
analyzing it, like the number of chapters and illustrations,
the size of the file, the number of references, and so forth.

you also have to acknowledge, at some point in time, that
no matter what you do, you ain't gonna make a professional
book-cataloger happy... and one of my close friends is just
such an animal, working in the library system over at u.c.l.a.

their cataloging workflow can summon hundreds of variables,
depending on the unique characteristics of a particular book,
and that's a complexity that we could never hope to replicate.

at the same time, though, we can get 80% of the utility with
2% of the effort (yes i did say 2%, and not the expected 20%),
so that's the sweet spot we need for maximum cost-benefit.

as i said, there are a lot of consultants selling metadata as
snake-oil, and the most common pitch is that metadata will
give better discovery. that's hogwash. discovery will always
be inferior until we develop good collaborative filtering, and
that's necessary anyway, and fully independent of metadata.

***

there's something else that i generally put under "metadata"
-- which other people do not -- which are the specifications
used to create the output-formats. these include things like
straight-quotes vs. curly, indented paragraphs vs. block, and
the pagesize (for .pdf), the font, fontsize, leading, and so on.
this allows the end-user who receives the z.m.l. file to create
outputs matching what the author intended them to look like.
in accordance with the all-text-in-one-file mandate of z.m.l.,
these specifications should be included in the text-file itself,
and can fall in the "metadata" section, the "colophon"
section,
or in their own "output specifications" section, as you desire...
and, of course, end-users can also change the specifications,
so as to create output that is formatted to their own desires...

-bowerbird
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://six.pairlist.net/pipermail/markdown-discuss/attachments/20110826/116af2ff/attachment.html>

Reasonably Related Threads

Search for more reasonably related threads

Markdown Discuss - Aug 2011 - Metadata syntax (was Universal syntax for Markdown)

Metadata syntax (was Universal syntax for Markdown)

Metadata syntax (was Universal syntax for Markdown)

Metadata syntax (was Universal syntax for Markdown)

Reasonably Related Threads