thr3ads.net - llvm dev - [llvm-dev] [GSoC 2017] Clang-based diff tool project [Mar 2017]

If this information is useful, please help other people find it:
Share via:

Greg Clayton via llvm-dev

2017-Mar-23 17:41 UTC

[llvm-dev] [GSoC 2017] Clang-based diff tool project

My original idea was to write a semantic diff tool that just does some simple
things up front:

create an MD5 from all top level blocks of the code. Start by just finding
matching blocks of code ('{' and '}', '(' and
')') and remember the source locations for these and their MD5 values.
Run a normal diff on the code and see what blocks the diffs fall into. Then try
to figure out where things moved by possibly delving deeper into each block that
matched something from the diff. Also if any blocks moved to a completely
different location, try and figure that out by matching the MD5 of any blocks.

For example if you had:

int main(int argc, const char **argv) {
  if (argc > 2) {
  }
  switch (argc) {
  }
} 

You would first make MD5s for the '(' and ')' in the
"main" line and for the '{' at the end of the main line, and
ending at the end of the code. Now the code looks like:

int main(int argc, const char **argv) {
  switch (argc) {
  }
  if (argc > 2) {
  }
} 

The diff would show that the "if" is gone and a new "if" is
found after the switch in the new version of the file. We would notice that the
diff appears inside the block from the first:

{
  if (argc > 2) {
  }
  switch (argc) {
  }
} 

And in the block from the second:

{
  switch (argc) {
  }
  if (argc > 2) {
  }
}

So we would then compute the MD5 for the blocks inside each of these blocks and
try to match things up. The MD5 would of course remove spaces that aren't in
strings and only compute the MD5 from the characters that make sense. This
simple type of approach could almost work on any language without the need to be
able to correctly compile each file with all the right options.

Greg
> On Mar 20, 2017, at 4:47 PM, Mehdi Amini <mehdi.amini at apple.com>
wrote:
> 
> (+CC: Greg Clayton who gave me this idea in the first place)
> 
>> On Mar 20, 2017, at 3:20 PM, Johannes Altmanninger via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
>> 
>> Hello,
>> 
>> I am currently studying Computer Science at TU Eindhoven. I am doing a
>> course that involves programming assignments on parts of LLVM such as
>> lowering, scheduling and optimization. For this year's Google
Summer of
>> Code I plan to submit a proposal to implement a clang-based diff tool
>> [1].
> 
> Great! I look forward to see this :)
> 
>> 
>> I think it really pays off to have decent developer tools available, as
>> they can save tons of time. Clang tooling has obviously been very
>> successful.  I think it would be a good idea to develop a diff tool
that
>> considers the structure of the code, as opposed to just the lines.
Plain
>> old diff only thinks in terms of "additions" and
"deletions", although
>> it would be more natural to also consider "updates" and
"moves".
>> 
>> So a structural diff would work solely on the AST, hence formatting
>> changes are ignored. It would allow to highlight the exact location of
a
>> change, and not a whole line. Furthermore, it would allow to compare
>> pieces of code with the same structure (think subclasses).
>> 
>> Besides some papers with clever AST-matching algorithms, a quick web
>> search yielded [2], which is a proof-of-concept implementation of a
>> structural comparison algorithm.  I think it demonstrates rather nicely
>> what could be done: movement of chunks of code can be easily traced.
>> 
>> Anyway, one could make all kinds of nice visualizations using a AST
diff
>> tool, however, I think the initial focus should probably be on creating
>> one with a similar output to traditional diff, with the difference that
>> updates and moves are displayed in a easily readable way, which already
>> could improve developer productivity and happiness.
>> 
>> As of now I have one question: The output of the tool is meant just for
>> humans to read (and not for actual patching), right?
> 
> Yes. But we developed software as libraries usually. Practically I expect
the main part of the work to write some piece of API that generate an
“in-memory” representation of the diff.
> 
> A tool that is generating a textual-human readable output is likely the
first client of this API and is likely critical to be able to functionally test
it in the early development. In the future I hope it’d enable other graphical
diff client to plug-in, or git-merge resolution tools as well.
> 
> Best,
> 
> — 
> Mehdi
>

Johannes Altmanninger via llvm-dev

2017-Mar-23 19:28 UTC

head link

[llvm-dev] [GSoC 2017] Clang-based diff tool project

I am currently considering the gumtree algorithm which is described in
[1]. It is able to detect code moves, updates.  There is also a java
implementation of gumtree.  It supports C++ via srcML, however this
seems to be quite incomplete and buggy. The algorithm consists of a top
down and a bottom up traversal of the AST that result in a matching of
corresponding nodes. It contains some heuristics similarity measurement.

Greg Clayton <clayborg at gmail.com> writes:
> My original idea was to write a semantic diff tool that just does some
> simple things up front:
>
> create an MD5 from all top level blocks of the code. Start by just
> finding matching blocks of code ('{' and '}', '('
and ')') and
> remember the source locations for these and their MD5 values. Run a
> normal diff on the code and see what blocks the diffs fall into. Then
> try to figure out where things moved by possibly delving deeper into
> each block that matched something from the diff. Also if any blocks
> moved to a completely different location, try and figure that out by
> matching the MD5 of any blocks.
Ok, I think I see how this works in principle. This is possibly more
efficient than gumtree, which has O(n^2) worst case complexity where n
is the number of nodes in the AST. But I am not sure..
>
> For example if you had:
>
> int main(int argc, const char **argv) {
>   if (argc > 2) {
>   }
>   switch (argc) {
>   }
> } 
>
> You would first make MD5s for the '(' and ')' in the
"main" line and
> for the '{' at the end of the main line, and ending at the end of
the
> code. Now the code looks like:
>
> int main(int argc, const char **argv) {
>   switch (argc) {
>   }
>   if (argc > 2) {
>   }
> } 
>
> The diff would show that the "if" is gone and a new
"if" is found
> after the switch in the new version of the file. We would notice that
> the diff appears inside the block from the first:
>
> {
>   if (argc > 2) {
>   }
>   switch (argc) {
>   }
> } 
>
> And in the block from the second:
>
> {
>   switch (argc) {
>   }
>   if (argc > 2) {
>   }
> }
>
> So we would then compute the MD5 for the blocks inside each of these
> blocks and try to match things up. The MD5 would of course remove
> spaces that aren't in strings and only compute the MD5 from the
> characters that make sense. This simple type of approach could almost
> work on any language without the need to be able to correctly compile
> each file with all the right options.
Yeah, this is another advantage, one does not need to basically have
a compiler in order to create the diff. So this would just consider
blocks enclosed by parentheses or braces. It is nice that it works for
any language that uses those for blocks.

So this would be a more lightweight and smart tool, while the gumtree is
more powerful and extensible because it operates only on the AST. If the
goal is to produce a diffing API that is able to exploit the semantics
of the language and enables tools for visualization and merging then I
think gumtree is the better choice. On the other hand, if we just want a
Unix tool that is a smart and lean improvement to diff then your idea
might be more appropriate in addition to being more flexible.

What do others think about this?


[1] https://hal.archives-ouvertes.fr/hal-01054552/document
>
> Greg
>
>> On Mar 20, 2017, at 4:47 PM, Mehdi Amini <mehdi.amini at
apple.com> wrote:
>> 
>> (+CC: Greg Clayton who gave me this idea in the first place)
>> 
>>> On Mar 20, 2017, at 3:20 PM, Johannes Altmanninger via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
>>> 
>>> Hello,
>>> 
>>> I am currently studying Computer Science at TU Eindhoven. I am
doing a
>>> course that involves programming assignments on parts of LLVM such
as
>>> lowering, scheduling and optimization. For this year's Google
Summer of
>>> Code I plan to submit a proposal to implement a clang-based diff
tool
>>> [1].
>> 
>> Great! I look forward to see this :)
>> 
>>> 
>>> I think it really pays off to have decent developer tools
available, as
>>> they can save tons of time. Clang tooling has obviously been very
>>> successful.  I think it would be a good idea to develop a diff tool
that
>>> considers the structure of the code, as opposed to just the lines.
Plain
>>> old diff only thinks in terms of "additions" and
"deletions", although
>>> it would be more natural to also consider "updates" and
"moves".
>>> 
>>> So a structural diff would work solely on the AST, hence formatting
>>> changes are ignored. It would allow to highlight the exact location
of a
>>> change, and not a whole line. Furthermore, it would allow to
compare
>>> pieces of code with the same structure (think subclasses).
>>> 
>>> Besides some papers with clever AST-matching algorithms, a quick
web
>>> search yielded [2], which is a proof-of-concept implementation of a
>>> structural comparison algorithm.  I think it demonstrates rather
nicely
>>> what could be done: movement of chunks of code can be easily
traced.
>>> 
>>> Anyway, one could make all kinds of nice visualizations using a AST
diff
>>> tool, however, I think the initial focus should probably be on
creating
>>> one with a similar output to traditional diff, with the difference
that
>>> updates and moves are displayed in a easily readable way, which
already
>>> could improve developer productivity and happiness.
>>> 
>>> As of now I have one question: The output of the tool is meant just
for
>>> humans to read (and not for actual patching), right?
>> 
>> Yes. But we developed software as libraries usually. Practically I
expect the main part of the work to write some piece of API that generate an
“in-memory” representation of the diff.
>> 
>> A tool that is generating a textual-human readable output is likely the
first client of this API and is likely critical to be able to functionally test
it in the early development. In the future I hope it’d enable other graphical
diff client to plug-in, or git-merge resolution tools as well.
>> 
>> Best,
>> 
>> — 
>> Mehdi
>>

Mehdi Amini via llvm-dev

2017-Mar-23 19:52 UTC

head link

[llvm-dev] [GSoC 2017] Clang-based diff tool project

> On Mar 23, 2017, at 12:28 PM, Johannes Altmanninger <aclopte at
gmail.com> wrote:
> 
> I am currently considering the gumtree algorithm which is described in
> [1]. It is able to detect code moves, updates.  There is also a java
> implementation of gumtree.  It supports C++ via srcML, however this
> seems to be quite incomplete and buggy. The algorithm consists of a top
> down and a bottom up traversal of the AST that result in a matching of
> corresponding nodes. It contains some heuristics similarity measurement.
> 
> Greg Clayton <clayborg at gmail.com <mailto:clayborg at
gmail.com>> writes:
> 
>> My original idea was to write a semantic diff tool that just does some
>> simple things up front:
>> 
>> create an MD5 from all top level blocks of the code. Start by just
>> finding matching blocks of code ('{' and '}',
'(' and ')') and
>> remember the source locations for these and their MD5 values. Run a
>> normal diff on the code and see what blocks the diffs fall into. Then
>> try to figure out where things moved by possibly delving deeper into
>> each block that matched something from the diff. Also if any blocks
>> moved to a completely different location, try and figure that out by
>> matching the MD5 of any blocks.
> 
> Ok, I think I see how this works in principle. This is possibly more
> efficient than gumtree, which has O(n^2) worst case complexity where n
> is the number of nodes in the AST. But I am not sure..
> 
>> 
>> For example if you had:
>> 
>> int main(int argc, const char **argv) {
>>  if (argc > 2) {
>>  }
>>  switch (argc) {
>>  }
>> } 
>> 
>> You would first make MD5s for the '(' and ')' in the
"main" line and
>> for the '{' at the end of the main line, and ending at the end
of the
>> code. Now the code looks like:
>> 
>> int main(int argc, const char **argv) {
>>  switch (argc) {
>>  }
>>  if (argc > 2) {
>>  }
>> } 
>> 
>> The diff would show that the "if" is gone and a new
"if" is found
>> after the switch in the new version of the file. We would notice that
>> the diff appears inside the block from the first:
>> 
>> {
>>  if (argc > 2) {
>>  }
>>  switch (argc) {
>>  }
>> } 
>> 
>> And in the block from the second:
>> 
>> {
>>  switch (argc) {
>>  }
>>  if (argc > 2) {
>>  }
>> }
>> 
>> So we would then compute the MD5 for the blocks inside each of these
>> blocks and try to match things up. The MD5 would of course remove
>> spaces that aren't in strings and only compute the MD5 from the
>> characters that make sense. This simple type of approach could almost
>> work on any language without the need to be able to correctly compile
>> each file with all the right options.
> 
> Yeah, this is another advantage, one does not need to basically have
> a compiler in order to create the diff. So this would just consider
> blocks enclosed by parentheses or braces. It is nice that it works for
> any language that uses those for blocks.
> 
> So this would be a more lightweight and smart tool, while the gumtree is
> more powerful and extensible because it operates only on the AST. If the
> goal is to produce a diffing API that is able to exploit the semantics
> of the language and enables tools for visualization and merging then I
> think gumtree is the better choice. On the other hand, if we just want a
> Unix tool that is a smart and lean improvement to diff then your idea
> might be more appropriate in addition to being more flexible.
> 
> What do others think about this?

That seems like a good summary to me.

What I find cool about an AST based tool, is the ability to extract the diff as
semantic informations, for example “function (or variable) was renamed from A to
B”.
I feel that you can get a totally different level of information with such a
tool.

— 
Mehdi

> 
> 
> [1] https://hal.archives-ouvertes.fr/hal-01054552/document
<https://hal.archives-ouvertes.fr/hal-01054552/document>
> 
>> 
>> Greg
>> 
>>> On Mar 20, 2017, at 4:47 PM, Mehdi Amini <mehdi.amini at
apple.com> wrote:
>>> 
>>> (+CC: Greg Clayton who gave me this idea in the first place)
>>> 
>>>> On Mar 20, 2017, at 3:20 PM, Johannes Altmanninger via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
>>>> 
>>>> Hello,
>>>> 
>>>> I am currently studying Computer Science at TU Eindhoven. I am
doing a
>>>> course that involves programming assignments on parts of LLVM
such as
>>>> lowering, scheduling and optimization. For this year's
Google Summer of
>>>> Code I plan to submit a proposal to implement a clang-based
diff tool
>>>> [1].
>>> 
>>> Great! I look forward to see this :)
>>> 
>>>> 
>>>> I think it really pays off to have decent developer tools
available, as
>>>> they can save tons of time. Clang tooling has obviously been
very
>>>> successful.  I think it would be a good idea to develop a diff
tool that
>>>> considers the structure of the code, as opposed to just the
lines. Plain
>>>> old diff only thinks in terms of "additions" and
"deletions", although
>>>> it would be more natural to also consider "updates"
and "moves".
>>>> 
>>>> So a structural diff would work solely on the AST, hence
formatting
>>>> changes are ignored. It would allow to highlight the exact
location of a
>>>> change, and not a whole line. Furthermore, it would allow to
compare
>>>> pieces of code with the same structure (think subclasses).
>>>> 
>>>> Besides some papers with clever AST-matching algorithms, a
quick web
>>>> search yielded [2], which is a proof-of-concept implementation
of a
>>>> structural comparison algorithm.  I think it demonstrates
rather nicely
>>>> what could be done: movement of chunks of code can be easily
traced.
>>>> 
>>>> Anyway, one could make all kinds of nice visualizations using a
AST diff
>>>> tool, however, I think the initial focus should probably be on
creating
>>>> one with a similar output to traditional diff, with the
difference that
>>>> updates and moves are displayed in a easily readable way, which
already
>>>> could improve developer productivity and happiness.
>>>> 
>>>> As of now I have one question: The output of the tool is meant
just for
>>>> humans to read (and not for actual patching), right?
>>> 
>>> Yes. But we developed software as libraries usually. Practically I
expect the main part of the work to write some piece of API that generate an
“in-memory” representation of the diff.
>>> 
>>> A tool that is generating a textual-human readable output is likely
the first client of this API and is likely critical to be able to functionally
test it in the early development. In the future I hope it’d enable other
graphical diff client to plug-in, or git-merge resolution tools as well.
>>> 
>>> Best,
>>> 
>>> — 
>>> Mehdi
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170323/3372f45e/attachment.html>

Vassil Vassilev via llvm-dev

2017-Mar-30 13:56 UTC

head link

[llvm-dev] [GSoC 2017] Clang-based diff tool project

Hi,

   This seems a very exciting project.

   As part of GSoC16 Raphael developed a code clone detection tool 
(https://docs.google.com/presentation/d/1mJ6dA6XmAQ8s8Zqm_j518yoW-_QZ72e69fPG1u_nbj8/edit#slide=id.g35f391192_00).
We are working on turning the infrastructure into a reusable set of 
components (https://reviews.llvm.org/D23418).

   Raphael hacked together a few lines of code, addressing Greg's 
proposal based on D23418.


r1 	r2
int main(int argc, const char **argv) {
   switch (argc) {
   }
   if (argc > 2) {
     return 1;
   }
   while (false);
   int funkyVariable = 1;
   funkyVariable++;
}
	int main(int argc, const char **argv) {
   if (argc > 2) {
     return 1;
   }
   switch (argc) {
   }
   while (false);
   int funkyVariable = 1;
   funkyVariable++;
}


   ./clangDiff
   Change: SwitchStmt moved from line 2 to line 5
   Change: IfStmt moved from line 4 to line 2


   Here is how it looks 
https://gist.github.com/Teemperor/b252bae4b2544f57d6bb9580a0e890e4


   Let us know if we can help you further with this!

-- Vassil and Raphael

On 23/03/17 18:41, Greg Clayton via llvm-dev wrote:> My original idea was to write a semantic diff tool that just does some
simple things up front:
>
> create an MD5 from all top level blocks of the code. Start by just finding
matching blocks of code ('{' and '}', '(' and
')') and remember the source locations for these and their MD5 values.
Run a normal diff on the code and see what blocks the diffs fall into. Then try
to figure out where things moved by possibly delving deeper into each block that
matched something from the diff. Also if any blocks moved to a completely
different location, try and figure that out by matching the MD5 of any blocks.
>
> For example if you had:
>
> int main(int argc, const char **argv) {
>    if (argc > 2) {
>    }
>    switch (argc) {
>    }
> }
>
> You would first make MD5s for the '(' and ')' in the
"main" line and for the '{' at the end of the main line, and
ending at the end of the code. Now the code looks like:
>
> int main(int argc, const char **argv) {
>    switch (argc) {
>    }
>    if (argc > 2) {
>    }
> }
>
> The diff would show that the "if" is gone and a new
"if" is found after the switch in the new version of the file. We
would notice that the diff appears inside the block from the first:
>
> {
>    if (argc > 2) {
>    }
>    switch (argc) {
>    }
> }
>
> And in the block from the second:
>
> {
>    switch (argc) {
>    }
>    if (argc > 2) {
>    }
> }
>
> So we would then compute the MD5 for the blocks inside each of these blocks
and try to match things up. The MD5 would of course remove spaces that
aren't in strings and only compute the MD5 from the characters that make
sense. This simple type of approach could almost work on any language without
the need to be able to correctly compile each file with all the right options.
>
> Greg
>
>> On Mar 20, 2017, at 4:47 PM, Mehdi Amini <mehdi.amini at
apple.com> wrote:
>>
>> (+CC: Greg Clayton who gave me this idea in the first place)
>>
>>> On Mar 20, 2017, at 3:20 PM, Johannes Altmanninger via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
>>>
>>> Hello,
>>>
>>> I am currently studying Computer Science at TU Eindhoven. I am
doing a
>>> course that involves programming assignments on parts of LLVM such
as
>>> lowering, scheduling and optimization. For this year's Google
Summer of
>>> Code I plan to submit a proposal to implement a clang-based diff
tool
>>> [1].
>> Great! I look forward to see this :)
>>
>>> I think it really pays off to have decent developer tools
available, as
>>> they can save tons of time. Clang tooling has obviously been very
>>> successful.  I think it would be a good idea to develop a diff tool
that
>>> considers the structure of the code, as opposed to just the lines.
Plain
>>> old diff only thinks in terms of "additions" and
"deletions", although
>>> it would be more natural to also consider "updates" and
"moves".
>>>
>>> So a structural diff would work solely on the AST, hence formatting
>>> changes are ignored. It would allow to highlight the exact location
of a
>>> change, and not a whole line. Furthermore, it would allow to
compare
>>> pieces of code with the same structure (think subclasses).
>>>
>>> Besides some papers with clever AST-matching algorithms, a quick
web
>>> search yielded [2], which is a proof-of-concept implementation of a
>>> structural comparison algorithm.  I think it demonstrates rather
nicely
>>> what could be done: movement of chunks of code can be easily
traced.
>>>
>>> Anyway, one could make all kinds of nice visualizations using a AST
diff
>>> tool, however, I think the initial focus should probably be on
creating
>>> one with a similar output to traditional diff, with the difference
that
>>> updates and moves are displayed in a easily readable way, which
already
>>> could improve developer productivity and happiness.
>>>
>>> As of now I have one question: The output of the tool is meant just
for
>>> humans to read (and not for actual patching), right?
>> Yes. But we developed software as libraries usually. Practically I
expect the main part of the work to write some piece of API that generate an
“in-memory” representation of the diff.
>>
>> A tool that is generating a textual-human readable output is likely the
first client of this API and is likely critical to be able to functionally test
it in the early development. In the future I hope it’d enable other graphical
diff client to plug-in, or git-merge resolution tools as well.
>>
>> Best,
>>
>> —
>> Mehdi
>>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170330/fe1f4f5b/attachment.html>

Mehdi Amini via llvm-dev

2017-Mar-30 15:03 UTC

head link

[llvm-dev] [GSoC 2017] Clang-based diff tool project

> On Mar 30, 2017, at 6:56 AM, Vassil Vassilev <v.g.vassilev at
gmail.com> wrote:
> 
> Hi,
> 
>   This seems a very exciting project.
Do I take that you’re volunteering to mentor it? ;-)

> 
>   As part of GSoC16 Raphael developed a code clone detection tool
(https://docs.google.com/presentation/d/1mJ6dA6XmAQ8s8Zqm_j518yoW-_QZ72e69fPG1u_nbj8/edit#slide=id.g35f391192_00
<https://docs.google.com/presentation/d/1mJ6dA6XmAQ8s8Zqm_j518yoW-_QZ72e69fPG1u_nbj8/edit#slide=id.g35f391192_00>).
We are working on turning the infrastructure into a reusable set of components
(https://reviews.llvm.org/D23418 <https://reviews.llvm.org/D23418>).
> 
>   Raphael hacked together a few lines of code, addressing Greg's
proposal based on D23418.
> 
>   
> r1	r2
> int main(int argc, const char **argv) {
>   switch (argc) {
>   }
>   if (argc > 2) {
>     return 1;
>   }
>   while (false);
>   int funkyVariable = 1;
>   funkyVariable++;
> }
> int main(int argc, const char **argv) {
>   if (argc > 2) {
>     return 1;
>   }
>   switch (argc) {
>   }
>   while (false);
>   int funkyVariable = 1;
>   funkyVariable++;
> }
>   ./clangDiff
>   Change: SwitchStmt moved from line 2 to line 5
>   Change: IfStmt moved from line 4 to line 2
>  
Neat! :)

> 
>   Here is how it looks
https://gist.github.com/Teemperor/b252bae4b2544f57d6bb9580a0e890e4
<https://gist.github.com/Teemperor/b252bae4b2544f57d6bb9580a0e890e4>
> 
>   Let us know if we can help you further with this!
I’d be happy if you could take the lead. Johannes asked earlier how to start in
clang and show his ability, any bug to fix or small improvement to implement you
can suggest?
He also asked about libclang vs libtooling, not sure if anyone already answered.

— 
Mehdi


> 
> -- Vassil and Raphael
> 
> On 23/03/17 18:41, Greg Clayton via llvm-dev wrote:
>> My original idea was to write a semantic diff tool that just does some
simple things up front:
>> 
>> create an MD5 from all top level blocks of the code. Start by just
finding matching blocks of code ('{' and '}', '(' and
')') and remember the source locations for these and their MD5 values.
Run a normal diff on the code and see what blocks the diffs fall into. Then try
to figure out where things moved by possibly delving deeper into each block that
matched something from the diff. Also if any blocks moved to a completely
different location, try and figure that out by matching the MD5 of any blocks.
>> 
>> For example if you had:
>> 
>> int main(int argc, const char **argv) {
>>   if (argc > 2) {
>>   }
>>   switch (argc) {
>>   }
>> } 
>> 
>> You would first make MD5s for the '(' and ')' in the
"main" line and for the '{' at the end of the main line, and
ending at the end of the code. Now the code looks like:
>> 
>> int main(int argc, const char **argv) {
>>   switch (argc) {
>>   }
>>   if (argc > 2) {
>>   }
>> } 
>> 
>> The diff would show that the "if" is gone and a new
"if" is found after the switch in the new version of the file. We
would notice that the diff appears inside the block from the first:
>> 
>> {
>>   if (argc > 2) {
>>   }
>>   switch (argc) {
>>   }
>> } 
>> 
>> And in the block from the second:
>> 
>> {
>>   switch (argc) {
>>   }
>>   if (argc > 2) {
>>   }
>> }
>> 
>> So we would then compute the MD5 for the blocks inside each of these
blocks and try to match things up. The MD5 would of course remove spaces that
aren't in strings and only compute the MD5 from the characters that make
sense. This simple type of approach could almost work on any language without
the need to be able to correctly compile each file with all the right options.
>> 
>> Greg
>> 
>>> On Mar 20, 2017, at 4:47 PM, Mehdi Amini <mehdi.amini at
apple.com> <mailto:mehdi.amini at apple.com> wrote:
>>> 
>>> (+CC: Greg Clayton who gave me this idea in the first place)
>>> 
>>>> On Mar 20, 2017, at 3:20 PM, Johannes Altmanninger via llvm-dev
<llvm-dev at lists.llvm.org> <mailto:llvm-dev at lists.llvm.org>
wrote:
>>>> 
>>>> Hello,
>>>> 
>>>> I am currently studying Computer Science at TU Eindhoven. I am
doing a
>>>> course that involves programming assignments on parts of LLVM
such as
>>>> lowering, scheduling and optimization. For this year's
Google Summer of
>>>> Code I plan to submit a proposal to implement a clang-based
diff tool
>>>> [1].
>>> Great! I look forward to see this :)
>>> 
>>>> I think it really pays off to have decent developer tools
available, as
>>>> they can save tons of time. Clang tooling has obviously been
very
>>>> successful.  I think it would be a good idea to develop a diff
tool that
>>>> considers the structure of the code, as opposed to just the
lines. Plain
>>>> old diff only thinks in terms of "additions" and
"deletions", although
>>>> it would be more natural to also consider "updates"
and "moves".
>>>> 
>>>> So a structural diff would work solely on the AST, hence
formatting
>>>> changes are ignored. It would allow to highlight the exact
location of a
>>>> change, and not a whole line. Furthermore, it would allow to
compare
>>>> pieces of code with the same structure (think subclasses).
>>>> 
>>>> Besides some papers with clever AST-matching algorithms, a
quick web
>>>> search yielded [2], which is a proof-of-concept implementation
of a
>>>> structural comparison algorithm.  I think it demonstrates
rather nicely
>>>> what could be done: movement of chunks of code can be easily
traced.
>>>> 
>>>> Anyway, one could make all kinds of nice visualizations using a
AST diff
>>>> tool, however, I think the initial focus should probably be on
creating
>>>> one with a similar output to traditional diff, with the
difference that
>>>> updates and moves are displayed in a easily readable way, which
already
>>>> could improve developer productivity and happiness.
>>>> 
>>>> As of now I have one question: The output of the tool is meant
just for
>>>> humans to read (and not for actual patching), right?
>>> Yes. But we developed software as libraries usually. Practically I
expect the main part of the work to write some piece of API that generate an
“in-memory” representation of the diff.
>>> 
>>> A tool that is generating a textual-human readable output is likely
the first client of this API and is likely critical to be able to functionally
test it in the early development. In the future I hope it’d enable other
graphical diff client to plug-in, or git-merge resolution tools as well.
>>> 
>>> Best,
>>> 
>>> — 
>>> Mehdi
>>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
<http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170330/c6ffaa22/attachment.html>

Greg Clayton via llvm-dev

2017-Mar-30 21:22 UTC

head link

[llvm-dev] [GSoC 2017] Clang-based diff tool project

> On Mar 30, 2017, at 6:56 AM, Vassil Vassilev <v.g.vassilev at
gmail.com> wrote:
> 
> Hi,
> 
>   This seems a very exciting project.
> 
>   As part of GSoC16 Raphael developed a code clone detection tool
(https://docs.google.com/presentation/d/1mJ6dA6XmAQ8s8Zqm_j518yoW-_QZ72e69fPG1u_nbj8/edit#slide=id.g35f391192_00
<https://docs.google.com/presentation/d/1mJ6dA6XmAQ8s8Zqm_j518yoW-_QZ72e69fPG1u_nbj8/edit#slide=id.g35f391192_00>).
We are working on turning the infrastructure into a reusable set of components
(https://reviews.llvm.org/D23418 <https://reviews.llvm.org/D23418>).
> 
>   Raphael hacked together a few lines of code, addressing Greg's
proposal based on D23418.
> 
>   
> r1	r2
> int main(int argc, const char **argv) {
>   switch (argc) {
>   }
>   if (argc > 2) {
>     return 1;
>   }
>   while (false);
>   int funkyVariable = 1;
>   funkyVariable++;
> }
> int main(int argc, const char **argv) {
>   if (argc > 2) {
>     return 1;
>   }
>   switch (argc) {
>   }
>   while (false);
>   int funkyVariable = 1;
>   funkyVariable++;
> }
>   ./clangDiff
>   Change: SwitchStmt moved from line 2 to line 5
>   Change: IfStmt moved from line 4 to line 2
>  
Very cool! Now we just need a graphical tool/IDE to integrate this with so we
can do these kinds of merges graphically.
>   Here is how it looks
https://gist.github.com/Teemperor/b252bae4b2544f57d6bb9580a0e890e4
<https://gist.github.com/Teemperor/b252bae4b2544f57d6bb9580a0e890e4>
> 
> 
>   Let us know if we can help you further with this!
> 
> -- Vassil and Raphael
> 
> On 23/03/17 18:41, Greg Clayton via llvm-dev wrote:
>> My original idea was to write a semantic diff tool that just does some
simple things up front:
>> 
>> create an MD5 from all top level blocks of the code. Start by just
finding matching blocks of code ('{' and '}', '(' and
')') and remember the source locations for these and their MD5 values.
Run a normal diff on the code and see what blocks the diffs fall into. Then try
to figure out where things moved by possibly delving deeper into each block that
matched something from the diff. Also if any blocks moved to a completely
different location, try and figure that out by matching the MD5 of any blocks.
>> 
>> For example if you had:
>> 
>> int main(int argc, const char **argv) {
>>   if (argc > 2) {
>>   }
>>   switch (argc) {
>>   }
>> } 
>> 
>> You would first make MD5s for the '(' and ')' in the
"main" line and for the '{' at the end of the main line, and
ending at the end of the code. Now the code looks like:
>> 
>> int main(int argc, const char **argv) {
>>   switch (argc) {
>>   }
>>   if (argc > 2) {
>>   }
>> } 
>> 
>> The diff would show that the "if" is gone and a new
"if" is found after the switch in the new version of the file. We
would notice that the diff appears inside the block from the first:
>> 
>> {
>>   if (argc > 2) {
>>   }
>>   switch (argc) {
>>   }
>> } 
>> 
>> And in the block from the second:
>> 
>> {
>>   switch (argc) {
>>   }
>>   if (argc > 2) {
>>   }
>> }
>> 
>> So we would then compute the MD5 for the blocks inside each of these
blocks and try to match things up. The MD5 would of course remove spaces that
aren't in strings and only compute the MD5 from the characters that make
sense. This simple type of approach could almost work on any language without
the need to be able to correctly compile each file with all the right options.
>> 
>> Greg
>> 
>>> On Mar 20, 2017, at 4:47 PM, Mehdi Amini <mehdi.amini at
apple.com> <mailto:mehdi.amini at apple.com> wrote:
>>> 
>>> (+CC: Greg Clayton who gave me this idea in the first place)
>>> 
>>>> On Mar 20, 2017, at 3:20 PM, Johannes Altmanninger via llvm-dev
<llvm-dev at lists.llvm.org> <mailto:llvm-dev at lists.llvm.org>
wrote:
>>>> 
>>>> Hello,
>>>> 
>>>> I am currently studying Computer Science at TU Eindhoven. I am
doing a
>>>> course that involves programming assignments on parts of LLVM
such as
>>>> lowering, scheduling and optimization. For this year's
Google Summer of
>>>> Code I plan to submit a proposal to implement a clang-based
diff tool
>>>> [1].
>>> Great! I look forward to see this :)
>>> 
>>>> I think it really pays off to have decent developer tools
available, as
>>>> they can save tons of time. Clang tooling has obviously been
very
>>>> successful.  I think it would be a good idea to develop a diff
tool that
>>>> considers the structure of the code, as opposed to just the
lines. Plain
>>>> old diff only thinks in terms of "additions" and
"deletions", although
>>>> it would be more natural to also consider "updates"
and "moves".
>>>> 
>>>> So a structural diff would work solely on the AST, hence
formatting
>>>> changes are ignored. It would allow to highlight the exact
location of a
>>>> change, and not a whole line. Furthermore, it would allow to
compare
>>>> pieces of code with the same structure (think subclasses).
>>>> 
>>>> Besides some papers with clever AST-matching algorithms, a
quick web
>>>> search yielded [2], which is a proof-of-concept implementation
of a
>>>> structural comparison algorithm.  I think it demonstrates
rather nicely
>>>> what could be done: movement of chunks of code can be easily
traced.
>>>> 
>>>> Anyway, one could make all kinds of nice visualizations using a
AST diff
>>>> tool, however, I think the initial focus should probably be on
creating
>>>> one with a similar output to traditional diff, with the
difference that
>>>> updates and moves are displayed in a easily readable way, which
already
>>>> could improve developer productivity and happiness.
>>>> 
>>>> As of now I have one question: The output of the tool is meant
just for
>>>> humans to read (and not for actual patching), right?
>>> Yes. But we developed software as libraries usually. Practically I
expect the main part of the work to write some piece of API that generate an
“in-memory” representation of the diff.
>>> 
>>> A tool that is generating a textual-human readable output is likely
the first client of this API and is likely critical to be able to functionally
test it in the early development. In the future I hope it’d enable other
graphical diff client to plug-in, or git-merge resolution tools as well.
>>> 
>>> Best,
>>> 
>>> — 
>>> Mehdi
>>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
<http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170330/316d9500/attachment.html>

llvm dev - Mar 2017 - [GSoC 2017] Clang-based diff tool project

[llvm-dev] [GSoC 2017] Clang-based diff tool project

[llvm-dev] [GSoC 2017] Clang-based diff tool project

[llvm-dev] [GSoC 2017] Clang-based diff tool project

[llvm-dev] [GSoC 2017] Clang-based diff tool project

[llvm-dev] [GSoC 2017] Clang-based diff tool project

[llvm-dev] [GSoC 2017] Clang-based diff tool project