Discussion:
RFC: PATCH to adjust libcpp raw string support for N3077 changes
Jason Merrill
2010-03-26 21:45:10 UTC
Permalink
At the Pittsburgh meeting earlier this month, the committee approved
several significant changes to raw strings:

1) Making them truly raw by reverting any transformations of extended
characters/UCNs, trigraphs or backslash/newline.
2) Avoiding other trigraph issues by changing the inner delimiters from
[] to ().
3) Avoiding UCN issues in the delimiters by prohibiting '\'.

I think we really want at least #2 in 4.5, preferably all 3, as it will
be the first release with raw string support.

I've attached two patches to implement these changes. The first
implements #2 and #3, the second patch implements #1. I think they are
safe to go in now, as the only change to non-raw string code is to pass
down line notes into lex_raw_string.

Any comments?
Jakub Jelinek
2010-03-26 22:12:39 UTC
Permalink
Post by Jason Merrill
At the Pittsburgh meeting earlier this month, the committee approved
1) Making them truly raw by reverting any transformations of
extended characters/UCNs, trigraphs or backslash/newline.
2) Avoiding other trigraph issues by changing the inner delimiters
from [] to ().
3) Avoiding UCN issues in the delimiters by prohibiting '\'.
I think we really want at least #2 in 4.5, preferably all 3, as it
will be the first release with raw string support.
I think we want all 3 in 4.5.

Jakub
Joseph S. Myers
2010-03-27 00:24:29 UTC
Permalink
At the Pittsburgh meeting earlier this month, the committee approved several
Where is the document specifying these changes? I don't think the patch
can be reviewed without it.
1) Making them truly raw by reverting any transformations of extended
characters/UCNs, trigraphs or backslash/newline.
But is it still considered in accordance with the specification that the
translation of the source file to UTF-8, and conversion of all end-of-line
sequences (CR, LF or CRLF) to LF, take place to produce the characters
that go in the string? (And if so, does the implementation follow this -
handling all newline sequences the same - and are there testcases for it?)

I note the second patch has a FIXME regarding reproducing
backslash-whitespace-newline sequences (where it's not just the number of
spaces that's relevant, but the choice of which non-vertical whitespace
characters are there). What will it do with ??/ followed by newline (with
or without whitespace) - will it correctly reconstitute the trigraph and
the newline? I think there should be tests for each trigraph sequence.
These FIXME issues don't need fixing now; we lived for years with PR
20078, which seems similar in spirit as an issue with an internal
representation failing to track whitespace and spelling differences in the
source code that are significant in corner cases; but we should understand
what is missing and have a PR in Bugzilla for it.
--
Joseph S. Myers
***@codesourcery.com
Joseph S. Myers
2010-03-28 15:10:39 UTC
Permalink
Post by Joseph S. Myers
At the Pittsburgh meeting earlier this month, the committee approved several
Where is the document specifying these changes? I don't think the patch
can be reviewed without it.
Ah yes, I meant to attach it. Done.
Thanks.

The first patch is OK. I think the code changes in the second patch are
OK, but it needs extra testcases for the consequences of the model that
trigraphs and backslash-newline sequences are replaced for the purpose of
delimiting the raw string token but then those changes are reversed
afterwards, if there aren't already such testcases in the testsuite:

* Trigraphs and backslash-newline may be used in the beginning and ending
d-char-sequences, with the source code characters before the replacements
not necessarily being the same in the beginning and ending sequences.

* ??( in the opening d-char-sequence is a trigraph for [ rather than being
?? at the end of the d-char-sequence followed by the ( before the
r-char-sequence.

* ??) followed by the opening d-char-sequence and " does not serve to end
the string because it is a trigraph at this lexical stage and ends up as
literal ??) in the raw string.

I expect these cases to work with the current implementation - but I think
they need testcases.

I would not expect the testcase change to need to add -trigraphs to
dg-options; -std=c++0x automatically enables trigraphs.
--
Joseph S. Myers
***@codesourcery.com
Jason Merrill
2010-03-29 15:55:48 UTC
Permalink
Post by Joseph S. Myers
The first patch is OK. I think the code changes in the second patch are
OK, but it needs extra testcases for the consequences of the model that
trigraphs and backslash-newline sequences are replaced for the purpose of
delimiting the raw string token but then those changes are reversed
* Trigraphs and backslash-newline may be used in the beginning and ending
d-char-sequences, with the source code characters before the replacements
not necessarily being the same in the beginning and ending sequences.
* ??( in the opening d-char-sequence is a trigraph for [ rather than being
?? at the end of the d-char-sequence followed by the ( before the
r-char-sequence.
Both of these are now tested in raw-string-2.C.
Post by Joseph S. Myers
* ??) followed by the opening d-char-sequence and " does not serve to end
the string because it is a trigraph at this lexical stage and ends up as
literal ??) in the raw string.
Actually, I think it should end the string; I'll work on clarifying the
standardese to that effect.

I've fixed the implementation accordingly, along with a few other bugs I
noticed when adding more tests.
Post by Joseph S. Myers
I would not expect the testcase change to need to add -trigraphs to
dg-options; -std=c++0x automatically enables trigraphs.
Ah, right.

The new patch doesn't pass down notes anymore, so the only change to
non-raw-string code is to ignore line notes in _cpp_process_line_notes
that have been marked handled (by setting type to 0) in lex_raw_string.

How does this look?
Joseph S. Myers
2010-03-29 19:44:54 UTC
Permalink
Post by Jason Merrill
The new patch doesn't pass down notes anymore, so the only change to
non-raw-string code is to ignore line notes in _cpp_process_line_notes
that have been marked handled (by setting type to 0) in lex_raw_string.
How does this look?
This patch looks OK. I've noticed a comment that needs fixing, although
Post by Jason Merrill
/* Lexes a raw string. The stored string contains the spelling, including
double quotes, delimiter string, '[' and ']', any leading
'L', 'u', 'U' or 'u8' and 'R' modifier. It returns the type of the
Should now reference '(' and ')'.
--
Joseph S. Myers
***@codesourcery.com
Jason Merrill
2010-03-29 20:10:11 UTC
Permalink
Post by Joseph S. Myers
Post by Jason Merrill
/* Lexes a raw string. The stored string contains the spelling, including
double quotes, delimiter string, '[' and ']', any leading
'L', 'u', 'U' or 'u8' and 'R' modifier. It returns the type of the
Should now reference '(' and ')'.
Here's the patch I'm checking in, which fixes that and moves the new
tests to c-c++-common.

Jason
Jason Merrill
2010-03-29 19:42:36 UTC
Permalink
Post by Joseph S. Myers
The first patch is OK.
I checked it in, and then another patch to move the tests to c-c++-common.

Jason
H.J. Lu
2010-03-29 20:27:38 UTC
Permalink
Post by Jason Merrill
Post by Joseph S. Myers
The first patch is OK.
I checked it in, and then another patch to move the tests to c-c++-common.
This patch caused:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43587
--
H.J.
Jason Merrill
2010-03-29 20:51:29 UTC
Permalink
Post by H.J. Lu
Post by Jason Merrill
Post by Joseph S. Myers
The first patch is OK.
I checked it in, and then another patch to move the tests to c-c++-common.
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43587
The patch to move the tests to c-c++-common fixed the raw-string
failures. I'll fix the other one now.

Jason

Loading...