No internet connection
  1. Home
  2. Support

Why does TalkYard's robot.txt exclude user profile pages?

By Michael Lynch @michael
    2021-04-10 22:49:18.930Z

    I got an email from Google Search warning me that they're indexing pages that are disallowed by robots.txt (weird that they're ignoring the robots.txt, but anyway...).

    I checked it out, and it looks like they're user profile pages:

    I can see that my robots.txt is indeed blocking those URLs:

    User-agent: *
    Disallow: /-/
    # Googlebot needs the CSS files to know that Talkyard is mobile friendly.
    # Probably good to let it access Javascript too?
    Allow: /-/assets/
    

    Is that intentional? Those seem like reasonable pages for Google to index.

    Related thread: Feature suggestion: Improve Google mobile scores

    • 4 replies
    1. KajMagnus @KajMagnus2021-04-12 16:05:50.032Z2021-04-12 16:29:43.300Z

      Is that intentional? Those seem like reasonable pages for Google to index.

      Yes it is intentional — although maybe it's better to change this no-index thing, or make it optional.

      I was thinking as follows: Somewhat often, people reuse the same usernames, across different places on the Internet. Then, one can websearch for someone's username, and find many places that person has joined — but maybe some people don't like this?

      Maybe they'd want their user profile pages, to be not-indexed by the search engines, for privacy reasons.

      So that's why it's not indexed. (Other things below /-/ is API endpoints and admin/moderation things, and shouldn't be indexed b.t.w.)

      At the same time, if one posts a reply, then one's username appears next to the reply, anyway. And can be found via a websearch anyway.

      Now I'm thinking:

      • Maybe it's better that, by default, the user profile pages get search-engine-indexed. And admins could disable indexing, and individual members could disable indexing for themselves.

      • If disabled, I thought that maybe there was a way to tell Googlebot to not index one's username, when it appears next to one's replies,
        however seems that's not possible:
        https://stackoverflow.com/questions/14314111/avoid-crawling-part-of-a-page-with-googleoff-and-googleon

        [...] googleon" and "googleoff" are only supported by the Google Search Appliance (when you host your own search results, [...]

        Seems Googlebot indexes all, or nothing.

      • Maybe instead one could choose to display only one's first name & avatar, but not one's username, next to one's replies. Then, only one's first name got indexed (and first names is pretty pointless to websearch for). But maybe then it's better if people simply choose different usernames.

      Actually no one has asked about their profile pages and search engines, so I suspect that it's ok to index profile pages, by default.

      they're indexing pages that are disallowed by robots.txt (weird that they're ignoring the robots.txt, but anyway...).

      Yes that was surprising. Looking in the "Affected pages" graph, seems they started doing this recently? Hmm.

      1. Michael Lynch @michael
          2021-04-13 01:27:22.788Z

          Oh, interesting.

          Yeah, obviously privacy expectations are very subjective. When I post on a public forum, I have no expectation that the forum admin would protect my user profile from search indexing. I assume that if I'm posting, Google will index what I write. If it was some sort of private community, that's different, but for something like a tech support forum, I expect indexing and don't expect the site admin to protect me from it.

          As a TalkYard forum admin, I'd prefer to make user profiles indexable because Google is indexing the posts anyway, which include the user's username.

          Maybe it's better that, by default, the user profile pages get search-engine-indexed. And admins could disable indexing, and individual members could disable indexing for themselves.

          Indexing by default would be my preference.

          I wouldn't want users to be able to override the setting, though, because it will generate warnings from Google search. As owner of the forum, I'd prefer to have ultimate control over what gets indexed.

          Maybe instead one could choose to display only one's first name & avatar, but not one's username, next to one's replies. Then, only one's first name got indexed (and first names is pretty pointless to websearch for). But maybe then it's better if people simply choose different usernames.

          I can see the logic in offering options to display only a "display name" instead of the username, but I feel like it's extra complexity for something that's not important to me, and that I don't think is important to my forum users. It could also lead to the opposite problem where someone has a unique first name, so they choose a non-Googleable username, and then they're surprised to see their first name displayed publicly.

        • In reply tomichael:

          Anyway, for now, I can add a <meta name="robots" content="noindex" /> tag — and, as mentioned in a reply to your Twitter tweet, then they also need to be removed from robots.txt:

          For the noindex directive to be effective, the page must not be blocked by a robots.txt file
          https://developers.google.com/search/docs/advanced/crawling/block-indexing

          ***

          Apparently Googlebot can index robots.txt-disallowed pages, if they're linked to: (which sounds a bit weird to me)

          https://www.searchenginejournal.com/google-pages-blocked-robots-txt-will-get-indexed-theyre-linked/255911/

          if these pages are blocked by robots.txt, then it could theoretically happen that someone randomly links to one of these pages. And if they do that then it could happen that we index this URL without any content because its blocked by robots.txt. So we wouldn’t know that you don’t want to have these pages actually indexed.

          Whereas if they’re not blocked by robots.txt you can put a noindex meta tag on those pages ... then we would know that these pages don’t need to be indexed and we can just skip them

          says John Mueller, Senior Webmaster Trends Analyst at Google.

          ***

          ( There was a Noindex: ... instruction one could add in robots.txt, but it's been deprecated — the meta tag is recommended instead:

          the use of noindex within robots.txt will no longer be supported by Google.
          Gary Illyes [Google Webmaster Trends Analyst at Google] explained that after running analysis around the use of noindex in robots.txt files, Google found “the number of sites that were hurting themselves was very high.”

          https://www.deepcrawl.com/blog/best-practice/robots-txt-noindex-the-best-kept-secret-in-seo/ )

          1. In reply tomichael:

            Status update: In the upcoming version (in some days), there'll be a <meta name="robots" content="noindex" /> tag, so the Google Search Console warnings should go away.

            Later, in the next-next? version, I'd like to add a per user and per group (incl the Everyone group) setting, so one can configure the search engine visibility, so user profile pages can get indexed.

            And also, because I need this myself, settings for making group membership invisible, so one cannot see if someone is a member of a "hidden group", or that the group even exists.
            And settings for making group members invisible to others in the forum.
            (I'm wondering if it's not-so-easy to guess what the background for these invisible group and members is :- ))