+6 votes
by (1k points)
*Screaming Frog* I'm trying to crawl a subfolder (/blog) of a site (as per Screaming Frog's guidance ), but for some reason, the crawl stops/reaches 100% at around 20 URLs - even though there are 100s of blogs in the subfolder. Has anyone encountered the same problem before? My configuration setup is working fine for another site.  
https://www.screamingfrog.co.uk/how...ites/
*Screaming Frog* I'm trying to crawl a subfolder (/blog) of a site (as per Screaming Frog's guidance

8 Answers

+6 votes
by (1.2k points)
Something server side in place to prevent crawls maybe? Have you tried changing the user agent to Googlebot ? Or increasing the time between requests perhaps?  
by (1k points)
I'll give that a shot!  
by (1k points)
No luck with that
by (1.2k points)
Hmmm. Maybe try turning "respect canonicals, pagination etc off". Does it crawl other sub-folders on the site ok?  
0 votes
by (310 points)
As @fro5 is saying, try to change the user agent :p
+5 votes
by (350 points)
Can you upload a picture of your configuration settings? That could be it
by (350 points)
@archeozoic yup, I see the problem. Check the crawl pagination box
by (1k points)
@lefthanded didn't do it :(
+1 vote
by (1k points)
Could the fact that blogs on the 'blog' page are loaded via js be affecting it? Only a small number of blogs appear on the page itself, the rest only start to appear on the page as you scroll (Hope that makes sense)
0 votes
by (350 points)
Yes it does. under the rendering tab try enabling javascript rendering
by (1k points)
So i'm now getting this. Bunch of repeated canonicalized URLs repeating themselves.  
by (1k points)
I'll reach out to screaming frog support. Don't want to waste anymore of your time. Appreciate the help! :)
by (1k points)
@lefthanded would you say if SF is having issues crawling this page, this could also be an issue for google?  
by (350 points)
@archeozoic hard to say without knowing the site. If you want to DM me the site I can take a look. If you’d rather not, that’s also okay. Just extending the offer  
+6 votes
by (1.5k points)
Did you figure it out?  
by (1k points)
Not yet - I think i'm gonna call it a night and try again tomorrow. The SEO rabbit hole is too deep right now lol
+6 votes
by (2.9k points)
Judging by the first screenshot of the multiple canonicalized urls when javascript is on, it's possible you have redirect chains, loops or possibly continuous parameter generation happening. Look at the urls from the javascript crawl and see what differences/commonalities are present. If you filter it to a single sub folder (if there are any in the folder you're crawling), you can also uncheck max redirects to follow, and strip all parameters depending on the need. Also, check ignore robots. txt but report Follow all redirects Crawl all canonicals allow meta-refresh crawl There are several things that could cause what you are seeing depending on the setup, and if you use tag manager to populate/generate things.  
by (1k points)
Thanks @vanhouten44 - i'll give this a go!  
by (2.9k points)
@archeozoic If you still can't get to the bottom of it, send me the url and criteria in the morning on a DM and I'll take a look.  
by (2.9k points)
@archeozoic Also, the "redirect chains" may well be very helpful (but noisy - use filters) depending on what the situation is that's happening.  
by (1k points)
Thanks @vanhouten44 :)
+3 votes
by (1k points)
Appears to be an issue with "infinite scroll" and JS on the blog page. I've got some more digging to do. Thanks for the help guys!  
The Search Engine Optimization Group is where you can always find questions, answers, advice, reviews & recommendations from other community members about better strategy on ranking highly for search engine results.
...