It has been a while since my last blog post. To be honest we have been waiting until Google and co properly index all our images. Without too much success, until we found out that manually submitting URLs seemed to work better. We don’t really know how the crawler works, but we found that some requests resulted in an error 403.
Manual submission
In the Google Webmaster Tools we can manually submit a webpage for indexing. We tried this after waiting for the crawler to automatically pick up all our pages based on the sitemap. Funny enough, the manual submission resuleted in an almost instant indexing and all our main categories can now be found on Google.
To submit a URL manually, we just had to use the URL inspection tool in the webmaster tools:
Obviously, we planned this a bit differently by creating a sitemap. We are not sure, why it didn’t work so far. But some of the manual submissions resultet in a http result code of 403, which was a bit weird. Turns out we had to change the web app firewall rules in Plesk, as the error may have occured by the crwaler trying to access our resources. To be honest, we are speculating here, but it would explain why the message was indicating that the server was overloaded when trying to index our page.
Other improvements
Besides the manual submissions and the (hopefully) fix for the 403 error while crawling, it is interesting that our direct image URLs (the most important ones!) are not indexed at all. Such an URL looks like:
https://www.usmalbilder.ch/view/images/space/space3.png
We had the suspicion that it may be an issue that the page URL looks like an image resource. The web master tools doesn’t give us an indication, but its really great that we could test the URLs and see what the crawler sees, this gives us much more confidence that our page is properly rendered when Google and co try to crawl our pages. You can find this useful functionality also in the URL inspection tool:
First test the live URL:
Then “view tested page”:
Which will show you the DOM it used, a screenshot and some other data:
Anyhow, we decided to change the way we construct the URL. We basically stripped away the .png suffix. We hope this will allow us to finally properly index all our pages. A test manual indexing request shows positive results, cool!
Up next
Meanwhile, we also played around with some other AI models on Hugging Face, specifically to describe our coloring pages automatically. Most of the models we tried where not really great, but this particular one is mind blowing!
Salesforce/blip-image-captioning-large · Hugging Face
The plan is to integrate this model into our build so that we can just add an image and let the title, description and image registration be done fully automatically by combining several AI models as this will save us a ton of manual work.