Commit Graph

  • e565a861ae give nice reply from seed in json. show how many outlinks from same domain were found and how many outlinks were filtered. same for addurls (bulk add). Matt Wells 2013-10-16 14:03:14 -07:00
  • 36e6e21ae3 url filter processing looking good. Matt Wells 2013-10-16 12:19:25 -07:00
  • f5e5b0f5d3 fix crawlbot bugs Matt Wells 2013-10-16 12:12:22 -07:00
  • d2f39cd1a0 comment updates mwells 2013-10-15 23:13:50 -07:00
  • 6052f60c48 speed up dirty word detection since we added a bunch of new dirty words/phrases. mwells 2013-10-15 22:41:31 -07:00
  • 70e4a57449 add respider freq parm to crawlbot page Matt Wells 2013-10-15 16:57:34 -07:00
  • f345c35927 crawlbot fixes. Matt Wells 2013-10-15 16:31:59 -07:00
  • f892b828d9 crawlbot api fixes Matt Wells 2013-10-15 14:08:55 -07:00
  • 00029129fd crawlbot api fixes Matt Wells 2013-10-15 12:40:56 -07:00
  • 65d6af7791 show expression/action pairs in crawlbot json output. Matt Wells 2013-10-15 11:50:57 -07:00
  • ed4eff784b json output crawlbot fixes. mwells 2013-10-15 12:45:23 -06:00
  • 70eaa542f3 minor change Matt Wells 2013-10-15 11:17:44 -07:00
  • a3a9a43ded api nomenclature mwells 2013-10-15 12:31:02 -06:00
  • 313eb1209e more crawlbot fixes mwells 2013-10-15 12:22:59 -06:00
  • d8835acfef crawlbot api work. mwells 2013-10-15 11:54:54 -06:00
  • 37a9e82060 update the dirty word list. but we still should remove tags, except maybe outlinks, and detect the dirty words on what remains. getting too many false positives in tags still. mwells 2013-10-15 01:01:19 -07:00
  • 3db726c22e take out references to AdultBit.cpp, since it is no longer used. mwells 2013-10-14 23:21:58 -07:00
  • 12bff1e9b0 fix potential problem of tons of points in our statsdb div graph. use hashtable to dedup points and save from printing out too many <div> tags. mwells 2013-10-14 22:52:29 -07:00
  • 90fca8c171 fix "search in category" link. mwells 2013-10-14 22:39:42 -07:00
  • 0096877127 fix "statsdb" graph so it seems to work now. mwells 2013-10-14 22:31:00 -07:00
  • 9e9ef9c2cc still getting statsdb link to work. a little better now. mwells 2013-10-14 21:21:27 -07:00
  • 8f93a72961 start using html div graph for PageStatsdb.cpp now too. mwells 2013-10-14 20:35:45 -07:00
  • a0808df2ae got new diffbot api compiled mwells 2013-10-14 18:19:59 -06:00
  • c19310cb7e code checkpoint mwells 2013-10-14 17:19:30 -06:00
  • a562c65627 another code checkpoint. new json api for crawlbot. new url filters for crawlbot. mwells 2013-10-14 16:10:48 -06:00
  • 5a7d70f7b2 code checkpoint mwells 2013-10-14 13:00:05 -06:00
  • 80918ca6e3 remove old libplotter references and files. mwells 2013-10-13 23:48:07 -07:00
  • 553c28fbe0 get performance graphing working again. use absolute divs to draw the graph instead of old gif plotter library. mwells 2013-10-13 23:39:31 -07:00
  • 81a09f9835 half way done fixing performance graph. needs more work. mwells 2013-10-13 22:02:21 -07:00
  • c7cf6a817a dmoz directory root page search box should just search all sites in dmoz. mwells 2013-10-13 20:13:15 -07:00
  • 3ac5838b8f fix the search tabs for the dmoz directory search box. allow more error types when spidering dmoz docs. mwells 2013-10-13 18:43:45 -07:00
  • 66364c581a minor fix when indexing dmoz urls. mwells 2013-10-13 17:12:20 -07:00
  • 876af6d8c6 dmoz support is now updated and re-integrated. mwells 2013-10-13 16:53:28 -07:00
  • 3bc85cf528 a few cleanups for the new dmoz code. mwells 2013-10-13 16:48:59 -07:00
  • 0cc78dc2e0 fix dup bug. mwells 2013-10-13 16:06:38 -07:00
  • d41d5554da fix dmoz search. mwells 2013-10-13 16:00:44 -07:00
  • 4cbb31e180 added searchbox for dmoz pages/sites. mwells 2013-10-13 15:45:12 -07:00
  • b60bdcc038 documentation updates. fixed sd=0. mwells 2013-10-13 14:24:41 -07:00
  • 2c7bc9031f documentation updates. mwells 2013-10-13 13:15:31 -07:00
  • 8547b8f802 print pretty dmoz pages. mwells 2013-10-13 00:39:05 -07:00
  • d4b5c37f45 Merge branch 'master' into testing mwells 2013-10-13 00:20:37 -07:00
  • fbcaefa6ff so we have spider https sites add the old gigablast private/public key file. mwells 2013-10-13 00:15:39 -07:00
  • 65bad44450 try to fix EBADIP stopping a page from getting indexed into dmoz. mwells 2013-10-13 00:14:16 -07:00
  • c949bfe315 ignore certain errors and index the doc anyway so we at least have it in our dmoz index with its designated title and summary from dmoz. mwells 2013-10-13 00:02:25 -07:00
  • eeb10bb99a fix ip vector logic in xmldoc.cpp. mwells 2013-10-12 23:14:39 -07:00
  • c283e85e40 add support for noindex meta tag. use it in the gbdmoz.urls.txt.* files that contain the dmoz urls we want to spider. mwells 2013-10-12 22:50:23 -07:00
  • d300dc42f7 added XmlDoc::getDmozTitles() and related functions. mwells 2013-10-12 21:56:25 -07:00
  • 3374ce450a fix a couple catdb generation bugs. MAX_CATIDS violation causing corruption. not saving catdb tree to catdb-saved.dat causing missing catdb recs. mwells 2013-10-12 20:33:04 -07:00
  • 0de777d80d parser fixes mwells 2013-10-11 17:35:12 -06:00
  • 6d5643e185 json parsing mwells 2013-10-11 16:14:26 -06:00
  • 1d133e87c9 just print dmoz pages verbatim for now. later we can show the dmoz entries as search results. mwells 2013-10-10 23:18:57 -07:00
  • 547420396f added Categories::printUrlsInTopic() to print the dmoz urls for a catid. can replace us doing a search for the time being. mwells 2013-10-10 23:03:07 -07:00
  • 38bb82c902 make /Top the top dir page now. mwells 2013-10-10 22:46:31 -07:00
  • 55c5ad2921 fix "Top/" issues in breadcrumb etc. mwells 2013-10-10 22:27:49 -07:00
  • ca6af65217 git dmoz nagivation system working. now we just need to index the urls to populate dmoz. mwells 2013-10-10 22:08:21 -07:00
  • be01041e36 added support for new url filter: "lastspidertime>={roundstart} --> IGNORE" so we can spider all urls before we advance to the next spider round and re-spider everything again. CollectionRec::m_spiderRoundStartTime and CollectionRec::m_spiderRoundNum are the new collection rec parms. show the round stuff on url filters page. mwells 2013-10-10 18:47:46 -06:00
  • 08c153bdb0 add old gb.pem file, not used by gigablast any more, but allows https server to startup. mwells 2013-10-09 17:37:01 -06:00
  • ea859ef685 added 'gb emailmandrill' for testing. got it working. it posts json, not url encoded. mwells 2013-10-09 17:35:51 -06:00
  • 2bb8b818d6 more bug fixes with notification system. mwells 2013-10-09 16:28:15 -06:00
  • c1c5c4e3d0 send notifications if no urls available for immediate spidering. mwells 2013-10-09 15:24:35 -06:00
  • 24e3b8cf52 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot mwells 2013-10-09 13:07:22 -06:00
  • 0b4bbf926e fix potential compiler error. Matt Wells 2013-10-09 11:52:58 -07:00
  • 58e2be8b6f take out log msg Matt Wells 2013-10-09 11:51:39 -07:00
  • 283ec2f6b4 email and webhook alerts when spider runs out of urls to spider. Matt Wells 2013-10-09 11:42:56 -07:00
  • 7ba9994804 many dmoz fixes. but still more we need to do. isn't printing subcategories right now. mwells 2013-10-08 23:55:11 -07:00
  • 3702a05d64 add sendEmailThroughMandrill() to send through mail chimp http api. Matt Wells 2013-10-08 18:01:38 -07:00
  • 9eecfd378c added support for pageprocesspattern again. || separated strings to find in m_content before sending to diffbot. Matt Wells 2013-10-08 17:08:58 -07:00
  • a78dc35169 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot mwells 2013-10-08 17:51:07 -06:00
  • ddbacab12f fix shard mapping of spiderdb. Matt Wells 2013-10-08 16:35:37 -07:00
  • a76e8e42c3 fix json parsing oopsy. Matt Wells 2013-10-08 16:28:25 -07:00
  • e1b798aa62 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot mwells 2013-10-08 13:30:57 -06:00
  • ae57540af1 prevent core during page parser Matt Wells 2013-10-08 11:54:55 -07:00
  • ed0fbf2b99 fix core from not decoding json properly. Matt Wells 2013-10-08 11:46:18 -07:00
  • a0e9e0df7a fix a little corage. Matt Wells 2013-10-07 16:47:55 -07:00
  • 8adb920deb return all fields in diffbot reply. Matt Wells 2013-10-07 15:37:56 -07:00
  • 1374fea70b do not run htmlDecode nor do iframe expansion on documents of type JSON. it was decoding &quot; and messing up the json. Matt Wells 2013-10-07 14:50:32 -07:00
  • 0b338161e4 fix a couple cores. Matt Wells 2013-10-07 11:59:07 -07:00
  • 63c7764cd1 c=dmoz3 to c=dmoz mwells 2013-10-06 17:12:45 -07:00
  • 59b491f007 return fake tag recs for links if usefakeips meta tag is given. saves some lookups in tagdb when adding gbdmoz.urls.txt.* files which have tons of links each. like 500,000. mwells 2013-10-06 16:42:32 -07:00
  • 2383905c80 start using fakeips flag to stop ip tragrec lookups mwells 2013-10-06 16:40:04 -06:00
  • 183b7c372e make sections grow dynamically so we do not OOM when trying to index a gbdmoz.urls.txt.* file which can be 25MB. mwells 2013-10-06 11:04:10 -06:00
  • 3780789201 improve winning spider req selection using hop count. Matt Wells 2013-10-06 10:01:10 -07:00
  • d8e6ac8748 fixed bug of not putting meta tags in all gbdmoz.urls.txt.* files in dmozparse.cpp mwells 2013-10-06 00:18:59 -06:00
  • 000caa5a26 support for usefakeips meta tag mwells 2013-10-06 00:10:07 -06:00
  • 2935a143f0 if downloading a url on 127.0.0.1 or other local ip then do not limit download size. should fix downloading of gbdmoz.urls.txt.* files which can be > 25MB big. mwells 2013-10-05 23:43:00 -06:00
  • 612f2872f7 use addurl to add the gbdmoz url files to gigablast. it should index just those dmoz urls, and not spider their links. it should ignore external errors like ETCPTIMEDOUT when indexing so it will be identical to dmoz. mwells 2013-10-05 23:22:51 -06:00
  • 9f73ba1531 fix core again. Matt Wells 2013-10-04 21:41:38 -07:00
  • d464066da4 use catdb/ not cat/ mwells 2013-10-04 22:39:41 -06:00
  • f21fb98c16 fix core when getting new spider reply when g_errno was ECORRUPTDATA Matt Wells 2013-10-04 20:44:29 -07:00
  • 71d5d05f7c use catdb/ subdir not cat/ for consistency. mwells 2013-10-04 21:35:13 -06:00
  • 2ef54f3601 fix core from hashing json fields when not diffbot reply Matt Wells 2013-10-04 20:00:03 -07:00
  • f69b1f46c6 fix another bug from shard change. Matt Wells 2013-10-04 16:49:50 -07:00
  • 0fea60eaae fix core Matt Wells 2013-10-04 16:35:19 -07:00
  • fe97e08281 move from groups to shards. got rid of annoying groupid bit mask thing. Matt Wells 2013-10-04 16:18:56 -07:00
  • 3fa0ad5786 fix './gb install' cmd to install the new files. Matt Wells 2013-10-04 14:04:47 -07:00
  • c3afed946d Merge branch 'master' into diffbot mwells 2013-10-04 14:59:19 -06:00
  • ad209bb403 debug log changes mwells 2013-10-04 14:59:03 -06:00
  • e1bde7b7fe fixed bug of getting lock from the wrong group. mwells 2013-10-04 12:42:01 -06:00
  • d4aa65c0fe try to fix spiders with m_msg5StartKey logic. mwells 2013-10-04 09:39:05 -06:00
  • 78c4bda368 fix dmozparse urldump -s bugs for dumping out urls in dmoz. mwells 2013-10-04 00:00:26 -06:00